XTC-Bench framework reveals unified multimodal models show weak cross-task consistency despite high individual performance

arXiv cs.CV · 2026年4月29日

AI要約

•Researchers introduced XTC-Bench, a scene-graph-grounded evaluation framework that measures whether unified multimodal models (systems supporting both visual understanding and generation in a shared representation) maintain semantic consistency across tasks given a visual concept.
•The framework uses Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts (objects, attributes, and relations), isolating internal consistency from standalone task accuracy.
•Experiments on eight open-source and one commercial unified models found that high generation or understanding performance does not imply strong cross-task alignment; architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone.

200以上のソースから厳選したAIニュースを毎日無料でお届けします。