XTC-Bench framework reveals unified multimodal models show weak cross-task consistency despite high individual performance
arXiv cs.CV · April 29, 2026
AI Summary
•Researchers introduced XTC-Bench, a scene-graph-grounded evaluation framework that measures whether unified multimodal models (systems supporting both visual understanding and generation in a shared representation) maintain semantic consistency across tasks given a visual concept.
•The framework uses Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts (objects, attributes, and relations), isolating internal consistency from standalone task accuracy.
•Experiments on eight open-source and one commercial unified models found that high generation or understanding performance does not imply strong cross-task alignment; architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone.