Back to articles

XTC-Bench framework reveals unified multimodal models show weak cross-task consistency despite high individual performance

arXiv cs.CV · April 29, 2026

AI Summary

  • Researchers introduced XTC-Bench, a scene-graph-grounded evaluation framework that measures whether unified multimodal models (systems supporting both visual understanding and generation in a shared representation) maintain semantic consistency across tasks given a visual concept.
  • The framework uses Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts (objects, attributes, and relations), isolating internal consistency from standalone task accuracy.
  • Experiments on eight open-source and one commercial unified models found that high generation or understanding performance does not imply strong cross-task alignment; architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone.

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free