TraceElephant benchmark introduces full execution traces for diagnosing failures in LLM-based multi-agent systems
arXiv cs.MA (Multi-Agent) · April 27, 2026
AI Summary
•Researchers introduced TraceElephant, a benchmark designed for failure attribution (identifying which agent and step caused a failure) in LLM-based multi-agent systems (AI systems where multiple language models work together and reason in natural language).
•Full execution traces improve attribution accuracy by up to 76% over partial-observation counterparts; existing benchmarks omit inputs and context that developers use when debugging, whereas TraceElephant captures complete traces and reproducible environments.
•The benchmark aims to advance failure attribution research and promote evaluation practices aligned with real-world debugging scenarios, supporting development of more transparent multi-agent systems.