TraceElephant benchmark introduces full execution traces for diagnosing failures in LLM-based multi-agent systems
arXiv cs.MA (Multi-Agent) · 2026年4月27日
AI要約
•Researchers introduced TraceElephant, a benchmark designed for failure attribution (identifying which agent and step caused a failure) in LLM-based multi-agent systems (AI systems where multiple language models work together and reason in natural language).
•Full execution traces improve attribution accuracy by up to 76% over partial-observation counterparts; existing benchmarks omit inputs and context that developers use when debugging, whereas TraceElephant captures complete traces and reproducible environments.
•The benchmark aims to advance failure attribution research and promote evaluation practices aligned with real-world debugging scenarios, supporting development of more transparent multi-agent systems.