New benchmark shows even the best AI models solve only 3 percent of real-world knowledge work tasks completely, exposing a gap between lab performance and practical capability.

THE DECODER11h ago2 min read

Summaries like this, in your inbox every morning.

3 Key Points

1
What happened: Artificial Analysis released the AA-Briefcase benchmark, which tests AI models on multi-week knowledge work projects built from thousands of fragmented source files like Slack threads, emails, meeting transcripts, and data exports. Claude Fable 5, the top performer, fully solved just 3 percent of tasks, and on 31 out of 91 tasks, no model cleared 50 percent.
2
Why it matters: As models improve, the type of failure changes—weaker models miss obvious steps or produce unusable output, while stronger models hit basic requirements but miss details that require piecing together information from multiple sources. This suggests that raw capability does not automatically translate to handling the messy, multi-source reality of how knowledge workers actually operate.
3
What to watch: Per-task costs span more than 800x, ranging from about $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5, highlighting a significant trade-off between model power and expense for real-world deployment.

No discussion yet for this article

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack