Study finds harness complexity does not improve LLM agent reliability uniformly across model capability tiers

Hacker NewsMay 28, 2026

Summaries like this, in your inbox every morning.

3 Key Points

Researchers conducted a 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark. Results refute the assumption that higher-capability models need proportionally less structural guidance.
For the frontier chat model (Gemini 2.5 Flash), increased harness verbosity lowered VTSR (a measure of task success and result validity) by 29-38 percentage points—opposite of the expected effect. For the frontier reasoning model (Qwen3.5-122B with extended thinking), strict harness achieved the highest VTSR at 91.7% and the lowest latency.
A failure taxonomy identified format_violation as the dominant failure type in capable models, while wrong_file dominates in low-capability models. Harness sensitivity appears non-monotone across the models evaluated and depends critically on model type (chat vs. reasoning).

AI-summarized, only the topics you pick — one digest a day via Email, Slack, or Discord.

Free · takes 30 seconds · unsubscribe anytime

No comments yet. Be the first to share your thoughts!

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack