AIToday

Study finds harness complexity does not improve LLM agent reliability uniformly across model capability tiers

Hacker News6d ago1 min read

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  1. 1

    Researchers conducted a 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark. Results refute the assumption that higher-capability models need proportionally less structural guidance.

  2. 2

    For the frontier chat model (Gemini 2.5 Flash), increased harness verbosity lowered VTSR (a measure of task success and result validity) by 29-38 percentage points—opposite of the expected effect. For the frontier reasoning model (Qwen3.5-122B with extended thinking), strict harness achieved the highest VTSR at 91.7% and the lowest latency.

  3. 3

    A failure taxonomy identified format_violation as the dominant failure type in capable models, while wrong_file dominates in low-capability models. Harness sensitivity appears non-monotone across the models evaluated and depends critically on model type (chat vs. reasoning).

Discussion

No comments yet. Be the first to share your thoughts!

Log in to join the discussion

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →