Summaries like this, in your inbox every morning.
Sign up free →Researchers conducted a 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark. Results refute the assumption that higher-capability models need proportionally less structural guidance.
For the frontier chat model (Gemini 2.5 Flash), increased harness verbosity lowered VTSR (a measure of task success and result validity) by 29-38 percentage points—opposite of the expected effect. For the frontier reasoning model (Qwen3.5-122B with extended thinking), strict harness achieved the highest VTSR at 91.7% and the lowest latency.
A failure taxonomy identified format_violation as the dominant failure type in capable models, while wrong_file dominates in low-capability models. Harness sensitivity appears non-monotone across the models evaluated and depends critically on model type (chat vs. reasoning).
No comments yet. Be the first to share your thoughts!
Log in to join the discussion



Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started Free5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack