AIToday

UK AI institute finds benchmarks underestimate what AI agents can actually do

THE DECODER1d ago6 min read
UK AI institute finds benchmarks underestimate what AI agents can actually do

Key takeaway

The UK's AI Security Institute discovered that standard AI benchmarks with fixed compute budgets systematically underestimate what frontier models can do. As models receive more tokens (computational budget) to work with, success rates climb significantly—some cybersecurity tasks require 50 million tokens to solve, yet are marked as failures under standard test conditions. Since token costs are falling, this finding has practical implications: capabilities may become cheaper to achieve, making accurate measurement across different compute levels essential for real deployment and risk decisions.

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  • What happened

    The UK's AI Security Institute tested frontier AI models across seven benchmarks and found that fixed evaluation budgets systematically underestimate agent capabilities. Performance improves significantly when models are given higher token budgets—in cybersecurity tasks, about 8 percent of tasks were only solved when the budget exceeded 10 million tokens, and some required 50 million. On software engineering tasks, success rates jumped about 25 percent when the token budget went from one million to ten million.

  • Why it matters

    Standard benchmarks with low compute budgets measure the floor, not the ceiling, of what these AI systems can do. Test scores that skip higher budgets can skew decisions about deployment and risk assessment. Since token costs are falling, capabilities once thought unaffordable could become cheaper to reach over time, making it increasingly important to measure performance across different compute levels rather than at a single fixed point.

  • What to watch

    The institute found that newer models benefit far more from extra compute than older ones, and frontier AI progress may be moving faster than benchmarks suggested—the doubling rate of AI time horizon at higher budgets is steeper than at fixed 2.5 million token budgets. AISI now tests models at multiple budgets using "minimum informative budgets" to determine when a model's reach stops growing with extra compute.

FAQ

What is the relationship between task difficulty and compute cost?
The research found a power law relationship: a one-minute task costs an AI agent thousands of tokens, a one-hour task costs millions, and a one-week task costs billions. A cyber task called "The Last Ones", which takes a human expert about 20 hours, required no tested model to solve it with fewer than 30 million tokens.
Do all AI model improvements work the same way with extra compute?
No. Extra compute helps most where agents can verify their own work—like running code or testing an exploit—but barely moves the needle on tasks where feedback is missing or delayed, such as medical tasks on HealthBench where all models hit a plateau within the standard budget.
How much faster is frontier AI progress at higher budgets?
At a fixed budget of 2.5 million tokens, the time horizon of frontier models doubles roughly every 4.7 months. At 50 million tokens, the doubling happens every 40 to 50 days instead of every 67 to 91 days—roughly 60 percent steeper.

Discussion

No comments yet. Be the first to share your thoughts!

Log in to join the discussion

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →