What is the relationship between task difficulty and compute cost?

The research found a power law relationship: a one-minute task costs an AI agent thousands of tokens, a one-hour task costs millions, and a one-week task costs billions. A cyber task called "The Last Ones", which takes a human expert about 20 hours, required no tested model to solve it with fewer than 30 million tokens.

Do all AI model improvements work the same way with extra compute?

No. Extra compute helps most where agents can verify their own work—like running code or testing an exploit—but barely moves the needle on tasks where feedback is missing or delayed, such as medical tasks on HealthBench where all models hit a plateau within the standard budget.

How much faster is frontier AI progress at higher budgets?

At a fixed budget of 2.5 million tokens, the time horizon of frontier models doubles roughly every 4.7 months. At 50 million tokens, the doubling happens every 40 to 50 days instead of every 67 to 91 days—roughly 60 percent steeper.

Back to articlesLarge Language Models

Large Language Models

UK AI institute finds benchmarks underestimate what AI agents can actually do

THE DECODER1d ago6 min read

Key takeaway

The UK's AI Security Institute discovered that standard AI benchmarks with fixed compute budgets systematically underestimate what frontier models can do. As models receive more tokens (computational budget) to work with, success rates climb significantly—some cybersecurity tasks require 50 million tokens to solve, yet are marked as failures under standard test conditions. Since token costs are falling, this finding has practical implications: capabilities may become cheaper to achieve, making accurate measurement across different compute levels essential for real deployment and risk decisions.

Summaries like this, in your inbox every morning.

3 Key Points

What happened
The UK's AI Security Institute tested frontier AI models across seven benchmarks and found that fixed evaluation budgets systematically underestimate agent capabilities. Performance improves significantly when models are given higher token budgets—in cybersecurity tasks, about 8 percent of tasks were only solved when the budget exceeded 10 million tokens, and some required 50 million. On software engineering tasks, success rates jumped about 25 percent when the token budget went from one million to ten million.
Why it matters
Standard benchmarks with low compute budgets measure the floor, not the ceiling, of what these AI systems can do. Test scores that skip higher budgets can skew decisions about deployment and risk assessment. Since token costs are falling, capabilities once thought unaffordable could become cheaper to reach over time, making it increasingly important to measure performance across different compute levels rather than at a single fixed point.
What to watch
The institute found that newer models benefit far more from extra compute than older ones, and frontier AI progress may be moving faster than benchmarks suggested—the doubling rate of AI time horizon at higher budgets is steeper than at fixed 2.5 million token budgets. AISI now tests models at multiple budgets using "minimum informative budgets" to determine when a model's reach stops growing with extra compute.

FAQ

What is the relationship between task difficulty and compute cost?: The research found a power law relationship: a one-minute task costs an AI agent thousands of tokens, a one-hour task costs millions, and a one-week task costs billions. A cyber task called "The Last Ones", which takes a human expert about 20 hours, required no tested model to solve it with fewer than 30 million tokens.
Do all AI model improvements work the same way with extra compute?: No. Extra compute helps most where agents can verify their own work—like running code or testing an exploit—but barely moves the needle on tasks where feedback is missing or delayed, such as medical tasks on HealthBench where all models hit a plateau within the standard budget.
How much faster is frontier AI progress at higher budgets?: At a fixed budget of 2.5 million tokens, the time horizon of frontier models doubles roughly every 4.7 months. At 50 million tokens, the doubling happens every 40 to 50 days instead of every 67 to 91 days—roughly 60 percent steeper.

Discussion

No comments yet. Be the first to share your thoughts!

Anthropic dev: Claude Fable 5 quality now limited by user's blind spots, not model

THE DECODER11h ago

SYSCALL: Assembly puzzle game launches with 200+ authored puzzles

Hacker News11h ago

Qpilot: AI agent automates manual browser testing without code

Hacker News11h ago

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →

UK AI institute finds benchmarks underestimate what AI agents can actually do

Key takeaway

3 Key Points

FAQ

Discussion

Related Articles

Open-source tool cuts Claude, GPT token costs 59–70% by hiding text in images

Alibaba bans Claude Code, citing security risk

Mistral AI eyes €1.7 billion Series C, claims path to $1 billion（約1600億円） ARR

Anthropic dev: Claude Fable 5 quality now limited by user's blind spots, not model

SYSCALL: Assembly puzzle game launches with 200+ authored puzzles

Qpilot: AI agent automates manual browser testing without code

Stay ahead with AI news