
The UK's AI Security Institute discovered that standard AI benchmarks with fixed compute budgets systematically underestimate what frontier models can do. As models receive more tokens (computational budget) to work with, success rates climb significantly—some cybersecurity tasks require 50 million tokens to solve, yet are marked as failures under standard test conditions. Since token costs are falling, this finding has practical implications: capabilities may become cheaper to achieve, making accurate measurement across different compute levels essential for real deployment and risk decisions.
Summaries like this, in your inbox every morning.
Sign up free →What happened
The UK's AI Security Institute tested frontier AI models across seven benchmarks and found that fixed evaluation budgets systematically underestimate agent capabilities. Performance improves significantly when models are given higher token budgets—in cybersecurity tasks, about 8 percent of tasks were only solved when the budget exceeded 10 million tokens, and some required 50 million. On software engineering tasks, success rates jumped about 25 percent when the token budget went from one million to ten million.
Why it matters
Standard benchmarks with low compute budgets measure the floor, not the ceiling, of what these AI systems can do. Test scores that skip higher budgets can skew decisions about deployment and risk assessment. Since token costs are falling, capabilities once thought unaffordable could become cheaper to reach over time, making it increasingly important to measure performance across different compute levels rather than at a single fixed point.
What to watch
The institute found that newer models benefit far more from extra compute than older ones, and frontier AI progress may be moving faster than benchmarks suggested—the doubling rate of AI time horizon at higher budgets is steeper than at fixed 2.5 million token budgets. AISI now tests models at multiple budgets using "minimum informative budgets" to determine when a model's reach stops growing with extra compute.
No comments yet. Be the first to share your thoughts!
Log in to join the discussion





Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
1 minute a day. The AI essentials.
200+ sources · Email / LINE / Slack