Researchers explain why larger language models learn rare tasks that smaller ones cannot master, even with extended training

THE DECODERJun 7, 2026

Summaries like this, in your inbox every morning.

3 Key Points

A study by researchers at Anthropic, Stanford, and other institutions found that small models can fail to reliably learn tasks that make up just 0.25 percent of the training data, while only larger models learn such rarely interspersed tasks. The team trained OLMo models ranging from 4 million to 4 billion parameters on up to 210 billion tokens from the Dolma corpus, mixing in artificial tasks like number comparison and modular addition at varying frequencies.
The mechanism: smaller models fall into an 'update-and-forget' loop where rare task signals are erased by subsequent training steps on frequent tasks, whereas larger models retain enough capacity to hold onto rare signals between observations. Once a large model masters frequent tasks, freed-up capacity allows it to build on rare patterns, while small models rarely reach that point.
The study proposes a practical alternative: instead of scaling up model size, increasing the frequency of a target task in training data can anchor a specific skill in smaller models, rather than requiring larger models to learn the rare task.

AI-summarized, only the topics you pick — one digest a day via Email, Slack, or Discord.

Free · takes 30 seconds · unsubscribe anytime

No comments yet. Be the first to share your thoughts!

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack