
Summaries like this, in your inbox every morning.
Sign up free →A study by researchers at Anthropic, Stanford, and other institutions found that small models can fail to reliably learn tasks that make up just 0.25 percent of the training data, while only larger models learn such rarely interspersed tasks. The team trained OLMo models ranging from 4 million to 4 billion parameters on up to 210 billion tokens from the Dolma corpus, mixing in artificial tasks like number comparison and modular addition at varying frequencies.
The mechanism: smaller models fall into an 'update-and-forget' loop where rare task signals are erased by subsequent training steps on frequent tasks, whereas larger models retain enough capacity to hold onto rare signals between observations. Once a large model masters frequent tasks, freed-up capacity allows it to build on rare patterns, while small models rarely reach that point.
The study proposes a practical alternative: instead of scaling up model size, increasing the frequency of a target task in training data can anchor a specific skill in smaller models, rather than requiring larger models to learn the rare task.
No comments yet. Be the first to share your thoughts!
Log in to join the discussion





Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack