Researchers develop a 524-item benchmark to measure how well large language models monitor their own accuracy across six cognitive domains.
arXiv cs.CL · April 20, 2026
AI Summary
•The Metacognitive Monitoring Battery uses human psychology frameworks to evaluate self-awareness in 20 frontier LLMs through 10,480 total evaluations
•Tests span six domains: learning, metacognitive calibration, social cognition, attention, executive function, and prospective regulation, each based on established experimental paradigms
•After each answer, models are asked to KEEP or WITHDRAW their response and place BETs, with the key metric being the 'withdraw delta' measuring difference in withdrawal rates between incorrect and correct answers
•Five of six task groups were pre-registered on the Open Science Framework before data collection to ensure methodological rigor