記事一覧に戻る

Researchers develop a 524-item benchmark to measure how well large language models monitor their own accuracy across six cognitive domains.

arXiv cs.CL · 2026年4月20日

AI要約

  • The Metacognitive Monitoring Battery uses human psychology frameworks to evaluate self-awareness in 20 frontier LLMs through 10,480 total evaluations
  • Tests span six domains: learning, metacognitive calibration, social cognition, attention, executive function, and prospective regulation, each based on established experimental paradigms
  • After each answer, models are asked to KEEP or WITHDRAW their response and place BETs, with the key metric being the 'withdraw delta' measuring difference in withdrawal rates between incorrect and correct answers
  • Five of six task groups were pre-registered on the Open Science Framework before data collection to ensure methodological rigor

関連記事

AIニュースを毎日お届け

200以上のソースから厳選したAIニュースを毎日無料でお届けします。

無料で始める