Mass General Brigham researchers created BRIDGE, a benchmark that reveals AI models perform far worse on real clinical tasks than on medical licensing exams, highlighting the gap between lab performance and patient-care readiness.

Hacker News23h ago2 min read

Summaries like this, in your inbox every morning.

3 Key Points

1
What happened: Researchers at Mass General Brigham developed BRIDGE, a multilingual benchmark that evaluates how well AI language models understand clinical text from electronic health records, clinical case reports, and patient-doctor consultations across nine languages. Testing 95 LLMs from 59 clinical sources, they found that while the best-performing model scored 92 on standardized medical exams, it earned only 44.8% on BRIDGE, exposing significant gaps in understanding nuanced clinical language.
2
Why it matters: Medical AI has traditionally been evaluated using standardized licensing exam questions that do not fully reflect the complexity of real-world clinical interactions. BRIDGE addresses this by assessing AI on actual patient-care tasks—including triage, diagnosis, prognosis, and billing coding—spanning 14 clinical specialties. This helps clinicians and developers understand which AI tools are genuinely ready for clinical use and where performance still falls short.
3
What to watch: The researchers created a public continuously updated leaderboard that now includes 107 LLMs, enabling clinicians and AI developers to compare model performance across clinical tasks. Because BRIDGE includes clinical data in nine languages, it is designed to help identify and address performance gaps for non-English-speaking patients.

No discussion yet for this article

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack