Summaries like this, in your inbox every morning.
Sign up free →What happened: Researchers at Mass General Brigham developed BRIDGE, a multilingual benchmark that evaluates how well AI language models understand clinical text from electronic health records, clinical case reports, and patient-doctor consultations across nine languages. Testing 95 LLMs from 59 clinical sources, they found that while the best-performing model scored 92 on standardized medical exams, it earned only 44.8% on BRIDGE, exposing significant gaps in understanding nuanced clinical language.
Why it matters: Medical AI has traditionally been evaluated using standardized licensing exam questions that do not fully reflect the complexity of real-world clinical interactions. BRIDGE addresses this by assessing AI on actual patient-care tasks—including triage, diagnosis, prognosis, and billing coding—spanning 14 clinical specialties. This helps clinicians and developers understand which AI tools are genuinely ready for clinical use and where performance still falls short.
What to watch: The researchers created a public continuously updated leaderboard that now includes 107 LLMs, enabling clinicians and AI developers to compare model performance across clinical tasks. Because BRIDGE includes clinical data in nine languages, it is designed to help identify and address performance gaps for non-English-speaking patients.
No discussion yet for this article
Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack