AIToday

Two new AI medical systems match or beat experienced doctors on diagnostic accuracy and treatment planning in controlled studies, but researchers caution the results may not translate to real-world hospitals.

THE DECODER1d ago3 min read
Two new AI medical systems match or beat experienced doctors on diagnostic accuracy and treatment planning in controlled studies, but researchers caution the results may not translate to real-world hospitals.

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  1. 1

    What happened: Two research teams published studies showing autonomous AI agents handling medical tasks. MIRA, developed at TUD Dresden and Heidelberg University, diagnosed conditions correctly 88.9 percent of the time across more than 500 real emergency department cases and matched experienced specialists at 87.8 percent versus 78.1 percent for specialists and 71.1 percent for residents in a head-to-head comparison. Google's AMIE system, which uses two agents working together on patient conversations and guideline matching, rated appropriate at 95 percent on first-visit plans compared with 72 percent for 21 primary care physicians across 100 cases.

  2. 2

    Why it matters: These results suggest AI systems could handle routine diagnostic and planning tasks that currently consume physician time. The authors and independent experts stress important caveats: both studies used simulations rather than real clinical settings, MIRA's test cases may have come from training data, AMIE's text-only interactions don't capture messy real-world patient communication, and the systems have not yet been tested in live hospitals. A researcher from MIRA's team compared such AI to an airplane's autopilot—capable of handling routine work but requiring physicians to retain ultimate responsibility.

  3. 3

    What to watch: A buried finding in AMIE's supplementary analysis reveals a risk to these systems' longevity. When Google's researchers tested the same specialized setup on newer Gemini 2.5 Flash instead of the older Gemini 1.5 Flash used in the study, the performance advantage 'almost vanished.' Newer general-purpose models like Gemini 2.5 Pro, o3, and GPT-5 already score 'largely comparable' to the full AMIE system on drug knowledge tests, suggesting the elaborate scaffolding designed to compensate for older model weaknesses becomes redundant as models improve. The source code for MIRA is available on GitHub.

Discussion

No discussion yet for this article

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →