Study audits four LLMs for reliability in psychiatric hospitalization risk assessment, finding that clinically insignificant variables increase predicted risk scores and output variability across all models.

arXiv cs.LG · April 27, 2026

AI Summary

•Researchers evaluated Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, and GPT-4o mini using synthetic patient profiles (n = 50) with 15 clinically relevant features and up to 50 clinically insignificant features, tested across four prompt reframings (neutral, logical, human impact, clinical judgment).
•Including medically insignificant variables resulted in a statistically significant increase in absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Prompt variations independently affected the trajectory of instability in a model-dependent manner.
•The findings demonstrate that LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior before clinical deployment.

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.