New self-supervised method enriches medical imaging reports by adding omitted positive findings, boosting vision-language model performance by up to 7.47%
arXiv cs.LG · April 14, 2026
AI Summary
•SemEnrich addresses the bias in medical datasets where clinicians predominantly report abnormalities while omitting positive/neutral findings
•The method uses semantic clustering of report sentences to automatically enrich training data with relevant observations from different clusters
•Testing showed significant improvements: 5.63% gain on COMET score, 7.47% on RadGraph-F1, 7.40% on Sentence BLEU, and 5.30% on CheXbert-F1
•Ablation studies confirmed that semantic clustering drives improvements, not random data augmentation
•Researchers also developed a way to incorporate semantic cluster information into reward design for GRPO training