The paper introduces
MedHELM, a comprehensive framework designed to evaluate the performance of
large language models across a broad spectrum of medical and operational tasks. Developed through collaboration with clinicians, this suite utilizes
37 diverse benchmarks and a hierarchical taxonomy to assess functions ranging from clinical decision support to administrative workflows. The research highlights that
general-purpose AI rankings often fail to predict medical competence, as several top-tier models showed significant performance drops when faced with real-world healthcare data. To ensure scalable and nuanced assessment, the authors implemented an
LLM-jury system that demonstrates strong agreement with human expert ratings while managing the high costs of professional clinical review. Ultimately, the study identifies
DeepSeek R1 and o3-mini as current performance leaders, though models like
Claude 3.5 Sonnet are noted for providing a superior balance between clinical accuracy and computational cost.
References:
- Bedi S, Cui H, Fuentes M, et al. Holistic evaluation of large language models for medical tasks with MedHELM[J]. Nature Medicine, 2026: 1-9.