This paper critically examines the use of multiple-choice question (MCQ) benchmarks to assess the medical knowledge and reasoning capabilities of Large Language Models (LLMs). The central argument is that high performance by LLMs on medical MCQs may be an overestimation of their true medical understanding, potentially driven by factors beyond genuine knowledge and reasoning. The authors propose and utilise a novel benchmark of paired free-response and MCQ questions (FreeMedQA) to investigate this hypothesis.
This study provides compelling evidence that performance on medical MCQ benchmarks may not be a reliable indicator of the true medical knowledge and reasoning abilities of LLMs. The significant performance drop in free-response questions, coupled with the above-chance MCQ accuracy even with completely masked questions, suggests that LLMs might be exploiting the structure of MCQs rather than demonstrating genuine understanding. The findings underscore the importance of developing and utilizing more rigorous evaluation methods, such as free-response questions, to accurately assess the potential and limitations of LLMs in medical applications.