Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study.

Item request has been placed!

Item request cannot be made.

Processing Request

Read Online Read More Add to Saved list

Author(s): Yu, Hongbin¹ (AUTHOR); Chen, Tian¹ (AUTHOR); Zhang, Xin¹ (AUTHOR); Yang, Yunfan¹ (AUTHOR); Liu, Qinyu¹ (AUTHOR); Yang, Chenlu¹ (AUTHOR); Shen, Kai¹ (AUTHOR); Li, He¹ (AUTHOR); Tang, Wenjiao¹ (AUTHOR); Zhong, Xushu¹ (AUTHOR); Shuai, Xiao¹ (AUTHOR); Yu, Xinmei¹ (AUTHOR); Liao, Yi¹ (AUTHOR); Wang, Chiyi¹ (AUTHOR); Zhu, Huanling¹ (AUTHOR); Wu, Yu¹ (AUTHOR)
Source:
Journal of Medical Internet Research. 2025, Vol. 27, p1-14. 14p.
Subject Terms:
*Longitudinal method; *Retrospective studies; Blood diseases; Diagnosis; Causal inference; Interdisciplinary research; Language models

Additional Information
- Abstract:
  Background: Rare hematologic diseases are frequently underdiagnosed or misdiagnosed due to their clinical complexity. Whether new-generation large language models (LLMs), particularly those using chain-of-thought reasoning, can improve diagnostic accuracy remains unclear. Objective: This study aimed to evaluate the diagnostic performance of new-generation commercial LLMs in rare hematologic diseases and to determine whether the LLM output enhances physicians' diagnostic accuracy. Methods: We conducted a 2-phase study. In the retrospective phase, we evaluated 7 mainstream LLMs on 158 nonpublic real-world admission records covering 9 rare hematologic diseases, assessed diagnostic performance using top-10 accuracy and mean reciprocal rank (MRR), and evaluated ranking stability via Jaccard similarity and entropy. Spearman rank correlation was used to examine the association between physicians' diagnoses and LLM-generated outputs. In the prospective phase, 28 physicians with varying levels of experience diagnosed 5 cases each, gaining access to LLM-generated diagnoses across 3 sequential steps to assess whether LLMs can improve diagnostic accuracy. Results: In the retrospective phase, ChatGPT-o1-preview demonstrated the highest top-10 accuracy (70.3%) and MRR (0.577), and DeepSeek-R1 ranked second. Diagnostic performance was low for amyloid light-chain (AL) amyloidosis; Castleman disease; Erdheim-Chester disease; and polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes (POEMS) syndrome. Interestingly, higher accuracy often correlated with lower ranking stability across most LLMs. The physician performance showed a strong correlation with both top-10 accuracy (ρ=0.565) and MRR (ρ=0.650). In the prospective phase, LLMs significantly improved the diagnostic accuracy of less-experienced physicians; no significant benefit was observed for specialists. However, when LLMs generated biased responses, physician performance often failed to improve or even declined. Conclusions: Without fine-tuning, new-generation commercial LLMs, particularly those with chain-of-thought reasoning, can identify diagnoses of rare hematologic diseases with high accuracy and significantly enhance the diagnostic performance of less-experienced physicians. Nevertheless, biased LLM outputs may mislead clinicians, highlighting the need for critical appraisal and cautious clinical integration with appropriate safeguard systems. Trial Registration: Chinese Clinical Trial Registry ChiCTR2400089959; https://www.chictr.org.cn/hvshowproject.html?id=260575 [ABSTRACT FROM AUTHOR]

Comments

No Comments.

Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study.

Contact

Follow us