Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.

Item request has been placed!

Item request cannot be made.

Processing Request

Read Online Read More Add to Saved list

Author(s): Lee S;Lee S; Jung S; Jung S; Park JH; Park JH; Cho H; Cho H; Moon S; Moon S; Ahn S; Ahn S; Ahn S
Source:
BMC emergency medicine [BMC Emerg Med] 2025 Sep 01; Vol. 25 (1), pp. 176. Date of Electronic Publication: 2025 Sep 01.
Publication Type:
Journal Article; Multicenter Study; Observational Study
Language:
English

Additional Information
- Source:
  Publisher: BioMed Central Country of Publication: England NLM ID: 100968543 Publication Model: Electronic Cited Medium: Internet ISSN: 1471-227X (Electronic) Linking ISSN: 1471227X NLM ISO Abbreviation: BMC Emerg Med Subsets: MEDLINE
- Publication Information:
  Original Publication: London : BioMed Central, [2001-
- Subject Terms:
  Triage*/methods ; Clinical Decision-Making*/methods ; Generative Artificial Intelligence* ; Emergency Service, Hospital* ; Large Language Models*; Retrospective Studies ; Prospective Studies ; Communication ; Sensitivity and Specificity ; Republic of Korea ; Tertiary Care Centers ; Humans ; Male ; Female ; Young Adult ; Adult ; Middle Aged ; Aged
- Abstract:
  Background: Timely and accurate triage is crucial for the emergency department (ED) care. Recently, there has been growing interest in applying large language models (LLMs) to support triage decision-making. However, most existing studies have evaluated these models using simulated scenarios rather than real-world clinical cases. Therefore, we evaluated the performance of multiple commercial LLMs for non-critical triage support in ED using real-world clinical conversations.
  Methods: We retrospectively analyzed real-world triage conversations prospectively collected from three tertiary hospitals in South Korea. Multiple commercial LLMs-including OpenAI GPT-4o, GPT-4.1, O3, Google Gemini 2.0 flash, Gemini 2.5 flash, Gemini 2.5 pro, DeepSeek V3, and DeepSeek R1-were evaluated for the accuracy in triaging patient urgency based solely on unsummarized dialogue. The Korean Triage and Acuity Scale (KTAS) assigned by triage nurses was used as the gold standard for evaluating the LLM classifications. Model performance was assessed under both a zero-shot prompting condition and a few-shot prompting condition that included representative examples.
  Results: A total of 1,057 triage cases were included in the analysis. Among the models, Gemini 2.5 flash achieved the highest accuracy (73.8%), specificity (88.9%), and PPV (94.0%). Gemini 2.5 pro demonstrated the highest sensitivity (90.9%) and F1-score (82.4%), though with lower specificity (23.3%). GPT-4.1 also showed balanced high accuracy (70.6%) and sensitivity (81.3%) with practical response times (1.79s). Performance varied widely between models and even between different versions from the same vendor. With few-shot prompting, most models showed further improvements in accuracy and F1-score.
  Conclusions: LLMs can accurately triage ED patient urgency using real-world clinical conversations. Several models demonstrated both high sensitivity and acceptable response times, supporting the feasibility of LLM in non-critical triage support tools in diverse clinical environments. These findings apply to non-critical patients (KTAS 3-5), and further research should address integration with objective clinical data and real-time workflow.
  (© 2025. The Author(s).)
- Abstract:
  Declarations. Ethics approval and consent to participate: This study was approved by the Institutional Review Board of Korea University Ansan Hospital (IRB No. 2025AS0116) and conducted in accordance with the principles of the Declaration of Helsinki. The requirement for informed consent was waived due to the retrospective analysis of publicly available anonymized data. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.
- References:
  Am J Emerg Med. 2024 Jan;75:72-78. (PMID: 37967485)
  CJEM. 2017 Jul;19(S2):S18-S27. (PMID: 28756800)
  BMJ Open. 2019 May 28;9(5):e026471. (PMID: 31142524)
  Am J Emerg Med. 2025 Mar;89:174-181. (PMID: 39731895)
  Acad Emerg Med. 2004 Jan;11(1):59-65. (PMID: 14709429)
  Am J Emerg Med. 2024 May;79:44-47. (PMID: 38341993)
  Stat Med. 2007 May 10;26(10):2170-83. (PMID: 16927452)
  Ann Emerg Med. 2019 Jul;74(1):140-152. (PMID: 30470513)
  PLoS One. 2019 Sep 6;14(9):e0216972. (PMID: 31490937)
  Am J Emerg Med. 2025 Aug;94:63-70. (PMID: 40273640)
  Am J Emerg Med. 2024 Jul;81:146-150. (PMID: 38728938)
- Contributed Indexing:
  Keywords: Artificial intelligence; Clinical conversation; Korean triage and acuity scale; Large language model; Triage
- Publication Date:
  Date Created: 20250901 Date Completed: 20250917 Latest Revision: 20250917
- Publication Date:
  20260130
- Accession Number:
  PMC12403343
- Accession Number:
  10.1186/s12873-025-01337-2
- Accession Number:
  40890624

Comments

No Comments.

Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.

Contact

Follow us