Abstract: Large language models (LLMs), such as ChatGPT and DeepSeek, are increasingly used by patients for medical information about their health condition. While several studies have explored ChatGPT’s performance in spinal cord injury (SCI)-related queries, no direct comparison between GPT-4o and DeepSeek-V3 has been conducted in this context. The five most-searched SCI-related topics comprising of forty-eight questions were generated based on top Google Trends search terms. Responses were generated using GPT-4o and DeepSeek-V3, with three outputs per question. A total of two hundred and eighty-eight responses were independently evaluated by three Rehabilitation Physicians using the S.C.O.R.E. framework, which evaluates five domains: Safety, Consensus with Guidelines, Objectivity, Reproducibility, and Explainability on a 5-point Likert scale. Paired t-tests were used to compare model performance. Both models achieved high ratings for Safety and Consensus with Guidelines. DeepSeek-V3 scored slightly but significantly higher in Objectivity ( P = 0.014), Reproducibility ( P = 0.007), and Explainability ( P < 0.001). No significant differences were found in Safety or Consensus with Guidelines. Qualitative review highlighted more consistent and contextually rich answers from DeepSeek-V3. While both GPT-4o and DeepSeek-V3 are generally safe and informative tools for SCI patient education, DeepSeek-V3 demonstrated slightly superior performance in delivering consistent, objective, and well-explained responses. LLMs may serve as useful adjuncts in SCI patient education, but ongoing evaluation and clinician oversight remains essential.
No Comments.