Large Language Models (LLMs) are highly sophisticated deep learning models pre-trained on massive datasets, with ChatGPT representing a prominent application of LLMs in the field of generative models. Since the release of ChatGPT at the end of 2022, generative chatbots have become widely employed across various medical disciplines. As a crucial discipline guiding clinical practices, the usage of generative chatbots like ChatGPT in Evidence-Based Medicine (EBM) is gradually increasing. However, the potential, challenges, and intricacies of their application in the domain of EBM remain unclear. This paper aims to explore and discuss the prospects, challenges, and considerations associated with the application of ChatGPT in the field of EBM through a review of relevant literature. The discussion spans four aspects: evidence generation, synthesis, assessment, dissemination, and implementation, providing researchers with insights into the latest developments and future research suggestions.
Objective To systematically review the accuracy and consistency of large language models (LLM) in assessing risk of bias in analytical studies. Methods The cohort and case-control studies related to COVID-19 based on the team's published systematic review of clinical characteristics of COVID-19 were included. Two researchers independently screened the studies, extracted data, and assessed risk of bias of the included studies with the LLM-based BiasBee model (version Non-RCT) used for automated evaluation. Kappa statistics and score differences were used to analyze the agreement between LLM and human evaluations, with subgroup analysis for Chinese and English studies. Results A total of 210 studies were included. Meta-analysis showed that LLM scores were generally higher than those of human evaluators, particularly in representativeness of exposed cohorts (△=0.764) and selection of external controls (△=0.109). Kappa analysis indicated slight agreement in items such as exposure assessment (κ=0.059) and adequacy of follow-up (κ=0.093), while showing significant discrepancies in more subjective items, such as control selection (κ=−0.112) and non-response rate (κ=−0.115). Subgroup analysis revealed higher scoring consistency for LLM in English-language studies compared to that of Chinese-language studies. Conclusion LLM demonstrate potential in risk of bias assessment; however, notable differences remain in more subjective tasks. Future research should focus on optimizing prompt engineering and model fine-tuning to enhance LLM accuracy and consistency in complex tasks.
Large language models (LLMs), a key component of artificial intelligence (AI), represent a significant breakthrough in natural language processing. As the capabilities of LLMs continue to evolve, their potential applications and future implications in clinical medical education warrant considerable attention. This study systematically reviews the development of LLMs, explores their innovative applications within the context of current challenges in clinical medical education, and critically assesses both the advantages and limitations of their implementation. The objective is to provide a comprehensive reference for the continued integration of AI-driven LLMs into clinical medical education.
As the volume of medical research using large language models (LLM) surges, the need for standardized and transparent reporting standards becomes increasingly critical. In January 2025, Nature Medicine published statement titled by TRIPOD-LLM reporting guideline for studies using large language models. This represents the first comprehensive reporting framework specifically tailored for studies that develop prediction models based on LLM. It comprises a checklist with 19 main items (encompassing 50 sub-items), a flowchart, and an abstract checklist (containing 12 items). This article provides an interpretation of TRIPOD-LLM’s development methods, primary content, scope, and the specific details of its items. The goal is to help researchers, clinicians, editors, and healthcare decision-makers to deeply understand and correctly apply TRIPOD-LLM, thereby improving the quality and transparency of LLM medical research reporting and promoting the standardized and ethical integration of LLM into healthcare.
Objective To compare the performance of ChatGPT-4.5 and DeepSeek-V3 across five key domains of physical therapy for knee osteoarthritis (KOA), evaluating the accuracy, completeness, reliability, and readability of their responses and exploring their clinical application potential. Methods Twenty-one core questions were extracted from 10 authoritative KOA rehabilitation guidelines published between September 2011 and January 2024, covering five task categories: rehabilitation assessment, physical agent modalities, exercise therapy, assistive device use, and patient education. Responses were generated using both the ChatGPT-4.5 and DeepSeek-V3 models and evaluated by four physical therapists with over five years of clinical experience using Likert scales (accuracy and completeness: 5 points; reliability: 7 points). The scale scores were compared between the two large language models. Additional assessment included language style clustering. Results Most of the scale scores did not follow a normal distribution, and were presented as median (lower quartile, upper quartile). ChatGPT-4.5 outperformed DeepSeek-V3 with higher scores in accuracy [4.75 (4.75, 4.75) vs. 4.75 (4.50, 5.00), P=0.018], completeness [4.75 (4.50, 5.00) vs. 4.25 (4.00, 4.50), P=0.006], and reliability [5.75 (5.50, 6.00) vs. 5.50 (5.50, 5.50), P=0.015]. Clustering analysis of language styles revealed that ChatGPT-4.5 demonstrated a more diverse linguistic style, whereas DeepSeek-V3 responses were more standardized. ChatGPT-4.5 achieved higher scores than DeepSeek-V3 in lexical richness [4.792 (4.720, 4.912) vs. 4.564 (4.409, 4.653), P<0.001], but lower than DeepSeek-V3 in syntactic richness [2.133 (2.072, 2.154) vs. 2.187 (2.154, 2.206), P=0.003]. Conclusions ChatGPT-4.5 demonstrates superior performance in accuracy, completeness, and reliability, indicating a stronger capacity for task execution. It uses more diverse words and has stronger flexibility in language generation. DeepSeek-V3 exhibited greater syntactic richness and is more normative in language. ChatGPT-4.5 is better suited for content-rich tasks that require detailed explanation, while DeepSeek-V3 is more appropriate for standardized question-answering applications.
ObjectiveTo evaluate the quality differences in recommendations generated by large language models (LLM) and clinical practitioners for sarcopenia-related questions. MethodsA sarcopenia knowledge base was constructed based on the latest domestic and international research and consensus guidelines. Using the Python environment, a locally deployed and sarcopenia-focused hybrid vertical LLM (referred to as LC) was implemented via LangChain-LLM. Eight fixed questions covering etiology, diagnosis, and prevention were selected, along with eight virtual patient cases. The evaluation team assessed the quality of answers generated by LC and written by clinical practitioners. Quantitative analysis was performed on the precision, recall, and F1 scores (harmonic mean of precision and recall) of treatment recommendations. ResultsThe responses were generally perceived as "possibly written by humans or AI", with a stronger inclination toward being AI-generated, although the accuracy of such judgments was low. Regarding answer quality attributes, LC's responses were superior to those of clinical practitioners in guideline consistency (P<0.01), exhibited similar acceptability (P>0.05), showed better practicality (P<0.05), and had a lower proportion of "1–2 errors" (P<0.05). Quantitative analysis of treatment recommendations indicated that LC and GPT-4.0 outperformed clinical practitioners in recall and F1 scores (P<0.05), with minimal differences between LC and GPT-4.0. ConclusionThe locally deployed sarcopenia-focused hybrid vertical LLM demonstrates high accuracy and applicability in addressing sarcopenia-related issues, outperforming clinical practitioners and exhibiting strong clinical decision-support capabilities.
The burgeoning application of large language models (LLM) in healthcare demonstrates immense potential, yet simultaneously poses new challenges to the standardization of research reporting. To enhance the transparency and reliability of medical LLM research, an international expert group published the TRIPOD-LLM reporting guideline in Nature Medicine in January 2024. As an extension of the TRIPOD+AI guideline, TRIPOD-LLM provides detailed reporting items specifically tailored to the unique characteristics of LLMs, including general foundational models (e.g., GPT-4) and domain-specific fine-tuned models (e.g., Med-PaLM 2). It addresses critical aspects such as prompt engineering, inference parameters, generative evaluation, and fairness considerations. Notably, the guideline introduces an innovative modular design and a "living guideline" mechanism. This paper provides a systematic, item-by-item interpretation and example-based analysis of the TRIPOD-LLM guideline. It is intended to serve as a clear and practical handbook for researchers in this field, as well as for journal reviewers and editors responsible for assessing the quality of such studies, thereby fostering the high-quality development of medical LLM research in China.