EN
登录

1933年Eurorad案例报告中开源LLM诊断性能的基准测试

Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports

Nature 等信源发布 2025-02-12 13:41

可切换为仅中文


Abstract

摘要

Recent advancements in large language models (LLMs) have created new ways to support radiological diagnostics. While both open-source and proprietary LLMs can address privacy concerns through local or cloud deployment, open-source models provide advantages in continuity of access, and potentially lower costs.

大型语言模型(LLM)的最新进展创造了支持放射学诊断的新方法。虽然开源和专有LLM都可以通过本地或云部署来解决隐私问题,但开源模型在访问的连续性方面提供了优势,并可能降低成本。

This study evaluated the diagnostic performance of fifteen open-source LLMs and one closed-source LLM (GPT-4o) in 1,933 cases from the Eurorad library. LLMs provided differential diagnoses based on clinical history and imaging findings. Responses were considered correct if the true diagnosis appeared in the top three suggestions.

这项研究评估了Eurorad图书馆1933例中15例开源LLM和1例闭源LLM(GPT-4o)的诊断性能。LLMs根据临床病史和影像学检查结果提供鉴别诊断。如果正确的诊断出现在前三项建议中,则认为回答正确。

Models were further tested on 60 non-public brain MRI cases from a tertiary hospital to assess generalizability. In both datasets, GPT-4o demonstrated superior performance, closely followed by Llama-3-70B, revealing how open-source LLMs are rapidly closing the gap to proprietary models. Our findings highlight the potential of open-source LLMs as decision support tools for radiological differential diagnosis in challenging, real-world cases..

。在这两个数据集中,GPT-4o表现出优异的性能,紧随其后的是Llama-3-70B,揭示了开源LLM如何迅速缩小与专有模型的差距。我们的研究结果突出了开源LLM作为具有挑战性的现实世界病例中放射学鉴别诊断的决策支持工具的潜力。。

Similar content being viewed by others

其他人正在查看类似内容

Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine

诊断推理提示揭示了医学中大型语言模型可解释性的潜力

Article

文章

Open access

开放存取

24 January 2024

2024年1月24日

The TRIPOD-LLM reporting guideline for studies using large language models

TRIPOD-LLM大型语言模型研究报告指南

Article

文章

08 January 2025

2025年1月8日

Evaluation and mitigation of the limitations of large language models in clinical decision-making

临床决策中大型语言模型局限性的评估和缓解

Article

文章

Open access

开放存取

04 July 2024

2024年7月4日

Introduction

简介

Recent advancements in artificial intelligence (AI) have transformed medical diagnostics, offering innovative tools to support clinical decision-making. One promising development is the emergence of large language models (LLMs), which excel at processing and generating natural language. In radiology, these models have demonstrated potential in various applications, including defining study protocols.

人工智能(AI)的最新进展已经改变了医学诊断,提供了支持临床决策的创新工具。一个有前途的发展是大型语言模型(LLM)的出现,它擅长处理和生成自然语言。在放射学中,这些模型已在各种应用中显示出潜力,包括定义研究方案。

1

1

,

,

2

2

, performing differential diagnosis

,执行鉴别诊断

3

3

,

,

4

4

, generating reports

,生成报告

5

5

,

,

6

6

, and extracting information from free-text reports

,并从自由文本报告中提取信息

7

7

,

,

8

8

.

.

However, a significant barrier to widespread clinical adoption is data privacy. The LLMs primarily used in previous studies are proprietary, closed-source models, such as GPT-4, Claude 3, or Gemini

然而,广泛临床采用的一个重大障碍是数据隐私。先前研究中主要使用的LLM是专有的封闭源代码模型,例如GPT-4,Claude 3或Gemini

9

9

,

,

10

10

,

,

11

11

. Access to these models is typically provided via web-based interfaces or via application programming interfaces (API), both of which necessitate the transfer of data to third-party servers, thereby increasing the risk of unauthorized access or misuse of sensitive health information and limiting their use on patient data.

。通常通过基于web的接口或应用程序编程接口(API)提供对这些模型的访问,这两者都需要将数据传输到第三方服务器,从而增加了未经授权访问或滥用敏感健康信息的风险,并限制了其在患者数据上的使用。

While cloud-based solutions for proprietary LLMs can address some privacy concerns, they may still be subject to commercial update cycles and potentially higher long-term costs..

虽然专有LLM的基于云的解决方案可以解决一些隐私问题,但它们可能仍会受到商业更新周期和潜在的更高长期成本的影响。。

Open-source models offer a viable alternative, enabling care institutions to retain patient data within their local infrastructure, mitigating these privacy concerns, and providing continuity of access independent of commercial update cycles, which can lower costs due to their free availability. While historically open-source LLMs have underperformed in clinical decision support tasks.

开源模型提供了一种可行的替代方案,使护理机构能够在其本地基础设施中保留患者数据,减轻这些隐私问题,并提供独立于商业更新周期的访问连续性,由于其免费可用性,可以降低成本。虽然历史上开源LLM在临床决策支持任务中表现不佳。

12

12

,

,

13

13

, Meta’s latest Llama-3 has shown performance levels on par with leading proprietary models in some areas, such as answering radiology board exam questions

14

14

. However, the diagnostic accuracy of such models in real-world clinical cases remains largely unexplored.

然而,这些模型在现实世界的临床病例中的诊断准确性在很大程度上尚未探索。

A well-suited resource for such an evaluation is Eurorad, a comprehensive repository of peer-reviewed radiological case reports managed by the European Society of Radiology (ESR). Eurorad serves as a valuable educational resource for radiologists, residents, and medical students, and encompasses a wide range of cases across radiological subspecialties such as abdominal imaging, neuroradiology, uroradiology, and pediatric radiology.

Eurorad是一个非常适合进行此类评估的资源,它是由欧洲放射学会(ESR)管理的同行评审放射病例报告的综合存储库。Eurorad为放射科医生、住院医生和医学生提供了宝贵的教育资源,涵盖了腹部影像学、神经放射学、泌尿放射学和儿科放射学等放射学亚专业的广泛病例。

15

15

.

.

Therefore, the aim of this study was to evaluate the performance of state-of-the-art open-source LLMs in radiological diagnosis using Eurorad case reports.

因此,本研究的目的是使用Eurorad病例报告评估最先进的开源LLM在放射学诊断中的表现。

Results

结果

Dataset

数据集

The initial dataset retrieved from the Eurorad library consisted of 4827 case reports. Using the Llama-3-70B model, we identified 2894 cases where the diagnosis was explicitly stated within the case description. These cases were subsequently excluded, resulting in a final dataset of 1933 cases for analysis.

从Eurorad图书馆检索到的初始数据集由4827个病例报告组成。使用Llama-3-70B模型,我们确定了2894例病例,其中病例描述中明确说明了诊断。随后排除了这些病例,最终产生了1933例病例的数据集进行分析。

This filtering process ensured that the LLMs were evaluated on genuinely challenging cases that required inference rather than simple information extraction. The dataset was primarily composed of cases from neuroradiology (21.4%), abdominal imaging (18.1%), and musculoskeletal imaging (14.6%), whereas breast imaging (3.4%) and interventional radiology (1.4%) were underrepresented (Table .

这种过滤过程确保了LLM在真正具有挑战性的案例中进行评估,这些案例需要推理而不是简单的信息提取。该数据集主要由神经放射学(21.4%),腹部成像(18.1%)和肌肉骨骼成像(14.6%)的病例组成,而乳腺成像(3.4%)和介入放射学(1.4%)的病例代表性不足(表。

1

1

). This distribution broadly reflects the relative prevalence of different radiological subspecialties in clinical practice.

)。这种分布广泛反映了临床实践中不同放射学亚专业的相对患病率。

Table 1 Dataset composition by subspecialty

表1按子专业划分的数据集组成

Full size table

全尺寸表

LLM judge performance

LLM法官表现

To use Llama-3-70B as an automated LLM Judge for assessing model responses in the large Eurorad dataset, we first needed to calibrate its response assessment against human expert assessment. In a subset of 140 Eurorad cases, Llama-3-70B exhibited a high accuracy of 87.8% in classifying responses as “correct” or “incorrect” (123 out of 140 responses; 95% CI: 0.82–0.93).

为了使用Llama-3-70B作为自动LLM法官来评估大型Eurorad数据集中的模型响应,我们首先需要根据人类专家评估校准其响应评估。在140个Eurorad案例的子集中,Llama-3-70B在将响应分类为“正确”或“不正确”时表现出87.8%的高准确率(140个响应中有123个;95%CI:0.82-0.93)。

Furthermore, in a separate subset of 20 responses that were rated by all three radiologists, the interrater agreement was found to be 100%, indicating complete consensus among the human experts..

此外,在所有三位放射科医生对20份回复进行评分的单独子集中,发现评分者之间的一致性为100%,表明人类专家之间达成了完全共识。。

The high level of agreement between Llama-3-70B and the human radiologists, as well as the complete consensus among the radiologists themselves, supports the validity of using Llama-3-70B as an automated judge for the larger LLM response dataset. This allows us to include the small inaccuracies of Llama-3-70B in the overall confidence interval assessment, as detailed in the “Statistics” section of the Methods..

Llama-3-70B与人类放射科医生之间的高度一致,以及放射科医生之间的完全共识,支持使用Llama-3-70B作为较大LLM响应数据集的自动判断的有效性。这使我们能够在总体置信区间评估中包括Llama-3-70B的微小误差,详见方法的“统计”部分。。

Model performance

模型性能

Across all models, the highest levels of diagnostic accuracy were achieved in interventional radiology (67.8 ± 6.2%), cardiovascular imaging (62.5 ± 3.2%), and abdominal imaging (60.5 ± 1.8%), whereas lower accuracy was observed in breast imaging (50.0 ± 4.3%) and musculoskeletal imaging (50.4 ± 2.1%).

在所有模型中,介入放射学的诊断准确性最高(67.8± 6.2%), 心血管成像(62.5±3.2%),和腹部成像(60.5±1.8%),而在乳房成像(50.0±4.3%)和肌肉骨骼成像(50.4±4.3%)中观察到较低的准确性 2.1%).

Granular accuracy metrics by subspecialty and model are provided in Supplementary Table .

补充表中提供了按子专业和模型划分的粒度精度指标。

2

2

.

.

GPT-4o demonstrated superior diagnostic performance across all subspecialties except interventional radiology, achieving a rate of 79.6 ± 2.3% correct responses. Meta-Llama-3-70B revealed the highest performance among open-source LLMs (73.2 ± 2.5%), with a considerable margin ahead of Mistral-Small (63.3 ± 2.6%), Qwen2.5-32B (62.5 ± 2.6%), and OpenBioLLM-Llama-3-70B (62.5 ± 2.6%).

GPT-4o在除介入放射学外的所有子专业中均表现出优异的诊断性能,正确反应率为79.6±2.3%。Meta-Llama-3-70B在开源LLM中表现最高(73.2±2.5%),领先于Mistral Small(63.3±2.6%),Qwen2.5-32B(62.5±2.6%),和OpenBioLLM-Llama-3-70B(62.5±2.6%)。

Lowest performance was seen in Medalpaca-13B (34.0 ± 2.6%), Meditron-7B (44.3 ± 2.7%), and BioMistral-7B (44.5 ± 2.7%). Meta-Llama-3-70B’s accuracy was substantially higher than its predecessor model Meta-Llama-2-70B (Figs. .

Medalpaca-13B的表现最低(34.0±2.6%),Meditron-7B(44.3±2.7%),和BioMistral-7B(44.5±2.7%)。Meta-Llama-3-70B的准确性大大高于其前身模型Meta-Llama-2-70B(图)。

1

1

,

,

2

2

).

).

Fig. 1: Model performance across subspecialty.

图1:跨子专业的模型性能。

Models were ranked by overall accuracy and grouped into radar plots, with four models displayed per plot. The four top-performing models are shown in the top left corner.

模型按总体精度排序,并分为雷达图,每个图显示四个模型。左上角显示了四个性能最佳的模型。

Full size image

全尺寸图像

Fig. 2: Performance of open-source LLMs in Eurorad dataset (

图2:Eurorad数据集中开源LLM的性能(

n

n

= 1933) and local brain MRI dataset (

1933)和局部脑部MRI数据集(

n

n

= 60).

(60)。

Error bars indicate adjusted 95% confidence intervals. Reader 1 and 2 were radiologists with two and four years of dedicated neuroradiology experience each.

误差线表示调整后的95%置信区间。。

Full size image

全尺寸图像

In the local brain MRI dataset, similar results were observed, with GPT-4o (76.7 ± 15.1%) and Llama-3-70B (71.7 ± 12.2%) again leading the rankings. Reader 2, a board-certified neuroradiologist, achieved the highest accuracy with 83.3 ± 13.3% correct responses. Reader 1, a radiologist with 2 years of neuroradiology experience achieved rates comparable to GPT-4o and Meta-Llama-3-70B (75.0 ± 15.5%).

在局部脑部MRI数据集中,观察到类似的结果,GPT-4o(76.7±15.1%)和Llama-3-70B(71.7±12.2%)再次排名第一。Reader 2是一名董事会认证的神经放射学家,获得了最高的准确性,正确率为83.3±13.3%。读者1,一位具有2年神经放射学经验的放射科医生,达到了与GPT-4o和Meta-Llama-3-70B相当的比率(75.0± 15.5%).

Several other models showed a drop in performance levels in the local dataset of up to 16% (e.g., Llama-2-70B: 47.8 ± 2.7% to 31.7 ± 12.6%) (Fig. .

其他几个模型显示,本地数据集中的性能水平下降了16%(例如,Llama-2-70B:47.8±2.7%至31.7±12.6%)(图)。

2

2

).

).

Correlation analysis

相关性分析

The relationship between model accuracy and model size (in billion parameters) is illustrated in Fig.

模型精度与模型尺寸(十亿个参数)之间的关系如图1所示。

3

3

. A Pearson correlation coefficient of 0.54 was determined, indicating a moderate positive correlation.

确定Pearson相关系数为0.54,表明中度正相关。

Fig. 3: Scatter plot: accuracy vs model size.

Models fine-tuned with biomedical corpora are highlighted in red. A Pearson correlation coefficient of 0.54 was determined, indicating a moderate positive correlation.

用生物医学语料库微调的模型以红色突出显示。Pearson相关系数为0.54,表明中度正相关。

Full size image

全尺寸图像

LLMs fine-tuned with domain-specific training data showed lower accuracy compared to general-purpose models of comparable size. For instance, both OpenBioLLM-Llama-3-70B (62.4 ± 2.6%) and OpenBioLLM-Llama-3-8B (45.4 ± 2.7%) demonstrated performance levels inferior to their respective base models Meta-Llama-3-70B (73.2 ± 2.5%) and Meta-Llama-3-8B (56.4 ± 2.6%)..

与规模相当的通用模型相比,使用特定领域的训练数据进行微调的LLM显示出较低的准确性。例如,OpenBioLLM-Llama-3-70B(62.4±2.6%)和OpenBioLLM-Llama-3-8B(45.4±2.7%)的性能水平均低于各自的基础模型Meta-Llama-3-70B(73.2±2.5%)和Meta-Llama-3-8B(56.4±2.6%)。。

Discussion

讨论

In this study, we benchmarked the diagnostic performance of fifteen leading open-source LLMs in a heterogeneous, challenging cohort of 1933 peer-reviewed case reports from the Eurorad library. Although GPT-4o outperformed all included open-source LLMs (79.6%), Meta’s Llama-3-70B followed very closely (73.2%), highlighting how open-source LLMS are quickly closing the gap to proprietary LLMs.

在这项研究中,我们对来自Eurorad图书馆的1933份同行评审病例报告的异质性,挑战性队列中的15个领先开源LLM的诊断性能进行了基准测试。尽管GPT-4o的表现优于所有开源LLM(79.6%),但Meta的Llama-3-70B紧随其后(73.2%),突显了开源LLM如何迅速缩小与专有LLM的差距。

This level of performance is noteworthy given the complexity and diversity of the cases included in our dataset. In the local brain MRI dataset, both models reached accuracy rates comparable to or only slightly lower than those of two experienced radiologists. The remaining models followed with a sizeable margin, underscoring the current dominance of Llama-3 among open-source models.

鉴于我们数据集中包含的案例的复杂性和多样性,这种性能水平值得注意。在局部脑部MRI数据集中,两种模型的准确率均与两位经验丰富的放射科医生的准确率相当或仅略低。其余的模型随后获得了相当大的利润,突显了Llama-3目前在开源模型中的主导地位。

This trend is further supported by Llama-3’s proficiency in other clinical tasks, such as answering close-ended medical questions, summarizing clinical documents, and patient education.

Llama-3在其他临床任务中的熟练程度进一步支持了这一趋势,例如回答封闭式医学问题,总结临床文件和患者教育。

14

14

,

,

16

16

.

.

Importantly, this study assessed the diagnostic performance of LLMs based on real case descriptions, more accurately representing the complexities of real-life clinical decision-making than questions with pre-defined response options. This approach provides a more realistic evaluation of LLMs’ potential in clinical settings, where the ability to interpret nuanced clinical information is crucial..

重要的是,这项研究基于真实病例描述评估了LLM的诊断性能,与具有预定义响应选项的问题相比,更准确地代表了现实生活中临床决策的复杂性。这种方法为LLM在临床环境中的潜力提供了更现实的评估,在临床环境中,解释细微差别的临床信息的能力至关重要。。

Our results revealed variations in performance across radiological subspecialties, with higher accuracy in genital (female) imaging and lower accuracy in musculoskeletal imaging. These differences may reflect inherent complexities within each subspecialty, variations in the quality or specificity of case descriptions, or potential biases in the models’ training data.

我们的结果揭示了放射学亚专业的表现差异,生殖器(女性)成像的准确性更高,肌肉骨骼成像的准确性更低。这些差异可能反映了每个子专业固有的复杂性,案例描述的质量或特异性的变化,或者模型训练数据中的潜在偏差。

Further investigation into these subspecialty-specific performance variations could provide valuable insights for targeted model improvements and clinical applications..

对这些亚专业特定性能变化的进一步研究可以为有针对性的模型改进和临床应用提供有价值的见解。。

Although we observed a moderate positive correlation between model size and diagnostic accuracy, some lighter models such as Meta-Llama-3-8B exhibited strong performance, outperforming larger models with more parameters (e.g., Llama-2-70B and Vicuna-13B). This suggests that smaller, lower-cost models with nonetheless robust results are attainable, making the implementation of LLMs in resource-constrained healthcare settings more viable.

尽管我们观察到模型大小与诊断准确性之间存在中等程度的正相关,但一些较轻的模型(例如Meta-Llama-3-8B)表现出很强的性能,优于具有更多参数的较大模型(例如Llama-2-70B和Vicuna-13B)。这表明可以实现更小,更低成本的模型,但结果却很可靠,这使得在资源有限的医疗保健环境中实施LLM更加可行。

Interestingly, medically fine-tuned models tended to perform worse than their respective base model or other general-purpose models of comparable size. This finding challenges the widely held assumption that domain-adaptive pretraining enhances model performance, although some recent studies support our observation.

有趣的是,医学上微调的模型往往比其各自的基础模型或其他规模相当的通用模型表现更差。尽管最近的一些研究支持我们的观察结果,但这一发现挑战了人们普遍认为域自适应预训练可以提高模型性能的假设。

17

17

,

,

18

18

.

.

Employing a state-of-the-art LLM model to automate the evaluation of LLM responses facilitated the large-scale analysis of thousands of cases, a scope unrealizable through manual processing. This strategy establishes a methodical benchmark for future large-scale investigations of clinical text documents..

采用最先进的LLM模型自动评估LLM响应,有助于对数千个案例进行大规模分析,这是通过手动处理无法实现的范围。该策略为未来对临床文本文档的大规模调查建立了系统的基准。。

Overall, this study highlights the potential of open-source LLMs as decision-support tools for radiological differential diagnosis in real-world cases. Yet, several obstacles for successful implementation and adoption remain to be addressed.

总体而言,这项研究强调了开源LLM作为现实世界病例中放射学鉴别诊断的决策支持工具的潜力。然而,成功实施和通过的几个障碍仍有待解决。

To begin with, how physicians can effectively interact with LLMs and how potential risks can be mitigated is yet to be determined. Whereas this study investigated the isolated diagnostic performance of LLMs, it is more realistic for them to serve as tools enhancing the capabilities of physicians

首先,医生如何与LLM有效互动以及如何减轻潜在风险尚待确定。尽管这项研究调查了LLM的孤立诊断性能,但将其作为增强医生能力的工具更为现实

19

19

,

,

20

20

. In the context of radiological diagnosis, LLMs could help rapidly generate multiple hypotheses that require further validation. A potential threat originating from LLM suggestions is automation bias, which is the common tendency of humans to excessively rely on automated decision-making systems. This cognitive phenomenon has been observed in AI-based systems for mammography classification.

。来自LLM建议的潜在威胁是自动化偏见,这是人类过度依赖自动化决策系统的常见趋势。在基于AI的乳腺X线照相分类系统中已经观察到这种认知现象。

21

21

and cerebral aneurysm detection

和脑动脉瘤检测

22

22

, and could lead to systematic errors in physicians who fail to critically evaluate LLM suggestions. In contrast, LLMs could also play a role in reducing cognitive biases if they are intentionally utilized to provide different perspectives and uncover common fallacies

,并可能导致未能批判性评估LLM建议的医生出现系统性错误。相比之下,如果有意利用法学硕士提供不同的观点并发现常见的谬误,那么法学硕士也可以在减少认知偏见方面发挥作用

23

23

.

.

While effective human-AI interaction is crucial, practical implementation of LLMs hinges on overcoming technical barriers. Operating open-source LLMs locally requires an adequate hardware and software infrastructure, as well as IT expertise that might be available in large academic centers but not in smaller institutions or practices.

虽然有效的人工智能交互至关重要,但LLMs的实际实施取决于克服技术障碍。。

24

24

. Vendors of PACS, RIS, or EHR systems could be instrumental in overcoming these barriers by integrating LLM-based features in a privacy-preserving and user-friendly manner.

。PACS、RIS或EHR系统的供应商可以通过以保护隐私和用户友好的方式集成基于LLM的功能来帮助克服这些障碍。

In addition to technical considerations, economic implications from a healthcare provider and system perspective warrant careful examination. Deploying LLMs incurs costs related to infrastructure, query usage, and physician training. Cost-effectiveness studies are needed to assess whether these investments are justified by the potential gains in diagnostic accuracy, productivity, and patient outcomes..

除了技术考虑之外,从医疗保健提供者和系统的角度来看,经济影响值得仔细研究。部署LLMs会产生与基础设施、查询使用和医生培训相关的成本。需要进行成本效益研究,以评估这些投资是否因诊断准确性,生产力和患者预后的潜在收益而合理。。

Moreover, establishing effective regulatory frameworks for the development and use of LLM-based tools in medicine remains a considerable challenge. LLMs used in clinical decision-making should meet rigorous safety and reliability standards, yet the near-infinite range of inputs and outputs complicates the definition of comprehensive guidelines.

此外,为开发和使用基于LLM的医学工具建立有效的监管框架仍然是一个相当大的挑战。临床决策中使用的LLM应符合严格的安全性和可靠性标准,但输入和输出的无限范围使综合指南的定义复杂化。

25

25

. Regulatory authorities such as the FDA (Food and Drug Administration) and the EMA (European Medicines Agency) should aim for adaptable yet robust oversight mechanisms to harness the potential benefits while preventing potential patient harm.

FDA(食品和药物管理局)和EMA(欧洲药品管理局)等监管机构应致力于建立适应性强但强有力的监督机制,以利用潜在的益处,同时防止潜在的患者伤害。

Lastly, patient perception of AI-enhanced diagnosis is another critical factor to consider. Transparent communication about the role of LLMs in the diagnostic process will likely ensuring patient trust, ultimately fostering acceptance

最后,患者对AI增强诊断的感知是另一个需要考虑的关键因素。关于LLM在诊断过程中的作用的透明沟通可能会确保患者的信任,最终促进接受

26

26

.

.

This study has several limitations. First, data contamination of LLMs cannot be definitively ruled out. Given the lack of transparency regarding the LLM training datasets, it is possible that the case reports used in this study overlap with the training data of some models. However, our complementary assessment on a non-public brain MRI dataset revealed largely consistent overall model rankings, although some models exhibited diminished performance.

这项研究有几个局限性。首先,不能明确排除LLM的数据污染。鉴于LLM训练数据集缺乏透明度,本研究中使用的病例报告可能与某些模型的训练数据重叠。然而,我们对非公共脑部MRI数据集的补充评估显示,总体模型排名基本一致,尽管一些模型表现出性能下降。

The detection and estimation of data contamination is an active area of research, and several methods have been proposed.

数据污染的检测和估计是一个活跃的研究领域,已经提出了几种方法。

27

27

,

,

28

28

,

,

29

29

,

,

30

30

. In one approach, the LLM in question is instructed to complete the initial segment of a dataset partition, and the overlap of the LLM output with the original data is measured

在一种方法中,指示所讨论的LLM完成数据集分区的初始段,并测量LLM输出与原始数据的重叠

27

27

. Another framework evaluates data contamination based on LLM output distribution characteristics

.另一个框架根据LLM输出分布特征评估数据污染

30

30

. Despite progress, significant weaknesses such as limited reproducibility and the lack of baseline comparisons persist, and robust, validated methods for detecting data contamination are yet to be established

尽管取得了进展,但仍然存在重大缺陷,例如重现性有限和缺乏基线比较,并且尚未建立可靠的,经过验证的检测数据污染的方法

29

29

.

.

Second, while the use of an LLM for the evaluation of LLM responses significantly enhanced the scalability of the analysis, it did so at the expense of reduced accuracy. To mitigate this limitation, we adjusted the standard error of model performance assessment based on our evaluation of Llama-3-70B’s judging accuracy in a subset of the data..

其次,虽然使用LLM评估LLM响应显着提高了分析的可扩展性,但这样做是以降低准确性为代价的。为了减轻这种限制,我们根据对Llama-3-70B在一部分数据中的判断准确性的评估,调整了模型性能评估的标准误差。。

Third, this study did not evaluate the multimodal performance of vision-language models (VLMs) capable of ingesting both text and image data as input. Both closed-source (e.g., GPT-4o, Claude 3 Opus, Gemini 1.5 Pro) and open-source VLMs (e.g., LLaVA-1.5, Qwen-VL, CogVLM)

第三,这项研究没有评估能够吸收文本和图像数据作为输入的视觉语言模型(VLM)的多模态性能。闭源(例如GPT-4o,Claude 3 Opus,Gemini 1.5 Pro)和开源VLM(例如LLaVA-1.5,Qwen VL,CogVLM)

31

31

,

,

32

32

,

,

33

33

,

,

34

34

,

,

35

35

,

,

36

36

have become available, and their role in radiology report generation (either standalone

已经可用,并且它们在放射报告生成中的作用(独立

37

37

or as an aid to radiologists

或者作为放射科医生的帮助

38

38

) is being explored. Several studies further evaluated the performance of VLMs in differential diagnosis but showed mixed results

)正在探索。一些研究进一步评估了VLM在鉴别诊断中的性能,但结果不一

4

4

,

,

10

10

,

,

39

39

,

,

40

40

. Whereas most VLMs are not capable of processing 3-dimensional image inputs, new models supporting 3D CT or MRI scans have been developed

。尽管大多数VLM无法处理三维图像输入,但已开发出支持3D CT或MRI扫描的新模型

41

41

,

,

42

42

.

.

Fourth, we did not investigate the impact of temperature settings or prompt design on LLM performance. To ensure deterministic responses, we applied a temperature of 0, but higher temperatures could potentially improve diagnostic accuracy

第四,我们没有调查温度设置或提示设计对LLM性能的影响。为了确保确定性响应,我们应用了0的温度,但是更高的温度可能会提高诊断的准确性

10

10

. Similarly, the optimal task-specific prompting strategy for radiological diagnosis is yet to be determined

同样,放射学诊断的最佳任务特定提示策略尚未确定

43

43

.

.

Finally, this study did not account for the influence of varying descriptions of the same case.

最后,这项研究没有考虑到同一案例的不同描述的影响。

A recent study evaluating GPT-4(V) in radiological diagnosis revealed that image description is a major determinant of LLM accuracy

最近一项评估GPT-4(V)在放射学诊断中的研究表明,图像描述是LLM准确性的主要决定因素

4

4

. The Eurorad case descriptions were written in awareness of the correct diagnosis, and their use of specific terminology or emphasis on certain image characteristics might have introduced a positive bias in LLM performance.

。Eurorad病例描述是在意识到正确诊断的情况下编写的,它们使用特定术语或强调某些图像特征可能会在LLM表现中引入积极的偏见。

In conclusion, we found that several open-source LLMs demonstrate promising performance in identifying the correct diagnosis based on case descriptions from the Eurorad library, highlighting their potential as a decision-support tool for radiological differential diagnosis.

总之,我们发现,一些开源LLM在根据Eurorad图书馆的病例描述确定正确诊断方面表现出良好的性能,突出了它们作为放射学鉴别诊断决策支持工具的潜力。

Methods

方法

The need for informed consent was waived by the Ethics Committee of the Technical University of Munich, as it involved the retrospective analysis of publicly available data and de-identified local data, posing minimal risk to the participants.

慕尼黑技术大学伦理委员会放弃了知情同意的必要性,因为它涉及对公开数据的回顾性分析和取消识别的本地数据,对参与者的风险最小。

Data

数据

To create a comprehensive and diverse dataset of challenging radiology cases, we automatically downloaded case report data—including “Clinical History,” “Imaging Findings,” “Final Diagnosis,” and “Section”—from the European Society of Radiology’s case report library at

为了创建一个全面而多样的具有挑战性的放射学病例数据集,我们自动下载了来自欧洲放射学学会病例报告库的病例报告数据,包括“临床病史”,“影像学发现”,“最终诊断”和“部分”

https://eurorad.org/

https://eurorad.org/

. Containing information on patient demographics, symptoms, pre-existing conditions, lab values, and detailed imaging findings, the case descriptions were sufficient to determine the accurate diagnosis in most cases. The final diagnosis, as indicated in the Eurorad dataset, served as the ground truth..

。病例描述包含有关患者人口统计学,症状,既往状况,实验室值和详细影像学检查结果的信息,足以确定大多数情况下的准确诊断。如Eurorad数据集所示,最终诊断是基本事实。。

All case reports published after July 6, 2015, and licensed under the Creative Commons License CC BY-NC-SA 4.0, were scraped using the Python library “Scrapy” (version 2.11.2) on June 15, 2024.

2024年6月15日,使用Python库“Scrapy”(版本2.11.2)删除了2015年7月6日之后发布并获得知识共享许可CC BY-NC-SA 4.0许可的所有病例报告。

To address potential data contamination concerns and assess generalizability, we further validated the performance of LLMs in a local dataset of 60 brain MRI cases. These were obtained from our local imaging database, as reported previously

为了解决潜在的数据污染问题并评估普遍性,我们进一步验证了LLMs在60例脑部MRI病例的本地数据集中的性能。如先前报道,这些是从我们的本地成像数据库中获得的

4

4

, and equally contained a brief clinical history and imaging findings. True diagnoses in the local dataset were determined based either on histopathology or through the independent agreement of at least two neuroradiologists, taking into account all relevant clinical follow-up information. This local dataset is not publicly accessible and, thus, highly unlikely to have been included in the LLMs’ training data..

,并同样包含简短的临床病史和影像学发现。考虑到所有相关的临床随访信息,根据组织病理学或通过至少两名神经放射学家的独立协议确定本地数据集中的真实诊断。该本地数据集不可公开访问,因此极不可能包含在LLMs的培训数据中。。

LLM selection

GPT-4o was included as a state-of-the-art closed-source LLM by OpenAI. Among open-source LLMs, general-purpose models from leading developers (Meta, Microsoft, Mistral, Alibaba, and Google) and top-ranking medically fine-tuned LLMs chosen based on trend and download metrics on HuggingFace (

GPT-4o被OpenAI列为最先进的封闭源代码LLM。在开源LLM中,来自领先开发人员(Meta,Microsoft,Mistral,Alibaba和Google)的通用模型以及根据HuggingFace上的趋势和下载指标选择的顶级医学微调LLM(

https://huggingface.co/models

https://huggingface.co/models

) were selected. One medically fine-tuned model, Meditron-70B, was tested but eventually excluded from the analysis as it returned non-sensical responses, possibly because it was not specifically trained to execute user instructions (also known as “instruction fine-tuning”).

)被选中。一种医学微调模型Meditron-70B经过测试,但最终被排除在分析之外,因为它返回了非感官反应,可能是因为它没有经过专门训练来执行用户指令(也称为“指令微调”)。

LLM setup

LLM设置

To evaluate a range of open-source large language models (LLMs), we developed a Python-based workflow utilizing the “llama_cpp_python” library (version 0.2.79). This library provides Python bindings for the widely-used “llama_cpp” software, enabling the execution of local, quantized LLMs in GGUF (GPT-generated unified format).

为了评估一系列开源大型语言模型(LLM),我们利用“llama\u cpp\u Python”库(版本0.2.79)开发了一个基于Python的工作流。该库为广泛使用的“llama\u cpp”软件提供了Python绑定,支持以GGUF(GPT生成的统一格式)执行本地量化LLM。

Quantization involves reducing the precision of the model’s numerical weights, typically transitioning from floating-point to lower-bit representations, which results in a smaller and faster model while preserving performance. For most models, Q5_K_M was chosen as a quantization, typically offering a good balance between compression and quality.

量化涉及降低模型数值权重的精度,通常从浮点转换为低位表示,这会导致模型更小更快,同时保持性能。对于大多数模型,选择Q5\u K\u M作为量化,通常在压缩和质量之间提供良好的平衡。

For the 70B models, a quantization factor of Q4_K_M was selected to allow full GPU offloading. The “llama_cpp_python” library allows for detailed control over relevant hyperparameters. In our experiments, we fully offloaded the LLMs to a GPU for higher computational speed, set the temperature to 0 to ensure deterministic responses, and limited the context width to 1024 tokens, which we previously validated to accommodate all case reports and responses.

对于70B模型,选择了Q4\u K\M的量化因子,以允许完整的GPU卸载。“llama\U cpp\U python”库允许对相关超参数进行详细控制。在我们的实验中,我们将LLM完全卸载到GPU以获得更高的计算速度,将温度设置为0以确保确定性响应,并将上下文宽度限制为1024个令牌,我们之前已经验证了这些令牌以适应所有病例报告和响应。

We chose these settings to balance performance and reproducibility, although we acknowledge that different configurations might yield varying results. Our Python code for prompt construction, along with detailed links to all models (downloaded from .

我们选择这些设置是为了平衡性能和再现性,尽管我们承认不同的配置可能会产生不同的结果。我们的Python代码用于快速构建,以及所有模型的详细链接(下载自。

https://huggingface.co/

https://huggingface.co/

), is publicly available in our GitHub repository at

),可在我们的GitHub存储库中公开获得

https://github.com/ai-idt/os_llm_eurorad

https://github.com/ai-idt/os_llm_eurorad

. The fifteen open-source LLM models included in this study are detailed in Table

本研究中包括的十五个开源LLM模型详见表

2

2

. All experiments were conducted using an Nvidia P8000 GPU with 48GB of video memory.

。所有实验均使用具有48GB视频内存的Nvidia P8000 GPU进行。

Table 2 Open-source LLM details

表2开源LLM详细信息

Full size table

全尺寸表

GPT-4o (“gpt-4o-2024-08-06”) was accessed via OpenAI’s application programming interface (API) (

GPT-4o(“GPT-4o-2024-08-06”)是通过OpenAI的应用程序编程接口(API)访问的(

https://platform.openai.com/docs/models#gpt-4o

https://platform.openai.com/docs/models#gpt第四

).

).

Human reader performance

人工读取器性能

To contrast the diagnostic performance of LLMs with radiologists, two readers were instructed to provide up to three differential diagnoses for the local dataset of 60 brain MRI cases. Reader 1 was a radiologist with two years of dedicated neuroradiology experience, while Reader 2 was a board-certified neuroradiologist with four years of experience.

为了对比LLM与放射科医生的诊断性能,指示两名读者为60例脑部MRI病例的局部数据集提供多达三种鉴别诊断。Reader 1是一名具有两年专门神经放射学经验的放射科医生,而Reader 2是一名具有四年经验的董事会认证神经放射科医生。

To create equal conditions, the two readers were provided with the textual case descriptions but not the image data, consistent with the setup for the LLMs, even though this did not represent a realistic clinical scenario..

为了创造平等的条件,向两位读者提供了文本病例描述,但没有提供图像数据,这与LLMs的设置一致,尽管这并不代表现实的临床情况。。

Case selection and response assessment

案例选择和响应评估

Upon review, we noted that a significant proportion of cases already contained the correct diagnosis within the “Clinical History” and “Imaging Findings” sections. Drawing inspiration from the “LLM-as-a-Judge” paradigm

经过审查,我们注意到很大一部分病例已经在“临床病史”和“影像学表现”部分中包含正确的诊断。

44

44

, we employed the most advanced open-source model available at the outset of this study, Llama-3-70B, to filter out these cases. A recent study indicated that Llama-3-70B, along with GPT-4 Turbo, demonstrated the closest alignment with human evaluations

,我们采用了本研究开始时可用的最先进的开源模型Llama-3-70B来过滤这些情况。最近的一项研究表明,Llama-3-70B和GPT-4 Turbo与人类评估最接近

45

45

, making it particularly suitable for this task. We prompted Llama-3-70B to assess all cases with the following instructions:

,使其特别适合此任务。我们提示Llama-3-70B按照以下说明评估所有病例:

You are a senior radiologist. Below, you will find a case description for a patient diagnosed with [Diagnosis]. Please check if the diagnosis or any part of it is mentioned, discussed, or suggested in the case description. Respond with either ‘mentioned’ (if the diagnosis is included) or ‘not mentioned,’ and nothing else.

你是一名高级放射科医生。下面,您将找到诊断为[诊断]的患者的病例描述。请检查案例描述中是否提到,讨论或建议了诊断或其任何部分。回答“提及”(如果包括诊断)或“未提及”,而不是其他。

.”

.”

Subsequently, we prompted each of the sixteen LLMs (15 open-source LLMs + GPT-4o) to provide three differential diagnoses along with a brief rationale for each, using the concatenated “Clinical History” and “Imaging Findings” as input:

随后,我们使用串联的“临床病史”和“影像学检查结果”作为输入,提示16个LLM(15个开源LLM+GPT-4o)中的每一个都提供了三个鉴别诊断以及每个诊断的简要原理:

You are a senior radiologist. Below, you will find information about a patient: first, the clinical presentation, followed by imaging findings. Based on this information, name the three most likely differential diagnoses, with a short rationale for each

你是一名高级放射科医生。下面,您将找到有关患者的信息:首先是临床表现,然后是影像学检查结果。根据这些信息,列出三种最可能的鉴别诊断,并对每种诊断给出简短的理由

.”

.”

Finally, we again utilized Llama-3-70B to evaluate each LLM’s responses on a binary scale, categorizing them as either “correct” (if the correct diagnosis was among the three differential diagnoses) or “wrong.” The prompt for this evaluation was:

最后,我们再次使用Llama-3-70B以二元量表评估每个LLM的反应,将其分类为“正确”(如果正确的诊断是三种鉴别诊断之一)或“错误”。进行此评估的提示是:

You are a senior radiologist. Below, you will find the correct diagnosis (indicated after ‘Correct Diagnosis:’) followed by the differential diagnoses provided by a Radiology Assistant during an exam. Please assess whether the Radiology Assistant included the correct diagnosis in their differential diagnosis.

你是一名高级放射科医生。下面,您将找到正确的诊断(“正确诊断:”)之后是放射科助理在检查期间提供的鉴别诊断。请评估放射科助理是否将正确的诊断包括在鉴别诊断中。

Respond only with ‘correct’ (if the correct diagnosis is included) or ‘wrong’ (if it is not).

仅用“正确”(如果包括正确的诊断)或“错误”(如果不正确)做出回应。

.”

.”

An exemplary case with details of the LLM query and response evaluation is shown in Fig.

具有LLM查询和响应评估细节的示例性案例如图1所示。

4

4

.

.

Fig. 4: Study design.

图4:研究设计。

A total of 2894 cases were excluded as the true diagnosis was mentioned in the case description to be provided as LLM input. 16 LLMs were prompted to output the three most likely differential diagnoses, based on the Eurorad case reports. Llama-3-70B was used to automatically determine the % of correct responses, given the ground truth diagnosis.

共有2894例被排除在外,因为在病例描述中提到了真正的诊断,将作为LLM输入提供。根据Eurorad病例报告,提示16名LLM输出三种最可能的鉴别诊断。鉴于地面实况诊断,Llama-3-70B用于自动确定正确响应的百分比。

A subset of 140 LLM responses were additionally rated by radiologists to evaluate the judging accuracy of Llama-3-70B. DDx differential diagnoses..

放射科医生对140个LLM反应的子集进行了额外评估,以评估Llama-3-70B的判断准确性。DDx鉴别诊断。。

Full size image

全尺寸图像

Human evaluation

人的评价

In order to gain an understanding of Llama-3-70B’s performance as an LLM judge for correctness of diagnoses, three experienced radiologists (SHK, with 2 years of experience, DMH and BW, board-certified radiologists with 10 years of experience each) additionally evaluated 60 LLM responses each for correctness, of which 20 were shared between all three reviewers to assess human interrater agreement.

为了了解Llama-3-70B作为LLM诊断正确性法官的表现,三位经验丰富的放射科医生(SHK,拥有2年的经验,DMH和BW,董事会认证的放射科医生,每个都有10年的经验)另外评估了60个LLM回复的正确性,其中20个在所有三位审稿人之间共享,以评估人类之间的一致性。

Using a total of 140 LLM responses for which both human “ground truth” and LLM judge assessments were known, we calculated the accuracy of the LLM judge (Fig. .

使用总共140个LLM响应,人类“地面真相”和LLM法官评估都是已知的,我们计算了LLM法官的准确性(图)。

5

5

).

).

Fig. 5: Exemplary LLM query and analysis (Eurorad case ID 12746).

图5:示例性LLM查询和分析(Eurorad案例ID 12746)。

Mistral-Small was instructed to determine the most likely differential diagnoses based on a textual case description that included a condensed medical history and imaging findings. The LLM output contained three diagnoses with a rationale for each suggestion. Llama-3-70B was subsequently utilized to classify the response as correct (if the true diagnosis is included in the suggestions) or wrong.

Mistral Small被指示根据文本病例描述确定最可能的鉴别诊断,其中包括简明的病史和影像学发现。LLM输出包含三个诊断,每个建议都有理由。。

The true diagnosis in this case was ‘jugular foramen meningioma’..

本例的真正诊断是“颈静脉孔脑膜瘤”。。

Full size image

全尺寸图像

Statistics

统计

Both the LLM judge as well as human raters evaluated LLM responses on a binary scale, i.e., if the correct diagnosis was among the top three differential diagnoses listed by the LLM or not. From this response data, we calculated the standard error per model and category as:

LLM法官和人类评估者都以二元量表评估了LLM的反应,即正确的诊断是否在LLM列出的前三个鉴别诊断中。根据这些响应数据,我们计算出每个模型和类别的标准误差为:

$$SE=\sqrt{\frac{p(1-p)}{n}}$$

$$SE=\sqrt{\frac{p(1-p)}{n}}$$

(1)

(1)

where

哪里

p

p

is the proportion of correct responses, and

正确回答的比例,以及

n

n

is the number of samples.

是样本数。

However, from our human evaluation of the LLM judge performance, we know about its inaccuracies and have to adjust the SE to account for this:

然而,从我们对法学硕士法官表现的人为评估来看,我们知道其不准确之处,因此必须调整SE来解释这一点:

$$S{E}_{adj}=\sqrt{\frac{A\,\ast \,(1-A)}{n}\,\ast \,{SE}^{2}}$$

S{E}_{adj}=\sqrt{\frac{A\,\ast\,{1-A}{n}\\ast\,{SE}^{2}$

(2)

(2)

where

哪里

A

A

is the accuracy of the LLM judge. The adjusted 95% Confidence Interval is then:

是法学硕士法官的准确性。

$$95{\rm{ \% }}CI=p\pm 1.96\,\ast \,S{E}_{adj}$$

CI=p\pm 1.96\,\ast\,S{E}_{adj}$$

(3)

(3)

To assess the relationship between LLM size (measured by the number of parameters) and diagnostic accuracy, the Pearson correlation coefficient was calculated.

为了评估LLM大小(通过参数数量测量)与诊断准确性之间的关系,计算了Pearson相关系数。

Data availability

数据可用性

The Eurorad case report library is publicly accessible at

Eurorad病例报告库可在

https://eurorad.org/

https://eurorad.org/

.

.

Code availability

代码可用性

Our Python code for prompt construction, along with detailed links to all models (downloaded from

我们的Python代码用于快速构建,以及所有模型的详细链接(下载自

https://huggingface.co/

https://huggingface.co/

), is publicly available in our GitHub repository at

),可在我们的GitHub存储库中公开获得

https://github.com/ai-idt/os_llm_eurorad

https://github.com/ai-idt/os_llm_eurorad

.

.

References

参考文献

Gertz, R. J. et al. GPT-4 for automated determination of radiologic study and protocol based on radiology request forms: a feasibility study.

Radiology

放射学

307

307

, e230877 (2023).

,e230877(2023年)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Rau, A. et al. A context-based chatbot surpasses radiologists and generic ChatGPT in following the ACR appropriateness guidelines.

Rau,A。等人。基于上下文的聊天机器人在遵循ACR适当性指南方面超越了放射科医生和通用聊天机器人。

Radiology

放射学

308

308

, e230970 (2023).

,e230970(2023年)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Kottlors, J. et al. Feasibility of differential diagnosis based on imaging patterns using a large language model.

Kottlors,J.等人。使用大型语言模型基于成像模式进行鉴别诊断的可行性。

Radiology

放射学

308

308

, e231167 (2023).

,e231167(2023)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Schramm, S. et al. Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases.

Schramm,S.等人。多模式提示元素对挑战性脑MRI病例中GPT-4V诊断性能的影响。

Radiology

放射学

314

314

, e240689 (2025).

,e240689(2025)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Mallio, C. A., Sertorio, A. C., Bernetti, C. & Beomonte Zobel, B. Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing.

Mallio,C.A.,Sertorio,A.C.,Bernetti,C。和Beomonte-Zobel,B。放射学结构化报告的大型语言模型:GPT-4,ChatGPT-3.5,困惑和必应的性能。

Radiol. Med.

无线电。医学。

128

128

, 808–812 (2023).

, 808–812 (2023).

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Doshi, R. et al. Quantitative evaluation of large language models to streamline radiology report impressions: a multimodal retrospective analysis.

Doshi,R.等人。大型语言模型的定量评估,以简化放射学报告印象:多模式回顾性分析。

Radiology

放射学

310

310

, e231593 (2024).

,e231593(2024)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Le Guellec, B. et al. Performance of an open-source large language model in extracting information from free-text radiology reports.

Le Guellec,B。等人。从自由文本放射报告中提取信息的开源大型语言模型的性能。

Radiol. Artif. Intell.

放射性。人工制品。因特尔。

6

6

, 230364 (2024).

, 230364 (2024).

Article

文章

Google Scholar

谷歌学者

Lehnen, N. C. et al. Data extraction from free-text reports on mechanical thrombectomy in acute ischemic stroke using ChatGPT: a retrospective analysis.

Lehnen,N.C.等人。使用ChatGPT从自由文本报告中提取急性缺血性卒中机械血栓切除术的数据:回顾性分析。

Radiology

放射学

311

311

, e232741 (2024).

,e232741(2024)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Katz, U. et al. GPT versus resident physicians — a benchmark based on official board scores.

Katz,U.等人,GPT与住院医师的对比-基于官方董事会评分的基准。

NEJM AI

NEJM AI

1

1

, (2024).

, (2024).

Suh, P. S. et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and gemini pro vision using image inputs from Diagnosis Please cases.

Suh,P.S.等人使用诊断请病例的图像输入比较放射科医生与GPT-4V和gemini pro vision的诊断准确性。

Radiology

放射学

312

312

, e240273 (2024).

,e240273(2024)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Sonoda, Y. et al. Diagnostic performances of GPT-4o, Claude 3 –Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases.

Sonoda,Y.等人,《GPT-4o、Claude 3–Opus和Gemini 1.5 Pro在“请诊断”病例中的诊断性能》。

Jpn. J. Radiol.

JPN。J. Radiol。

42

42

, 1231–1235 (2024).

, 1231–1235 (2024).

Article

文章

PubMed

PubMed

PubMed Central

PubMed 中央

Google Scholar

谷歌学者

Wu, S. et al. Benchmarking open-source large language models, GPT-4 and claude 2 on multiple-choice questions in nephrology.

Wu,S.等人。在肾病学多项选择题上对开源大型语言模型GPT-4和claude 2进行基准测试。

NEJM AI

奈梅艾

1

1

(2024).

(2024).

Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks.

Sandmann,S.,Riepenhausen,S.,Plagwitz,L。&Varghese,J。临床决策支持任务的ChatGPT,Google search和Llama 2的系统分析。

Nat. Commun.

Nat.普通。

15

15

, 2050 (2024).

, 2050 (2024).

Article

文章

PubMed

PubMed

PubMed Central

PubMed 中央

Google Scholar

谷歌学者

Adams, L. C. et al. Llama 3 challenges proprietary state-of-the-art large language models in radiology board–style examination questions.

亚当斯(Adams,L.C.)等人,《骆驼3》挑战了放射学委员会式试题中专有的最先进的大型语言模型。

Radiology

放射学

312

312

, e241191 (2024).

,e241191(2024)。

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Eurorad. Homepage.

欧罗拉德。主页。

https://eurorad.org/

https://eurorad.org/

(2024).

(2024).

Liu, F. et al. Large language models in the clinic: a comprehensive benchmark. Preprint at

Liu,F.等人。临床中的大型语言模型:综合基准。预印本

medRxiv

medRxiv 文件

https://doi.org/10.1101/2024.04.24.24306315

https://doi.org/10.1101/2024.04.24.24306315

(2024).

(2024).

Jeong, D. P., Garg, S., Lipton, Z. C. & Oberst, M. Medical adaptation of large language and vision-language models: are we making progress? In

Jeong,D.P.,Garg,S.,Lipton,Z.C。和Oberst,M。大型语言和视觉语言模型的医学适应:我们正在取得进展吗?在

Proc. 2024 Conference on Empirical Methods in Natural Language Processing

程序。2024年自然语言处理经验方法会议

12143-12170 (Association for Computational Linguistics, 2024).

12143-12170(计算语言学协会,2024)。

Dorfner, F. J. et al. Biomedical large languages models seem not to be superior to generalist models on unseen medical data. Preprint at arXiv:2408.13833 (2024).

在看不见的医学数据上,生物医学大型语言模型似乎并不优于通才模型。arXiv预印本:2408.13833(2024)。

Kim, S. H. et al. Human-AI collaboration in large language model-assisted brain MRI differential diagnosis: a usability study.

Kim,S.H.等人。大型语言模型辅助脑MRI鉴别诊断中的人工智能协作:可用性研究。

European Radiology

欧洲放射学

(in press).

(印刷中)。

Siepmann, R. et al. The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation.

Siepmann,R.等人,《虚拟参考放射科医生:临床图像阅读和解释的综合AI辅助》。

Eur. Radiol.

欧元,无线电

34

34

, 6652–6666 (2024).

, 6652–6666 (2024).

Article

文章

PubMed

PubMed

PubMed Central

PubMed 中央

Google Scholar

谷歌学者

Dratsch, T. et al. Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance.

Dratsch,T.等人,《乳房X线照相术中的自动化偏倚:人工智能BI-RADS建议对读者表现的影响》。

Radiology

放射学

307

307

, e222176 (2023).

,e222176(2023年)。

Kim, S. H. et al. Automation bias in AI-assisted detection of cerebral aneurysms on time-of-flight MR-angiography.

Kim,S.H.等人。飞行时间MR血管造影中AI辅助检测脑动脉瘤的自动化偏倚。

La radiologia medica

(in press).

(印刷中)。

Ke, Y. et al. Mitigating cognitive biases in clinical decision-making through multi-agent conversations using large language models: simulation study.

Ke,Y.等人。使用大型语言模型通过多智能体对话减轻临床决策中的认知偏差:模拟研究。

J. Med. Internet Res.

J、 医学互联网研究。

26

26

, e59439 (2024).

,e59439(2024年)。

Article

文章

PubMed

PubMed

PubMed Central

PubMed 中央

Google Scholar

谷歌学者

Klang, E. et al. A strategy for cost-effective large language model use at health system-scale.

Klang,E。等人。在卫生系统规模上使用具有成本效益的大型语言模型的策略。

npj Digit. Med.

例如数字。麦德

7

7

, 320 (2024).

, 320 (2024).

Article

文章

PubMed

PubMed

PubMed Central

PubMed 中央

Google Scholar

谷歌学者

Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices.

Gilbert,S.,Harvey,H.,Melvin,T.,Vollebregt,E。&Wicks,P。大型语言模型AI聊天机器人需要作为医疗设备获得批准。

Nat. Med. 2023 29:10

《自然医学杂志》2023年29:10

29

29

, 2396–2398 (2023).

, 2396–2398 (2023).

Google Scholar

谷歌学者

Zhang, Z. et al. Patients’ perceptions of using artificial intelligence (AI)-based technology to comprehend radiology imaging data.

Zhang,Z.等人。患者对使用基于人工智能(AI)的技术来理解放射成像数据的看法。

Health Informatics J.

健康信息学J。

27

27

, 14604582211011215 (2021).

, 14604582211011215 (2021).

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Golchin, S. & Surdeanu, M. Time travel in LLMs: tracing data contamination in large language models. In

Golchin,S。和Surdeanu,M。LLMs中的时间旅行:追踪大型语言模型中的数据污染。在

12th International Conference on Learning Representations, ICLR 2024

第十二届学习表征国际会议,ICLR 2024

(2023).

(2023).

Golchin, S. & Surdeanu, M. Data contamination quiz: a tool to detect and estimate contamination in large language models. Preprint at arXiv:2311.06233 (2023).

Golchin,S。&Surdeanu,M。数据污染测验:一种检测和估计大型语言模型中污染的工具。arXiv预印本:2311.06233(2023)。

Balloccu, S., Schmidtová, P., Lango, M. & Dušek, O. Leak, cheat, repeat: data contamination and evaluation malpractices in closed-source LLMs. In

Balloccu,S.,Schmidtová,P.,Lango,M。&Dušek,O。泄漏,欺骗,重复:封闭源LLM中的数据污染和评估不当行为。在

Proc. 18th Conference European Chapter Association Computational Linguistics

程序。第18届欧洲分会计算语言学协会会议

67–93 (Association for Computational Linguistics, 2024).

Dong, Y. et al. Generalization or memorization: data contamination and trustworthy evaluation for large language models. Preprint at arxiv:2402.15938 (2024).

Dong,Y.等人,《泛化或记忆:大型语言模型的数据污染和可信评估》。arxiv预印本:2402.15938(2024)。

Gemini models. Gemini API. Google AI for developers.

双子座模型。双子座API。谷歌AI为开发者服务。

https://ai.google.dev/gemini-api/docs/models/gemini

https://ai.google.dev/gemini-api/docs/models/gemini

(2024).

(2024).

Models. Anthropic.

模型。人类。

https://docs.anthropic.com/en/docs/about-claude/models

https://docs.anthropic.com/en/docs/about-claude/models

(2024).

(2024).

Models. OpenAI API.

模型。OpenAI API。

https://platform.openai.com/docs/models/gp

https://platform.openai.com/docs/models/gp

(2024).

(2024).

Wang, W. et al. CogVLM: visual expert for pretrained language models. Preprint at arXiv:2311.03079 (2023).

Wang,W。et al。CogVLM:预训练语言模型的视觉专家。arXiv预印本:2311.03079(2023)。

Bai, J. et al. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond. Preprint at arXiv:2308.12966 (2023).

Bai,J。et al。Qwen VL:一种用于理解,本地化,文本阅读等的多功能视觉语言模型。arXiv预印本:2308.12966(2023)。

Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved baselines with visual instruction tuning. In

Liu,H.,Li,C.,Li,Y。&Lee,Y.J。通过视觉指令调整改进了基线。在

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

IEEE/CVF计算机视觉和模式识别会议论文集

(2024).

(2024).

Mohsan, M. M. et al. Vision transformer and language model based radiology report generation.

Mohsan,M.M.等人。基于视觉转换器和语言模型的放射报告生成。

IEEE Access

IEEE访问

11

11

, 1814–1824 (2023).

, 1814–1824 (2023).

Article

文章

Google Scholar

谷歌学者

Tanno, R. et al. Collaboration between clinicians and vision–language models in radiology report generation.

Tanno,R.等人。临床医生与视觉语言模型在放射学报告生成中的合作。

Nat. Med.

自然医学。

https://doi.org/10.1038/s41591-024-03302-1

https://doi.org/10.1038/s41591-024-03302-1

(2024).

(2024).

Horiuchi, D. et al. Comparing the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in challenging neuroradiology cases.

Horiuchi,D。等人比较了基于GPT-4的ChatGPT,基于GPT-4V的ChatGPT和放射科医生在具有挑战性的神经放射学病例中的诊断性能。

Clin. Neuroradiol.

临床。神经放射学。

34

34

, 779–787 (2024).

, 779–787 (2024).

Article

文章

PubMed

PubMed

Google Scholar

谷歌学者

Wu, C. et al. Can GPT-4V(ision) serve medical applications? Case studies on GPT-4V for multimodal medical diagnosis. Preprint at arXiv:2310.09909 (2023).

Wu,C。等人。GPT-4V(ision)可以用于医疗应用吗?GPT-4V用于多模式医学诊断的案例研究。arXiv预印本:2310.09909(2023)。

Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Preprint at arXiv:2308.02463 (2023).

Wu,C.,Zhang,X.,Zhang,Y.,Wang,Y。&Xie,W。通过利用网络规模的2D和3D医学数据,建立放射学的通才基础模型。arXiv预印本:2308.02463(2023)。

Blankemeier, L. et al. Merlin: a vision language foundation model for 3D computed tomography.

Res. Sq.

真正地。

https://doi.org/10.21203/RS.3.RS-4546309/V1

https://doi.org/10.21203/RS.3.RS-4546309/V1

(2024).

(2024).

Sivarajkumar, S., Kelley, M., Samolyk-Mazzanti, A., Visweswaran, S. & Wang, Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural Language processing: algorithm development and validation study.

Sivarajkumar,S.,Kelley,M.,Samolyk-Mazzanti,A.,Visweswaran,S。&Wang,Y。零炮临床自然语言处理中大型语言模型提示策略的实证评估:算法开发和验证研究。

JMIR Med. Inform.

JMIR与。

12

12

, e55318 (2024).

Article

文章

PubMed

PubMed

PubMed Central

PubMed 中央

Google Scholar

谷歌学者

Zheng, L. et al. Judging LLM-as-a-Judge with MT-bench and Chatbot arena.

Zheng,L.等人。用MT bench和Chatbot arena评判LLM-as-a-Judge。

Adv. Neural Inf. Process Syst.

高级神经信息处理系统。

36

36

, 46595–46623 (2023).

, 46595–46623 (2023).

Google Scholar

谷歌学者

Singh Thakur, A., Choudhary, K., Srinik Ramayapally, V., Vaidyanathan, S. & Hupkes Meta, D. Judging the judges: evaluating alignment and vulnerabilities in LLMs-as-judges. Preprint at arXiv: 2406.12624 (2024).

Singh Thakur,A.,Choudhary,K.,Srinik Ramayapally,V.,Vaidyanathan,S。&Hupkes Meta,D。判断法官:评估LLM作为法官的一致性和脆弱性。arXiv预印本:2406.12624(2024)。

Download references

下载参考资料

Acknowledgements

致谢

Not applicable.

不适用。

Funding

资金

Open Access funding enabled and organized by Projekt DEAL.

由Projekt交易启用和组织的开放获取资金。

Author information

作者信息

Authors and Affiliations

作者和隶属关系

Department of Diagnostic and Interventional Neuroradiology, Klinikum rechts der Isar, School of Medicine and Health, Technical University of Munich, Munich, Germany

德国慕尼黑工业大学医学与健康学院Klinikum rechts der Isar诊断与介入神经放射学系

Su Hwan Kim, Severin Schramm, Paul-Sören Platzek, Karolin Johanna Paprottka, Claus Zimmer, Dennis M. Hedderich & Benedikt Wiestler

Su Hwan Kim,Severin Schramm,Paul-Sören Platzek,Karolin Johanna Paprottka,Claus Zimmer,Dennis M. Hedderich和Benedikt Wiestler

Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, School of Medicine and Health, Technical University of Munich, Munich, Germany

德国慕尼黑工业大学医学与健康学院Klinikum rechts der Isar诊断与介入放射学系

Lisa C. Adams & Rickmer Braren

Lisa C.Adams和Rickmer Braren

Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, School of Medicine and Health, Technical University of Munich, Munich, Germany

慕尼黑工业大学医学与健康学院慕尼黑德国心脏中心心血管放射与核医学系

Keno K. Bressem

基诺·K·布雷森

Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany

德国慕尼黑工业大学计算机辅助医疗程序

Matthias Keicher

马蒂亚斯 凯彻

AI for Image-Guided Diagnosis and Therapy, School of Medicine and Health, Technical University of Munich, Munich, Germany

德国慕尼黑工业大学医学与健康学院图像引导诊断与治疗AI

Benedikt Wiestler

本尼迪克特指示器

Authors

作者

Su Hwan Kim

苏 韩 金

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Severin Schramm

塞维林·施拉姆

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Lisa C. Adams

丽莎·C·亚当斯

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Rickmer Braren

里克梅尔 布拉伦

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Keno K. Bressem

凯诺·K·布雷塞姆

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Matthias Keicher

马蒂亚斯 凯彻

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Paul-Sören Platzek

保罗·索伦·普拉泽克

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Karolin Johanna Paprottka

卡罗林·约翰娜·费罗特卡

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Claus Zimmer

克劳斯 房间

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Dennis M. Hedderich

丹尼斯·M·赫德里奇

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Benedikt Wiestler

本尼迪克特指示器

View author publications

查看作者出版物

You can also search for this author in

您也可以在中搜索此作者

PubMed

PubMed

Google Scholar

谷歌学者

Contributions

捐款

S.H.K., D.M.H., and B.W. conceived and designed the study. S.H.K. and B.W. drafted the original manuscript. S.S., L.A., R.B., K.K.B., M.K., and C.Z. contributed to data interpretation. P.S.P. and K.J.P. contributed to the reader study. All authors critically reviewed and approved the manuscript.

S、 H.K.,D.M.H。和B.W.构思并设计了这项研究。S、 H.K.和B.W.起草了原始手稿。S、 S.,L.A.,R.B.,K.K.B.,M.K。和C.Z.为数据解释做出了贡献。P、 S.P.和K.J.P.为读者研究做出了贡献。所有作者都严格审查并批准了手稿。

Corresponding author

通讯作者

Correspondence to

通信对象

Su Hwan Kim

金素焕

.

.

Ethics declarations

道德宣言

Competing interests

相互竞争的利益

The authors declare no competing interests.

作者声明没有利益冲突。

Additional information

其他信息

Publisher’s note

出版商注释

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature在已发布的地图和机构隶属关系中的管辖权主张方面保持中立。

Supplementary information

补充信息

supplement-1

补充-1

Rights and permissions

权限和权限

Open Access

开放存取

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

本文根据知识共享署名4.0国际许可证进行许可,该许可证允许以任何媒体或格式使用,共享,改编,分发和复制,只要您对原始作者和来源给予适当的信任,提供知识共享许可证的链接,并指出是否进行了更改。

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

本文中的图像或其他第三方材料包含在文章的知识共享许可中,除非在材料的信用额度中另有说明。如果材料未包含在文章的知识共享许可中,并且您的预期用途不受法律法规的许可或超出许可用途,则您需要直接获得版权所有者的许可。

To view a copy of this licence, visit .

要查看此许可证的副本,请访问。

http://creativecommons.org/licenses/by/4.0/

http://creativecommons.org/licenses/by/4.0/

.

.

Reprints and permissions

重印和许可

About this article

关于本文

Cite this article

引用本文

Kim, S.H., Schramm, S., Adams, L.C.

金,S.H.,施拉姆,S.,亚当斯,L.C。

et al.

等人。

Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports.

在1933年Eurorad病例报告中对开源LLM的诊断性能进行基准测试。

npj Digit. Med.

例如数字。麦德

8

8

, 97 (2025). https://doi.org/10.1038/s41746-025-01488-3

, 97 (2025).https://doi.org/10.1038/s41746-025-01488-3

Download citation

下载引文

Received

已接收

:

:

14 September 2024

2024年9月14日

Accepted

已接受

:

:

28 January 2025

Published

已发布

:

:

12 February 2025

2025年2月12日

DOI

DOI

:

:

https://doi.org/10.1038/s41746-025-01488-3

https://doi.org/10.1038/s41746-025-01488-3

Share this article

分享这篇文章

Anyone you share the following link with will be able to read this content:

与您共享以下链接的任何人都可以阅读此内容:

Get shareable link

获取可共享链接

Sorry, a shareable link is not currently available for this article.

很抱歉,本文目前没有可共享的链接。

Copy to clipboard

复制到剪贴板

Provided by the Springer Nature SharedIt content-sharing initiative

由Springer Nature SharedIt内容共享计划提供

Subjects

主题

Diagnosis

诊断

Medical imaging

医学影像学

Translational research

转化研究