【论文精度-大模型结果评测】A Systematic Evaluation of GPT-4V’s Multimodal Capability for Medical Image Analysis

最新推荐文章于 2025-05-05 14:07:50 发布

海蛎子程序员

最新推荐文章于 2025-05-05 14:07:50 发布

阅读量634

点赞数 23

文章标签：人工智能 nlp 自然语言处理 AIGC

本文链接：https://blog.csdn.net/oLIMBO123/article/details/142912239

版权

A Systematic Evaluation of GPT-4V’s Multimodal Capability for Medical Image Analysis

1.Keyword

GPT-4V - 多模态模型
Medical Image - 医学影像图片
Radiology Report Generation - 放射性报告生成
Medical Visual Question Answering - 医学视觉问答
Medical Visual Grounding - 找到文本所指的内容在图像中的位置
Large Language Model Evaluation - 大模型评估

2. Research Problem

本文对GPT-4V在医学图像分析上的能力进行评测，主要是聚焦在3个任务: 1. Radiology Report Generation 2. Medical Visual Question Answering 3. Medical Visual Grounding
For the evaluation, a set of prompts is designed for each task to induce the corresponding capability of GPT4V to produce sufficiently good outputs
Three evaluation ways including quantitative analysis, human evaluation, and case study are employed to achieve an in-depth and extensive evaluation

3. Key Contributions

比较了传统的评测NLP模型能力的指标( precision, recall, F1 score, BLEU, CIDEr, ROUGE, METEOR)与人为(该领域专家)评测的差异。通过实验证明传统的评价指标对于open-end类型的问题(问题无法通过 yes or no简要的进行回答，需要一段或者一句更加详细的文本来回答问答) 无法正确的评估。所以需要一种全新的Evaluation Metrics是迫在眉睫的。
GPT-4V在Radiology Report Generation 和 Medical Visual Question Answering task表现很好，尽管GPT-4V并非是一个专门针对医学领域的模型，其表现也接近SOAT。但是在Medical Visual Grounding task表现很糟糕，在该任务下还有很大的提高空间。

4. Methodology

Dataset: The study utilized MIMIC-CXR for radiology reports, the VQA-RAD dataset for visual question answering and MS-CXR for Medical Visual Grounding.
Model architecture: GPT-4V, a multimodal model capable of processing both text and images.
Training procedure: Utilizing zero-short or few-short prompt to generate response from GPT-4V
Evaluation metrics: Included traditional NLP metrics (BLEU, ROUGE, METEOR, CIDEr) and human evaluation(真的就只是找了该领域专家对随机抽取的100个测试数据结果进行评估，我一开始还以为是用的Agent来做=。=)s for qualitative assessment.

5. Main Results

传统指标认为GPT-4V生成的结果都很差，但是专家对GPT-4V生成的结果却认为还不错。
因此，作者又对传统metric的结果和human evaluation的结果进行了对比，比如采用了肯达尔系数和p-value。传统指标和专家都对GPT-4V生成的结果进行评级，二者的肯达尔系数在0值左右徘徊(肯达尔系数的范围是1到-1, 1代表两个排名完全一致，-1代表完全不一致，0代表二者互相独立。)

Kendall’s τ Coefficient
Definition: Kendall’s τ (tau) coefficient is a statistic used to measure the ordinal association between two quantities. It assesses how well the relationship between two variables can be described using a monotonic function. The coefficient is calculated based on the number of concordant and discordant pairs of observations.
肯达尔系数
在这里插入图片描述