论文阅读笔记(二十一):MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network

The inability to interpret the model prediction in semantically and visually meaningful ways is a well-known shortcoming of most existing computer-aided diagnosis methods. In this paper, we propose MDNet to establish a direct multimodal mapping between medical images and diagnostic reports that can read images, generate diagnostic reports, retrieve images by symptom descriptions, and visualize attention, to provide justifications of the network diagnosis process. MDNet includes an image model and a language model. The image model is proposed to enhance multi-scale feature ensembles and utilization efficiency. The language model, integrated with our improved attention mechanism, aims to read and explore discriminative image feature descriptions from reports to learn a direct mapping from sentence words to image pixels. The overall network is trained end-to-end by using our developed optimization strategy. Based on a pathology bladder cancer images and its diagnostic reports (BCIDR) dataset, we conduct sufficient experiments to demonstrate that MDNet outperforms comparative baselines. The proposed image model obtains state-ofthe-art performance on two CIFAR datasets as well.

在语义和视觉上有意义的方法无法解释模型预测是目前大多数计算机辅助诊断方法的一个众所周知的缺点。在本文中, 我们建议 MDNet 在医学图像和诊断报告之间建立一个直接的多模式映射, 可以读取图像, 生成诊断报告, 通过症状描述检索图像, 并可视化attention, 以提供网络诊断过程的合理性。MDNet 包括一个图像模型和一个语言模型。提出了一种增强多尺度特征集合和利用效率的图像模型。该语言模型与我们改进的attention机制相结合, 目的是从报告中读取和探索区分性的图像特征描述, 以学习从句子词到图像像素的直接映射。通过使用我们开发的优化策略, 对整个网络进行了端到端的培训。基于病理膀胱癌图像及其诊断报告 (BCIDR) 数据集, 我们进行了充分的实验, 证明 MDNet 优于比较基线。所提出的图像模型也获得了两个 CIFAR 数据集的最先进的性能。

In recent years, the rapid development of deep learning technologies has shown remarkable impact on the biomedical image domain. Conventional image analysis tasks, such as segmentation and detection [2], support quick knowledge discovery from medical metadata to help specialists’ manual diagnosis and decision-making. Automatic decisionmaking tasks (e.g. diagnosis) are usually treated as standard classification problems. However, generic classification models are not an optimal solution for intelligent computer-aided diagnosis, because such models conceal the rationale for their conclusions, and therefore lack the interpretable justifications to support their decision-making process. It is rather difficult to investigate how well the model captures and understands the critical biomarker information. A model that is able to visually and semantically interpret the underlying reasons that support its diagnosis results is significant and critical (Figure 1).

近年来, 深度学习技术的飞速发展对生物医学图像领域产生了显著的影响。传统的图像分析任务, 如分割和检测 [2], 支持快速知识发现从医学元数据, 以帮助专家的人工诊断和决策。自动决策任务 (如诊断) 通常被视为标准分类问题。然而, 一般分类模型并不是智能计算机辅助诊断的最佳解决方案, 因为这些模型掩盖了其结论的基本原理, 因此缺乏解释的理由来支持其决策过程。研究模型捕获和理解关键生物标志物信息是相当困难的。能够直观和语义地解释支持其诊断结果的基本原因的模型是重要的和关键的 (图 1)。

In clinical practice, medical specialists usually write diagnosis reports to record microscopic findings from images to diagnose and select treatment options. Teaching machine learning models to automatically imitate this process is a way to provide interpretability to machine learning models. Recently, image to language generation [14, 22, 4, 33] and attention [36] methods attract some research interests.

在临床实践中, 医学专家通常编写诊断报告, 记录从图像到诊断和选择治疗方案的显微结果。辅导机器学习模型自动模拟这一过程, 是为机器学习模型提供解读的一种方法。最近, 图像到语言生成 [14, 22, 4, 33] 和attention [36] 方法吸引了一些研究兴趣。

In this paper, we present a unified network, namely MDNet, that can read images, generate diagnostic reports, retrieve images by symptom descriptions, and visualize network attention, to provide justifications of the network diagnosis process. For evaluation, we have applied MDNet on a pathology bladder cancer image dataset with diagnostic reports (Section 5.2 introduces dataset details). In bladder pathology images, changes in the size and density of urothelial cell nuclei or thickening of the urothelial neoplasm of bladder tissue indicate carcinoma. Accurately describing these features facilitates the accurate diagnosis and is critical for the identification of early-stage bladder cancer. The accurate discrimination of those subtle appearance changes is challenging even for observers with extensive experience. To train MDNet, we address the problem of directly mining discriminative image feature information from reports and learn a direct multimodal mapping from report sentence words to image pixels. This problem is significant because discriminative image features to support diagnostic conclusion inference is “latent” in reports rather than offered by specific image/object labels. Effectively utilizing these semantic information in reports is necessary for effective image-language modeling.

本文提出了一个统一的网络, 即 MDNet, 可以读取图像, 生成诊断报告, 通过症状描述检索图像, 并可视化网络attention, 提供网络诊断过程的合理性。为评估, 我们已经应用 MDNet 的病理膀胱癌图像数据集与诊断报告 (第5.2 节介绍数据集详细信息)。膀胱病理图像中, 尿失禁细胞细胞核大小和密度的变化或膀胱组织尿囊肿瘤的增厚表明癌。准确描述这些特征有助于准确诊断, 对早期膀胱癌的鉴别具有重要的价值。即使对具有丰富经验的观察家来说, 对这些微妙外观变化的准确辨别也具有挑战性。为了训练 MDNet, 我们解决了从报告中直接挖掘判别图像特征信息的问题, 并学习从报告句话到图像像素的直接多模式映射。这个问题是很重要的, 因为鉴别图像特征支持诊断结论推理是 “潜伏” 的报告, 而不是提供特定的图像/对象标签。有效地利用这些语义信息在报告中是有效的图像语言建模的必要条件。

For image modeling based on convolutional neural networks (CNNs), we address the capability of the network to capture size-variant image features (such as mitosis depicted in pixels or cell polarity depicted in regions) for image representations. We analyze the weakness of the residual network (ResNet) [6, 7] from the ensemble learning aspect and propose ensemble-connection to encourage multiscale representation integration, which results in more efficient feature utilization according to our experiment results. For language modeling, we adopt Long Short-Term Memory (LSTM) networks [33], but focus on investigating the usage of LSTM to mine discriminative information from reports and compute effective gradients to guide the image model training. We develop an optimization approach to train the overall network end-to-end starting from scratch. We integrate the attention mechanism [36] in our language model and propose to enhance its visual feature alignment with sentence words to obtain sharper attention maps.

对于基于卷积神经网络 (CNNs) 的图像建模, 我们解决了网络捕获大小变体图像特征 (如在区域中描述的像素或细胞极性中的有丝分裂) 来进行图像表示的能力。从集成学习方面分析了剩余网络 (ResNet) [6、7] 的缺点, 提出了集成连接, 以鼓励多尺度表示集成, 从而根据我们的实验结果。对于语言建模, 我们采用长短期记忆 (LSTM) 网络 [33], 但重点是调查 LSTM 的使用, 从报告中挖掘判别信息, 并计算有效的梯度来指导图像模型的训练。我们开发了一个优化的方法来培训整个网络端到端从头开始。在我们的语言模型中, 我们将attention机制 [36] 整合在一起, 并建议提高其视觉特征与句子词的匹配, 以获得更清晰的attention映射。

To our knowledge, this is the first study to develop an interpretable attention-based model that can explicitly simulate the medical (pathology) image diagnosis process. We perform sufficient experimental analysis with complementary evaluation metrics to demonstrate that MDNet can generate promising and reliable results, also outperforms well-known image captioning baselines [14] on the BCIDR dataset. In addition, we validate the state-of-the-art performance of the proposed image model belonging to MDNet on two public CIFAR datasets [18].

根据我们的知识, 这是第一个研究开发一个解释关注的模型, 可以显式地模拟医学 (病理) 图像诊断过程。我们进行了充分的实验分析与互补评价指标, 以证明 MDNet 可以产生有希望和可靠的结果, 也优于知名的图像字幕基线 [14] 在 BCIDR 数据集。此外, 我们还验证了在两个公共 CIFAR 数据集 [18] 上属于 MDNet 的建议图像模型的最新性能。

Image and language modeling: Joint image and language modeling enables the generation of semantic descriptions, which provides more intelligible predictions. Image captioning is one typical of application [16]. Recent methods use recurrent neural networks (RNNs) to model natural language conditioned on image information modeled by CNNs [14, 33, 13, 38]. They typically employ pre-trained powerful CNN models, such as GoogLeNet [28], to provide image features. Semantic image features play a key role in accurate captioning [22, 4]. Many methods focus on learning better alignment from natural language words to provided visual features, such as attention mechanisms [36, 38, 37], multimodal RNN [22, 14, 4] and so on [24, 37]. However, in the medical image domain, pre-trained universal CNN models are not available. A complete end-to-end trainable model for joint image-sentence modeling is an attractive open question, and it can facilitate multimodal knowledge sharing between the image and language models.

图像和语言建模: 联合图像和语言建模能够生成语义描述, 从而提供更易于理解的预测。图像字幕是一个典型的应用 [16]。最近的方法使用递归神经网络 (RNNs) 模型的自然语言的条件对图像信息的模拟由 CNNs [14, 33, 13, 38]。他们通常使用预先训练的强大的 CNN 模型, 如 GoogLeNet [28], 以提供图像功能。语义图像功能在准确的字幕中扮演关键角色 [22, 4]。许多方法的重点是学习更好地对齐从自然语言词到提供的视觉特征, 例如attention机制 [36, 38, 37], 多式 RNN [22, 14, 4] 等 [24, 37]。然而, 在医学图像领域, 预先训练的通用 CNN 模型是不可用的。一个完整的端到端可训练模型用于联合图像-句子建模是一个有吸引力的开放性问题, 它可以促进图像和语言模型之间的多模式知识共享。

Image-sentence alignment also encourages visual explanations for network inner workings [15]. Hence, attention mechanisms become particularly necessary [36]. We witness growing interests of its exploration to achieve the network interpretability [41, 27]. The full power of this field has vast potentials to renovate computer-aided medical diagnosis, but a dearth of related work exists. To date, [25] and [17] deal with the problem of generating disease keywords for radiology images.

图像-句子对齐也鼓励对网络内部运作的视觉解释 [15]。因此, attention机制变得特别必要 [36]。我们见证了其探索的日益增长的兴趣, 以实现网络解读 [41, 27]。这一领域的全部力量有很大的潜力来更新计算机辅助医学诊断, 但缺乏相关的工作存在。到目前为止, [25] 和 [17] 处理为放射学图像生成疾病关键字的问题。

Skip-connection: Based on the residual network (ResNet) [6], the new pre-act-ResNet [7] introduces identity mapping skip-connection [7] to address the network training difficulty. Identity mapping gradually becomes an acknowledged strategy to overcome the barrier of training very deep networks [7, 11, 39, 10]. Besides, skip-connection encourages the integration of multi-scale representations for more efficient feature utilization [21, 1, 35].

跳过连接: 基于残差网络 (ResNet) [6], 新的预动作 ResNet [7] 引入了身份映射跳接 [7] 来解决网络培训的困难。身份映射逐渐成为一个公认的战略, 克服训练非常深的网络的障碍 [7, 11, 39, 10]。此外, 跳过连接还鼓励多尺度表示的集成, 以便更有效地使用功能 [21、1、35]。

This paper presents a novel unified network, namely MDNet, to establish the direct multimodal mapping from medical images and diagnostic reports. Our method provides a novel perspective to perform medical image diagnosis: generating diagnostic reports and corresponding network attention, making the network diagnosis and decisionmaking process semantically and visually interpretable. Sufficient experiments validate our proposed method.

本文提出了一种新的统一网络, 即 MDNet, 建立了医学图像和诊断报告的直接多模式映射。我们的方法为医学图像诊断提供了一个新的视角: 生成诊断报告和相应的网络attention, 使网络诊断和决策过程在语义上和视觉上解释。充分的实验验证了我们提出的方法。

Based on this work, limitations and open questions are drawn: building and testing large-scale pathology imagereport datasets; generating finer [27] attention for small biomarker localization; applying to whole slide diagnosis. We expect to address them in the future work.

在这项工作的基础上, 提出了局限性和开放性问题: 建立和测试大型病理图像报告数据集;对小生物标志物定位产生更精细的 [27] attention;应用于整张幻灯片诊断。我们希望在今后的工作中解决这些问题。

阅读更多
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/sunshine_010/article/details/79953172
文章标签: MDNet
个人分类: 笔记
上一篇论文阅读笔记(二十):Mask R-CNN
下一篇论文阅读笔记(二十二):Feature Pyramid Networks for Object Detection(FPN)
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

关闭
关闭
关闭