Title
题目
TransUNet: Rethinking the U-Net architecture design for medical imagesegmentation through the lens of transformers
TransUNet: 通过Transformer的视角重新思考U-Net架构在医学图像分割中的设计
01
文献速递介绍
卷积神经网络(CNNs),特别是全卷积网络(FCNs)(Long 等,2015),在医学图像分割领域中获得了显著的关注。在其各种迭代模型中,U-Net 模型(Ronneberger 等,2015)因其对称的编码器–解码器设计,并通过跳跃连接增强细节保留,成为许多研究人员的首选。基于这一方法,各类医学成像任务中取得了显著进展。这些进展包括磁共振成像(MRI)中的心脏分割(Yu 等,2017)、利用计算机断层扫描(CT)进行的器官勾勒(Zhou 等,2017;Li 等,2018b;Yu 等,2018;Luo 等,2021)以及结肠镜检查中的息肉分割(Zhou 等,2019)。
尽管CNN在表示能力方面无可匹敌,但由于卷积操作的局部性,在建模远程关系时往往表现不足。当面对不同患者之间纹理、形状和大小的巨大变化时,这一局限性尤其明显。认识到这一不足,研究界越来越倾向于使用完全基于注意力机制的Transformers模型,因为它们在捕捉全局上下文方面有着天然的优势(Vaswani 等,2017)。然而,Transformers将输入处理为一维序列,优先进行全局上下文建模,容易生成低分辨率的特征。因此,一种更有前景的混合方法是结合CNN和Transformer编码器。
TransUNet(Chen 等,2021)于2021年首次提出,是首批将Transformer集成到医学图像分析中的模型之一。该方法利用了U-Net编码器的高分辨率空间细节,同时发挥了Transformers在全局上下文建模中的优势,这在医学图像分割中至关重要。这一创新促使了后续研究的开展(Cao 等,2022;Xie 等,2021;Hatamizadeh 等,2021)。尽管如此,不同U-Net组件中Transformers自注意力机制的全面理解仍然缺失。
Abatract
摘要
Medical image segmentation is crucial for healthcare, yet convolution-based methods like U-Net face limitationsin modeling long-range dependencies. To address this, Transformers designed for sequence-to-sequencepredictions have been integrated into medical image segmentation. However, a comprehensive understandingof Transformers’ self-attention in U-Net components is lacking. TransUNet, first introduced in 2021, is widelyrecognized as one of the first models to integrate Transformer into medical image analysis. In this study,we present the versatile framework of TransUNet that encapsulates Transformers’ self-attention into two keymodules: (1) a Transformer encoder tokenizing image patches from a convolution neural network (CNN)feature map, facilitating global context extraction, and (2) a Transformer decoder refining candidate regionsthrough cross-attention between proposals and U-Net features. These modules can be flexibly inserted intothe U-Net backbone, resulting in three configurations: Encoder-only, Decoder-only, and Encoder+Decoder.TransUNet provides a library encompassing both 2D and 3D implementations, enabling users to easily tailorthe chosen architecture. Our findings highlight the encoder’s efficacy in modeling interactions among multipleabdominal organs and the decoder’s strength in handling small targets like tumors. It excels in diversemedical applications, such as multi-organ segmentation, pancreatic tumor segmentation, and hepatic vesselsegmentation. Notably, our Trans