OBELISK1
Abstract
Deep CNN 通过在端到端可训练的体系结构中用学习的卷积滤波器取代手工制作的特征提取功能,在大多数图像分析任务中达到了 SOTA。尽管如此,卷积核规格仍需经过大量人工设计——卷积操作的感受野的形状和大小是一个非常敏感的部分,必须针对不同的图像分析应用进行调整。3D全卷积多尺度体系结构(3D-UNet),具有跳跃连接,擅长语义分割和 landmark 定位,但有巨大的内存需求,并且依赖于大量注释数据集,这是医学图像分析中更广泛应用的一个重要限制。
大量的验证实验表明,稀疏可变形卷积的性能是由于它们能够用很少的表达性滤波器参数捕获大的空间上下文,并且学习复杂的形状和外观特征并不总是需要网络深度。与传统 CNN 的结合进一步改善了具有较大形状变化的小器官的描绘
Deep networks have set the state-of-the-art in most image analysis tasks by replacing handcrafted features with learned convolution filters within end-to-end trainable architectures. Still, the specifications of a convolutional network are subject to much manual design – the shape and size of the receptive field for convolutional operations is a very sensitive part that has to be tuned for different image analysis applications. 3D fully-convolutional multi-scale architectures with skip-connection that excel at semantic segmentation and landmark localisation have huge memory requirements and rely on large annotated datasets - an important limitation for wider adaptation in medical image analysis.
Extensive validation experiments indicate that the performance of sparse deformable convolutions is due to their ability to capture large spatial context with few expressive filter parameters and that network depth is not always necessary to learn complex shape and appearance features. A combination with conventional CNNs further improves the delineation of small organs with large shape variations and the fast inference time using flexible image sampling may offer new potential use cases for deep networks in computer-assisted, image-guided interventions.
Introduction and Motivation
为了解决 patch-based 分类的计算耗时长问题,提出了所谓的完全卷积网络(FCN)在编-解码器体系结构中使用内在的多尺度方法和残差(或跳跃)连接,以在精度和计算需求之间获得良好的权衡,然而仍依赖多达几十个卷积层。我们注意到,大多数医学分割任务只处理不到十几类结构,这就提出了一个问题:是否确实需要具有许多卷积层的非常深的网络才能获得高质量的结果。
To address computational issues of patch-based classification so called fully convolutional networks (FCNs) have been proposed (Long et al., 2015; Ronneberger et al., 2015) that use an intrinsic multi-scale approach within an encoder-decoder architecture and residual or skip connections to obtain a good trade-off between accuracy and computational demand, while still relying on dozens of convolutional layers. However, we note that most object and medical segmentation tasks deal with fewer than a dozen classes of anatomy, which raises the question whether very deep networks with many spatial convolution layers are indeed required to obtain high-quality results.
在这项工作中,我们提出了一个扩展卷积的替代概念,其在连续可微空间中学习大而稀疏卷积核的空间滤波器偏移量( b i a s bias bias)和系数( w e i g h t weight weight)。我们坚信,这些能够自动调整滤波器布局的空间可变形卷积核对于提高网络对医学 3D 图像的适用性是一条极其重要的线索。
In this work, we present an alternative concept to dilated convolutions in which both the spatial filter offsets and coefficients of a large and sparse convolutional kernel are learned in a continuous, differentiable space. We strongly believe that these spatially deformable kernels that automatically adapt their filter layout are an extremely important clue to deepen the understanding of the processes within convolutional networks and improve the applicability to medical volumetric data.
Method
使用 im2col 算子在 CNN 中实现传统卷积以提取重叠块,然后使用滤波器组进行矩阵乘法并重塑为预期的特征图尺寸。 OBELISK 中的可变形卷积遵循类似的原理,但用连续采样的空间滤波器偏移布局替换了矩形卷积,该布局添加到特征图坐标(此处为 3×5 网格)和使用 gridsample 算子的双线性插值。滤波器组的后续硬件优化单精度矩阵乘法在计算时间方面同样有效,但可以在单层内捕获更多空间上下文,并且可训练参数很少。
Implementation of conventional convolutions in deep networks using the im2col operator to extract overlapping patches followed by a matrix multiplication with a filter bank and reshaping to the expected feature map dimensions. The deformable convolutions in OBELISK follow a similar principle, but replace the rectangular patch-extraction with a continuously sampled spatial filter offset layout, which is added to the feature map coordinates (here a 3×5 grid) and bilinear interpolation using the gridsample operator. The subsequent hardware-optimised single-precision matrix multiplication of a filter bank is equivalently effective in terms of computation time, but may capture much more spatial context within a single layer and with few trainable parameters.
这一概念可以很容易地集成到常用的 U-Net 或 V-Net 架构中。
为了提高通用性,我们现在将描述用于多通道输入的 OBELISK(2D 情况下),并定义一个更通用的可变形卷积操作:给定一个大小为 B × C i n × H × W B×C_{in}×H×W B×Cin×H×W 的输入张量(特征图),一个大小为 B × 1 × S s p × 2 B×1×S_{sp}×2 B×1×Ssp×2 的空间采样张量,一个大小为 1 × K × 1 × 2 1×K×1×2 1×K×1×2 的可变形偏移张量和一个大小为 1 × C o u t × C i n ⋅ K 1 × C_{out} × C_{in} \cdot K 1×Cout×Cin⋅K 的权重张量。这里, B B B 是批量大小, C i n C_{in} Cin 是输入通道数, C o u t C_{out} Cout 是输出通道数, H H H 和 W W W 是输入特征图的长宽, S s p S_{sp} Ssp 是空间输出位置的数量, K K K 是可变形卷积核元素的数量。
For improved generality, we will now describe OBELISK for multichannel inputs (for ease of description explained for 2D) and define a more general deformable convolution operation as follows. Given a dense input tensor (feature maps) of size B×Cin×H×W with a spatial sampling coordinate tensor of size B×1×Ssp×2 , a deformable offset tensor of size 1×K×1×2 and a weight tensor of size 1 × Cout× Cin· K are required. Here, B is the batch size, Cinthe number of input channels, Coutthe number of output channels, H and W the spatial dimensions of the input feature map, Ssp the number of spatial output locations, K the number of kernel elements.
在 FCN 架构中使用 OBELISK 层时,必须选择空间采样,以便 S s p S_{sp} Ssp 等于用于进一步处理的规则网格。使用 grid sample 操作,输入被采样为 B × C i n × K × S s p B × C_{in} × K × S_{sp} B×Cin×K×Ssp 的大小,类似于经典卷积的 im2col 过程(另请参见图 上图)。
When using OBELISK layers within a fully convolutional architecture the spatial sampling has to be chosen so that Ssp equals a regular grid for further processing. Using the grid sample operation the input is sampled to a new size of B × Cin× K × Ssp similar to the process of im2col for classical convolutions (see also Fig. 1).
Hybrid OBELISK-CNN
高分辨率和中分辨率的传统 CNN 层之后是两个带有空间可变形卷积的 OBELISK 层,这些内核使用(与原始 OBELISK 网络相比)生成的特征图的规则粗网格,并且再次仅使用 1 × 1 过滤器作为用于特征抽象的多层感知器。每个体素都可以独立计算,并且需要在采样位置之间共享的参数少得多(平移不变性)。
最后,我们还探索了一种混合架构,它将 OBELISK 的优势与浅层 U-Net 学习的边缘保留滤波器相结合(见图 4)。对于这个网络,我们使用了两个 OBELISK 层——一个具有 512 个空间滤波器偏移和一个非常粗的规则采样网格(8 个与输入大小的间隔),另一个具有 128 个空间滤波器偏移和一个粗网格(四分之一大小)每个输入维度)——都跟随着一个全卷积的 1×1 密集网络。我们求助于更简单的(一元变体),没有配对或减去偏移量。这些层包含少于 50k 的参数,每个层都使用原始扫描的平滑版本作为输入。它们的输出分别有 32 个和 16 个通道,并连接在一个浅层 U-Net 中,该 U-Net 遵循下表中的基线 U-Net 模型,但删除了参数密集程度最高的层 #6 - #10,因此只有大约 230k 可训练参数保留。
Hybrid OBELISK-CNN: Finally, we also explore a hybrid architecture that combines the advantages of OBELISK with the edge-preserving filters learned by a shallow U-Net (see Fig. 4). For this network, we use two OBELISK layers – one with 512 spatial filter offsets and a very coarse regular sampling grid (spacing of eight w.r.t. the size of the input) and another one with 128 spatial filter offsets and a coarse grid (quarter size of each input dimension) – both followed by a fully-convolutional 1×1 dense-net. We resorted to the simpler (unary variant) were no offsets are paired or subtracted. These layers contain less than 50k parameters and each use a smoothed version of the original scans as input. Their output has 32 and 16 channels respectively and is concatenated within a shallow U-Net that follows the baseline U-Net model in Table 1, but with the most parameter-intensive layers #6 - #10 removed, so that only around 140k trainable parameters remain.
UNet:
Layer | (Out) - Size | Kernel / Stride | # Channels | Skip |
---|---|---|---|---|
Input | 14 4 3 144^3 1443 | 1 | ||
#1 | 14 4 3 144^3 1443 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | 1 → \rightarrow → 5 | → \rightarrow → #13 |
#2 | 7 2 3 72^3 723 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 $ / 2$ | 5 → \rightarrow → 16 | |
#3 | 7 2 3 72^3 723 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | 16 → \rightarrow → 16 | → \rightarrow → #12 |
#4 | 3 6 3 36^3 363 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 $ / 2$ | 16 → \rightarrow → 32 | |
#5 | 3 6 3 36^3 363 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | 32 → \rightarrow → 32 | → \rightarrow → #11 |
… | ||||
#11 | 3 6 3 36^3 363 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | x → \rightarrow → 32 | #5 → \rightarrow → |
7 2 3 72^3 723 | Upsample | 32 | ||
#12 | 7 2 3 72^3 723 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | 48 $\rightarrow$9 | #3 → \rightarrow → |
14 4 3 144^3 1443 | Upsample | 64 | ||
#13 | 14 4 3 144^3 1443 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | 14 → \rightarrow → # L L L | #1 → \rightarrow → |
#14 | 14 4 3 144^3 1443 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | # L L L → \rightarrow → # L L L |
OBELISK:
Description of our original OBELISK model with 1×1 Dense-Net.
Layer | (Out) - Size | Kernel / Stride | # Channels / Groups | Skip |
---|---|---|---|---|
Input | 14 4 3 144^3 1443 | 1 | ||
Avg | 14 4 3 144^3 1443 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | 1 | |
Avg | 14 4 3 144^3 1443 | 3 × 3 × 3 3 \times 3 \times 3 3×3×3 | 1 | |
offsets | 1024 × 3 1024 \times 3 1024×3 | |||
S s p S_{sp} Ssp | gridsample | 1 → \rightarrow → 1024 | ||
#1 | S s p S_{sp} Ssp | 1 × 1 × 1 1 \times 1 \times 1 1×1×1 | 1024 → 256 1024 \rightarrow 256 1024→256 / 4 /4 /4 | |
#2 | S s p S_{sp} Ssp | 1 × 1 × 1 1 \times 1 \times 1 1×1×1 | 256 → 128 256 \rightarrow 128 256→128 | → \rightarrow → #3 - #6 |
#3 | S s p S_{sp} Ssp | 1 × 1 × 1 1 \times 1 \times 1 1×1×1 | 128 → 32 128 \rightarrow 32 128→32 | → \rightarrow → #4 - #6 |
#4 | S s p S_{sp} Ssp | 1 × 1 × 1 1 \times 1 \times 1 1×1×1 | 160 → 32 160 \rightarrow 32 160→32 | → \rightarrow → #5 - #6 |
#5 | S s p S_{sp} Ssp | 1 × 1 × 1 1 \times 1 \times 1 1×1×1 | 192 → 32 192 \rightarrow 32 192→32 | → \rightarrow → #6 |
#6 | S s p S_{sp} Ssp | 1 × 1 × 1 1 \times 1 \times 1 1×1×1 | 224 → 32 224 \rightarrow 32 224→32 | |
#7 | S s p S_{sp} Ssp | 1 × 1 × 1 1 \times 1 \times 1 1×1×1 | 32 → # L 32 \rightarrow \#L 32→#L |
Experiments
我们在两个不同挑战等级的 3D CT 数据集上进行了广泛的实验。首先,我们探讨了有限规模的训练数据集(10 个VISCERAL 训练扫描)的困难,该数据集具有一组具有中等挑战性的腹部结构,如我们在会议论文中所做的那样,我们不对扫描进行特定预处理。其次,我们在公共多标签数据集上对不同的网络架构进行了额外的验证实验,该数据集基于(Roth等人,2015年)引入的 TCIA CT 胰腺数据集(43 个扫描),但已通过更多手动扩展标注器官并使用紧密裁剪的感兴趣区域(Gibson 等人,2018)。
We performed extensive experiments on two 3D CT datasets with different challenges. First, we explore the difficulties of a limited sized training dataset (10 VISCERAL training scans (Jimenez-del Toro et al., 2016)) with a set of moderately challenging abdominal structures as done in our conference paper without specific preprocessing of the scans. Second, we perform additional validation experiments of different network architectures on a public multi-label dataset that is based on the TCIA CT pancreas dataset (with 43 scans) introduced in (Roth et al., 2015), but has been extended by more manually labelled organs and uses the tightly cropped region of interests as done in (Gibson et al., 2018).
VISCERAL 实验是对以下七个解剖结构的分割进行了留存交叉验证:…
唯一的预处理是重采样到 3mm 的各向同性体素尺寸,并裁剪到 156 × 115 × 160 体素的尺寸,而不使用任何指导信息。
The VISCERAL experiment is carried out with leaveone-out cross validation for the segmentation the following seven anatomical structures: …
The only pre-processing applied was a resampling to isotropic voxel sizes of 3mm and cropping to dimensions of 156 × 115 × 160 voxels, without using any guidance information.
在我们新的 TCIA 实验中,我们应用了与 (Gibson et al., 2018)[^6] 提出的相同的预处理,其中涉及严格的手动边界框裁剪和重采样到 14 4 3 144^3 1443 个体素的通用尺寸(这会导致不规则的体素间距)。在 (Gibson et al., 2018) 中讨论了手动边界框选择是一个重要方面,有人认为没有注意门的单通道前馈 U-Nets (Oktay et al., 2018) 依赖于准确的初始化。由于 43 个扫描的较大数据库大小,我们采用了 32 或 33 个训练图像和 10 或 11 个测试扫描的四重交叉验证。
In our new TCIA experiments we applied the same preprocessing as proposed by (Gibson et al., 2018), which involves a tight manual bounding box cropping and resampling to common dimension of 14 4 3 144^3 1443 voxels (which results in irregular voxel spacings). It was discussed in (Gibson et al., 2018) that the manual bounding box selection was an important aspect and it was argued that single pass feed-forward U-Nets without attention gates (Oktay et al., 2018) are dependent on accurate initialisation. Due to the larger database size of 43 scans, we employ a fourfold cross-validation with either 32 or 33 training images and 10 or 11 test scans.
即使我们没有使用任何后处理,例如边缘保留平滑(Heinrich 和 Blendowski,2016)、更积极的数据增强(包括弹性变形和对比度变化)或曲率流平滑(Gibson等,2018),我们的新方法接近两个具有挑战性的 3D CT 分割任务的 SOTA。
Even though we have not used any post-processing such as edge-preserving smoothing (Heinrich and Blendowski, 2016), more aggressive data augmentation (including elastic deformations and contrast variations) or curvature flow smoothing (Gib- 595 son et al., 2018)), our new approach approaches state-ofthe-art accuracies for two challenging 3D CT segmentation tasks.
我们对处理类别不平衡的不同方案进行了试验,包括传统的 Dice 损失(Milleri等人,2016)和 hinge Dice 损失(Gibson等人,2018),但发现对于所有 U-Net 网络结构,在 300 个 epoch 之后,与经典的交叉熵损失相比,差异可以忽略不计。对于 OBELISK,我们像以前一样使用在线困难示例挖掘以及由逆标签频率的平方根加权的交叉熵损失(前景的权重平均为 1,背景的权重为 0.5)。除非另有说明,否则始终使用随机的仿射图像变换作为唯一的数据增强。
We experimented with different schemes for dealing with class imbalance including the conventional Dice loss (Milletari et al., 2016) and a hinge Dice loss (Gibson et al., 2018), but found the differences to be negligible to a classical cross-entropy loss after 300 epochs for all U-Net architectures. For OBELISK we employ online hard example mining as before together with a cross-entropy loss weighted by the square root of the inverse label frequency (the weights for foreground are averaged to 1 and the background is weighted with 0.5). Randomly applied affine image transformations are used throughout as the only type of data augmentation unless otherwise mentioned.
从定量分割精度可以看出,只有很少参数的 OBELISK 层可以取代传统的固定的卷积核,同时通常会产生更好的表示学习能力,但由于不规则的内存访问模式和三线性插值的额外计算,可能会导致更复杂的数值运算。
总之,我们相信 OBELISK 的可变形卷积是一种很有前途的强大的插件,可以很容易地集成到现有的浅 U 形网络中,并为将其用作预先训练的低参数形状编码器提供了巨大的潜力。
Heinrich, M., Ozan Oktay, and N. Bouteldja. “Obelisk Net: Fewer Layers to Solve 3d Multi-Organ Segmentation with Sparse Deformable Convolutions.” Medical Image Analysis 54 (2019): Q 9. ↩︎