Dynamic Routing Between Capsules



capsule用activity vector表示某个object。vector的模长表示object出现的可能性,向量的方向表示object的方向。低级别的capsule通过transformation matrices对高级别capsule的参数进行预测。当多个预测一致时,更高级别的胶囊变得活跃。迭代路由协议机制:一个较低级别的胶囊倾向于将其输出发送到较高级别的胶囊,如果其activity vector具有较大的标量积。

1. Introduction


2. How the vector inputs and outputs of a capsule are computed

我们希望胶囊的输出向量的模长表示胶囊代表的object出现的概率。因此,我们使用非线性的“squashing”函数来确保vector长度短的几乎缩小为零,而vector矢量长度长的缩小到略小于1。squashing function:

vj=||sj||21+||sj||2sj||sj|| v j = | | s j | | 2 1 + | | s j | | 2 s j | | s j | |

vj v j 是输出向量, sj s j 为输入向量。
For all but the first layer of capsules, the total input to a capsule sj s j is a weighted sum over all “prediction vectors” u^j|i u ^ j | i from the capsules in the layer below and is produced by multiplying the output uj u j of a capsule in the layer below by a weight matrix Wij W i j
sj=iciju^j|i,u^j|i=Wijui s j = ∑ i c i j u ^ j | i , u ^ j | i = W i j u i

where the cij c i j are coupling coefficients that are determined by the iterative dynamic routing process.
cij=exp(bij)kexp(bik) c i j = e x p ( b i j ) ∑ k e x p ( b i k )

bijbij+u^j|ivj b i j ← b i j + u ^ j | i ⋅ v j

3. Margin loss for digit existence

To allow for multiple digits, we use a separate margin loss, Lk L k for each digit capsule k k :

where Tk=1 T k = 1 iff a digit of class k is present and m+=0.9 m + = 0.9 and m=0.1 m − = 0.1 . The lambda l a m b d a down-weighting 缩减没出现时的矢量长度。

4. CapsNet architecture

The architecture is shallow with only two convolutional layers and one fully connected layer.
Conv1 has 256,9×9 256 , 9 × 9 convolution kernels with a stride of 1 and ReLU activation. (提取的特征作为primary capsules的输入。)
The second layer (Primary Capsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9×9 9 × 9 kernel and a stride of 2). Each primary capsule output sees the outputs of all 256×81 256 × 81 Conv1 units whose receptive fields overlap with the location of the center of the capsule.
In total PrimaryCapsules has 32×6×6 32 × 6 × 6 capsule outputs (each output is an 8D vector) and each capsule in the 6×6 6 × 6 grid is sharing their weights with each other.
One can see PrimaryCapsules as a Convolution layer with squashing fuctiong as its block non-linearity.
The final Layer (DigitCaps) has one 16D capsule per digit class and each of these capsules receives input from all the capsules in the layer below.
We have routing only between two consecutive capsule layers (e.g. PrimaryCapsules and DigitCaps).
Since Conv1 output is 1D, there is no orientation in its space to agree on.(以为特征没法算方向)
Therefore, no routing is used between Conv1 and PrimaryCapsules. All the routing logits ( bij b i j ) are initialized to zero. Therefore, initially a capsule output ( ui u i ) is sent to all parent capsules with equal probability ( cij c i j ).Our implementation is in TensorFlow and we use the Adam optimizer with its TensorFlow default parameters, including the exponentially decaying learning rate, to minimize the sum of the margin losses.

4.1. Reconstruction as a regularization method

最小化输出与像素强度之间的平方差的总和。将这个重建损失缩小(乘0.0005的系数),让它不会在训练期间占主导地位。CapsNet 16D输出的重建功能非常强大,同时仅保留重要细节。

5. Capsules on MNIST

发现在MNIST上的性能好于art of state

5.1. What the individual dimensions of a capsule represent


5.2. Robustness to Affine Transformations

Affine Transformations:就是空间上的平移旋转放缩
An under-trained CapsNet with early stopping which achieved 99.23% accuracy on the expanded MNIST test set achieved 79% accuracy on the affnist test set.
A traditional convolutional model with a similar number of parameters which achieved similar accuracy (99.22%) on the expanded mnist test set only achieved 66% on the affnist test set.

6. Segmenting highly overlapping digits


6.1. MultiMNIST dataset

将MNIST数据集放在成 32×32 32 × 32 上,每个数字可以看成被 20×20 20 × 20 的框子框住然后让每两个数字平均有80%的重叠。The training set size is 60M and the test set size is 10M.

6.2. MultiMNIST results

Our 3 layer CapsNet model trained from scratch on MultiMNIST training data achieves higher test classification accuracy than our baseline convolutional model.

7. Other datasets

与其他generative models 一样,Capsules的一个缺点是它喜欢考虑图像中的所有内容,因此它在拟合杂乱一些的图像时,使用软标签比onehot标签时更好。

8. Discussion and previous work

近三十年来,语音识别领域的最新技术是使用隐马尔可夫模型和高斯混合作为输出分布。 这些模型很容易在小型计算机上学习,但与使用分布作为输出的RNN相比,它们具有致命的缺点representational limitation,它是用类似onehot表示信息,效率低下。 为了使HMM可以记住的信息数量增加一倍,需要对隐藏节点的数量进行平方。 对于recurrent net,只需要将隐藏的神经元数量加倍。
Capsule希望解决的就是representational limitation的问题。Capsule希望解决CNN中类似GMM-HMM与RNN的问题,并且我们的实验也显示了它强大的表征能力。



