《Dynamic Routing Between Capsules》论文学习总结

最新推荐文章于 2023-12-23 17:43:54 发布

小风_

最新推荐文章于 2023-12-23 17:43:54 发布

阅读量218

点赞数

分类专栏：论文学习总结文章标签：计算机视觉机器学习深度学习

本文链接：https://blog.csdn.net/qq_33952811/article/details/124976266

版权

论文学习总结专栏收录该内容

26 篇文章 9 订阅

订阅专栏

author: Geoffrey E. Hinton etc.
years: 2017
institution: google brain
★论文相关技术已申请专利

Abstract

A capsule(胶囊) is a group of neurons, whose activity vector(激活向量) represents the instantiation parameters(实例化参数) of a specific type of entity(特定类型的实体) such as an object or an object part.胶囊是一组神经元,其(激活向量)代表了实例化参数的一个特定类型的实体一个对象或一个对象等部分。
The length of the activity vector represent the probability that entity exists(实体存在的概率).
The orientation of the activity vector represent the instantiation parameters.
Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules.在一个级别上的活动胶囊通过变换矩阵对更高级别胶囊的实例化参数进行预测when multiple predictions agree, a higher level capsule becomes active.当多个预测一致时，一个较高水平的胶囊就会活跃起来
multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits多层胶囊系统在MNIST上实现了最先进的性能，在识别高度重叠的数字方面比卷积网络要好得多
use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.使用了一种迭代的协议路由机制: 低级别胶囊倾向于将其输出发送到较高级别的胶囊，其活动向量与来自较低级别胶囊的预测有一个大的标量积。

Introduction

Human vision uses a carefully determided sequence of fixation points to ignore irrelevant details, to ensure only a tiny fraction of the optic array is ever processed at the highest resolution. 人类视觉使用仔细确定的注视点序列来忽略不相关的细节，以确保只有一小部分的光学阵列能够以最高分辨率处理。
Introspection(自省?) is a poor guide to understanding how much of our knowlegde of a scene comes from the sequence of fixations and how much we glean from a single fixation, this paper will assume that a single fixation gives us much more than just a single identified object and its properties. Introspection 对于理解我们有多少对环境的认知是来源于注意点序列，以及我们关注了多少单个注意点，是不太良好的。本文将假设一个注意点不仅仅给了我们一个识别对象及其属性。
We assume that our multi-layer visual system creates a parse tree-like structure on each fixation, and we ignore the issue of how there single-fixation parse trees are coordinated over multiple fixations.我们假设我们的多层视觉系统在每个上创建一个解析树状结构，我们忽略了单个注意点解析树如何在多个固定上协调的问题。
Parse trees are generally constructed on the fly by dynamically allocating memory，a parse tree is carved out of a fixed multilayer neural network like a sculpture is carved from a rock. Each layer will be divided into many small groups of neurons called “capsules”。each node in the parse tree will correspond to an active capsule. 解析树常常通过动态内存分配进行构建。我们假定，对于单个固定点，解析树是从固定的多层神经网络中雕刻出来的，就像雕刻从岩石上雕刻出来的雕塑一样，每一层都会被分成许多小群的神经元，这里称为胶囊。每个节点都对应一个活动的胶囊。
Using an iterative routing process, each active capsule will choose a capsule in the layer above to be its parent in the tree. For the higher levels of a visual system, this iterative process will be solving the problem of assigning parts to wholes.使用迭代路由过程，每个活动胶囊将选择上一层中的胶囊作为其树中的父胶囊。对于视觉系统的更高层次，这个迭代过程将解决将部分分配到整体的问题。
The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.活跃囊内神经元的活动代表了图像中存在的特定实体的各种属性。这些属性可以包括许多不同类型的实例化参数，如姿势(位置，大小，方向)，变形，速度，反照率，色调，纹理等。
One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity. We ensure that the length of the vector output of a capsule cannot exceed 1 by applying a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude. 一个非常特殊的属性是映像中实例化实体的存在性。一个明显的表示存在的方法是使用一个独立的物流单元，其输出是实体存在的概率。在本文中，我们探索了一种有趣的替代方案，即使用实例化参数向量的总长度来表示实体的存在性，并强制向量的方向来表示实体的属性。我们通过应用一个离开方向的非线性来确保胶囊的矢量输出长度不超过1
The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Initially, the output is routed to all possible parents but is scaled down by coupling coeffificients that sum to 1.由于胶囊的输出是矢量，因此可以使用强大的动态路由机制来确保将胶囊的输出发送到上一层的适当父节点。最初，输出被路由到所有可能的父节点，但通过耦合系数和为1来缩小。
For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents. 对于每个可能的父节点，胶囊通过将自己的输出乘以权重矩阵计算出一个“预测向量”。如果这个预测向量与一个可能的父节点的输出有一个大的标量积，则存在自顶向下的反馈，它会增加该父节点的耦合系数，并降低其他父节点的耦合系数
This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsule’s prediction with the parent’s output. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. We demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects.这增加了胶囊对父组件的贡献，从而进一步增加了胶囊的预测与父组件的输出的标量积。这种“基于协议的路由”应该比由max-pooling实现的非常原始的路由形式更有效，max-pooling允许一层的神经元忽略下一层的局部池中除了最活跃的特征检测器之外的所有特征。我们演示了我们的动态路由机制是一种有效的方法来实现分割高度重叠对象所需的“解释离开”。
Convolutional neural networks (CNNs) use translated replicas of learned feature detectors. This allows them to translate knowledge about good weight values acquired at one position in an image to other positions. This has proven extremely helpful in image interpretation. 卷积神经网络(cnn)使用学习的特征检测器的翻译副本。这使得他们能够将图像中一个位置获得的良好权重值的知识转换到其他位置。事实证明，这在图像解释方面非常有用。
Even though we are replacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement, we would still like to replicate learned knowledge across space. To achieve this, we make all but the last layer of capsules be convolutional. As with CNNs, we make higher-level capsules cover larger regions of the image即使我们用矢量输出胶囊和按协议路由的最大池取代了cnn的标量输出特征检测器，我们仍然希望跨空间复制学到的知识。为了实现这一点，我们让最后一层胶囊都是绕层的。和cnn一样，我们用更高级别的胶囊覆盖图像的更大区域
Unlike max-pooling however, we do not throw away information about the precise position of the entity within the region. For low level capsules, location information is “place-coded” by which capsule is active. As we ascend the hierarchy, more and more of the positional information is “rate-coded” in the real-valued components of the output vector of a capsule. This shift from place-coding to rate-coding combined with the fact that higher-level capsules represent more complex entities with more degrees of freedom suggests that the dimensionality of capsules should increase as we ascend the hierarchy.但是，与max-pooling不同的是，我们不会丢弃实体在区域中的精确位置信息。对于低级胶囊，位置信息是“位置编码”的，胶囊是由其激活的。随着层次的上升，越来越多的位置信息在胶囊输出向量的实值组件中被“速率编码”。从位置编码到速率编码的转变，再加上更高级别的胶囊代表更复杂、自由度更大的实体这一事实，表明胶囊的维度应该随着我们的上升而增加

How the vector inputs and outputs of a capsule are computed

We want the length of the output vector of a capsule to represent the probability that the entityrepresented by the capsule is present in the current input. We therefore use a non-linear “squashing” function to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to alength slightly below 1. We leave it to discriminative learning to make good use of this non-linearity. 为了让胶囊的输出向量的长度表示胶囊所代表的实体在当前输入中出现的概率，使用非线性“squashing”函数来确保短向量压缩到几乎为零的长度，而长向量压缩到略小于1的长度。我们把它留给鉴别学习，以充分利用这种非线性函数
$\mathbf{v}_{j}=\frac{\left\|\mathbf{s}_{j}\right\|^{2}}{1+\left\|\mathbf{s}_{j}\right\|^{2}} ·\frac{\mathbf{s}_{j}}{\left\|\mathbf{s}_{j}\right\|}$
其中， $\mathbf{v}_{j}$ 表示向量输出， $\mathbf{s}_{j}$ 表示向量输入

除了第一层，其他所有的胶囊 $\mathbf{s}_{j}$ 都是来自下面一层中胶囊输出 $\mathbf{u}_{i}$ 与 $\mathbf{W}_{ij}$ 的加权总和
$\mathbf{s}_{j}=\sum_{i} c_{i j} \hat{\mathbf{u}}_{j \mid i}, \quad \hat{\mathbf{u}}_{j \mid i}=\mathbf{W}_{i j} \mathbf{u}_{i}$
其中, $\mathbf{c}_{ij}$ 是由迭代动态路由过程确定的耦合系数，胶囊 $i$ 与上述层中所有胶囊之间的耦合系数和为1，由一个“routinf softmax”确定，其初始 $b_{i j}$ 是胶囊i和j需要连接的log先验概率
$c_{i j}=\frac{\exp \left(b_{i j}\right)}{\sum_{k} \exp \left(b_{i k}\right)}$
这个log先验概率，可以被区别性学习（与其他权重参数一同），它们取决于两个胶囊的位置和类型，而不取决于当前的输入图像。然后，通过测量上一层中每个胶囊 $j$ 的当前输出 $\mathbf{v}_{j}$ 与胶囊 $i$ 的预测 $\hat{\mathbf{u}}_{j \mid i}$ 之间的一致性，迭代修正初始耦合系数。

这种一致性仅仅通过标量积实现，即 $a_{ij} = \mathbf{v}_{j} ·\hat{\mathbf{u}}_{j \mid i}$ ，在计算连接胶囊i到更高级别胶囊的所有耦合系数的新值之前，将该协议视为对数似有性，并将其添加到初始logit $b_{i j}$ 中。

在卷积胶囊层，每个胶囊的输出一个上一层中每个胶囊类型的局部矢量网格，通过使用不同的转换矩阵为网格的每个成员以及每种类型的胶囊实现。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZtTCw8MB-1653492864429)(D:\PersonalDoc\个人\写作\相关文献\capsules_procedure1.png)]$

Margin loss for digit existence

We are using the length of the instantiation vector to represent the probability that a capsule’s entity exists.We would like the top-level capsule for digit class k to have a long instantiation vector if and only if that digit is present in the image我们使用实例化向量的长度来表示胶囊实体存在的概率。我们希望数字类k的顶层胶囊有一个长实例化向量，当且仅当该数字出现在图像中。为了考虑到多个数字，我们使用了单独的margin loss， $L_k$ ，对于每个数字胶囊，k有：
$L_{k}=T_{k} \max \left(0, m^{+}-\left\|\mathbf{v}_{k}\right\|\right)^{2}+\lambda\left(1-T_{k}\right) \max \left(0,\left\|\mathbf{v}_{k}\right\|-m^{-}\right)^{2}$
其中，如果数字类别 $k$ 存在， $T_{k}$ =1， $m^{+}=0.9,m^{-}=0.1$ ， $\lambda$ 用于降低缺失数字类的损失，使初始学习不再收缩所有数字胶囊的活动向量的长度。我们用λ = 0.5。总损失只是所有数字胶囊损失的总和。

Architecture

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XBfV7Ay8-1653492864431)(C:\Users\Small_wind\AppData\Roaming\Typora\typora-user-images\image-20210925192756881.png)]$

第一个卷积层使用 256 个 9×9 卷积核，深度为1，步幅为1，且使用了 ReLU 激活函数。输出的张量20×20×256。此外，CapsNet 的卷积核感受野使用的是 9×9。这两层间的权值数量应该为 9×9×1×256+256=20992，后面加256是偏置值。
第二个卷积层开始作为 Capsule 层的输入而构建相应的张量结构。使用32×8个9×9卷积核（深度为256），步幅为2。输出张量为6×6×8×32，即输出了6×6×32个维度为8的capsule向量。两层间的权值数量为9×9×256×8×32+8×32=5308672。
PrimaryCaps层有6×6×32个capsules。其中每个map含有6×6个capsule。不同的map代表不同的类型，而同一个map的不同capsule代表不同的位置。
第三层DigitCaps在第二层输出的向量基础上进行传播与Routing更新。第二层共输出 6×6×32=1152 个capsule，即i层共有1152个Capsule，第三层j层有10个Capsules（每个是16维的向量）。
- $W_{ij}$ 有1152×10个，每个是8×16的向量
- $u_i$ （8×1的向量）与W_ij相乘得到预测向量([8,16].T*[8,1]=[16,1])后，接着就有1152×10个耦合系数 $c_{ij}$ 。
- 将 $s_j$ 传入squashing后就得到激活后的最终输出 $v_j$ 。
- DigitCaps 层与 PrimaryCaps 层之间的参数中，所有 $W_{ij}$ 的参数数量是 1152×10×8×16=1474560， $c_{ij}$ 的参数数量为 1152×10， $b_{ij}$ 参数数量是1152×10。

实验结果

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BIN3tADV-1653492864431)(C:\Users\Small_wind\AppData\Roaming\Typora\typora-user-images\image-20210925194805279.png)]$