MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network
ResNet
y l = F l ( x l ) + h ( x l ) (1) y_{l}= F_{l}(x_{l})+h(x_{l}) \tag{1} yl=Fl(xl)+h(xl)(1)
x
l
+
1
=
f
(
y
l
)
(2)
x_{l+1}=f(y_{l}) \tag{2}
xl+1=f(yl)(2)
其中f和h是恒等映射,即
x
l
+
1
=
y
l
=
x
l
+
F
l
(
x
l
)
(3)
x_{l+1}=y_{l}=x_{l}+F_{l}(x_{l}) \tag{3}
xl+1=yl=xl+Fl(xl)(3)
其中
x
l
=
x
l
−
1
+
F
l
(
x
l
−
1
)
i
.
e
.
,
x
l
+
1
=
x
l
−
1
+
F
l
(
x
l
)
+
F
(
x
l
−
1
)
(4)
x_{l}=x_{l-1}+F_{l}(x_{l-1})\\ i.e., x_{l+1}=x_{l-1}+F_{l}(x_{l})+F(x_{l-1}) \tag{4}
xl=xl−1+Fl(xl−1)i.e.,xl+1=xl−1+Fl(xl)+F(xl−1)(4)
由此推出
x
L
=
x
l
+
∑
i
=
l
L
−
1
F
l
(
x
i
)
(5)
x_{L}=x_{l}+\sum_{i=l}^{L-1}F_{l}(x_{i}) \tag{5}
xL=xl+i=l∑L−1Fl(xi)(5)
反向传播:
∂
L
∂
x
l
=
∂
L
∂
x
L
∂
x
L
∂
x
l
=
∂
L
∂
x
L
(
1
+
∂
∂
x
l
∑
i
=
l
L
−
1
F
l
(
x
i
)
)
(6)
\frac{\partial\mathcal{L}}{\partial x_{l}}=\frac{\partial\mathcal{L}}{\partial x_{L}}\frac{\partial x_{L}}{\partial x_{l}}=\frac{\partial\mathcal{L}}{\partial x_{L}}\left(1+\frac{\partial}{\partial x_{l}}\sum_{i=l}^{L-1}F_{l}(x_{i})\right) \tag{6}
∂xl∂L=∂xL∂L∂xl∂xL=∂xL∂L(1+∂xl∂i=l∑L−1Fl(xi))(6)
Decouple ensemble network outputs
p c = ∑ k w k c ⋅ ∑ i , j y L ( k ) ( i , j ) (7) p^{c}=\sum_{k}w_{k}^{c}\cdot \sum_{i,j}y_{L}^{(k)}(i,j) \tag{7} pc=k∑wkc⋅i,j∑yL(k)(i,j)(7)
其中
p
c
p^{c}
pc是类别
c
c
c的输出概率,
(
i
,
j
)
(i,j)
(i,j)表示空间坐标。
w
c
=
[
w
1
c
,
⋯
,
w
k
c
,
⋯
]
T
\mathcal{w}^{c}=[w_{1}^{c},\cdots,w_{k}^{c},\cdots]^{T}
wc=[w1c,⋯,wkc,⋯]T是全连接层的第
c
c
c列权重矩阵。
y
L
k
y_{L}^{k}
yLk是最后的残差块的第
k
k
k个特征图。
把公式(5)代入到(7)中,推出 :
p
c
=
∑
i
,
j
w
c
⋅
y
L
=
∑
i
,
j
w
c
(
y
1
+
∑
m
=
1
L
−
1
F
m
(
y
m
)
)
(8)
p^{c}=\sum_{i,j}w^{c}\cdot y_{L}=\sum_{i,j}w^{c}(y_{1}+\sum_{m=1}^{L-1}F_{m}(y_{m})) \tag{8}
pc=i,j∑wc⋅yL=i,j∑wc(y1+m=1∑L−1Fm(ym))(8)
文章中指出了ResNet的一个缺点: Using a single weighting function in the classification module is suboptimal in this situation. This is because the outputs of all ensembles share classifiers such that the importance of their individual features are undermined.
To address this issue, they propose to decouple the ensemble outputs and apply classifiers to them individually by using:
p
c
=
∑
i
,
j
(
w
1
c
⋅
y
1
+
∑
m
=
1
L
−
1
w
m
+
1
c
⋅
F
m
(
y
m
)
)
(9)
p^c=\sum_{i,j}\left(\boldsymbol{w}_{1}^{c}\cdot y_{1}+\sum_{m=1}^{L-1}\boldsymbol{w}_{m+1}^{c}\cdot F_{m}(y_{m})\right) \tag{9}
pc=i,j∑(w1c⋅y1+m=1∑L−1wm+1c⋅Fm(ym))(9)
同时提出一个新的 skip-connection,定义如下:
y
l
+
1
=
F
l
(
y
l
)
⊗
y
l
(10)
y_{l+1}=F_{l}(y_{l})\otimes y_{l} \tag{10}
yl+1=Fl(yl)⊗yl(10)
where
⊗
\otimes
⊗ is the concatenation operation. And defining this skip-connection scheme as ensemble-connection.
Language model
log p ( x 0 : T ∣ I ; θ L ) = ∑ t = 0 T log p ( x t ∣ I , x 0 : t − 1 ; θ L ) (11) \log p(\boldsymbol{x}_{0:T}\mid I;\theta_{L})=\sum_{t=0}^{T}\log p(\boldsymbol{x}_{t}\mid I,\boldsymbol{x}_{0:t-1}; \theta_{L}) \tag{11} logp(x0:T∣I;θL)=t=0∑Tlogp(xt∣I,x0:t−1;θL)(11)
where
{
x
0
,
⋯
,
x
T
}
\left\{\boldsymbol{x}_{0},\cdots,\boldsymbol{x}_{T}\right\}
{x0,⋯,xT} are sentence words,
θ
L
\theta_{L}
θL are the parameters of LSTM.
h
t
=
L
S
T
M
(
E
x
t
−
1
,
h
t
−
1
,
z
t
)
(12)
\boldsymbol{h}_{t}=LSTM(E\boldsymbol{x}_{t-1},\boldsymbol{h}_{t-1},\boldsymbol{z}_{t}) \tag{12}
ht=LSTM(Ext−1,ht−1,zt)(12)
where
E
E
E is the word embedding matrix,
z
t
\boldsymbol{z}_{t}
zt is a context vector.
z
t
=
a
t
C
(
I
)
T
(13)
\boldsymbol{z}_{t}=\boldsymbol{a}_{t}\mathcal{C}(I)_{T} \tag{13}
zt=atC(I)T(13)
where
C
(
I
)
\mathcal{C}(I)
C(I) is the Conv feature maps generated by the image model (stands for
y
L
y_{L}
yL).
a
t
=
s
o
f
t
m
a
x
(
W
a
t
t
t
a
n
h
(
W
h
h
t
−
1
+
c
)
)
c
=
(
w
c
)
T
C
(
I
)
(14)
\boldsymbol{a}_{t}=softmax(W_{att}tanh(W_{h}\boldsymbol{h}_{t-1}+\mathcal{c})) \\ \boldsymbol{c}=(\boldsymbol{w}^c)^{T}\mathcal{C}(I) \tag{14}
at=softmax(Watttanh(Whht−1+c))c=(wc)TC(I)(14)
where
W
a
t
t
W_{att}
Watt and
W
h
W_{h}
Wh are learned embedding matrices,
c
\boldsymbol{c}
c is the Conv feature embedding through
w
c
\boldsymbol{w}^c
wc.
edding matrices, c \boldsymbol{c} c is the Conv feature embedding through w c \boldsymbol{w}^c wc.