[Show and Tell: A Neural Image Caption Generator][https://arxiv.org/pdf/1411.4555v1.pdf]
概要
本文介绍了NIC算法,将CNN与LSTM结合,做了一件什么事呢。就是小学时的看图说话,利用CNN提取图片特征,并作为 t − 1 t_{-1} t−1输入LSTM中,将描述性词汇转化为独热编码,利用嵌入模型做为 S t S_t St输入LSTM中。
公式
最大化似然函数:
θ
⋆
=
arg
max
θ
∑
(
I
,
S
)
log
p
(
S
∣
I
;
θ
)
\theta^{\star}=\arg \max _{\theta} \sum_{(I, S)} \log p(S | I ; \theta)
θ⋆=argθmax(I,S)∑logp(S∣I;θ)
可能性:
log
p
(
S
∣
I
)
=
∑
t
=
0
N
log
p
(
S
t
∣
I
,
S
0
,
…
,
S
t
−
1
)
\log p(S | I)=\sum_{t=0}^{N} \log p\left(S_{t} | I, S_{0}, \ldots, S_{t-1}\right)
logp(S∣I)=t=0∑Nlogp(St∣I,S0,…,St−1)
在LSTM中:
h
t
+
1
=
f
(
h
t
,
x
t
)
h_{t+1}=f\left(h_{t}, x_{t}\right)
ht+1=f(ht,xt)
i t = σ ( W i x x t + W i m m t − 1 ) f t = σ ( W f x x t + W f m m t − 1 ) o t = σ ( W o x x t + W o m m t − 1 ) c t = f t ⊙ c t − 1 + i t ⊙ h ( W c x x t + W c m m t − 1 ) ( 7 ) m t = o t ⊙ c t p t + 1 = Softmax ( m t ) \begin{aligned} i_{t} &=\sigma\left(W_{i x} x_{t}+W_{i m} m_{t-1}\right) \\ f_{t} &=\sigma\left(W_{f x} x_{t}+W_{f m} m_{t-1}\right) \\ o_{t} &=\sigma\left(W_{o x} x_{t}+W_{o m} m_{t-1}\right) \\ c_{t} &=f_{t} \odot c_{t-1}+i_{t} \odot h\left(W_{c x} x_{t}+W_{c m} m_{t-1}\right)(7) \\ m_{t} &=o_{t} \odot c_{t} \\ p_{t+1} &=\operatorname{Softmax}\left(m_{t}\right) \end{aligned} itftotctmtpt+1=σ(Wixxt+Wimmt−1)=σ(Wfxxt+Wfmmt−1)=σ(Woxxt+Wommt−1)=ft⊙ct−1+it⊙h(Wcxxt+Wcmmt−1)(7)=ot⊙ct=Softmax(mt)
输入和输出:
x
−
1
=
C
N
N
(
I
)
x
t
=
W
e
S
t
,
t
∈
{
0
…
N
−
1
}
p
t
+
1
=
LSTM
(
x
t
)
,
t
∈
{
0
…
N
−
1
}
\begin{aligned} x_{-1} &=\mathrm{CNN}(I) \\ x_{t} &=W_{e} S_{t}, \quad t \in\{0 \ldots N-1\} \\ p_{t+1} &=\operatorname{LSTM}\left(x_{t}\right), \quad t \in\{0 \ldots N-1\} \end{aligned}
x−1xtpt+1=CNN(I)=WeSt,t∈{0…N−1}=LSTM(xt),t∈{0…N−1}