机器学习greenhand的关于RNN原理的非严谨性的数学推导 (1)
—— SKYWALKER2099@CSDN 20230410
Before everything:
当我想要真正的了解rnn的原理,手写一遍当然是最好的了。但是网上大多数的纯numpy推导代码要不缺乏相应的完善的阐述,要不就是讲述了数学原理仿佛了解一二,但是实际上更不不知道如何按照他去实现手写的底层原理。
在查询资料学习的过程中,找到了rnn_lstm_from_scratch这一份代码,似乎很详细,但是类似的他的数学原理我还是不太能理解。【note:这份代码里面V,W和大多数图上,以及本文表示的意义是反的,需要注意看】所以在查询各种资料的过程中试图结合各种思路,把他的数学原理阐述明白。
由此下文主要以不严谨的方式阐述了我关于rnn的基本数学逻辑,或者说他的底层运行逻辑,展现出来。(文中的数学符号并不太专业。更加偏向于一个思路)
————
损失函数为(不考虑正则化项):
L
=
∑
t
=
1
T
L
(
t
)
(1)
L = \sum_{t=1}^TL^{(t)} \tag{1}
L=t=1∑TL(t)(1)
L
(
t
)
=
−
Σ
i
=
1
C
p
i
l
o
g
(
q
i
)
(2)
L^{(t)} = -\Sigma_{i=1}^C p_i log(q_i) \tag{2}
L(t)=−Σi=1Cpilog(qi)(2)
其中C代表类别数。
p
i
p_i
pi为真实,
q
i
q_i
qi为预测.
比如:
TRUE
:[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],PRED
:[0.1, 0.6, 0.3, 0, 0, 0, 0, 0, 0, 0]
则交叉熵为: −ln(0.6)≈0.51
一. forward pass
为了更好的推导我做出如下的假设:假设这一次输入为4*2*\1的矩阵,(按照举例的程序这样子设计这个输入),也就是一个包含四个样本,并且每个样本用两个特征描述的输入。比如abba,我要训练他学习这种单词间的关系,一个训练集合可以包含很多类似这样长度不一样(比如大小为x*2*1)的输入。
并且t以带()的上标表示第t个样本输入的时候相关的输出。
比如:
a
(
3
)
a^{(3)}
a(3)
input(四个样本(字符),每个样本用两个特征表示):
(
[
1
,
0
]
T
,
[
0
,
1
]
T
,
[
0
,
1
]
T
,
[
1
,
0
]
T
)
(3.1)
\begin{pmatrix} [1,0]^T,[0,1]^T,[0,1]^T,[1,0]^T\\ \end{pmatrix} \tag{3.1}
([1,0]T,[0,1]T,[0,1]T,[1,0]T)(3.1)
也就是
(
1
,
0
,
0
,
1
0
,
1
,
1
,
0
)
\begin{pmatrix} 1,0,0,1\\ 0,1,1,0\\ \end{pmatrix}
(1,0,0,10,1,1,0)
符号记为:
x
(
t
)
=
(
[
x
0
(
0
)
,
x
1
(
0
)
]
T
,
.
.
.
,
[
x
0
(
t
)
,
x
1
(
t
)
]
T
,
[
x
0
(
3
)
,
x
1
(
3
)
]
T
)
(
0
<
=
t
<
=
3
)
(3.2)
x^{(t)}=\begin{pmatrix} [x_{0}^{(0)},x_{1}^{(0)}]^T,...,[x_{0}^{(t)},x_{1}^{(t)}]^T,[x_{0}^{(3)},x_{1}^{(3)}]^T\\ \end{pmatrix} (0<=t<=3)\tag{3.2}
x(t)=([x0(0),x1(0)]T,...,[x0(t),x1(t)]T,[x0(3),x1(3)]T)(0<=t<=3)(3.2)
那么他的训练过程就是,首先前向传播,将输入中的每一个样本(这里的实际意义就是字母),依次的输入到神经网络,每一次输出一个输出,以及获得隐藏层的状态,隐藏层的状态又作为下一次的输入。如下图:
并且每一次输出
o
(
t
)
o^{(t)}
o(t),这里t最大就是3(0,1,2,3),都可以计算得到相应的损失函数
L
(
t
)
L^{(t)}
L(t)
由
H
O
U
T
(
t
)
=
f
(
U
x
(
t
)
+
W
s
(
t
−
1
)
)
HOUT^{(t)}=f(Ux^{(t)}+Ws^{(t-1)})
HOUT(t)=f(Ux(t)+Ws(t−1)),并且这里设hidden layer的宽度为3,那么由于由
U
x
(
t
)
Ux^{(t)}
Ux(t)的大小是(?*2)*(2*1)[这里表示的是np.dot,也就是U乘以x^{(t)}][note:np.dot只有两者都是一维向量的时候才会点积,也就是a*b的转置],要生成一个宽度为3的,所以需要一个3*2大小的U(也就是(hidden_size*vocab_size(每个字母的特征维度))),这里假设U是以下这个值:
U
=
(
[
0.5
,
0.6
,
0.7
]
[
0.1
,
0.2
,
0.0
]
)
T
=
(
[
0.5
,
0.1
]
[
0.6
,
0.2
]
[
0.7
,
0.0
]
)
(3.3)
U =\begin{pmatrix} [0.5,0.6,0.7]\\ [0.1,0.2,0.0]\\ \end{pmatrix}^T =\begin{pmatrix} [0.5,0.1]\\ [0.6,0.2]\\ [0.7,0.0]\\ \end{pmatrix} \tag{3.3}
U=([0.5,0.6,0.7][0.1,0.2,0.0])T=
[0.5,0.1][0.6,0.2][0.7,0.0]
(3.3)
符号记为:
U
=
(
[
U
00
(
t
)
,
U
01
(
t
)
]
[
U
10
(
t
)
,
U
11
(
t
)
]
[
U
20
(
t
)
,
U
21
(
t
)
]
)
(3.4)
U =\begin{pmatrix} [U_{00}^{(t)},U_{01}^{(t)}]\\ [U_{10}^{(t)},U_{11}^{(t)}]\\ [U_{20}^{(t)},U_{21}^{(t)}]\\ \end{pmatrix} \tag{3.4}
U=
[U00(t),U01(t)][U10(t),U11(t)][U20(t),U21(t)]
(3.4)
那么我们看最后依次的输出,也就是t=3(0<=t<=3)时候的输出:
H
I
N
X
(
t
)
=
U
∗
x
(
t
)
=
(
[
H
I
N
X
00
(
t
)
]
[
H
I
N
X
10
(
t
)
]
[
H
I
N
X
20
(
t
)
]
)
=
(
[
0.5
]
[
0.6
]
[
0.7
]
)
∣
t
=
3
(3.5)
HINX^{(t)} =U*x^{(t)} =\begin{pmatrix} [HINX_{00}^{(t)}]\\ [HINX_{10}^{(t)}]\\ [HINX_{20}^{(t)}]\\ \end{pmatrix} =\begin{pmatrix} [0.5]\\ [0.6]\\ [0.7]\\ \end{pmatrix}|t=3 \tag{3.5}
HINX(t)=U∗x(t)=
[HINX00(t)][HINX10(t)][HINX20(t)]
=
[0.5][0.6][0.7]
∣t=3(3.5)
因为HINX,HINW,HIN,HOUT的尺寸应该是一样的所以他们都为(3*1)(hidden_size*1)。
H
I
N
W
(
t
)
=
W
∗
(
H
O
U
T
(
t
−
1
)
)
T
HINW^{(t)}=W*(HOUT^{(t-1)})^T
HINW(t)=W∗(HOUT(t−1))T所以可以推的W为(3*3)(hidden_size*hidden_size),这里假设W如下:
W
=
(
[
0.5
,
0.2
,
0.1
]
[
0.5
,
0.3
,
0.2
]
[
0.4
,
0.3
,
0.4
]
)
(3.6)
W =\begin{pmatrix} [0.5,0.2,0.1]\\ [0.5,0.3,0.2]\\ [0.4,0.3,0.4]\\ \end{pmatrix} \tag{3.6}
W=
[0.5,0.2,0.1][0.5,0.3,0.2][0.4,0.3,0.4]
(3.6)
符号记为:
W
=
(
[
W
00
(
t
)
,
W
01
(
t
)
,
W
02
(
t
)
]
[
W
10
(
t
)
,
W
11
(
t
)
,
W
12
(
t
)
]
[
W
20
(
t
)
,
W
21
(
t
)
,
W
22
(
t
)
]
)
(3.7)
W =\begin{pmatrix} [W_{00}^{(t)},W_{01}^{(t)},W_{02}^{(t)}]\\ [W_{10}^{(t)},W_{11}^{(t)},W_{12}^{(t)}]\\ [W_{20}^{(t)},W_{21}^{(t)},W_{22}^{(t)}]\\ \end{pmatrix} \tag{3.7}
W=
[W00(t),W01(t),W02(t)][W10(t),W11(t),W12(t)][W20(t),W21(t),W22(t)]
(3.7)
并且假设
H
O
U
T
(
t
−
1
)
=
(
[
0.1
]
[
0.2
]
[
0.3
]
)
∣
这里简便起见就看最后一轮的时候,所以
t
=
3
(3.8)
HOUT^{(t-1)} =\begin{pmatrix} [0.1]\\ [0.2]\\ [0.3]\\ \end{pmatrix} |这里简便起见就看最后一轮的时候,所以t=3\tag{3.8}
HOUT(t−1)=
[0.1][0.2][0.3]
∣这里简便起见就看最后一轮的时候,所以t=3(3.8)
那么:
H
I
N
W
(
t
)
=
W
∗
H
O
U
T
(
t
−
1
)
=
(
[
0.12
]
[
0.17
]
[
0.22
]
)
=
(
[
H
I
N
X
00
(
t
)
]
[
H
I
N
X
10
(
t
)
]
[
H
I
N
X
20
(
t
)
]
)
∣
t
=
3
(3.9)
HINW^{(t)} =W*HOUT^{(t-1)} =\begin{pmatrix} [0.12]\\ [0.17]\\ [0.22]\\ \end{pmatrix} =\begin{pmatrix} [HINX_{00}^{(t)}]\\ [HINX_{10}^{(t)}]\\ [HINX_{20}^{(t)}]\\ \end{pmatrix}|t=3 \tag{3.9}
HINW(t)=W∗HOUT(t−1)=
[0.12][0.17][0.22]
=
[HINX00(t)][HINX10(t)][HINX20(t)]
∣t=3(3.9)
所以:
H
I
N
(
t
)
=
H
I
N
W
(
t
)
+
H
I
N
X
(
t
)
=
(
[
0.62
]
[
0.77
]
[
0.92
]
)
=
(
[
H
I
N
00
(
t
)
]
[
H
I
N
10
(
t
)
]
[
H
I
N
20
(
t
)
]
)
∣
t
=
3
(3.10)
HIN^{(t)} = HINW^{(t)} + HINX^{(t)} = \begin{pmatrix} [0.62]\\ [0.77]\\ [0.92]\\ \end{pmatrix} =\begin{pmatrix} [HIN_{00}^{(t)}]\\ [HIN_{10}^{(t)}]\\ [HIN_{20}^{(t)}]\\ \end{pmatrix} |t=3 \tag{3.10}
HIN(t)=HINW(t)+HINX(t)=
[0.62][0.77][0.92]
=
[HIN00(t)][HIN10(t)][HIN20(t)]
∣t=3(3.10)
在hidden layer的激活函数之后,也就是所谓的
s
(
t
)
=
这里所写的
H
O
U
T
(
T
)
=
f
(
U
x
(
t
)
+
W
s
(
t
−
1
)
)
=
也就是这里反了一下顺序(也就是一个转置的关系)的
f
(
H
I
N
X
(
t
)
∗
U
+
H
I
N
W
(
t
)
∗
W
)
s_{(t)} = 这里所写的HOUT^{(T)}=f(Ux_{(t)}+Ws_{(t-1)})=也就是这里反了一下顺序(也就是一个转置的关系)的f(HINX^{(t)}*U + HINW^{(t)}*W )
s(t)=这里所写的HOUT(T)=f(Ux(t)+Ws(t−1))=也就是这里反了一下顺序(也就是一个转置的关系)的f(HINX(t)∗U+HINW(t)∗W)
这里假设激活函数f是tanh,得到
H
O
U
T
(
t
)
HOUT^{(t)}
HOUT(t),为:
H
O
U
T
(
t
)
=
t
a
n
h
(
H
I
N
(
t
)
)
=
(
[
0.66
]
[
0.85
]
[
1.05
]
)
=
(
[
H
O
U
T
00
(
t
)
]
[
H
O
U
T
10
(
t
)
]
[
H
O
U
T
20
(
t
)
]
)
∣
t
=
3
(3.11)
HOUT^{(t)} = tanh(HIN^{(t)}) = \begin{pmatrix} [0.66]\\ [0.85]\\ [1.05]\\ \end{pmatrix} =\begin{pmatrix} [HOUT_{00}^{(t)}]\\ [HOUT_{10}^{(t)}]\\ [HOUT_{20}^{(t)}]\\ \end{pmatrix} |t=3 \tag{3.11}
HOUT(t)=tanh(HIN(t))=
[0.66][0.85][1.05]
=
[HOUT00(t)][HOUT10(t)][HOUT20(t)]
∣t=3(3.11)
这里输出由假设要三个维度(可以设计一个output_dimension,这里用的是onehot向量编码所以output_dimension=vocab_size),所以输出应该是一个2*1的矩阵,结合HOUT的大小为(3*1),以及这里用
a
(
t
)
=
V
∗
H
O
U
T
(
t
)
a^{(t)}=V * HOUT^{(t)}
a(t)=V∗HOUT(t)所以V的大小应该为2*3、(output_dimension*hidden_size,这里就等于vocab_size*hidden_size),假设V如下:
V
=
(
[
0.1
,
0.3
,
0.2
]
[
0.5
,
0.8
,
0.2
]
)
(3.12)
V =\begin{pmatrix} [0.1,0.3,0.2]\\ [0.5,0.8,0.2]\\ \end{pmatrix} \tag{3.12}
V=([0.1,0.3,0.2][0.5,0.8,0.2])(3.12)
符号记为:
V
=
(
[
V
00
(
t
)
,
V
01
(
t
)
,
V
02
(
t
)
]
[
V
10
(
t
)
,
V
11
(
t
)
,
V
12
(
t
)
]
)
(3.13)
V =\begin{pmatrix} [V_{00}^{(t)},V_{01}^{(t)},V_{02}^{(t)}]\\ [V_{10}^{(t)},V_{11}^{(t)},V_{12}^{(t)}]\\ \end{pmatrix} \tag{3.13}
V=([V00(t),V01(t),V02(t)][V10(t),V11(t),V12(t)])(3.13)
则
outputs:
a
(
t
)
=
V
∗
H
O
U
T
(
t
)
=
(
[
0.53
]
[
1.22
]
)
∣
t
=
3
(4.1)
a^{(t)}=V * HOUT^{(t)} = \begin{pmatrix} [0.53]\\ [1.22]\\ \end{pmatrix}|t=3\tag{4.1}
a(t)=V∗HOUT(t)=([0.53][1.22])∣t=3(4.1)
符号记为:
a
(
t
)
=
(
[
a
00
(
t
)
]
[
a
10
(
t
)
]
)
(4.2)
a^{(t)}=\begin{pmatrix} [a_{00}^{(t)}]\\ [a_{10}^{(t)}]\\ \end{pmatrix} \tag{4.2}
a(t)=([a00(t)][a10(t)])(4.2)
为了后续的计算,这里假设t<3的时候输出为:
a
(
0
)
=
(
[
0.21
]
[
1.00
]
)
a^{(0)} = \begin{pmatrix} [0.21]\\ [1.00]\\ \end{pmatrix}
a(0)=([0.21][1.00])
a
(
1
)
=
(
[
1.21
]
[
0.50
]
)
(4.3)
a^{(1)} = \begin{pmatrix} [1.21]\\ [0.50]\\ \end{pmatrix} \tag{4.3}
a(1)=([1.21][0.50])(4.3)
a
(
2
)
=
(
[
0.71
]
[
0.90
]
)
a^{(2)} = \begin{pmatrix} [0.71]\\ [0.90]\\ \end{pmatrix}
a(2)=([0.71][0.90])
最后接softmax(outputs after softmax):
softmax:
S
o
f
t
m
a
x
(
x
)
=
e
x
∑
i
e
x
i
Softmax(x) ={e^x \over \sum_i e^{x_i}}
Softmax(x)=∑iexiex
o
(
t
)
=
S
o
f
t
m
a
x
(
a
(
t
)
)
o^{(t)}= Softmax(a^{(t)})
o(t)=Softmax(a(t))
上面这个表达式输出的是一个行\列向量。(这里为了方便看就用一个行向量表示)
o
(
0
)
=
(
[
0.31
,
0.69
]
)
∣
t
=
0
(4.4)
o^{(0)}=\begin{pmatrix} [0.31,0.69]\\ \end{pmatrix} |t=0\tag{4.4}
o(0)=([0.31,0.69])∣t=0(4.4)
o
(
1
)
=
(
[
0.67
,
0.33
]
)
∣
t
=
1
(4.4)
o^{(1)}=\begin{pmatrix} [0.67,0.33]\\ \end{pmatrix} |t=1\tag{4.4}
o(1)=([0.67,0.33])∣t=1(4.4)
o
(
2
)
=
(
[
0.45
,
0.55
]
)
∣
t
=
2
(4.4)
o^{(2)}=\begin{pmatrix} [0.45,0.55]\\ \end{pmatrix} |t=2\tag{4.4}
o(2)=([0.45,0.55])∣t=2(4.4)
o
(
t
)
=
(
[
0.33
,
0.67
]
)
∣
t
=
3
(4.4)
o^{(t)}=\begin{pmatrix} [0.33,0.67]\\ \end{pmatrix} |t=3\tag{4.4}
o(t)=([0.33,0.67])∣t=3(4.4)
符号记为:
o
(
t
)
=
(
[
o
00
(
t
)
,
o
01
(
t
)
]
)
(5.1)
o^{(t)}=\begin{pmatrix} [o_{00}^{(t)},o_{01}^{(t)}]\\ \end{pmatrix} \tag{5.1}
o(t)=([o00(t),o01(t)])(5.1)
假设预测的实际标签是:
targets:
(
[
0
,
1
]
T
,
[
0
,
1
]
T
,
[
1
,
0
]
T
,
[
1
,
0
]
T
)
(5.2)
\begin{pmatrix} [0,1]^T,[0,1]^T,[1,0]^T,[1,0]^T\\ \end{pmatrix} \tag{5.2}
([0,1]T,[0,1]T,[1,0]T,[1,0]T)(5.2)
也就是
(
0
,
0
,
1
,
1
1
,
1
,
0
,
0
)
\begin{pmatrix} 0,0,1,1\\ 1,1,0,0\\ \end{pmatrix}
(0,0,1,11,1,0,0)
符号记为:
P
=
(
[
p
0
(
0
)
,
p
0
(
1
)
,
p
0
(
2
)
,
p
0
(
3
)
]
[
p
1
(
0
)
,
p
1
(
1
)
,
p
1
(
2
)
,
p
1
(
3
)
]
)
(5.1)
P=\begin{pmatrix} [p_{0}^{(0)},p_{0}^{(1)},p_{0}^{(2)},p_{0}^{(3)}]\\ [p_{1}^{(0)},p_{1}^{(1)},p_{1}^{(2)},p_{1}^{(3)}]\\ \end{pmatrix} \tag{5.1}
P=([p0(0),p0(1),p0(2),p0(3)][p1(0),p1(1),p1(2),p1(3)])(5.1)
也就是一般预测下一个单词,target就是输入的左移一位,最后一个就是表示EOS。(这里由于简化计算我只设了两个维度,那么就不得不又用onehot编码的【1,0】表示a,又表示EOS,这显然是不对的,但这里只是示意所以先勉强这么做。)
这里的损失函数就是最开头所说的交叉熵(设以e为底,简便之后的计算),假设前三轮的计算出来的交叉熵是:
L
(
0
)
=
−
(
0
+
1
∗
l
n
(
0.69
)
)
=
0.37
(6)
L^{(0)} = -(0+1*ln(0.69)) = 0.37\tag{6}
L(0)=−(0+1∗ln(0.69))=0.37(6)
L
(
1
)
=
−
(
0
+
1
∗
l
n
(
0.33
)
)
=
1.12
(7)
L^{(1)} = -(0+1*ln(0.33)) = 1.12\tag{7}
L(1)=−(0+1∗ln(0.33))=1.12(7)
L
(
2
)
=
−
(
1
∗
l
n
(
0.45
)
+
0
)
=
0.79
(8)
L^{(2)} = -(1*ln(0.45)+0) = 0.79\tag{8}
L(2)=−(1∗ln(0.45)+0)=0.79(8)
t=3的时候的交叉熵是
L
(
3
)
=
−
(
1
∗
l
n
(
0.33
)
+
0
)
=
1.10
(9)
L^{(3)} = -(1*ln(0.33)+0) = 1.10\tag{9}
L(3)=−(1∗ln(0.33)+0)=1.10(9)
这一次前向完毕总的损失函数为
L
=
3.38
(10)
这一次前向完毕总的损失函数为L = 3.38 \tag{10}
这一次前向完毕总的损失函数为L=3.38(10)
这里上标
t
为
3
这里上标t为3
这里上标t为3