RNN(Recurrent Neural Network)是一类用于处理序列数据的神经网络。
本文主要参考内容:
https://blog.csdn.net/zhaojc1995/article/details/80572098
1. RNN
下图是一个标准结构的RNN的前向传播过程
其中
x
x
x是输入,
h
h
h是隐藏层单元,
o
o
o为输出,
L
L
L为损失函数,
y
y
y是训练集的标签,
t
t
t代表时刻状态,
U
U
U是从输入到隐藏层的参数值,
W
W
W是上一个隐藏层到当前隐藏层的参数值,
V
V
V是隐藏层到输出层的参数值,
U
,
W
,
V
U,W,V
U,W,V也是模型的参数,也就是需要迭代更新的参数。
1.1 RNN前向传播算法
前向传播的算法,对于
t
t
t时刻:
对于隐藏层的pre-activation:
a ( t ) = U x ( t ) + W h ( t − 1 ) + b a^{(t)} = Ux^{(t)} + Wh^{(t-1)} + b a(t)=Ux(t)+Wh(t−1)+b
对于隐藏层的post-activation:
h ( t ) = g ( a ( t ) ) = g ( U x ( t ) + W h ( t − 1 ) + b ) h^{(t)} = g(a^{(t)}) = g(Ux^{(t)} + Wh^{(t-1)} + b) h(t)=g(a(t))=g(Ux(t)+Wh(t−1)+b)
其中 g ( x ) g(x) g(x)为激活函数, b b b为偏置,其实 b = b a + b h b=b_a+b_h b=ba+bh。
t t t时刻的输出pre-activation:
o ( t ) = V h ( t ) + c o^{(t)} = Vh^{(t)} + c o(t)=Vh(t)+c
t t t时刻的输出post-activation:
y ^ ( t ) = σ ( o ( t ) ) \hat y^{(t)} = \sigma(o^{(t)}) y^(t)=σ(o(t))
1.2 RNN后向传播算法-BPTT
BPTT(back-propagation through time)算法是常用的训练RNN的方法,其实本质还是BP算法。
我们需要优化的参数有三个,分别是 V , W , U V,W,U V,W,U.
因为参数 V V V没有涉及到时间序列的传播中,所以比较简单
∂ L ( t ) ∂ V = ∂ L ( t ) ∂ y ^ ( t ) ∂ y ^ ( t ) ∂ o ( t ) ∂ o ( t ) ∂ V = ∂ L ( t ) ∂ y ^ ( t ) ∂ y ^ ( t ) ∂ o ( t ) ∂ [ V h ( t ) + c ] ∂ V = L ′ ( t ) ( y ^ ( t ) ) σ ′ ( o ( t ) ) h ( t ) \begin{aligned} \frac {\partial L^{(t)}} {\partial V} &= \frac {\partial L^{(t)}} {\partial \hat y^{(t)}} \frac {\partial \hat y^{(t)}} {\partial o^{(t)}} \frac {\partial o^{(t)}} {\partial V} \\ &= \frac {\partial L^{(t)}} {\partial \hat y^{(t)}} \frac {\partial \hat y^{(t)}} {\partial o^{(t)}} \frac {\partial [Vh^{(t)} + c]} {\partial V} \\ &= L'^{(t)}(\hat y^{(t)}) \sigma'(o^{(t)}) h^{(t)} \end{aligned} ∂V∂L(t)=∂y^(t)∂L(t)∂o(t)∂y^(t)∂V∂o(t)=∂y^(t)∂L(t)∂o(t)∂y^(t)∂V∂[Vh(t)+c]=L′(t)(y^(t))σ′(o(t))h(t)
同理:
∂ L ( t ) ∂ c = L ′ ( t ) ( y ^ ( t ) ) σ ′ ( o ( t ) ) \frac {\partial L^{(t)}} {\partial c} = L'^{(t)}(\hat y^{(t)}) \sigma'(o^{(t)}) ∂c∂L(t)=L′(t)(y^(t))σ′(o(t))
求取
L
(
t
)
L^{(t)}
L(t)对
h
(
t
)
h^{(t)}
h(t)的导数:
∂
L
(
t
)
∂
h
(
t
)
=
∂
L
(
t
)
∂
y
^
(
t
)
∂
y
^
(
t
)
∂
o
(
t
)
∂
o
(
t
)
∂
h
(
t
)
=
∂
L
(
t
)
∂
y
^
(
t
)
∂
y
^
(
t
)
∂
o
(
t
)
∂
[
V
h
(
t
)
+
c
]
∂
h
(
t
)
=
L
′
(
t
)
(
y
^
(
t
)
)
σ
′
(
o
(
t
)
)
V
\begin{aligned} \frac {\partial L^{(t)}} {\partial h^{(t)}} &= \frac {\partial L^{(t)}} {\partial \hat y^{(t)}} \frac {\partial \hat y^{(t)}} {\partial o^{(t)}} \frac {\partial o^{(t)}} {\partial h^{(t)}} \\ &= \frac {\partial L^{(t)}} {\partial \hat y^{(t)}} \frac {\partial \hat y^{(t)}} {\partial o^{(t)}} \frac {\partial [Vh^{(t)} + c]} {\partial h^{(t)}} \\ &= L'^{(t)}(\hat y^{(t)}) \sigma'(o^{(t)}) V \end{aligned}
∂h(t)∂L(t)=∂y^(t)∂L(t)∂o(t)∂y^(t)∂h(t)∂o(t)=∂y^(t)∂L(t)∂o(t)∂y^(t)∂h(t)∂[Vh(t)+c]=L′(t)(y^(t))σ′(o(t))V
因为 W , U W,U W,U涉及到时间序列的传播过程中,涉及到历史数据,为了方便起见,先求 L ( 3 ) L^{(3)} L(3)对于 W W W的偏导
∂ L ( 3 ) ∂ W = ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ a ( 3 ) ∂ a ( 3 ) ∂ W + ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ h ( 2 ) ∂ h ( 2 ) ∂ a ( 2 ) ∂ a ( 2 ) ∂ W + ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ h ( 2 ) ∂ h ( 2 ) ∂ h ( 1 ) ∂ h ( 1 ) ∂ a ( 1 ) ∂ a ( 1 ) ∂ W = ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ a ( 3 ) ∂ a ( 3 ) ∂ W + ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ a ( 3 ) ∂ a ( 3 ) ∂ h ( 2 ) ∂ h ( 2 ) ∂ a ( 2 ) ∂ a ( 2 ) ∂ W + ∂ L ( 3 ) ∂ h ( 3 ) ∂ h ( 3 ) ∂ a ( 3 ) ∂ a ( 3 ) ∂ h ( 2 ) ∂ h ( 2 ) ∂ a ( 2 ) ∂ a ( 2 ) ∂ h ( 1 ) ∂ h ( 1 ) ∂ a ( 1 ) ∂ a ( 1 ) ∂ W = ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a ( 3 ) ) h ( 2 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) h ( 1 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) W g ′ ( a ( 1 ) ) h ( 0 ) \begin{aligned} \frac {\partial L^{(3)}} {\partial W} &= \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial a^{(3)}} \frac {\partial a^{(3)}} {\partial W}+ \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial h^{(2)}} \frac {\partial h^{(2)}} {\partial a^{(2)}} \frac {\partial a^{(2)}} {\partial W} + \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial h^{(2)}} \frac{\partial h^{(2)}} {\partial h^{(1)}} \frac {\partial h^{(1)}} {\partial a^{(1)}} \frac {\partial a^{(1)}} {\partial W} \\ & = \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial a^{(3)}} \frac {\partial a^{(3)}} {\partial W}+ \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial a^{(3)}} \frac{\partial a^{(3)}} {\partial h^{(2)}} \frac {\partial h^{(2)}} {\partial a^{(2)}} \frac {\partial a^{(2)}} {\partial W} \\ & + \frac {\partial L^{(3)}} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial a^{(3)}} \frac{\partial a^{(3)}} {\partial h^{(2)}} \frac {\partial h^{(2)}} {\partial a^{(2)}} \frac{\partial a^{(2)}} {\partial h^{(1)}} \frac {\partial h^{(1)}} {\partial a^{(1)}} \frac {\partial a^{(1)}} {\partial W} \\ &= \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{(3)})h^{(2)} + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3})Wg'(a^{2})h^{(1)} + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3}) W g'(a^{2}) W g'(a^{(1)}) h^{(0)} \end{aligned} ∂W∂L(3)=∂h(3)∂L(3)∂a(3)∂h(3)∂W∂a(3)+∂h(3)∂L(3)∂h(2)∂h(3)∂a(2)∂h(2)∂W∂a(2)+∂h(3)∂L(3)∂h(2)∂h(3)∂h(1)∂h(2)∂a(1)∂h(1)∂W∂a(1)=∂h(3)∂L(3)∂a(3)∂h(3)∂W∂a(3)+∂h(3)∂L(3)∂a(3)∂h(3)∂h(2)∂a(3)∂a(2)∂h(2)∂W∂a(2)+∂h(3)∂L(3)∂a(3)∂h(3)∂h(2)∂a(3)∂a(2)∂h(2)∂h(1)∂a(2)∂a(1)∂h(1)∂W∂a(1)=∂h(3)∂L(3)g′(a(3))h(2)+∂h(3)∂L(3)g′(a3)Wg′(a2)h(1)+∂h(3)∂L(3)g′(a3)Wg′(a2)Wg′(a(1))h(0)
同理可得:
∂ L ( 2 ) ∂ W = ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) h ( 1 ) + ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) W g ′ ( a ( 1 ) ) h 0 \frac {\partial L^{(2)}} {\partial W} = \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2})h^{(1)} + \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2}) W g'(a^{(1)}) h^{0} ∂W∂L(2)=∂h(2)∂L(2)g′(a2)h(1)+∂h(2)∂L(2)g′(a2)Wg′(a(1))h0
∂ L ( 1 ) ∂ W = ∂ L ( 1 ) ∂ h ( 1 ) g ′ ( a ( 1 ) ) h ( 0 ) \frac {\partial L^{(1)}} {\partial W} =\frac {\partial L^{(1)}} {\partial h^{(1)}} g'(a^{(1)}) h^{(0)} ∂W∂L(1)=∂h(1)∂L(1)g′(a(1))h(0)
可以得到:
∂ L ∂ W = ∑ t = 1 T ∂ L ( t ) ∂ W \frac {\partial L} {\partial W} = \sum_{t=1}^T \frac {\partial L^{(t)}} {\partial W} ∂W∂L=t=1∑T∂W∂L(t)
同样的推导可以得到对 U U U的导数:
∂ L ( 3 ) ∂ U = ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a ( 3 ) ) x ( 3 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) x ( 2 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) W g ′ ( a ( 1 ) ) x ( 1 ) \frac {\partial L^{(3)}} {\partial U} = \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{(3)})x^{(3)} + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3})Wg'(a^{2})x^{(2)} + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3}) W g'(a^{2}) W g'(a^{(1)}) x^{(1)} ∂U∂L(3)=∂h(3)∂L(3)g′(a(3))x(3)+∂h(3)∂L(3)g′(a3)Wg′(a2)x(2)+∂h(3)∂L(3)g′(a3)Wg′(a2)Wg′(a(1))x(1)
∂ L ( 2 ) ∂ U = ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) x ( 2 ) + ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) W g ′ ( a ( 1 ) ) x ( 1 ) \frac {\partial L^{(2)}} {\partial U} = \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2})x^{(2)} + \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2}) W g'(a^{(1)}) x^{(1)} ∂U∂L(2)=∂h(2)∂L(2)g′(a2)x(2)+∂h(2)∂L(2)g′(a2)Wg′(a(1))x(1)
∂ L ( 1 ) ∂ U = ∂ L ( 1 ) ∂ h ( 1 ) g ′ ( a ( 1 ) ) x ( 1 ) \frac {\partial L^{(1)}} {\partial U} =\frac {\partial L^{(1)}} {\partial h^{(1)}} g'(a^{(1)}) x^{(1)} ∂U∂L(1)=∂h(1)∂L(1)g′(a(1))x(1)
∂ L ∂ U = ∑ t = 1 T ∂ L ( t ) ∂ U \frac {\partial L} {\partial U} = \sum_{t=1}^T \frac {\partial L^{(t)}} {\partial U} ∂U∂L=t=1∑T∂U∂L(t)
得到对偏置 b b b的推导:
∂ L ( 3 ) ∂ b = ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a ( 3 ) ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) + ∂ L ( 3 ) ∂ h ( 3 ) g ′ ( a 3 ) W g ′ ( a 2 ) W g ′ ( a ( 1 ) ) \frac {\partial L^{(3)}} {\partial b} = \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{(3)}) + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3})Wg'(a^{2}) + \frac {\partial L^{(3)}} {\partial h^{(3)}} g'(a^{3}) W g'(a^{2}) W g'(a^{(1)}) ∂b∂L(3)=∂h(3)∂L(3)g′(a(3))+∂h(3)∂L(3)g′(a3)Wg′(a2)+∂h(3)∂L(3)g′(a3)Wg′(a2)Wg′(a(1))
∂ L ( 2 ) ∂ b = ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) + ∂ L ( 2 ) ∂ h ( 2 ) g ′ ( a 2 ) W g ′ ( a ( 1 ) ) \frac {\partial L^{(2)}} {\partial b} = \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2})+ \frac {\partial L^{(2)}} {\partial h^{(2)}} g'(a^{2}) W g'(a^{(1)}) ∂b∂L(2)=∂h(2)∂L(2)g′(a2)+∂h(2)∂L(2)g′(a2)Wg′(a(1))
∂ L ( 1 ) ∂ b = ∂ L ( 1 ) ∂ h ( 1 ) g ′ ( a ( 1 ) ) \frac {\partial L^{(1)}} {\partial b} =\frac {\partial L^{(1)}} {\partial h^{(1)}} g'(a^{(1)}) ∂b∂L(1)=∂h(1)∂L(1)g′(a(1))
∂ L ∂ b = ∑ t = 1 T ∂ L ( t ) ∂ b \frac {\partial L} {\partial b} = \sum_{t=1}^T \frac {\partial L^{(t)}} {\partial b} ∂b∂L=t=1∑T∂b∂L(t)
2. python实现RNN
代码内容来源:https://github.com/ddbourgin/numpy-ml/tree/master/numpy_ml/neural_nets/layers
class RNNCell(object):
"""RNN cell 只运行其中一个时序单元"""
def __init__(self, n_out, act_fn="Tanh", optimizer=None):
self.n_in = None
self.n_out = n_out
self.n_timesteps = None
self.params = {
"U": None,
"W": None,
"b": None
}
self.act_fn = Tanh()
self.optimizer = SGD()
self.is_initialized = False
def __str__(self):
return 'RNNCell(n_in={},n_out{})'.format(str(self.n_in), str(self.n_out))
def __call__(self, X):
return self.forward(X)
def __init_params(self):
"""
初始化参数
:return:
"""
self.X = []
U = np.random.randn(self.n_in, self.n_out)
W = np.random.randn(self.n_out, self.n_out)
b = np.zeros((self.n_out, 1))
self.params = {
"U": U,
"W": W,
"b": b
}
# 保存梯度以便进行梯度更新
self.gradients = {
"U": np.zeros_like(U),
"W": np.zeros_like(W),
"b": np.zeros_like(b)
}
# 保存中间运行结果,以便梯度更新
self.derived_variables = {
"h": [],
"a": [],
"n_timesteps": 0,
"current_step": 0,
"dLdh_accumulator": None
}
self.is_initialized = True
def forward(self, X):
"""
前向传播
:param X: numpy shape(n_ex, n_in)
:return:
"""
if self.is_initialized is False:
self.n_in = X.shape[1]
self.__init_params()
self.derived_variables["n_timesteps"] += 1
self.derived_variables["current_step"] += 1
b = self.params['b']
U = self.params['U']
W = self.params['W']
# 隐藏层
h_s = self.derived_variables['h']
if 0 == len(h_s):
n_ex, n_in = X.shape
h_s.append(np.zeros((n_ex, self.n_out)))
a_t = h_s[-1] @ W + X @ U + b.T
h_t = self.act_fn(a_t)
self.derived_variables['a'].append(a_t)
self.derived_variables['h'].append(h_t)
self.X.append(X)
return h_t
def backward(self, dLdh):
"""
后向传播,计算参数梯度
:param dLdh: numpy shape(n_ex, n_out)
:return:
"""
self.derived_variables['current_step'] -= 1
a_s = self.derived_variables['a']
h_s = self.derived_variables['h']
t = self.derived_variables['current_step']
dh_acc = self.derived_variables['dLdh_accumulator']
if dh_acc is None:
dh_acc = np.zeros_like(h_s[0])
# 获取参数
U = self.params['U']
W = self.params['W']
b = self.params['b']
dh = dLdh + dh_acc
da = self.act_fn.grad(a_s[t]) * dh
dX_t = da @ U.T
self.gradients['W'] += h_s[t].T @ da
self.gradients['U'] += self.X[t].T @ da
self.gradients['b'] += da.sum(axis=0, keepdims=True).T
self.derived_variables['dLdh_accumulator'] = da @ W.T
return dX_t
def flush_gradients(self):
"""每次迭代,都需要将中间结果清零"""
self.X = []
self.gradients = {
"U": np.zeros_like(self.params['U']),
"W": np.zeros_like(self.params['W']),
"b": np.zeros_like(self.params['b'])
}
self.derived_variables = {
"h": [],
"a": [],
"n_timesteps": 0,
"current_step": 0,
"dLdh_accumulator": None
}
def update(self):
"""更新参数"""
if self.optimizer is None:
self.params['W'] -= 0.01 * self.gradients['W']
self.params['U'] -= 0.01 * self.gradients['U']
self.params['b'] -= 0.01 * self.gradients['b']
else:
self.params['W'] = self.optimizer(self.params['W'], self.gradients['W'], 'W')
self.params['U'] = self.optimizer(self.params['U'], self.gradients['U'], 'U')
self.params['b'] = self.optimizer(self.params['b'], self.gradients['b'], 'b')
class RNN(object):
"""将用RNNCell将时序接起来"""
def __init__(self, n_out, act_fn="Tanh", optimizer=None):
self.n_in = None
self.n_out = n_out
self.cell = RNNCell(
n_out = self.n_out,
act_fn = act_fn,
optimizer=optimizer
)
def __str__(self):
return "RNN"
def __call__(self, X):
return self.forward(X)
def forward(self, X):
"""
前向传播
:param X: numpy shape(n_ex, n_in, n_t) n_ex是样本个数,n_in是当个样本的特征个数,n_t是不同时间点的特征
:return: numpy shape(n_ex, n_out, n_t)
"""
Y = []
n_ex, n_in, n_t = X.shape
for t in range(n_t):
yt = self.cell(X[:, :, t])
Y.append(yt)
return np.dstack(Y)
def backward(self, dLdh):
"""
后向迭代参数
:param dLdh:
:return:
"""
dLdX = []
n_ex, n_out, n_t = dLdh.shape
for t in reversed(range(n_t)):
dLdXt = self.cell.backward(dLdh[:, :, t])
dLdX.insert(0, dLdXt)
return np.dstack(dLdX)
def flush_gradients(self):
self.cell.flush_gradients()
def update(self):
"""
更新参数,更新完成将计算过程擦除
"""
self.cell.update()
self.cell.flush_gradients()