RNN（二）前向和BPTT

最新推荐文章于 2024-09-14 13:08:59 发布

xmdxcsj

最新推荐文章于 2024-09-14 13:08:59 发布

阅读量5k

点赞数 3

CC 4.0 BY-SA版权

分类专栏：神经网络文章标签： rnn-bptt

本文链接：https://blog.csdn.net/xmdxcsj/article/details/50088967

神经网络专栏收录该内容

21 篇文章

订阅专栏

本文详细介绍了循环神经网络（RNN）的前向传播过程及反向传播通过时间（BPTT）算法。从网络结构出发，阐述了输入层到隐藏层、隐藏层到输出层的计算过程，并深入解析了误差反向传播中梯度更新的具体步骤。

RNN（二）前向和BPTT

标签（空格分隔）： RNN BPTT

basic definition

To simply notation, the RNN here only contains one input layer, one hidden layer and one putput layer. Notations are listed below:

neural layer	node	index	number
input layer	x(t)	i	N
previous hidden layer	s(t)	h	M
hidden layer	s(t-1)	j	M
output layer	y(t)	k	O
input->hidden	V(t)	i,j	N->M
previous hidden->hidden	U(t)	h,j	M->M
hidden->output	W(t)	j,k	M->O

Besides, P is the total number of available training samples which are indexed by l

forward

RNN forward
1. input->hidden

n e t j (t) = \sum i N x i (t) v j i + \sum h M s h (t - 1) u j h + θ j

$net_j(t)=\sum_{i}^Nx_i(t)v_{ji}+\sum_{h}^Ms_h(t-1)u_{jh}+\theta_j$

s j (t) = f (n e t j (t))

$s_j(t)=f(net_j(t))$
2. hidden->output

n e t k (t) = \sum j M s j (t) w k j + θ k

$net_k(t)=\sum_{j}^Ms_j(t)w_{kj}+\theta_k$

y k (t) = g (n e t k (t))

$y_k(t)=g(net_k(t))$

f and g are the activate functions of hidden layer and output layer respectively.

backpropagation

prerequisite

Any network structure can be trained with backpropagation when desired output patterns exist and each function that has been used to calculate the actual output patterns is differentiable.

cost function

1.summed squared error(SSE)
The cost function can be any differentiable function that is able to measure the loss of the predicted values from the gold answers. The SSE is frequently-used, and works well in the training of conventional feed-forward neural networks.

C = 1 2 \sum l P \sum k O (d l k - y l k) 2

$C=\frac{1}{2}\sum_l^P\sum_k^O(d_{lk}-y_{lk})^2$
2.cross extropy(CE)
The cross-entropy loss is used in Recurrent Neural Network Language Models(RNNLM) and performs well.

C = - \sum l P \sum k O d l k ln y l k + (1 - d l k) ln (1 - y l k)

$C=-\sum_l^P\sum_k^Od_{lk}\ln y_{lk}+(1-d_{lk})\ln(1-y_{lk})$

Discussion below is based on SSE.

error component

error for output nodes

$δ l k = - \partial C \partial n e t l k = - \partial C \partial y l k \partial y l k \partial n e t l k = (d l k - y l k) g' (y l k)$ $\delta_{lk}=-\frac{\partial C}{\partial net_{lk}}=-\frac{\partial C}{\partial y_{lk}}\frac{\partial y_{lk}}{\partial net_{lk}}=(d_{lk}-y_{lk})g'(y_{lk})$
error for hidden nodes

$δ l j = - (\sum k O \partial C \partial y l k \partial y l k \partial n e t l k \partial n e t l k \partial s l j) \partial s l j \partial n e t l j = \sum k O δ l k w k j f' (n e t l j)$ $\delta_{lj}=-(\sum_k^O\frac{\partial C}{\partial y_{lk}}\frac{\partial y_{lk}}{\partial net_{lk}}\frac{\partial net_{lk}}{\partial s_{lj}})\frac{\partial s_{lj}}{\partial net_{lj}}=\sum_k^O\delta_{lk}w_{kj}f'(net_{lj})$

activate function

sigmoid
$f (n e t) = 1 1 + e - n e t$ $f(net)=\frac{1}{1+e^{-net}}$
$f' (n e t) = f (n e t) {1 - f (n e t)}$ $f'(net)=f(net)\{1-f(net)\}$
softmax
$g (n e t k) = e n e t k \sum O k e n e t k$ $g(net_k)=\frac{e^{net_k}}{\sum_k^Oe^{net_k}}$
$g' (n e t k) = e n e t k ( \sum O j e n e t j - e n e t k ) ( \sum O j e n e t j ) 2$ $g'(net_k)=\frac{e^{net_k}(\sum_j^Oe^{net_j}-e^{net_k})}{{(\sum_j^Oe^{net_j}})^2}$

gradient descent

According to the gradient descent, each weight change in the network should be proportional to the negative gradient of the cost function, with respect to the speci c weight:

Δ w = - η \partial C \partial w

$\Delta w=-\eta \frac{\partial C}{\partial w}$
where

η $\eta$ is the learning rate.
1. hidden->output

Δ w k j = - η \partial C \partial w k j = η \sum l P (- \partial C \partial n e t l k) \partial n e t l k \partial w k j = η \sum l P δ l k \partial n e t l k \partial w k j = η \sum l P δ l k s l j

$\Delta w_{kj}=-\eta \frac{\partial C}{\partial w_{kj}}=\eta \sum_l^P(-\frac{\partial C}{\partial net_{lk}})\frac{\partial net_{lk}}{\partial w_{kj}}=\eta \sum_l^P\delta_{lk}\frac{\partial net_{lk}}{\partial w_{kj}}=\eta \sum_l^P\delta_{lk}s_{lj}$
2. input->hidden

Δ v j i = - η \partial C \partial v j i = η \sum l P δ l j x l i

$\Delta v_{ji}=-\eta \frac{\partial C}{\partial v_{ji}}=\eta \sum_l^P\delta_{lj}x_{li}$
3. previous hidden->hidden

Δ u j h = - η \partial C \partial u j h = η \sum l P δ l j s (l - 1) h

$\Delta u_{jh}=-\eta \frac{\partial C}{\partial u_{jh}}=\eta \sum_l^P\delta_{lj}s_{(l-1)h}$

unfolding

In a recurrent neural network, errors can be propagated further, i.e. more than 2 layers, in order to capture longer history information. This process is usually called unfolding.
In an unfolded RNN, the recurrent weight is duplicated spatially for an arbitrary number of time steps, here refered to as T.

n e t l j (t) = \sum i N x l i (t) v j i + \sum h M s (l - 1) h u j h + θ j

$net_{lj}(t)=\sum_{i}^Nx_{li}(t)v_{ji}+\sum_{h}^Ms_{(l-1)h}u_{jh}+\theta_j$

s (l - 1) h = f (n e t (l - 1) h)

$s_{(l-1)h}=f(net_{(l-1)h})$
t-1 hidden node

Error for hidden nodes through time as:

δ l j (t - 1) = - \partial C \partial n e t ( l - 1 ) j = - \sum h M \partial C \partial n e t l h \partial n e t l h \partial n e t ( l - 1 ) j

$\delta_{lj}(t-1)=-\frac{\partial C}{\partial net_{(l-1)j}}=-\sum_h^M\frac{\partial C}{\partial net_{lh}}\frac{\partial net_{lh}}{\partial net_{(l-1)j}}$

= (- \sum h M \partial C \partial n e t l h) (\partial n e t l h \partial s ( l - 1 ) j) (\partial s ( l - 1 ) j \partial n e t ( l - 1 ) j)

$=(-\sum_h^M\frac{\partial C}{\partial net_{lh}})(\frac{\partial net_{lh}}{\partial s_{(l-1)j}})(\frac{\partial s_{(l-1)j}}{\partial net_{(l-1)j}})$

= \sum h M δ l h (t) u h j f' (n e t (l - 1) j)

$=\sum_h^M\delta_{lh}(t)u_{hj}f'(net_{(l-1)j})$
where h is the index for the hidden node at time step t, and j for the hidden node at time step t-1.
此处原始论文使用的是

slj(t−1) $s_{lj}(t-1)$ ，个人感觉应该是

netlj(t−1) $net_{lj}(t-1)$ ，但是这种表示方式又不好解释，因为

t $t$ 时刻对应的下标是

l $l$ ，

t−1 $t-1$ 时刻对应的下标也是

l $l$ ，所以修改成了

net(l−1)j $net_{(l-1)j}$ ，认为

t $t$ 时刻对应的为

l $l$ ，

t−1 $t-1$ 时刻对应的是

l−1 $l-1$ .

After all error deltas have been obtained, weights are folded back adding up to one big change for each unfolded weights.
1. input->hidden

Δ v j i (t) = η \sum z T \sum l P δ l j (t - z) x (l - z) i

$\Delta v_{ji}(t)=\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)x_{(l-z)i}$
2. previous hidden->hidden

Δ u j h (t) = η \sum z T \sum l P δ l j (t - z) s (l - 1 - z) h

$\Delta u_{jh}(t)=\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)s_{(l-1-z)h}$

summary

input->hidden
$v j i (t + 1) = v j i (t) + η \sum z T \sum l P δ l j (t - z) x (l - z) i$ $v_{ji}(t+1)=v_{ji}(t)+\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)x_{(l-z)i}$
previous hidden->hidden
$u j h (t + 1) = u j h (t) + η \sum z T \sum l P δ l j (t - z) s (l - 1 - z) h$ $u_{jh}(t+1)=u_{jh}(t)+\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)s_{(l-1-z)h}$
hidden->output
$w k j (t + 1) = w k j (t) + η \sum l P δ l k s l j$ $w_{kj}(t+1)=w_{kj}(t)+\eta \sum_l^P\delta_{lk}s_{lj}$