RNN（二）前向和BPTT

最新推荐文章于 2024-08-31 08:01:08 发布

xmdxcsj

最新推荐文章于 2024-08-31 08:01:08 发布

阅读量4.9k

点赞数 3

分类专栏：神经网络文章标签： rnn-bptt

本文链接：https://blog.csdn.net/xmdxcsj/article/details/50088967

版权

神经网络专栏收录该内容

21 篇文章 6 订阅

订阅专栏

RNN（二）前向和BPTT

标签（空格分隔）： RNN BPTT

basic definition

To simply notation, the RNN here only contains one input layer, one hidden layer and one putput layer. Notations are listed below:

neural layer	node	index	number
input layer	x(t)	i	N
previous hidden layer	s(t)	h	M
hidden layer	s(t-1)	j	M
output layer	y(t)	k	O
input->hidden	V(t)	i,j	N->M
previous hidden->hidden	U(t)	h,j	M->M
hidden->output	W(t)	j,k	M->O

Besides, P is the total number of available training samples which are indexed by l

forward

RNN forward
1. input->hidden

n e t j (t) = \sum i N x i (t) v j i + \sum h M s h (t - 1) u j h + θ j

$net_j(t)=\sum_{i}^Nx_i(t)v_{ji}+\sum_{h}^Ms_h(t-1)u_{jh}+\theta_j$

s j (t) = f (n e t j (t))

$s_j(t)=f(net_j(t))$
2. hidden->output

n e t k (t) = \sum j M s j (t) w k j + θ k

$net_k(t)=\sum_{j}^Ms_j(t)w_{kj}+\theta_k$

y k (t) = g (n e t k (t))

$y_k(t)=g(net_k(t))$

f and g are the activate functions of hidden layer and output layer respectively.

backpropagation

prerequisite

Any network structure can be trained with backpropagation when desired output patterns exist and each function that has been used to calculate the actual output patterns is differentiable.

cost function

1.summed squared error(SSE)
The cost function can be any differentiable function that is able to measure the loss of the predicted values from the gold answers. The SSE is frequently-used, and works well in the training of conventional feed-forward neural networks.

C = 1 2 \sum l P \sum k O (d l k - y l k) 2

$C=\frac{1}{2}\sum_l^P\sum_k^O(d_{lk}-y_{lk})^2$
2.cross extropy(CE)
The cross-entropy loss is used in Recurrent Neural Network Language Models(RNNLM) and performs well.

C = - \sum l P \sum k O d l k ln y l k + (1 - d l k) ln (1 - y l k)

$C=-\sum_l^P\sum_k^Od_{lk}\ln y_{lk}+(1-d_{lk})\ln(1-y_{lk})$

Discussion below is based on SSE.

error component

error for output nodes

$δ l k = - \partial C \partial n e t l k = - \partial C \partial y l k \partial y l k \partial n e t l k = (d l k - y l k) g' (y l k)$ $\delta_{lk}=-\frac{\partial C}{\partial net_{lk}}=-\frac{\partial C}{\partial y_{lk}}\frac{\partial y_{lk}}{\partial net_{lk}}=(d_{lk}-y_{lk})g'(y_{lk})$
error for hidden nodes

$δ l j = - (\sum k O \partial C \partial y l k \partial y l k \partial n e t l k \partial n e t l k \partial s l j) \partial s l j \partial n e t l j = \sum k O δ l k w k j f' (n e t l j)$ $\delta_{lj}=-(\sum_k^O\frac{\partial C}{\partial y_{lk}}\frac{\partial y_{lk}}{\partial net_{lk}}\frac{\partial net_{lk}}{\partial s_{lj}})\frac{\partial s_{lj}}{\partial net_{lj}}=\sum_k^O\delta_{lk}w_{kj}f'(net_{lj})$

activate function

sigmoid
$f (n e t) = 1 1 + e - n e t$ $f(net)=\frac{1}{1+e^{-net}}$
$f' (n e t) = f (n e t) {1 - f (n e t)}$ $f'(net)=f(net)\{1-f(net)\}$
softmax
$g (n e t k) = e n e t k \sum O k e n e t k$ $g(net_k)=\frac{e^{net_k}}{\sum_k^Oe^{net_k}}$
$g' (n e t k) = e n e t k ( \sum O j e n e t j - e n e t k ) ( \sum O j e n e t j ) 2$ $g'(net_k)=\frac{e^{net_k}(\sum_j^Oe^{net_j}-e^{net_k})}{{(\sum_j^Oe^{net_j}})^2}$

gradient descent

According to the gradient descent, each weight change in the network should be proportional to the negative gradient of the cost function, with respect to the speci c weight:

Δ w = - η \partial C \partial w

$\Delta w=-\eta \frac{\partial C}{\partial w}$
where

η $\eta$ is the learning rate.
1. hidden->output

Δ w k j = - η \partial C \partial w k j = η \sum l P (- \partial C \partial n e t l k) \partial n e t l k \partial w k j = η \sum l P δ l k \partial n e t l k \partial w k j = η \sum l P δ l k s l j

$\Delta w_{kj}=-\eta \frac{\partial C}{\partial w_{kj}}=\eta \sum_l^P(-\frac{\partial C}{\partial net_{lk}})\frac{\partial net_{lk}}{\partial w_{kj}}=\eta \sum_l^P\delta_{lk}\frac{\partial net_{lk}}{\partial w_{kj}}=\eta \sum_l^P\delta_{lk}s_{lj}$
2. input->hidden

Δ v j i = - η \partial C \partial v j i = η \sum l P δ l j x l i

$\Delta v_{ji}=-\eta \frac{\partial C}{\partial v_{ji}}=\eta \sum_l^P\delta_{lj}x_{li}$
3. previous hidden->hidden

Δ u j h = - η \partial C \partial u j h = η \sum l P δ l j s (l - 1) h

$\Delta u_{jh}=-\eta \frac{\partial C}{\partial u_{jh}}=\eta \sum_l^P\delta_{lj}s_{(l-1)h}$

unfolding

In a recurrent neural network, errors can be propagated further, i.e. more than 2 layers, in order to capture longer history information. This process is usually called unfolding.
In an unfolded RNN, the recurrent weight is duplicated spatially for an arbitrary number of time steps, here refered to as T.

n e t l j (t) = \sum i N x l i (t) v j i + \sum h M s (l - 1) h u j h + θ j

$net_{lj}(t)=\sum_{i}^Nx_{li}(t)v_{ji}+\sum_{h}^Ms_{(l-1)h}u_{jh}+\theta_j$

s (l - 1) h = f (n e t (l - 1) h)

$s_{(l-1)h}=f(net_{(l-1)h})$
t-1 hidden node

Error for hidden nodes through time as:

δ l j (t - 1) = - \partial C \partial n e t ( l - 1 ) j = - \sum h M \partial C \partial n e t l h \partial n e t l h \partial n e t ( l - 1 ) j

$\delta_{lj}(t-1)=-\frac{\partial C}{\partial net_{(l-1)j}}=-\sum_h^M\frac{\partial C}{\partial net_{lh}}\frac{\partial net_{lh}}{\partial net_{(l-1)j}}$

= (- \sum h M \partial C \partial n e t l h) (\partial n e t l h \partial s ( l - 1 ) j) (\partial s ( l - 1 ) j \partial n e t ( l - 1 ) j)

$=(-\sum_h^M\frac{\partial C}{\partial net_{lh}})(\frac{\partial net_{lh}}{\partial s_{(l-1)j}})(\frac{\partial s_{(l-1)j}}{\partial net_{(l-1)j}})$

= \sum h M δ l h (t) u h j f' (n e t (l - 1) j)

$=\sum_h^M\delta_{lh}(t)u_{hj}f'(net_{(l-1)j})$
where h is the index for the hidden node at time step t, and j for the hidden node at time step t-1.
此处原始论文使用的是

slj(t−1) $s_{lj}(t-1)$ ，个人感觉应该是

netlj(t−1) $net_{lj}(t-1)$ ，但是这种表示方式又不好解释，因为

t $t$ 时刻对应的下标是

l $l$ ，

t−1 $t-1$ 时刻对应的下标也是

l $l$ ，所以修改成了

net(l−1)j $net_{(l-1)j}$ ，认为

t $t$ 时刻对应的为

l $l$ ，

t−1 $t-1$ 时刻对应的是

l−1 $l-1$ .

After all error deltas have been obtained, weights are folded back adding up to one big change for each unfolded weights.
1. input->hidden

Δ v j i (t) = η \sum z T \sum l P δ l j (t - z) x (l - z) i

$\Delta v_{ji}(t)=\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)x_{(l-z)i}$
2. previous hidden->hidden

Δ u j h (t) = η \sum z T \sum l P δ l j (t - z) s (l - 1 - z) h

$\Delta u_{jh}(t)=\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)s_{(l-1-z)h}$

summary

input->hidden
$v j i (t + 1) = v j i (t) + η \sum z T \sum l P δ l j (t - z) x (l - z) i$ $v_{ji}(t+1)=v_{ji}(t)+\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)x_{(l-z)i}$
previous hidden->hidden
$u j h (t + 1) = u j h (t) + η \sum z T \sum l P δ l j (t - z) s (l - 1 - z) h$ $u_{jh}(t+1)=u_{jh}(t)+\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)s_{(l-1-z)h}$
hidden->output
$w k j (t + 1) = w k j (t) + η \sum l P δ l k s l j$ $w_{kj}(t+1)=w_{kj}(t)+\eta \sum_l^P\delta_{lk}s_{lj}$