RNN(二) 前向和BPTT
标签(空格分隔): RNN BPTT
basic definition
To simply notation, the RNN here only contains one input layer, one hidden layer and one putput layer. Notations are listed below:
neural layer | node | index | number |
---|---|---|---|
input layer | x(t) | i | N |
previous hidden layer | s(t) | h | M |
hidden layer | s(t-1) | j | M |
output layer | y(t) | k | O |
input->hidden | V(t) | i,j | N->M |
previous hidden->hidden | U(t) | h,j | M->M |
hidden->output | W(t) | j,k | M->O |
Besides, P is the total number of available training samples which are indexed by l
forward
1. input->hidden
2. hidden->output
f and g are the activate functions of hidden layer and output layer respectively.
backpropagation
prerequisite
Any network structure can be trained with backpropagation when desired output patterns exist and each function that has been used to calculate the actual output patterns is differentiable.
cost function
1.summed squared error(SSE)
The cost function can be any differentiable function that is able to measure the loss of the predicted values from the gold answers. The SSE is frequently-used, and works well in the training of conventional feed-forward neural networks.
2.cross extropy(CE)
The cross-entropy loss is used in Recurrent Neural Network Language Models(RNNLM) and performs well.
Discussion below is based on SSE.
error component
- error for output nodes
δlk=−∂C∂netlk=−∂C∂ylk∂ylk∂netlk=(dlk−ylk)g′(ylk) - error for hidden nodes
δlj=−(∑kO∂C∂ylk∂ylk∂netlk∂netlk∂slj)∂slj∂netlj=∑kOδlkwkjf′(netlj)
activate function
- sigmoid
f(net)=11+e−net
f′(net)=f(net){1−f(net)} - softmax
g(netk)=enetk∑Okenetk
g′(netk)=enetk(∑Ojenetj−enetk)(∑Ojenetj)2
gradient descent
According to the gradient descent, each weight change in the network should be proportional to the negative gradient of the cost function, with respect to the speci c weight:
where η is the learning rate.
1. hidden->output
2. input->hidden
3. previous hidden->hidden
unfolding
In a recurrent neural network, errors can be propagated further, i.e. more than 2 layers, in order to capture longer history information. This process is usually called unfolding.
In an unfolded RNN, the recurrent weight is duplicated spatially for an arbitrary number of time steps, here refered to as T.

Error for hidden nodes through time as:
where h is the index for the hidden node at time step t, and j for the hidden node at time step t-1.
此处原始论文使用的是 slj(t−1) ,个人感觉应该是 netlj(t−1) ,但是这种表示方式又不好解释,因为 t 时刻对应的下标是
After all error deltas have been obtained, weights are folded back adding up to one big change for each unfolded weights.
1. input->hidden
2. previous hidden->hidden
summary
- input->hidden
vji(t+1)=vji(t)+η∑zT∑lPδlj(t−z)x(l−z)i - previous hidden->hidden
ujh(t+1)=ujh(t)+η∑zT∑lPδlj(t−z)s(l−1−z)h - hidden->output
wkj(t+1)=wkj(t)+η∑lPδlkslj
references
BackPropagation Through Time
A guide to recurrent neural networks and backpropagation