RNN(1)--Cross the threshold of RNN

This article records our learning process about RNN part in Dive Into Deep learning book

Introduction

RNN is designed for time sequence. It has a state variable to store historical information.Current output is depended on state variable and current input.RNN often is used to operate sequence data.

Language Model

Language Model is a important skill in NLP.

Introduction

what is Language Model?

I regular a natural language plain text as a time sequence.For example:
T e x t = w 1 , w 2 , w 3 . . . . w i − > w o r d Text=w_1,w_2,w_3.... \\ w_i->word Text=w1,w2,w3....wi>word
We see w i w_i wi as output of i-th time step.The Language Model is a probability function to evaluate probability of same sequence.For instance,in machine translation the ‘You go first.’ translates as ‘你先走’ or ‘你走先’, then you can use Language Model to get that probability of ‘你先走’ is bigger.

How to compute Language Model?

We suppose we have a sequence ( w 1 , w 2 , w 3 . . . . ) (w_1,w_2,w_3....) (w1,w2,w3....).As following express We can define method of Language Model calculation:
P ( s e q u e n c e ) = P ( w 1 ) ∗ P ( w 2 ∣ w 1 ) ∗ P ( w 3 ∣ w 2 , w 1 ) ∗ P ( w 4 ∣ w 1 , w 2 , w 3 ) P(sequence)=P(w_1)*P(w_2|w_1)*P(w_3|w_2,w_1)*P(w_4|w_1,w_2,w_3) P(sequence)=P(w1)P(w2w1)P(w3w2,w1)P(w4w1,w2,w3)
For getting Language Model,we need to compute frequency of a word and conditional probability which is frequency of a word when the all words in front of this word is given. These need to be computed are parameters of Language Model.What you need to pay attention to is following calculation of conditional probability:
P ( w 3 ∣ w 1 , w 2 ) = P ( w 3 , w 1 , w 2 ) P ( w 1 , w 2 ) P(w_3|w_1,w_2)=\frac{P(w_3,w_1,w_2)}{P(w_1,w_2)} \\ P(w3w1,w2)=P(w1,w2)P(w3,w1,w2)
P ( w 3 , w 1 , w 2 ) P(w_3,w_1,w_2) P(w3,w1,w2) is frequency of sequence that they are next to each other.

n-grams

As you can see,we need to compute and store frequency of sequence that they are next to each other. When there are more words in our training set,our computation complexity of building Language Model increases exponentially. The reason why computation complexity is so bad is that method of building above thinks a word is relative with fore all words. We can simplify its computation and make complexity down by n-grams method.

Markov Supposition

Markov Supposition thinks that a word is only relative with fore n words. It is also named Markov chain of order n.Although it might be wrong,it can reduce computation.

Definition of n-grams

Markov chain of order n-1 is n-grams.Following express is n-grams:
P ( w 1 , w 2 , . . w T ) = ∏ t = 1 T P ( w t ∣ w t − ( n − 1 ) , . . . , w t − 1 ) P(w_1,w_2,..w_T)=\prod_{t=1}^TP(w_t|w_{t-(n-1)},...,w_{t-1}) P(w1,w2,..wT)=t=1TP(wtwt(n1),...,wt1)
Whole express is following:
P ( w 1 , w 2 , . . w T ) = ∏ t = 1 T P ( w t ∣ w t − ( n − 1 ) , . . . , w t − 1 ) = ∏ t = 1 T P ( w t ∣ w 1 , . . . , w t − 1 ) P(w_1,w_2,..w_T)=\prod_{t=1}^TP(w_t|w_{t-(n-1)},...,w_{t-1}) \\ =\prod_{t=1}^TP(w_t|w_1,...,w_{t-1}) P(w1,w2,..wT)=t=1TP(wtwt(n1),...,wt1)=t=1TP(wtw1,...,wt1)
Because w w w of i n d e x &lt; t − ( n − 1 ) index&lt;t-(n-1) index<t(n1) have no relation with w t w_t wt,and w t w_t wt only have relation with w w w of t − 1 &gt; = i n d e x &gt; = t − ( n − 1 ) t-1&gt;=index&gt;=t-(n-1) t1>=index>=t(n1).

Advantage and Disadvantage

When we set a small n,our result is inaccurate.When we set a big n,our building Language Model complexity is big.Thus function of the n is to keep balance between computation complexity and model accuracy.What you should pay attention to is that n-grams does reduce the complexity of computing parameters. can We will discuss how to balance in following contents.

RNN First Step

RNN stores fore information by using hidden state,rather than rigidly remembering all specific length sequences.

Multi-layer Perceptron

We discuss single hidden layer multi-layer perceptron.We suppose we have n n n samples and d d d dimensions feature vector.Computation process as following :
S u p p o s e : n   s a m p l e s , d − d i m   i n p u t   f e a t u r e   v e c t o r o u r   i n p u t   i s   X ∈   R n ∗ d H i d d e n   l a y e r   w e i g h t   i s   W 1 ∈   R d ∗ h H i d d e n   l a y e r   o u t p u t : H n ∗ h = X ∗ W 1 + b 1 ∗ h ( b r o a d c a s t m e c h a n i s m ) O u t p u t   l a y e r   w e i g h t   i s   W 2 ∈   R h ∗ q O u t p u t : o = H n ∗ h ∗ W 2 + b 1 ∗ q c l a s s i f i e r : s o f t m a x ( o )   t o   g e t   p r o b a b i l i t y   d i s t r i b u t i o n Suppose: \\ n\ samples,d-dim\ input\ feature\ vector \\ our\ input\ is\ X \in\ R^{n*d} \\ Hidden\ layer\ weight\ is\ W_1\in\ R^{d*h} \\ Hidden\ layer\ output:H^{n*h}=X*W_1+b^{1*h}(broadcast mechanism) \\ Output\ layer\ weight\ is\ W_2\in\ R^{h*q} \\ Output:o=H^{n*h}*W_2+b^{1*q} \\ classifier:softmax(o)\ to\ get\ probability\ distribution Suppose:n samples,ddim input feature vectorour input is X RndHidden layer weight is W1 RdhHidden layer output:Hnh=XW1+b1h(broadcastmechanism)Output layer weight is W2 RhqOutput:o=HnhW2+b1qclassifier:softmax(o) to get probability distribution

Multi-layer Perceptron containing hidden state

The difference with multi-layer perceptron is that we store last hidden layer output H t − 1 H_{t-1} Ht1.We add new weight W 3 ∈ R h ∗ h W_{3}\in R^{h*h} W3Rhh.This new weight is used to describe how to use last hidden layer output.The hidden layer output of current time step is depend on current layer input and H t − 1 H_{t-1} Ht1.We suppose activation of hidden layer is ϕ \phi ϕ.As following:
H t = ϕ ( X ∗ W 1 + H t − 1 ∗ W 3 + b 1 ∗ h ) H_{t}=\phi(X*W_1+H_{t-1}*W_3+b^{1*h}) Ht=ϕ(XW1+Ht1W3+b1h)
According to relation between H t − 1 H_{t-1} Ht1 and H t H_t Ht, H t H_t Ht can obtain historical information from fore sequence. It indicates that our NN owns memory,and these information remembered just likes state of current time step.Because H t H_{t} Ht of current time step uses H t − 1 H{t-1} Ht1 of last time step,the computation of hidden layer is recurrent. Thus we define this neural network as recurrent neural network. Why is it recurrent?Different time step uses same weights W 1 , W 2 , W 3 W_1,W_2,W_3 W1,W2,W3.The process is following graph.We see the graph folded,and it is recurrent in hidden layer.
在这里插入图片描述
So weights of RNN only contain V , W , U   a n d   a l l   b i a s e s V,W,U\ and\ all\ biases V,W,U and all biases.RNN uses shared weights.The amount of all weights of RNN does not change with moving time step. It is better than n-grams or the Language Model.
在这里插入图片描述
We merge X t X_{t} Xt and H t − 1 H_{t-1} Ht1 on column(get M n ∗ ( d + h ) M^{n*(d+h)} Mn(d+h)),and merge W d ∗ h W^{d*h} Wdh and W h ∗ h W^{h*h} Whh on row(get N ( d + h ) ∗ h N^{(d+h)*h} N(d+h)h).We can easily prove M ∗ N = X ∗ W 1 + H t − 1 ∗ W 3 M*N=X*W_1+H_{t-1}*W_3 MN=XW1+Ht1W3.This just is a small skill.

The Language Model based on character-level recurrent neural network

We use RNN to predict next character depending on fore characters. In training process,we use s o f t m a x softmax softmax function to get probability distribution,then we use cross-entropy loss function to compute error between label predicted and target label.Our target label is next character.Our label predicted depends on input of our current time step and fore all characters.The loss value depends on the probability distribution and next character.Because our input is a character,so this model is named character-level recurrent neural network.Because amount of characters is smaller than amount of words. So it’s simple.
在这里插入图片描述

Why can RNN express conditional probability of a character depending on fore all characters?

We use graph above as example.We use ‘想’,‘要’,‘有’ to get a probability distribution.This probability distribution has a dimension which can express probability of ‘直’.The probability is conditional probability of the character ‘直’ depending on fore all characters ‘想’,‘要’,‘有’. So RNN can express conditional probability of a character.

Data set of Language Model

In this section,we preprocess a data set of Language Model.Then we train character-level RNN to get a Language Model by this data set.Finally,we can use this model trained to produce lyrics.
The data set is Jay Chou’s lyrics from jay first album to 跨时代 tenth album.

Read data set

The lyrics file is at d2l-zh.We only use fore 10000 words to train our RNN.

import zipfile
with zipfile.ZipFile('/home/*/*/jaychou_lyrics.txt.zip') as zf:
    with zf.open('jaychou_lyrics.txt') as f:
        l=f.read().decode('utf-8')
l=l.replace('\n',' ').replace('\r',' ')
l=l[0:10000]

Result:
在这里插入图片描述

Establish index of character

For simple,we replace character with index.

inx_to_ch=list(set(l))
ch_to_inx=dict([(cha,i) for i,cha in enumerate(inx_to_ch)])
print(len(ch_to_inx))

corpus_indices=[ch_to_inx[char] for char in l]
sample=corpus_indices[0:20]
print('chars are :',''.join([inx_to_ch[i] for i in sample]))
print('indices are :',sample)

Result:

在这里插入图片描述

Sample of sequential data

When we train our RNN model,we should randomly pick a mini-batch datas and labels.Training data is different with vector or matrix,and it is a partial continuous sequential data in training data.Label consists of next character of training data.We have tow ways to sample training data and labels.They are stochastic sampling and adjacent sampling.

Stochastic sampling

We separate training data to n u m _ e x a m p l e s num\_examples num_examples.We use b a t c h _ s i z e batch\_size batch_size training data as a epoch.Each time we need to reinitialize the hidden state,because the samples between mini-batches is discontinuous sequential data.

import random
from mxnet import nd
def data_creator(corpus_indices,batch_size,num_steps):
    #num_examples,we need substract 1,because label is input+1
    num_examples=(len(corpus_indices)-1)//num_steps
    num_epoches=num_examples//batch_size
    examples_indices=list(range(num_examples))
    random.shuffle(examples_indices)
    def _data(start):
        return corpus_indices[start:start+num_steps]
    for i in range(num_epoches):
        i=i*batch_size
        batch_examples=examples_indices[i:i+batch_size]
        X=[_data(j*num_steps) for j in batch_examples]
        Y=[_data(j*num_steps+1) for j in batch_examples]
        yield nd.array(X),nd.array(Y)
my_seq=range(30)
for x,y in data_creator(my_seq,2,6):
    print('x:',x)
    print('y:',y)

Result:
在这里插入图片描述
The adjacent mini-batch might be discontinuous as well.For example, [ 0 , 1 , 2 , 3 , 4 , 5 ] [0,1,2,3,4,5] [0,1,2,3,4,5] and [ 12 , 13 , 14 , 15 , 16 , 17 ] [12,13,14,15,16,17] [12,13,14,15,16,17] is discontinuous.

Adjacent sampling

We can get continuous mini-batches,and it means hidden state of final time step in current mini-batch can be used to initialize hidden state of next mini-batch. By this way,we link all mini-batch,and we only need to initialize the hidden state once.Output of next mini-batch depends on input of current mini-batch.On the other hand,the gradient of weights depends all continuous mini-batches. In the same epoch,with increasing amount of iteration,the computation of gradient will cost more.For using a mini-batch to update weights(compute gradient),we separate the hidden state from the computation graph before we pick next mini-batch.We will know this operation way in realization part of Realize RNN From Zero section.
What you should pay attention to is data of a mini-batch can parallel execute.We can merge mini-batch data to compute. It is same with CNN. So there are not order of execution in a mini-batch.The order exits between all mini-batches.

def data_creator_adjacent_sampling(corpus_indices,batch_size,num_steps):
    corpus_indices=nd.array(corpus_indices)
    totel_len=len(corpus_indices)
    batch_len=totel_len//batch_size
    ci=corpus_indices[0:batch_len*batch_size].reshape((batch_size,batch_len))
    epoch_size=(batch_len-1)//num_steps
    for i in range(epoch_size):
        X=ci[:,i*num_steps:(i+1)*num_steps]
        Y=ci[:,i*num_steps+1:(i+1)*num_steps+1]
        yield X,Y
my_seq=range(30)
for x,y in data_creator_adjacent_sampling(my_seq,2,6):
    print('x:',x)
    print('y:',y)

Result:
在这里插入图片描述
In obvious,The adjacent mini-batches are continuous.

Realize RNN from zero

We will realize a model based on Jay Chou’s lyrics to produce lyrics.

one-hot vector

We can change a character into a one-hot vector as input of RNN.We suppose we hava N N N different characters. In other words,the length of our c h _ t o _ i n x ch\_to\_inx ch_to_inx dictionary is N N N.In c h _ t o _ i n x ch\_to\_inx ch_to_inx,each character corresponds to a index from 0 0 0 to N − 1 N-1 N1.If we have a character whose index is i i i,all elements of its one-hot vector is zero except that i-th element is 1 1 1, and the length of this vector is N N N.
For example, the index of character is 0 and the index of character is 2:
在这里插入图片描述
Then we change shape [ b a t c h _ s i z e ∗ n u m _ s t e p s ] [batch\_size*num\_steps] [batch_sizenum_steps] of input into shape n u m _ s t e p s ∗ [ b a t c h _ s i z e , v o c a b _ s i z e ] num\_steps*[batch\_size,vocab\_size] num_steps[batch_size,vocab_size] as input of RNN. It means that the input of time step t t t is X t ∈ R n ∗ d X_t \in R^{n*d} XtRnd,which n n n is b a t c h _ s i z e batch\_size batch_size, d d d is the length of c h _ t o _ i n x ch\_to\_inx ch_to_inx dictionary. It is similar with input format of the Full connection Neural Network.

def to_one_hot(x,size):
    return [nd.one_hot(j,size) for j in x.T]
X=nd.arange(10).reshape((2,5))
inputs=to_one_hot(X,vocab_size)
print((len(inputs),inputs[0].shape))
print(inputs)

Result:
在这里插入图片描述

Initialize weights of model

n u m _ h i d d e n s num\_hiddens num_hiddens is a hyper-parameter. In following,we initialize all weights.

def get_params(num_vocabs,num_hiddens,ctx):
    def _one(shape,ctx):
        return nd.random.normal(scale=0.01,shape=shape,ctx=ctx)
    ctx=d2l.try_gpu()
    #hidden layer
    w_dh=_one((num_vocabs,num_hiddens),ctx)
    w_hh=_one((num_hiddens,num_hiddens),ctx)
    b_h=nd.zeros((1,num_hiddens),ctx=ctx)
    #output layer
    w_hq=_one((num_hiddens,num_vocabs),ctx)
    b_q=nd.zeros((1,num_vocabs),ctx=ctx)
    
    params=[w_dh,w_hh,b_h,w_hq,b_q]
    for p in params:
        p.attach_grad()
    return params

Define Model

The reason why we use a tuple to represent initial hidden state is that we might need to operate multiple hidden state,for example LSTM containing memory cell.

def init_rnn_state(batch_size,num_hiddens,ctx):
    return (nd.zeros((batch_size,num_hiddens),ctx=ctx),)

Now,we compute output and hidden state of RNN.

def rnn(inputs,state,params):
    #input and output both are a 3-dim matrix whose shape is (num_steps,batch_size,num_vocabs)
    h,=state
    w_dh,w_hh,b_h,w_hq,b_q=params
    outputs=[]
    for i in inputs:
        h=nd.tanh(nd.dot(i,w_dh)+nd.dot(h,w_hh)+b_h)
        o=nd.dot(h,w_hq)+b_q
        outputs.append(o) #is different with extend function
    return outputs,(h,)

For example:

#example
import d2lzh as d2l
def example():
    num_hiddens=256
    ctx=d2l.try_gpu()
    inputs=to_one_hot(X.as_in_context(ctx),vocab_size)
    state=init_rnn_state(X.shape[0],num_hiddens,ctx)
    params=get_params(vocab_size,num_hiddens,ctx)
    o,(h,)=rnn(inputs,state,params)
    print((len(o),o[0].shape,h.shape))
example()

result:
(5, (2, 1027), (2, 256))

Define prediction function

# 本函数已保存在d2lzh包中方便以后使用
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx):
state = init_rnn_state(1, num_hiddens, ctx)
output = [char_to_idx[prefix[0]]]
for t in range(num_chars + len(prefix) - 1):
# 将上一时间步的输出作为当前时间步的输入
X = to_onehot(nd.array([output[-1]], ctx=ctx), vocab_size)
# 计算输出和更新隐藏状态
(Y, state) = rnn(X, state, params)
# 下一个时间步的输入是prefix里的字符或者当前的最佳预测字符
if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
output.append(int(Y[0].argmax(axis=1).asscalar()))
return ''.join([idx_to_char[i] for i in output])

不当搬运工,看原书更直接,后面的东西都很简单,没必要“抄”一遍,可以看原书,这里就不给出了。本来写博客最初的想法就是为了记录自己遇到的问题是如何解决的,以及自己对一些知识的理解,一些难点的笔记,方便自己以后复习、检查自己是否真的懂了能够写出来。附github链接github资料(伯克利2019春季教材)

Backward Propagation through time

We need to open the recurrent structure by time step.Then we can get dependency relationship between model variable and weights.Finally,according to the chain rule,we can use backward propagation algorithm to compute and store gradient.

Definition of Model

Suppose we use a RNN without b i a s bias bias item,and activation of model is identity function( ϕ ( x ) = x \phi(x)=x ϕ(x)=x),and input is a single sample.We can get:
h t = W h x ∗ X + W h h ∗ h t − 1 o t = W q h ∗ h t L o s s = 1 T ∑ t = 1 T l ( o t , y t ) h_t=W_{hx}*X+W_{hh}*h_{t-1} \\ o_t=W_{qh}*h_t \\ Loss=\frac{1}{T}\sum_{t=1}^Tl(o_t,y_t) ht=WhxX+Whhht1ot=WqhhtLoss=T1t=1Tl(ot,yt)
In following, L o s s Loss Loss is named target function.

Computation Graph of Model

For visualizing the dependency relationship between model variable and weights,we draw computation graph of model:
在这里插入图片描述
Figure 6.3:Computation graph has 3 time steps.Circle with shadow is operation,rectangle with shadow is parameter,and remains are variables of model.

Method

The parameters needing to update have W h x , W h h , W q h W_{hx},W_{hh},W_{qh} Whx,Whh,Wqh.We use a operational mark p r o d prod prod to represent matrix multiplication with necessary matrix transformation.
( 1 ) F i r s t   o f   a l l , w e   c o m p u u t e   g r a d i e n t   o f   t i m e   s t e p   t   o u t p u t   v a r i a b l e ∂ L o s s ∂ o t = ∂ l ( o t , y t ) T ∗ ∂ o t ∈ R q ∗ 1 = [ ∂ o t 1 ∂ o t 2 . . . ∂ o t q ] ( 2 ) T h e n   w e   c o m p u t e   g r a d i e n t   o f   W q h A s   w e   k o w n h t = [ h t 1 h t 2 . . . h t h ] ∂ L o s s ∂ W q h = ∑ t = 1 T p r o d ( ∂ L o s s ∂ o t , ∂ o t ∂ W q h ) = ∑ t = 1 T ( ∂ L o s s ∂ o t ∗ h t T ) ∈ R q h = [ ∂ o t 1 ∗ h t 1 ∂ o t 1 ∗ h t 2 ∂ o t 1 ∗ h t 3 . . . ∂ o t 1 ∗ h t h ∂ o t 2 ∗ h t 1 ∂ o t 2 ∗ h t 2 ∂ o t 2 ∗ h t 3 . . . ∂ o t 2 ∗ h t h . . . ∂ o t q ∗ h t 1 ∂ o t q ∗ h t 2 ∂ o t q ∗ h t 3 . . . ∂ o t q ∗ h t h ] Y o u   c a n   s e e   o t 1 , o t 2 . . . o t 1 = W 11 ∗ h t 1 + W 12 ∗ h t 2 + . . . + W 1 h ∗ h t h . . . o t q = W q 1 ∗ h t 1 + W q 2 ∗ h t 2 + . . . + W q h ∗ h t h t h e   W i j   i s   e l e m e n t   o f   W q h . T h e   g r a d i e n t   i s   o b v i o u s . T h i s   i s   g r a d i e n t   m a t r i x . ( 3 ) B e c a u s e   L o s s   o n l y   d e p e n d s   o n   h T   b y   o T , s o   f i r s t l y   w e   c o m p u t e ∂ L o s s ∂ h T = p r o d ( ∂ L o s s ∂ o T , ∂ o T ∂ h T ) = W q h T ∗ ∂ L o s s ∂ o T = [ W 11 ∗ ∂ o T 1 + W 21 ∗ ∂ o T 2 + . . . + W q 1 ∗ ∂ o T q . . . W 1 h ∗ ∂ o T 1 + W 2 h ∗ ∂ o T 2 + . . . + W q h ∗ ∂ o T q ] A c c o r d i n g   t o   c o m p u t a t i o n   o f   o T   a b o v e , i t ′ s   s i m p l e . ( 4 ) F o r   t &lt; T , L o s s   d e p e n d s   o n   h t   b y   h t + 1   a n d   o t , A c c o r d i n g   t o   t h e   C h a i n   R u l e , w e   c o m p u t e   g r a d i e n t   o f   h t a c c o r d i n g   t o   t h e   o r d e r   o f   t i m e   s t e p   i n c r e a s i n g . ∂ L o s s ∂ h t = p r o d ( ∂ L o s s ∂ h t + 1 , ∂ h t + 1 ∂ h t ) + p r o d ( ∂ L o s s ∂ o t , ∂ o t ∂ h t ) = W h h T ∗ ∂ L o s s ∂ h t + 1 + W q h T ∗ ∂ L o s s ∂ o t E x p r e s s   a b o v e   i s   a   r e c u r s i o n , w e   c a n   o p e n   i t ∂ L o s s ∂ h t = ∑ i = t T ( W h h T ) T − i ∗ W q h T ∗ ∂ L o s s ∂ o T + t − i T h i s   e x p r e s s   i s   e a s i l y   p r o v e d , I   h a v a   a   q u e s t i o n : t + 2 , t + 3... r e m e m b e r s   h t , w h y   d o   w e   o n l y   u s e   t + 1   s t e p ? I f   y o u   a d d   t + 2 , t + 3...   t o r e c u r s i o n   e x p r e s s , y o u   a r e   r e p e a t e d l y   c o m p u t i n g   s o m e   r e d u n d a n t   g r a d i e n t s . T h i s   e x p r e s s   h a s   a n o t h e r   i n t e r p r e t a t i o n . Y o u   c a n   t h i n k   t h a t   c u r r e n t   t i m e   s t e p   r e m e m b e r s   a l l   f o r e   i n f o r m a t i o n . W e   c o n s i d e r   t h e   i n f l u e n c e   o f   h t   f o r   a l l   b a c k   o u t p u t s , a s f o l l o w i n g ∂ L o s s ∂ h t = ∑ i = t + 1 T ( ∂ L o s s ∂ o i ∗ ∂ o i ∂ h t ) ( S u p p o s e   o u r   e a c h   o i   i s c o m p u t e d   b y   a l l   f o r e   o u t p u t   o f   h i d d e n   l a y e r , i t m e a n s   t h a t   o u t p u t   o f   h i d d e n   l a y e r   o f   l a s t   t i m e   s t e p   i s   o p e n e d . N o t   h t , i t   i s   c o n t e n t s   o f   h t . ) F i n a l   r e s u l t   i s   s a m e   w i t h   b e f o r e . (1)First\ of\ all,we\ compuute\ gradient\ of\ time\ step\ t\ output\ variable \\ \frac{\partial Loss}{\partial o_t}= \frac{\partial l(o_t,y_t)}{T*\partial o_t} \in R^{q*1} \\ =\\ \begin{bmatrix} \partial o_{t1}\\ \partial o_{t2}\\ ...\\ \partial o_{tq} \end{bmatrix} \\ (2)Then\ we\ compute\ gradient\ of\ W_{qh} \\ As\ we\ kown\\ h_t= \begin{bmatrix} h_{t1} \\ h_{t2}\\ ...\\ h_{th} \end{bmatrix}\\ \frac{\partial Loss}{\partial W_{qh}}=\sum_{t=1}^Tprod(\frac{\partial Loss}{\partial o_t},\frac{\partial o_t}{\partial W_{qh}})=\sum_{t=1}^T(\frac{\partial Loss}{\partial o_t}*h_t^T) \in R^{qh}\\ =\\ \begin{bmatrix} \partial o_{t1}*h_{t1} &amp;\partial o_{t1}*h_{t2}&amp;\partial o_{t1}*h_{t3}&amp;...&amp;\partial o_{t1}*h_{th}\\ \partial o_{t2}*h_{t1} &amp;\partial o_{t2}*h_{t2}&amp;\partial o_{t2}*h_{t3}&amp;...&amp;\partial o_{t2}*h_{th}\\ &amp;&amp;&amp;...\\ \partial o_{tq}*h_{t1} &amp;\partial o_{tq}*h_{t2}&amp;\partial o_{tq}*h_{t3}&amp;...&amp;\partial o_{tq}*h_{th}\\ \end{bmatrix}\\ You\ can\ see\ o_{t1},o_{t2}...\\ o_{t1}=W_{11}*h_{t1}+W_{12}*h_{t2}+...+W_{1h}*h_{th}\\ ...\\ o_{tq}=W_{q1}*h_{t1}+W_{q2}*h_{t2}+...+W_{qh}*h_{th}\\ the\ W_{ij}\ is\ element\ of\ W_{qh}.\\ The\ gradient\ is\ obvious.This\ is\ gradient\ matrix. \\ (3)Because\ Loss\ only\ depends\ on\ h_T\ by\ o_T,so\ firstly\ we\ compute\\ \frac{\partial Loss}{\partial h_T}=prod(\frac{\partial Loss}{\partial o_T},\frac{\partial o_T}{\partial h_T})=W_{qh}^T*\frac{\partial Loss}{\partial o_T}\\ =\\ \begin{bmatrix} W_{11}*\partial o_{T1}+W_{21}*\partial o_{T2}+...+W_{q1}*\partial o_{Tq}\\ ...\\ W_{1h}*\partial o_{T1}+W_{2h}*\partial o_{T2}+...+W_{qh}*\partial o_{Tq} \end{bmatrix}\\ According\ to\ computation\ of\ o_T\ above,it&#x27;s\ simple. \\ (4)For\ t&lt;T,Loss\ depends\ on\ h_t\ by\ h_{t+1}\ and\ o_t,\\ According\ to\ the\ Chain\ Rule,we\ compute\ gradient\ of\ h_t\\ according\ to\ the\ order\ of\ time\ step\ increasing.\\ \frac{\partial Loss}{\partial h_t}=prod(\frac{\partial Loss}{\partial h_{t+1}},\frac{\partial h_{t+1}}{\partial h_{t}})+ prod(\frac{\partial Loss}{\partial o_{t}},\frac{\partial o_{t}}{\partial h_{t}})\\ =\\W^{T}_{hh}*\frac{\partial Loss}{\partial h_{t+1}}+W^T_{qh}*\frac{\partial Loss}{\partial o_{t}}\\ Express\ above\ is\ a\ recursion,we\ can\ open\ it\\ \frac{\partial Loss}{\partial h_t}=\sum_{i=t}^{T}(W^T_{hh})^{T-i}*W^T_{qh}*\frac{\partial Loss}{\partial o_{T+t-i}}\\ This\ express\ is\ easily\ proved,I\ hava\ a\ question: t+2,t+3...\\ remembers\ h_t,why\ do\ we\ only\ use\ t+1\ step?If\ you\ add\ t+2,t+3...\ to\\ recursion\ express,you\ are\ repeatedly\ computing\ some\ redundant\ gradients.\\ This\ express\ has\ another\ interpretation.\\ You\ can\ think\ that\ current\ time\ step\ remembers\ all\ fore\ information.\\ We\ consider\ the\ influence\ of\ h_t\ for\ all\ back\ outputs,as following\\ \frac{\partial Loss}{\partial h_t}=\sum_{i=t+1}^T (\frac{\partial Loss}{\partial o_i}*\frac{\partial o_i}{\partial h_t})\\ (Suppose\ our\ each\ o_i\ is\\ computed\ by\ all\ fore\ output\ of\ hidden\ layer,it\\ means\ that\ output\ of\ hidden\ layer\ of\ last\ time\ step\ is\ opened.Not\ h_t,it\ is\ contents\ of\ h_t.) \\ Final\ result\ is\ same\ with\ before. (1)First of all,we compuute gradient of time step t output variableotLoss=Totl(ot,yt)Rq1=ot1ot2...otq(2)Then we compute gradient of WqhAs we kownht=ht1ht2...hthWqhLoss=t=1Tprod(otLoss,Wqhot)=t=1T(otLosshtT)Rqh=ot1ht1ot2ht1otqht1ot1ht2ot2ht2otqht2ot1ht3ot2ht3otqht3............ot1hthot2hthotqhthYou can see ot1,ot2...ot1=W11ht1+W12ht2+...+W1hhth...otq=Wq1ht1+Wq2ht2+...+Wqhhththe Wij is element of Wqh.The gradient is obvious.This is gradient matrix.(3)Because Loss only depends on hT by oT,so firstly we computehTLoss=prod(oTLoss,hToT)=WqhToTLoss=W11oT1+W21oT2+...+Wq1oTq...W1hoT1+W2hoT2+...+WqhoTqAccording to computation of oT above,its simple.(4)For t<T,Loss depends on ht by ht+1 and ot,According to the Chain Rule,we compute gradient of htaccording to the order of time step increasing.htLoss=prod(ht+1Loss,htht+1)+prod(otLoss,htot)=WhhTht+1Loss+WqhTotLossExpress above is a recursion,we can open ithtLoss=i=tT(WhhT)TiWqhToT+tiLossThis express is easily proved,I hava a question:t+2,t+3...remembers ht,why do we only use t+1 step?If you add t+2,t+3... torecursion express,you are repeatedly computing some redundant gradients.This express has another interpretation.You can think that current time step remembers all fore information.We consider the influence of ht for all back outputs,asfollowinghtLoss=i=t+1T(oiLosshtoi)(Suppose our each oi iscomputed by all fore output of hidden layer,itmeans that output of hidden layer of last time step is opened.Not ht,it is contents of ht.)Final result is same with before.
According to exponent item ( W h h T ) T − i (W^T_{hh})^{T-i} (WhhT)Ti of ∂ L o s s ∂ h t \frac{\partial Loss}{\partial h_t} htLoss express,when T T T is big or t t t is small,we can find that what happen to ∂ L o s s ∂ h t \frac{\partial Loss}{\partial h_t} htLoss might be gradient vanishing problem and gradient exploding problem. It will effect the gradient of W h x W_{hx} Whx and W h h W_{hh} Whh.The gradient of W h x W_{hx} Whx and W h h W_{hh} Whh is following:
∂ L o s s ∂ W h x = ∑ t = 1 T p r o d ( ∂ L o s s ∂ h t , ∂ h t ∂ W h x ) = ∑ t = 1 T ∂ L o s s ∂ h t ∗ x t T ∂ L o s s ∂ W h h = ∑ t = 1 T p r o d ( ∂ L o s s ∂ h t , ∂ h t ∂ W h h ) = ∑ t = 1 T ∂ L o s s ∂ h t ∗ h t − 1 T \frac{\partial Loss}{\partial W_{hx}}=\sum_{t=1}^Tprod(\frac{\partial Loss}{\partial h_{t}},\frac{\partial h_t}{\partial W_{hx}})=\sum_{t=1}^T\frac{\partial Loss}{\partial h_{t}}*x_t^T\\ \frac{\partial Loss}{\partial W_{hh}}=\sum_{t=1}^Tprod(\frac{\partial Loss}{\partial h_{t}},\frac{\partial h_t}{\partial W_{hh}})=\sum_{t=1}^T\frac{\partial Loss}{\partial h_{t}}*h_{t-1}^T WhxLoss=t=1Tprod(htLoss,Whxht)=t=1ThtLossxtTWhhLoss=t=1Tprod(htLoss,Whhht)=t=1ThtLossht1T
We need to store gradient computed,so that we compute gradient of fore layers.What’s backward propagation?We compute gradient from final layer to first layer,and fore gradients depend on back gradients.We needn’t recompute back gradients.We also need store output of forward propagation,for example, ∂ L o s s ∂ W h h \frac{\partial Loss}{\partial W_{hh}} WhhLoss depends on output of hidden layer. So we need to clip gradient to prevent from gradient exploding problem.

Gated recurrent neural network

The clip gradient method can’t solve gradient vanishing problem. It means RNN is hard to catch dependency relationship between tow time steps with big time step distance in time sequence. It is conceivable that back layer almost has no influence with gradient of another. So this dependency relationship is going to be weaker with bigger time step distance. GRNN can solve it by learnable gate. It has tow release methods–GRU and LSTM.

GRU(gated recurrent unit)

GRU contains reset gate and update gate. It changes computation of hidden layer by them.Inputs of reset gate and update gate are X t X_t Xt and h t − 1 h_{t-1} ht1.Output is computed by a full-connection layer.
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值