RNN（1）--Cross the threshold of RNN

最新推荐文章于 2024-03-08 17:37:44 发布

linjiet

最新推荐文章于 2024-03-08 17:37:44 发布

阅读量549

点赞数

分类专栏：机器学习文章标签： RNN

本文链接：https://blog.csdn.net/qq_39742013/article/details/89068907

版权

机器学习专栏收录该内容

27 篇文章 0 订阅

订阅专栏

This article records our learning process about RNN part in Dive Into Deep learning book

Introduction

RNN is designed for time sequence. It has a state variable to store historical information.Current output is depended on state variable and current input.RNN often is used to operate sequence data.

Language Model

Language Model is a important skill in NLP.

Introduction

what is Language Model?

I regular a natural language plain text as a time sequence.For example:
$Text=w_1,w_2,w_3.... \\ w_i->word$
We see $w_i$ as output of i-th time step.The Language Model is a probability function to evaluate probability of same sequence.For instance,in machine translation the ‘You go first.’ translates as ‘你先走’ or ‘你走先’, then you can use Language Model to get that probability of ‘你先走’ is bigger.

How to compute Language Model?

We suppose we have a sequence $w_1,w_2,w_3....)$ .As following express We can define method of Language Model calculation:
$P(sequence)=P(w_1)*P(w_2|w_1)*P(w_3|w_2,w_1)*P(w_4|w_1,w_2,w_3)$
For getting Language Model,we need to compute frequency of a word and conditional probability which is frequency of a word when the all words in front of this word is given. These need to be computed are parameters of Language Model.What you need to pay attention to is following calculation of conditional probability:
$P(w_3|w_1,w_2)=\frac{P(w_3,w_1,w_2)}{P(w_1,w_2)} \\$
$P(w_3,w_1,w_2)$ is frequency of sequence that they are next to each other.

n-grams

As you can see,we need to compute and store frequency of sequence that they are next to each other. When there are more words in our training set,our computation complexity of building Language Model increases exponentially. The reason why computation complexity is so bad is that method of building above thinks a word is relative with fore all words. We can simplify its computation and make complexity down by n-grams method.

Markov Supposition

Markov Supposition thinks that a word is only relative with fore n words. It is also named Markov chain of order n.Although it might be wrong,it can reduce computation.

Definition of n-grams

Markov chain of order n-1 is n-grams.Following express is n-grams:
$P(w_1,w_2,..w_T)=\prod_{t=1}^TP(w_t|w_{t-(n-1)},...,w_{t-1})$
Whole express is following:
$P(w_1,w_2,..w_T)=\prod_{t=1}^TP(w_t|w_{t-(n-1)},...,w_{t-1}) \\ =\prod_{t=1}^TP(w_t|w_1,...,w_{t-1})$
Because $w$ of $i n d e x < t - (n - 1)$ have no relation with $w_t$ ,and $w_t$ only have relation with $w$ of $t - 1 > = i n d e x > = t - (n - 1)$ .

Advantage and Disadvantage

When we set a small n,our result is inaccurate.When we set a big n,our building Language Model complexity is big.Thus function of the n is to keep balance between computation complexity and model accuracy.What you should pay attention to is that n-grams does reduce the complexity of computing parameters. can We will discuss how to balance in following contents.

RNN First Step

RNN stores fore information by using hidden state,rather than rigidly remembering all specific length sequences.

Multi-layer Perceptron

We discuss single hidden layer multi-layer perceptron.We suppose we have $n$ samples and $d$ dimensions feature vector.Computation process as following :
$\\ n\ samples,d-dim\ input\ feature\ vector \\ our\ input\ is\ X \in\ R^{n*d} \\ Hidden\ layer\ weight\ is\ W_1\in\ R^{d*h} \\ Hidden\ layer\ output:H^{n*h}=X*W_1+b^{1*h}(broadcast mechanism) \\ Output\ layer\ weight\ is\ W_2\in\ R^{h*q} \\ Output:o=H^{n*h}*W_2+b^{1*q} \\ classifier:softmax(o)\ to\ get\ probability\ distribution$

Multi-layer Perceptron containing hidden state

The difference with multi-layer perceptron is that we store last hidden layer output $H_{t-1}$ .We add new weight $W_{3}\in R^{h*h}$ .This new weight is used to describe how to use last hidden layer output.The hidden layer output of current time step is depend on current layer input and $H_{t-1}$ .We suppose activation of hidden layer is $\phi$ .As following:
$H_{t}=\phi(X*W_1+H_{t-1}*W_3+b^{1*h})$
According to relation between $H_{t-1}$ and $H_t$ , $H_t$ can obtain historical information from fore sequence. It indicates that our NN owns memory,and these information remembered just likes state of current time step.Because $H_{t}$ of current time step uses $H{t-1}$ of last time step,the computation of hidden layer is recurrent. Thus we define this neural network as recurrent neural network. Why is it recurrent?Different time step uses same weights $W_1,W_2,W_3$ .The process is following graph.We see the graph folded,and it is recurrent in hidden layer.
在这里插入图片描述
So weights of RNN only contain $V,W,U\ and\ all\ biases$ .RNN uses shared weights.The amount of all weights of RNN does not change with moving time step. It is better than n-grams or the Language Model.

We merge $X_{t}$ and $H_{t-1}$ on column(get $M^{n*(d+h)}$ ),and merge $W^{d*h}$ and $W^{h*h}$ on row(get $N^{(d+h)*h}$ ).We can easily prove $M*N=X*W_1+H_{t-1}*W_3$ .This just is a small skill.

The Language Model based on character-level recurrent neural network

We use RNN to predict next character depending on fore characters. In training process,we use $s o f t m a x$ function to get probability distribution,then we use cross-entropy loss function to compute error between label predicted and target label.Our target label is next character.Our label predicted depends on input of our current time step and fore all characters.The loss value depends on the probability distribution and next character.Because our input is a character,so this model is named character-level recurrent neural network.Because amount of characters is smaller than amount of words. So it’s simple.
在这里插入图片描述

Why can RNN express conditional probability of a character depending on fore all characters?

We use graph above as example.We use ‘想’,‘要’,‘有’ to get a probability distribution.This probability distribution has a dimension which can express probability of ‘直’.The probability is conditional probability of the character ‘直’ depending on fore all characters ‘想’,‘要’,‘有’. So RNN can express conditional probability of a character.

Data set of Language Model

In this section,we preprocess a data set of Language Model.Then we train character-level RNN to get a Language Model by this data set.Finally,we can use this model trained to produce lyrics.
The data set is Jay Chou’s lyrics from jay first album to 跨时代 tenth album.

Read data set

The lyrics file is at d2l-zh.We only use fore 10000 words to train our RNN.

import zipfile
with zipfile.ZipFile('/home/*/*/jaychou_lyrics.txt.zip') as zf:
    with zf.open('jaychou_lyrics.txt') as f:
        l=f.read().decode('utf-8')
l=l.replace('\n',' ').replace('\r',' ')
l=l[0:10000]

Result:
在这里插入图片描述

Establish index of character

For simple,we replace character with index.

inx_to_ch=list(set(l))
ch_to_inx=dict([(cha,i) for i,cha in enumerate(inx_to_ch)])
print(len(ch_to_inx))

corpus_indices=[ch_to_inx[char] for char in l]
sample=corpus_indices[0:20]
print('chars are :',''.join([inx_to_ch[i] for i in sample]))
print('indices are :',sample)

Result:

在这里插入图片描述

Sample of sequential data

When we train our RNN model,we should randomly pick a mini-batch datas and labels.Training data is different with vector or matrix,and it is a partial continuous sequential data in training data.Label consists of next character of training data.We have tow ways to sample training data and labels.They are stochastic sampling and adjacent sampling.

Stochastic sampling

We separate training data to $num\_examples$ .We use $batch\_size$ training data as a epoch.Each time we need to reinitialize the hidden state,because the samples between mini-batches is discontinuous sequential data.

import random
from mxnet import nd
def data_creator(corpus_indices,batch_size,num_steps):
    #num_examples,we need substract 1,because label is input+1
    num_examples=(len(corpus_indices)-1)//num_steps
    num_epoches=num_examples//batch_size
    examples_indices=list(range(num_examples))
    random.shuffle(examples_indices)
    def _data(start):
        return corpus_indices[start:start+num_steps]
    for i in range(num_epoches):
        i=i*batch_size
        batch_examples=examples_indices[i:i+batch_size]
        X=[_data(j*num_steps) for j in batch_examples]
        Y=[_data(j*num_steps+1) for j in batch_examples]
        yield nd.array(X),nd.array(Y)
my_seq=range(30)
for x,y in data_creator(my_seq,2,6):
    print('x:',x)
    print('y:',y)

Result:
在这里插入图片描述
The adjacent mini-batch might be discontinuous as well.For example, $[0, 1, 2, 3, 4, 5]$ and $[12, 13, 14, 15, 16, 17]$ is discontinuous.

Adjacent sampling

We can get continuous mini-batches,and it means hidden state of final time step in current mini-batch can be used to initialize hidden state of next mini-batch. By this way,we link all mini-batch,and we only need to initialize the hidden state once.Output of next mini-batch depends on input of current mini-batch.On the other hand,the gradient of weights depends all continuous mini-batches. In the same epoch,with increasing amount of iteration,the computation of gradient will cost more.For using a mini-batch to update weights(compute gradient),we separate the hidden state from the computation graph before we pick next mini-batch.We will know this operation way in realization part of Realize RNN From Zero section.
What you should pay attention to is data of a mini-batch can parallel execute.We can merge mini-batch data to compute. It is same with CNN. So there are not order of execution in a mini-batch.The order exits between all mini-batches.

def data_creator_adjacent_sampling(corpus_indices,batch_size,num_steps):
    corpus_indices=nd.array(corpus_indices)
    totel_len=len(corpus_indices)
    batch_len=totel_len//batch_size
    ci=corpus_indices[0:batch_len*batch_size].reshape((batch_size,batch_len))
    epoch_size=(batch_len-1)//num_steps
    for i in range(epoch_size):
        X=ci[:,i*num_steps:(i+1)*num_steps]
        Y=ci[:,i*num_steps+1:(i+1)*num_steps+1]
        yield X,Y
my_seq=range(30)
for x,y in data_creator_adjacent_sampling(my_seq,2,6):
    print('x:',x)
    print('y:',y)

Result:
在这里插入图片描述
In obvious,The adjacent mini-batches are continuous.

Realize RNN from zero

We will realize a model based on Jay Chou’s lyrics to produce lyrics.

one-hot vector

We can change a character into a one-hot vector as input of RNN.We suppose we hava $N$ different characters. In other words,the length of our $ch\_to\_inx$ dictionary is $N$ .In $ch\_to\_inx$ ,each character corresponds to a index from $0$ to $N - 1$ .If we have a character whose index is $i$ ,all elements of its one-hot vector is zero except that i-th element is $1$ , and the length of this vector is $N$ .
For example, the index of character is 0 and the index of character is 2:
在这里插入图片描述
Then we change shape $batch\_size*num\_steps]$ of input into shape $num\_steps*[batch\_size,vocab\_size]$ as input of RNN. It means that the input of time step $t$ is $X_t \in R^{n*d}$ ,which $n$ is $batch\_size$ , $d$ is the length of $ch\_to\_inx$ dictionary. It is similar with input format of the Full connection Neural Network.

def to_one_hot(x,size):
    return [nd.one_hot(j,size) for j in x.T]
X=nd.arange(10).reshape((2,5))
inputs=to_one_hot(X,vocab_size)
print((len(inputs),inputs[0].shape))
print(inputs)

Result:
在这里插入图片描述

Initialize weights of model

$num\_hiddens$ is a hyper-parameter. In following,we initialize all weights.

def get_params(num_vocabs,num_hiddens,ctx):
    def _one(shape,ctx):
        return nd.random.normal(scale=0.01,shape=shape,ctx=ctx)
    ctx=d2l.try_gpu()
    #hidden layer
    w_dh=_one((num_vocabs,num_hiddens),ctx)
    w_hh=_one((num_hiddens,num_hiddens),ctx)
    b_h=nd.zeros((1,num_hiddens),ctx=ctx)
    #output layer
    w_hq=_one((num_hiddens,num_vocabs),ctx)
    b_q=nd.zeros((1,num_vocabs),ctx=ctx)
    
    params=[w_dh,w_hh,b_h,w_hq,b_q]
    for p in params:
        p.attach_grad()
    return params

Define Model

The reason why we use a tuple to represent initial hidden state is that we might need to operate multiple hidden state,for example LSTM containing memory cell.

def init_rnn_state(batch_size,num_hiddens,ctx):
    return (nd.zeros((batch_size,num_hiddens),ctx=ctx),)

Now,we compute output and hidden state of RNN.

def rnn(inputs,state,params):
    #input and output both are a 3-dim matrix whose shape is (num_steps,batch_size,num_vocabs)
    h,=state
    w_dh,w_hh,b_h,w_hq,b_q=params
    outputs=[]
    for i in inputs:
        h=nd.tanh(nd.dot(i,w_dh)+nd.dot(h,w_hh)+b_h)
        o=nd.dot(h,w_hq)+b_q
        outputs.append(o) #is different with extend function
    return outputs,(h,)

For example:

#example
import d2lzh as d2l
def example():
    num_hiddens=256
    ctx=d2l.try_gpu()
    inputs=to_one_hot(X.as_in_context(ctx),vocab_size)
    state=init_rnn_state(X.shape[0],num_hiddens,ctx)
    params=get_params(vocab_size,num_hiddens,ctx)
    o,(h,)=rnn(inputs,state,params)
    print((len(o),o[0].shape,h.shape))
example()

result:
(5, (2, 1027), (2, 256))

Define prediction function

# 本函数已保存在d2lzh包中方便以后使用
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx):
state = init_rnn_state(1, num_hiddens, ctx)
output = [char_to_idx[prefix[0]]]
for t in range(num_chars + len(prefix) - 1):
# 将上一时间步的输出作为当前时间步的输入
X = to_onehot(nd.array([output[-1]], ctx=ctx), vocab_size)
# 计算输出和更新隐藏状态
(Y, state) = rnn(X, state, params)
# 下一个时间步的输入是prefix里的字符或者当前的最佳预测字符
if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
output.append(int(Y[0].argmax(axis=1).asscalar()))
return ''.join([idx_to_char[i] for i in output])

不当搬运工，看原书更直接，后面的东西都很简单，没必要“抄”一遍，可以看原书，这里就不给出了。本来写博客最初的想法就是为了记录自己遇到的问题是如何解决的，以及自己对一些知识的理解，一些难点的笔记，方便自己以后复习、检查自己是否真的懂了能够写出来。附github链接github资料（伯克利2019春季教材）

Backward Propagation through time

We need to open the recurrent structure by time step.Then we can get dependency relationship between model variable and weights.Finally,according to the chain rule,we can use backward propagation algorithm to compute and store gradient.

Definition of Model

Suppose we use a RNN without $b i a s$ item,and activation of model is identity function( $\phi(x)=x$ ),and input is a single sample.We can get:
$h_t=W_{hx}*X+W_{hh}*h_{t-1} \\ o_t=W_{qh}*h_t \\ Loss=\frac{1}{T}\sum_{t=1}^Tl(o_t,y_t)$
In following, $L o s s$ is named target function.

Computation Graph of Model

For visualizing the dependency relationship between model variable and weights,we draw computation graph of model:
在这里插入图片描述
Figure 6.3:Computation graph has 3 time steps.Circle with shadow is operation,rectangle with shadow is parameter,and remains are variables of model.

Method

The parameters needing to update have $W_{hx},W_{hh},W_{qh}$ .We use a operational mark $p r o d$ to represent matrix multiplication with necessary matrix transformation.
$(1)First\ of\ all,we\ compuute\ gradient\ of\ time\ step\ t\ output\ variable \\ \frac{\partial Loss}{\partial o_t}= \frac{\partial l(o_t,y_t)}{T*\partial o_t} \in R^{q*1} \\ =\\ \begin{bmatrix} \partial o_{t1}\\ \partial o_{t2}\\ ...\\ \partial o_{tq} \end{bmatrix} \\ (2)Then\ we\ compute\ gradient\ of\ W_{qh} \\ As\ we\ kown\\ h_t= \begin{bmatrix} h_{t1} \\ h_{t2}\\ ...\\ h_{th} \end{bmatrix}\\ \frac{\partial Loss}{\partial W_{qh}}=\sum_{t=1}^Tprod(\frac{\partial Loss}{\partial o_t},\frac{\partial o_t}{\partial W_{qh}})=\sum_{t=1}^T(\frac{\partial Loss}{\partial o_t}*h_t^T) \in R^{qh}\\ =\\ \begin{bmatrix} \partial o_{t1}*h_{t1} &\partial o_{t1}*h_{t2}&\partial o_{t1}*h_{t3}&...&\partial o_{t1}*h_{th}\\ \partial o_{t2}*h_{t1} &\partial o_{t2}*h_{t2}&\partial o_{t2}*h_{t3}&...&\partial o_{t2}*h_{th}\\ &&&...\\ \partial o_{tq}*h_{t1} &\partial o_{tq}*h_{t2}&\partial o_{tq}*h_{t3}&...&\partial o_{tq}*h_{th}\\ \end{bmatrix}\\ You\ can\ see\ o_{t1},o_{t2}...\\ o_{t1}=W_{11}*h_{t1}+W_{12}*h_{t2}+...+W_{1h}*h_{th}\\ ...\\ o_{tq}=W_{q1}*h_{t1}+W_{q2}*h_{t2}+...+W_{qh}*h_{th}\\ the\ W_{ij}\ is\ element\ of\ W_{qh}.\\ The\ gradient\ is\ obvious.This\ is\ gradient\ matrix. \\ (3)Because\ Loss\ only\ depends\ on\ h_T\ by\ o_T,so\ firstly\ we\ compute\\ \frac{\partial Loss}{\partial h_T}=prod(\frac{\partial Loss}{\partial o_T},\frac{\partial o_T}{\partial h_T})=W_{qh}^T*\frac{\partial Loss}{\partial o_T}\\ =\\ \begin{bmatrix} W_{11}*\partial o_{T1}+W_{21}*\partial o_{T2}+...+W_{q1}*\partial o_{Tq}\\ ...\\ W_{1h}*\partial o_{T1}+W_{2h}*\partial o_{T2}+...+W_{qh}*\partial o_{Tq} \end{bmatrix}\\ According\ to\ computation\ of\ o_T\ above,it's\ simple. \\ (4)For\ t<T,Loss\ depends\ on\ h_t\ by\ h_{t+1}\ and\ o_t,\\ According\ to\ the\ Chain\ Rule,we\ compute\ gradient\ of\ h_t\\ according\ to\ the\ order\ of\ time\ step\ increasing.\\ \frac{\partial Loss}{\partial h_t}=prod(\frac{\partial Loss}{\partial h_{t+1}},\frac{\partial h_{t+1}}{\partial h_{t}})+ prod(\frac{\partial Loss}{\partial o_{t}},\frac{\partial o_{t}}{\partial h_{t}})\\ =\\W^{T}_{hh}*\frac{\partial Loss}{\partial h_{t+1}}+W^T_{qh}*\frac{\partial Loss}{\partial o_{t}}\\ Express\ above\ is\ a\ recursion,we\ can\ open\ it\\ \frac{\partial Loss}{\partial h_t}=\sum_{i=t}^{T}(W^T_{hh})^{T-i}*W^T_{qh}*\frac{\partial Loss}{\partial o_{T+t-i}}\\ This\ express\ is\ easily\ proved,I\ hava\ a\ question: t+2,t+3...\\ remembers\ h_t,why\ do\ we\ only\ use\ t+1\ step?If\ you\ add\ t+2,t+3...\ to\\ recursion\ express,you\ are\ repeatedly\ computing\ some\ redundant\ gradients.\\ This\ express\ has\ another\ interpretation.\\ You\ can\ think\ that\ current\ time\ step\ remembers\ all\ fore\ information.\\ We\ consider\ the\ influence\ of\ h_t\ for\ all\ back\ outputs,as following\\ \frac{\partial Loss}{\partial h_t}=\sum_{i=t+1}^T (\frac{\partial Loss}{\partial o_i}*\frac{\partial o_i}{\partial h_t})\\ (Suppose\ our\ each\ o_i\ is\\ computed\ by\ all\ fore\ output\ of\ hidden\ layer,it\\ means\ that\ output\ of\ hidden\ layer\ of\ last\ time\ step\ is\ opened.Not\ h_t,it\ is\ contents\ of\ h_t.) \\ Final\ result\ is\ same\ with\ before.$
According to exponent item $(W^T_{hh})^{T-i}$ of $\frac{\partial Loss}{\partial h_t}$ express,when $T$ is big or $t$ is small,we can find that what happen to $\frac{\partial Loss}{\partial h_t}$ might be gradient vanishing problem and gradient exploding problem. It will effect the gradient of $W_{hx}$ and $W_{hh}$ .The gradient of $W_{hx}$ and $W_{hh}$ is following:
$\frac{\partial Loss}{\partial W_{hx}}=\sum_{t=1}^Tprod(\frac{\partial Loss}{\partial h_{t}},\frac{\partial h_t}{\partial W_{hx}})=\sum_{t=1}^T\frac{\partial Loss}{\partial h_{t}}*x_t^T\\ \frac{\partial Loss}{\partial W_{hh}}=\sum_{t=1}^Tprod(\frac{\partial Loss}{\partial h_{t}},\frac{\partial h_t}{\partial W_{hh}})=\sum_{t=1}^T\frac{\partial Loss}{\partial h_{t}}*h_{t-1}^T$
We need to store gradient computed,so that we compute gradient of fore layers.What’s backward propagation?We compute gradient from final layer to first layer,and fore gradients depend on back gradients.We needn’t recompute back gradients.We also need store output of forward propagation,for example, $\frac{\partial Loss}{\partial W_{hh}}$ depends on output of hidden layer. So we need to clip gradient to prevent from gradient exploding problem.

Gated recurrent neural network

The clip gradient method can’t solve gradient vanishing problem. It means RNN is hard to catch dependency relationship between tow time steps with big time step distance in time sequence. It is conceivable that back layer almost has no influence with gradient of another. So this dependency relationship is going to be weaker with bigger time step distance. GRNN can solve it by learnable gate. It has tow release methods–GRU and LSTM.

GRU(gated recurrent unit)

GRU contains reset gate and update gate. It changes computation of hidden layer by them.Inputs of reset gate and update gate are $X_t$ and $h_{t-1}$ .Output is computed by a full-connection layer.
在这里插入图片描述