Sequence Models (一)

这里写图片描述
这里写图片描述
这里写图片描述
任务:识别人名
给定 x x ,输出y标识每个词是否是人名的一部分。

现在,输入有9个词,因此我们最后的输出长度也为9,分别表示这9个词是否为人名的一部分。

标识:
x:x<1>,x<2>,,x<t>,,x<Tx> x : x < 1 > , x < 2 > , ⋯ , x < t > , ⋯ , x < T x >
x<t> x < t > : tth t − t h position in the sequence
x(i) x ( i ) : ith i − t h input sequence
Tx T x : the length of input sequence
T(i)x T x ( i ) : the input sequence length for training example i i

y:y<1>,y<2>,,y<t>,,y<Ty>
y<t> y < t > : tth t − t h position in the output sequence
y(i) y ( i ) : ith i − t h output
Ty T y : the length of output sequence
T(i)y T y ( i ) : the output sequence length for training example i i

那么句子中的每个词又如何表示呢?
So, to represent a word in the sentence the first thing you do is come up with a Vocabulary. Sometimes also called a Dictionary and that means making a list of the words that you will use in your representations.

这里写图片描述

So the first word in the vocabulary is a, that will be the first word in the dictionary. The second word is Aaron and then a little bit further down is the word and, and then eventually you get to the words Harry then eventually the word Potter, and then all the way down to maybe the last word in dictionary is Zulu. And so, a will be word one, Aaron is word two, and in my dictionary the word and appears in positional index 367. Harry appears in position 4075, Potter in position 6830, and Zulu is the last word to the dictionary is maybe word 10,000.

So in this example, I’m going to use a dictionary with size 10,000 words.
If you have chosen a dictionary of 10,000 words and one way to build this dictionary will be be to look through your training sets and find the top 10,000 occurring words, also look through some of the online dictionaries that tells you what are the most common 10,000 words in the English Language saved. What you can do is then use one hot representations to represent each of these words. For example, x<1> which represents the word Harry would be a vector with all zeros except for a 1 in position 4075 because that was the position of Harry in the dictionary. And then x<2> x < 2 > will be again similarly a vector of all zeros except for a 1 in position 6830 and then zeros everywhere else. And each of these would be a 10,000 dimensional vector if your vocabulary has 10,000 words.

So in this representation, x<t> x < t > for each of the values of t t in a sentence will be a one-hot vector, one-hot because there's exactly one one is on and zero everywhere else and you will have nine of them to represent the nine words in this sentence. And the goal is given this representation for X to learn a mapping using a sequence model to then target output y, I will do this as a supervised learning problem.

对于在词汇表中不包括的词,可以单设一个‘UNK’。

这里写图片描述

对于之前所述的任务,我们可以首先尝试使用一个传统的神经网络。
这里写图片描述
Now, one thing you could do is try to use a standard neural network for this task. So, in our previous example, we had nine input words. So, you could imagine trying to take these nine input words, maybe the nine one-hot vectors and feeding them into a standard neural network, maybe a few hidden layers, and then eventually had this output the nine values zero or one that tell you whether each word is part of a person’s name.

But this turns out not to work well and there are really two main problems of this. The first is that the inputs and outputs can be different lengths and different examples. 每个样例可能有不同的Tx(i) T(x)y T y ( x ) 。或许可以通过zero-padding,但仍然不是一个好的表达方式。

And then a second and maybe more serious problem is that a naive neural network architecture like this, it doesn't share features learned across different positions of texts. 序列中的每个位置独立地输入,但是对于序列型数据,前一个位置 x<t1> x < t − 1 > 会极大影响后一个位置 x<t> x < t >

So, what is a recurrent neural network?
这里写图片描述

So if you are reading the sentence from left to right, the first word you will read is the some first words say x<1> x < 1 > , and what we’re going to do is take the first word and feed it into a neural network layer. I’m going to draw it like this. So there’s a hidden layer of the first neural network and we can have the neural network maybe try to predict the output. So is this part of the person’s name or not.

And what a recurrent neural network does is, when it then goes on to read the second word in the sentence, say x<2> x < 2 > , instead of just predicting y<2> y < 2 > using only x<2> x < 2 > , it also gets to input some information from whether the computer that time step one . So in particular, deactivation value from time step one is passed on to time step two. Then at the next time step, recurrent neural network inputs the third word x<3> x < 3 > and it tries to output some prediction, y^<3> y ^ < 3 > and so on up until the last time step where it inputs x<Tx> x < T x > and then it outputs y^Ty y ^ T y .

在本例中 Tx=Ty T x = T y 。如果不等的化,网络结构需要做出一些调整。
So at each time step, the recurrent neural network that passes on as activation to the next time step for it to use. And to kick off the whole thing, we'll also have some either made-up activation at time zero, this is usually the vector of zeros. Some researchers will initialized a0 a 0 randomly. You have other ways to initialize a0 a 0 but really having a vector of zeros as the fake times zero activation is the most common choice. So that gets input to the neural network.

I’ll tend to draw the unrolled diagram like the one you have on the left, but if you see something like the diagram on the right in a textbook or in a research paper, what it really means or the way I tend to think about it is to mentally unrow it into the diagram you have on the left instead. The recurrent neural network scans through the data from left to right. The parameters it uses for each time step are shared. The parameters governing the connection from x<1> x < 1 > to the hidden layer, will be some set of parameters we’re going to write as Wax W a x and is the same parameters Wax W a x that it uses for every time step.

这里写图片描述
每个time step的参数是共享的。
Deactivations, the horizontal connections will be governed by some set of parameters Waa W a a and the same parameters Waa W a a use on every timestep and similarly Wya W y a that governs the output predictions. I’ll describe on the next line exactly how these parameters work.

这里写图片描述

RNN所做的是:在预测 y<3> y < 3 > 时,不仅使用当前输入 x<3> x < 3 > 的信息,还会使用之前的输入 x<1>,x<2> x < 1 > , x < 2 > 的信息。当然,目前我们的RNN结构还有一个缺陷,因为它没有利用之后位置的输入信息,在之后的双向RNN中我们会解决这个问题。
So one limitation of this particular neural network structure is that the prediction at a certain time uses inputs or uses information from the inputs earlier in the sequence but not information later in the sequence. We will address this in a later video where we talk about bi-directional recurrent neural networks or BRNNs.
这里写图片描述

这里写图片描述

a<0>=0⃗ a<t>=g1(Waaa<t1>+Waxx<t>+ba)y^<t>=g2(Wyaa<t>+by) a < 0 > = 0 → a < t > = g 1 ( W a a a < t − 1 > + W a x x < t > + b a ) y ^ < t > = g 2 ( W y a a < t > + b y )

这里写图片描述

这里写图片描述

这里写图片描述
For forward prop, you would computes these activations from left to right as follows in the neural network, and so you’ve outputs all of the predictions. In backprop, as you might already have guessed, you end up carrying backpropagation calculations in basically the opposite direction of the forward prop arrows.

这里写图片描述
In this back propagation procedure, the most significant message or the most significant recursive calculation is this one, which goes from right to left, and that’s why it gives this algorithm as well, a pretty fast full name called backpropagation through time. And the motivation for this name is that for forward prop, you are scanning from left to right, increasing indices of the time, t, whereas, the backpropagation, you’re going from right to left, you’re kind of going backwards in time.

这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述
这里写图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值