NG Andrew deep learning五序列模型-CSDN博客

本文链接：https://blog.csdn.net/zhuazengbian9095/article/details/114581548

一、循环序列模型

1.1 why sequence models?
其实，不仅仅是语音/音乐/文字这种长条状顺序出现的序列需要用序列模型，
动作识别——帧为最小单位，视频流也是一个序列，因此也需要用序列模型来做。
这么看来，用序列模型，不仅仅可以做语音语言处理，CV相关的项目很多都可以做。
在这里插入图片描述
（sentiment classification:情感分类）

1.9 GRU：Gated Recurrent Unit

$tanhx\in(-1,1)$
在这里插入图片描述
$\in (0,1)$

Original : $a^t =g(W_a[a^{t-1, x^{t}+b_a})$
Simplified GRU:
new variable:c(memory cell)
in this version, $c^t = a^t$ (distinguished with LSTM)
…
(1) In every step, a candidate for rewirting memory $c^t$ :
$\hat c^t=tanh(W_c[c^{t-1},x]+b_c)$ ,
(2) Update Gate: decides to rewirte $c^t$ or not (In lots of graph illustrations, denoted as a red/blue sigmoid-like curve)
$\Gamma_u\in(0,1)=sigmoid(W_u[c^{t-1},x^t]+b_u)$
(3) then: new memory cell, when gate=1, rewite it with the candidate $\hat c^t$ , otherwies let $c^t$ the same
$c^t=\Gamma_u \hat c^t+(1-\Gamma_u) c^{(t-1)}$
——when the memory is not needed anymore, it would be rewrited;(for example, with “cat”, write “is”, or with “cats”, write “are”, then we don’t need to remember it’s “cat” or “cats”, thus we can clear or rewirte then memory cell. )

在这里插入图片描述

FULL GRU
(1) There is one more gate: Relevance Gate
which tell how relevant is $c^{t-1}$ to computing $c^t$

(2) why do this?
Because researchers over many years have tried lots of different versions, and find this one, GRU, robust and useful.

3.10 LSTM（long short-term Memory)

(memory is short-term, but in LSTM, this short-term memory is relatively longer)
事实上LSTM比GRU早出来而且更强一点；GRU可以看作是LSTM的简化版，更快但是稍弱，如分离卷积之于普通卷积。一般用LSTM
Formulations
picture from blog post to Chris Ola: Understanding LSTM Network
Peephole Connection（偷窥孔连接)
LSTM的一种变种技术，把 $c^{t-1}$ 加入到了三个gate的计算中

1.11 Bidirectional RNN

Bidirectional RNN can look both forward and backward which enhance itds processing ability
在这里插入图片描述

重点
这个图不如李宏毅的清晰，自己画一个如下
缺点： You do need the entire sequence; 那么在实时的语音识别里就用不了，因为需要等一个人说完整句话才能识别
NLP通常用Bidirectional LSTM

1.12 Deep RNN

三层的RNN已经算比较深了
除非是y这个地方换成了只有纵向而没有水平连接的深层网络
下图是一个三层RNN（纵向）
在这里插入图片描述

二、NLP

2.1 Word repersentation 词汇表征——将维/词汇联系

1、Word embedding(词嵌入):让算法自动理解词汇，man to women， king to queen

2、之前都是用字典与one-hot来编码词汇的
这样词汇是孤立的
它不懂 man to women， king to queen,
同时也有维度很大的缺点

3、所以使用词嵌入：featurized representation

这样orange与apple就很接近，如果模型看过“I want a glass of orange juice", 那么它在做填空的时候就知道“I want a glas of apple ____“可以填”juice”

4、为什么叫embedding：
比如10000个词汇，用了300个feature，那么这10000个词汇全部被映射成一个feature为坐标轴的300维的空间中的一个个点，相当于嵌入到了一个超立方体中。
也成功降维了：10000——>300
在这里插入图片描述

图中是利用t-SNE算法将300D的空间投影到2D空间看下关系（但是实际上投影到2D后大概率看不出关系的）

在这里插入图片描述

2.2 Using Word Embedding (In NLP)

1、小数据集怎么整
在这里插入图片描述
假设小数据集里没有“durian"”cultivator"（榴莲培育家），怎么无米之炊？
——可以用从网上下载大数据集的Embedding表示——迁移学习

2.3 Properties of word embeddings

word embeddings可以实现analogy reasoning（类比推理），而analogy reasoning也可以帮助大家理解word embedding在干什么
在这里插入图片描述

在这里插入图片描述
问：man对woman，king对what?
答：因为 $e_{man}-e_{woman}\approx e_{king}-e_{queen}$ ，所以类比推理：man对woman， king对queen。做法是在所有embedding中寻找与 $e_{king}-{e_{man}-e_{woman}}$ 最接近的embedding vector，发现是queen的vector，所以答案是queeen
1、Fomulation:
$\argmax_{e_w} \rm{sim}(e_w,e_{king}-e_{man}+e_{woman}))$
2、sim()

$\rm{sim}(u,v)=\frac{u^Tv}{||u||_2||v||_2}$ (即归一化的内积或者说 $\rm{cos}\theta$ ； $||u||_2||v||_2\rm{cos}\theta=\vec{u}^T\vec{v}$ )

夹角0度，平行向量，sim为最大值1：
夹角越大，越不平行（not what we want），sim越小。
在这里插入图片描述

或 $\rm{sim}=||a-b||^2$

2.4 Embedding matrix

$O_i$ 表示第i个元素为1，其他元素为0的one-hot vector，即第i个词汇的one-hot表示；
$e_i$ 表示Embedding matrix第i个Embedding column(vector)，即第i个词汇的word embedding vector.
$E$ 为Embedding matrix
则 $e_i = E \cdot O_i$
（公式上是这么表示，但是实际上就是python取第i个列切片的事）

2.5 Language modeling problem

一开始的算法比较复杂；后来越来越简单，结果很不错（炼丹了炼丹了）
下面先讲讲复杂的再讲讲简单的
1、固定窗口长度（比如4）（左/右or both ）以处理固定长度序列；CNN举例
在这里插入图片描述

2、更简单的算法Skip-Gram: Neaby 1 word (instead of 4)
glass——>juice
(看2.6节的formulation)

2.6 Word2Vec模型——如何学习/得到Embedding matrix

1、Model1——Skip-grams——Context&Target对；利用监督学习来学习embedding
在这里插入图片描述

实际上是算给定中心词汇，算出非中心词汇或者说目标词汇是某个词的概率

t means target or y
$E$ 是Embedding matrix，同时也是我学习的参数
$\theta_t$ 是softmax的参数
在这里插入图片描述
2、softmax classification存在的问题——因为词库大，计算太太太慢

解决方法1：hierarchical softmax （分级softmax分类，树形）
不一定非是平衡树，可以思想上像huffman按常用频率决定某个类别在树上的深度
解决方法2：见2.7负采样

3、如何sample/决定 contex c ?
the/of/a巨高频
orange/durian不高频
训练这些不高频的，收益才高
（比如天天花5小时只学习1+1这些没营养的巨简单的，小朋友没有提高，考试只能得10分；
花时间学习了方程式，小朋友进阶了，模型变聪明了，考了90分）
4、model2——CBOW
continuous bag of words(连续词袋模型），用左右两边的词汇来做输入预测