Stanford NLP1

最新推荐文章于 2024-05-11 09:36:41 发布

LinMoson

最新推荐文章于 2024-05-11 09:36:41 发布

阅读量257

点赞数

分类专栏：笔记 stanford网课文章标签： nlp 深度学习 python

本文链接：https://blog.csdn.net/qq_36257407/article/details/108039800

版权

笔记同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

stanford网课

4 篇文章 0 订阅

订阅专栏

基于Stanford 2019年NLP课程，疫情期间在家憋出来的

Lesson 1

Intro

we don’t know how others interpret our words,
我们能做的就是get better at guessing how your words affect others, or make them feel sth. like what we want they to feel

重点
we want to represent word’s meaning.We relate meaning to idea.

尝试

利用WordNet这个工具可以查找同义词set（synonym）/上义词（hypernym）

找同义词
比如找到good的同义词，比如good作名词时的同义词，作形容词的同义词——adj中还分不同意思，不同意思的同义词

from nltk.corpus import wordnet as wn
poses={"n":"noun","v":"verb","s":"adj","a":"adj","r":"adj"}
for synset in wn.synsets{"good"}:
	print("{}:{}".format(poses[synset.pos()],			#前一个输出词性
		",".join(l.name() for l in synset.lemmas())))	#后一个输出这个词对应词性的同义词
#OUTPUT
noun:good
noun:good,goodness
adj:good
adj(sat):estimable

找一个词的上义词

from nltk.corpus import wordnet as wn
panda=wn.synset("panda.n.01")
hyper=lambda s:s.hypernyms()
output=list(panda.closure(hyper))

Synset('procyonid.n.01')
Synset('carnivore.n.01')
Synset('placental.n.01')
Synset('mammal.n.01')
Synset('vertebrate.n.01')
Synset('chordate.n.01')
Synset('animal.n.01')
Synset('organism.n.01')
Synset('living_thing.n.01')
Synset('whole.n.02')
Synset('object.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')

缺点也很明显
1. 会Miss一些新的词
2. 而且这样给词汇ignore词之间的nuance(细微差别)
3. 过于主观
4. 需要人手动编纂

最早，人们认为每个词都有一个Particular place to put（one-hot vector）

缺点
1. 词太多了，这样vector会很长，而且词会变形，导致词的数量进一步爆炸
2. 这样忽略了词之间的关系，比如hotel和motel，词虽然很近，但是one-hot vector中看不出任何联系（in math terms,these two words are orthogonal）

Word Vector

来源于distributional semantics(分布语义学)
word’s meaning is given by the words that frequently appear close-by(根据周围常出现的content来理解词汇意思）

这样我们就能build a dense vector for each word,similar words have similar vectors
同理，将他们project，再visualization之后会发现similar vector会聚集在一起
word vector也被称为word embeddings/word representations

Word2vec

知道了原理，怎么做？

我们有大量文本
each word in a fixed vocabulary is represented by a vector(用随机值初始化它们)
遍历文章中每个位置t，每个位置都有一个center word $c$ 还有周围words(content) $o$
利用 $o$ 和 $c$ 的similarity来计算p(o|c)我们用中间词来predict周围词
调整word vector来maximize this probability，具体调整的措施就是第三步中移动position的位置，相当于回到第三步

Cost/Loss/Objective function
对于每个position t，我们predict words within a fixed size $m$ window，given center word $w_j$
$Likelihood=L(\theta)=\prod^T_{t=1}\prod_{-m\le{j}\le{m},j\neq{0}} p(w_{t+j}|w_t;\theta)$
$\theta$ is all variables to be optimized
而Loss function $J(\theta)=-\frac{1}{T}\log{L(\theta)}$
我们需要做的就是minimize loss function，即maximize predictive accuracy
How to calculate $P(w_{t+j}|w_t;\theta)$ ?
- Answer：实际上针对每个词 $\omega$ ，我们有两个vector
1. $v_w$ when w is a center word(use to predict context word)
2. $u_w$ when w is a context word
  像下图，根据中心into来predict周围的problems，实际上是缩写
对于一个center word $c$ 和一个context word $o$ 而言
$P(o|c)=\frac{exp(u_o^Tv_c)}{\sum_{\omega\in V}exp(u_w^Tv_c)}$
分子： $p (目标 o ∣ 中心 c)$ ，dot product 来比较 $o$ 和 $c$ 的similarity，exp确保得到的是正数
分母： $p (周围词 w ∣ 中心 c)$ 求和，达到normalize all vocabularies to give probability distribution

(有点像 $softmax(x_i)=\frac{\exp{x_i}}{\sum_{j=1}^n{\exp{(x_j)}}}$ ，把一个value $x_i$ map to a probability distribution。
max是因为它将largest value的概率给放大了，叫soft是因为给那些较小值仍然分配了一些概率)

how to optimize model

we optimize the parameter by walking down the gradient
$\theta$ represent all the parameters.Each word has two vectors,so their size would be $2 d * V$
在这里插入图片描述
$\max J'(\theta)=\prod^T_{t=1}\prod_{-m\le{j}\le{m},j\neq{0}} p(w_{t+j}|w_t;\theta)$
左右同时log， $J(\theta)$ 变到分母上，求最大变成了求最小
$J(\theta)=-\frac{1}{T}\sum_{t=1}^T\sum_{-m\le{j}\le{m},j\neq{0}}\log{p(w'_{t+1}|w_t)}$
又已知
$P(o|c)=\frac{exp(u_o^Tv_c)}{\sum_{\omega\in V}exp(u_o^Tv_c)}$

目标：minimize $J(\theta)$ by changing $\theta$ parameters， $\theta$ with respect to P(o|c)
然后一般也不会只predict一个词，一般会预测比如10个，这样也有更大的概率会猜中
因为会有different context和different words，所以matrix会loose
$\frac{\partial}{\partial v_c}\log{\frac{exp(u_o^Tv_c)}{\sum_{\omega\in V}exp(u_w^Tv_c)}}$
$\frac{\partial}{\partial v_c}\log{(exp(u_o^Tv_c))}-\frac{\partial}{\partial v_c}\log\sum^{v}_{w=1}\exp{u_o^Tv_c}$
左边分子numerator,右边分母denominator；
- 这边的 $v_c$ 是个vector，有几百维的dimension，如果我们把 $v_c$ 拆开来写，则会变成
  $\frac{\partial}{\partial v_c}u_o^Tv_c=\frac{\partial}{\partial v_c}(u_{o_1}v_{c_1}+u_{o_2}v_{c_2}+u_{o_3}v_{c_3}+...+u_{o_100}v_{c_100})$
  那么对于 $v_{c_1}$ 求偏导时，后面都不含有 $v_{c_1}$ ，所以最后只剩下 $u_{o_1}$ .
  后面以此类推，得到 $u_{o_1},u_{o_2},u_{o_3},...]$ ,最后可以写成 $u_o$
- 利用chain rule，对denominator偏导
  $=\frac{1}{\sum_{\omega=1}^v{\exp(u_o^Tv_c)}}\frac{\partial}{\partial{v_c}}\sum^{v}_{x=1}\exp{(u_x^Tv_c)}$ log后看为一个整体
  $=\frac{1}{\sum_{\omega=1}^v{\exp(u_o^Tv_c)}}\sum^{v}_{x=1}\frac{\partial}{\partial{v_c}}\exp{(u_x^Tv_c)}$ exp()看为一个函数，里面的视为一个参数
  后半部分 $=\sum^{v}_{x=1}\exp{(u_x^Tv_c)}*\frac{\partial}{\partial{v_c}}(u_x^Tv_c)$
  $=\sum^{v}_{x=1}\exp{(u_x^Tv_c)}*u_x^T$
- 最终结果
  $\frac{\partial}{\partial v_c}\log{P(o|c)}=u_0-\frac{\sum^{v}_{x=1}\exp{(u_x^Tv_c)}*u_x^T}{\sum_{\omega=1}^v{\exp(u_o^Tv_c)}}$
  $=u_0-\sum_{x=1}^v{\frac{\exp{(u_x^Tv_c)}}{\sum_{\omega=1}^v{\exp(u_o^Tv_c)}}*u_x}$
  观察得 $\frac{\exp{(u_x^Tv_c)}}{\sum_{\omega=1}^v{\exp(u_o^Tv_c)}}=p(x|c)$
最终化简
$=u_0-\sum^v_{x=1}p(x|c)*u_x$
- 解读：
  $u_0$ 是观察到的context representation,
  减去 $\sum^v_{x=1}p(x|c)$ 模型所认为的context should look like
  这是一个Expectation表达式，它认为合适的概率应该是：weighted average of the models of each word’s representation * 它在目前model中的概率
  这种差（slope）能告诉我们words的该往哪种direction走

运用

from gensim.models import KeyedVectors
model=KeyedVectors.load_word2vec_format(wordvec_file)	#导入pre_train模型
#找一个词的最接近的词
model.most_similar('word')
#我们可以对其进行加加减减
model.most_similar(negative=['word1','word2']，
					positive=['word3','word4'])

def analogy(x1,x2,y1):	#类推
	result=model.most_similar(postive=[y1,x2],negative=[x1])
	return result[0][0]
analogy('japan','japanese','austria')	
#输出'austrian'：改变最后一个词，就能输出对应国家的人

Lesson 2

PCA做点会丢失很多数据

def display_pca_scatter(model,words=None,sample=0):
	if words==None:
		if sample>0:
			words=np.random.choice(list(model.vocab.keys()),sample)
		else:
			words=[word for word in model.vocab]
	#将词向量提取出来
	word_vectors=np.array([model[w] for w in words])
	#降维
	twodim=PCA().fit_transform(word_vectors)[:,:2]
	#作图
	plt.figure(figsize=(6,6))
	plt.scatter(twodim[:,0],twodim[:,1],edgecolors='k',c='r')
	#将点的word标注出来
	for word,(x,y) in zip(words,twodim):
		plt.text(x+0.05,y+0.05,word)

PCA的特点
Nokia is close to Samsung,but it is also close to Finland for some other reasons.在高维中，这些词会很反直觉的和一些看上去不相关的词靠的很近

word vector的具体计算/优化

无论在tensorflow还是pytorch中，word vector都以row的方式储存
所以我们进行运算，一般都是拿竖的一列word vector(即原来是横的，要转置成竖的)来和整个matrix做点乘，即 $U_{matrix}*v_4^T$
我们想要的是能生成word vector的model，all the words have a reasonably high probability occur in the context

例外：and/or/that/of 和任何词的dot product都很高
根据一些论文里的建议，你的word vector里first component的得分很高，一般它们就是of/and/that（或者说就是受到frequency effect影响），把开头丢掉即可

Optimization

非常不适合全局算gradient，Stochatic Gradient Descent在实际中用得也少，一般都是用Mini-batch

优点
1. get less noisy estimation(因为我们average)
2. mini-batch可以用GPU加速，这样可以并行计算
  但如果真用SGD来计算Word Vector，即一个window移动一格就做一次gradient descent

相当于一个Mini-batch（size=window size）只有2m+1个词，得到的matrix very sparse
只能update已经出现过的词，最后只会更新固定几行（只更新几个词）
得到的基本上都是接近0的数

Skip-Grams

用center predict surrounding

CBOW

用surrounding predict center

Negative Sampling

Naive Softmax
以这个为例
$P(o|c)=\frac{exp(u_o^Tv_c)}{\sum_{\omega\in V}exp(u_o^Tv_c)}$
分母需要很多次dot product之后加起来，这种运算很慢

Solution:
Train binary logistic regression for a true pair(center word and word in its context window)
分母=中间词*目标词（observed word）+随机选的几个词的dot product
因为随机，所以和目标词（observed word）关系不大，dot product必然小，这样就可以代替所有词了，不必全部加和
他们要求maximization而不是minimization， $\sigma$ 是激活函数sigmoid（sigmoid像Binary function,在（0,0.5）处对称）
$J_t(\theta)=\log{\sigma{(u_o^Tv_C)}}+\sum_{i=1}^kE_{j~P(\omega)}[\log\sigma{(-u_i^Tv_c)}]$
$J_{neg_sample}(o,v_c,U)=-\log{(\sigma(u_o^Tv_c))}-\sum_{k=1}^K\log{(\sigma(-u_k^Tv_c))}$
1. 这里的k的意思是take k negative samples(选它们的word probabilities)，可以选10/15个
2. maximize真实出现在center word周边的词的probability
  minimize那些随机选的词出现在center word周边的概率
3. 我们随机选的词也不是有要求的
  Unigram distribution $U{(\omega)}$ ：independent统计每个词在corpus中出现的次数
  进行变形: $P(\omega)=\frac{U(\omega)^{\frac{3}{4}}}{Z}$
  Z是一个normalization term=所有词的词频的 $\frac{3}{4}$ 倍结果之和
  这样可以让Less frequent words能被更多的选中
  在实际操作中，系统在每个epoch中，会利用shuffling opertaion,一开始就shuffle data into different sequence，这样避免了挑选数据的时间，每个并行的计算也都是拿的不同的数据

window vs. full doc

co-occurence matrix，统计一起出现的次数
比如这里I LIKE一起出现的次数为2,同理，由于symmetrical，所以LIKE I一起出现的次数也是2
在这里插入图片描述
如果数据足够多的话，I LIKE和I LEARNING这一起出现的次数会一样多，那样LIKE和LEARNING的vector就会很similar

缺点
如果词很多，vector的dimension会很高，会需要大量storage
而且大量cell会是0，者就变成了一个sparse matrix,model会less robust
solution:把维度降下来，降到25-1000，但是同时尽可能保留数据
Singular Value Decomposition(SVD)
$X_{origin}=U\sum V^T$
U,V是orthogonal matrix,size=row/column

浅黄色的部分用不到，深黄色的部分是smallest singular value，也可以丢弃，这样就降低了维度
这样原本的x就会被降为 $x^k$ ，k rank的matrix

import numpy as np
la=np.linalg
words=['I','like','enjoy','deep','learning','NLP','flying','.']
#先说有哪些词
X=np.array([[0,2,1,0,0,0,0,0],
			[2,0,0,1,0,1,0,0],
			...)	#构建matrix
U,S,Vh=la.svd(X,full_matrices=False)	#SVD分解
for i in xrange(len(words)):
	plt.text(U[i,0],U[i,1],words[i])	#作图

我们承认:去算大量文本的时候，High dimension word vector的确会更准确，比如1000比如300准确，但是消耗也更多。但一般公认最好的是300

转折
18年有人写了论文，证明了随着word vector dimension增高，word vector的准确率不会下降。
新的缺点
he,the,has这三个词too frequent
解决办法
1. 用 $\log$
2. set min(X,t)，t设成100，X是某个非常常见的单词，这样就算出现次数多了，100也相当于给它设定了天花板（ceiling func）
3. 无视所有这种太常见的词
4. 对于window内的单词，不是一视同仁，对于离center word近的词，count more/给更大的权重
5. 不再用count这种单纯的数数方法，而是利用Pearson correlations来算，如果得到负数，就把负数设为0

这样得到的图，更像是线性的，名词和名词连成一条线，动词和动词，形容词和形容词，彼此之间隔得很开在这里插入图片描述

两种主要方法的差异
对于数数的（左边），他们用了整个matrix来进行predict；
对于找方向的（右边的），用概率的方式来Predict，它们从所有里面找sample，就不会很费memory

结合上面两种(GloVe)

共同出现的概率可以给meaning component 编码（encode）
比如ice，那么solid就应该常和它一起出现，而gas就不应该；steam则反之
但是ice和水一起出现，gas和水也会一起出现
所以我们将他们相除，这样差异大的就可以体现出他们的区别，差异不大的说明是它们的共有meaning
在这里插入图片描述

我们通过 $\log$ 把级数级别的差异转化为Linear meaning components
$\omega_x*(\omega_a-\omega_b)=\log\frac{P(x|a)}{P(x|b)}$
转化一下变为
$\omega_i*\omega_j=\log P(i|j)$
最终的loss func为
$J=\sum^V_{i,j=1}f(X_{ij})(\omega_x^T \hat{\omega_j}+b_i+\hat b_j-\log{X_ij})^2$
最后减去的log是为了剔除掉两个词相同的情况，两个bias $b$ 是为了平衡两个词都是uncommon/common words的情况

特点
1. Fast training
2. scalable to huge corpora
3. good performance even with small corpus and small vectors

how to evaluate word vector

两个方面入手进行评估

intrinsic
1. 对于一些subtask的完成情况（能不能准确认识一个词/把同义词放在一起）
2. Fast to compute
3. Help us understand the system
4. 除非real task中实现了非常好的结果，让人ardently admit，不然是Not clear
extrinsic
1. evaluation on real task
2. can take a long time to compute accuracy
3. 可以取代现实中的某个subsystem
4. 对于最后表现好/不好，不能确定是否是某个subsystem的问题

评估word vector相似度

计算cos distance和angle
$d=\arg \max_i \frac{(x_b-x_a+x_c)^Tx_i}{||x_b-x_a+x_c||}$
禁止系统返回自己输入的单词
human judgement on word，人工评估word vector distance(数据来源：WordSim353)

Tips：wikipedia的数据比新闻上的数据训练效果更好

Word senses & Word sense amibuguity

Intrinsic

许多词都有多个常用意义，比如right

把相近意思的词cluster到一起
build linear algebraic structure for each word
比如一个词，有多个意思，根据词意的流行程度，给他们不同的权重，最后加起来
权重的frequency的比重 $\alpha_1=\frac{f_1}{f_1+f_2+f_3}$
实际上效果不赖，可以有效的一个词的不同含义(different sense)

Extrinsic

是否能完成subsequent task，比如实体命名（name entity recognition）:找到人，组织，地点

Lesson 3

Classifier

以softmax为例 $p(y|x)=\frac{\exp{(W_yx)}}{\sum_{c=1}^C{\exp{(W_cx)}}}$
这里的W是个Matrix，它有y个row，每条row对应一个class。
这样在计算的时候，每次选出一条row和x进行Dot product，看x和这个class的接近度
$W_yx=\sum_{i=1}^dW_{yi}x_i=f_y$
$f_y$ 即对x关于某一类label的predict，这样其实相当于对 $f_y$ 做softmax

实际操作时，比如有C种类别，那么其实vector只需要有C-1行，因为用1-其余的，自然就是剩下的；相当于binary classification只要一列weight即可

在优化时
1. 要么maximize correct class’s prob
2. 要么minimize $-\log(prob)$ ，这种也被称为NLL Loss(Negative Log-likelihood loss)
3. 实际上，用的最多的还是cross-entropy loss(来源于Information theory)
  p是true prob,q是computed prob
  $H(p,q)=-\sum_{c=1}^C{p(c)\log{q(c)}}$
  用one-hot encode解释，只有当在正确位置上是，p©=1是在正确位置上，其余位置时全是0
  但是在实际计算的时候，需要除以example numbers
  $J(\theta)=\frac{1}{N}\sum{-\log{\frac{e^{f_{y_i}}}{\sum_{c=1}^C{e^{f_{y_c}}}}}}$
  最终 $f_y=f_y(x)=W_yx$ 变写为 $f = W x$

传统的方法是求gradient然后优化W

Commonly in NLP

change W and word vectors simultaneously
通过one-hot encoding将word vector提取出来。

Neural Network

介绍了neural network+activation func。四个neuron的output，可以同时做三个激活函数，
写作 $f([z_1,z_1,z_3])$ 则是同一个激活函数，写作 $f(z_1),f(z_1),f(z_3)])$ 则可以是三个不同的激活函数
但是在实际中时，激活函数一般是一个，Activation Layer会设置一个
最后他们的output做weighted sum得到结果
所有的neuron都分成两部分，前一部分接受所有收到的value，并对他们进行函数 $f$ 的运算，后半部分输出这个neuron的值。在上一层neuron到下一层neuron的连线，则是权重

对于名词entity recognition的训练

直接将所有word vector求均值，效果差，丢失位置之间信息
一句话所有的word组成一个长word vector，对于这个这么长的word vector，我们对其采用softmax来求每个W的值，这里的 $\hat{y_y}$ 是predicted model output probability
$\hat{y_y}=p(y|x)=\frac{\exp{(W_yx)}}{\sum^C_{c=1}\exp{(W_cx)}}$
然后利用cross entropy来优化，这里 $f_{y_i}=W_yx$
$J(\theta)=\frac{1}{N}\sum_{i=1}^N-\log{(\frac{e^{f_{y_i}}}{\sum_{c=1}^Ce^{f_c}})}$
1. concatenate word vector
2. 得到hidden layer
  $z = W x + b$ and $a = f (z)$
3. 再乘以matrix得到最终结果：如果U选的location是我们要的location，那么得分高，反之得分低
  $s=U^Ta$

compute gradients by hand

对于 $f(x_1,x_2,...)$ 这种，求偏导等于
$\frac{\partial f}{\partial x}=[\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},\frac{\partial f}{\partial x_3},...]$
给了你各个dimension的gradient
对于一个有N个input和m个output(m个f函数)，它的结果则是Jaconbian matrix（m*n）
$\frac{\partial f}{\partial x}= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & ... & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & ... & \frac{\partial f_2}{\partial x_n} \\ . & ...\\ . & ...\\ . & ...\\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & ... & \frac{\partial f_m}{\partial x_n} \\ \end{bmatrix} \quad$
即 $(\frac{\partial f}{\partial x})_{ij}=\frac{\partial f_i}{\partial x_j}$

Elementwise activation func
$h_i=f(z_i)$ ，若 $i\neq j$ 则等于0，所以
$\frac{\partial f}{\partial x}= \begin{bmatrix} f'(z_1) & & & & 0 \\ & .\\ & &.\\ & &&.\\ 0 &&&& f'(z_n) \end{bmatrix} \quad=diag(f'(z))$
other equation
$\frac{\partial }{\partial x}(Wx+b)=W$
$\frac{\partial }{\partial b}(Wx+b)=I$ identity matrix
$\frac{\partial }{\partial u}(u^Th)=h^T$

比如： $s=u^Th$ $h = f (z)$ $z = W x + b$ ，求 $\frac{\partial s}{\partial b}$
$\frac{\partial s}{\partial b}=\frac{\partial s}{\partial h}\frac{\partial h}{\partial z}\frac{\partial z}{\partial b}=h^T diag(f'(z))I$
同理，求 $\frac{\partial s}{\partial W}$ ，会发现 $\frac{\partial s}{\partial b}=\frac{\partial s}{\partial h}\frac{\partial h}{\partial z}\frac{\partial z}{\partial W}$ ，只有最后一项不一样
从neural network来说，就很节省计算量，我们把前两项记为 $\delta=\frac{\partial s}{\partial h}\frac{\partial h}{\partial z}$ ，考虑到z=Wx+b
$\frac{\partial s}{\partial W}=\delta \frac{\partial z}{\partial W}=\delta^T x^T$
$\delta$ 是local error signal at z,x是input。 $[n * m]$ = $[n * 1]$ * $[1 * m]$

参数
我们需要优化的是W，即 $\frac{\partial s}{\partial W}$ ，根据nm个input，1个output，我们需要一个1*nm的Jacobian matrix
$\frac{\partial s}{\partial W}= \begin{bmatrix} \frac{\partial s}{\partial W_{11}} & ... & \frac{\partial s}{\partial W_{1m}} \\ . \\ . \\ . \\ \frac{\partial s}{\partial W_{n1}} & ... & \frac{\partial s}{\partial W_{nm}} \\ \end{bmatrix} \quad$

Lesson 4 Backpropagation

看这个例子
$\frac{\partial s}{\partial W}=\delta \frac{\partial z}{\partial W}=\delta \frac{\partial}{\partial W}Wx+b$
$W_{ij}$ 只对 $z_i$ 有贡献，就比如 $W_{23}$ 只对 $z_2$ 有贡献
$\frac{\partial z_i}{\partial{w_{ij}}}=\frac{\partial}{\partial{w_{ij}}}W_ix+b_i=\frac{\partial}{\partial{w_{ij}}}\sum^{d}_{k=1}W_{ik}x_k=x_j$
我们现在相当于把之前的x从matrix解开为一个个element，所以我们可以得到
$\frac{\partial s}{\partial W_{ij}}=\delta_ix_j$
即我们有一列 $\delta$ 和一个转置过来（一行）的x matrix相乘，就可以得到结果

当一个词在window内出现两次，那么我们就要将它在更新到word vector matrix中时，更新两次

update的pitfall
的确会变得更好，但是会和原先一些词距离变远。
比如TV,telly,television三个词同义，但是前两个在train dataset中，后一个在test dataset中。
我们要train sentiment classifier，在训练中，TV和telly会移动，但是television不会，这就会导致最后classifier训练好之后，会把三者分开
solution
1. pretrain model，用尽可能多的word去做一个词库，从里面random select来train（machine translation中使用）
2. 是否要用移植训练（fine tune）?
  如果dataset小，那就word vector不管pretrain那些下降的gradient，即set word vector fixed；
  如果dataset大，那么train - update - fine-tune会有好的效果

假设正常情况下这么走，最终函数是s，f是neuron
$\rightarrow^z f \rightarrow^{h}$

左侧downstream gradient= $\frac{\partial s}{\partial z}=\frac{\partial s}{\partial h}\frac{\partial h}{\partial z}$
中间f local gradient= $\frac{\partial h}{\partial z}$
右边 upstream gradient= $\frac{\partial s}{\partial h}$
downstream gradient=upstream gradient * local gradient

$\max(x,y)$ 的偏导是1/0(根据大小)
之前我们计算 $\delta$ 的优势就出来了，每个节点都有计算一个结果，利用chain rule可以将它们连起来

Backpropagation时，先Initialize output gradient=1，然后reverse访问
fprop和bprop的复杂度O()是一样的

class ComputationalGraph(object):
	def forward(inputs):
		#1.pass inputs to input gates
		#2. forward the computational graph
		for gate in self.graph.nodes_topologically_sorted():
			gate.forward()
		return loss

	def backward():
		for gate in reverse(self.graph.nodes_topologically_sorted()):
			gate.backward()	#apply chain rules
		return inputs_gradients

multiple gate

class MultiplyGate(object):
	def forward(x,y):
		z=x*y
		self.x=x	#keep the value
		self.y=y
		return z
	def backward(dz):
		dx=self.y*dz	#dz/dx * dL/dz
		dy=self.x*dz	#dz/dy * dL/dz  local * upstrea
		return [dx,dy]

gradient check：检查梯度是否精准
$f'(x)=\frac{f(x+h)-f(x+h)}{2h}$ $h\approx 1e^{-4}$
2014年之前会这么做，approximate but very slow,这种方法check implementation很有用

L2 regularization可以有效的prevent overfitting

Vectorization

想要deep learning在GPU上跑得快，就需要vectorize them
W点乘一行word vector，然后loop下一行的速度,慢于W直接点乘一个matrix

from numpy import random
N=500
D=300
C=5
W=random.rand(C,d)
wordvec_list=[random.rand(d,1) for i in range(N)]
wordvec_one_matrix=random.rand(d,N)	#速度更快

这里的速度是matrix 10x faster than for-loop

activation func

Hard tanh
$= - 1 i f x < - 1$
$hardtanh(x)=\left\{ \begin{aligned} -1&&{if}&& x<-1 \\ x &&{if}&& -1\le x \le 1 \\ 1 &&{if}&& x>1 \end{aligned} \right.$

Parameter initialization

initialize all other weights ~uniform(-r,r)分布，r不要太大也不要太小
Xavier Initialization：variance inversely proportional to $n_{in}$ (前一层layer size)和 $n_{out}$ (下一层size)
$Var(W_i)=\frac{2}{n_{in}+n_{out}}$

Learning rate

如果能随着epoch而把learning rate调低，效果会更好

比如每k个epoch，把learning rate取半
cyclic learning rate（lr会变大变小）
Fancier optimizer会要你给一个initial learning rate，随着训练learning rate shrink，所以会一般会给0.1

Lesson 5

两种句子构成的想法

语言学家是怎么想的

Phrase Structure Grammar（context-free grammar）
cat that cuddly by car the
这些word组成了Phrase(短句)the cuddly cat, by the car,
然后这些short phrase组成bigger phrases

the a 这种是determiner，cat dog是noun，然后我们发现了一个规律det + n
然后我们又遇到了large barking这种adj，in a crate/on the table这种介宾词
于是我们更新grammar为：det+(adj)+n+（介+det+n）
the table也是一个det+n的组合，后面介宾我们称之为preposition(介词) phrase

然后
4. talk to the cat/walk behind，我们又可以学到verb+(preposition phrase）

Dependency Structure
它展示了which words depend on(modify or are arguments of) which other words（相互之间的依赖关系）
EG: 他们之间的相互依赖关系
机器需要interpret sentence structure这样才能理解meaning
但是这样会有歧义：当verb后有n，并且n之后又prepositional phrase
- EG：Cops kill man with knife.(警察用刀杀了人/警察杀了持刀者)
- EG：Scientists count whales from space(数太空中的鲸鱼/在太空中数鲸鱼)
  这种没有歧义，用常识可以理解，但是机器不能，编程中一般对介词加括号，认为它和closest n有关联
特别复杂的句子，prepositional phrase跟一个prepositional phrase，特别复杂，找到他们的dependency关系很困难
为了得到具体的modify个数，我们需要引入一个Catalan numbers
$C_n=\frac{(2n)!}{(n+1)!n!}$
它会arise in many tree-like contexts
歧义2
- （来源于前置定语/修饰词）
- 来源于comma（逗号）：是表示断句还是表示and
- 来源于一些约定俗称的短语（类似于中文断句）
  first hand / job experience & first / hand job experience
- 来源于修饰第几个n
  
  总之，他们有两种表达形式
1. 一句句子写成一条Line,上方画箭头表示互相之间dependency(箭头从谁指向谁，一直有两种争论，都接受，自己选定一种就行)
2. 把head of the sentence放在top，写成树状，然后彼此关系一目了然
  实际中我们不用那些标签，只要有arrow就可以表示关系了
  
  有一个总的数据库：Universal Dependencies treebank
  尽管全是人工手工标注，但是给了很多帮助：
3. 比如词频和distribution；
4. 比如涉及的词汇/语言广泛,而且随机，是很好的机器学习data
5. 比如词之间的距离

Dependency Parsing

每个词都要depend on别人，自己是他人的dependent.除了有一个词要depend on 起始标准位作为ROOT

最好不要让这些抽样的dependency变成一个cycle，我们想要一个tree
Dependency最好不要cross,虽然偶尔会发生，所以在现实中，我们会换种说法
I’ll give a talk on bootstrapping tomorrow

常见的四种Dependency Parsing方法
1. Dynamic programming
2. Graph algorithms
3. Constraint Satisfaction
4. Transition-based parsing/deterministic dependency parsing（最后这个在现实中常用）

Transition-based parsing

ROOT标志位理解为一个stack，句子要处理的部分理解为buffer
在这里插入图片描述
ate depend on I,所以我们可以把ROOT从原点更新到I,
有点像build a tree，选了一个词有了左弧(leftchild)，就要给他配置右弧(rightchild)

MaltParser(Transition-based parsing的update)

build machine learning classifier to predict left/right arc
不用search了，直接tell you which direction to go
provides fast linear time parsing

Evaluate on dependency parsing

machine build arcs，然后算有几个是对的
然后可以看结构和labels有没有都预测对；也可以只看结构对不对，不看label对不对
在这里插入图片描述

为什么要train neural dependency parser?

conventional dependency parse的缺点
1. sparse(features match very few things)
2. incomplete（不可能全都标记完）
3. expensive computation(95%时间都在算feature）
Solution:利用了distributed representation
1. 每个词用d-dimensional dense vector表示（similar words are close）
2. 同时利用了part-of-speech(POS) and dependency labels（提取TOKENS）
  noun,verb,adj这种，这样有些词就会和一些词建立dependency(work - working - works)
  比如numerical modifier 应该和adj modifier靠的近
  把两者组合起来，组成一个新的vector
Final mode

用cross entropy来衡量效果如何，来进行优化
然后用bigger,deeper network会有更好的效果

Lesson 6

language model的目的是根据前面的句子，预测下一个词：即要根据已知，计算下一个词的probability（假设下一个词是在一列pre-define的list中）
$P(x^{(t+1)}|x^{(1)},x^{(2)},...,x^{(t)})$
这样language model就变成一个assign probability to a piece of text的系统了
$P(x^{(1)},x^{(2)},...,x^{(t)})=P(x^{(1)})*P(x^{(2)}|x^{(1)})*...*P(x^{(T)}|x^{(T-1)},...,x^{(1)})$
$=\prod^{T}_{t=1}P(x^{(t)}|x^{(t-1)},...,x^{(1)})$
应用：输入短信时预测下一个词，搜索时预测你要搜什么

N-gram language model

n-gram是一个n个consecutive word的集合,收集different n-grams的frequent

unigram:the,students,open,their
bigrams:the students,students open,open their
trigrams:the students open,students open their
4-grams:the students opened their

Assumption：
$x^{(t+1)}$ depends on 之前的n-1个words,而不适合所有的都有关系
conditional probability后面有n-1个words

$\begin{aligned} P(x^{(t+1)}|x^{(t)},...,x^{(1)})&=P(x^{(t+1)}|x^{(t)},...,x^{(t-n+2)})\\ &=\frac{P(x^{(t+1)},x^{(t)},...,x^{(t-n+2)})}{P(x^{(t)},...,x^{(t-n+2)})}\\ &=\frac{P(n-gram)}{P((n-1)-gram)} \end{aligned}$

如何得到这些n-gram probability?

solution：count them in large corpus of text
- EG:比如我们要得到一个4-gram的LM，那么我们要靠前3个词来预测最后一个词
  as the proctor started the clock,the students opened their _____
  
  前面的都不要，只要’student opened their’，要预测的词叫w
  
  $their)=\frac{count(students \, open \, their \, w)}{count(students \, open \, their)}$ = $\frac{四词出现的频数}{三词出现的频数}$
  
  如果students open their出现1k次，students open their book出现400次，那么Book出现的概率是0.4
进一步思考：把前面的丢掉是否真的好？很明显，结合前文，有的词更合适填入
- Idea：很明显，我们需要把握n-grams的长度，不然丢掉太多词反而不利于predict
进一步思考（sparsity problem）：有的词从来没有出现过，那么它的prob=0，我们该怎么办
- solution:Smoothing，所有词都加一个 $\delta$ ，确保每个词的概率 $\neq 0$ ，不然你会发现概率矩阵里大多数都=0（sparse matrix）
进一步思考：分母=0怎么办？
- solution：（back-off）前面的n-1缩短为n-2（原来是找前3个，现在找前2个）
  N一般不会大于5

Storage Problem

你设置了N，那你就要存取所有见到的N-grams——而随着N增加，这种N-grams的种类也在变多，你要存的也多，model gets bigger
总之：N增大，storage增大，sparse matrix问题变严重

Generate text

比如给了开头两个词，那么第三个词会有一个conditional sample，从中sample出一个词
然后继续拿==最后两个词（其中一个是刚生成的）==来进行预测
我们要把逗号、句号这类也当成一个word，这样才能让model学会加标点

Neural Language Model

window-based neural model

fixed-size window

先discard一部分词
用one-hot vector来代表这些词
model转化输出
最后输出一个关于vector的probability（即更有可能是哪个vector），然后选出那个词

特点
- improvement:
  1. no sparsity problem(总归会有一个输出)
  2. don’t need to store all n-grams you saw
- problem:
  1. fixed window is too small(会丢失一些info)
  2. enlarge window size会同时enlarge weight matrix W,使得模型复杂
  3. window can never be large enough
  4. $x^{(1)}$ 和 $x^{(2)}$ 乘的是完全不同的matrix，这里有4个word vector，相当于我们在同时train四个similar funcs
总结：我们需要能够变长window to process any length(这样词和词可以有交互)

RNN

在这里插入图片描述
中间叫hidden states是因为我们可以把它们认为是single state mutates all the time

hidden states要靠input和previous states才能计算， $H_0$ 我们称之为initial states
$number_{hidden\,states}=number_{inputs}$
在RNN training过程中，可以train word vector，也可以直接用pre-trained word vector(图中word vector e的维度可以自己指定，也可用别人)

特点
- 优点
  1. 可以process any length
  2. in theory,在时间序列t时，可以利用many steps back之前的info
  3. model size won’t change( $W_E$ & $W_H$ )
  4. 每个timestep计算时，都是用same weight( $W_H$ 在一个epoch内是一致的，我们要学的是一个general matrix)
- 缺点
  1. slow(不能parallel，只能先一个再一个)
  2. 实际中，很难得到many steps back之前的info

how to train RNN-LM

compute output distribution $\hat{y}^{(t)}$ for every step t，即根据已给的单词，predict每个词的probability
Loss func use cross-entropy：predicted prob dist $\hat{y}^{(t)}$ and true next word $y^{(t)}$ (one-hot for $x^{(t+1)}$ )
$\begin{aligned} J^{(t)}(\theta)&=CE(y^{(t)},\hat{y}^{(t)})\\ &=-\sum_{\omega\in V}y_{\omega}^{(t)} \log{\hat{y}^{(t)}_{\omega}}\\ &=-\log{\hat{y}^{(t)}_{t+1}} \end{aligned}$
在这里插入图片描述
average to get overall loss for entire training set
$J(\theta)=\frac{1}{T}\sum^T_{t=1}J^{(t)}(\theta)=\frac{1}{T}\sum^T_{t=1}-\log{\hat{y^{(t)}_{x_{t+1}}}}$

但用entire corpus计算gradient loss，too expensive
所以现实中，把 $x^{(1)},...,x^{(T)}$ 当做是一个sentenc/document
Stochastic Gradient Descent可以让我们加速，计算loss $J(\theta)$ for sentence，compute gradients and updates weights

Backpropagation for RNN

Q:derivative of $J(\theta)$ w.r.t. $W_h$ ?
A: $\frac{\partial J^{(t)}}{\partial W_h}=\sum^{t}_{i=1}\frac{\partial J^{(t)}}{\partial W_h}|_{(i)}$

利用multivariable chain rule
$\frac{d}{dt}f(x(t),y(t))=\frac{\partial f}{\partial x}\frac{dx}{dt}+\frac{\partial f}{\partial y}\frac{dy}{dt}$

RNN generate texts

输入一个初始词，会得到一个output probability distribution
从中sample一个词，这样就相当于生成了一个词
用sample的这个词来作为新的input，又得到一个output probability distribution
从中sample一个，往复3.4步

用什么样的语料去训练，就会得到怎样style的text(尽管有时候狗屁不通)

evaluation

perplexity=inverse probability of the corpus(越小越好)
$perplexity=\prod_{t=1}^T{\frac{1}{P_{LM}(x^{(t+1)}|x^{(t)},...,x^{(1)})}}^{\frac{1}{T}}$
1. 对于corpus内的每个单词，我们计算 $x^{(t+1)}$ 在given $x^{(t)},...,x^{(1)})$ 的概率，并Inverse它
2. 幂上的 $\frac{1}{T}$ 是用来normalize的，T是words的个数
  我们要这么做，是因为corpus如果bigger,perplex会smaller。但我们很明显希望有更多的corpus可以参与训练，所以要normalize它
perplexity=exponential of cross-entropy loss $J(\theta)$
$perplexity=\prod_{t=1}^T{\frac{1}{\hat{y}_{x_{t+1}}^{(t)}}}=\exp{(\frac{1}{T}\sum_{t=1}^T-\log{\hat{y}_{x_{t+1}}^{(t)}})}=\exp{(J(\theta))}$
language model的重要性
1. a bench task helps us measure our progresson understanding language
2. 现实中很多时候用的到
  - speech recognition
  - 预测搜索词输入
  - spelling correction(猜出这个人想写什么)
  - 机器翻译
  - 根据不同style，猜测是谁写的这篇文章（authorship identification）
  - summarization(根据Input，总结出在说什么)

RNN的作用

tag word标注词性(named entity recognition)
sentiment classification句子情感分类
- encode the sentence with RNN(而不是单纯将一对词的word vector组合起来)
  1. 可以用final hidden state来作为sentence encoding，因为我们假设final state contains all the info
  2. take element-wise max/mean来作给sentence encoding
question answering(RNN可以作为很好地encoder)
给出一段文章，然后给出问题，把文章作为input来生成答案
RNN基于question跑，然后它encode question,它就是question。我们也可以element-wise take max/mean来讲问题转化为一个vector
speech translation：用word error rate来评估，也可以用perplexity