【课程笔记】CS224N斯坦福自然语言处理

Course Homepage: http://web.stanford.edu/class/cs224n/

Lecture one

2021.2.25

Plan of lecture one

在这里插入图片描述

Distributed representation of word vector

A word’s meaning is given by the words that frequentlt appear closed-by.

“You should know a word by the company it keeps.”——J. R. Firth
在这里插入图片描述
Loss function:
在这里插入图片描述
Finally:

在这里插入图片描述
Before is trainning method, then this is prediction method:

在这里插入图片描述
Use vectors’ dot multiply to calculate the simularity.
Now we need to calculate partial:
在这里插入图片描述
numerator is easy, look into denominator:
Use the chain rule:

在这里插入图片描述
Then move the derivative inside the sum:
在这里插入图片描述
Use the chain rule again:
在这里插入图片描述

Finally:
在这里插入图片描述
Write it into a simple form:
在这里插入图片描述
It means it takes the “difference” between the expected context word and the actual context word that showed up.

This difference give us the slope as to whcih direction we should be walking changing the words.
Exercise one
Solution of Exercise one

Lecture Two

在这里插入图片描述

More detail of word-embedding

Tricks

  • SGD is faster, and can bring in some noise, which is equal to normalize the function.
  • Objective function is nonconvex, so it is very important to choose a good starting point.
  • ∇ θ J t ( θ ) \nabla_{\theta} J_{t}(\theta) θJt(θ) is very sparse! (Because only words in the window can be updated) So sometimes we use hashes or only update concerned rows in the matrix. Or we use Negative sampling transfer trhe probelm into binary classification.
    在这里插入图片描述

The idea of SG and CBOW is:
在这里插入图片描述
在这里插入图片描述

Negative Sampling: Train binary logistic regressions for a true pair(center word and word in its context window) versus a couple of noise pairs(the center word paired with a random word)

在这里插入图片描述
Here, The capital Z is often used as a normalization term:

So this is saying if you want the probability distribution of words, is you work out this three quaters power of the count of the word for every word in the vocabulary and then these numbers you just sum them up over the vocabulary and it’ll be sum total andf we’re dividing by that. So we get a probability distribution.

(Notes: In this class, Z means teacher normalization term to turn things into probabliyies)

Glove

Symbol Definition

Reference
X i j : X_{ij}: Xij: For center word i i i, times of word j j j appearance.
X i : X_i: Xi: Times of word i i i appearance in full context.
P i , k = X i , k X i : P_{i,k}=\frac{X_{i,k}}{X_{i}}: Pi,k=XiXi,k:Frequency of word k k k appear in context of word i i i
 ratio  i , j , k = P i , k P j , k : \text { ratio }_{i, j, k}=\frac{P_{i, k}}{P_{j, k}}:  ratio i,j,k=Pj,kPi,k: For different center word, the frequency of k k k appearance’s ratio.

Now we want to get function g:

P i , k P j , k =  ratio  i , j , k = g ( w i , w j , w k ) \frac{P_{i, k}}{P_{j, k}}=\text { ratio }_{i, j, k}=g\left(w_{i}, w_{j}, w_{k}\right) Pj,kPi,k= ratio i,j,k=g(wi,wj,wk)

The following is very curtness.

  • Suppose g have ( w i − w j w_{i}-w_{j} wiwj) because this function will finally get the difference between i i i and j j j
  • g is a scalar so g maybe have ( ( w i − w j ) T w k \left(w_{i}-w_{j}\right)^{T} w_{k} (wiwj)Twk)
  • Let g just be the exp function: g ( w i , w j , w k ) = exp ⁡ ( ( w i − w j ) T w k ) g\left(w_{i}, w_{j}, w_{k}\right)=\exp \left(\left(w_{i}-w_{j}\right)^{T} w_{k}\right) g(wi,wj,wk)=exp((wiwj)Twk)

Finally:

P i , k P j , k = exp ⁡ ( ( w i − w j ) T w k ) P i , k P j , k = exp ⁡ ( w i T w k − w j T w k ) P i , k P j , k = exp ⁡ ( w i T w k ) exp ⁡ ( w j T w k ) \begin{aligned} \frac{P_{i, k}}{P_{j, k}} &=\exp \left(\left(w_{i}-w_{j}\right)^{T} w_{k}\right) \\ \frac{P_{i, k}}{P_{j, k}} &=\exp \left(w_{i}^{T} w_{k}-w_{j}^{T} w_{k}\right) \\ \frac{P_{i, k}}{P_{j, k}} &=\frac{\exp \left(w_{i}^{T} w_{k}\right)}{\exp \left(w_{j}^{T} w_{k}\right)} \end{aligned} Pj,kPi,kPj,kPi,kPj,kPi,k=exp((wiwj)Twk)=exp(wiTwkwjTwk)=exp(wjTwk)exp(wiTwk)

P i , j = exp ⁡ ( w i T w j ) log ⁡ ( X i , j ) − log ⁡ ( X i ) = w i T w j log ⁡ ( X i , j ) = w i T w j + b i + b j \begin{array}{c} P_{i, j}=\exp \left(w_{i}^{T} w_{j}\right) \\ \log \left(X_{i, j}\right)-\log \left(X_{i}\right)=w_{i}^{T} w_{j} \\ \log \left(X_{i, j}\right)=w_{i}^{T} w_{j}+b_{i}+b_{j} \end{array} Pi,j=exp(wiTwj)log(Xi,j)log(Xi)=wiTwjlog(Xi,j)=wiTwj+bi+bj

The loss function is:
J = ∑ i , j N ( w i T w j + b i + b j − log ⁡ ( X i , j ) ) 2 J=\sum_{i, j}^{N}\left(w_{i}^{T} w_{j}+b_{i}+b_{j}-\log \left(X_{i, j}\right)\right)^{2} J=i,jN(wiTwj+bi+bjlog(Xi,j))2

Add the weight item:

J = ∑ i , j N f ( X i , j ) ( v i T v j + b i + b j − log ⁡ ( X i , j ) ) 2 f ( x ) = { ( x / x m a x ) 0.75 ,  if  x < x max ⁡ 1 ,  if  x > = x max ⁡ \begin{array}{c} J=\sum_{i, j}^{N} f\left(X_{i, j}\right)\left(v_{i}^{T} v_{j}+b_{i}+b_{j}-\log \left(X_{i, j}\right)\right)^{2} \\ f(x)=\left\{\begin{array}{ll} \left(x / x_{m a x}\right)^{0.75}, & \text { if } x<x_{\max } \\ 1, & \text { if } x>=x_{\max } \end{array}\right. \end{array} J=i,jNf(Xi,j)(viTvj+bi+bjlog(Xi,j))2f(x)={(x/xmax)0.75,1, if x<xmax if x>=xmax

Here f ( x ) f(x) f(x) means the more frequnetly x x x appears, the higher weight it will have.

Lecture Three

在这里插入图片描述
Little new knowledge in this lecture.

Binary word window classification

在这里插入图片描述
Use the window of word to classifiy the center word:

在这里插入图片描述
在这里插入图片描述
With this bigger vector, we get:

在这里插入图片描述

Lecture Four

在这里插入图片描述
“In 2019, deeplearrning is still a kind of craft.”

I write the mtrix derivation skills here

附录

常见单词统计

zilch n. 零无价值的物品; 人名
Notation n. 符号;乐谱;计数法
Orthogonality n. 正交性
Determinant n. 决定因素;行列式 adj. 决定性的

相关资源

非常基础的SVD推导
矩阵求导术

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值