【课程笔记】CS224N斯坦福自然语言处理

最新推荐文章于 2024-06-18 17:58:03 发布

置顶 Haor.L

最新推荐文章于 2024-06-18 17:58:03 发布

阅读量468

点赞数 2

分类专栏：经典机器学习模型 Pytorch深度学习文章标签：机器学习自然语言处理深度学习

本文链接：https://blog.csdn.net/weixin_46233323/article/details/114091707

版权

经典机器学习模型同时被 2 个专栏收录

20 篇文章 11 订阅

订阅专栏

Pytorch深度学习

7 篇文章 3 订阅

订阅专栏

【更新中】CS224N斯坦福自然语言处理

Lecture one
- Plan of lecture one
- Distributed representation of word vector
Lecture Two
Lecture Three
- Binary word window classification
Lecture Four
附录
- 常见单词统计
- 相关资源

Course Homepage: http://web.stanford.edu/class/cs224n/

Lecture one

2021.2.25

Plan of lecture one

在这里插入图片描述

Distributed representation of word vector

A word’s meaning is given by the words that frequentlt appear closed-by.

“You should know a word by the company it keeps.”——J. R. Firth
在这里插入图片描述
Loss function:

Finally:

在这里插入图片描述
Before is trainning method, then this is prediction method:

在这里插入图片描述
Use vectors’ dot multiply to calculate the simularity.
Now we need to calculate partial:

numerator is easy, look into denominator:
Use the chain rule:

在这里插入图片描述
Then move the derivative inside the sum:

Use the chain rule again:

Finally:
在这里插入图片描述
Write it into a simple form:

It means it takes the “difference” between the expected context word and the actual context word that showed up.

This difference give us the slope as to whcih direction we should be walking changing the words.
Exercise one
Solution of Exercise one

Lecture Two

在这里插入图片描述

More detail of word-embedding

Tricks

SGD is faster, and can bring in some noise, which is equal to normalize the function.
Objective function is nonconvex, so it is very important to choose a good starting point.
$\nabla_{\theta} J_{t}(\theta)$ is very sparse! (Because only words in the window can be updated) So sometimes we use hashes or only update concerned rows in the matrix. Or we use Negative sampling transfer trhe probelm into binary classification.

The idea of SG and CBOW is:
在这里插入图片描述

Negative Sampling: Train binary logistic regressions for a true pair(center word and word in its context window) versus a couple of noise pairs(the center word paired with a random word)

在这里插入图片描述
Here, The capital Z is often used as a normalization term:

So this is saying if you want the probability distribution of words, is you work out this three quaters power of the count of the word for every word in the vocabulary and then these numbers you just sum them up over the vocabulary and it’ll be sum total andf we’re dividing by that. So we get a probability distribution.

(Notes: In this class, Z means teacher normalization term to turn things into probabliyies)

Glove

Symbol Definition

Reference
$X_{ij}:$ For center word $i$ , times of word $j$ appearance.
$X_i:$ Times of word $i$ appearance in full context.
$P_{i,k}=\frac{X_{i,k}}{X_{i}}:$ Frequency of word $k$ appear in context of word $i$
$\text { ratio }_{i, j, k}=\frac{P_{i, k}}{P_{j, k}}:$ For different center word, the frequency of $k$ appearance’s ratio.

Now we want to get function g:

$\frac{P_{i, k}}{P_{j, k}}=\text { ratio }_{i, j, k}=g\left(w_{i}, w_{j}, w_{k}\right)$

The following is very curtness.

Suppose g have ( $w_{i}-w_{j}$ ) because this function will finally get the difference between $i$ and $j$
g is a scalar so g maybe have ( $\left(w_{i}-w_{j}\right)^{T} w_{k}$ )
Let g just be the exp function: $g\left(w_{i}, w_{j}, w_{k}\right)=\exp \left(\left(w_{i}-w_{j}\right)^{T} w_{k}\right)$

Finally:

$\begin{aligned} \frac{P_{i, k}}{P_{j, k}} &=\exp \left(\left(w_{i}-w_{j}\right)^{T} w_{k}\right) \\ \frac{P_{i, k}}{P_{j, k}} &=\exp \left(w_{i}^{T} w_{k}-w_{j}^{T} w_{k}\right) \\ \frac{P_{i, k}}{P_{j, k}} &=\frac{\exp \left(w_{i}^{T} w_{k}\right)}{\exp \left(w_{j}^{T} w_{k}\right)} \end{aligned}$

$\begin{array}{c} P_{i, j}=\exp \left(w_{i}^{T} w_{j}\right) \\ \log \left(X_{i, j}\right)-\log \left(X_{i}\right)=w_{i}^{T} w_{j} \\ \log \left(X_{i, j}\right)=w_{i}^{T} w_{j}+b_{i}+b_{j} \end{array}$

The loss function is:
$J=\sum_{i, j}^{N}\left(w_{i}^{T} w_{j}+b_{i}+b_{j}-\log \left(X_{i, j}\right)\right)^{2}$

Add the weight item:

$\begin{array}{c} J=\sum_{i, j}^{N} f\left(X_{i, j}\right)\left(v_{i}^{T} v_{j}+b_{i}+b_{j}-\log \left(X_{i, j}\right)\right)^{2} \\ f(x)=\left\{\begin{array}{ll} \left(x / x_{m a x}\right)^{0.75}, & \text { if } x<x_{\max } \\ 1, & \text { if } x>=x_{\max } \end{array}\right. \end{array}$