【Stanford】Deep Learning-CS224N Lecture 1-2

最新推荐文章于 2024-11-12 11:12:28 发布

我对算法一无所知

最新推荐文章于 2024-11-12 11:12:28 发布

阅读量378

点赞数

分类专栏： CS224N 文章标签：深度学习 CS224N NLP 自然语言处理

本文链接：https://blog.csdn.net/qq_31267769/article/details/94443906

版权

CS224N 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Lecture 1

Introduction and Word Vectors

I mainly recorded the content of the word vector part.

Required tools: pure python and PyTorch or Tensorflow.

Definition: meaning (from Webster dictionary)

the idea that is represented by a word, phrase etc.
the idea that a person wants to express by using words or sign etc.
the idea that is expressed in work if writing or art etc.

The commonest linguistic way of thinking of meaning:

signifier(symbol) <=> signifier(idea or things)

=> denotational semantics(指称语义学)

Compared with traditional NLP:

In traditional NLP, we regard words as discrete symbols: "hotel", "motel" a localist representation.

Words can be represented by one-hot vectors(one_hot 编码，单热矢量):

For example:

motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]

hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

vector dimension = number of words in vocabulary (e.g.,500,000)

But how do you know the relationship between the meaning of a word?

Example: in web search, if a user searches for "Seattle motel", we would like to match documents containing "Seattle hotel". Because there are almost no differences between the two words.

But, the two words' one-hot vector is really different.

motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]

hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

They have no similarity relationship between them. In math term, there two vectors are orthogonal (正交的)。

There is no natural notion of similarity for one-hot vector.

And the word similarity table is too huge and incompleteness.

Solution: learn to encode similarity in the vectors themselves.

Import a new definition: Distributional Semantics(分布式语义): A word's meaning is given by the words that frequently appear close-by.

(一个词的意思可以被经常出现在这个词附近的词所赋予)

Word Vector(sometimes called word embedding or word representations. They are a distributed representation.)

It is a smallish vector, whereby all of the numbers are non-zero.

In the course video, there is an example: banking.

It's a nine-dimension vector. In fact, people always use a larger dimensionality. Such as 50 is the minimum number. And a typical number is 300 on your laptop. If you want to really max outperformance, maybe 1000 to 3000 is better.

Word2vec --- a framework for learning word vectors.

Main ideas:

A large corpus of text(body of text,文本).
Every word in a fixed vocabulary is represented by a vector.
Go through each position 't' in the text, which has a centre word "c" and context words "o" ("outside").
Use the similarity of the word vectors for 'c' and 'o' to calculate the probability of 'o' given 'c' (or vice versa)(条件概率，在c的条件下o的概率).
Keep adjusting the word vector to maximize the probability.

Then I attach some slides here which I think is hard to understand if you have no basis of machine learning or deep learning.（刚刚学完概率论，我感觉和概率论极大似然估计挺像的。）

这是我对lecture 1的初步理解，如果有不正确的地方，我会后续不断完善改正。

Lecture 2

词义的表示方法

词义就是单词所指代的概念或事物，那么我们如何才能在计算机中获取单词可用的词义呢？对于英文来说，通常的方法是使用WordNet，它根据同义词关系和单词层次关系构成单词网络。著名自然语言处理库NLTK中就包含了WordNet，下面是使用WordNet的两个例子：

左边的例子是获取单词“good”各个词义的近义词，右边例子是获取单词“panda”的上位词。

这样做要比WordNet简单很多，但是也存在许多问题，比如：

维度灾难，这样稀疏的向量，存储和训练都会造成巨大的开销。
每个向量都是正交的，欧式距离也都相等，无法直接衡量每个单词的相似度。

于是人们想根据单词的特性（比如一起出现的上下文）来构造稠密（dense）的向量来表示单词，使其具有表征词义的能力。

通过上下文来表征词义

最后得到的向量与下图类似。这种表示单词的方法称作词向量（Word Vectors），也称做词嵌入（Word Embeddings）或词表示（Word Representations）。这样的词向量既可以较为容易地得到，也可以利用余弦相似度等方法计算单词的相似度。

词向量的例子

Word2Vec

Word2Vec [1] 就是一个利用神经网络通过上下文关系得到词向量的方法。首先我们来介绍Word2Vec最基本的两个模型Skip-gram和CBOW（Continuous Bag-Of-Words）。

注：为了详细介绍Word2Vec，这节还参考了文献 [2] ，因此和课程中的符号不完全相同。

Skip-gram

Skip-gram的思想很简单，先选定一个中心词 (center word)，然后我们用这个词来预测一定窗口（window）大小内的上下文单词 (context word) 。例如：

当我们选定“into”作为中心词，窗口大小为2时，上下文单词包括“problems”, “turning”, “banking”, “crises”。注意在Word2Vec中，我们认为中心词与上下文单词的距离是不重要的，几个上下文单词被同等的对待。我们的目标就是用单词“into”来预测其他四个单词，如果预测和实际不符就会更新中心词的词向量。之后我们将窗口向右移动一个单词，就得到