Lecture 1
Introduction and Word Vectors
I mainly recorded the content of the word vector part.
-
Required tools: pure python and PyTorch or Tensorflow.
Definition: meaning (from Webster dictionary)
- the idea that is represented by a word, phrase etc.
- the idea that a person wants to express by using words or sign etc.
- the idea that is expressed in work if writing or art etc.
The commonest linguistic way of thinking of meaning:
signifier(symbol) <=> signifier(idea or things)
=> denotational semantics(指称语义学)
Compared with traditional NLP:
In traditional NLP, we regard words as discrete symbols: "hotel", "motel" a localist representation.
Words can be represented by one-hot vectors(one_hot 编码,单热矢量):
For example:
motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]
vector dimension = number of words in vocabulary (e.g.,500,000)
But how do you know the relationship between the meaning of a word?
Example: in web search, if a user searches for "Seattle motel", we would like to match documents containing "Seattle hotel". Because there are almost no differences between the two words.
But, the two words' one-hot vector is really different.
motel = [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
hotel = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]
They have no similarity relationship between them. In math term, there two vectors are orthogonal (正交的)。
There is no natural notion of similarity for one-hot vector.
And the word similarity table is too huge and incompleteness.
Solution: learn to encode similarity in the vectors themselves.
Import a new definition: Distributional Semantics(分布式语义): A word's meaning is given by the words that frequently appear close-by.
(一个词的意思可以被经常出现在这个词附近的词所赋予)
Word Vector(sometimes called word embedding or word representations. They are a distributed representation.)
It is a smallish vector, whereby all of the numbers are non-zero.
In the course video, there is an example: banking.
It's a nine-dimension vector. In fact, people always use a larger dimensionality. Such as 50 is the minimum number. And a typical number is 300 on your laptop. If you want to really max outperformance, maybe 1000 to 3000 is better.
Word2vec --- a framework for learning word vectors.
Main ideas:
-
A large corpus of text(body of text,文本).
-
Every word in a fixed vocabulary is represented by a vector.
-
Go through each position 't' in the text, which has a centre word "c" and context words "o" ("outside").
-
Use the similarity of the word vectors for 'c' and 'o' to calculate the probability of 'o' given 'c' (or vice versa)(条件概率,在c的条件下o的概率).
-
Keep adjusting the word vector to maximize the probability.
Then I attach some slides here which I think is hard to understand if you have no basis of machine learning or deep learning.(刚刚学完概率论,我感觉和概率论极大似然估计挺像的。)
这是我对lecture 1的初步理解,如果有不正确的地方,我会后续不断完善改正。
Lecture 2
词义的表示方法
词义就是单词所指代的概念或事物,那么我们如何才能在计算机中获取单词可用的词义呢?对于英文来说,通常的方法是使用WordNet,它根据同义词关系和单词层次关系构成单词网络。著名自然语言处理库NLTK中就包含了WordNet,下面是使用WordNet的两个例子:
左边的例子是获取单词“good”各个词义的近义词,右边例子是获取单词“panda”的上位词。
这样做要比WordNet简单很多,但是也存在许多问题,比如:
- 维度灾难,这样稀疏的向量,存储和训练都会造成巨大的开销。
- 每个向量都是正交的,欧式距离也都相等,无法直接衡量每个单词的相似度。
于是人们想根据单词的特性(比如一起出现的上下文)来构造稠密(dense)的向量来表示单词,使其具有表征词义的能力。
通过上下文来表征词义
最后得到的向量与下图类似。这种表示单词的方法称作词向量(Word Vectors),也称做词嵌入(Word Embeddings)或词表示(Word Representations)。这样的词向量既可以较为容易地得到,也可以利用余弦相似度等方法计算单词的相似度。
词向量的例子
Word2Vec
Word2Vec [1] 就是一个利用神经网络通过上下文关系得到词向量的方法。首先我们来介绍Word2Vec最基本的两个模型Skip-gram和CBOW(Continuous Bag-Of-Words)。
注:为了详细介绍Word2Vec,这节还参考了文献 [2] ,因此和课程中的符号不完全相同。
Skip-gram
Skip-gram的思想很简单,先选定一个中心词 (center word),然后我们用这个词来预测一定窗口(window)大小内的上下文单词 (context word) 。例如:
当我们选定“into”作为中心词,窗口大小为2时,上下文单词包括“problems”, “turning”, “banking”, “crises”。注意在Word2Vec中,我们认为中心词与上下文单词的距离是不重要的,几个上下文单词被同等的对待。我们的目标就是用单词“into”来预测其他四个单词,如果预测和实际不符就会更新中心词的词向量。之后我们将窗口向右移动一个单词,就得到
这时中心词变为“banking”,上下文单词变为“turning”, “into”, “crises”, “as”。这样不断更新最终就可以得到我们的词向量。
lecture 2 未完