概述
就是降维!
我们训练一个具有单个隐藏层的简单神经网络,想要的是这些隐藏层的权重,这些权重实际上就是word vectors.
trick:
- Subsample: 减少训练的词。
- Negative sample: 使每个训练样本只能更新很少的一部分模型权重,加快训练。
简介
https://machinelearningmastery.com/what-are-word-embeddings/
https://www.zhihu.com/question/32275069
词嵌入是自然语言处理(NLP)中语言模型与表征学习技术的统称。概念上而言,它是指把一个维数为所有词的数量的高维空间嵌入到一个维数低得多的连续向量空间中,每个单词或词组被映射为实数域上的向量。
One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities.
Algorithms
1. Embedding Layer
It requires that documents are clean and each words are encoded as one-hot. The size of vector space are specified as the part of the model, such as 50, 100, 300.
This approach of learning an embedding layer requires a lot of training data and can be slow, but will learn an embedding both targeted to the specific text data and the NLP task.
每一个单词都需要一个one-hot vector, 计算量大,单词之间相关性没有被表示。
As we can look at following picture, word ‘girl’ won’t make any help of the training of other words in the first layer.
2. Word2Vec
paper: Linguistic Regularities in Continuous Space Word Representations, 2013.
It is good at capturing syntactic and semantic regularities in language.
Two different learning models were introduced that can be used as part of Word2Vec approach to learn word embedding.
- Continuous Bag-of-Words, or CBOW model. # 通过已知的周围词对该词进行word embedding.
- Continuous Skip-Gram Model. # 通过预测周围词进行 word embbeding.
3. GloVe
paper: GloVe: Global Vectors for Word Representation, 2014.
把全局统计(eg. Latent Semantic Analysis (LSA)) 和局部文本学习 (word2vec) 结合起来,更加有效
The Globe Vector for Word Representation. It is an extension to word2vec and can learn word vector more efficiently.
Classical vector space model representations of words were developed using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using global text statistics but are not as good as the learned methods like word2vec at capturing meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen example above).
GloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in word2vec.
Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.
How to use the word embedding
1. Learn an Embedding
You may choose to learn a word embedding for your problem.
This will require a large amount of text data to ensure that useful embeddings are learned, such as millions or billions of words.
You have two main options when training your word embedding:
- Learn it Standalone, where a model is trained to learn the embedding, which is saved and used as a part of another model for your task later. This is a good approach if you would like to use the same embedding in multiple models.
- Learn Jointly, where the embedding is learned as part of a large task-specific model. This is a good approach if you only intend to use the embedding on one task.
2. Reuse an Embedding
It’s common for researcher to use pre-trained word embbeding. For example, both word2vec and GloVe word embeddings are available for free download.
These can be used on your project instead of training your own embeddings from scratch.
可以选择直接使用,也可以基于此进行更新。
Articles
- Word embedding on Wikipedia
- Word2vec on Wikipedia
- GloVe on Wikipedia
- An overview of word embeddings and their connection to distributional semantic models, 2016.
- Deep Learning, NLP, and Representations, 2014.
Papers
- Distributional structure, 1956.
- A Neural Probabilistic Language Model, 2003.
- A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, 2008.
- Continuous space language models, 2007.
- Efficient Estimation of Word Representations in Vector Space, 2013
- Distributed Representations of Words and Phrases and their Compositionality, 2013.
- GloVe: Global Vectors for Word Representation, 2014.
Projects
Word2Vec 算法
就是降维!
我们训练一个具有单个隐藏层的简单神经网络,想要的是这些隐藏层的权重,这些权重实际上就是word vectors.
这种trick还有很多形式。
Another place you may have seen this trick is in unsupervised feature learning, where you train an auto-encoder to compress an input vector in the hidden layer, and decompress it back to the original in the output layer. After training it, you strip off the output layer (the decompression step) and just use the hidden layer–it’s a trick for learning good image features without having labeled training data.
1. Fake Task
Fake task: 给定一个词,给出每个词在它周围的概率。
我们训练一个神经网络,通过输入 word pairs(两个概率相近的词,就是被窗口圈住的进行组合) 来进行训练。
We’re going to train the neural network to do the following. Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.
When I say “nearby”, there is actually a “window size” parameter to the algorithm. A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total).
We’ll train the neural network to do this by feeding it word pairs found in our training documents. The below example shows some of the training samples (word pairs) we would take from the sentence “The quick brown fox jumps over the lazy dog.” I’ve used a small window size of 2 just for the example. The word highlighted in blue is the input word.
2. Model Details
设有10e4个词,隐藏层300个神经元。
首先用one-hot代表每个词,然后设计以一个网络,输入是one-hot,输出也是1e4向量,每个元素代表每个词在它周围的概率。
使用 word pairs进行训练
1x1e4 (1e4 x 300) -> 1x300 (300x1e4) -> 1x1e4
Each output neuron (one per word in our vocabulary!) will produce an output between 0 and 1, and the sum of all these output values will add up to 1.
There is no activation function on the hidden layer neurons, but the output neurons use softmax. We’ll come back to this later.
When training this network on word pairs, the input is a one-hot vector representing the input word and the training output is also a one-hot vector representing the output word. But when you evaluate the trained network on an input word, the output vector will actually be a probability distribution (i.e., a bunch of floating point values, not a one-hot vector).
3. The Hidden Layer
hidden layer 有10,000 x 300 个权重,每一行就是我们想要得到的300个词向量。
一个词的one-hot向量和这个矩阵相乘,就可以得到压缩后的词向量,其实就是weight matrix的一行。
If you look at the rows of the weight matrix, there are actually what will be our word vectors.
So the end goal of all of this is really just to learn this hidden layer weight matrix – the output layer we’ll just toss when we’re done!
Let’s get back, though, to working through the definition of this model that we’re going to train.
Now, you might be asking yourself–“That one-hot vector is almost all zeros… what’s the effect of that?” If you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the “1”. Here’s a small example to give you a visual.
4. Output Layer
输出层是一个softmax回归,每一个输出层神经元会输出一个(0, 1)之间的概率,所有输出总和是1.
下面是上一层一个ants的word vector 与输出层的car权重进行点积,得到的就是ants旁边是car的概率。
The 1 x 300
word vector for “ants” then gets fed to the output layer. The output layer is a softmax regression classifier. There’s an in-depth tutorial on Softmax Regression here, but the gist of it is that each output neuron (one per word in our vocabulary!) will produce an output between 0 and 1, and the sum of all these output values will add up to 1.
Specifically, each output neuron has a weight vector which it multiplies against the word vector from the hidden layer, then it applies the function exp(x)
to the result. Finally, in order to get the outputs to sum up to 1, we divide this result by the sum of the results from all 10,000 output nodes.
The last picture an illustration of calculating the output of the output neuron for the word “car”.
Note that neural network does not know anything about the offset of the output word relative to the input word. It does not learn a different set of probabilities for the word before the input versus the word after. To understand the implication, let’s say that in our training corpus, every single occurrence of the word ‘York’ is preceded by the word ‘New’. That is, at least according to the training data, there is a 100% probability that ‘New’ will be in the vicinity of ‘York’. However, if we take the 10 words in the vicinity of ‘York’ and randomly pick one of them, the probability of it being ‘New’ is not 100%; you may have picked one of the other words in the vicinity.
5. Negative sample
- 对频繁出现的单词进行二次采样,减少训练的样本数。
- 负采样,使每个训练样本只能更新很少的一部分模型权重。
You may have noticed somthing-it’s a huge network! The hidden layer and output layer have 1e4 x 300 = 3 million weghts.
The authors of Word2Vec addressed these issues in their second paper with the following two innovations:
- Subsampling frequent words to decrease the number of training examples.
- Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights.
Subsampling Frequent Words
‘The’ 可以出现在很多地方,对其他单词的理解几乎没有任何帮助。
我们定义了一个概率 P ( w i ) P(w_i) P(