1. Background
- Word representation: a process that transform the symbols to the machine understandable meanings
- Definition of meaning(Webster Dictionary)
-
- The thing one intends to convey
- The logical extension of a word
- How to represent the meaning so that the machine can understand?
- Compute word similarity (词相似性计算)
-
- WR(Star) ≈ WR(Sun)
- WR(Motel) ≈ WR(Hotel)
- lnfer word relation(语义关系)
-
- WR(China) - WR(Beijing) ≈ WR(Japan) - WR(Tokyo)
- WR(Man) ≈ WR(King)- WR(Queen)+ WR(Woman)
- WR(Swimming) ≈ WR(Walking)- WR(Walk)+ WR(Swim)
2. One-Hot Representaion
2.1. Definition
- Regard words as discrete symbols
- Word ID or one-hot representation
- E.g.
-
- Vector dimension = # word in vocabulary
- Order is not important
2.2. Problems of One-Hot Representaion
- similarity(star, sum) = (Vstars,Vsun) = 0
- All the vectors are orthogonal. No natural notion of similarity for one-hot vectors
2.3. Represent Word ByContext
- The meaning of word is given by the words that frequently appear close-by
-
- "You shall know a word by the company it keeps." (J.R.Firth 1957:11).
- One of the most successful ideas of modern statistical NLP.
- Use context words to represent stars
-
- Co-occurrence Counts
- Words Embeddings
-
-
- Term-Term matrix
-
-
-
-
- How often a word occurs with another
-
-
-
-
- Term-Document maxtrix
-
-
-
-
- How often a word occurs in a document
-
-
shining | bright | trees | dark | look | |
stars | 38 | 45 | 2 | 27 | 12 |
2.4. Problem of Count-Based Representation
- Increase in size with vocabulary
- Require a lot of storage
2.5. Word Embedding
- Distributed Representation
-
- Build a dense vector for each word learned from large-scale text corpora
- Learning method: Word2Vec
sparsity issues for those less frequent words
-
- Subsequent classification models will be less robust
- Distributed Representation
3. Word2Vec
- Word2Vec uses shallow neural networks that associate words to distributed representations
- It can capture many linguistic regularities, such as
- Word2vec uses a sliding window of a fixed size moving alonga sentence
- ln each window, the middle word is the target word, otherwords are the context words
-
- Given the context words, CBow predicts the probabilities of thetarget word
- While given a target word, skip-gram predicts the probabilities ofthe context words
3.1 CBOW & SKIP-GRAM
- ln CBOW architecture, the model predicts the target wordgiven a window of surrounding context words
- According to the bag-of-word assumption: The order ofcontext words does not influence the prediction
-
- Suppose the window size is 5
-
-
- Never too late to learn
- .P(late|[never,too,to,learn]) ...
-
3.2 Problems of Full Softmax
- When the vocabulary size is very large
-
- Softmax for all the words every step depends on a huge number ofmodel parameters, which is computationally impractica
- We need to improve the computation efficiency
-
-
- 负采样
- 分层softmax
-
3.3 Imporving Computational Efficiency
ln fact, we do not need a full probabilistic model in word2vec
There are two main improvement methods for word2vec:
- Negative sampling
-
- As we discussed before, the vocabulary is very large, which means our model has a tremendous number of weightsneed to be updated every step
- The idea of negative sampling is, to only update a small percentage of the weights every step
- Since we have the vocabulary and know the context words, we can select a couple of words not in the context word list by probabilit
- Hierarchical softmax