Word Representation

1. Background

  • Word representation: a process that transform the symbols to the machine understandable meanings
  • Definition of meaning(Webster Dictionary)
    • The thing one intends to convey
    • The logical extension of a word
  • How to represent the meaning so that the machine can understand?
  • Compute word similarity (词相似性计算)
    • WR(Star) ≈ WR(Sun)
    • WR(Motel) ≈ WR(Hotel)
  • lnfer word relation(语义关系)
    • WR(China) - WR(Beijing) ≈ WR(Japan) - WR(Tokyo)
    • WR(Man) ≈ WR(King)- WR(Queen)+ WR(Woman)
    • WR(Swimming) ≈ WR(Walking)- WR(Walk)+ WR(Swim)

2. One-Hot Representaion

2.1. Definition

  • Regard words as discrete symbols
  • Word ID or one-hot representation
  • E.g.
    • Vector dimension = # word in vocabulary
    • Order is not important

2.2. Problems of One-Hot Representaion

  • similarity(star, sum) = (Vstars,Vsun) = 0
  • All the vectors are orthogonal. No natural notion of similarity for one-hot vectors

2.3. Represent Word ByContext

  • The meaning of word is given by the words that frequently appear close-by
    • "You shall know a word by the company it keeps." (J.R.Firth 1957:11).
    • One of the most successful ideas of modern statistical NLP.
  • Use context words to represent stars
    • Co-occurrence Counts
    • Words Embeddings
      • Term-Term matrix
        • How often a word occurs with another
      • Term-Document maxtrix
        • How often a word occurs in a document

shining

bright

trees

dark

look

stars

38

45

2

27

12

2.4. Problem of Count-Based Representation

  • Increase in size with vocabulary
  • Require a lot of storage

2.5. Word Embedding

  • Distributed Representation
    • Build a dense vector for each word learned from large-scale text corpora
    • Learning method: Word2Vec

sparsity issues for those less frequent words

    • Subsequent classification models will be less robust
  • Distributed Representation

3. Word2Vec

  • Word2Vec uses shallow neural networks that associate words to distributed representations
  • It can capture many linguistic regularities, such as

  • Word2vec uses a sliding window of a fixed size moving alonga sentence
  • ln each window, the middle word is the target word, otherwords are the context words
    • Given the context words, CBow predicts the probabilities of thetarget word
  • While given a target word, skip-gram predicts the probabilities ofthe context words

3.1 CBOW & SKIP-GRAM

  • ln CBOW architecture, the model predicts the target wordgiven a window of surrounding context words
  • According to the bag-of-word assumption: The order ofcontext words does not influence the prediction
    • Suppose the window size is 5
      • Never too late to learn
      • .P(late|[never,too,to,learn]) ...

3.2 Problems of Full Softmax

  • When the vocabulary size is very large
    • Softmax for all the words every step depends on a huge number ofmodel parameters, which is computationally impractica
    • We need to improve the computation efficiency
      • 负采样
      • 分层softmax

3.3 Imporving Computational Efficiency

ln fact, we do not need a full probabilistic model in word2vec

There are two main improvement methods for word2vec:

  • Negative sampling
    • As we discussed before, the vocabulary is very large, which means our model has a tremendous number of weightsneed to be updated every step
    • The idea of negative sampling is, to only update a small percentage of the weights every step
    • Since we have the vocabulary and know the context words, we can select a couple of words not in the context word list by probabilit
  • Hierarchical softmax
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值