Word Vectors详解(1)

We want to represent a word with a vector in NLP. There are many methods.

1 one-hot Vector

Represent every word as an |V|1 vector with all 0s and one 1 at the index of that word in the sorted english language. Where V is the set of vocabularies.

2 SVD Based Methods

2.1 Window based Co-occurrence Matrix

Representing a word by means of its neighbors.
In this method we count the number of times each word appears inside a window of a particular size around the word of interest.

For example:
这里写图片描述

The matrix is too large. We should make it smaller with SVD.

  1. Generate |V||V| co-occurrence matrix, X .
  2. Apply SVD on X to get X=USVT .

    • Select the first k columns of U to get a k -dimensional word vectors.
    • ki=1σi|V|i=1σi indicates the amount of variance captured by the first k dimensions.

2.2 shortage

SVD based methods do not scale well for big matrices and it is hard to incorporate new words or documents. Computational cost for a mn matrix is O(mn2)

3 Iteration Based Methods - Word2Vec

3.1 Language Models (Unigrams, Bigrams, etc.)

We need to create such a model that will assign a probability to a sequence of tokens.

For example
* The cat jumped over the puddle. —high probability
* Stock boil fish is toy. —low probability

Unigrams:
We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent:

P(w1,w2
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值