Word2Vec 相关

  1. 找Word2Vec的工具,实现看效果
    • Word2Vec(Google):
      • Capture many linguistic regularities
        For example vector operations vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’)
      • From words to phrases and beyond
        Example vector for representing ‘san francisco’
      • Word Consine distance
      • Word clustering
        Deriving word classes from huge data sets. This is achieved by performing K-means clustering on top of the word vectors. The output is a vocabulary file with words and their corresponding class IDs
    • Performance
      • Architecture:
        • Skip-Gram: slower, better for infrequent words
        • CBOW: fast
      • The training algorithm:
        • hierarchical softmax: better for infrequent words
        • negative sampling: better for frequent words, better with low dimensional vectors
      • Sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
      • Dimensionality of the word vectors: usually more is better, but not always
      • Context(window) size:
        • skip-gram: around 10
        • CBOW: around 5
    • 获取训练数据(黑体的训练数据在参考网站都有网址)
      • First billion characters from wikipedia (use the pre-processing perl script from the bottom of Matt Mahoney’s page)
      • Latest Wikipedia dump Use the same script as above to obtain clean text. Should be more than 3 billion words.
      • WMT11 site: text data for several languages (duplicate sentences should be removed before training the models)
      • Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text.
      • UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization).
      • Text data from more languages can be obtained at statmt.org and in the Polyglot project(亲测好评).
    • 总之Google的word2vec网站有很多可探索的东西
    • 影响词向量质量的因素
      • 训练数据的数量和质量
      • 词向量的大小
      • 训练算法
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值