自然语言认知行为NLP(gensim)

1 篇文章 0 订阅
1 篇文章 0 订阅

NLP中词模型的训练少不了,基于词向量,你给他吃什么,他就出来什么,哈哈。例子说明,你输入的语料都是科技的,出来模型映射出的相似词语也是科技相关的,输入小说,出的就是小说。当然在大数据的时代,预料越多,采样越多,出来的结果也更准确,(这只是主观想法,最后的结果可能更更接近佛陀,哈哈,让你怀疑人生)

D:\Python35\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2018-05-24 09:09:26,558 : INFO : collecting all words and their counts
2018-05-24 09:09:30,098 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-24 09:09:35,650 : INFO : collected 529468 word types from a corpus of 31656800 raw words and 3166 sentences
2018-05-24 09:09:35,651 : INFO : Loading a fresh vocabulary
2018-05-24 09:09:37,089 : INFO : min_count=1 retains 529468 unique words (100% of original 529468, drops 0)
2018-05-24 09:09:37,089 : INFO : min_count=1 leaves 31656800 word corpus (100% of original 31656800, drops 0)
2018-05-24 09:09:38,360 : INFO : deleting the raw counts dictionary of 529468 items
2018-05-24 09:09:38,377 : INFO : sample=0.001 downsamples 32 most-common words
2018-05-24 09:09:38,377 : INFO : downsampling leaves estimated 27122423 word corpus (85.7% of prior 31656800)
2018-05-24 09:09:38,892 : INFO : constructing a huffman tree from 529468 words
2018-05-24 09:09:53,343 : INFO : built huffman tree with maximum node depth 25
2018-05-24 09:09:54,448 : INFO : estimated required memory for 529468 words and 100 dimensions: 1005989200 bytes
2018-05-24 09:09:54,448 : INFO : resetting layer weights
2018-05-24 09:09:59,394 : INFO : training model with 3 workers on 529468 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=3

2018-05-24 09:13:36,289 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-24 09:13:36,296 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-24 09:13:36,299 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-24 09:13:36,299 : INFO : EPOCH - 5 : training on 31656800 raw words (27122332 effective words) took 43.8s, 618655 effective words/s
2018-05-24 09:13:36,299 : INFO : training on a 158284000 raw words (135611129 effective words) took 216.9s, 625209 effective words/s
2018-05-24 09:13:36,299 : INFO : storing 529468x100 projection weights into model.txt

sentences = word2vec.LineSentence('./语料.txt')(200MB  社会人文方面的文章)

model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)

出来的model 600MB

测试语句

   他每天都很忙碌,有很多工作处理,没有时间休息,放弃节假日     
------------------------------------------------------------------------

    属性   : 对的,成功  0.491986

    心理   :痛苦,悲伤         0.464617


  这个城市人都很懒惰,总是花很多时间在闲聊,不去做有意义的事情         
------------------------------------------------------------------------
      属性   :  对的,成功         0.548144

      心理   :开心,快乐 0.522422

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值