自然语言认知行为NLP(gensim)

最新推荐文章于 2024-03-01 20:00:00 发布

置顶 HJUGujbwi223

最新推荐文章于 2024-03-01 20:00:00 发布

阅读量511

点赞数

分类专栏： python NLP gensim 文章标签： NLP PYTHON gensim model

本文链接：https://blog.csdn.net/hjugujbwi223/article/details/80432604

版权

python 同时被 3 个专栏收录

2 篇文章 0 订阅

订阅专栏

NLP

1 篇文章 0 订阅

订阅专栏

gensim

1 篇文章 0 订阅

订阅专栏

NLP中词模型的训练少不了，基于词向量，你给他吃什么，他就出来什么，哈哈。例子说明，你输入的语料都是科技的，出来模型映射出的相似词语也是科技相关的，输入小说，出的就是小说。当然在大数据的时代，预料越多，采样越多，出来的结果也更准确，（这只是主观想法，最后的结果可能更更接近佛陀，哈哈，让你怀疑人生）

D:\Python35\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2018-05-24 09:09:26,558 : INFO : collecting all words and their counts
2018-05-24 09:09:30,098 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-24 09:09:35,650 : INFO : collected 529468 word types from a corpus of 31656800 raw words and 3166 sentences
2018-05-24 09:09:35,651 : INFO : Loading a fresh vocabulary
2018-05-24 09:09:37,089 : INFO : min_count=1 retains 529468 unique words (100% of original 529468, drops 0)
2018-05-24 09:09:37,089 : INFO : min_count=1 leaves 31656800 word corpus (100% of original 31656800, drops 0)
2018-05-24 09:09:38,360 : INFO : deleting the raw counts dictionary of 529468 items
2018-05-24 09:09:38,377 : INFO : sample=0.001 downsamples 32 most-common words
2018-05-24 09:09:38,377 : INFO : downsampling leaves estimated 27122423 word corpus (85.7% of prior 31656800)
2018-05-24 09:09:38,892 : INFO : constructing a huffman tree from 529468 words
2018-05-24 09:09:53,343 : INFO : built huffman tree with maximum node depth 25
2018-05-24 09:09:54,448 : INFO : estimated required memory for 529468 words and 100 dimensions: 1005989200 bytes
2018-05-24 09:09:54,448 : INFO : resetting layer weights
2018-05-24 09:09:59,394 : INFO : training model with 3 workers on 529468 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=3

2018-05-24 09:13:36,289 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-24 09:13:36,296 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-24 09:13:36,299 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-24 09:13:36,299 : INFO : EPOCH - 5 : training on 31656800 raw words (27122332 effective words) took 43.8s, 618655 effective words/s
2018-05-24 09:13:36,299 : INFO : training on a 158284000 raw words (135611129 effective words) took 216.9s, 625209 effective words/s
2018-05-24 09:13:36,299 : INFO : storing 529468x100 projection weights into model.txt

sentences = word2vec.LineSentence('./语料.txt')（200MB 社会人文方面的文章）

model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)

出来的model 600MB

测试语句

他每天都很忙碌，有很多工作处理，没有时间休息，放弃节假日
------------------------------------------------------------------------

属性：对的，成功 0.491986

心理：痛苦，悲伤 0.464617

这个城市人都很懒惰，总是花很多时间在闲聊，不去做有意义的事情
------------------------------------------------------------------------
属性：对的，成功 0.548144

心理：开心，快乐 0.522422