NLP中词模型的训练少不了,基于词向量,你给他吃什么,他就出来什么,哈哈。例子说明,你输入的语料都是科技的,出来模型映射出的相似词语也是科技相关的,输入小说,出的就是小说。当然在大数据的时代,预料越多,采样越多,出来的结果也更准确,(这只是主观想法,最后的结果可能更更接近佛陀,哈哈,让你怀疑人生)
D:\Python35\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2018-05-24 09:09:26,558 : INFO : collecting all words and their counts
2018-05-24 09:09:30,098 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-24 09:09:35,650 : INFO : collected 529468 word types from a corpus of 31656800 raw words and 3166 sentences
2018-05-24 09:09:35,651 : INFO : Loading a fresh vocabulary
2018-05-24 09:09:37,089 : INFO : min_count=1 retains 529468 unique words (100% of original 529468, drops 0)
2018-05-24 09:09:37,089 : INFO : min_count=1 leaves 31656800 word corpus (100% of original 31656800, drops 0)
2018-05-24 09:09:38,360 : INFO : deleting the raw counts dictionary of 529468 items
2018-05-24 09:09:38,377 : INFO : sample=0.001 downsamples 32 most-common words
2018-05-24 09:09:38,377 : INFO : downsampling leaves estimated 27122423 word corpus (85.7% of prior 31656800)
2018-05-24 09:09:38,892 : INFO : constructing a huffman tree from 529468 words
2018-05-24 09:09:53,343 : INFO : built huffman tree with maximum node depth 25
2018-05-24 09:09:54,448 : INFO : estimated required memory for 529468 words and 100 dimensions: 1005989200 bytes
2018-05-24 09:09:54,448 : INFO : resetting layer weights
2018-05-24 09:09:59,394 : INFO : training model with 3 workers on 529468 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=3
2018-05-24 09:13:36,289 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-05-24 09:13:36,296 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-05-24 09:13:36,299 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-05-24 09:13:36,299 : INFO : EPOCH - 5 : training on 31656800 raw words (27122332 effective words) took 43.8s, 618655 effective words/s
2018-05-24 09:13:36,299 : INFO : training on a 158284000 raw words (135611129 effective words) took 216.9s, 625209 effective words/s
2018-05-24 09:13:36,299 : INFO : storing 529468x100 projection weights into model.txt
sentences = word2vec.LineSentence('./语料.txt')(200MB 社会人文方面的文章)
model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)
出来的model 600MB
测试语句
他每天都很忙碌,有很多工作处理,没有时间休息,放弃节假日
------------------------------------------------------------------------
属性 : 对的,成功 0.491986
心理 :痛苦,悲伤 0.464617
这个城市人都很懒惰,总是花很多时间在闲聊,不去做有意义的事情
------------------------------------------------------------------------
属性 : 对的,成功 0.548144
心理 :开心,快乐 0.522422