python初步实现word2vec

有关于word2vec的文章,写的不错,易懂,转载过来!
原本只想贴一个链接,想想还是算了……

一、前言

一开始看到word2vec环境的安装还挺复杂的,安了半天Cygwin也没太搞懂。后来突然发现,我为什么要去安C语言版本的呢,我应该去用Python版本的,然后就发现了gensim,安装个gensim的包就可以用word2vec了,不过gensim只实现了word2vec里面的skip-gram模型。若要用到其他模型,就需要去研究其他语言的word2vec了。

 

二、语料准备

有了gensim包之后,看了网上很多教程都是直接传入一个txt文件,但是这个txt文件长啥样,是什么样的数据格式呢,很多博客都没有说明,也没有提供可以下载的txt文件作为例子。进一步理解之后发现这个txt是一个包含巨多文本的分好词的文件。如下图所示,是我自己训练的一个语料,我选取了自己之前用爬虫抓取的7000条新闻当做语料并进行分词。注意,词与词之间一定要用空格:


 


这里分词使用的是结巴分词。

这部分代码如下:

[python] view plain copy
print ?
  1. import jieba  
  2. f1 =open(”fenci.txt”)  
  3. f2 =open(”fenci_result.txt”‘a’)  
  4. lines =f1.readlines()  # 读取全部内容  
  5. for line in lines:  
  6.     line.replace(’\t’).replace(‘\n’).replace(‘ ’,)  
  7.     seg_list = jieba.cut(line, cut_all=False)  
  8.     f2.write(” ”.join(seg_list))  
  9.    
  10. f1.close()  
  11. f2.close()  
import jieba 
f1 =open(“fenci.txt”)
f2 =open(“fenci_result.txt”, ‘a’)
lines =f1.readlines() # 读取全部内容
for line in lines:
line.replace(‘\t’, ”).replace(‘\n’, ”).replace(’ ‘,”)
seg_list = jieba.cut(line, cut_all=False)
f2.write(” “.join(seg_list))

f1.close()
f2.close()

还要注意的一点就是语料中的文本一定要多,看网上随便一个语料都是好几个G,而且一开始我就使用了一条新闻当成语料库,结果很不好,输出都是0。然后我就用了7000条新闻作为语料库,分词完之后得到的fenci_result.txt是20M,虽然也不大,但是已经可以得到初步结果了。

 

三、使用gensim的word2vec训练模型

相关代码如下:

[python] view plain copy
print ?
  1. from gensim.modelsimport word2vec  
  2. import logging  
  3.    
  4.    
  5. # 主程序  
  6. logging.basicConfig(format=’%(asctime)s:%(levelname)s: %(message)s’, level=logging.INFO)  
  7. sentences =word2vec.Text8Corpus(u”fenci_result.txt”)  # 加载语料  
  8. model =word2vec.Word2Vec(sentences, size=200)  #训练skip-gram模型,默认window=5  
  9.    
  10. print model  
  11. # 计算两个词的相似度/相关程度  
  12. try:  
  13.     y1 = model.similarity(u”国家”, u“国务院”)  
  14. except KeyError:  
  15.     y1 = 0  
  16. print u“【国家】和【国务院】的相似度为:”, y1  
  17. print“—–\n”  
  18. #  
  19. # 计算某个词的相关词列表  
  20. y2 = model.most_similar(u”控烟”, topn=20)  # 20个最相关的  
  21. print u“和【控烟】最相关的词有:\n”  
  22. for item in y2:  
  23.     print item[0], item[1]  
  24. print“—–\n”  
  25.    
  26. # 寻找对应关系  
  27. print u“书-不错,质量-“  
  28. y3 =model.most_similar([u’质量’, u‘不错’], [u‘书’], topn=3)  
  29. for item in y3:  
  30.     print item[0], item[1]  
  31. print“—-\n”  
  32.    
  33. # 寻找不合群的词  
  34. y4 =model.doesnt_match(u”书 书籍 教材 很”.split())  
  35. print u“不合群的词:”, y4  
  36. print“—–\n”  
  37.    
  38. # 保存模型,以便重用  
  39. model.save(u”书评.model”)  
  40. # 对应的加载方式  
  41. # model_2 =word2vec.Word2Vec.load(“text8.model”)  
  42.    
  43. # 以一种c语言可以解析的形式存储词向量  
  44. #model.save_word2vec_format(u”书评.model.bin”, binary=True)  
  45. # 对应的加载方式  
  46. # model_3 =word2vec.Word2Vec.load_word2vec_format(“text8.model.bin”,binary=True)  
from gensim.modelsimport word2vec 
import logging

# 主程序 logging.basicConfig(format='%(asctime)s:%(levelname)s: %(message)s', level=logging.INFO) sentences =word2vec.Text8Corpus(u"fenci_result.txt") # 加载语料 model =word2vec.Word2Vec(sentences, size=200) #训练skip-gram模型,默认window=5 print model # 计算两个词的相似度/相关程度 try: y1 = model.similarity(u"国家", u"国务院") except KeyError: y1 = 0 print u"【国家】和【国务院】的相似度为:", y1 print"-----\n" # # 计算某个词的相关词列表 y2 = model.most_similar(u"控烟", topn=20) # 20个最相关的 print u"和【控烟】最相关的词有:\n" for item in y2: print item[0], item[1] print"-----\n" # 寻找对应关系 print u"书-不错,质量-" y3 =model.most_similar([u'质量', u'不错'], [u'书'], topn=3) for item in y3: print item[0], item[1] print"----\n" # 寻找不合群的词 y4 =model.doesnt_match(u"书 书籍 教材 很".split()) print u"不合群的词:", y4 print"-----\n" # 保存模型,以便重用 model.save(u"书评.model") # 对应的加载方式 # model_2 =word2vec.Word2Vec.load("text8.model") # 以一种c语言可以解析的形式存储词向量 #model.save_word2vec_format(u"书评.model.bin", binary=True) # 对应的加载方式 # model_3 =word2vec.Word2Vec.load_word2vec_format("text8.model.bin",binary=True)

输出如下:

  1. ”D:\program files\python2.7.0\python.exe” “D:/pycharm workspace/毕设/cluster_test/word2vec.py”  
  2. D:\program files\python2.7.0\lib\site-packages\gensim\utils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial  
  3.   warnings.warn(”detected Windows; aliasing chunkize to chunkize_serial”)  
  4. D:\program files\python2.7.0\lib\site-packages\gensim\utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won’t be available.  
  5.   warnings.warn(”Pattern library is not installed, lemmatization won’t be available.”)  
  6. 2016-12-12 15:37:43,331: INFO: collecting all words and their counts  
  7. 2016-12-12 15:37:43,332: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types  
  8. 2016-12-12 15:37:45,236: INFO: collected 99865 word types from a corpus of 3561156 raw words and 357 sentences  
  9. 2016-12-12 15:37:45,236: INFO: Loading a fresh vocabulary  
  10. 2016-12-12 15:37:45,413: INFO: min_count=5 retains 29982 unique words (30% of original 99865, drops 69883)  
  11. 2016-12-12 15:37:45,413: INFO: min_count=5 leaves 3444018 word corpus (96% of original 3561156, drops 117138)  
  12. 2016-12-12 15:37:45,602: INFO: deleting the raw counts dictionary of 99865 items  
  13. 2016-12-12 15:37:45,615: INFO: sample=0.001 downsamples 29 most-common words  
  14. 2016-12-12 15:37:45,615: INFO: downsampling leaves estimated 2804247 word corpus (81.4% of prior 3444018)  
  15. 2016-12-12 15:37:45,615: INFO: estimated required memory for 29982 words and 200 dimensions: 62962200 bytes  
  16. 2016-12-12 15:37:45,746: INFO: resetting layer weights  
  17. 2016-12-12 15:37:46,782: INFO: training model with 3 workers on 29982 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5  
  18. 2016-12-12 15:37:46,782: INFO: expecting 357 sentences, matching count from corpus used for vocabulary survey  
  19. 2016-12-12 15:37:47,818: INFO: PROGRESS: at 1.96% examples, 267531 words/s, in_qsize 6, out_qsize 0  
  20. 2016-12-12 15:37:48,844: INFO: PROGRESS: at 3.70% examples, 254229 words/s, in_qsize 3, out_qsize 1  
  21. 2016-12-12 15:37:49,871: INFO: PROGRESS: at 5.99% examples, 273509 words/s, in_qsize 3, out_qsize 1  
  22. 2016-12-12 15:37:50,867: INFO: PROGRESS: at 8.18% examples, 281557 words/s, in_qsize 6, out_qsize 0  
  23. 2016-12-12 15:37:51,872: INFO: PROGRESS: at 10.20% examples, 280918 words/s, in_qsize 5, out_qsize 0  
  24. 2016-12-12 15:37:52,898: INFO: PROGRESS: at 12.44% examples, 284750 words/s, in_qsize 6, out_qsize 0  
  25. 2016-12-12 15:37:53,911: INFO: PROGRESS: at 14.17% examples, 278948 words/s, in_qsize 0, out_qsize 0  
  26. 2016-12-12 15:37:54,956: INFO: PROGRESS: at 16.47% examples, 284101 words/s, in_qsize 2, out_qsize 1  
  27. 2016-12-12 15:37:55,934: INFO: PROGRESS: at 18.60% examples, 285781 words/s, in_qsize 6, out_qsize 1  
  28. 2016-12-12 15:37:56,933: INFO: PROGRESS: at 20.84% examples, 288045 words/s, in_qsize 6, out_qsize 0  
  29. 2016-12-12 15:37:57,973: INFO: PROGRESS: at 23.03% examples, 289083 words/s, in_qsize 6, out_qsize 2  
  30. 2016-12-12 15:37:58,993: INFO: PROGRESS: at 24.87% examples, 285990 words/s, in_qsize 6, out_qsize 1  
  31. 2016-12-12 15:38:00,006: INFO: PROGRESS: at 27.17% examples, 288266 words/s, in_qsize 4, out_qsize 1  
  32. 2016-12-12 15:38:01,081: INFO: PROGRESS: at 29.52% examples, 290197 words/s, in_qsize 1, out_qsize 2  
  33. 2016-12-12 15:38:02,065: INFO: PROGRESS: at 31.88% examples, 292344 words/s, in_qsize 6, out_qsize 0  
  34. 2016-12-12 15:38:03,188: INFO: PROGRESS: at 34.01% examples, 291356 words/s, in_qsize 2, out_qsize 2  
  35. 2016-12-12 15:38:04,161: INFO: PROGRESS: at 36.02% examples, 290805 words/s, in_qsize 6, out_qsize 0  
  36. 2016-12-12 15:38:05,174: INFO: PROGRESS: at 38.26% examples, 292174 words/s, in_qsize 3, out_qsize 0  
  37. 2016-12-12 15:38:06,214: INFO: PROGRESS: at 40.56% examples, 293297 words/s, in_qsize 4, out_qsize 1  
  38. 2016-12-12 15:38:07,201: INFO: PROGRESS: at 42.69% examples, 293428 words/s, in_qsize 4, out_qsize 1  
  39. 2016-12-12 15:38:08,266: INFO: PROGRESS: at 44.65% examples, 292108 words/s, in_qsize 1, out_qsize 1  
  40. 2016-12-12 15:38:09,295: INFO: PROGRESS: at 46.83% examples, 292097 words/s, in_qsize 4, out_qsize 1  
  41. 2016-12-12 15:38:10,315: INFO: PROGRESS: at 49.13% examples, 292968 words/s, in_qsize 2, out_qsize 2  
  42. 2016-12-12 15:38:11,326: INFO: PROGRESS: at 51.37% examples, 293621 words/s, in_qsize 5, out_qsize 0  
  43. 2016-12-12 15:38:12,367: INFO: PROGRESS: at 53.39% examples, 292777 words/s, in_qsize 2, out_qsize 2  
  44. 2016-12-12 15:38:13,348: INFO: PROGRESS: at 55.35% examples, 292187 words/s, in_qsize 5, out_qsize 0  
  45. 2016-12-12 15:38:14,349: INFO: PROGRESS: at 57.31% examples, 291656 words/s, in_qsize 6, out_qsize 0  
  46. 2016-12-12 15:38:15,374: INFO: PROGRESS: at 59.50% examples, 292019 words/s, in_qsize 6, out_qsize 0  
  47. 2016-12-12 15:38:16,403: INFO: PROGRESS: at 61.68% examples, 292318 words/s, in_qsize 4, out_qsize 2  
  48. 2016-12-12 15:38:17,401: INFO: PROGRESS: at 63.81% examples, 292275 words/s, in_qsize 6, out_qsize 0  
  49. 2016-12-12 15:38:18,410: INFO: PROGRESS: at 65.71% examples, 291495 words/s, in_qsize 4, out_qsize 1  
  50. 2016-12-12 15:38:19,433: INFO: PROGRESS: at 67.62% examples, 290443 words/s, in_qsize 6, out_qsize 0  
  51. 2016-12-12 15:38:20,473: INFO: PROGRESS: at 69.58% examples, 289655 words/s, in_qsize 6, out_qsize 2  
  52. 2016-12-12 15:38:21,589: INFO: PROGRESS: at 71.71% examples, 289388 words/s, in_qsize 2, out_qsize 2  
  53. 2016-12-12 15:38:22,533: INFO: PROGRESS: at 73.78% examples, 289366 words/s, in_qsize 0, out_qsize 1  
  54. 2016-12-12 15:38:23,611: INFO: PROGRESS: at 75.46% examples, 287542 words/s, in_qsize 5, out_qsize 1  
  55. 2016-12-12 15:38:24,614: INFO: PROGRESS: at 77.25% examples, 286609 words/s, in_qsize 3, out_qsize 0  
  56. 2016-12-12 15:38:25,609: INFO: PROGRESS: at 79.33% examples, 286732 words/s, in_qsize 5, out_qsize 1  
  57. 2016-12-12 15:38:26,621: INFO: PROGRESS: at 81.40% examples, 286595 words/s, in_qsize 2, out_qsize 0  
  58. 2016-12-12 15:38:27,625: INFO: PROGRESS: at 83.53% examples, 286807 words/s, in_qsize 6, out_qsize 0  
  59. 2016-12-12 15:38:28,683: INFO: PROGRESS: at 85.32% examples, 285651 words/s, in_qsize 5, out_qsize 3  
  60. 2016-12-12 15:38:29,729: INFO: PROGRESS: at 87.56% examples, 286175 words/s, in_qsize 6, out_qsize 1  
  61. 2016-12-12 15:38:30,706: INFO: PROGRESS: at 89.86% examples, 286920 words/s, in_qsize 5, out_qsize 0  
  62. 2016-12-12 15:38:31,714: INFO: PROGRESS: at 92.10% examples, 287368 words/s, in_qsize 6, out_qsize 0  
  63. 2016-12-12 15:38:32,756: INFO: PROGRESS: at 94.40% examples, 288070 words/s, in_qsize 4, out_qsize 2  
  64. 2016-12-12 15:38:33,755: INFO: PROGRESS: at 96.30% examples, 287543 words/s, in_qsize 1, out_qsize 0  
  65. 2016-12-12 15:38:34,802: INFO: PROGRESS: at 98.71% examples, 288375 words/s, in_qsize 4, out_qsize 0  
  66. 2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 2 more threads  
  67. 2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 1 more threads  
  68. Word2Vec(vocab=29982, size=200, alpha=0.025)  
  69. 【国家】和【国务院】的相似度为: 0.387535493256  
  70. —–  
  71.   
  72. 2016-12-12 15:38:35,293: INFO: worker thread finished; awaiting finish of 0 more threads  
  73. 2016-12-12 15:38:35,293: INFO: training on 17805780 raw words (14021191 effective words) took 48.5s, 289037 effective words/s  
  74. 2016-12-12 15:38:35,293: INFO: precomputing L2-norms of word weight vectors  
  75. 和【控烟】最相关的词有:  
  76.   
  77. 禁烟 0.6038454175  
  78. 防烟 0.585186183453  
  79. 执行 0.530897378922  
  80. 烟控 0.516572892666  
  81. 广而告之 0.508533298969  
  82. 履约 0.507428050041  
  83. 执法 0.494115233421  
  84. 禁烟令 0.471616715193  
  85. 修法 0.465247869492  
  86. 该项 0.457907706499  
  87. 落实 0.457776963711  
  88. 控制 0.455987215042  
  89. 这方面 0.450040221214  
  90. 立法 0.44820779562  
  91. 控烟办 0.436062157154  
  92. 执行力 0.432559013367  
  93. 控烟会 0.430508673191  
  94. 进展 0.430286765099  
  95. 监管 0.429748386145  
  96. 惩罚 0.429243773222  
  97. —–  
  98.   
  99. 书-不错,质量-  
  100. 生存 0.613928854465  
  101. 稳定 0.595371186733  
  102. 整体 0.592055797577  
  103. —-  
  104.   
  105. 不合群的词: 很  
  106. —–  
  107.   
  108. 2016-12-12 15:38:35,515: INFO: saving Word2Vec object under 书评.model, separately None  
  109. 2016-12-12 15:38:35,515: INFO: not storing attribute syn0norm  
  110. 2016-12-12 15:38:35,515: INFO: not storing attribute cum_table  
  111. 2016-12-12 15:38:36,490: INFO: saved 书评.model  
  112.   
  113. Process finished with exit code 0  
"D:\program files\python2.7.0\python.exe" "D:/pycharm workspace/毕设/cluster_test/word2vec.py"
D:\program files\python2.7.0\lib\site-packages\gensim\utils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
D:\program files\python2.7.0\lib\site-packages\gensim\utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
  warnings.warn("Pattern library is not installed, lemmatization won't be available.")
2016-12-12 15:37:43,331: INFO: collecting all words and their counts
2016-12-12 15:37:43,332: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-12-12 15:37:45,236: INFO: collected 99865 word types from a corpus of 3561156 raw words and 357 sentences
2016-12-12 15:37:45,236: INFO: Loading a fresh vocabulary
2016-12-12 15:37:45,413: INFO: min_count=5 retains 29982 unique words (30% of original 99865, drops 69883)
2016-12-12 15:37:45,413: INFO: min_count=5 leaves 3444018 word corpus (96% of original 3561156, drops 117138)
2016-12-12 15:37:45,602: INFO: deleting the raw counts dictionary of 99865 items
2016-12-12 15:37:45,615: INFO: sample=0.001 downsamples 29 most-common words
2016-12-12 15:37:45,615: INFO: downsampling leaves estimated 2804247 word corpus (81.4% of prior 3444018)
2016-12-12 15:37:45,615: INFO: estimated required memory for 29982 words and 200 dimensions: 62962200 bytes
2016-12-12 15:37:45,746: INFO: resetting layer weights
2016-12-12 15:37:46,782: INFO: training model with 3 workers on 29982 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2016-12-12 15:37:46,782: INFO: expecting 357 sentences, matching count from corpus used for vocabulary survey
2016-12-12 15:37:47,818: INFO: PROGRESS: at 1.96% examples, 267531 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:37:48,844: INFO: PROGRESS: at 3.70% examples, 254229 words/s, in_qsize 3, out_qsize 1
2016-12-12 15:37:49,871: INFO: PROGRESS: at 5.99% examples, 273509 words/s, in_qsize 3, out_qsize 1
2016-12-12 15:37:50,867: INFO: PROGRESS: at 8.18% examples, 281557 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:37:51,872: INFO: PROGRESS: at 10.20% examples, 280918 words/s, in_qsize 5, out_qsize 0
2016-12-12 15:37:52,898: INFO: PROGRESS: at 12.44% examples, 284750 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:37:53,911: INFO: PROGRESS: at 14.17% examples, 278948 words/s, in_qsize 0, out_qsize 0
2016-12-12 15:37:54,956: INFO: PROGRESS: at 16.47% examples, 284101 words/s, in_qsize 2, out_qsize 1
2016-12-12 15:37:55,934: INFO: PROGRESS: at 18.60% examples, 285781 words/s, in_qsize 6, out_qsize 1
2016-12-12 15:37:56,933: INFO: PROGRESS: at 20.84% examples, 288045 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:37:57,973: INFO: PROGRESS: at 23.03% examples, 289083 words/s, in_qsize 6, out_qsize 2
2016-12-12 15:37:58,993: INFO: PROGRESS: at 24.87% examples, 285990 words/s, in_qsize 6, out_qsize 1
2016-12-12 15:38:00,006: INFO: PROGRESS: at 27.17% examples, 288266 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:01,081: INFO: PROGRESS: at 29.52% examples, 290197 words/s, in_qsize 1, out_qsize 2
2016-12-12 15:38:02,065: INFO: PROGRESS: at 31.88% examples, 292344 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:03,188: INFO: PROGRESS: at 34.01% examples, 291356 words/s, in_qsize 2, out_qsize 2
2016-12-12 15:38:04,161: INFO: PROGRESS: at 36.02% examples, 290805 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:05,174: INFO: PROGRESS: at 38.26% examples, 292174 words/s, in_qsize 3, out_qsize 0
2016-12-12 15:38:06,214: INFO: PROGRESS: at 40.56% examples, 293297 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:07,201: INFO: PROGRESS: at 42.69% examples, 293428 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:08,266: INFO: PROGRESS: at 44.65% examples, 292108 words/s, in_qsize 1, out_qsize 1
2016-12-12 15:38:09,295: INFO: PROGRESS: at 46.83% examples, 292097 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:10,315: INFO: PROGRESS: at 49.13% examples, 292968 words/s, in_qsize 2, out_qsize 2
2016-12-12 15:38:11,326: INFO: PROGRESS: at 51.37% examples, 293621 words/s, in_qsize 5, out_qsize 0
2016-12-12 15:38:12,367: INFO: PROGRESS: at 53.39% examples, 292777 words/s, in_qsize 2, out_qsize 2
2016-12-12 15:38:13,348: INFO: PROGRESS: at 55.35% examples, 292187 words/s, in_qsize 5, out_qsize 0
2016-12-12 15:38:14,349: INFO: PROGRESS: at 57.31% examples, 291656 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:15,374: INFO: PROGRESS: at 59.50% examples, 292019 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:16,403: INFO: PROGRESS: at 61.68% examples, 292318 words/s, in_qsize 4, out_qsize 2
2016-12-12 15:38:17,401: INFO: PROGRESS: at 63.81% examples, 292275 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:18,410: INFO: PROGRESS: at 65.71% examples, 291495 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:19,433: INFO: PROGRESS: at 67.62% examples, 290443 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:20,473: INFO: PROGRESS: at 69.58% examples, 289655 words/s, in_qsize 6, out_qsize 2
2016-12-12 15:38:21,589: INFO: PROGRESS: at 71.71% examples, 289388 words/s, in_qsize 2, out_qsize 2
2016-12-12 15:38:22,533: INFO: PROGRESS: at 73.78% examples, 289366 words/s, in_qsize 0, out_qsize 1
2016-12-12 15:38:23,611: INFO: PROGRESS: at 75.46% examples, 287542 words/s, in_qsize 5, out_qsize 1
2016-12-12 15:38:24,614: INFO: PROGRESS: at 77.25% examples, 286609 words/s, in_qsize 3, out_qsize 0
2016-12-12 15:38:25,609: INFO: PROGRESS: at 79.33% examples, 286732 words/s, in_qsize 5, out_qsize 1
2016-12-12 15:38:26,621: INFO: PROGRESS: at 81.40% examples, 286595 words/s, in_qsize 2, out_qsize 0
2016-12-12 15:38:27,625: INFO: PROGRESS: at 83.53% examples, 286807 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:28,683: INFO: PROGRESS: at 85.32% examples, 285651 words/s, in_qsize 5, out_qsize 3
2016-12-12 15:38:29,729: INFO: PROGRESS: at 87.56% examples, 286175 words/s, in_qsize 6, out_qsize 1
2016-12-12 15:38:30,706: INFO: PROGRESS: at 89.86% examples, 286920 words/s, in_qsize 5, out_qsize 0
2016-12-12 15:38:31,714: INFO: PROGRESS: at 92.10% examples, 287368 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:32,756: INFO: PROGRESS: at 94.40% examples, 288070 words/s, in_qsize 4, out_qsize 2
2016-12-12 15:38:33,755: INFO: PROGRESS: at 96.30% examples, 287543 words/s, in_qsize 1, out_qsize 0
2016-12-12 15:38:34,802: INFO: PROGRESS: at 98.71% examples, 288375 words/s, in_qsize 4, out_qsize 0
2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 2 more threads
2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 1 more threads
Word2Vec(vocab=29982, size=200, alpha=0.025)




【国家】和【国务院】的相似度为: 0.387535493256 ----- 2016-12-12 15:38:35,293: INFO: worker thread finished; awaiting finish of 0 more threads 2016-12-12 15:38:35,293: INFO: training on 17805780 raw words (14021191 effective words) took 48.5s, 289037 effective words/s 2016-12-12 15:38:35,293: INFO: precomputing L2-norms of word weight vectors 和【控烟】最相关的词有: 禁烟 0.6038454175 防烟 0.585186183453 执行 0.530897378922 烟控 0.516572892666 广而告之 0.508533298969 履约 0.507428050041 执法 0.494115233421 禁烟令 0.471616715193 修法 0.465247869492 该项 0.457907706499 落实 0.457776963711 控制 0.455987215042 这方面 0.450040221214 立法 0.44820779562 控烟办 0.436062157154 执行力 0.432559013367 控烟会 0.430508673191 进展 0.430286765099 监管 0.429748386145 惩罚 0.429243773222 ----- 书-不错,质量- 生存 0.613928854465 稳定 0.595371186733 整体 0.592055797577 ---- 不合群的词: 很 —– 2016-12-12 15:38:35,515: INFO: saving Word2Vec object under 书评.model, separately None 2016-12-12 15:38:35,515: INFO: not storing attribute syn0norm 2016-12-12 15:38:35,515: INFO: not storing attribute cum_table 2016-12-12 15:38:36,490: INFO: saved 书评.model Process finished with exit code 0







本文转载而来,对,就是转载!
原地址:http://blog.csdn.net/xiaoquantouer/article/details/53583980
原作者:小拳头

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值