gensim的word2vec训练词向量英文语料

最新推荐文章于 2023-01-05 16:19:36 发布

cxiaojun

最新推荐文章于 2023-01-05 16:19:36 发布

阅读量1.5k

点赞数 2

分类专栏：冲文章标签： python 自然语言处理

本文链接：https://blog.csdn.net/cxiaojun/article/details/105461282

版权

冲专栏收录该内容

2 篇文章 0 订阅

订阅专栏

gensim的word2vec训练词向量英文语料

1.先处理语料

批量处理语料文章，去除标点符号，写到一个新文本里。

import os
txt_path="路径.txt"
save_path="路径.小程序"
txt_name=sorted(os.listdir(txt_path))
txt_path_length=len(txt_name)
for i in range(txt_path_length):
    txt=os.path.join(txt_path,txt_name[i])
    f= open(txt,'r')#打开病人原文报告
    text = f.read()  # 获取文本内容
    ff = text.split()
    str_out = ' '.join(ff).replace(',', '').replace('.', '').replace('?', '').replace('!', '') \
        .replace('"', '').replace('@', '').replace(':', '').replace('...', '').replace('(', '').replace(')', '') \
        .replace('-', '').replace('#', '').replace('/', '').replace('+', '').replace("'", '') \
        .replace('*', '')  # 去掉标点符号
    fo = open('D:/all.txt', 'w', encoding='utf-8')
    fo.write(str_out)

2.训练词向量，保存词典

from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus('D:/all.txt')
model = word2vec.Word2Vec(sentences, sg=1, size=100,  window=5,  min_count=5,  negative=3, sample=0.001, hs=1, workers=4)
model.save('D:/text83.model')
print(model['of'])

y2 = model.most_similar("of", topn=20)  # 20个最相关的
print(u"和of最相关的词有：\n")
for item in y2:
    print(item[0], item[1])
print("--------\n")

y1 = model.similarity("of", "the")
print(u"of和the的相似度为：", y1)
print("--------\n")

model.wv.save_word2vec_format('D:/w2v_mod.txt',binary=False)

3.运行结果

C:\ProgramData\Anaconda3\python.exe C:/DOSB/python/1.py
2020-04-11 20:11:54,404 : INFO : collecting all words and their counts
2020-04-11 20:11:54,404 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-11 20:11:54,405 : INFO : collected 598 word types from a corpus of 1253 raw words and 1 sentences
2020-04-11 20:11:54,405 : INFO : Loading a fresh vocabulary
2020-04-11 20:11:54,405 : INFO : effective_min_count=5 retains 40 unique words (6% of original 598, drops 558)
2020-04-11 20:11:54,405 : INFO : effective_min_count=5 leaves 504 word corpus (40% of original 1253, drops 749)
2020-04-11 20:11:54,405 : INFO : deleting the raw counts dictionary of 598 items
2020-04-11 20:11:54,405 : INFO : sample=0.001 downsamples 40 most-common words
2020-04-11 20:11:54,405 : INFO : downsampling leaves estimated 114 word corpus (22.8% of prior 504)
2020-04-11 20:11:54,405 : INFO : constructing a huffman tree from 40 words
2020-04-11 20:11:54,406 : INFO : built huffman tree with maximum node depth 7
2020-04-11 20:11:54,406 : INFO : estimated required memory for 40 words and 100 dimensions: 76000 bytes
2020-04-11 20:11:54,406 : INFO : resetting layer weights
2020-04-11 20:11:54,407 : INFO : training model with 4 workers on 40 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=3 window=5
2020-04-11 20:11:54,411 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,411 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,411 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,411 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,411 : INFO : EPOCH - 1 : training on 1253 raw words (107 effective words) took 0.0s, 68231 effective words/s
2020-04-11 20:11:54,422 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,422 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,422 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,423 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,423 : INFO : EPOCH - 2 : training on 1253 raw words (123 effective words) took 0.0s, 53898 effective words/s
2020-04-11 20:11:54,426 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,426 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,426 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,426 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,426 : INFO : EPOCH - 3 : training on 1253 raw words (116 effective words) took 0.0s, 75827 effective words/s
2020-04-11 20:11:54,429 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,429 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,429 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,429 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,429 : INFO : EPOCH - 4 : training on 1253 raw words (126 effective words) took 0.0s, 85878 effective words/s
2020-04-11 20:11:54,432 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,432 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,432 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,432 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,432 : INFO : EPOCH - 5 : training on 1253 raw words (114 effective words) took 0.0s, 77567 effective words/s
2020-04-11 20:11:54,432 : INFO : training on a 6265 raw words (586 effective words) took 0.0s, 22721 effective words/s
2020-04-11 20:11:54,432 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
2020-04-11 20:11:54,432 : INFO : saving Word2Vec object under D:/text83.model, separately None
2020-04-11 20:11:54,433 : INFO : not storing attribute vectors_norm
2020-04-11 20:11:54,433 : INFO : not storing attribute cum_table
2020-04-11 20:11:54,435 : INFO : saved D:/text83.model
C:/DOSB/python/1.py:7: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  print(model['of'])
C:/DOSB/python/1.py:9: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
  y2 = model.most_similar("of", topn=20)  # 20个最相关的
2020-04-11 20:11:54,436 : INFO : precomputing L2-norms of word weight vectors
C:/DOSB/python/1.py:15: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).
  y1 = model.similarity("of", "the")
2020-04-11 20:11:54,437 : INFO : storing 40x100 projection weights into D:/w2v_mod.txt
[ 4.5119943e-03  1.1110865e-03  4.7830464e-03  3.2766715e-03
  1.4255361e-05  4.7274330e-03  2.6914128e-03 -5.1430068e-03
  7.1067740e-03  2.3664068e-03  4.8416734e-04  4.6614781e-03
 -1.6550183e-03 -1.3269378e-03 -9.4731981e-03  7.4856781e-04
  1.7661846e-03  1.2955075e-04 -3.2728771e-03  9.0239588e-03
  7.1609071e-03  2.2960042e-03  5.9234416e-03 -2.4910760e-04
  4.5021693e-03 -4.2646695e-03  2.3406495e-03 -3.0547287e-03
  3.6683748e-05  3.8325130e-03 -4.9320315e-03 -5.0613913e-03
  3.6021832e-03 -2.3138372e-03  1.7006524e-03  6.1490024e-03
  6.9686287e-04 -6.3259329e-04 -4.1721454e-03  2.5062754e-03
 -5.1129638e-04  4.3048696e-03 -2.7197471e-04 -3.2922096e-04
  1.4095093e-03 -9.4465399e-03 -2.8034057e-03  2.7519900e-03
  5.6970068e-03 -5.6160893e-03 -4.8419149e-03 -3.1580466e-03
 -7.8896349e-03 -1.0767973e-02  1.4045762e-03  2.0113366e-03
 -5.5731642e-03  1.5514770e-03  4.1259183e-03  7.0777321e-03
  1.1041010e-03 -8.7191779e-03  8.4786414e-04  1.2240792e-03
 -5.8230339e-03 -1.2301673e-03  4.4685923e-03  6.5877289e-03
  2.3564792e-04 -2.6430183e-03 -4.8967930e-03  2.9890998e-03
 -9.0697380e-03 -7.7300547e-03  6.1238389e-03 -1.1534996e-03
 -2.6939285e-04  7.2782440e-03  4.9038430e-05 -6.6716187e-03
 -3.4231436e-03 -1.3689960e-03 -1.9137915e-03 -3.5094414e-04
  6.0740514e-03  5.0583384e-03  1.1700102e-03  2.2965777e-03
  4.3817397e-04  2.8374954e-04 -4.5496868e-03  6.2393737e-03
  5.8327024e-03 -4.6927286e-03  6.4144228e-03  1.7604369e-04
 -2.3924143e-03  1.6456917e-03 -2.2331469e-03  3.4285260e-03]
和of最相关的词有：
was 0.6871980428695679
and 0.6559514999389648
the 0.6459410190582275
patient 0.5905261635780334
to 0.585208535194397
His 0.5622400641441345
an 0.5490080118179321
breath 0.548919141292572
a 0.5240625739097595
were 0.5178908109664917
for 0.5129812955856323
with 0.5029577016830444
in 0.49395352602005005
left 0.4854786694049835
low 0.4786217510700226
The 0.478326678276062
20 0.47305724024772644
his 0.45636075735092163
no 0.4553608000278473
had 0.44865986704826355
--------

of和the的相似度为： 0.645941
--------


Process finished with exit code 0

cxiaojun

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
gensim的word2vec训练词向量英文语料

gensim的word2vec训练词向量英文语料1.先处理语料批量处理语料文章，去除标点符号，写到一个新文本里。import ostxt_path="路径.txt"save_path="路径.小程序"txt_name=sorted(os.listdir(txt_path))txt_path_length=len(txt_name)for i in range(txt_path_l...
复制链接

扫一扫