gensim的word2vec训练词向量 英文语料

gensim的word2vec训练词向量 英文语料

1.先处理语料

批量处理语料文章,去除标点符号,写到一个新文本里。

import os
txt_path="路径.txt"
save_path="路径.小程序"
txt_name=sorted(os.listdir(txt_path))
txt_path_length=len(txt_name)
for i in range(txt_path_length):
    txt=os.path.join(txt_path,txt_name[i])
    f= open(txt,'r')#打开病人原文报告
    text = f.read()  # 获取文本内容
    ff = text.split()
    str_out = ' '.join(ff).replace(',', '').replace('.', '').replace('?', '').replace('!', '') \
        .replace('"', '').replace('@', '').replace(':', '').replace('...', '').replace('(', '').replace(')', '') \
        .replace('-', '').replace('#', '').replace('/', '').replace('+', '').replace("'", '') \
        .replace('*', '')  # 去掉标点符号
    fo = open('D:/all.txt', 'w', encoding='utf-8')
    fo.write(str_out)

2.训练词向量,保存词典

from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus('D:/all.txt')
model = word2vec.Word2Vec(sentences, sg=1, size=100,  window=5,  min_count=5,  negative=3, sample=0.001, hs=1, workers=4)
model.save('D:/text83.model')
print(model['of'])

y2 = model.most_similar("of", topn=20)  # 20个最相关的
print(u"和of最相关的词有:\n")
for item in y2:
    print(item[0], item[1])
print("--------\n")

y1 = model.similarity("of", "the")
print(u"of和the的相似度为:", y1)
print("--------\n")

model.wv.save_word2vec_format('D:/w2v_mod.txt',binary=False)

3.运行结果

C:\ProgramData\Anaconda3\python.exe C:/DOSB/python/1.py
2020-04-11 20:11:54,404 : INFO : collecting all words and their counts
2020-04-11 20:11:54,404 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-11 20:11:54,405 : INFO : collected 598 word types from a corpus of 1253 raw words and 1 sentences
2020-04-11 20:11:54,405 : INFO : Loading a fresh vocabulary
2020-04-11 20:11:54,405 : INFO : effective_min_count=5 retains 40 unique words (6% of original 598, drops 558)
2020-04-11 20:11:54,405 : INFO : effective_min_count=5 leaves 504 word corpus (40% of original 1253, drops 749)
2020-04-11 20:11:54,405 : INFO : deleting the raw counts dictionary of 598 items
2020-04-11 20:11:54,405 : INFO : sample=0.001 downsamples 40 most-common words
2020-04-11 20:11:54,405 : INFO : downsampling leaves estimated 114 word corpus (22.8% of prior 504)
2020-04-11 20:11:54,405 : INFO : constructing a huffman tree from 40 words
2020-04-11 20:11:54,406 : INFO : built huffman tree with maximum node depth 7
2020-04-11 20:11:54,406 : INFO : estimated required memory for 40 words and 100 dimensions: 76000 bytes
2020-04-11 20:11:54,406 : INFO : resetting layer weights
2020-04-11 20:11:54,407 : INFO : training model with 4 workers on 40 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=3 window=5
2020-04-11 20:11:54,411 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,411 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,411 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,411 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,411 : INFO : EPOCH - 1 : training on 1253 raw words (107 effective words) took 0.0s, 68231 effective words/s
2020-04-11 20:11:54,422 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,422 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,422 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,423 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,423 : INFO : EPOCH - 2 : training on 1253 raw words (123 effective words) took 0.0s, 53898 effective words/s
2020-04-11 20:11:54,426 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,426 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,426 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,426 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,426 : INFO : EPOCH - 3 : training on 1253 raw words (116 effective words) took 0.0s, 75827 effective words/s
2020-04-11 20:11:54,429 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,429 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,429 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,429 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,429 : INFO : EPOCH - 4 : training on 1253 raw words (126 effective words) took 0.0s, 85878 effective words/s
2020-04-11 20:11:54,432 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-04-11 20:11:54,432 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-04-11 20:11:54,432 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-04-11 20:11:54,432 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-04-11 20:11:54,432 : INFO : EPOCH - 5 : training on 1253 raw words (114 effective words) took 0.0s, 77567 effective words/s
2020-04-11 20:11:54,432 : INFO : training on a 6265 raw words (586 effective words) took 0.0s, 22721 effective words/s
2020-04-11 20:11:54,432 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
2020-04-11 20:11:54,432 : INFO : saving Word2Vec object under D:/text83.model, separately None
2020-04-11 20:11:54,433 : INFO : not storing attribute vectors_norm
2020-04-11 20:11:54,433 : INFO : not storing attribute cum_table
2020-04-11 20:11:54,435 : INFO : saved D:/text83.model
C:/DOSB/python/1.py:7: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  print(model['of'])
C:/DOSB/python/1.py:9: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
  y2 = model.most_similar("of", topn=20)  # 20个最相关的
2020-04-11 20:11:54,436 : INFO : precomputing L2-norms of word weight vectors
C:/DOSB/python/1.py:15: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).
  y1 = model.similarity("of", "the")
2020-04-11 20:11:54,437 : INFO : storing 40x100 projection weights into D:/w2v_mod.txt
[ 4.5119943e-03  1.1110865e-03  4.7830464e-03  3.2766715e-03
  1.4255361e-05  4.7274330e-03  2.6914128e-03 -5.1430068e-03
  7.1067740e-03  2.3664068e-03  4.8416734e-04  4.6614781e-03
 -1.6550183e-03 -1.3269378e-03 -9.4731981e-03  7.4856781e-04
  1.7661846e-03  1.2955075e-04 -3.2728771e-03  9.0239588e-03
  7.1609071e-03  2.2960042e-03  5.9234416e-03 -2.4910760e-04
  4.5021693e-03 -4.2646695e-03  2.3406495e-03 -3.0547287e-03
  3.6683748e-05  3.8325130e-03 -4.9320315e-03 -5.0613913e-03
  3.6021832e-03 -2.3138372e-03  1.7006524e-03  6.1490024e-03
  6.9686287e-04 -6.3259329e-04 -4.1721454e-03  2.5062754e-03
 -5.1129638e-04  4.3048696e-03 -2.7197471e-04 -3.2922096e-04
  1.4095093e-03 -9.4465399e-03 -2.8034057e-03  2.7519900e-03
  5.6970068e-03 -5.6160893e-03 -4.8419149e-03 -3.1580466e-03
 -7.8896349e-03 -1.0767973e-02  1.4045762e-03  2.0113366e-03
 -5.5731642e-03  1.5514770e-03  4.1259183e-03  7.0777321e-03
  1.1041010e-03 -8.7191779e-03  8.4786414e-04  1.2240792e-03
 -5.8230339e-03 -1.2301673e-03  4.4685923e-03  6.5877289e-03
  2.3564792e-04 -2.6430183e-03 -4.8967930e-03  2.9890998e-03
 -9.0697380e-03 -7.7300547e-03  6.1238389e-03 -1.1534996e-03
 -2.6939285e-04  7.2782440e-03  4.9038430e-05 -6.6716187e-03
 -3.4231436e-03 -1.3689960e-03 -1.9137915e-03 -3.5094414e-04
  6.0740514e-03  5.0583384e-03  1.1700102e-03  2.2965777e-03
  4.3817397e-04  2.8374954e-04 -4.5496868e-03  6.2393737e-03
  5.8327024e-03 -4.6927286e-03  6.4144228e-03  1.7604369e-04
 -2.3924143e-03  1.6456917e-03 -2.2331469e-03  3.4285260e-03]
和of最相关的词有:
was 0.6871980428695679
and 0.6559514999389648
the 0.6459410190582275
patient 0.5905261635780334
to 0.585208535194397
His 0.5622400641441345
an 0.5490080118179321
breath 0.548919141292572
a 0.5240625739097595
were 0.5178908109664917
for 0.5129812955856323
with 0.5029577016830444
in 0.49395352602005005
left 0.4854786694049835
low 0.4786217510700226
The 0.478326678276062
20 0.47305724024772644
his 0.45636075735092163
no 0.4553608000278473
had 0.44865986704826355
--------

of和the的相似度为: 0.645941
--------


Process finished with exit code 0
  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,我可以为您讲解如何用gensim训练英文向量。首先,您需要准备一份包含英文文本的语料库,可以是一个文本文件或者是一个文件夹中包含多个文本文件。然后,您需要安装gensim库以及其他相关依赖库。 接下来,您可以使用以下代码进行训练: ```python from gensim.models import Word2Vec import logging # 设置日志等级 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # 读取语料库 with open('corpus.txt', 'r') as f: corpus = f.readlines() # 对文本进行预处理,例如分、去除停用等 processed_corpus = [[word for word in document.lower().split() if word.isalpha()] for document in corpus] # 训练模型 model = Word2Vec(processed_corpus, size=100, window=5, min_count=5, workers=4) # 保存模型 model.save('word2vec.model') ``` 在上述代码中,`size`参数表示向量的维度,`window`参数表示窗口大小,`min_count`参数表示频阈值,`workers`参数表示训练时使用的线程数。您可以根据需要调整这些参数。 训练完成后,您可以使用以下代码加载模型并进行相关操作: ```python # 加载模型 model = Word2Vec.load('word2vec.model') # 获取某个向量 vector = model.wv['word'] # 查找与某个最相似的 similar_words = model.wv.most_similar('word') ``` 以上就是使用gensim训练英文向量的基本流程和操作,希望对您有帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值