gensim笔记 02 gensim构造word2vec词向量模型

1. 简单使用

创建日志对象

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

构造向量

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

创建模型对象

model = word2vec.Word2Vec(texts, min_count=1)

参数说明:

  • min_ count:在不同大小的语料集中,我们对于基准词频的需求也是不一样的。譬如在较大的语料集中,我们希望忽略那些只出现过一两次的单词,这里我们就可以通过设置min_count参数进行控制。一般而言,合理的参数值会设置在0~100之间
  • Size:size参数主要是用来设置神经网络的层数,Word2Vec中的默认值是设置为100层。更大的层次设置意味着更多的输入数据,不过也能提升整体的准确度,合理的设置范围为10~数百。

2. 使用维基百科语料库

语料库下载链接
提取码:2ugv

根据语料库训练生成模型

import logging
import os.path
import sys
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

if __name__ == '__main__':

    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # check and process input arguments
    if len(sys.argv) < 4:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
    model.save(outp1)
    model.wv.save_word2vec_format(outp2, binary=False)
# python word2vec_model.py zh.jian.wiki.seg.txt wiki.zh.text.model wiki.zh.text.vector
# 终端执行这个文件 word2vec_model.py是这个文件的名字  第二个参数传入的是维基百科的语料库  第三个参数表示生成的训练模型保存的文件名,第四个参数表示生成的向量名


运行结束后生成如下文件

在这里插入图片描述

测试训练出的模型
from gensim.models import Word2Vec

wiki_word2vec_model = Word2Vec.load('wiki.zh.text.model')

test_word = ['苹果', '数学', '学术', '白痴', '篮球']
for word in test_word:
    res = wiki_word2vec_model.most_similar(word)
    print(word)
    print(res)

输出

苹果
[('洋葱', 0.5287384986877441), ('apple', 0.5183306932449341), ('好想变', 0.49525684118270874), ('苹果公司', 0.4800078570842743), ('咬一口', 0.4660457372665405), ('饼干', 0.46428003907203674), ('柳丁', 0.4600396454334259), ('葡萄柚', 0.45942795276641846), ('西打', 0.45623183250427246), ('水果', 0.45601674914360046)]
数学
[('微积分', 0.7172355055809021), ('算术', 0.7024766206741333), ('数学分析', 0.6578953862190247), ('概率论', 0.6521844267845154), ('统计学', 0.6293018460273743), ('高等数学', 0.6234102249145508), ('数论', 0.6179691553115845), ('逻辑学', 0.615193247795105), ('拓扑学', 0.6122117042541504), ('解析几何', 0.6048047542572021)]
学术
[('学术研究', 0.7457844018936157), ('汉学', 0.6014567613601685), ('教研', 0.583657443523407), ('学术界', 0.5773233771324158), ('史学', 0.566792905330658), ('科研', 0.5661078691482544), ('学术思想', 0.5613530874252319), ('社会科学', 0.5518584251403809), ('科学研究', 0.5512319207191467), ('学术活动', 0.5493532419204712)]
白痴
[('疯子', 0.6032769083976746), ('爱哭鬼', 0.564818263053894), ('书呆子', 0.561726450920105), ('骗子', 0.5472438335418701), ('笨蛋', 0.538104772567749), ('萝莉控', 0.5373555421829224), ('傻子', 0.5323609113693237), ('小聪明', 0.5319529175758362), ('老哥', 0.5271468162536621), ('调皮', 0.5128688216209412)]
篮球
[('美式足球', 0.6249872446060181), ('橄榄球', 0.5859279036521912), ('男子篮球', 0.5840273499488831), ('棒球', 0.5770983695983887), ('冰球', 0.5764968991279602), ('排球', 0.5746052265167236), ('篮球队', 0.5532582998275757), ('中国篮球', 0.541188657283783), ('网球', 0.5303999185562134), ('足球', 0.5260317921638489)]

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页