NLP | gensim库 gensim for NLP

最新推荐文章于 2024-09-04 16:57:01 发布

十三吖

最新推荐文章于 2024-09-04 16:57:01 发布

阅读量691

点赞数 1

分类专栏： NLP NLP 文章标签： NLP gensim for NLP gensim库

本文链接：https://blog.csdn.net/qq_40006058/article/details/86534843

版权

NLP 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

NLP

10 篇文章 9 订阅

订阅专栏

0 Quick Example

#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora, models, similarities
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
          [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
          [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
          [(0, 1.0), (4, 2.0), (7, 1.0)],
          [(3, 1.0), (5, 1.0), (6, 1.0)],
          [(9, 1.0)],
          [(9, 1.0), (10, 1.0)],
          [(9, 1.0), (10, 1.0), (11, 1.0)],
          [(8, 1.0), (10, 1.0), (11, 1.0)]]
tfidf = models.TfidfModel(corpus)
vec = [(0, 1), (4, 1)]
print(tfidf[vec])

D:\anaconda\envs\tensorflow\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")


[(0, 0.8075244024440723), (4, 0.5898341626740045)]

# 计算vec与corpus中所有文档的相似度
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
#print(index)
sims = index[tfidf[vec]]
#print(sims)
print(list(enumerate(sims)))

[(0, 0.4662244), (1, 0.19139354), (2, 0.2460055), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]

1 Corpora and Vector Spaces

#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

1.1 From Strings to Vectors

从字符串表示的文档开始

from gensim import corpora
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
# 首先，让我们对文档进行标记，删除常用单词以及仅在语料库中出现一次的单词
stoplist = set('for a of to in the and'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
from collections import defaultdict
frequency = defaultdict(int) # 字典
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1] for text in texts]
from pprint import pprint # 用于打印 Python 数据结构. 当你在命令行下打印特定数据结构时你会发现它很有用(输出格式比较整齐, 便于阅读)。
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

# 词袋
dictionary = corpora.Dictionary(texts)
dictionary.save('deerwester.dict')
print(dictionary)
print(dictionary.token2id)

new_doc = "Human computer interaction in the"
new_vec = dictionary.doc2bow(new_doc.lower().split()) # 在词袋里没有出现过得单词会被忽略
print(new_vec)

Dictionary(12 unique tokens: ['eps', 'human', 'graph', 'trees', 'time']...)
{'eps': 8, 'human': 1, 'graph': 10, 'trees': 9, 'time': 6, 'computer': 0, 'interface': 2, 'system': 5, 'survey': 4, 'user': 7, 'minors': 11, 'response': 3}
[(0, 1), (1, 1)]

corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('deerwester.mm', corpus)
pprint(corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

1.2 Corpus Streaming - One Document at a Time

请注意，上面的语料库完全驻留在内存中，作为普通的Python列表。在这个简单的例子中，它并不重要，但只是为了使事情清楚，假设语料库中有数百万个文档,将所有这些存储在RAM中是行不通的。

相反，我们假设文档存储在磁盘上的文件中，每行一个文档。 Gensim只要求语料库必须能够一次返回一个文档向量：

class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # 每line一个文档，用空格分割
            yield dictionary.doc2bow(line.lower().split())

corpus_memory_friendly = MyCorpus()
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x000001E3DB444BA8>

语料库现在是一个对象。我们没有定义任何打印方式，因此print只输出内存中对象的地址。不是很有用。要查看构成向量，让我们遍历语料库并打印每个文档向量（一次一个）：

for vector in corpus_memory_friendly:
    print(vector)

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

尽管输出与普通Python列表的输出相同，但语料库现在对内存更友好，因为一次最多只有一个向量驻留在RAM中。您的语料库现在可以随意扩展。

同样，构造字典而不将所有文本加载到内存中：

from six import iteritems
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))

stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]

dictionary.filter_tokens(stop_ids + once_ids)
dictionary.compactify() # 在删除的单词后删除id序列中的空白
print(dictionary)

Dictionary(12 unique tokens: ['eps', 'human', 'graph', 'trees', 'system']...)

这就是它的全部！至少就字袋表示而言。当然，我们用这种语料库做的是另一个问题; 如何计算不同单词的频率可能是有用的，这一点都不清楚。事实证明，它不是，我们需要首先对这个简单的表示应用转换，然后才能使用它来计算任何有意义的文档与文档的相似性。转换将在下一个教程中介绍，但在此之前，让我们简单地将注意力转向语料库持久性。

1.3 Corpus Formats

Gensim通过前面提到的流式语料库接口实现它们：文件以懒惰的方式从（分别存储到）磁盘读取，一次一个文档，而不是一次将整个语料库读入主存储器。

1.Matrix Market格式是一种比较值得注意的文件格式。要以Matrix Market格式保存语料库：

corpus = [[(1,0.5)], []]

corpora.MmCorpus.serialize('./corpus.mm', corpus)

2.其余的格式

corpora.SvmLightCorpus.serialize('./corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('./corpus.lda-c', corpus)
corpora.LowCorpus.serialize('./corpus.low', corpus)

3.加载

corpus = corpora.MmCorpus('./corpus.mm')
print(corpus)
print(list(corpus))

MmCorpus(2 documents, 2 features, 1 non-zero entries)
[[(1, 0.5)], []]

1.4 Compatibility with Numpy and Scipy

Gensim还包含有效的实用程序函数，帮助实现从/到 numpy矩阵转换

import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5,2])
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms = 5)

import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5,2)  # random sparse matrix as example
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

2 Topics and Transformations

#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

2.1 Transformation interface

在上一篇关于Corpora和Vector Spaces的教程中，我们创建了一个文档语料库，表示为向量流。继续，让我们启动gensim并使用该语料库：

from gensim import corpora, models, similarities
import os
if (os.path.exists('./deerwester.dict')) :
    dictionary = corpora.Dictionary.load('./deerwester.dict') # 词袋
    corpus = corpora.MmCorpus('./deerwester.mm') # 语料库 print(list(corpus)) 查看
    print("Used files generated from first tutorial")
else:
    print("Please run first tutorial to generate data set")

Used files generated from first tutorial

在本教程中，我将展示如何将文档从一个矢量表示转换为另一个矢量表示。这个过程有两个目标：

1.为了在语料库中显示隐藏的结构，发现单词之间的关系并使用它们以新的语义的方式描述文档。
2.使文档表示更紧凑，这既提高了效率（新表示消耗更少的资源）和功效（边际数据趋势被忽略，降噪）。

2.1.1 Creating a Transformation

我们使用教程1中的旧语料库来初始化（训练）转换模型。不同的转换可能需要不同的初始化参数; 在TfIdf的情况下，“训练”仅包括通过提供的语料库一次并计算其所有特征的文档频率。

tfidf = models.TfidfModel(corpus)

转换总是在两个特定的向量空间之间转换。必须使用相同的向量空间（=同一组特征id）进行训练以及后续的向量转换。未能使用相同的输入要素空间，例如应用不同的字符串预处理，使用不同的特征ID，或使用预期为TfIdf向量的词袋输入向量，将导致转换调用期间的功能不匹配，从而导致垃圾输出和/或运行时异常。

2.1.2 Transforming vectors

从现在开始，tfidf被视为一个只读对象，可用于将旧表示（bag-of-words整数计数）中的任何向量转换为新表示（TfIdf实值权重）：

doc_bow = [(0,1), (1,1)]
print(tfidf[doc_bow]) # 将原始表示转换为新的表示

[(0, 0.7071067811865476), (1, 0.7071067811865476)]

或者将转换应用于整个语料库：

corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]

在这种特殊情况下，我们正在改变我们用于训练的同一语料库，但这只是偶然的。一旦初始化了转换模型，它就可以用于任何向量（当然，只要它们来自相同的向量空间），即使它们根本没有在训练语料库中使用。这是通过LSA的折叠过程，LDA的主题推断等来实现的。

调用model[corpus]仅在旧的语料库文档流周围创建一个包装器 - 实际转换在文档迭代期间即时完成。我们无法在调用corpus_transformed = model [corpus]时转换整个语料库，因为这意味着将结果存储在主存中，这与gensim的内存独立目标相矛盾。如果您将多次迭代转换的corpus_transformed，并且转换成本很高，请先将生成的语料库序列化到磁盘并继续使用它。

转换也可以序列化，一个在另一个之上，在一个链中：

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(doc)

[(0, 0.06600783396090446), (1, -0.5200703306361849)]
[(0, 0.19667592859142596), (1, -0.7609563167700044)]
[(0, 0.08992639972446512), (1, -0.7241860626752507)]
[(0, 0.07585847652178257), (1, -0.6320551586003429)]
[(0, 0.10150299184980194), (1, -0.5737308483002952)]
[(0, 0.7032108939378308), (1, 0.16115180214025876)]
[(0, 0.8774787673119827), (1, 0.16758906864659512)]
[(0, 0.9098624686818575), (1, 0.14086553628719112)]
[(0, 0.6165825350569281), (1, -0.0539290756638931)]

lsi.print_topics(2)

[(0,
  '0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
 (1,
  '-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

保存和加载模型

lsi.save('./model.lsi') # same for tfidf, lda, ...
lsi = models.LsiModel.load('./model.lsi')

2.2 Available Transformations

Gensim实现了几种流行的向量空间模型算法：

1.词频*逆文档频率，TF-IDF预计初始化过程中袋的字（整数）训练语料。

tfidf = models.TfidfModel(corpus, normalize=True)

2.潜在语义索引，LSI（或有时LSA）将文档从单词袋或（优选地）TfIdf加权空间转换为较低维度的潜在空间。

model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

3.随机投影，RP旨在减少向量空间维度。这是一种非常有效的（内存和CPU友好的）方法，通过投入一点随机性来近似文档之间的TfIdf距离。建议的目标维度再次为数百/数千，具体取决于您的数据集。

model = models.RpModel(tfidf_corpus, num_topics=500)

4.Latent Dirichlet Allocation，LDA是另一种从词袋计数转变为低维度主题空间的转变。 LDA是LSA（也称为多项PCA）的概率扩展，因此LDA的主题可以解释为对单词的概率分布。与LSA一样，这些分布也是从训练语料库中自动推断出来的

model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

5.分层Dirichlet过程，HDP是一种非参数贝叶斯方法（注意缺少的请求主题数）：

model = models.HdpModel(corpus, id2word=dictionary)

3 Similarity Queries

3.1 Similarity Interface

在前面的[语料库和向量空间]和[主题和转换]教程中，我们介绍了在向量空间模型中创建语料库以及如何在不同向量空间之间进行转换的含义。这种特征的一个常见原因是我们想要确定文档对之间的相似性，或者特定文档与一组其他文档（例如用户查询与索引文档）之间的相似性。

from gensim import corpora, models, similarities
dictionary = corpora.Dictionary.load('./deerwester.dict')
corpus = corpora.MmCorpus('./deerwester.mm')
print(corpus)

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=3)

doc = 'Human computer interaction'
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
print(vec_lsi)

MmCorpus(9 documents, 12 features, 28 non-zero entries)
[(0, 0.46182100453271524), (1, 0.07002766527899892), (2, -0.12452907551899137)]

此外，我们将考虑余弦相似性来确定两个向量的相似性。余弦相似度是向量空间建模中的标准度量，但是向量表示概率分布的地方，不同的相似性度量方法可能更合适。

index1 = similarities.MatrixSimilarity(lsi[corpus])
#index2 = similarities.SparseMatrixSimilarity(lsi[corpus], num_features=)
#index3 = similarities.Similarity(lsi[corpus], num_features=)

index1.save('./deerwester.index')
index1 = similarities.MatrixSimilarity.load('./deerwester.index')

D:\anaconda\envs\tensorflow\lib\site-packages\gensim\matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):

计算 doc = ‘Human computer interaction’ 与每个文档的相似性

sims = index1[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

[(0, 0.9925859), (1, 0.6613736), (2, 0.997788), (3, 0.9276979), (4, 0.35544106), (5, 0.0013079792), (6, 0.002062656), (7, 0.0023344755), (8, 0.0825683)]

sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)

[(2, 0.997788), (0, 0.9925859), (3, 0.9276979), (1, 0.6613736), (4, 0.35544106), (8, 0.0825683), (7, 0.0023344755), (6, 0.002062656), (5, 0.0013079792)]

4 Experiments on the English Wikipedia

4.1Preparing the corpus

First, download the dump of all Wikipedia articles from http://download.wikimedia.org/enwiki/ (you want the file enwiki-latest-pages-articles.xml.bz2, or enwiki-YYYYMMDD-pages-articles.xml.bz2 for date-specific dumps). This file is about 8GB in size and contains (a compressed version of) all articles from the English Wikipedia.

Convert the articles to plain text (process Wiki markup) and store the result as sparse TF-IDF vectors. In Python, this is easy to do on-the-fly and we don’t even need to uncompress the whole archive to disk. There is a script included in gensim that does just that, run:

$ python -m gensim.scripts.make_wiki

4.2 Latent Semantic Analysis

import gensim

id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')

lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
lsi.print_topics(10)

4.3 Latent Dirichlet Allocation

import gensim

id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')

lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
lda.print_topics(10)

简而言之，如果使用LDA逐步将新文档添加到模型中，请务必小心。批量使用LDA，其中整个训练语料库事先已知或不显示主题漂移，是可以的并且不受影响。

要运行批量LDA（不在线），请使用以下方法训练LdaModel：

lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)
# 像往常一样，训练有素的模型可以用来将新的，看不见的文档（简单的词袋计数向量）转换为LDA主题分布：
doc_lda = lad[doc_bow]

5 Example

在gensim中，word2vec 相关的API都在包gensim.models.word2vec中。和算法有关的参数都在类gensim.models.word2vec.Word2Vec中。算法需要注意的参数有：

　　　　1) sentences: 我们要分析的语料，可以是一个列表，或者从文件中遍历读出。后面我们会有从文件读出的例子。

　　　　2) size: 词向量的维度，默认值是100。这个维度的取值一般与我们的语料的大小相关，如果是不大的语料，比如小于100M的文本语料，则使用默认值一般就可以了。如果是超大的语料，建议增大维度。

　　　　3) window：即词向量上下文最大距离，这个参数在我们的算法原理篇中标记为cc，window越大，则和某一词较远的词也会产生上下文关系。默认值为5。在实际使用中，可以根据实际的需求来动态调整这个window的大小。如果是小语料则这个值可以设的更小。对于一般的语料这个值推荐在[5,10]之间。

　　　　4) sg: 即我们的word2vec两个模型的选择了。如果是0， 则是CBOW模型，是1则是Skip-Gram模型，默认是0即CBOW模型。

　　　　5) hs: 即我们的word2vec两个解法的选择了，如果是0， 则是Negative Sampling，是1的话并且负采样个数negative大于0， 则是Hierarchical Softmax。默认是0即Negative Sampling。

　　　　6) negative:即使用Negative Sampling时负采样的个数，默认是5。推荐在[3,10]之间。这个参数在我们的算法原理篇中标记为neg。

　　　　7) cbow_mean: 仅用于CBOW在做投影的时候，为0，则算法中的xwxw为上下文的词向量之和，为1则为上下文的词向量的平均值。在我们的原理篇中，是按照词向量的平均值来描述的。个人比较喜欢用平均值来表示xwxw,默认值也是1,不推荐修改默认值。

　　　　8) min_count:需要计算词向量的最小词频。这个值可以去掉一些很生僻的低频词，默认是5。如果是小语料，可以调低这个值。

　　　　9) iter: 随机梯度下降法中迭代的最大次数，默认是5。对于大语料，可以增大这个值。

　　　　10) alpha: 在随机梯度下降法中迭代的初始步长。算法原理篇中标记为ηη，默认是0.025。

　　　　11) min_alpha: 由于算法支持在迭代的过程中逐渐减小步长，min_alpha给出了最小的迭代步长值。随机梯度下降中每轮的迭代步长可以由iter，alpha， min_alpha一起得出。这部分由于不是word2vec算法的核心内容，因此在原理篇我们没有提到。对于大语料，需要对alpha, min_alpha,iter一起调参，来选择合适的三个值。

　　　　以上就是gensim word2vec的主要的参数，下面我们用一个实际的例子来学习word2vec。

# -*- coding: utf-8 -*-

import jieba
import jieba.analyse
from gensim.models import word2vec

jieba.suggest_freq('沙瑞金', True)
jieba.suggest_freq('田国富', True)
jieba.suggest_freq('高育良', True)
jieba.suggest_freq('侯亮平', True)
jieba.suggest_freq('钟小艾', True)
jieba.suggest_freq('陈岩石', True)
jieba.suggest_freq('欧阳菁', True)
jieba.suggest_freq('易学习', True)
jieba.suggest_freq('王大路', True)
jieba.suggest_freq('蔡成功', True)
jieba.suggest_freq('孙连城', True)
jieba.suggest_freq('季昌明', True)
jieba.suggest_freq('丁义珍', True)
jieba.suggest_freq('郑西坡', True)
jieba.suggest_freq('赵东来', True)
jieba.suggest_freq('高小琴', True)
jieba.suggest_freq('赵瑞龙', True)
jieba.suggest_freq('林华华', True)
jieba.suggest_freq('陆亦可', True)
jieba.suggest_freq('刘新建', True)
jieba.suggest_freq('刘庆祝', True)

with open('./in_the_name_of_people.txt') as f:
    document = f.read()
    #document_decode = document.decode('GBK')
    document_cut = jieba.cut(document)
    #print  ' '.join(jieba_cut)  //如果打印结果，则分词效果消失，后面的result无法显示
    result = ' '.join(document_cut)
    result = result.encode('utf-8')
    with open('./in_the_name_of_people_segment.txt', 'w') as f2:
        f2.write(result)
f.close()
f2.close()

sentences = word2vec.LineSentence('./in_the_name_of_people_segment.txt') 

# 模型
model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)  

# 第一个是最常用的，找出某一个词向量最相近的词集合
req_count = 5
for key in model.wv.similar_by_word('沙瑞金'.decode('utf-8'), topn =100):
    if len(key[0])==3:
        req_count -= 1
        print key[0], key[1]
        if req_count == 0:
            break;
            
# 第二个应用是看两个词向量的相近程度，这里给出了书中两组人的相似程度：
print model.wv.similarity('沙瑞金'.decode('utf-8'), '高育良'.decode('utf-8'))
print model.wv.similarity('李达康'.decode('utf-8'), '王大路'.decode('utf-8'))

# 第三个应用是找出不同类的词，这里给出了人物分类题：
print model.wv.doesnt_match(u"沙瑞金 高育良 李达康 刘庆祝".split())