摘要:粗略从的方面查看一下gensim包中的文件结构与接口,感性地认识一下gensim的源码都有些什么东西,这个是认识Gensim源码的第一步。内容包含了文件结构,核心接口,Corpora模块,Models模块 ,Similarity模块,Models模块 ,scripts, 集成sklearn,摘要与关键词,单元测试,topic coherence这几个方面。
0.文件结构
把开gensim包,目录结构如下地出现眼前:
模块分为语料,模型等等,另外interfaces.py核心接口,matutils.py数学工具,utils.py公共方法。nosy.py这个不重要,是用来监控py文档是否有修改更的。
1. Gensim核心接口[interfaces.py]###
1.1corpusABC
Interface (abstract base class) for corpora. A corpus is simply an iterable, where each iteration step yields one document:
语料接口(抽象基类),一个语料是一个简单的迭代器,每步产生一个文档;
>>> for doc in corpus:
>>> # do something with the doc...
A document is a sequence of (fieldId, fieldValue) 2-tuples:
一个文档是一个二元组(域id,域值)序列;
>>> for attr_id, attr_value in doc:
>>> # do something with the attribute
1.2 SimilarityABC
Abstract interface for similarity searches over a corpus.
In all instances, there is a corpus against which we want to perform the similarity search.
For each similarity search, the input is a document and the output are its similarities to individual corpus documents.
Similarity queries are realized by calling self[query_document].
There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (ie., the query is each corpus document in turn).
在语料之上的相似搜索抽象接口。
所有实例中,凭借一个语料我们可以执行相似搜索。
对于每个相似搜索,输入一个文档,输出是各自相似的文档集合;
相似查询是通过调用self[query_document]这样方法来实现的。
这里也有一个方便的包装器,可以自迭代按顺序产生自已的相似性文档 。
1.3 TransformationABC
Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead:
转换的接口,接收通过字典标记’[]‘的一个稀疏文档,返回取而代之的稀疏文档;
2. Corpora模块
This package contains implementations of various streaming corpus I/O format.
这个包包含了各种流式语料I/O格式的实现。
各类的层次关系,可以看成一个子类就是一个语料的储存形式了:
3.Models模块
This package contains algorithms for extracting document representations from their raw bag-of-word counts.
这个包主要是维护从源数据的词袋计算中抽取文档的表示算法;
models包下的文件结构:
各自的继承关系:
4. Similarity模块
This package contains implementations of pairwise similarity queries.
这个包是相似查询对的实现,
只有两个文件:docsim.py与index.py
docsim.py中的类如下,均继承于SimilarityABC接口。
Similarity模块下的类图:
5. Parsing模块
This package contains functions to preprocess raw text
文本预处理
里面包含两个文件:
preprocessing.py:文档的预处理,例如停用词,大小写等。
porter.py : Porter Stemming Algorithm 【词干提取算法】,来自论文
Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
no. 3, pp 130-137,
算法相关信息:http://www.tartarus.org/~martin/PorterStemmer
词干提取,也就是把单词的复数,第三人称之类的单词还原成原型,例如:
"""Get rid of plurals and -ed or -ing. E.g.,
caresses -> caress
ponies -> poni
ties -> ti
caress -> caress
cats -> cat
feed -> feed
agreed -> agree
disabled -> disable
matting -> mat
mating -> mate
meeting -> meet
milling -> mill
messing -> mess
meetings -> meet
"""
6. scripts
这个是一个脚本集合,方便处理与转换的,
例如
glove2word2vec.py,是GloVe vectors format 转成 word2vec text format;
USAGE: $ python -m gensim.scripts.glove2word2vec --input <GloVe vector file> --output <Word2vec vector file>
Where:
<GloVe vector file>: Input GloVe .txt file
<Word2vec vector file>: Desired name of output Word2vec .txt file
word2vec2tensor是word2vec转成tensor形式:
USAGE: $ python -m gensim.scripts.word2vec2tensor --input <Word2Vec model file> --output <TSV tensor filename prefix> [--binary] <Word2Vec binary flag>
Where:
<Word2Vec model file>: Input Word2Vec model
<TSV tensor filename prefix>: 2D tensor TSV output file name prefix
<Word2Vec binary flag>: Set True if Word2Vec model is binary. Defaults to False.
Output:
The script will create two TSV files. A 2d tensor format file, and a Word Embedding metadata file. Both files will
us the --output file name as prefix
7. 集成sklearn
Scikit learn对于gensim的包装器:SklearnWrapperLdaModel与SklearnWrapperLsiModel
8. summarization
8.1 关键词:
def keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=[‘NN’, ‘JJ’], lemmatize=False, deacc=True)
关键词的计算用到了graph;
8.2 概述
def summarize(text, ratio=0.2, word_count=None, split=False)
主用到TextRank algorithm,计算用到了graph;
8.3 相关的数据结构及算法
BM25[bm25.py]
TextRank算法
Graph【common.py,graph.py】
9. 单元测试
10 topic coherence###
主题模型有评估模型,对于这方面的相关资料:
What is Topic Coherence?
https://rare-technologies.com/what-is-topic-coherence/Exploring the Space of Topic Coherence Measures
http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdfEvaluating topic coherence measures
https://mimno.infosci.cornell.edu/nips2013ws/nips2013tm_submission_7.pdfTopic Coherence To Evaluate Topic Models
http://qpleple.com/topic-coherence-to-evaluate-topic-models/对topic cohearnce的演示:
https://nbviewer.jupyter.org/github/dsquareindia/gensim/blob/280375fe14adea67ce6384ba7eabf362b05e6029/docs/notebooks/topic_coherence_tutorial.ipynb基于语义连贯性实现主题挖掘和分类 http://blog.csdn.net/shirdrn/article/details/7076505
【作者:happyprince, http://blog.csdn.net/ld326/article/details/78379449】