NLP05-Gensim源码[包与接口]

最新推荐文章于 2024-08-20 09:53:30 发布

happyprince

最新推荐文章于 2024-08-20 09:53:30 发布

阅读量1.9k

点赞数 1

本文链接：https://blog.csdn.net/ld326/article/details/78379449

版权

NLP 同时被 2 个专栏收录

79 篇文章 6 订阅

订阅专栏

python

42 篇文章 1 订阅

订阅专栏

摘要：粗略从的方面查看一下gensim包中的文件结构与接口，感性地认识一下gensim的源码都有些什么东西，这个是认识Gensim源码的第一步。内容包含了文件结构，核心接口，Corpora模块，Models模块，Similarity模块，Models模块，scripts, 集成sklearn,摘要与关键词，单元测试，topic coherence这几个方面。

0.文件结构

把开gensim包，目录结构如下地出现眼前：
这里写图片描述
模块分为语料，模型等等，另外interfaces.py核心接口，matutils.py数学工具，utils.py公共方法。nosy.py这个不重要，是用来监控py文档是否有修改更的。

1. Gensim核心接口[interfaces.py]###

这里写图片描述

1.1corpusABC

Interface (abstract base class) for corpora. A corpus is simply an iterable, where each iteration step yields one document:
语料接口（抽象基类），一个语料是一个简单的迭代器，每步产生一个文档；

>>> for doc in corpus:
>>>     # do something with the doc...

A document is a sequence of (fieldId, fieldValue) 2-tuples:
一个文档是一个二元组（域id,域值）序列；

>>> for attr_id, attr_value in doc:
>>>     # do something with the attribute

1.2 SimilarityABC

Abstract interface for similarity searches over a corpus.
In all instances, there is a corpus against which we want to perform the similarity search.
For each similarity search, the input is a document and the output are its similarities to individual corpus documents.
Similarity queries are realized by calling self[query_document].
There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (ie., the query is each corpus document in turn).
在语料之上的相似搜索抽象接口。
所有实例中，凭借一个语料我们可以执行相似搜索。
对于每个相似搜索，输入一个文档，输出是各自相似的文档集合；
相似查询是通过调用self[query_document]这样方法来实现的。
这里也有一个方便的包装器，可以自迭代按顺序产生自已的相似性文档。

1.3 TransformationABC

Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead:
转换的接口，接收通过字典标记’[]‘的一个稀疏文档，返回取而代之的稀疏文档；

2. Corpora模块

This package contains implementations of various streaming corpus I/O format.
这个包包含了各种流式语料I/O格式的实现。
这里写图片描述
各类的层次关系,可以看成一个子类就是一个语料的储存形式了：

3.Models模块

This package contains algorithms for extracting document representations from their raw bag-of-word counts.
这个包主要是维护从源数据的词袋计算中抽取文档的表示算法；
models包下的文件结构：
这里写图片描述
各自的继承关系：

4. Similarity模块

This package contains implementations of pairwise similarity queries.
这个包是相似查询对的实现，
只有两个文件：docsim.py与index.py
docsim.py中的类如下，均继承于SimilarityABC接口。
Similarity模块下的类图：
这里写图片描述

5. Parsing模块

This package contains functions to preprocess raw text
文本预处理
里面包含两个文件：
preprocessing.py:文档的预处理，例如停用词，大小写等。
porter.py : Porter Stemming Algorithm 【词干提取算法】，来自论文
Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
no. 3, pp 130-137,
算法相关信息：http://www.tartarus.org/~martin/PorterStemmer
词干提取，也就是把单词的复数，第三人称之类的单词还原成原型，例如：

"""Get rid of plurals and -ed or -ing. E.g.,

   caresses  ->  caress
   ponies    ->  poni
   ties      ->  ti
   caress    ->  caress
   cats      ->  cat

   feed      ->  feed
   agreed    ->  agree
   disabled  ->  disable

   matting   ->  mat
   mating    ->  mate
   meeting   ->  meet
   milling   ->  mill
   messing   ->  mess

   meetings  ->  meet
"""

6. scripts

这个是一个脚本集合，方便处理与转换的，
例如

glove2word2vec.py,是GloVe vectors format 转成 word2vec text format；
USAGE: $ python -m gensim.scripts.glove2word2vec --input <GloVe vector file> --output <Word2vec vector file>
Where:
    <GloVe vector file>: Input GloVe .txt file
    <Word2vec vector file>: Desired name of output Word2vec .txt file
word2vec2tensor是word2vec转成tensor形式：
USAGE: $ python -m gensim.scripts.word2vec2tensor --input <Word2Vec model file> --output <TSV tensor filename prefix> [--binary] <Word2Vec binary flag>
Where:
    <Word2Vec model file>: Input Word2Vec model
    <TSV tensor filename prefix>: 2D tensor TSV output file name prefix
    <Word2Vec binary flag>: Set True if Word2Vec model is binary. Defaults to False.
Output:
    The script will create two TSV files. A 2d tensor format file, and a Word Embedding metadata file. Both files will
    us the --output file name as prefix

7. 集成sklearn

Scikit learn对于gensim的包装器：SklearnWrapperLdaModel与SklearnWrapperLsiModel

8. summarization

8.1 关键词：

def keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=[‘NN’, ‘JJ’], lemmatize=False, deacc=True)
关键词的计算用到了graph；

8.2 概述

def summarize(text, ratio=0.2, word_count=None, split=False)
主用到TextRank algorithm，计算用到了graph；

8.3 相关的数据结构及算法

BM25[bm25.py]
TextRank算法
Graph【common.py,graph.py】

9. 单元测试

10 topic coherence###

主题模型有评估模型，对于这方面的相关资料：

What is Topic Coherence?
https://rare-technologies.com/what-is-topic-coherence/

Exploring the Space of Topic Coherence Measures
http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

Evaluating topic coherence measures
https://mimno.infosci.cornell.edu/nips2013ws/nips2013tm_submission_7.pdf

Topic Coherence To Evaluate Topic Models
http://qpleple.com/topic-coherence-to-evaluate-topic-models/

对topic cohearnce的演示：
https://nbviewer.jupyter.org/github/dsquareindia/gensim/blob/280375fe14adea67ce6384ba7eabf362b05e6029/docs/notebooks/topic_coherence_tutorial.ipynb

基于语义连贯性实现主题挖掘和分类 http://blog.csdn.net/shirdrn/article/details/7076505