NLP07-Gensim源码简析[MmCorpus&SvmLightCorpus]

最新推荐文章于 2021-05-09 15:18:23 发布

happyprince

最新推荐文章于 2021-05-09 15:18:23 发布

阅读量809

点赞数 1

分类专栏： NLP python 文章标签：源码 nlp gensim

本文链接：https://blog.csdn.net/ld326/article/details/78396982

版权

NLP 同时被 2 个专栏收录

79 篇文章 6 订阅

订阅专栏

python

42 篇文章 1 订阅

订阅专栏

摘要：主要分析MmCorpus&SvmLightCorpus两个源代码，查看语料是以什么形式来保存的，对矩阵的相关储存格式进行了了解，并对相关代码进行阅读。

1. MmCorpus

1.1 MM介绍

MM是种矩阵的模型：Matrix Market File Format
《Text File Formats》
http://math.nist.gov/MatrixMarket/formats.html
《The Matrix Market File Format》
http://people.sc.fsu.edu/~jburkardt/data/mm/mm.html

The Matrix Market File Format MM File Characteristics:
● ASCII format;
● allow comment lines, which begin with a percent sign;
● use a “coordinate” format for sparse matrices;
● use an “array” format for general dense matrices;
A file in the Matrix Market format comprises four parts:
1. Header line: contains an identifier, and four text fields;
2. Comment lines: allow a user to store information and comments;
3. Size line: specifies the number of rows and columns, and the number of nonzero elements;
4. Data lines: specify the location of the matrix entries (implicitly or explicitly) and their values.

Coordinate Format - aparse matrices(稀疏矩阵);
Array Format - dense matrices(稠密矩阵);
如下的相互转换的例子
这里写图片描述

1.2 gensim例子

Demo：

from gensim import corpora
texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('tmp/deerwester.mm', corpus)

Gensim转存成的文档内容：
这里写图片描述
储存了一个9X12的矩阵，一共有28个非零项。
查看源码调用情况：

MmCorpus为IndexedCorpus的一个了类，对于MMCorpus的保存，主要是由MmCorpus调用了MmWriter来实现，可以认为，这是一个把二维数组转成coordinate保存的过程，即是保存成了稀疏矩阵。
如下为MmCorpus实现的save_corpus(),下面可以看到调用了MmWriter.write_corpus方法，这个方法是静态方法：
这里写图片描述
MmWriter.write_corpus方法：

# 把数据以MM的形式写到磁盘上【MmWriter的静态方法】
def write_corpus(fname, corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False):
    """
    Save the vector space representation of an entire corpus to disk.
    """
    # 创建MmWriter对像
    mw = MmWriter(fname)

    # write empty headers to the file (with enough space to be overwritten later)
    # 加上50个空格，然后空出一行；
    mw.write_headers(-1, -1, -1)  # will print 50 spaces followed by newline on the stats line

    # calculate necessary header info (nnz elements, num terms, num docs) while writing out vectors
    # 计算需要的头信息,头信息有非零元素数，词数，文档数
    _num_terms, num_nnz = 0, 0
    docno, poslast = -1, -1
    offsets = []
    # 判断是否有metadata数据属性
    if hasattr(corpus, 'metadata'):
        orig_metadata = corpus.metadata
        corpus.metadata = metadata
        if metadata:
            docno2metadata = {}
    else:
        metadata = False
    #  遍历二维数组，里面的元素是<词编号id,词频>；例如[[<id1,词频>,<id2,词频>,...],[<id3,词频>,<id2,词频>,...],... ]
    for docno, doc in enumerate(corpus):
        if metadata:
            bow, data = doc
            docno2metadata[docno] = data
        else:
            bow = doc
        if docno % progress_cnt == 0:
            logger.info("PROGRESS: saving document #%i" % docno)
        if index:
            posnow = mw.fout.tell()
            if posnow == poslast:
                offsets[-1] = -1
            offsets.append(posnow)
            poslast = posnow
        #  写向量，保存成 坐标1，坐标2，非零值
        max_id, veclen = mw.write_vector(docno, bow)
        _num_terms = max(_num_terms, 1 + max_id)
        num_nnz += veclen
    if metadata:
        utils.pickle(docno2metadata, fname + '.metadata.cpickle')
        corpus.metadata = orig_metadata

    num_docs = docno + 1
    num_terms = num_terms or _num_terms

    if num_docs * num_terms != 0:
        logger.info("saved %ix%i matrix, density=%.3f%% (%i/%i)" % (
            num_docs, num_terms,
            100.0 * num_nnz / (num_docs * num_terms),
            num_nnz,
            num_docs * num_terms))

    # now write proper headers, by seeking and overwriting the spaces written earlier
    # 写头信息，把刚才空出来的行补上去
    mw.fake_headers(num_docs, num_terms, num_nnz)

    mw.close()
    if index:
        return offsets

其中，对每条记录的保存，调用MmWriter类中的write_vector方法。

# 每个向转成这样的坐标形式 【MmWriter类方法】
for termid, weight in vector:  # write term ids in sorted order
    self.fout.write(utils.to_utf8("%i %i %s\n" % (docno + 1, termid + 1, weight)))  # +1 because MM format starts counting from 1

2. SvmLightCorpus

参见如下连接，了解更多SvmLightCorpus: http://svmlight.joachims.org/
把上面的语料保存成svmlight形式，增加代码测试：

corpora.SvmLightCorpus.serialize('tmp/deerwester.svm', corpus)

结果：
0 1:1 2:1 3:1
0 1:1 4:1 5:1 6:1 7:1 8:1
0 2:1 6:1 8:1 9:1
0 3:1 8:2 9:1
0 5:1 6:1 7:1
0 10:1
0 10:1 11:1
0 10:1 11:1 12:1
0 4:1 11:1 12:1

显示与词代表示的很相似，这个是以1开始，词袋那个以0开始显示。
关的”0“是默认显示的，本来这个种文档是有来分类保存的，当没有指定类时，也就是说，这些向量都被分成一类的。

with utils.smart_open(fname, 'wb') as fout:
    for docno, doc in enumerate(corpus):
        label = labels[docno] if labels else 0 # target class is 0 by default
        offsets.append(fout.tell())
        fout.write(utils.to_utf8(SvmLightCorpus.doc2line(doc, label)))

doc2line方法：

pairs = ' '.join("%i:%s" % (termid + 1, termval) for termid, termval in doc) # +1 to convert 0-base to 1-base
return "%s %s\n" % (label, pairs)

一行中，label表示这个后面向量的一个分类，行号是文档，即为行号，第一个数字为列号，最后一个数据是数据项。

【作者：happyprince , http://blog.csdn.net/ld326/article/details/78396982】

happyprince

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
NLP07-Gensim源码简析[MmCorpus&SvmLightCorpus]

摘要：主要分析MmCorpus&SvmLightCorpus两个源代码，查看语料是以什么形式来保存的，对矩阵的相关储存格式进行了了解，并对相关代码进行阅读。1. MmCorpus1.1 MM介绍MM是种矩阵的模型：Matrix Market File Format 《Text File Formats》 http://math.nist.gov/MatrixMarket/formats.html
复制链接

扫一扫

专栏目录