摘要:主要分析MmCorpus&SvmLightCorpus两个源代码,查看语料是以什么形式来保存的,对矩阵的相关储存格式进行了了解,并对相关代码进行阅读。
1. MmCorpus
1.1 MM介绍
MM是种矩阵的模型:Matrix Market File Format
《Text File Formats》
http://math.nist.gov/MatrixMarket/formats.html
《The Matrix Market File Format》
http://people.sc.fsu.edu/~jburkardt/data/mm/mm.html
The Matrix Market File Format MM File Characteristics:
● ASCII format;
● allow comment lines, which begin with a percent sign;
● use a “coordinate” format for sparse matrices;
● use an “array” format for general dense matrices;
A file in the Matrix Market format comprises four parts:
1. Header line: contains an identifier, and four text fields;
2. Comment lines: allow a user to store information and comments;
3. Size line: specifies the number of rows and columns, and the number of nonzero elements;
4. Data lines: specify the location of the matrix entries (implicitly or explicitly) and their values.
Coordinate Format - aparse matrices(稀疏矩阵);
Array Format - dense matrices(稠密矩阵);
如下的相互转换的例子
1.2 gensim例子
Demo:
from gensim import corpora
texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('tmp/deerwester.mm', corpus)
Gensim转存成的文档内容:
储存了一个9X12的矩阵,一共有28个非零项。
查看源码调用情况:
MmCorpus为IndexedCorpus的一个了类,对于MMCorpus的保存,主要是由MmCorpus调用了MmWriter来实现,可以认为,这是一个把二维数组转成coordinate保存的过程,即是保存成了稀疏矩阵。
如下为MmCorpus实现的save_corpus(),下面可以看到调用了MmWriter.write_corpus方法,这个方法是静态方法:
MmWriter.write_corpus方法:
# 把数据以MM的形式写到磁盘上【MmWriter的静态方法】
def write_corpus(fname, corpus, progress_cnt=1000, index=False, num_terms=None, metadata=False):
"""
Save the vector space representation of an entire corpus to disk.
"""
# 创建MmWriter对像
mw = MmWriter(fname)
# write empty headers to the file (with enough space to be overwritten later)
# 加上50个空格,然后空出一行;
mw.write_headers(-1, -1, -1) # will print 50 spaces followed by newline on the stats line
# calculate necessary header info (nnz elements, num terms, num docs) while writing out vectors
# 计算需要的头信息,头信息有非零元素数,词数,文档数
_num_terms, num_nnz = 0, 0
docno, poslast = -1, -1
offsets = []
# 判断是否有metadata数据属性
if hasattr(corpus, 'metadata'):
orig_metadata = corpus.metadata
corpus.metadata = metadata
if metadata:
docno2metadata = {}
else:
metadata = False
# 遍历二维数组,里面的元素是<词编号id,词频>;例如[[<id1,词频>,<id2,词频>,...],[<id3,词频>,<id2,词频>,...],... ]
for docno, doc in enumerate(corpus):
if metadata:
bow, data = doc
docno2metadata[docno] = data
else:
bow = doc
if docno % progress_cnt == 0:
logger.info("PROGRESS: saving document #%i" % docno)
if index:
posnow = mw.fout.tell()
if posnow == poslast:
offsets[-1] = -1
offsets.append(posnow)
poslast = posnow
# 写向量,保存成 坐标1,坐标2,非零值
max_id, veclen = mw.write_vector(docno, bow)
_num_terms = max(_num_terms, 1 + max_id)
num_nnz += veclen
if metadata:
utils.pickle(docno2metadata, fname + '.metadata.cpickle')
corpus.metadata = orig_metadata
num_docs = docno + 1
num_terms = num_terms or _num_terms
if num_docs * num_terms != 0:
logger.info("saved %ix%i matrix, density=%.3f%% (%i/%i)" % (
num_docs, num_terms,
100.0 * num_nnz / (num_docs * num_terms),
num_nnz,
num_docs * num_terms))
# now write proper headers, by seeking and overwriting the spaces written earlier
# 写头信息,把刚才空出来的行补上去
mw.fake_headers(num_docs, num_terms, num_nnz)
mw.close()
if index:
return offsets
其中,对每条记录的保存,调用MmWriter类中的write_vector方法。
# 每个向转成这样的坐标形式 【MmWriter类方法】
for termid, weight in vector: # write term ids in sorted order
self.fout.write(utils.to_utf8("%i %i %s\n" % (docno + 1, termid + 1, weight))) # +1 because MM format starts counting from 1
2. SvmLightCorpus
参见如下连接,了解更多SvmLightCorpus: http://svmlight.joachims.org/
把上面的语料保存成svmlight形式,增加代码测试:
corpora.SvmLightCorpus.serialize('tmp/deerwester.svm', corpus)
结果:
0 1:1 2:1 3:1
0 1:1 4:1 5:1 6:1 7:1 8:1
0 2:1 6:1 8:1 9:1
0 3:1 8:2 9:1
0 5:1 6:1 7:1
0 10:1
0 10:1 11:1
0 10:1 11:1 12:1
0 4:1 11:1 12:1
显示与词代表示的很相似,这个是以1开始,词袋那个以0开始显示。
关的”0“是默认显示的,本来这个种文档是有来分类保存的,当没有指定类时,也就是说,这些向量都被分成一类的。
with utils.smart_open(fname, 'wb') as fout:
for docno, doc in enumerate(corpus):
label = labels[docno] if labels else 0 # target class is 0 by default
offsets.append(fout.tell())
fout.write(utils.to_utf8(SvmLightCorpus.doc2line(doc, label)))
doc2line方法:
pairs = ' '.join("%i:%s" % (termid + 1, termval) for termid, termval in doc) # +1 to convert 0-base to 1-base
return "%s %s\n" % (label, pairs)
一行中,label表示这个后面向量的一个分类,行号是文档,即为行号,第一个数字为列号,最后一个数据是数据项。
【作者:happyprince , http://blog.csdn.net/ld326/article/details/78396982】