Corpora and Vector Spaces

最新推荐文章于 2021-10-26 14:25:03 发布

Thinking_boy1992

最新推荐文章于 2021-10-26 14:25:03 发布

阅读量780

点赞数

分类专栏： Gensim 文章标签： gensim

Gensim 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

不要忘记设置：

>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

如果你想要查看日志事件；

从字符串到向量（From Strings to Vectors）
从以字符串表示的文档开始：

>>> from gensim import corpora
>>>
>>> documents = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

这个小的语料库仅仅包含9个文本，每一个仅仅包含一个句子；
首先，让我们标记这个文本，移除常用单词（使用一个小的停用词表），和仅仅出现一次的单词；

>>> # remove common words and tokenize
>>> stoplist = set('for a of the and to in'.split())
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>>          for document in documents]
>>>
>>> # remove words that appear only once
>>> from collections import defaultdict
>>> frequency = defaultdict(int)
>>> for text in texts:
>>>     for token in text:
>>>         frequency[token] += 1
>>>
>>> texts = [[token for token in text if frequency[token] > 1]
>>>          for text in texts]
>>>
>>> from pprint import pprint  # pretty-printer
>>> pprint(texts)
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

处理文本的方式有可能不同；这里，仅仅把空格当作拆分标记，接着转换为小写字母；
处理文本的方法是多样的，它依赖于应用环境和语言，所有没有使用固定的接口来限定它；相反，一个文档被表示为从文档中提取的特征，而不是它的表面的字符串形式；如何提取特征取决于你；
下面，描述一个常用的方法（叫为bag-of words），但是记住不同的应用领域需要不同的特征；
为了把文本转化为向量，我们使用一个文档表示叫做bag-of -words;
在这个表示方法中，每一个文本被表示为一个向量，向量中的每一个元素代表单词系统在文本中出现的次数；
把单词用它们的id表示是有好处的；所有单词的id和相应单词出现的次数之间的映射叫做字典；

>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference
>>> print(dictionary)
Dictionary(12 unique tokens)

这里我们给出现在语料中的所有的单词分配一个唯一的整数id，使用 gensim.corpora.dictionary.Dictionary类
它遍历文本，收集单词计数和相关的统计。最后，我们看见十二个不相同的单词在被处理过的语料中，这就意味着每一个文档将要被表示为十二个数字；下面展示了单词和id之间的映射关系：

>>> print(dictionary.token2id)
{'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,
'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}

转换标记文本到向量：

>>> new_doc = "Human computer interaction"
>>> new_vec = dictionary.doc2bow(new_doc.lower().split())
>>> print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored
[(0, 1), (1, 1)]

方法doc2bow()简单的统计每一个不同的单词出现的次数，把单词转化为相应的id。以稀疏向量的形式进行返回结果；稀疏向量[(0, 1), (1, 1)]可以被解读为在文档“Human computer interaction”中，单词“computer”（id 0）和human（id 1）出现一次；词表中的其他单词出现0次；

>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use
>>> print(corpus)
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

语料流—一次只有一个文档（Corpus Streaming – One Document at a Time）
注意上面的语料库完全驻留在内存中。让我们假设在语料库中有数以百万计的文档；把它们都存储在内存中是不可行的。相反，让我们假设文档被存储在了磁盘中的文件夹内，每一个文档一行。
Gensim仅仅需要每一时刻语料库能够返回一个文档向量

>>> class MyCorpus(object):
>>>     def __iter__(self):
>>>         for line in open('mycorpus.txt'):
>>>             # assume there's one document per line, tokens separated by whitespace
>>>             yield dictionary.doc2bow(line.lower().split())

每一个文档在一个文件中占用一行这个假设是不重要的；能够改变iter函数来适用不同的输入格式，例如:解析XML，访问网络；
把你的输入解析为单个文档中标识符列表的形式；然后把标识通过词典对应到id，最后在iter中生成稀疏向量的表示形式；

>>> corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
>>> print(corpus_memory_friendly)
<__main__.MyCorpus object at 0x10d5690>

语料是一个新的对象，没有定义任何方式输出它，所以print仅仅输出对象在内存中的地址；为了看组成的向量，让我们迭代语料，打印出每一个文本向量；

>>> for vector in corpus_memory_friendly:  # load one vector into memory at a time
...     print(vector)
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

尽管输出和平常的Python列表，语料现在更加的内存优化了；因为最多只有一个向量驻留在内存中；所以预料可以尽可能大；
相似的，在构造字典的时候，同样没有加载所有的文本进入内存；

>>> from six import iteritems
>>> # collect statistics about all tokens
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
>>> # remove stop words and words that appear only once
>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
>>>             if stopword in dictionary.token2id]
>>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
>>> dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
>>> dictionary.compactify()  # remove gaps in id sequence after words that were removed
>>> print(dictionary)
Dictionary(12 unique tokens)

我们首先需要在这个简单表示的基础之上实行一个转换；这样就可以使用它来计算任何有意义的文件，例如，文本相似度；

语料格式（Corpus Formats）
存在一些文件格式用来序列化向量空间语料；
Gensim根据前面提到的流语料接口实现它们：文档从硬盘中被读入，一次读入一个文档，
一种更加值得注意的文件格式是 Market Matrix format，为了使用Market Matrix format保存语料：

>>> # create a toy corpus of 2 documents, as a plain Python list
>>> corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it
>>>
>>> corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

另外的格式包括： Joachim’s SVMlight format , Blei’s LDA-C format and GibbsLDA++ format.

>>> corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
>>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
>>> corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

方便的，从 Matrix Market file加载语料迭代器：

>>> corpus = corpora.MmCorpus('/tmp/corpus.mm')

语料对象是流，所以不能够直接的打印它们：

>>> print(corpus)
MmCorpus(2 documents, 2 features, 1 non-zero entries)

为了查看语料的内容：

>>> # one way of printing a corpus: load it entirely into memory
>>> print(list(corpus))  # calling list() will convert any sequence to a plain Python list
[[(1, 0.5)], []]

或者：

>>> # another way of doing it: print one document at a time, making use of the streaming interface
>>> for doc in corpus:
...     print(doc)
[(1, 0.5)]
[]

第二种方式明显是更加内存友好的
为了以 Blei’s LDA-C format保存同样的Matrix Market document stream

>>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

以这种方式，Gensim能够被作为一个内存高效的I/O格式转换工具：使用一种格式加载文档流，立即以另一种格式存储它们；
与 NumPy and SciPy的兼容性
Gensim还包括有效的实用功能（efficient utility functions ），用来转化numpy 矩阵：

>>> import gensim
>>> import numpy as np
>>> numpy_matrix = np.random.randint(10, size=[5,2])  # random matrix as an example
>>> corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
>>> numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)

转化为scipy.sparse matrices：

>>> import scipy.sparse
>>> scipy_sparse_matrix = scipy.sparse.random(5,2)  # random sparse matrix as example
>>> corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
>>> scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)

为提供一个完整的参考， API documentation.

下一的教程是主题和转换

Thinking_boy1992

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Corpora and Vector Spaces

不要忘记设置：>>> import logging>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)如果你想要查看日志事件；从字符串到向量（From Strings to Vectors）从以字符串表示的文档开始：>>> from gensim impor
复制链接

扫一扫