gensim(二)--语料与向量之间转换

这篇文章演示如何把文本转换为向量表示,以及语料库文档流式处理并保存到硬盘上。

import logging
from pprint import pprint
from collections import defaultdict
# 设置日志格式,日志级别
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]
#移除停用词
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# 移除只出现一次的词
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]
pprint(texts)

对文档预处理的方式有很多种,预处理的数据好坏直接影响最终结果,而选择什么样的文档处理方式与文档内容有关。

保存词典

dictionary = corpora.Dictionary(texts)
dictionary.save('tmp/deerwester.dict')
print(dictionary)

有12个不同的词,所以每个文档都被12维向量表示。
获取词典键值对

print(dictionary.token2id)
{'minors': 11, 'computer': 0, 'graph': 10, 'survey': 4, 'time': 6, 'eps': 8, 'interface': 2, 'system': 5, 'human': 1, 'user': 7, 'response': 3, 'trees': 9}

把文档转换为向量

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

interaction不在字典中,所以值为0,默认省略

[(0, 1), (1, 1)]

doc2bow用于统计每个不同的词出现的频率

保存到硬盘
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('tmp/deerwester.mm', corpus)  
print(corpus)

在上面的处理中,原始语料corpus全部加载在内存中,在语料库比较小时,这没问题,但是语料库大时,内存就会不足。

这时,就需要使用迭代器流式处理文档,这种方式可以从硬盘,网络中读取原始语料

class MyCorpus(object):
    def __iter__(self):
        for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

逐行处理:

class MyCorpus(object):
    def __iter__(self):
        for line in open('jieba_zhu',encoding='utf-8'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

同样的,构建词典也不需要同时把所有文档加载到内存中,也可以逐行处理。

dictionary = corpora.Dictionary(line.split('/') for line in open('jieba_zhuxian',encoding='utf-8'))
stop_ids = [
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
# 过滤停用词
dictionary.filter_tokens(stop_ids + once_ids)
#调整容量,因为移除了部分词
dictionary.compactify()
print(dictionary)

保存语料的格式有:
Market Matrix format

corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)
#读取保存的语料
corpus = corpora.MmCorpus('/tmp/corpus.mm')

Joachim’s SVMlight format

corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)

`Blei’s LDA-C format

corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)

GibbsLDA++ format

corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

从numpy矩阵转换为corpus语料

import gensim
import numpy as np
numpy_matrix = np.random.randint(10, size=[5, 2])  # random matrix as an example
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)

scipy

import scipy.sparse
scipy_sparse_matrix = scipy.sparse.random(5, 2)  # random sparse matrix as example
corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值