gensim 之 语料库和矢量空间

官方文档:
http://radimrehurek.com/gensim/tutorial.html#first-example
语料库: 可以很接近词库,但并非词库,用来表示文档集的矩阵
矢量空间: 可以理解为用来描述一个文档的向量
比如:

词库为: fly  sky  moon (位置顺序为0  1  2)
文档集为:
    I fly in the sky,will fly to the moon
    I love blue sky with white clouds
    let's go
则有语料库为:
    [
        [(0,2),(1,1),(2,1)]#表示fly出现2次,sky1次,moon1次
        [(1,1)]
        []
    ]
矢量空间: [(1,1)]或者[(0,2),(1,1),(2,1)]或者[]都是一个矢量空间
例如一个文档内容为: The moon goes round the earth
则它的矢量空间为: [(2,1)]

1.将文档集做成语料库

1.我们需要从文档集中获得词库:
0) 分词
1) 去掉没有意义的冠词
2) 去低频词: 去掉只在出现过一次的单词(避免我们的语料库矩阵太稀疏了)
3) 将剩下的词做成词库
2.根据词库处理文档集,转化为语料库

#文档集
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
#去冠词,连接词等
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]
#统计词频
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
#去掉低频词
texts = [[token for token in text if frequency[token] > 1]
          for text in texts]
from pprint import pprint
pprint(texts)

#获取词库dictionary
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')
print(dictionary)
print(dictionary.token2id)

#将文档转为语料库corpus
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

2.gensim语料库使用入门

1.corpora.MmCorpus.serialize(‘/tmp/corpus.mm’, corpus) 将corpus语料库保存值文件corpus.mm中
2.获得new_doc的矢量空间new_vec

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())

3.读取文件中的文档集

class MyCorpus(object):
     def __iter__(self):
         for line in open('mycorpus.txt'):
             # assume there's one document per line, tokens separated by whitespace
             yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus()
for vector in corpus_memory_friendly:
    print(vector)

4.下面这种获取语料库方式是实际应用中常用的:

from six import iteritems
#从文件中读取
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
#去掉不用的词
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
             if stopword in dictionary.token2id]
#去低频词
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)
print(dictionary.token2id)
#删除删除后的ID序列中的空白
dictionary.compactify()
print(dictionary.token2id)
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值