gensim 之语料库和矢量空间

最新推荐文章于 2024-05-25 14:31:47 发布

jrymos001

最新推荐文章于 2024-05-25 14:31:47 发布

阅读量896

点赞数

分类专栏：机器学习 Python笔记文章标签： gensim入门

本文链接：https://blog.csdn.net/m0_37681914/article/details/73744685

版权

机器学习同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

Python笔记

10 篇文章 0 订阅

订阅专栏

官方文档:
http://radimrehurek.com/gensim/tutorial.html#first-example
语料库: 可以很接近词库,但并非词库,用来表示文档集的矩阵
矢量空间: 可以理解为用来描述一个文档的向量
比如:

词库为: fly  sky  moon (位置顺序为0  1  2)
文档集为:
    I fly in the sky,will fly to the moon
    I love blue sky with white clouds
    let's go
则有语料库为:
    [
        [(0,2),(1,1),(2,1)]#表示fly出现2次,sky1次,moon1次
        [(1,1)]
        []
    ]
矢量空间: [(1,1)]或者[(0,2),(1,1),(2,1)]或者[]都是一个矢量空间
例如一个文档内容为: The moon goes round the earth
则它的矢量空间为: [(2,1)]

1.将文档集做成语料库

1.我们需要从文档集中获得词库:
0) 分词
1) 去掉没有意义的冠词
2) 去低频词: 去掉只在出现过一次的单词(避免我们的语料库矩阵太稀疏了)
3) 将剩下的词做成词库
2.根据词库处理文档集,转化为语料库

#文档集
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
#去冠词,连接词等
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]
#统计词频
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
#去掉低频词
texts = [[token for token in text if frequency[token] > 1]
          for text in texts]
from pprint import pprint
pprint(texts)

#获取词库dictionary
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict')
print(dictionary)
print(dictionary.token2id)

#将文档转为语料库corpus
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

2.gensim语料库使用入门

1.corpora.MmCorpus.serialize(‘/tmp/corpus.mm’, corpus) 将corpus语料库保存值文件corpus.mm中
2.获得new_doc的矢量空间new_vec

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())

3.读取文件中的文档集

class MyCorpus(object):
     def __iter__(self):
         for line in open('mycorpus.txt'):
             # assume there's one document per line, tokens separated by whitespace
             yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus()
for vector in corpus_memory_friendly:
    print(vector)

4.下面这种获取语料库方式是实际应用中常用的:

from six import iteritems
#从文件中读取
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
#去掉不用的词
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
             if stopword in dictionary.token2id]
#去低频词
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)
print(dictionary.token2id)
#删除删除后的ID序列中的空白
dictionary.compactify()
print(dictionary.token2id)

jrymos001

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
gensim 之语料库和矢量空间

官方文档: http://radimrehurek.com/gensim/tutorial.html#first-example 语料库: 可以很接近词库,但并非词库,用来表示文档集的矩阵矢量空间: 可以理解为用来描述一个文档的向量比如: 词库为: fly sky moon (位置顺序为0 1 2)文档集为: I fly in the sky,will fly to
复制链接

扫一扫