基于gensim 词袋模型对文档逐词向量化-自编写代码

最新推荐文章于 2024-07-18 18:59:32 发布

姚贤贤

最新推荐文章于 2024-07-18 18:59:32 发布

阅读量2.5k

点赞数 2

分类专栏：机器学习文章标签： gensim 词袋模型 doc2vec 机器学习人工智能

本文链接：https://blog.csdn.net/u011311291/article/details/79199091

版权

机器学习专栏收录该内容

85 篇文章 6 订阅

订阅专栏

对于gensim，向量化文本只能通过dictionary.doc2bow来形成list(tuple（id，freq）)类型的向量，为了体现文档词语的前后关联，需要对gensim进行拓展。

import jieba
import gensim
import numpy as np
from gensim import corpora
# 加载数据
wordslist = ["我在玉龙雪山","我喜欢玉龙雪山","我还要去玉龙雪山"] 
# 切词
textTest = [[word for word in jieba.cut(words)] for words in wordslist]
# 生成字典
dictionary = corpora.Dictionary(textTest,prune_at=2000000)
for key in dictionary.iterkeys():
    print key,dictionary.get(key),dictionary.dfs[key]
# 1 在 1
# 5 还要 1
# 0 我 3
# 2 玉龙雪山 3
# 3 喜欢 1
# 4 去 1

# 测试
testwords = "我在玉龙雪山"
corpus1 = dictionary.doc2bow(jieba.cut(testwords))
print corpus1
# [(0, 1), (1, 1), (2, 1)]
corpus2 = dictionary.doc2vec(jieba.cut(testwords))
print corpus2
# [0, 1, 2]
#我，在，玉龙雪山
# 如果词语没有达到max_document_length，则以0补充，如果超过了max_document_length，则把多余的切割
corpus2 = dictionary.doc2vec(jieba.cut(testwords),max_document_length=10)
print corpus2
# [0, 1, 2, 0, 0, 0, 0, 0, 0, 0]

打开D:\Python27\Lib\site-packages\gensim\corpora\dictionary.py文件，然后在class Dictionary(utils.SaveLoad, Mapping)中添加以下方法：

def doc2vec(self, document,max_document_length=None):
        vecs = []
        token2id = self.token2id
        if max_document_length==None:
            vecs = list(token2id[word] for word in document)
            return vecs
        elif max_document_length >= 1:
            vecs = list(token2id[word] for word in document)
            vecs_len = len(vecs)
            if  vecs_len < max_document_length:
                vecs.extend([0]*(max_document_length-vecs_len))
                return vecs
            else:
                return vecs[:max_document_length]

当然也有现成的模块实现这个功能，比如tensorflow就有支持这种模型,但中文的文档似乎还不太好处理:

import numpy as np
import tflearn
x_text =[
'i love you',
'me too'
]
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
vocab_processor.fit(x_text)
x = np.array(list(vocab_processor.fit_transform(x_text)))
print x
# [[1 2 3 0]
# [4 5 0 0]]

姚贤贤

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
基于gensim 词袋模型对文档逐词向量化-自编写代码

对于gensim，向量化文本只能通过dictionary.doc2bow来形成list(tuple（id，freq）)类型的向量，为了体现文档词语的前后关联，需要对gensim进行拓展。import jiebaimport gensimimport numpy as npfrom gensim import corpora# 加载数据wordslist = ["我在玉龙雪山","
复制链接

扫一扫

专栏目录