bag-of-word模型的简单python实现

前言

如下图所示,bag-of-word即将text即文本表述按单词拆分,组成词典,然后将文本描述按词按是否在词典出现进行编码,组成词语向量。图中’she loves pizza is … are the best’即是词典,下方的编码即某一描述的词向量编码。
在这里插入图片描述

基于keras Tokenizer实现

from keras.preprocessing.text import Tokenizer
 
text = [
  'There was a man',
  'The man had a dog',
  'The dog and the man walked',
]
#using tokenizer 
model = Tokenizer()
model.fit_on_texts(text)
 
#print keys 
print(model.word_index.keys())
 
#create bag of words representation 
rep = model.texts_to_matrix(text, mode='count')
print(rep)

vector= model.texts_to_sequences(Echair)
print(vector)

其中texts_to_matrix后的mode参数为count表示统计出的是词频向量,并没有把文本转成需要的向量表示,所以科学使用Tokenizer的方法是,首先用Tokenizer的 fit_on_texts 方法学习出文本的字典,然后word_index 就是对应的单词和数字的映射关系dict,通过这个dict可以将每个string的每个词转成数字,可以用texts_to_sequences,这是我们需要的,然后通过padding的方法补成同样长度,再用keras中自带的embedding层进行向量化。

基于NLTK的实现

#Importing the required modules
import numpy as np
from nltk.tokenize import word_tokenize
from collections import defaultdict 

#Sample text corpus
data = ['She loves pizza, pizza is delicious.','She is a good person.','good people are the best.']

#clean the corpus.
sentences = []
vocab = []
for sent in data:
    x = word_tokenize(sent)
    sentence = [w.lower() for w in x if w.isalpha() ]
    sentences.append(sentence)
    for word in sentence:
        if word not in vocab:
            vocab.append(word)

#number of words in the vocab
len_vector = len(vocab)

#Index dictionary to assign an index to each word in vocabulary
index_word = {}
i = 0
for word in vocab:
    index_word[word] = i 
    i += 1 

def bag_of_words(sent):
    count_dict = defaultdict(int)
    vec = np.zeros(len_vector)
    for item in sent:
        count_dict[item] += 1
    for key,item in count_dict.items():
        vec[index_word[key]] = item
    return vec 
#Testing our model
vector = bag_of_words(sentences[0])
print(vector)

以上代码若是第一次运行nltk相关包会出现错误如下:
LookupError:


Resource ‘tokenizers/punkt/english.pickle’ not found.


解决方案参考https://blog.csdn.net/quiet_girl/article/details/72604691

参考资料

[1] Bag of Words Model in Python [In 10 Lines of Code!] keras方案原文
[2] Creating Bag of Words Model from Scratch in python NLTK方案原文
[3] 如何科学地使用keras的Tokenizer进行文本预处理

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值