前言
如下图所示,bag-of-word即将text即文本表述按单词拆分,组成词典,然后将文本描述按词按是否在词典出现进行编码,组成词语向量。图中’she loves pizza is … are the best’即是词典,下方的编码即某一描述的词向量编码。
基于keras Tokenizer实现
from keras.preprocessing.text import Tokenizer
text = [
'There was a man',
'The man had a dog',
'The dog and the man walked',
]
#using tokenizer
model = Tokenizer()
model.fit_on_texts(text)
#print keys
print(model.word_index.keys())
#create bag of words representation
rep = model.texts_to_matrix(text, mode='count')
print(rep)
vector= model.texts_to_sequences(Echair)
print(vector)
其中texts_to_matrix后的mode参数为count表示统计出的是词频向量,并没有把文本转成需要的向量表示,所以科学使用Tokenizer的方法是,首先用Tokenizer的 fit_on_texts 方法学习出文本的字典,然后word_index 就是对应的单词和数字的映射关系dict,通过这个dict可以将每个string的每个词转成数字,可以用texts_to_sequences,这是我们需要的,然后通过padding的方法补成同样长度,再用keras中自带的embedding层进行向量化。
基于NLTK的实现
#Importing the required modules
import numpy as np
from nltk.tokenize import word_tokenize
from collections import defaultdict
#Sample text corpus
data = ['She loves pizza, pizza is delicious.','She is a good person.','good people are the best.']
#clean the corpus.
sentences = []
vocab = []
for sent in data:
x = word_tokenize(sent)
sentence = [w.lower() for w in x if w.isalpha() ]
sentences.append(sentence)
for word in sentence:
if word not in vocab:
vocab.append(word)
#number of words in the vocab
len_vector = len(vocab)
#Index dictionary to assign an index to each word in vocabulary
index_word = {}
i = 0
for word in vocab:
index_word[word] = i
i += 1
def bag_of_words(sent):
count_dict = defaultdict(int)
vec = np.zeros(len_vector)
for item in sent:
count_dict[item] += 1
for key,item in count_dict.items():
vec[index_word[key]] = item
return vec
#Testing our model
vector = bag_of_words(sentences[0])
print(vector)
以上代码若是第一次运行nltk相关包会出现错误如下:
LookupError:
Resource ‘tokenizers/punkt/english.pickle’ not found.
…
解决方案参考https://blog.csdn.net/quiet_girl/article/details/72604691
参考资料
[1] Bag of Words Model in Python [In 10 Lines of Code!] keras方案原文
[2] Creating Bag of Words Model from Scratch in python NLTK方案原文
[3] 如何科学地使用keras的Tokenizer进行文本预处理