bag-of-word模型的简单python实现

最新推荐文章于 2023-07-28 01:01:36 发布

信雪神话

最新推荐文章于 2023-07-28 01:01:36 发布

阅读量753

点赞数

分类专栏： General Problem

本文链接：https://blog.csdn.net/hookie1990/article/details/118521300

版权

General Problem 专栏收录该内容

31 篇文章 0 订阅

订阅专栏

前言

如下图所示，bag-of-word即将text即文本表述按单词拆分，组成词典，然后将文本描述按词按是否在词典出现进行编码，组成词语向量。图中’she loves pizza is … are the best’即是词典，下方的编码即某一描述的词向量编码。
在这里插入图片描述

基于keras Tokenizer实现

from keras.preprocessing.text import Tokenizer
 
text = [
  'There was a man',
  'The man had a dog',
  'The dog and the man walked',
]
#using tokenizer 
model = Tokenizer()
model.fit_on_texts(text)
 
#print keys 
print(model.word_index.keys())
 
#create bag of words representation 
rep = model.texts_to_matrix(text, mode='count')
print(rep)

vector= model.texts_to_sequences(Echair)
print(vector)

其中texts_to_matrix后的mode参数为count表示统计出的是词频向量，并没有把文本转成需要的向量表示，所以科学使用Tokenizer的方法是，首先用Tokenizer的 fit_on_texts 方法学习出文本的字典，然后word_index 就是对应的单词和数字的映射关系dict，通过这个dict可以将每个string的每个词转成数字，可以用texts_to_sequences，这是我们需要的，然后通过padding的方法补成同样长度，再用keras中自带的embedding层进行向量化。

基于NLTK的实现

#Importing the required modules
import numpy as np
from nltk.tokenize import word_tokenize
from collections import defaultdict 

#Sample text corpus
data = ['She loves pizza, pizza is delicious.','She is a good person.','good people are the best.']

#clean the corpus.
sentences = []
vocab = []
for sent in data:
    x = word_tokenize(sent)
    sentence = [w.lower() for w in x if w.isalpha() ]
    sentences.append(sentence)
    for word in sentence:
        if word not in vocab:
            vocab.append(word)

#number of words in the vocab
len_vector = len(vocab)

#Index dictionary to assign an index to each word in vocabulary
index_word = {}
i = 0
for word in vocab:
    index_word[word] = i 
    i += 1 

def bag_of_words(sent):
    count_dict = defaultdict(int)
    vec = np.zeros(len_vector)
    for item in sent:
        count_dict[item] += 1
    for key,item in count_dict.items():
        vec[index_word[key]] = item
    return vec 
#Testing our model
vector = bag_of_words(sentences[0])
print(vector)