词集模型:单词构成的集合,每个单词只出现一次。
词袋模型:把每一个单词都进行统计,同时计算每个单词出现的次数。
在train_x中,总共有6篇文档,每一行代表一个样本即一篇文档。我们的目标是将train_x转化为可训练矩阵,即生成每个样本的词向量。可以对train_x分别建立词集模型,词袋模型来解决。
train_x = [["my", "dog", "has", "flea", "problems", "help", "please"],
["maybe", "not", "take", "him", "to", "dog", "park", "stupid"],
["my", "dalmation", "is", "so", "cute", "I", "love", "him"],
["stop", "posting", "stupid", "worthless", "garbage"],
["him", "licks", "ate", "my", "steak", "how", "to", "stop", "him"],
["quit", "buying", "worthless", "dog", "food", "stupid"]]
1. 词集模型
算法步骤:
1)整合所有的单词到一个集合中,假设最终生成的集合长度为wordSetLen = 31。
2)假设文档/样本数为sampleCnt = 6,则建立一个sampleCnt * wordSetLen = 6 * 31的矩阵,这个矩阵被填入有效值之后,就是最终的可训练矩阵m。
3)遍历矩阵m,填入0,1有效值。0代表当前列的单词没有出现在当前行的样本/文档中,1代表当前列的单词出现在当前行的样本/文档中。
4)最终生成一个6 * 31的可训练矩阵。
2. 词袋模型
词袋模型中,训练矩阵不仅仅只出现0,1还会出现其他数字,这些数字代表的是当前样本中单词出现的次数。
# -*- coding: utf-8 -*-
import numpy as np
def load_data():
""" 1. 导入train_x, train_y """
train_x = [["my", "dog", "has", "flea", "problems", "help", "please"],
["maybe", "not", "take", "him", "to", "dog", "park", "stupid"],
["my", "dalmation", "is", "so", "cute", "I", "love", "him"],
["stop", "posting", "stupid", "worthless", "garbage"],
["him", "licks", "ate", "my", "steak", "how", "to", "stop", "him"],
["quit", "buying", "worthless", "dog", "food", "stupid"]]
label = [0, 1, 0, 1, 0, 1]
return train_x, label
def setOfWord(train_x):
""" 2. 所有单词不重复的汇总到一个列表
train_x: 文档合集, 一个样本构成一个文档
wordSet: 所有单词生成的集合的列表
"""
wordList = []
length = len(train_x)
for sample in range(length):
wordList.extend(train_x[sample])
wordSet = list(set(wordList))
return wordSet
def create_wordVec(sample, wordSet, mode="wordSet"):
""" 3. 将一个样本生成一个词向量 """
length = len(wordSet)
wordVec = [0] * length
if mode == "wordSet":
for i in range(length):
if wordSet[i] in sample:
wordVec[i] = 1
elif mode == "wordBag":
for i in range(length):
for j in range(len(sample)):
if sample[j] == wordSet[i]:
wordVec[i] += 1
else:
raise(Exception("The mode must be wordSet or wordBag."))
return wordVec
def main(mode="wordSet"):
train_x, label = load_data()
wordSet = setOfWord(train_x)
sampleCnt = len(train_x)
train_matrix = []
for i in range(sampleCnt):
train_matrix.append(create_wordVec(train_x[i], wordSet, "wordBag"))
return train_matrix
if __name__ == "__main__":
train_x, label = load_data()
wordSet = setOfWord(train_x)
train_matrix = main("wordSet")