数据挖掘 - 词集模型 & 词袋模型

词集模型:单词构成的集合,每个单词只出现一次。

词袋模型:把每一个单词都进行统计,同时计算每个单词出现的次数。


在train_x中,总共有6篇文档,每一行代表一个样本即一篇文档。我们的目标是将train_x转化为可训练矩阵,即生成每个样本的词向量。可以对train_x分别建立词集模型,词袋模型来解决。

train_x = [["my", "dog", "has", "flea", "problems", "help", "please"],
               ["maybe", "not", "take", "him", "to", "dog", "park", "stupid"],
               ["my", "dalmation", "is", "so", "cute", "I", "love", "him"],
               ["stop", "posting", "stupid", "worthless", "garbage"],
               ["him", "licks", "ate", "my", "steak", "how", "to", "stop", "him"],
               ["quit", "buying", "worthless", "dog", "food", "stupid"]]


1. 词集模型

算法步骤:

1)整合所有的单词到一个集合中,假设最终生成的集合长度为wordSetLen = 31。

2)假设文档/样本数为sampleCnt = 6,则建立一个sampleCnt * wordSetLen = 6 * 31的矩阵,这个矩阵被填入有效值之后,就是最终的可训练矩阵m。

3)遍历矩阵m,填入0,1有效值。0代表当前列的单词没有出现在当前行的样本/文档中,1代表当前列的单词出现在当前行的样本/文档中

4)最终生成一个6 * 31的可训练矩阵。


2. 词袋模型

词袋模型中,训练矩阵不仅仅只出现0,1还会出现其他数字,这些数字代表的是当前样本中单词出现的次数。

# -*- coding: utf-8 -*-
import numpy as np

def load_data():
    """ 1. 导入train_x, train_y """
    train_x = [["my", "dog", "has", "flea", "problems", "help", "please"],
               ["maybe", "not", "take", "him", "to", "dog", "park", "stupid"],
               ["my", "dalmation", "is", "so", "cute", "I", "love", "him"],
               ["stop", "posting", "stupid", "worthless", "garbage"],
               ["him", "licks", "ate", "my", "steak", "how", "to", "stop", "him"],
               ["quit", "buying", "worthless", "dog", "food", "stupid"]]
    label = [0, 1, 0, 1, 0, 1]
    return train_x, label


def setOfWord(train_x):
    """ 2. 所有单词不重复的汇总到一个列表 
    train_x: 文档合集, 一个样本构成一个文档
    wordSet: 所有单词生成的集合的列表
    """
    wordList = []
    
    length = len(train_x)
    for sample in range(length):
        wordList.extend(train_x[sample])
    wordSet = list(set(wordList))
    return wordSet


def create_wordVec(sample, wordSet, mode="wordSet"):
    """ 3. 将一个样本生成一个词向量 """
    length = len(wordSet)
    wordVec = [0] * length

    if mode == "wordSet":
        for i in range(length):
            if wordSet[i] in sample:
                wordVec[i] = 1
    elif mode == "wordBag":
        for i in range(length):
            for j in range(len(sample)):
                if sample[j] == wordSet[i]:
                    wordVec[i] += 1
    else:
        raise(Exception("The mode must be wordSet or wordBag."))
    return wordVec


def main(mode="wordSet"):
    train_x, label = load_data()
    wordSet = setOfWord(train_x)
    
    sampleCnt = len(train_x)
    train_matrix = []
    for i in range(sampleCnt):
        train_matrix.append(create_wordVec(train_x[i], wordSet, "wordBag"))
    return train_matrix
        

if __name__ == "__main__":
    train_x, label = load_data()
    wordSet = setOfWord(train_x)
    train_matrix = main("wordSet")

  • 5
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值