词袋(Bag of Words)模型及其 Python 语言实现

词袋模型是一种文本特征的表示方法。

具体地,把词表里的词和我要表示的词作比对,没有画 0,有则画数量具体出现的频次。

例如:
句子 1:我/爱/知乎,知乎/真好。
句子 2:我/爱/微博,微博/真好。
于是有 词表=【'我','爱','知乎','真好','微博'】

且 len(词表)=5,故最后我期待用 5 维向量来表示句子 1 和句子 2

句子 1 表示为[1,1,2,1,0] #第一句中没有'微博'

句子 2 表示为[1,1,0,1,2]#第一句中没有'知乎'

Python 语言实现

import numpy as np
from nltk.corpus import stopwords
#Step 1: Tokenize a sentence
def word_extraction(sentence):
    #提取句子中的词们
    words = sentence.split()
    stop_words = set(stopwords.words('english'))
    cleaned_text = [w.lower() for w in words if not w in stop_words]
    return cleaned_text
#Step 2:Apply tokenization to all sentences
def tokenize(sentences):
    #对所有句子做 step1,生成词表
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
    words = sorted(list(set(words)))
    return words
#Step 3: Build vocabulary and generate vectors
def generate_bow(allsentences):
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab))
    for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = np.zeros(len(vocab))
        for w in words:
            for i, word in enumerate(vocab):
                if word == w:
                    bag_vector[i] += 1

        print("{0}\n{1}\n".format(sentence, np.array(bag_vector)))

allsentences = ["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",
"I looked for Mary and Samantha at the bus station",
"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

print(generate_bow(allsentences))

Output:
Word List for Document 
['arrived', 'bus', 'early', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'the', 'took', 'train', 'waited'] 

Joe waited for the train
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]

The train was late
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

Mary and Samantha took the bus
[0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0.]

I looked for Mary and Samantha at the bus station
[0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0.]

Mary and Samantha arrived at the bus station early but waited until noon for the bus
[1. 2. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1.]

None
  • 2
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值