词袋(Bag of Words)模型及其 Python 语言实现

最新推荐文章于 2023-05-07 13:27:52 发布

不可能打工

最新推荐文章于 2023-05-07 13:27:52 发布

阅读量2.9k

点赞数 2

文章标签： python 自然语言处理 nlp 机器学习深度学习

本文链接：https://blog.csdn.net/ewen_lee/article/details/108142900

版权

词袋模型是一种文本特征的表示方法。

具体地，把词表里的词和我要表示的词作比对，没有画 0，有则画数量具体出现的频次。

例如：
句子 1：我/爱/知乎，知乎/真好。
句子 2：我/爱/微博，微博/真好。
于是有词表=【'我'，'爱'，'知乎'，'真好'，'微博'】

且 len(词表）=5，故最后我期待用 5 维向量来表示句子 1 和句子 2

句子 1 表示为[1,1,2,1,0] #第一句中没有'微博'

句子 2 表示为[1,1,0,1,2]#第一句中没有'知乎'

Python 语言实现

import numpy as np
from nltk.corpus import stopwords
#Step 1: Tokenize a sentence
def word_extraction(sentence):
    #提取句子中的词们
    words = sentence.split()
    stop_words = set(stopwords.words('english'))
    cleaned_text = [w.lower() for w in words if not w in stop_words]
    return cleaned_text
#Step 2：Apply tokenization to all sentences
def tokenize(sentences):
    #对所有句子做 step1,生成词表
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
    words = sorted(list(set(words)))
    return words
#Step 3: Build vocabulary and generate vectors
def generate_bow(allsentences):
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab))
    for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = np.zeros(len(vocab))
        for w in words:
            for i, word in enumerate(vocab):
                if word == w:
                    bag_vector[i] += 1

        print("{0}\n{1}\n".format(sentence, np.array(bag_vector)))

allsentences = ["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",
"I looked for Mary and Samantha at the bus station",
"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

print(generate_bow(allsentences))

Output:
Word List for Document 
['arrived', 'bus', 'early', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'the', 'took', 'train', 'waited'] 

Joe waited for the train
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]

The train was late
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

Mary and Samantha took the bus
[0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0.]

I looked for Mary and Samantha at the bus station
[0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0.]

Mary and Samantha arrived at the bus station early but waited until noon for the bus
[1. 2. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1.]

None

不可能打工

关注

2
点赞
踩
23

收藏

觉得还不错? 一键收藏
0
评论
词袋(Bag of Words)模型及其 Python 语言实现

词袋模型是一种文本特征的表示方法。具体地，把词表里的词和我要表示的词作比对，没有画 0，有则画数量具体出现的频次。例如：句子 1：我/爱/知乎，知乎/真好。句子 2：我/爱/微博，微博/真好。于是有词表=【'我'，'爱'，'知乎'，'真好'，'微博'】且 len(词表）=5，故最后我期待用 5 维向量来表示句子 1 和句子 2句子 1 表示为[1,1,2,1,0] #第一句中没有'...
复制链接

扫一扫