Bag of words 词袋模型（概念+代码实现）

hUaleeF

于 2022-10-25 16:47:11 发布

阅读量631

点赞数

本文链接：https://blog.csdn.net/hua_453/article/details/127516178

版权

NLP Learning Notes 专栏收录该内容

8 篇文章 2 订阅

订阅专栏

Bag of words

The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears. This process is often referred to as vectorization.

Let’s take an example to understand this concept in depth.

- “It was the best of times” 
- “It was the worst of times”
-  “It was the age of wisdom”
-  “It was the age of foolishness”

We treat each sentence as a separate document and we make a list of all words. We get,

- ‘It’, ‘was’, ‘the’, ‘best’, ‘of’, ‘times’, ‘worst’, ‘age’, ‘wisdom’, ‘foolishness’

The next step is the vectors creation. Vectors convert text that can be used by the machine learning algorithm.We take the first document — “It was the best of times” and we check the frequency of words from the 10 unique words.

- “it” = 1
- “was” = 1
- “the” = 1
- “best” = 1
- “of” = 1
- “times” = 1
- “worst” = 0
- “age” = 0
- “wisdom” = 0
- “foolishness” = 0

Rest of the documents will be:

- “It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
- “It was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
- “It was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
- “It was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is called a bigram model.

The process of converting NLP text into numbers is called vectorization in ML. Different ways to convert text into vectors are:

- Counting the number of times each word appears in a document.
- Calculating the frequency that each word appears in a document out of all the words in the document.

Implementing BOW in Python

We use Keras’s Tokenizer class:

from keras.preprocessing.text import Tokenizer

docs = [
    'It was the best of times',
    'It was the worst of times',
    'It was the age of wisdom',
    'It was the age of foolishness'
]

## Step 1: Determine the Vocabulary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
print(f'Vocabulary: {list(tokenizer.word_index.keys())}')

## Step 2: Count
vectors = tokenizer.texts_to_matrix(docs, mode='count')
print(vectors)

Running that code gives us:

Vocabulary: ['it', 'was', 'the', 'of', 'times', 'age', 'best', 'worst', 'wisdom', 'foolishness']
[[0. 1. 1. 1. 1. 1. 0. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0.]
 [0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0.]
 [0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 1.]]