十三、处理文本数据1使用 one-hot_1t文本数据处理-CSDN博客

本文链接：https://blog.csdn.net/xlw_6569/article/details/136190874

文本是最常用的序列数据之一，可以理解为字符序列或单词序列，但最常见的是单词级处理。
与其他所有神经网络一样，深度学习模型不会接收原始文本作为输入，它只能处理数值张量。
文本向量化（vectorize） 是指将文本转换为数值张量的过程。

将文本分割为单词，并将每个单词转换为一个向量。
将文本分割为字符，并将每个字符转换为一个向量。
提取单词或字符的 n-gram，并将每个 n-gram 转换为一个向量。n-gram 是多个连续单词或字符的集合（n-gram 之间可重叠）。
将文本分解而成的单元（单词、字符或 n-gram）叫作标记（token），将文本分解成标记的过程叫作分词（tokenization）。

理解 n-gram 和词袋

n-gram 是从一个句子中提取的 N 个（或更少）连续单词的集合。这一概念中的“单词”也可以替换为“字符”
来看一个简单的例子。考虑句子“The cat sat on the mat.”（“猫坐在垫子上”）。它可以被分解为

二元语法（2-grams）的集合。

{“The”, “The cat”, “cat”, “cat sat”, “sat”, “sat on”, “on”, “on the”, “the”, “the mat”, “mat”}

三元语法（3-grams）的集合

{“The”, “The cat”, “cat”, “cat sat”, “The cat sat”, “sat”, “sat on”, “on”, “cat sat on”, “on the”, “the”, “sat on the”, “the mat”, “mat”, “on the mat”}

这样的集合分别叫作二元语法袋（bag-of-2-grams）及三元语法袋（bag-of-3-grams）。这里袋（bag）这一术语指的是，我们处理的是标记组成的集合，而不是一个列表或序列，即标记没有特定的顺序。这一系列分词方法叫作词袋（bag-of-words）。

词袋是一种不保存顺序的分词方法，因此它往往被用于浅层的语言处理模型，而不是深度学习模型。提取 n-gram 是一种特征工程，深度学习不需要这种死板而又不稳定的方法，并将其替换为分层特征学习。

在使用轻量级的浅层文本处理模型时（比如 logistic 回归和随机森林），n-gram 是一种功能强大、不可或缺的特征工程工具。

单词和字符的 one-hot 编码

下面两个简单示例，一个是单词级的 one-hot编码，另一个是字符级的 one-hot 编码。

单词级的 one-hot 编码

import numpy as np

# This is our initial data; one entry per "sample"
# (in this toy example, a "sample" is just a sentence, but
# it could be an entire document).
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# First, build an index of all tokens in the data.
token_index = {}
for sample in samples:
    # We simply tokenize the samples via the `split` method.
    # in real life, we would also strip punctuation and special characters
    # from the samples.
    for word in sample.split():
        if word not in token_index:
            # Assign a unique index to each unique word
            token_index[word] = len(token_index) + 1
            # Note that we don't attribute index 0 to anything.

# Next, we vectorize our samples.
# We will only consider the first `max_length` words in each sample.
max_length = 10

# This is where we store our results:
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

字符级的 one-hot 编码

import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable  # All printable ASCII characters.
token_index = dict(zip(characters, range(1, len(characters) + 1)))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample[:max_length]):
        index = token_index.get(character)
        results[i, j, index] = 1.

用 Keras 实现单词级的 one-hot 编码

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer, configured to only take
# into account the top-1000 most common words
tokenizer = Tokenizer(num_words=1000)
# This builds the word index
tokenizer.fit_on_texts(samples)

# This turns strings into lists of integer indices.
sequences = tokenizer.texts_to_sequences(samples)

# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported!
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

# This is how you can recover the word index that was computed
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))