Keras-文本序列_文本向量化（一）(标记做 one-hot 编码)

最新推荐文章于 2023-05-21 15:13:01 发布

炼丹师666

最新推荐文章于 2023-05-21 15:13:01 发布

阅读量950

点赞数

分类专栏： Python 机器学习

本文链接：https://blog.csdn.net/wj1298250240/article/details/104173320

版权

Python 同时被 2 个专栏收录

123 篇文章 1 订阅

订阅专栏

机器学习

8 篇文章 0 订阅

订阅专栏

Keras-文本序列_文本向量化(标记做 one-hot 编码)
参考：
https://blog.csdn.net/qq_30614345/article/details/98714874

6.1.1　单词和字符的 one-hot 编码
代码清单 6-1 单词级的 one-hot 编码（简单示例）
代码清单 6-2 字符级的 one-hot 编码（简单示例）
代码清单 6-3 用 Keras 实现单词级的 one-hot 编码
代码清单 6-4 使用散列技巧的单词级的 one-hot 编码（简单示例）


print(token_index)         
print(results)        
# 单词和字符的 one-hot 编码
# one-hot 编码是将标记转换为向量的最常用、最基本的方法。在第 3 章的 IMDB 和路透社两
# 个例子中，你已经用过这种方法（都是处理单词）。它将每个单词与一个唯一的整数索引相关联，
# 然后将这个整数索引 i 转换为长度为 N 的二进制向量（N 是词表大小），这个向量只有第 i 个元
# 素是 1，其余元素都为 0。
# 当然，也可以进行字符级的 one-hot 编码。为了让你完全理解什么是 one-hot 编码以及如何
# 实现 one-hot 编码，代码清单 6-1 和代码清单 6-2 给出了两个简单示例，一个是单词级的 one-hot
# 编码，另一个是字符级的 one-hot 编码。
# 代码清单 6-1 单词级的 one-hot 编码（简单示例）

import numpy as np

# This is our initial data; one entry per "sample"
# (in this toy example, a "sample" is just a sentence, but
# it could be an entire document).

# 初始数据：每个样本是列表的一个元素（本例中的样本是一个句子，但也可以是一整篇文档）

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# First, build an index of all tokens in the data.
token_index = {}
for sample in samples:
    # We simply tokenize the samples via the `split` method.
    # in real life, we would also strip punctuation and special characters
    # from the samples.
    for word in sample.split():
        if word not in token_index:
            # Assign a unique index to each unique word
            token_index[word] = len(token_index) + 1
            # Note that we don't attribute index 0 to anything.

# Next, we vectorize our samples.
# We will only consider the first `max_length` words in each sample.
max_length = 10

# This is where we store our results:
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.
        
print(token_index)         
print(results)        
{'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10}
[[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]
Character level one-hot encoding (toy example)

# 代码清单 6-2 字符级的 one-hot 编码（简单示例）

import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# 所有可打印的 ASCII 字符
characters = string.printable  # All printable ASCII characters.
token_index = dict(zip(characters, range(1, len(characters) + 1)))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample[:max_length]):
        index = token_index.get(character)
        results[i, j, index] = 1.
        
print(token_index)         
print(results)                
{'0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10, 'a': 11, 'b': 12, 'c': 13, 'd': 14, 'e': 15, 'f': 16, 'g': 17, 'h': 18, 'i': 19, 'j': 20, 'k': 21, 'l': 22, 'm': 23, 'n': 24, 'o': 25, 'p': 26, 'q': 27, 'r': 28, 's': 29, 't': 30, 'u': 31, 'v': 32, 'w': 33, 'x': 34, 'y': 35, 'z': 36, 'A': 37, 'B': 38, 'C': 39, 'D': 40, 'E': 41, 'F': 42, 'G': 43, 'H': 44, 'I': 45, 'J': 46, 'K': 47, 'L': 48, 'M': 49, 'N': 50, 'O': 51, 'P': 52, 'Q': 53, 'R': 54, 'S': 55, 'T': 56, 'U': 57, 'V': 58, 'W': 59, 'X': 60, 'Y': 61, 'Z': 62, '!': 63, '"': 64, '#': 65, '$': 66, '%': 67, '&': 68, "'": 69, '(': 70, ')': 71, '*': 72, '+': 73, ',': 74, '-': 75, '.': 76, '/': 77, ':': 78, ';': 79, '<': 80, '=': 81, '>': 82, '?': 83, '@': 84, '[': 85, '\\': 86, ']': 87, '^': 88, '_': 89, '`': 90, '{': 91, '|': 92, '}': 93, '~': 94, ' ': 95, '\t': 96, '\n': 97, '\r': 98, '\x0b': 99, '\x0c': 100}
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]
Note that Keras has built-in utilities for doing one-hot encoding text at the word level or character level, starting from raw text data. This is what you should actually be using, as it will take care of a number of important features, such as stripping special characters from strings, or only taking into the top N most common words in your dataset (a common restriction to avoid dealing with very large input vector spaces).

Using Keras for word-level one-hot encoding:

#  Keras 的内置函数可以对原始文本数据进行单词级或字符级的 one-hot 编码。你应该
# 使用这些函数，因为它们实现了许多重要的特性，比如从字符串中去除特殊字符、只考虑数据
# 集中前 N 个最常见的单词（这是一种常用的限制，以避免处理非常大的输入向量空间）。
# 代码清单 6-3 用 Keras 实现单词级的 one-hot 编码

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer, configured to only take
# into account the top-1000 most common words

# 创建一个分词器（tokenizer），设置为只考虑前 1000 个最常见的单词

tokenizer = Tokenizer(num_words=1000)
# This builds the word index  构建单词索引
tokenizer.fit_on_texts(samples)

# This turns strings into lists of integer indices.
# 将字符串转换为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)

# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported!

# 也可以直接得到 one-hot 二进制表示。
# 这个分词器也支持除 one-hot 编码外的其他向量化模式

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

# This is how you can recover the word index that was computed
# 找回单词索引
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# one-hot 编码的一种变体是所谓的 one-hot 散列技巧（one-hot hashing trick），如果词表中唯
# 一标记的数量太大而无法直接处理，就可以使用这种技巧。这种方法没有为每个单词显式分配
# 一个索引并将这些索引保存在一个字典中，而是将单词散列编码为固定长度的向量，通常用一
# 个非常简单的散列函数来实现。这种方法的主要优点在于，它避免了维护一个显式的单词索引，
# 从而节省内存并允许数据的在线编码（在读取完所有数据之前，你就可以立刻生成标记向量）。
# 这种方法有一个缺点，就是可能会出现散列冲突（hash collision），即两个不同的单词可能具有
# 相同的散列值，随后任何机器学习模型观察这些散列值，都无法区分它们所对应的单词。如果
# 散列空间的维度远大于需要散列的唯一标记的个数，散列冲突的可能性会减小。
# 代码清单 6-4 使用散列技巧的单词级的 one-hot 编码（简单示例）

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We will store our words as vectors of size 1000.
# Note that if you have close to 1000 words (or more)
# you will start seeing many hash collisions, which
# will decrease the accuracy of this encoding method.

# 将单词保存为长度为 1000 的向量。如果单词数量接近 1000 个（或更多），那么会遇到很多散列冲突，这会降低这种编码方法的准确性

dimensionality = 1000
max_length = 10

results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # Hash the word into a "random" integer index
        # that is between 0 and 1000
#         将单词散列为 0~1000 范围内的一个随机整数索引
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.

炼丹师666

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
Keras-文本序列_文本向量化（一）(标记做 one-hot 编码)

Keras-文本序列_文本向量化(标记做 one-hot 编码)参考：https://blog.csdn.net/qq_30614345/article/details/987148746.1.1　单词和字符的 one-hot 编码代码清单 6-1 单词级的 one-hot 编码（简单示例）代码清单 6-2 字符级的 one-hot 编码（简单示例）代码清单 6-3 用 Keras 实...
复制链接

扫一扫

专栏目录