Keras 文本预处理 (Tokenizer的使用)

最新推荐文章于 2024-07-17 12:02:11 发布

眠眠菇

最新推荐文章于 2024-07-17 12:02:11 发布

阅读量4.4k

点赞数 3

分类专栏：深度学习

本文链接：https://blog.csdn.net/weixin_44060440/article/details/107745086

版权

深度学习专栏收录该内容

1 篇文章 1 订阅

订阅专栏

注: 部分内容参照keras中文文档
Tokenizer
文本标记实用类。

该类允许使用两种方法向量化一个文本语料库：将每个文本转化为一个整数序列（每个整数都是词典中标记的索引）；或者将其转化为一个向量，其中每个标记的系数可以是二进制值、词频、TF-IDF权重等。

keras.preprocessing.text.Tokenizer(num_words=None, 
                                   filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', 
                                   lower=True, 
                                   split=' ', 
                                   char_level=False, 
                                   oov_token=None, 
                                   document_count=0)

num_words: 需要保留的最大词数，基于词频。只有最常出现的 num_words 词会被保留。
filters: 一个字符串，其中每个元素是一个将从文本中过滤掉的字符。默认值是所有标点符号，加上制表符和换行符，减去 ’ 字符。
lower: 布尔值。是否将文本转换为小写。
split: 字符串。按该字符串切割文本。
char_level: 如果为 True，则每个字符都将被视为标记。
oov_token: 如果给出，它将被添加到 word_index 中，并用于在 text_to_sequence 调用期间替换词汇表外的单词。

默认情况下，删除所有标点符号，将文本转换为空格分隔的单词序列（单词可能包含 ’ 字符）。这些序列然后被分割成标记列表。然后它们将被索引或向量化。

0 是不会被分配给任何单词的保留索引。

例：

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = ["I love dog",
             "I love cat"]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

>>> {'i': 1, 'love': 2, 'dog': 3, 'cat': 4}

类方法

fit_on_texts(texts)
- texts：要用以训练的文本列表
texts_to_sequences(texts)
- texts：待转为序列的文本列表
- 返回值：序列的列表，列表中每个序列对应于一段输入文本
texts_to_sequences_generator(texts)
- 本函数是texts_to_sequences的生成器函数版
- texts：待转为序列的文本列表
- 返回值：每次调用返回对应于一段输入文本的序列

例：

from tensorflow.keras.preprocessing.text import Tokenizer
import sys

sentences = ["I love dog",
             "I love cat"]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print("word_index： ", word_index)
seq_normal = tokenizer.texts_to_sequences(sentences)  # 没用生成器
print("seq_normal: ", seq_normal)
seq = tokenizer.texts_to_sequences_generator(sentences)  # 用了生成器
print("seq_generator:", end=" ")
while True:
    try:
        print(next(seq), end=" ")
    except StopIteration:
        sys.exit()
        
>>> word_index：  {'i': 1, 'love': 2, 'dog': 3, 'cat': 4}
>>> seq_normal:  [[1, 2, 3], [1, 2, 4]]
>>> seq_generator: [1, 2, 3] [1, 2, 4]

pad_sequences
sequence类的方法
将多个序列截断或补齐为相同长度。

该函数将一个 num_samples 的序列（整数列表）转化为一个 2D Numpy 矩阵，其尺寸为 (num_samples, num_timesteps)。 num_timesteps 要么是给定的 maxlen 参数，要么是最长序列的长度。

比 num_timesteps 短的序列将在末端以 value 值补齐。

比 num_timesteps 长的序列将会被截断以满足所需要的长度。补齐或截断发生的位置分别由参数 pading 和 truncating 决定。

向前补齐为默认操作。

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)

参数

sequences: 列表的列表，每一个元素是一个序列。
maxlen: 整数，所有序列的最大长度。
dtype: 输出序列的类型。要使用可变长度字符串填充序列，可以使用 object。
padding: 字符串，‘pre’ 或 ‘post’ ，在序列的前端补齐还是在后端补齐。
truncating: 字符串，‘pre’ 或 ‘post’ ，移除长度大于 maxlen 的序列的值，要么在序列前端截断，要么在后端。
value: 浮点数，表示用来补齐的值。

x: Numpy 矩阵，尺寸为 (len(sequences), maxlen)。

例：

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = ["I love dog",
             "I love cat",
             "I really love tigers",
             "I love rabbits since they are cute"]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print("word_index： ", word_index)
seq_normal = tokenizer.texts_to_sequences(sentences)  
padded1 = pad_sequences(seq_normal, maxlen=max([len(sentence.split(" ")) for sentence in sentences]))
print("padded1:\n", padded1)
padded2 = pad_sequences(seq_normal, maxlen=max([len(sentence.split(" ")) for sentence in sentences]), padding='post', value=101)
print("padded2:\n", padded2)
padded3 = pad_sequences(seq_normal, maxlen=5, truncating='post')
print("padded3:\n", padded3)

>>> word_index：  {'i': 1, 'love': 2, 'dog': 3, 'cat': 4, 'really': 5, 'tigers': 6, 'rabbits': 7, 'since': 8, 'they': 9, 'are': 10, 'cute': 11}
>>> padded1:
 [[ 0  0  0  0  1  2  3]
 [ 0  0  0  0  1  2  4]
 [ 0  0  0  1  5  2  6]
 [ 1  2  7  8  9 10 11]]
 >>>padded2:
 [[  1   2   3 101 101 101 101]
 [  1   2   4 101 101 101 101]
 [  1   5   2   6 101 101 101]
 [  1   2   7   8   9  10  11]]
 >>>padded3:
 [[0 0 1 2 3]
 [0 0 1 2 4]
 [0 1 5 2 6]
 [1 2 7 8 9]]