Keras分词器 Tokenizer

最新推荐文章于 2025-04-22 18:13:09 发布

小孟Tec

最新推荐文章于 2025-04-22 18:13:09 发布

阅读量3.9k

点赞数 3

分类专栏：深度学习文章标签： Tokenizer

本文链接：https://blog.csdn.net/m0_38024592/article/details/102963978

版权

深度学习专栏收录该内容

10 篇文章

订阅专栏

Tokenizer

Tokenizer是一个将文本向量化，转换成序列的类。用来文本处理的分词、嵌入 。

导入改类

from keras.preprocessing.text import Tokenizer

默认参数如下

keras.preprocessing.text.Tokenizer(num_words=None, 
                                   filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
                                   lower=True, 
                                   split=' ', 
                                   char_level=False, 
                                   oov_token=None, 
                                   document_count=0)

参数说明：

num_words: 默认是None处理所有字词，但是如果设置成一个整数，那么最后返回的是最常见的、出现频率最高的num_words个字词。一共保留 num_words-1 个词。
filters: 过滤一些特殊字符，默认上文的写法就可以了。
lower: 是否全部转为小写。
split: 分词的分隔符字符串，默认为空格。因为英文分词分隔符就是空格。
char_level: 分字。
oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

方法	参数	返回值
fit_on_texts(texts)	texts：要用以训练的文本列表	-
texts_to_sequences(texts)	texts：待转为序列的文本列表	序列的列表，列表中每个序列对应于一段输入文本
texts_to_sequences_generator(texts)	texts：待转为序列的文本列表	本函数是texts_to_sequences的生成器函数版，返回每次调用返回对应于一段输入文本的序列
texts_to_matrix(texts, mode)	texts：待向量化的文本列表；mode：‘binary’，‘count’，‘tfidf’， ‘freq’之一，默认为‘binary’	形如(len(texts), nb_words)的numpy array
fit_on_sequences(sequences)	sequences：要用以训练的序列列表	-
sequences_to_matrix(sequences)	sequences：待向量化的序列列表； mode：同上	返回值：形如(len(sequences), nb_words)的numpy array

属性

word_counts: 字典，将单词（字符串）映射为它们在训练期间出现的次数。仅在调用fit_on_texts之后设置。
word_docs: 字典，将单词（字符串）映射为它们在训练期间所出现的文档或文本的数量。仅在调用fit_on_texts之后设置。
word_index: 字典，将单词（字符串）映射为它们的排名或者索引。仅在调用fit_on_texts之后设置。
document_count: 整数。分词器被训练的文档（文本或者序列）数量。仅在调用fit_on_texts或fit_on_sequences之后设置

In Action 实战

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences  # 填充语料


tokenizer = Tokenizer(num_words=None,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                      lower=True,
                      split=' ',
                      char_level=False,
                      oov_token=None,
                      document_count=0)
"""
num_words: 默认是None处理所有字词，但是如果设置成一个整数，
那么最后返回的是最常见的、出现频率最高的num_words个字词。一共保留 num_words-1 个词。
filters: 过滤一些特殊字符，默认上文的写法就可以了。
lower: 是否全部转为小写。
split: 分词的分隔符字符串，默认为空格。因为英文分词分隔符就是空格。
char_level: 分字。
oov_token: if given, it will be added to word_index 
and used to replace out-of-vocabulary words during text_to_sequence calls
"""


corpus = ['Updates internal vocabulary.',
          'In the case,',
          'Required before']
#  fit_on_texts 方法
tokenizer.fit_on_texts(corpus)

#  word_counts属性
print(tokenizer.word_counts)
# 输出 
# OrderedDict([('updates', 1), ('internal', 1), ('vocabulary', 1), 
# ('in', 1), ('the', 1), ('case', 1), ('required', 1), ('before', 1)])
print(tokenizer.word_docs)
# 输出
# defaultdict(<class 'int'>, {'vocabulary': 1, 'updates': 1, 'internal': 1, 
# 'case': 1, 'in': 1, 'the': 1, 'required': 1, 'before': 1})
print(tokenizer.word_index)
# 输出
# {'updates': 1, 'internal': 2, 'vocabulary': 3, 'in': 4, 'the': 5, 'case': 6, 'required': 7, 'before': 8}
print(tokenizer.document_count)
# 输出 3

print(tokenizer.texts_to_sequences(corpus))
# 输出
# [[1, 2, 3], [4, 5, 6], [7, 8]]
print(tokenizer.texts_to_matrix(corpus))
# 输出
# [[0. 1. 1. 1. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 1. 1. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 1. 1.]]
print(tokenizer.texts_to_matrix(corpus).shape)
# 输出 (3, 9)

tokenizer.fit_on_sequences(tokenizer.texts_to_sequences(corpus))
print(tokenizer.sequences_to_matrix(tokenizer.texts_to_sequences(corpus)))
# 输出
# [[0. 1. 1. 1. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 1. 1. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 0. 0. 1. 1.]]

Attention：

需要注意的点是，由于书写习惯，英文文本的单词之间是用空格隔开的，split=' ' 这个参数可以直接对英文文本进行空格分词。但是对中文不行，因此使用 tokenizer.fit_on_texts(text) 时，text如果是英文文本，可以直接 text = ["Today is raining.", "I feel tired today."] ，但是text是中文文本的话，需要先将中文文本分词再作为输入text： text = ["今天北京下雨了", "我今天加班"]

这里就是我踩过的坑了，之前拷代码下来跑的时候，别人用的是英文文本，没问题，但是我的输入是中文文本，导致分词步骤利用空格对中文分词，会将整句话当作一个token，而且是字典里找不到的token，这样会造成大量的相同的嵌入表达和相同的预测分数。

因此，keras的Tokenizer对于英文文档可以做分词+嵌入两步，对于中文的话，其实只有嵌入这步。

嵌入示例

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# 1. 创建分词器 Tokenizer 对象
tokenizer = Tokenizer() # 里面的参数可以自己根据实际情况更改

# 2. 整理整体语料，中文需空格分词
text = ["今天 北京 下 雨 了", "我 今天 加班"]

# 3. 将Tokenizer拟合语料，生成字典，形成新的tokenizer
tokenizer.fit_on_texts(text)

# 4. 保存tokenizer，避免重复对同一语料进行拟合
import joblib
joblib.dump(tokenizer, save_path)

# 5. 整合需要做嵌入的文本，中文需要空格分词
new_text = ["今天 回家 吃饭", "我 今天 生病 了"]

# 6. 将文本向量化
list_tokenized = tokenizer.text_to_sequence(new_text)

# 7. 生成训练数据的序列
X_train = pad_sequences(list_tokenized, maxlen=200)

Reference：

http://codewithzhangyi.com/2019/04/23/keras-tokenizer/