Keras---text.Tokenizer：文本与序列预处理-CSDN博客

keras中文文档：http://keras-cn.readthedocs.io/en/latest/preprocessing/text/

1.简介

在进行自然语言处理之前，需要对文本进行处理。
本文介绍keras提供的预处理包keras.preproceing下的text与序列处理模块sequence模块

2.text模块提供的方法

text_to_word_sequence(text,fileter) 可以简单理解此函数功能类str.split
one_hot(text,vocab_size) 基于hash函数(桶大小为vocab_size)，将一行文本转换向量表示（把单词数字化，vocab_size=5表示所有单词全都数字化在5以内）

3. text.Tokenizer类

这个类用来对文本中的词进行统计计数，生成文档词典，以支持基于词典位序生成文本的向量表示。
init(num_words) 构造函数，传入词典的最大值

3.1 成员函数

fit_on_text(texts) 使用一系列文档来生成token词典，texts为list类，每个元素为一个文档。
texts_to_sequences(texts) 将多个文档转换为word下标的向量形式,shape为[len(texts)，len(text)] – (文档数，每条文档的长度)；
texts_to_matrix(texts) 将多个文档转换为矩阵表示,shape为[len(texts),num_words]；

3.2 成员变量

document_count 处理的文档数量
word_index 一个dict，保存所有word对应的编号id，从1开始
word_counts 一个dict，保存每个word在所有文档中出现的次数
word_docs 一个dict，保存每个word出现的文档的数量
index_docs 一个dict，保存word的id出现的文档的数量

3.3代码示例

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
tokenizer = Tokenizer(num_words=1000)  # 只考虑前1000个最常见的单词
tokenizer.fit_on_texts(samples)  # 建立字典，构建单词索引

sequences = tokenizer.texts_to_sequences(samples)  # 将字符串转换为整数索引组成的列表

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')  # 将多个文档转换为矩阵表示形式

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))