Keras 文本预处理 text sequence_keras的text two sequence-CSDN博客

本文深入探讨了使用Keras进行文本预处理的方法，包括句子分割、词汇映射、one-hot编码等关键技术。通过实例展示了如何利用Tokenizer类进行词汇频率统计、词汇索引建立及文档向量化，是理解和应用Keras文本预处理功能的实用指南。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

预处理

句子分割、ohe-hot：

from keras.preprocessing import text
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=4) #num_words:None或整数,个人理解就是对统计单词出现数量后选择次数多的前n个单词，后面的单词都不做处理。
tokenizer.fit_on_texts(texts)
print( tokenizer.word_index) #单词对应的index
print( tokenizer.texts_to_sequences(texts)) # 使用字典将对应词转成index。shape为 (文档数，每条文档的长度)

{'some': 1, 'thing': 2, 'to': 3, 'eat': 4, 'drink': 5, 'food': 6}
[[1, 2, 3], [1, 1, 2, 3], [2, 3]]

print( tokenizer.texts_to_matrix(texts)) # 转成one-hot，与前面的不同。shape为[len(texts),num_words]
print( tokenizer.word_counts) #单词在所有文档中的总数量，如果num_words=4，应该选择some thing to
print( tokenizer.word_docs) #单词出现在文档中的数量
print( tokenizer.index_docs) #index对应单词出现在文档中的数量

[[0. 1. 1. 1.]
 [0. 1. 1. 1.]
 [0. 0. 1. 1.]]
OrderedDict([('some', 3), ('thing', 3), ('to', 3), ('eat', 2), ('drink', 1), ('food', 1)])
{'thing': 3, 'some': 2, 'eat': 2, 'to': 3, 'drink': 1, 'food': 1}
{2: 3, 1: 2, 4: 2, 3: 3, 5: 1, 6: 1}

from keras.preprocessing import text
from keras.preprocessing.text import Tokenizer

text1='some thing to eat'
text2='some some thing to drink'
text3='thing to eat food'
texts=[text1, text2, text3]


print(text.text_to_word_sequence(text3))

print(text.one_hot(text2,20))  #n表示编码值在1到n之间
print(text.one_hot(text2,5))

['thing', 'to', 'eat', 'food']
[5, 5, 9, 19, 9]
[2, 2, 1, 3, 4]