keras Tokenizer将单词向量化用法详解亲手实践详细讲解

最新推荐文章于 2024-09-20 10:18:19 发布

火星种萝卜

最新推荐文章于 2024-09-20 10:18:19 发布

阅读量1.4k

点赞数

分类专栏： NLP

原文链接：https://blog.csdn.net/qq_16234613/article/details/79436941

版权

NLP 专栏收录该内容

247 篇文章 4 订阅

订阅专栏

本文深入探讨了Keras库中用于文本预处理的各种方法，包括文本序列化、单词计数、文档频率统计等，通过实例展示了如何使用Tokenizer类进行文本的序列化转换和one-hot编码，适合于对NLP和深度学习感兴趣的读者。

摘要由CSDN通过智能技术生成

https://blog.csdn.net/qq_16234613/article/details/79436941

from keras.preprocessing import text
from keras.preprocessing.text import Tokenizer

text1='some thing to eat'
text2='some some thing to drink'
text3='thing to eat food'
texts=[text1, text2, text3]

# keras.preprocessing.text.text_to_word_sequence(text,
# filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n',
# lower=True,
# split=" ")
print(text.text_to_word_sequence(text3))

# 将一行文本使用hash原理转成one-hot形式，不是按照字典形式进行的映射
# keras.preprocessing.text.one_hot(text,
# n,
# filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n',
# lower=True,
# split=" ")
print(text.one_hot(text2,20)) #n表示编码值在1到n之间
print(text.one_hot(text2,5))

#result

Using TensorFlow backend.

['thing', 'to', 'eat', 'food']
[6, 6, 1, 2, 18]
[3, 3, 2, 4, 4]

tokenizer = Tokenizer(num_words=4) #num_words:None或整数,个人理解就是对统计单词出现数量后选择次数多的前n个单词，后面的单词都不做处理。
tokenizer.fit_on_texts(texts)
print("1=", tokenizer.texts_to_sequences(texts)) # 使用字典将对应词转成index。shape为 (文档数，每条文档的长度)
print( "2=",tokenizer.texts_to_matrix(texts)) # 转成one-hot，与前面的不同。shape为[len(texts),num_words]
print( "3=",tokenizer.word_counts) #单词在所有文档中的总数量，如果num_words=4，应该选择some thing to
print( "4=",tokenizer.word_docs) #单词出现在文档中的数量
print("5=", tokenizer.word_index) #单词对应的index

print( "6=",tokenizer.index_docs) #index对应单词出现在文档中的数量

#result

1= [[1, 2, 3], [1, 1, 2, 3], [2, 3]]
2= [[0. 1. 1. 1.]
 [0. 1. 1. 1.]
 [0. 0. 1. 1.]]
3= OrderedDict([('some', 3), ('thing', 3), ('to', 3), ('eat', 2), ('drink', 1), ('food', 1)])
4= defaultdict(<class 'int'>, {'eat': 2
, 'thing': 3, 'some': 2, 'to': 3, 'drink': 1, 'food': 1})
5= {'some': 1, 'thing': 2, 'to': 3, 'eat': 4, 'drink': 5, 'food': 6}
6= defaultdict(<class 'int'>, {4: 2, 2: 3, 1: 2, 3: 3, 5: 1, 6: 1})