使用keras的Tokenizer进行文本预处理

最新推荐文章于 2024-05-31 00:04:26 发布

solumin

最新推荐文章于 2024-05-31 00:04:26 发布

阅读量3.8k

点赞数 4

分类专栏：机器学习实验

本文链接：https://blog.csdn.net/solumin/article/details/100173183

版权

from keras.preprocessing import text#facts, accu_label, article_label, imprison_label=load_data()somestr = ['ha ha gua angry','howa ha gua excited naive']tok=text.Tokenizer() #初始化标注器tok.fit_on_te...

摘要由CSDN通过智能技术生成

from keras.preprocessing import text
#facts, accu_label, article_label, imprison_label=load_data()
somestr = ['ha ha gua angry','howa ha gua excited naive']

tok=text.Tokenizer() #初始化标注器
tok.fit_on_texts(somestr) #学习出文本的字典
word_index = tok.word_index#查看对应的单词和数字的映射关系dict
print(word_index)
sequences = tok.texts_to_sequences(somestr) #通过texts_to_sequences 这个dict可以将每个string的每个词转成数字
print(sequences)

{‘naive’: 6, ‘ha’: 1, ‘excited’: 5, ‘angry’: 3, ‘gua’: 2, ‘howa’: 4}
[[1, 1, 2, 3], [4, 1, 2, 5, 6]]

转换成词袋序列

maxlen = 10
from keras.preprocessing import sequence
x = sequence.pad_sequences(sequences, maxlen,dtype='int16')  # 将每条文本的长度设置一个固定值。
print(x)

[[0 0 0 0 0 0 1 1 2 3]
[0 0 0 0 0 4 1 2 5 6]]

import numpy as np
lenofdata = len(x)
x_train = x[np.arange(len(x))][:int(lenofdata * 0.8)]
print(x_train)

[[0 0 0 0 0 0 1 1 2 3]]

np.vstack((x, x_train))

array([[0, 0, 0, 0, 0, 0, 1, 1, 2, 3],
[

最低0.47元/天解锁文章

solumin

关注

4
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
使用keras的Tokenizer进行文本预处理

from keras.preprocessing import text#facts, accu_label, article_label, imprison_label=load_data()somestr = ['ha ha gua angry','howa ha gua excited naive']tok=text.Tokenizer() #初始化标注器tok.fit_on_te...
复制链接

扫一扫