keras + tensorflow —— 文本处理

最新推荐文章于 2024-03-19 22:16:31 发布

五道口纳什

最新推荐文章于 2024-03-19 22:16:31 发布

阅读量1.2k

点赞数

分类专栏： caffe-TensorFlow-keras-theano-

本文链接：https://blog.csdn.net/lanchunhui/article/details/51247517

版权

caffe-TensorFlow-keras-theano- 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

0. 语料获取

amazon s3 获取：

txtpath = keras.utils.get_file('nietzche.txt', 
	origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(txtpath).read().lower()

1. 文本的 one-hot 编码

from keras.preprocessing.text import Tokenizer

# 编码如下的两行文本
samples = ['The cat sat on the mat.', 'The dog ate my homework']

tokenizer = Tokenizer(num_words=1000)	# 用长度为 1000 的词汇集编码文本
tokenizer.fit_on_text(samples)
sequences = tokenizer.texts_to_sequences(samples)
	# [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
	# 1 ⇒ dog, 2 ⇒ cat, 3 ⇒ cat, 4 ⇒ on ...
tokenizer.word_index

	{'the': 1,
	 'cat': 2,
	 'sat': 3,
	 'on': 4,
	 'mat': 5,
	 'dog': 6,
	 'ate': 7,
	 'my': 8,
	 'homework': 9}
	
one_hot_mat = tokenizer.texts_to_matrix(samples, mode='binary')
# array([[0., 1., 1., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.]])
one_hot_mat.shape
	# (2, 1000)

2. 数据集的整理

keras 内置的 imdb 数据集为例：

from keras.datasets import imdb

max_features = 1000		# 构建长度为 1000 的字典	
max_len = 20			# 一条记录的最大单词数目

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
	# X_train, X_test：均为 list 构成的一维数组，之所以不是二维数组，在于每一条记录长度不一致；

from keras.preprocessing import sequence
X_train = sequence.pad_sequences(X_train, max_len)
X_test = sequence.pad_sequences(X_test, max_len)
	# 此时 X_train、X_test 均为二维数组，
	# 对于各自原始的行长度大于 20 的，截断保留后 20 位，长度小于 20 的在前补0