keras处理文本数据

最新推荐文章于 2024-05-08 18:27:18 发布

格拉迪沃

最新推荐文章于 2024-05-08 18:27:18 发布

阅读量1.6k

点赞数

分类专栏： keras学习

本文链接：https://blog.csdn.net/qq_32796253/article/details/88835699

版权

keras处理文本数据

1处理文本数据
- 单词和字符的one-hot编码
使用词嵌入
- 利用Embedding层学习词嵌入
从原始文本到词嵌入

1处理文本数据

文本是一种以字符或者单词为序列数据，而如何让他让计算机读懂从而进行一系列处理是比较关键的一步。从本质上来说文字其实就是便于人这种碳基生命理解的抽象符号，而对于计算机这种硅基生命胚胎来说，或许向量才是它们最便于理解的形式，因此下面介绍如何将文本向量化。
将文本分解成的单元叫做标记，将文本分解成标记的过程叫做分词。标记有单词、字符和单词或字符的n-grame三种，其中n-grame很少用于深度学习，这里不再多说。而向量和标记相关联的方法主要有两种，一种为one-hot编码，一种为标记嵌入（包含词嵌入(word embedding)）

单词和字符的one-hot编码

单词级的one-hot编码

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
token_index = {
   }
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1
max_length = 10
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

调用keras API实现单词级分类

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
#创建一个分词器（tokenizer）只考虑前1000个常见单词
tokenizer = Tokenizer(num_words=1000)
构建单词索引
tokenizer.fit_on_texts(samples)

将字符串转化成为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)

得到one-hot编码的二进制表示
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

找回单词引索
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

使用词嵌入

有没有一个理想的词嵌入空间，可以完美映射人类语言，并可用于所有自然语言处理任务？可能有，但我们尚未发现。因此合理的做法是对每个新任务都学习一个新的嵌入空间。

利用Embedding层学习词嵌入

IMDB数据集是Keras内部集成的，初次导入需要下载一下，之后就可以直接用了。
IMDB数据集包含来自互联网的50000条严重两极分化的评论，该数据被分为用于训练的25000条评论和用于测试的25000条评论，训练集和测试集都包含50%的正面评价和50%的负面评价。该数据集已经经过预处理：评论（单词序列）已经被转换为整数序列，其中每个整数代表字典中的某个单词。

from keras.layers import Embedding
from keras.datasets import imdb
from keras import preprocessing
from keras.models import Sequential
from keras.layers import Flatten

最低0.47元/天解锁文章

格拉迪沃

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
keras处理文本数据

keras处理文本数据1处理文本数据单词和字符的one-hot编码使用词嵌入利用Embedding层学习词嵌入1处理文本数据文本是一种以字符或者单词为序列数据，而如何让他让计算机读懂从而进行一系列处理是比较关键的一步。从本质上来说文字其实就是便于人这种碳基生命理解的抽象符号，而对于计算机这种硅基生命胚胎来说，或许向量才是它们最便于理解的形式，因此下面介绍如何将文本向量化。将文本分解成的单元叫...
复制链接

扫一扫