Python深度学习之处理文本数据

最新推荐文章于 2024-06-08 23:07:37 发布

CDFMLR

最新推荐文章于 2024-06-08 23:07:37 发布

阅读量641

点赞数 1

分类专栏： Python深度学习文章标签： python 深度学习机器学习 nlp

本文链接：https://blog.csdn.net/u012419550/article/details/107936335

版权

本文介绍了如何使用Python进行深度学习中的文本数据处理，包括n-grams、词袋模型、one-hot编码和词嵌入。讨论了预训练的词嵌入如GloVe，并展示了在IMDB数据集上应用词嵌入进行情感分析的实例。

摘要由CSDN通过智能技术生成

Deep Learning with Python

这篇文章是我学习《Deep Learning with Python》(第二版，François Chollet 著) 时写的系列笔记之一。文章的内容是从 Jupyter notebooks 转成 Markdown 的，你可以去 GitHub 或 Gitee 找到原始的 .ipynb 笔记本。

你可以去这个网站在线阅读这本书的正版原文(英文)。这本书的作者也给出了配套的 Jupyter notebooks。

本文为 第6章深度学习用于文本和序列 (Chapter 6. Deep learning for text and sequences) 的笔记。

6.1 Working with text data

处理文本数据

要用深度学习的神经网络处理文本数据，和图片类似，也要把数据向量化：文本 -> 数值张量。

要做这种事情可以把每个单词变成向量，也可以把字符变成向量，还可以把多个连续单词或字符(称为 N-grams)变成向量。

反正不管如何划分，我们把文本拆分出来的单元叫做 tokens（标记），拆分文本的过程叫做 tokenization(分词)。

注：token 的中文翻译是“标记”😂。这些翻译都怪怪的，虽然 token 确实有标记这个意思，但把这里的 token 翻译成标记就没内味儿了。我觉得 token 是那种以一个东西代表另一个东西来使用的意思，这种 token 是一种有实体的东西，比如代金券。“标记”这个词在字典上作名词是「起标示作用的记号」的意思，而我觉得记号不是个很实体的东西。代金券不是一种记号、也就能说是标记，同样的，这里的 token 也是一种实体的东西，我觉得不能把它说成是“标记”。我不赞同这种译法，所以下文所有涉及 token 的地方统一写成 “token”，不翻译成“标记”。

文本的向量化就是先作分词，然后把生成出来的 token 逐个与数值向量对应起来，最后拿对应的数值向量合成一个表达了原文本的张量。其中，比较有意思的是如何建立 token 和数值向量的联系，下面介绍两种搞这个的方法：one-hot encoding(one-hot编码) 和 token embedding(标记嵌入)，其中 token embedding 一般都用于单词，叫作词嵌入「word embedding」。

文本的向量化：从文本到token再到张量

n-grams 和词袋(bag-of-words)

n-gram 是能从一个句子中提取出的 ≤N 个连续单词的集合。例如：「The cat sat on the mat.」

这个句子分解成 2-gram 是：

{"The", "The cat", "cat", "cat sat", "sat",
  "sat on", "on", "on the", "the", "the mat", "mat"}

这个集合被叫做 bag-of-2-grams (二元语法袋)。

分解成 3-gram 是：

{"The", "The cat", "cat", "cat sat", "The cat sat",
  "sat", "sat on", "on", "cat sat on", "on the", "the",
  "sat on the", "the mat", "mat", "on the mat"}

这个集合被叫做 bag-of-3-grams (三元语法袋)。

把这东西叫做「袋」是因为它只是 tokens 组成的集合，没有原来文本的顺序和意义。把文本分成这种袋的分词方法叫做「词袋(bag-of-words)」。

由于词袋是不保存顺序的（分出来是集合，不是序列），所以一般不在深度学习里面用。但在轻量级的浅层文本处理模型里面，n-gram 和词袋还是很重要的方法的。

one-hot 编码

one-hot 是比较基本、常用的。其做法是将每个 token 与一个唯一整数索引关联，然后将整数索引 i 转换为长度为 N 的二进制向量(N 是词表大小)，这个向量只有第 i 个元素为 1，其余元素都为 0。

下面给出两个玩具版本的 one-hot 编码示例：

# 单词级的 one-hot 编码

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

token_index = {
   }
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1
            
# 对样本进行分词。只考虑每个样本前 max_length 个单词
max_length = 10

results = np.zeros(shape=(len(samples), 
                          max_length, 
                          max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

print(results)

[[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]

# 字符级的 one-hot 编码

import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

characters = string.printable    # 所有可打印的 ASCII 字符
token_index = dict(zip(range(1, len(characters) + 1), characters))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1.
        
print(results)

[[[1. 1. 1. ... 1. 1. 1.]
  [1. 1. 1. ... 1. 1. 1.]
  [1. 1. 1. ... 1. 1. 1.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[1. 1. 1. ... 1. 1. 1.]
  [1. 1. 1. ... 1. 1. 1.]
  [1. 1. 1. ... 1. 1. 1.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]

Keras 内置了比刚才写的这种玩具版本强大得多的 one-hot 编码工具，在现实使用中，你应该使用这种方法，而不是使用刚才的玩具版本：

from tensorflow.keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

tokenizer = Tokenizer(num_words=1000)    # 只考虑前 1000 个最常见的单词
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)    # 将字符串转换为整数索引组成的列表
print('sequences:', sequences)

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')  # 直接得到 one-hot 二进制表示

word_index = tokenizer.word_index    # 单词索引，就是词表字典啦，用这个就可以还原数据

print(f'one_hot_results: shape={one_hot_results.shape}:\n', one_hot_results, )
print(f'Found {len(word_index)} unique tokens.', 'word_index:', word_index)

sequences: [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
one_hot_results: shape=(2, 1000):
 [[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
Found 9 unique tokens. word_index: {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}

这种 one-hot 编码还有一种简单的变种叫做 one-hot hashing trick（one-hot 散列技巧），这个方法的思想是不对每个 token 关联唯一的整数索引，而是用哈希函数去作用，把文本直接映射成一个固定长度的向量。

用这种方法可以节省维护单词索引的内存开销，还可以实现在线编码（来一个编码一个，不影响之、之后的）；但也有一些弊端：可能出现散列冲突，编码后的数据也不能够还原。

# 使用散列技巧的单词级的 one-hot 编码&

最低0.47元/天解锁文章

CDFMLR

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录