第六章深度学习用于文本和序列

最新推荐文章于 2021-06-25 18:10:21 发布

404 Not Found.

最新推荐文章于 2021-06-25 18:10:21 发布

阅读量183

点赞数

分类专栏： Python深度学习

本文链接：https://blog.csdn.net/weixin_44818787/article/details/106266403

版权

Python深度学习专栏收录该内容

13 篇文章 2 订阅

订阅专栏

1.1.单词和字符的one-hot编码

下面展示一些 内联代码片。

"""
1.1.单词和字符的one-hot编码 
"""

# In[1]:.单词级的one-hot编码(简单示例)
import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']   # 初始数据：长度为2,每个样本是列表的一个元素(本例的样本是一个句子,但也可以是一整篇文档)
print(len(samples))

token_index = {}                     # 构建数据中所有标记的索引(字典)
for sample in samples:
    for word in sample.split():      # 利用split()方法对样本进行分词. 在实际应用中,还需要从样本中去掉标点和特殊字符
        if word not in token_index:  # 如果单词不在token_index中，就添加进去
            token_index[word] = len(token_index) + 1    # 为每个单词指定一个唯一索引. 注意：没有为索引为编号0指定单词
    print(token_index)
            
max_length = 10             # 对样本进行分词. 只考虑每个样本前max_length个单词

# 将结果保存在results中   3D (2, 10, 11) 2个10×11的0矩阵
results = np.zeros(shape=(len(samples),
                          max_length,
                          max(token_index.values()) + 1))
print(results.shape)

# enumerate()用于将一个可遍历的数据对象组合为一个索引序列，同时列出数据和数据下标
for i, sample in enumerate(samples):
    print(i,sample)
    for j, word in list(enumerate(sample.split()))[:max_length]:
        print(j, word)
        index = token_index.get(word)
        results[i, j, index] = 1.
        
print(results)                          # 已经将标记处理成张量


# In[2]:.字符级的one-hot编码(简单示例)
import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']  
characters = string.printable           # 所有可打印的ASCll字符

token_index = dict(zip(characters, range(1, len(characters) + 1)))   # 用zip和dict创建字典

max_length = 50                # 对样本进行分词. 只考虑每个样本前max_length个单词

# 将结果保存在results中  3D (2, 50, 101) 2个50×101的0矩阵
results = np.zeros((len(samples),
                    max_length, 
                    max(token_index.values()) + 1))
print(results.shape)

for i, sample in enumerate(samples):   
    for j, character in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(character)
        results[i, j, index] = 1.



# In[3]:.用Keras实现单词级的one-hot编码(简单示例)(Tokenizer分词器)
from keras.preprocessing.text import Tokenizer  

samples = ['The cat sat on the mat.', 'The dog ate my homework.']  

tokenizer = Tokenizer(num_words=1000)   # 创建一个分词器(tokenizer),设置为只考虑前1000个最常见的单词
tokenizer.fit_on_texts(samples)         # 构建单词索引

sequences = tokenizer.texts_to_sequences(samples)     # 将字符转换为整数索引组成的列表
print(sequences)

# 也可以直接得到one-hot二进制表示. 这个分词器也支持除了one-hot编码外的其他向量化模式
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')   

word_index = tokenizer.word_index     # 找回单词索引
print('Found %s unique tokens.' % len(word_index))      # 9个不同的单词  (只考虑前1000个最常见的单词)



# In[4]:.使用散列技巧的单词级的one-hot编码(简单示例)
samples = ['The cat sat on the mat.', 'The dog ate my homework.']  

# 将单词保存为长度为1000的向量. 如果单词数量接近1000个(或更多),那么会遇到很多散列冲突,这会降低这种编码方法的准确性
dimensionality = 1000   # 维数
max_length = 10

results = np.zeros((len(samples),             # 将结果保存在results中  3D (2, 10, 1000) 2个10×1000的0矩阵 
                    max_length, 
                    dimensionality))
print(results.shape)

for i, sample in enumerate(samples):   
    for j, character in list(enumerate(sample.split()))[:max_length]:
        index = abs(hash(word)) % dimensionality    # (散列函数)将单词散列为0～1000范围内的一个随机整数索引
        results[i, j, index] = 1.
print(results)

代码运行结果：
第一部分：在这里插入图片描述
第二部分：(2, 50, 101)
第三部分：
[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
Found 9 unique tokens.
第四部分：

404 Not Found.

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
第六章深度学习用于文本和序列

1.1.单词和字符的one-hot编码下面展示一些内联代码片。"""1.1.单词和字符的one-hot编码 """# In[1]:.单词级的one-hot编码(简单示例)import numpy as npsamples = ['The cat sat on the mat.', 'The dog ate my homework.'] # 初始数据：长度为2,每个样本是列表的一个元素(本例的样本是一个句子,但也可以是一整篇文档)print(len(samples))token
复制链接

扫一扫