One-hot编码

最新推荐文章于 2022-07-06 02:29:50 发布

yumi_huang

最新推荐文章于 2022-07-06 02:29:50 发布

阅读量503

点赞数 1

分类专栏：深度学习

本文链接：https://blog.csdn.net/Yumi_huang/article/details/83013688

版权

深度学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
token_index = {}
for sample in samples:
    # 在这里使用split()方法，在中文中需要分词
    for word in sample.split():
        if word not in token_index:
            # 给每一个词一个独有的索引值
            token_index[word] = len(token_index) + 1
print(token_index)
# We will only consider the first `max_length` words in each sample.
max_length = 10

# 将编码结果存在矩阵中
#初始化(句子数量)个矩阵，维度为（句子最大长度*单词数）的全0矩阵，然后再根据token_index将相应位置的值置为1
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # get方法返回dict指定键的值
        index = token_index.get(word)
        results[i, j, index] = 1.
print(results)

当词太多时，token_index太大，提出hash方法,Index通过hash的方式确定

results = np.zeros((len(samples), max_length, demension))
print(results.shape)
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # get方法返回dict指定键的值
        index = abs(hash(word))% demension
        results[i, j, index] = 1.
print(results)

Keras封装好的one-hot方法：

from keras.preprocessing.text import Tokenizer
import numpy as np
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

tokenizer = Tokenizer(num_words=20)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
print(one_hot_results[0])
print(one_hot_results[1])

word_index = tokenizer.word_index
print(word_index)
print('Found %s unique tokens.' % len(word_index))

yumi_huang

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
One-hot编码

import numpy as npsamples = ['The cat sat on the mat.', 'The dog ate my homework.']token_index = {}for sample in samples: # 在这里使用split()方法，在中文中需要分词 for word in sample.split(): if w...
复制链接

扫一扫

专栏目录