one-hot编码

最新推荐文章于 2024-04-28 20:16:45 发布

飞向Hadoop

最新推荐文章于 2024-04-28 20:16:45 发布

阅读量486

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/hunannanhu/article/details/105528955

版权

NLP 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

one-hot介绍：

又称独热编码、一位有效编码。其方法是使用N位状态寄存器来对N个状态进行编码，每个状态都有它独立的寄存器位，并且在任意时候，其中只有一位有效。

one-hot的应用：

one hot在特征提取上属于词袋模型（bag of words）。例如我们的语料库中有段话：我毕业于湖南工业大学我就职于长沙代码研究所,以下是对该段话进行的分词流程：

1.我们首先对语料库分词，并获取其中所有的词，然后对每个词进行编号：

'我': 1, '毕业': 2, '于': 3, '湖南工业大学': 4, '就职': 5, '长沙代码研究所': 6

2.然后使用one hot对每段话提取特征向量

3.最终得到特征向量

下面是one-hot编码实现

import numpy as np
samples = ['我 毕业 于 湖南工业大学', '我 就职 于 长沙代码研究所']

token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1

print(len(token_index))

print(token_index)

results = np.zeros(shape=(len(samples), len(token_index) + 1, max(token_index.values()) + 1))
results.shape

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split())):
        index = token_index.get(word)
        print(j, index, word)
        results[i ,j, index] = 1


print("\n\n打印results\n\n")
print(results)

results2 = np.zeros(shape=(len(samples), max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for _,word in list(enumerate(sample.split())):
        index = token_index.get(word)
        results2[i, index] = 1

print("\n\n打印results2\n\n")
print(results2)

运行结果截图：

基于keras实现：

from keras.preprocessing.text import Tokenizer
samples = ['我 毕业 于 湖南工业大学', '我 就职 于 长沙代码研究所']

#构建单词索引
tokenizer = Tokenizer()
tokenizer.fit_on_texts(samples)

word_index = tokenizer.word_index
print(word_index)
print(len(word_index))


sequences = tokenizer.texts_to_sequences(samples)
print(sequences)

one_hot_results = tokenizer.texts_to_matrix(samples)
print(one_hot_results)

运行结果截图：