实现对文本的简单one-hot编码

最新推荐文章于 2024-07-26 20:35:37 发布

Einstellung

最新推荐文章于 2024-07-26 20:35:37 发布

阅读量7.7k

点赞数 2

分类专栏：深度学习

本文链接：https://blog.csdn.net/Einstellung/article/details/82865224

版权

one-hot编码是将标记转换为向量的最常用、最基本方法。下面分别讲讲字符级的one-hot编码和单词级的one-hot编码。

单词级的one-hot编码

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']  # 初始数据，本例中是一个句子，当然也可以是一篇文章

token_index = {
    }   # 构建数据中所有标记的索引
for sample in samples:
    for word in sample.split():   # 用split方法对样本进行分词，实际应用中，可能还需要考虑到标点符号
        if word not in token_index:

            token_index[word] = len(token_index) + 1  #为每个唯一单词指定唯一索引，注意我们没有为索引编号0指定单词


max_length = 10   # 对样本进行分词，只考虑样本前max_length单词

results