独热编码（one-hot encoding）

最新推荐文章于 2024-05-08 23:41:02 发布

__Watson__

最新推荐文章于 2024-05-08 23:41:02 发布

阅读量4.9k

点赞数 1

分类专栏：机器学习文章标签：人工智能

机器学习专栏收录该内容

13 篇文章 1 订阅

订阅专栏

原文 https://blog.csdn.net/a595130080/article/details/64442800

一、介绍

在数据处理和特征工程中，经常会遇到类型数据，如性别分为[男，女]（暂不考虑其他。。。。），手机运营商分为[移动，联通，电信]等，我们通常将其转为数值带入模型，如[0,1], [-1,0,1]等，但模型往往默认为连续型数值进行处理，这样其实是违背我们最初设计的，也会影响模型效果。

独热编码便是解决这个问题，其方法是使用N位状态寄存器来对N个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候，其中只有一位有效。

如自然编码为：0，1

独热编码为：10，01

可以理解为对有m个取值的特征，经过独热编码处理后，转为m个二元特征，每次只有一个激活。

如数字字体识别0~9中，6的独热编码为：

0000001000

二、优点

独热编码的优点为：

1.能够处理非连续型数值特征。
2.在一定程度上也扩充了特征。比如性别本身是一个特征，经过one hot编码以后，就变成了男或女两个特征。

当然，当特征类别较多时，数据经过独热编码可能会变得过于稀疏。

三、实现

我们可以自己根据实际问题实现独热编码，如0~9数字识别中

# labels 变成one-hot encoding, [2] -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

# digit 0 , representedas 10

# labels 变成one-hot encoding, [10] -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

labels =np.array([x[0] for x in labels])

one_hot_labels= []

for numin labels:

one_hot= [0.0] * 10

ifnum == 10:

one_hot[0]= 1.0

else:

one_hot[num]= 1.0

one_hot_labels.append(one_hot)

在sklearn中也有现有的函数可以直接调用：

import numpy as np

fromsklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()

enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1],[1, 0, 2]])

print"enc.n_values_is:",enc.n_values_

print"enc.feature_indices_is:",enc.feature_indices_

print enc.transform([[0, 1, 1]]).toarray()

代码运行结果

enc.n_values_ is: [234]

enc.feature_indices_ is:[0259]

[[ 1. 0. 0. 1. 0. 0. 1. 0. 0.]]

这里样本有三个维度，第一维有2个取值，第二维有3个取值，第三维有4个取值，也就是enc.n_values_ = [234]，而enc.feature_indices_为维度取值范围累加，则[0, 1, 1]经过编码为：

[[ 1. 0. 0. 1. 0. 0. 1. 0. 0.]]

1.to_categorical的功能

简单来说，to_categorical就是将类别向量转换为二进制（只有0和1）的矩阵类型表示。其表现为将原有的类别向量转换为独热编码的形式。先上代码看一下效果：


 
 
   
   
    
    
   
   
   
   
    
    
     
     from keras.utils.np_utils import *
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #类别向量定义
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     b = [
     
     0,
     
     1,
     
     2,
     
     3,
     
     4,
     
     5,
     
     6,
     
     7,
     
     8]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #调用to_categorical将b按照9个类别来进行转换
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     b = to_categorical(b, 
     
     9)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print(b)
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     执行结果如下：
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     [
     
     [1. 0. 0. 0. 0. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 1. 0. 0. 0. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 1. 0. 0. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 1. 0. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 1. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 0. 1. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 0. 0. 1. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 0. 0. 0. 1. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 0. 0. 0. 0. 1.]]

to_categorical最为keras中提供的一个工具方法，从以上代码运行可以看出，将原来类别向量中的每个值都转换为矩阵里的一个行向量，从左到右依次是0,1,2，...8个类别。2表示为[0. 0. 1. 0. 0. 0. 0. 0. 0.]，只有第3个为1，作为有效位，其余全部为0。

2.one_hot encoding(独热编码)介绍

独热编码又称为一位有效位编码，上边代码例子中其实就是将类别向量转换为独热编码的类别矩阵。也就是如下转换：

那么一道思考题来了，让你自己编码实现类别向量向独热编码的转换，该怎样实现呢？

以下是我自己粗浅写的一个小例子，仅供参考：


 
 
   
   
    
    
   
   
   
   
    
    
     
     def convert_to_one_hot(labels, num_classes):
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         #计算向量有多少行
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         num_labels = len(labels)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #生成值全为0的独热编码的矩阵
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         labels_one_hot = np.zeros((num_labels, num_classes))
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #计算向量中每个类别值在最终生成的矩阵“压扁”后的向量里的位置
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         index_offset = np.arange(num_labels) * num_classes
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #遍历矩阵，为每个类别的位置填充1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         labels_one_hot.flat[index_offset + labels] = 
     
     1
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return labels_one_hot
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     #进行测试
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     b = [
     
     2, 
     
     4, 
     
     6, 
     
     8, 
     
     6, 
     
     2, 
     
     3, 
     
     7]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     print(convert_to_one_hot(b,
     
     9))
    
    
   
   

   
   
    
    
   
   
   
   
    
     
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     测试结果：
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
     [
     
     [0. 0. 1. 0. 0. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 1. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 0. 0. 1. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 0. 0. 0. 0. 1.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 0. 0. 1. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 1. 0. 0. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 1. 0. 0. 0. 0. 0.]
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
      [
     
     0. 0. 0. 0. 0. 0. 0. 1. 0.]]

3.源码解析

to_categorical在keras的utils/np_utils.py中，源码如下：


 
 
   
   
    
    
   
   
   
   
    
    
     
     def to_categorical(y, num_classes=None, dtype='float32'):
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     """Converts a class vector (integers) to binary class matrix.
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         E.g. for use with categorical_crossentropy.
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         # Arguments
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             y: class vector to be converted into a matrix
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
                 (integers from 0 to num_classes).
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             num_classes: total number of classes.
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             dtype: The data type expected by the input, as a string
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
                 (`float32`, `float64`, `int32`...)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         # Returns
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             A binary matrix representation of the input. The classes axis
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             is placed last.
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         # Example
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         ```python
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         # Consider an array of 5 labels out of a set of 3 classes {0, 1, 2}:
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         > labels
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         array([0, 2, 1, 2, 0])
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         # `to_categorical` converts this into a matrix with as many
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         # columns as there are classes. The number of rows
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         # stays the same.
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         > to_categorical(labels)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         array([[ 1.,  0.,  0.],
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
                [ 0.,  0.,  1.],
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
                [ 0.,  1.,  0.],
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
                [ 0.,  0.,  1.],
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
                [ 1.,  0.,  0.]], dtype=float32)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         ```
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         """
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #将输入y向量转换为数组
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         y = np.array(y, dtype=
     
     'int')
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #获取数组的行列大小
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         input_shape = y.shape
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     if input_shape 
     
     and input_shape[
     
     -1] == 
     
     1 
     
     and len(input_shape) > 
     
     1:
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             input_shape = tuple(input_shape[:
     
     -1])
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #y变为1维数组
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         y = y.ravel()
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #如果用户没有输入分类个数，则自行计算分类个数
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     if 
     
     not num_classes:
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
             num_classes = np.max(y) + 
     
     1
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         n = y.shape[
     
     0]
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #生成全为0的n行num_classes列的值全为0的矩阵
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         categorical = np.zeros((n, num_classes), dtype=dtype)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #np.arange(n)得到每个行的位置值，y里边则是每个列的位置值
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         categorical[np.arange(n), y] = 
     
     1
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     #进行reshape矫正
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         output_shape = input_shape + (num_classes,)
    
    
   
   

   
   
    
    
   
   
   
   
    
    
     
         categorical = np.reshape(categorical, output_shape)
    
    
   
   

   
   
    
    
   
   
   
   
    
        
     
     return categorical

看过源码之后，确实觉得自己的代码还需要完善。框架里的一些api，我们可以先自己想着来写，然后和源码进行对比学习，这是一个很好的学习方法。

__Watson__

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
独热编码（one-hot encoding）

原文 https://blog.csdn.net/a595130080/article/details/64442800一、介绍在数据处理和特征工程中，经常会遇到类型数据，如性别分为[男，女]（暂不考虑其他。。。。），手机运营商分为[移动，联通，电信]等，我们通常将其转为数值带入模型，如[0,1], [-1,0,1]等，但模型往往默认为连续型数值进行处理，这样其实是违背我们最初设...
复制链接

扫一扫