python类别变量（class_label）转换为One_Hot的几种方式

最新推荐文章于 2024-08-08 16:43:27 发布

fitzgerald0

最新推荐文章于 2024-08-08 16:43:27 发布

阅读量3.8w

点赞数 10

分类专栏：机器学习

本文链接：https://blog.csdn.net/fitzgerald0/article/details/79200902

版权

机器学习专栏收录该内容

27 篇文章 13 订阅

订阅专栏

首先解析一下，one_hot (独热)编码，和dummy variable（哑变量）的区别：

在用keras时候，有一个模块写好one_hot转换

from keras.utils import to_categorical

data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
encoded=to_categorical(data)
print(encoded)

则打印出来的结果为：

[[ 0.  1.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  1.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]]

def to_categorical(y, num_classes=None):
    y = np.array(y, dtype='int')
    input_shape = y.shape
    if input_shape and input_shape[-1] == 1 and len(input_shape) > 1:
        input_shape = tuple(input_shape[:-1])
    y = y.ravel()
    if not num_classes:
        num_classes = np.max(y) + 1
    n = y.shape[0]
    categorical = np.zeros((n, num_classes))
    categorical[np.arange(n), y] = 1
    output_shape = input_shape + (num_classes,)
    categorical = np.reshape(categorical, output_shape)
    return categorical

在sklearn中的one_hot编码

from sklearn import preprocessing
import numpy as np
label = preprocessing.LabelEncoder()
one_hot = preprocessing.OneHotEncoder(sparse = False)
cat_data =([1,3,2], 
           [2,1,1],
          [4,2,2])
print one_hot.fit_transform(cat_data)

打印的结果为

[[1. 0. 0. 0. 0. 1. 0. 1.]

 [0. 1. 0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 1. 0. 0. 1.]]

如果是二分类（二进制）则，还可以用以下方法，定义一个函数

def cat_to_num(data):
    categories = unique(data)
    features = []
    for cat in categories:
        binary = (data == cat)
        features.append(binary.astype("int"))
    return features

import numpy as np
cat_data =np.array(['male', 'female', 'male', 'male'])

cat_to_num(cat_data)

打印出

[array([0, 1, 0, 0]), array([1, 0, 1, 1])]

上文定义的这个转换函数可以是1维的，且可以是字符串

当然在sklearn中也有实现的方法，（二进制 / 二分类编码）

from sklearn.preprocessing import Binarizer

Binarizer如果直接调用的话必须是二维矩阵，数值类型，需要设置threshold阈值

import numpy as np
cat_data =np.array([[1,2],
                   [3,4]])

binarizer =Binarizer(threshold=2.1) 
binarizer.transform(cat_data)

下面是pandas的dummy variable

import pandas as pd

data_dummy=pd.get_dummies(data)

fitzgerald0

关注

10
点赞
踩
55

收藏

觉得还不错? 一键收藏
3
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录