机器学习编码方式总结

最新推荐文章于 2025-03-08 21:26:21 发布

CtrlZ1

最新推荐文章于 2025-03-08 21:26:21 发布

阅读量3.5k

点赞数 3

分类专栏：机器学习深度学习代码知识文章标签：编码

本文链接：https://blog.csdn.net/qq_41076797/article/details/102662245

版权

机器学习深度学习代码知识专栏收录该内容

32 篇文章

订阅专栏

编码方式用于对值是离散型的特征的处理。这里讲一下onehot独热编码和labelencoding编码。

先说一下独热编码

实现方式1：pd.get_dummies()函数

官方api：

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)[source]

Convert categorical variable into dummy/indicator variables.

Parameters:

Parameters:	data : array-like, Series, or DataFrame Data of which to get dummy indicators. prefix : str, list of str, or dict of str, default None String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes. prefix_sep : str, default ‘_’ If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix. dummy_na : bool, default False Add a column to indicate NaNs, if False NaNs are ignored. columns : list-like, default None Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted. sparse : bool, default False Whether the dummy-encoded columns should be backed by a `SparseArray` (True) or a regular NumPy array (False). drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level. New in version 0.18.0. dtype : dtype, default np.uint8 Data type for new columns. Only a single dtype is allowed. New in version 0.23.0.
Returns:	DataFrame Dummy-coded data.

data : array-like, Series, or DataFrame

Data of which to get dummy indicators.

prefix : str, list of str, or dict of str, default None

String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

prefix_sep : str, default ‘_’

If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_na : bool, default False

Add a column to indicate NaNs, if False NaNs are ignored.

columns : list-like, default None

Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.

sparse : bool, default False

Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).

drop_first : bool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

New in version 0.18.0.

dtype : dtype, default np.uint8

Data type for new columns. Only a single dtype is allowed.

New in version 0.23.0.

Returns:

DataFrame

Dummy-coded data.

data : array-like, Series, or DataFrame 
输入的数据
prefix : string, list of strings, or dict of strings, default None 
get_dummies转换后，列名的前缀 
*columns : list-like, default None 
指定需要实现类别转换的列名
dummy_na : bool, default False 
增加一列表示空缺值，如果False就忽略空缺值
drop_first : bool, default False 
获得k中的k-1个类别值，去除第一个

关于参数sparse我想专门说一下，这个为True的话意味着会返回一个可以带有稀疏结构SparseDataFrame的dataframe，先看一下普通的dataframe转化为csr的方式：

import pandas as pd
import numpy as np

x=pd.DataFrame({
    'A':np.array([1,0,3],dtype='int32'),
    'B':np.array([0,0,0],dtype='int32'),
    'c':np.array([0,0,3],dtype='int32')
})

print(x)
print(x.to_sparse().to_coo())
print("*"*10)
print(x.to_sparse().to_coo().tocsr())

输出
 A  B  c
0  1  0  0
1  0  0  0
2  3  0  3
  (0, 0)	1
  (1, 0)	0
  (2, 0)	3
  (0, 1)	0
  (1, 1)	0
  (2, 1)	0
  (0, 2)	0
  (1, 2)	0
  (2, 2)	3
**********
  (0, 0)	1
  (0, 1)	0
  (0, 2)	0
  (1, 0)	0
  (1, 1)	0
  (1, 2)	0
  (2, 0)	3
  (2, 1)	0
  (2, 2)	3

这个dataframe只能用to_sparse()函数转化为SparseDataFrame形式。倘若某个dataframe是用

get_dummies（sparse=True）返回的

那么只要这么写就行了：

df.sparse.to_coo().tocsr()就好了。

实现方式2：sklearn封装的api——OneHotEncoder

源码：

"""Encode categorical integer features using a one-hot aka one-of-K scheme.

    The input to this transformer should be a matrix of integers, denoting
    the values taken on by categorical (discrete) features. The output will be
    a sparse matrix where each column corresponds to one possible value of one
    feature. It is assumed that input features take on values in the range
    [0, n_values).

    This encoding is needed for feeding categorical data to many scikit-learn
    estimators, notably linear models and SVMs with the standard kernels.

    Note: a one-hot encoding of y labels should use a LabelBinarizer
    instead.

    Read more in the :ref:`User Guide <preprocessing_categorical_features>`.

    Parameters
    ----------
    n_values : 'auto', int or array of ints
        Number of values per feature.

        - 'auto' : determine value range from training data.
        - int : number of categorical values per feature.
                Each feature value should be in ``range(n_values)``
        - array : ``n_values[i]`` is the number of categorical values in
                  ``X[:, i]``. Each feature value should be
                  in ``range(n_values[i])``

    categorical_features : "all" or array of indices or mask
        Specify what features are treated as categorical.

        - 'all' (default): All features are treated as categorical.
        - array of indices: Array of categorical feature indices.
        - mask: Array of length n_features and with dtype=bool.

        Non-categorical features are always stacked to the right of the matrix.

    dtype : number type, default=np.float
        Desired dtype of output.

    sparse : boolean, default=True
        Will return sparse matrix if set True else will return an array.

    handle_unknown : str, 'error' or 'ignore'
        Whether to raise an error or ignore if a unknown categorical feature is
        present during transform.

    Attributes
    ----------
    active_features_ : array
        Indices for active features, meaning values that actually occur
        in the training set. Only available when n_values is ``'auto'``.

    feature_indices_ : array of shape (n_features,)
        Indices to feature ranges.
        Feature ``i`` in the original data is mapped to features
        from ``feature_indices_[i]`` to ``feature_indices_[i+1]``
        (and then potentially masked by `active_features_` afterwards)

    n_values_ : array of shape (n_features,)
        Maximum number of values per feature.

举个栗子：

from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])    # fit来学习编码
enc.transform([[0, 1, 3]]).toarray()    # 进行编码

输出：array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

数据矩阵是4*3，即4个数据，3个特征维度。

0 0 3 观察左边的数据矩阵，第一列为第一个特征维度，有两种取值0\1. 所以对应编码方式为10 、01

1 1 0 同理，第二列为第二个特征维度，有三种取值0\1\2，所以对应编码方式为100、010、001

0 2 1 同理，第三列为第三个特征维度，有四中取值0\1\2\3，所以对应编码方式为1000、0100、0010、0001

1 0 2

再来看要进行编码的参数[0 , 1, 3]， 0作为第一个特征编码为10, 1作为第二个特征编码为010， 3作为第三个特征编码为0001. 故此编码结果为 1 0 0 1 0 0 0 0 1

但是上面这样的字符串不是我们需要的结果，在工程中的用法是先用LabelEncoder把非数字的值转化为数字值，然后采用sklearn进行操作

代码：

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pandas as pd
import numpy as np
data=pd.DataFrame({'A':['a','b','a'],'B':['b','ahj','c'],'c':[1,2,3]})
print("初始数据:")
print(data)
print("下面使用pd.get_dummies进行独热编码...")
new_data=pd.get_dummies(data,prefix=['col1','col2'])
print(new_data)

print("下面使用sklearn进行labelEncoder编码")
label=LabelEncoder()
new_data=label.fit_transform(data['B'])
print(new_data)

print("下面使用sklearn进行独热编码...")
One=OneHotEncoder(sparse=False)
new_data=One.fit_transform(new_data.reshape(-1,1))
print(new_data)

new_data.reshape(-1,1)这是由于在新版的sklearn中，所有的数据都应该是二维矩阵，哪怕它只是单独一行或一列（比如前面做预测时，仅仅只用了一个样本数据），所以需要使用.reshape(1,-1)进行转换，具体操作如下。

初始数据:
   A    B  c
0  a    b  1
1  b  ahj  2
2  a    c  3
下面使用pd.get_dummies进行独热编码...
   c  col1_a  col1_b  col2_ahj  col2_b  col2_c
0  1       1       0         0       1       0
1  2       0       1         1       0       0
2  3       1       0         0       0       1
下面使用sklearn进行labelEncoder编码
[1 0 2]
下面使用sklearn进行独热编码...
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]

实现方式3：sklearn封装的标签二值化api——LabelBinarizer

from sklearn.preprocessing import LabelBinarizer
data=pd.DataFrame({'A':['a','b','a'],'B':['b','ahj','c'],'c':[1,2,3]})
print("下面使用LabelBinarizer进行独热编码")
new_data=LabelBinarizer().fit_transform(data['B'])
print(new_data)

[[0 1 0]
 [1 0 0]
 [0 0 1]]

再说一下LabelEncoder

实现方法1使用sklearn进行：

from sklearn.preprocessing import LabelEncoder

labelencoding.fit_transform(data)

实现方法2使用pd.factorize()

Parameters:

Parameters:	values : sequence A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization. sort : bool, default False Sort uniques and shuffle labels to maintain the relationship. order : None Deprecated since version 0.23.0: This parameter has no effect and is deprecated. na_sentinel : int, default -1 Value to mark “not found”. size_hint : int, optional Hint to the hashtable sizer.
Returns:	labels : ndarray An integer ndarray that’s an indexer into uniques. `uniques.take(labels)` will have the same values as values. uniques : ndarray, Index, or Categorical The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned. Note Even if there’s a missing value in values, uniques will not contain an entry for it.

values : sequence

A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization.

sort : bool, default False

Sort uniques and shuffle labels to maintain the relationship.

order : None

Deprecated since version 0.23.0: This parameter has no effect and is deprecated.

na_sentinel : int, default -1

Value to mark “not found”.

size_hint : int, optional

Hint to the hashtable sizer.

Returns:

labels : ndarray

An integer ndarray that’s an indexer into uniques. uniques.take(labels) will have the same values as values.

uniques : ndarray, Index, or Categorical

The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.

Note

Even if there’s a missing value in values, uniques will not contain an entry for it.