机器学习编码方式总结

编码方式用于对值是离散型的特征的处理。这里讲一下onehot独热编码和labelencoding编码。

先说一下独热编码

实现方式1:pd.get_dummies()函数

官方api:

pandas.get_dummies(dataprefix=Noneprefix_sep='_'dummy_na=Falsecolumns=Nonesparse=Falsedrop_first=Falsedtype=None)[source]

Convert categorical variable into dummy/indicator variables.

Parameters:

data : array-like, Series, or DataFrame

Data of which to get dummy indicators.

prefix : str, list of str, or dict of str, default None

String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

prefix_sep : str, default ‘_’

If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_na : bool, default False

Add a column to indicate NaNs, if False NaNs are ignored.

columns : list-like, default None

Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.

sparse : bool, default False

Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).

drop_first : bool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

New in version 0.18.0.

dtype : dtype, default np.uint8

Data type for new columns. Only a single dtype is allowed.

New in version 0.23.0.

Returns:

DataFrame

Dummy-coded data.

data : array-like, Series, or DataFrame 
输入的数据
prefix : string, list of strings, or dict of strings, default None 
get_dummies转换后,列名的前缀 
*columns : list-like, default None 
指定需要实现类别转换的列名
dummy_na : bool, default False 
增加一列表示空缺值,如果False就忽略空缺值
drop_first : bool, default False 
获得k中的k-1个类别值,去除第一个

关于参数sparse我想专门说一下,这个为True的话意味着会返回一个可以带有稀疏结构SparseDataFrame的dataframe,先看一下普通的dataframe转化为csr的方式:

import pandas as pd
import numpy as np

x=pd.DataFrame({
    'A':np.array([1,0,3],dtype='int32'),
    'B':np.array([0,0,0],dtype='int32'),
    'c':np.array([0,0,3],dtype='int32')
})

print(x)
print(x.to_sparse().to_coo())
print("*"*10)
print(x.to_sparse().to_coo().tocsr())
输出
 A  B  c
0  1  0  0
1  0  0  0
2  3  0  3
  (0, 0)	1
  (1, 0)	0
  (2, 0)	3
  (0, 1)	0
  (1, 1)	0
  (2, 1)	0
  (0, 2)	0
  (1, 2)	0
  (2, 2)	3
**********
  (0, 0)	1
  (0, 1)	0
  (0, 2)	0
  (1, 0)	0
  (1, 1)	0
  (1, 2)	0
  (2, 0)	3
  (2, 1)	0
  (2, 2)	3

这个dataframe只能用to_sparse()函数转化为SparseDataFrame形式。倘若某个dataframe是用

get_dummies(sparse=True)返回的

那么只要这么写就行了:

df.sparse.to_coo().tocsr()就好了。

 

 实现方式2:sklearn封装的api——OneHotEncoder

源码:

"""Encode categorical integer features using a one-hot aka one-of-K scheme.

    The input to this transformer should be a matrix of integers, denoting
    the values taken on by categorical (discrete) features. The output will be
    a sparse matrix where each column corresponds to one possible value of one
    feature. It is assumed that input features take on values in the range
    [0, n_values).

    This encoding is needed for feeding categorical data to many scikit-learn
    estimators, notably linear models and SVMs with the standard kernels.

    Note: a one-hot encoding of y labels should use a LabelBinarizer
    instead.

    Read more in the :ref:`User Guide <preprocessing_categorical_features>`.

    Parameters
    ----------
    n_values : 'auto', int or array of ints
        Number of values per feature.

        - 'auto' : determine value range from training data.
        - int : number of categorical values per feature.
                Each feature value should be in ``range(n_values)``
        - array : ``n_values[i]`` is the number of categorical values in
                  ``X[:, i]``. Each feature value should be
                  in ``range(n_values[i])``

    categorical_features : "all" or array of indices or mask
        Specify what features are treated as categorical.

        - 'all' (default): All features are treated as categorical.
        - array of indices: Array of categorical feature indices.
        - mask: Array of length n_features and with dtype=bool.

        Non-categorical features are always stacked to the right of the matrix.

    dtype : number type, default=np.float
        Desired dtype of output.

    sparse : boolean, default=True
        Will return sparse matrix if set True else will return an array.

    handle_unknown : str, 'error' or 'ignore'
        Whether to raise an error or ignore if a unknown categorical feature is
        present during transform.

    Attributes
    ----------
    active_features_ : array
        Indices for active features, meaning values that actually occur
        in the training set. Only available when n_values is ``'auto'``.

    feature_indices_ : array of shape (n_features,)
        Indices to feature ranges.
        Feature ``i`` in the original data is mapped to features
        from ``feature_indices_[i]`` to ``feature_indices_[i+1]``
        (and then potentially masked by `active_features_` afterwards)

    n_values_ : array of shape (n_features,)
        Maximum number of values per feature.

举个栗子:

from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])    # fit来学习编码
enc.transform([[0, 1, 3]]).toarray()    # 进行编码

 

输出:array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

数据矩阵是4*3,即4个数据,3个特征维度。

0 0 3                      观察左边的数据矩阵,第一列为第一个特征维度,有两种取值0\1. 所以对应编码方式为10 、01

1 1 0                                               同理,第二列为第二个特征维度,有三种取值0\1\2,所以对应编码方式为100、010、001

0 2 1                                               同理,第三列为第三个特征维度,有四中取值0\1\2\3,所以对应编码方式为1000、0100、0010、0001

1 0 2

再来看要进行编码的参数[0 , 1,  3], 0作为第一个特征编码为10,  1作为第二个特征编码为010, 3作为第三个特征编码为0001.  故此编码结果为 1 0 0 1 0 0 0 0 1

但是上面这样的字符串不是我们需要的结果,在工程中的用法是先用LabelEncoder把非数字的值转化为数字值,然后采用sklearn进行操作

代码:

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pandas as pd
import numpy as np
data=pd.DataFrame({'A':['a','b','a'],'B':['b','ahj','c'],'c':[1,2,3]})
print("初始数据:")
print(data)
print("下面使用pd.get_dummies进行独热编码...")
new_data=pd.get_dummies(data,prefix=['col1','col2'])
print(new_data)

print("下面使用sklearn进行labelEncoder编码")
label=LabelEncoder()
new_data=label.fit_transform(data['B'])
print(new_data)

print("下面使用sklearn进行独热编码...")
One=OneHotEncoder(sparse=False)
new_data=One.fit_transform(new_data.reshape(-1,1))
print(new_data)


new_data.reshape(-1,1)这是由于在新版的sklearn中,所有的数据都应该是二维矩阵,哪怕它只是单独一行或一列(比如前面做预测时,仅仅只用了一个样本数据),所以需要使用.reshape(1,-1)进行转换,具体操作如下。 

初始数据:
   A    B  c
0  a    b  1
1  b  ahj  2
2  a    c  3
下面使用pd.get_dummies进行独热编码...
   c  col1_a  col1_b  col2_ahj  col2_b  col2_c
0  1       1       0         0       1       0
1  2       0       1         1       0       0
2  3       1       0         0       0       1
下面使用sklearn进行labelEncoder编码
[1 0 2]
下面使用sklearn进行独热编码...
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]

实现方式3:sklearn封装的标签二值化api——LabelBinarizer

from sklearn.preprocessing import LabelBinarizer
data=pd.DataFrame({'A':['a','b','a'],'B':['b','ahj','c'],'c':[1,2,3]})
print("下面使用LabelBinarizer进行独热编码")
new_data=LabelBinarizer().fit_transform(data['B'])
print(new_data)
[[0 1 0]
 [1 0 0]
 [0 0 1]]

 再说一下LabelEncoder

实现方法1使用sklearn进行:

from sklearn.preprocessing import LabelEncoder
labelencoding.fit_transform(data)

实现方法2使用pd.factorize()

Parameters:

values : sequence

A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization.

sort : bool, default False

Sort uniques and shuffle labels to maintain the relationship.

order : None

Deprecated since version 0.23.0: This parameter has no effect and is deprecated.

na_sentinel : int, default -1

Value to mark “not found”.

size_hint : int, optional

Hint to the hashtable sizer.

Returns:

labels : ndarray

An integer ndarray that’s an indexer into uniques. uniques.take(labels) will have the same values as values.

uniques : ndarray, Index, or Categorical

The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.

Note

 

Even if there’s a missing value in values, uniques will not contain an entry for it.

See also

cut

Discretize continuous-valued array.

unique

Find the unique value in an array.

下面是有关参数的设置及作用举例:自己看就能看得懂

Examples

These examples all show factorize as a top-level method like pd.factorize(values). The results are identical for methods like Series.factorize().

>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> labels
array([0, 0, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)

With sort=True, the uniques will be sorted, and labels will be shuffled so that the relationship is the maintained.

>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
>>> labels
array([1, 1, 0, 2, 1])
>>> uniques
array(['a', 'b', 'c'], dtype=object)

Missing values are indicated in labels with na_sentinel (-1 by default). Note that missing values are never included in uniques.

>>> labels, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
>>> labels
array([ 0, -1,  1,  2,  0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)

Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.

>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> labels, uniques = pd.factorize(cat)
>>> labels
array([0, 0, 1])
>>> uniques
[a, c]
Categories (3, object): [a, b, c]

Notice that 'b' is in uniques.categories, despite not being present in cat.values.

For all other pandas objects, an Index of the appropriate type is returned.

>>> cat = pd.Series(['a', 'a', 'c'])
>>> labels, uniques = pd.factorize(cat)
>>> labels
array([0, 0, 1])
>>> uniques
Index(['a', 'c'], dtype='object')

代码:

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pandas as pd
import numpy as np
data=pd.DataFrame({'A':['a','b','a'],'B':['b','ahj','c'],'c':[1,2,3]})
print("初始数据:")
print(data)

print("下面使用sklearn进行labelEncoder编码")
label=LabelEncoder()
new_data=label.fit_transform(data['B'])
print(new_data)

print("下面使用pd.factorize()进行labelEncoder编码")
new_data=pd.factorize(data['B'])
print(new_data[0])
初始数据:
   A    B  c
0  a    b  1
1  b  ahj  2
2  a    c  3
下面使用sklearn进行labelEncoder编码
[1 0 2]
下面使用pd.factorize()进行labelEncoder编码
[0 1 2]

 

  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

CtrlZ1

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值