编码方式用于对值是离散型的特征的处理。这里讲一下onehot独热编码和labelencoding编码。
先说一下独热编码
实现方式1:pd.get_dummies()函数
官方api:
pandas.
get_dummies
(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)[source]
Convert categorical variable into dummy/indicator variables.
Parameters: | data : array-like, Series, or DataFrame Data of which to get dummy indicators. prefix : str, list of str, or dict of str, default None String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes. prefix_sep : str, default ‘_’ If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix. dummy_na : bool, default False Add a column to indicate NaNs, if False NaNs are ignored. columns : list-like, default None Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted. sparse : bool, default False Whether the dummy-encoded columns should be backed by a drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level. New in version 0.18.0. dtype : dtype, default np.uint8 Data type for new columns. Only a single dtype is allowed. New in version 0.23.0. |
---|---|
Returns: | DataFrame Dummy-coded data. |
data : array-like, Series, or DataFrame
输入的数据
prefix : string, list of strings, or dict of strings, default None
get_dummies转换后,列名的前缀
*columns : list-like, default None
指定需要实现类别转换的列名
dummy_na : bool, default False
增加一列表示空缺值,如果False就忽略空缺值
drop_first : bool, default False
获得k中的k-1个类别值,去除第一个
关于参数sparse我想专门说一下,这个为True的话意味着会返回一个可以带有稀疏结构SparseDataFrame的dataframe,先看一下普通的dataframe转化为csr的方式:
import pandas as pd
import numpy as np
x=pd.DataFrame({
'A':np.array([1,0,3],dtype='int32'),
'B':np.array([0,0,0],dtype='int32'),
'c':np.array([0,0,3],dtype='int32')
})
print(x)
print(x.to_sparse().to_coo())
print("*"*10)
print(x.to_sparse().to_coo().tocsr())
输出
A B c
0 1 0 0
1 0 0 0
2 3 0 3
(0, 0) 1
(1, 0) 0
(2, 0) 3
(0, 1) 0
(1, 1) 0
(2, 1) 0
(0, 2) 0
(1, 2) 0
(2, 2) 3
**********
(0, 0) 1
(0, 1) 0
(0, 2) 0
(1, 0) 0
(1, 1) 0
(1, 2) 0
(2, 0) 3
(2, 1) 0
(2, 2) 3
这个dataframe只能用to_sparse()函数转化为SparseDataFrame形式。倘若某个dataframe是用
get_dummies(sparse=True)返回的
那么只要这么写就行了:
df.sparse.to_coo().tocsr()就好了。
实现方式2:sklearn封装的api——OneHotEncoder
源码:
"""Encode categorical integer features using a one-hot aka one-of-K scheme.
The input to this transformer should be a matrix of integers, denoting
the values taken on by categorical (discrete) features. The output will be
a sparse matrix where each column corresponds to one possible value of one
feature. It is assumed that input features take on values in the range
[0, n_values).
This encoding is needed for feeding categorical data to many scikit-learn
estimators, notably linear models and SVMs with the standard kernels.
Note: a one-hot encoding of y labels should use a LabelBinarizer
instead.
Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
Parameters
----------
n_values : 'auto', int or array of ints
Number of values per feature.
- 'auto' : determine value range from training data.
- int : number of categorical values per feature.
Each feature value should be in ``range(n_values)``
- array : ``n_values[i]`` is the number of categorical values in
``X[:, i]``. Each feature value should be
in ``range(n_values[i])``
categorical_features : "all" or array of indices or mask
Specify what features are treated as categorical.
- 'all' (default): All features are treated as categorical.
- array of indices: Array of categorical feature indices.
- mask: Array of length n_features and with dtype=bool.
Non-categorical features are always stacked to the right of the matrix.
dtype : number type, default=np.float
Desired dtype of output.
sparse : boolean, default=True
Will return sparse matrix if set True else will return an array.
handle_unknown : str, 'error' or 'ignore'
Whether to raise an error or ignore if a unknown categorical feature is
present during transform.
Attributes
----------
active_features_ : array
Indices for active features, meaning values that actually occur
in the training set. Only available when n_values is ``'auto'``.
feature_indices_ : array of shape (n_features,)
Indices to feature ranges.
Feature ``i`` in the original data is mapped to features
from ``feature_indices_[i]`` to ``feature_indices_[i+1]``
(and then potentially masked by `active_features_` afterwards)
n_values_ : array of shape (n_features,)
Maximum number of values per feature.
举个栗子:
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # fit来学习编码
enc.transform([[0, 1, 3]]).toarray() # 进行编码
输出:array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])
数据矩阵是4*3,即4个数据,3个特征维度。
0 0 3 观察左边的数据矩阵,第一列为第一个特征维度,有两种取值0\1. 所以对应编码方式为10 、01
1 1 0 同理,第二列为第二个特征维度,有三种取值0\1\2,所以对应编码方式为100、010、001
0 2 1 同理,第三列为第三个特征维度,有四中取值0\1\2\3,所以对应编码方式为1000、0100、0010、0001
1 0 2
再来看要进行编码的参数[0 , 1, 3], 0作为第一个特征编码为10, 1作为第二个特征编码为010, 3作为第三个特征编码为0001. 故此编码结果为 1 0 0 1 0 0 0 0 1
但是上面这样的字符串不是我们需要的结果,在工程中的用法是先用LabelEncoder把非数字的值转化为数字值,然后采用sklearn进行操作
代码:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pandas as pd
import numpy as np
data=pd.DataFrame({'A':['a','b','a'],'B':['b','ahj','c'],'c':[1,2,3]})
print("初始数据:")
print(data)
print("下面使用pd.get_dummies进行独热编码...")
new_data=pd.get_dummies(data,prefix=['col1','col2'])
print(new_data)
print("下面使用sklearn进行labelEncoder编码")
label=LabelEncoder()
new_data=label.fit_transform(data['B'])
print(new_data)
print("下面使用sklearn进行独热编码...")
One=OneHotEncoder(sparse=False)
new_data=One.fit_transform(new_data.reshape(-1,1))
print(new_data)
new_data.reshape(-1,1)这是由于在新版的sklearn中,所有的数据都应该是二维矩阵,哪怕它只是单独一行或一列(比如前面做预测时,仅仅只用了一个样本数据),所以需要使用.reshape(1,-1)进行转换,具体操作如下。
初始数据:
A B c
0 a b 1
1 b ahj 2
2 a c 3
下面使用pd.get_dummies进行独热编码...
c col1_a col1_b col2_ahj col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
下面使用sklearn进行labelEncoder编码
[1 0 2]
下面使用sklearn进行独热编码...
[[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]
实现方式3:sklearn封装的标签二值化api——LabelBinarizer
from sklearn.preprocessing import LabelBinarizer
data=pd.DataFrame({'A':['a','b','a'],'B':['b','ahj','c'],'c':[1,2,3]})
print("下面使用LabelBinarizer进行独热编码")
new_data=LabelBinarizer().fit_transform(data['B'])
print(new_data)
[[0 1 0]
[1 0 0]
[0 0 1]]
再说一下LabelEncoder
实现方法1使用sklearn进行:
from sklearn.preprocessing import LabelEncoder
labelencoding.fit_transform(data)
实现方法2使用pd.factorize()
Parameters: | values : sequence A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization. sort : bool, default False Sort uniques and shuffle labels to maintain the relationship. order : None Deprecated since version 0.23.0: This parameter has no effect and is deprecated. na_sentinel : int, default -1 Value to mark “not found”. size_hint : int, optional Hint to the hashtable sizer. |
---|---|
Returns: | labels : ndarray An integer ndarray that’s an indexer into uniques. uniques : ndarray, Index, or Categorical The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned. Note Even if there’s a missing value in values, uniques will not contain an entry for it. |
See also
Discretize continuous-valued array.
Find the unique value in an array.
下面是有关参数的设置及作用举例:自己看就能看得懂
Examples
These examples all show factorize as a top-level method like pd.factorize(values)
. The results are identical for methods like Series.factorize()
.
>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> labels
array([0, 0, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)
With sort=True
, the uniques will be sorted, and labels will be shuffled so that the relationship is the maintained.
>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
>>> labels
array([1, 1, 0, 2, 1])
>>> uniques
array(['a', 'b', 'c'], dtype=object)
Missing values are indicated in labels with na_sentinel (-1
by default). Note that missing values are never included in uniques.
>>> labels, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
>>> labels
array([ 0, -1, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> labels, uniques = pd.factorize(cat)
>>> labels
array([0, 0, 1])
>>> uniques
[a, c]
Categories (3, object): [a, b, c]
Notice that 'b'
is in uniques.categories
, despite not being present in cat.values
.
For all other pandas objects, an Index of the appropriate type is returned.
>>> cat = pd.Series(['a', 'a', 'c'])
>>> labels, uniques = pd.factorize(cat)
>>> labels
array([0, 0, 1])
>>> uniques
Index(['a', 'c'], dtype='object')
代码:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
import pandas as pd
import numpy as np
data=pd.DataFrame({'A':['a','b','a'],'B':['b','ahj','c'],'c':[1,2,3]})
print("初始数据:")
print(data)
print("下面使用sklearn进行labelEncoder编码")
label=LabelEncoder()
new_data=label.fit_transform(data['B'])
print(new_data)
print("下面使用pd.factorize()进行labelEncoder编码")
new_data=pd.factorize(data['B'])
print(new_data[0])
初始数据:
A B c
0 a b 1
1 b ahj 2
2 a c 3
下面使用sklearn进行labelEncoder编码
[1 0 2]
下面使用pd.factorize()进行labelEncoder编码
[0 1 2]