数据预处理-特征编码与离散化

ITLiu_JH

已于 2022-03-07 15:44:32 修改

阅读量832

点赞数

分类专栏：数据分析入门文章标签： sklearn python 机器学习

于 2022-03-01 08:50:37 首次发布

本文链接：https://blog.csdn.net/it_liujh/article/details/123197038

版权

数据分析入门专栏收录该内容

38 篇文章 7 订阅

订阅专栏

特征编码

1、OneHotEncoder

sklearn.preprocessing.OneHotEncoder

preprocessing.OneHotEncoder(
categories=‘auto’, #Categories (unique values) per feature
drop=“first”, #Specifies a methodology to use to drop one of the categories per
feature.
dtype=<class ‘numpy.float64’>, #期望的输出类型
sparse=True, #如果设置为True将返回稀疏矩阵，否则将返回一个数组。
handle_unknown=‘error’ #若转换期间存在未知的分类特征，引发错误还是忽略
)

OneHotEncoder(
*,
categories=‘auto’,
drop=None,
sparse=True,
dtype=<class ‘numpy.float64’>,
handle_unknown=‘error’,
)
Docstring:
Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or
strings, denoting the values taken on by categorical (discrete) features.
The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’)
encoding scheme. This creates a binary column for each category and
returns a sparse matrix or dense array (depending on the sparse
parameter)

By default, the encoder derives the categories based on the unique values
in each feature. Alternatively, you can also specify the categories
manually.

This encoding is needed for feeding categorical data to many scikit-learn
estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer
instead.

Read more in the :ref:User Guide <preprocessing_categorical_features>.

Parameters

categories : ‘auto’ or a list of array-like, default=‘auto’
Categories (unique values) per feature:

- 'auto' : Determine categories automatically from the training data.
- list : ``categories[i]`` holds the categories expected in the ith
  column. The passed categories should not mix strings and numeric
  values within a single feature, and should be sorted in case of
  numeric values.

The used categories can be found in the ``categories_`` attribute.

.. versionadded:: 0.20

drop : {‘first’, ‘if_binary’} or a array-like of shape (n_features,), default=None
Specifies a methodology to use to drop one of the categories per
feature. This is useful in situations where perfectly collinear
features cause problems, such as when feeding the resulting data
into a neural network or an unregularized regression.

However, dropping one category breaks the symmetry of the original
representation and can therefore induce a bias in downstream models,
for instance for penalized linear classification or regression models.

- None : retain all features (the default).
- 'first' : drop the first category in each feature. If only one
  category is present, the feature will be dropped entirely.
- 'if_binary' : drop the first category in each feature with two
  categories. Features with 1 or more than 2 categories are
  left intact.
- array : ``drop[i]`` is the category in feature ``X[:, i]`` that
  should be dropped.

.. versionadded:: 0.21
   The parameter `drop` was added in 0.21.

.. versionchanged:: 0.23
   The option `drop='if_binary'` was added in 0.23.

sparse : bool, default=True
Will return sparse matrix if set True else will return an array.

dtype : number type, default=float
Desired dtype of output.

handle_unknown : {‘error’, ‘ignore’}, default=‘error’
Whether to raise an error or ignore if an unknown categorical feature
is present during transform (default is to raise). When this parameter
is set to ‘ignore’ and an unknown category is encountered during
transform, the resulting one-hot encoded columns for this feature
will be all zeros. In the inverse transform, an unknown category
will be denoted as None.

Attributes

categories_ : list of arrays
The categories of each feature determined during fitting
(in order of the features in X and corresponding with the output
of transform). This includes the category specified in drop
(if any).

drop_idx_ : array of shape (n_features,)
- drop_idx_[i] is the index in categories_[i] of the category
to be dropped for each feature.
- drop_idx_[i] = None if no category is to be dropped from the
feature with index i, e.g. when drop='if_binary' and the
feature isn’t binary.
- drop_idx_ = None if all the transformed features will be
retained.

.. versionchanged:: 0.23
   Added the possibility to contain `None` values.

Examples

Given a dataset with two features, we let the encoder find the unique
values per feature and transform the data to a binary one-hot encoding.

from sklearn.preprocessing import OneHotEncoder

One can discard categories not seen during fit:

enc = OneHotEncoder(handle_unknown=‘ignore’)
X = [[‘Male’, 1], [‘Female’, 3], [‘Female’, 2]]
enc.fit(X)
OneHotEncoder(handle_unknown=‘ignore’)

enc.categories_
[array([‘Female’, ‘Male’], dtype=object), array([1, 2, 3], dtype=object)]

enc.transform([[‘Female’, 1], [‘Male’, 4]]).toarray()
array([[1., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.]])

enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([[‘Male’, 1],
[None, 2]], dtype=object)

enc.get_feature_names([‘gender’, ‘group’])
array([‘gender_Female’, ‘gender_Male’, ‘group_1’, ‘group_2’, ‘group_3’],
dtype=object)

One can always drop the first column for each feature:

drop_enc = OneHotEncoder(drop=‘first’).fit(X)
drop_enc.categories_
[array([‘Female’, ‘Male’], dtype=object), array([1, 2, 3], dtype=object)]

drop_enc.transform([[‘Female’, 1], [‘Male’, 2]]).toarray()
array([[0., 0., 0.],
[1., 1., 0.]])

Or drop a column for feature only having 2 categories:

drop_binary_enc = OneHotEncoder(drop=‘if_binary’).fit(X)
drop_binary_enc.transform([[‘Female’, 1], [‘Male’, 2]]).toarray()
array([[0., 1., 0., 0.],
[1., 0., 1., 0.]])
File: c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\sklearn\preprocessing_encoders.py
Type: type
Subclasses:
将包含𝐾个取值的离散型特征转换成𝐾个二元特征（取值为0或1）
优点：经过One-Hot编码之后，不同的原始特征之间拥有相同的距离；One-Hot编码对包含离散型特征的回归模型及分类模型的效果有很好的提升。
缺点：特征显著增多，且增加了特征之间的相关性。

2、get_dummies

pandas.get_dummies()

pd.get_dummies(
data,
prefix=None,
prefix_sep=’_’,
dummy_na=False,
columns=None,
sparse=False,
drop_first=False,
dtype=None,
) -> ‘DataFrame’
Docstring:
Convert categorical variable into dummy/indicator variables.

Parameters

data : array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefix : str, list of str, or dict of str, default None
String to append DataFrame column names.
Pass a list with length equal to the number of columns
when calling get_dummies on a DataFrame. Alternatively, prefix
can be a dictionary mapping column names to prefixes.
prefix_sep : str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a
list or dictionary as with prefix.
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None
Column names in the DataFrame to be encoded.
If columns is None then all the columns with
object or category dtype will be converted.
sparse : bool, default False
Whether the dummy-encoded columns should be backed by
a :class:SparseArray (True) or a regular NumPy array (False).
drop_first : bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the
first level.
dtype : dtype, default np.uint8
Data type for new columns. Only a single dtype is allowed.

3、LabelEncoder

sklearn.preprocessing.LabelEncoder()

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and
not the input X.

Read more in the :ref:User Guide <preprocessing_targets>.

… versionadded:: 0.12

Attributes

classes_ : ndarray of shape (n_classes,)
Holds the label for each class.

Examples

LabelEncoder can be used to normalize labels.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])
LabelEncoder()

le.classes_
array([1, 2, 6])

le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]…)

le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are
hashable and comparable) to numerical labels.

le = preprocessing.LabelEncoder()
le.fit([“paris”, “paris”, “tokyo”, “amsterdam”])
LabelEncoder()

list(le.classes_)
[‘amsterdam’, ‘paris’, ‘tokyo’]

le.transform([“tokyo”, “tokyo”, “paris”])
array([2, 2, 1]…)

list(le.inverse_transform([2, 2, 1]))
[‘tokyo’, ‘tokyo’, ‘paris’]

4、map函数

eg:

data[‘gender’].map({‘M’:1,‘F’:2})

离散化

等距

pandas.cut
eg:
d1 = pd.cut(data, k, labels = range(k)) #每标签的覆盖范围的大小基本相同

等频

d1=pd.qcut(data, k, labels = range(k)) #每个标签的样本数量基本相同

二值化（ Binarizer）

用法：
Binarizer(*, threshold=0.0, copy=True)
Docstring:
Binarize data (set feature values to 0 or 1) according to a threshold.
eg:

from sklearn.preprocessing import Binarizer
#二值化，阈值设置为10，返回值为二值化后的数据，大于10的为1，小于10的为0
scaler = Binarizer(threshold=10) #实例化
scaler.fit_transform(data[“xxx”])

ITLiu_JH

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
数据预处理-特征编码与离散化

特征编码1、OneHotEncodersklearn.preprocessing.OneHotEncoderpreprocessing.OneHotEncoder(n_values=‘auto’, #‘auto’，int或int数组，每个特征的取值个数。categorical_features=‘all’, #指定将哪些功能视为分类dtype=<class ‘numpy.float64’>, #期望的输出类型sparse=True, #如果设置为True将返回稀疏矩阵，否则
复制链接

扫一扫