one-hot encoding:The Standard Approach for Categorical Features
Categorical feature:如,color of flowers: yellow, red, green。
one-hot encoding:一种码制,有多少个状态(或者叫类别值)就有多少个比特,且只有一个比特为1,其它全为0.
Pandas offers a convenient function called get_dummies to get one-hot encodings.
code
独热编码
Pandas offers a convenient function called get_dummies to get one-hot encodings. Call it like this:
one_hot_encoded_data = pd.get_dummies(data)
help(pd.get_dummies)
Help on function get_dummies in module pandas.core.reshape.reshape:
get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Convert categorical variable into dummy/indicator variables
Parameters
----------
data : array-like, Series, or DataFrame
prefix : string, list of strings, or dict of strings, default None
String to append DataFrame column names.
Pass a list with length equal to the number of columns
when calling get_dummies on a DataFrame. Alternatively, `prefix`
can be a dictionary mapping column names to prefixes.
prefix_sep : string, default '_'
If appending prefix, separator/delimiter to use. Or pass a
list or dictionary as with `prefix.`
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None
Column names in the DataFrame to be encoded.
If `columns` is None then all the columns with
`object` or `category` dtype will be converted.
sparse : bool, default False
Whether the dummy columns should be sparse or not. Returns
SparseDataFrame if `data` is a Series or if all columns are included.
Otherwise returns a DataFrame with some SparseBlocks.
drop_first : bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the
first level.
.. versionadded:: 0.18.0
dtype : dtype, default np.uint8
Data type for new columns. Only a single dtype is allowed.
.. versionadded:: 0.23.0
Returns
-------
dummies : DataFrame or SparseDataFrame
Examples
--------
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
>>> s1 = ['a', 'b', np.nan]
>>> pd.get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0
>>> pd.get_dummies(s1, dummy_na=True)
a b NaN
0 1 0 0
1 0 1 0
2 0 0 1
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
... 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
>>> pd.get_dummies(pd.Series(list('abcaa')))
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
>>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
See Also
--------
Series.str.get_dummies
align:
final_train_predictors, final_test_predictors= one_hot_encoded_training_data_predictors.align(one_hot_encoded_test_data_predictors, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:对齐后没有值的地方填0,默认填的是NaN
#align
help(one_hot_encoded_X.align)
Help on method align in module pandas.core.frame:
align(self, other, join='outer', axis=None, level=None, copy=True, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None) method of pandas.core.frame.DataFrame instance
Align two objects on their axes with the
specified join method for each axis Index
Parameters
----------
other : DataFrame or Series
join : {'outer', 'inner', 'left', 'right'}, default 'outer'
axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None)
level : int or level name, default None
Broadcast across a level, matching Index values on the
passed MultiIndex level
copy : boolean, default True
Always returns new objects. If copy=False and no reindexing is
required then original objects are returned.
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any
"compatible" value
method : str, default None
limit : int, default None
fill_axis : {0 or 'index', 1 or 'columns'}