【category_encoders】分类特征编码方式

Table of Contents

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
plt.rcParams["font.sans-serif"] = ["FangSong"] 
plt.rcParams["axes.unicode_minus"] = False 
import warnings
warnings.filterwarnings("ignore")
import category_encoders as ce

非监督不分裂 :‘OrdinalEncoder’,‘CountEncoder’
监督不分裂:‘TargetEncoder’,’ LeaveOneOutEncoder’,‘CatBoostEncoder’,‘WOEEncoder’
非监督分裂:‘OneHotEncoder’,‘Binary Encoder’

dir(ce)
['BackwardDifferenceEncoder',
 'BaseNEncoder',
 'BinaryEncoder',
 'CatBoostEncoder',
 'CountEncoder',
 'GLMMEncoder',
 'HashingEncoder',
 'HelmertEncoder',
 'JamesSteinEncoder',
 'LeaveOneOutEncoder',
 'MEstimateEncoder',
 'OneHotEncoder',
 'OrdinalEncoder',
 'PolynomialEncoder',
 'SumEncoder',
 'TargetEncoder',
 'WOEEncoder',
 '__all__',
 '__author__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'backward_difference',
 'basen',
 'binary',
 'cat_boost',
 'count',
 'glmm',
 'hashing',
 'helmert',
 'james_stein',
 'leave_one_out',
 'm_estimate',
 'one_hot',
 'ordinal',
 'polynomial',
 'sum_coding',
 'target_encoder',
 'utils',
 'woe']
X = pd.DataFrame(np.array([['male',10],['female', 20], ['male',10], 
                       ['female',20],['female',10],['female',30],['male',10]]),
             columns = ['Sex','Type'])
y = np.array([0,1,1,0,1,0,1])
X
SexType
0male10
1female20
2male10
3female20
4female10
5female30
6male10

OrdinalEncoder 序列编码

相当于sklearn中的LabelEncode
然而不是很好编码缺失值,应之前填充
encoder = ce.OrdinalEncoder(cols = ['Sex', 'Type']).fit(X,y)
encoder
OrdinalEncoder(cols=['Sex', 'Type'],
               mapping=[{'col': 'Sex', 'data_type': dtype('O'),
                         'mapping': male      1
female    2
NaN      -2
dtype: int64},
                        {'col': 'Type', 'data_type': dtype('O'),
                         'mapping': 10     1
20     2
30     3
NaN   -2
dtype: int64}])
encoder.transform(X) 
SexType
011
122
211
322
421
523
611

OneHotEncoder 独热编码

相当于pandas中get_dummies
encoder = ce.OneHotEncoder(cols = ['Sex', 'Type'],drop_invariant= True).fit(X,y)
encoder.transform(X) 
Sex_1Sex_2Type_1Type_2Type_3
010100
101010
210100
301010
401100
501001
610100

TargetEncoder 目标编码

link:https://zhuanlan.zhihu.com/p/119093636
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WOwbyUfY-1603970994566)(attachment:image.png)]

encoder = ce.TargetEncoder(cols = ['Sex', 'Type'],drop_invariant= True).fit(X,y)
encoder.transform(X)
SexType
00.6553140.741531
10.5033880.519210
20.6553140.741531
30.5033880.519210
40.5033880.741531
50.5033880.571429
60.6553140.741531

Binary Encoder二进制编码

encoder = ce.BinaryEncoder(cols = ['Sex', 'Type']).fit(X,y)
encoder.transform(X)
Sex_0Sex_1Type_0Type_1Type_2
001001
110010
201001
310010
410001
510011
601001

BaseNEncoder 贝叶斯编码

encoder = ce.BaseNEncoder(cols = ['Sex', 'Type']).fit(X,y)
encoder.transform(X)
Sex_0Sex_1Type_0Type_1Type_2
001001
110010
201001
310010
410001
510011
601001

LeaveOneOutEncoder 留一法

类似目标编码 
encoder = ce.LeaveOneOutEncoder(cols = ['Sex', 'Type']).fit(X,y)

encoder.transform(X)
SexType
00.6666670.750000
10.5000000.500000
20.6666670.750000
30.5000000.500000
40.5000000.750000
50.5000000.571429
60.6666670.750000

HashingEncoder 哈希编码

encoder = ce.HashingEncoder(cols = ['Sex', 'Type']).fit(X,y)
encoder.transform(X)
col_0col_1col_2col_3col_4col_5col_6col_7
010000100
100001100
210000100
300001100
410000100
500000101
610000100

CatBoostEncoder catboost目标编码

encoder = ce.CatBoostEncoder(cols = ['Sex', 'Type']).fit(X,y)
encoder.transform(X)
SexType
00.6428570.714286
10.5142860.523810
20.6428570.714286
30.5142860.523810
40.5142860.714286
50.5142860.571429
60.6428570.714286

CountEncoder 频率编码

encoder = ce.CountEncoder(cols = ['Sex', 'Type']).fit(X,y)
encoder.transform(X)
SexType
034
142
234
342
444
541
634

WOEEncoder 证据权重编码

encoder = ce.WOEEncoder(cols = ['Sex', 'Type']).fit(X,y)
encoder.transform(X)
SexType
00.2231440.510826
1-0.182322-0.182322
20.2231440.510826
3-0.182322-0.182322
4-0.1823220.510826
5-0.1823220.000000
60.2231440.510826
dir(ce)
['BackwardDifferenceEncoder',
 'BaseNEncoder',
 'BinaryEncoder',
 'CatBoostEncoder',
 'CountEncoder',
 'GLMMEncoder',
 'HashingEncoder',
 'HelmertEncoder',
 'JamesSteinEncoder',
 'LeaveOneOutEncoder',
 'MEstimateEncoder',
 'OneHotEncoder',
 'OrdinalEncoder',
 'PolynomialEncoder',
 'SumEncoder',
 'TargetEncoder',
 'WOEEncoder',
 '__all__',
 '__author__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'backward_difference',
 'basen',
 'binary',
 'cat_boost',
 'count',
 'glmm',
 'hashing',
 'helmert',
 'james_stein',
 'leave_one_out',
 'm_estimate',
 'one_hot',
 'ordinal',
 'polynomial',
 'sum_coding',
 'target_encoder',
 'utils',
 'woe']

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值