sklearn非数值特征处理
sklearn中非数值特征有两种处理方式
1.一个是OrdinalEncoder, 此估计器将每个分类特征转换为整数的一个新特征(0到n_categories - 1)
from sklearn import preprocessing
encoder1=preprocessing.OrdinalEncoder()
X= [[23,'male', 'from US', 'uses Safari'], [26,'female', 'from Europe', 'uses Firefox'],[27,'female', 'from Asia', 'uses Google']]#数据集中只有两个样本
encoder1.fit(X)#先训练一个encoder
encoder1.transform(X)#使用这个encoder对样本进行转换
array([[0., 1., 2., 2.],
[1., 0., 1., 0.],
[2., 0., 0., 1.]])
2.另一个OneHotEncoder,OneHotEncoder对每个特征的每个可能的值,若有这个值则为1,若没有这个属性则为0,因此这种OneHot编码出来的样本特征向量可能很长.
encoder2=preprocessing.OneHotEncoder()
encoder2.fit(X)
encoder2.transform(X).toarray()
array([[1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
[0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0.],
[0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0.]])
可以看见编码之后特征向量长度是所有特征的可能值得个数和
也可以使用参数categories来指定编码的特征,对于训练数据集中的可能没有出现过的特征值,我们可以指定参数handle_unknown=‘ignore’,这个参数只在OneHotEncoder可设置.
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers],handle_unknown='ignore')
# Note that for there are missing categorical values for the 2nd and 3rd
# feature
X = [['male', 'from US', 'uses Safari'], ['female', 'from Suzhou', 'uses Firefox']]#ignore值是from Suzhou
enc.fit(X)
enc.transform([['female', 'from Suzhou', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0., 1., 0., 0., 0.]])
可以看见编码结果中特征向量的长度任然是10,忽略了"from Suzhou"的编码
下面我们使用一个实例来展示
import pandas as pd
from sklearn import preprocessing
from sklearn import svm
data=pd.read_csv("data/breast-cancer/breast-cancer.data",names=["class","age","menopause","tumor-size"," inv-nodes"," node-caps","deg-malig","breast","breast-quad","irradiat"])
data.head()#乳腺癌数据
class | age | menopause | tumor-size | inv-nodes | node-caps | deg-malig | breast | breast-quad | irradiat | |
---|---|---|---|---|---|---|---|---|---|---|
0 | no-recurrence-events | 30-39 | premeno | 30-34 | 0-2 | no | 3 | left | left_low | no |
1 | no-recurrence-events | 40-49 | premeno | 20-24 | 0-2 | no | 2 | right | right_up | no |
2 | no-recurrence-events | 40-49 | premeno | 20-24 | 0-2 | no | 2 | left | left_low | no |
3 | no-recurrence-events | 60-69 | ge40 | 15-19 | 0-2 | no | 2 | right | left_up | no |
4 | no-recurrence-events | 40-49 | premeno | 0-4 | 0-2 | no | 2 | right | right_low | no |
第一种编码
train_num=int(0.75*len(data))
print(train_num)
encoder=preprocessing.OrdinalEncoder()
encoder.fit(data.iloc[:,:-1])
num_data=encoder.transform(data.iloc[:,:-1])
num_data
214
array([[0., 1., 2., ..., 2., 0., 2.],
[0., 2., 2., ..., 1., 1., 5.],
[0., 2., 2., ..., 1., 0., 2.],
...,
[1., 4., 0., ..., 0., 1., 3.],
[1., 2., 0., ..., 2., 0., 2.],
[1., 3., 0., ..., 2., 0., 2.]])
##数据预处理
#min_max_scaler=preprocessing.MinMaxScaler()
#num_data=min_max_scaler.fit_transform(num_data)
num_data=preprocessing.scale(num_data)
print(num_data.mean(axis=0))
print(num_data.std(axis=0))
[ 4.96883032e-17 1.92542175e-16 1.52170429e-16 -1.73909061e-16
7.45324548e-17 -2.23597364e-16 0.00000000e+00 -7.45324548e-17
-1.52170429e-16]
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
##标签值
encoder.fit(data.iloc[:,-1:])
num_target=encoder.transform(data.iloc[:,-1:])
model=svm.SVC(gamma='scale',C=1,kernel='rbf')#sigmoid,rbf,poly,precomputed
model.fit(num_data[:train_num,:],num_target[:train_num,:])
D:\anaconda\lib\site-packages\sklearn\utils\validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
model.score(num_data[:train_num,:],num_target[:train_num,:])
0.8925233644859814
第二种编码
encoder=preprocessing.OneHotEncoder(handle_unknown="ignore")
encoder.fit(data.iloc[:,:-1])
num_data=encoder.transform(data.iloc[:,:-1]).toarray()
num_data
array([[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 1.],
[1., 0., 0., ..., 0., 0., 0.],
...,
[0., 1., 0., ..., 1., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.]])
num_data=preprocessing.scale(num_data)
print(num_data.mean(axis=0))
print(num_data.std(axis=0))
[-1.43940803e-15 1.43940803e-15 -1.51927810e-16 -1.50229479e-16
6.67686574e-17 8.38490117e-17 1.26549897e-16 -2.18162706e-16
4.96883032e-17 -1.31596366e-16 -7.76379738e-18 3.72856369e-16
-2.74062047e-16 -1.86331137e-17 5.04646829e-18 -1.25773517e-16
-2.85707743e-16 2.36019440e-16 -3.84307970e-17 5.91989550e-18
2.86290028e-17 1.55275948e-16 1.41301112e-15 -1.58284419e-16
-3.33843287e-17 -1.33052078e-16 9.72027431e-16 -3.65286666e-16
5.23862228e-16 2.14863092e-16 -8.48583053e-16 -3.28796819e-16
4.28561615e-16 -1.27326277e-16 6.35078625e-16 2.40677719e-17
-2.40677719e-17 -8.35093455e-17 -4.93001133e-17 1.64592504e-16
3.10551895e-17 1.46735770e-16 -5.16292525e-17]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
##标签值使用原来的
model=svm.SVC(gamma="scale",C=1,kernel="rbf")
model.fit(num_data[:train_num,:],num_target[:train_num,:])
D:\anaconda\lib\site-packages\sklearn\utils\validation.py:761: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
model.score(num_data[:train_num,:],num_target[:train_num,:])
0.9018691588785047