接上节机器学习实例2
上一节我们对所要处理的数据进行了分类(训练集和测试集),本节的主要目的是特征处理,关于数据集的特征信息我们已经在上一节详细介绍,这次直接进行处理:
from sklearn.feature_extraction import DictVectorizer
dict_vect=DictVectorizer(sparse=False)
以上代码的目的:
sklearn.feature_extraction模块可以用于从由诸如文本和图像的格式组成的数据集中提取机器学习算法支持的格式的特征。
DictVectorizer为分类(也称为标称,离散)特征实现所谓的one-of-K或“one-hot(独热)”编码。 分类特征是“属性值”对,其中值被限制到无序的离散可能性的列表(例如主题标识符,对象的类型,标签,名称…)。
X_train=dict_vect.fit_transform(X_train.to_dict(orient='record'))
dict_vect.feature_names_
以上特征处理代码运行后的输出:
['age',
'capital-gain',
'capital-loss',
'education-num',
'education= 10th',
'education= 11th',
'education= 12th',
'education= 1st-4th',
'education= 5th-6th',
'education= 7th-8th',
'education= 9th',
'education= Assoc-acdm',
'education= Assoc-voc',
'education= Bachelors',
'education= Doctorate',
'education= HS-grad',
'education= Masters',
'education= Preschool',
'education= Prof-school',
'education= Some-college',
'hours-per-week',
'marital-status= Divorced',
'marital-status= Married-AF-spouse',
'marital-status= Married-civ-spouse',
'marital-status= Married-spouse-absent',
'marital-status= Never-married',
'marital-status= Separated',
'marital-status= Widowed',
'native-country= Cambodia',
'native-country= Canada',
'native-country= China',
'native-country= Columbia',
'native-country= Cuba',
'native-country= Dominican-Republic',
'native-country= Ecuador',
'native-country= El-Salvador',
'native-country= England',
'native-country= France',
'native-country= Germany',
'native-country= Greece',
'native-country= Guatemala',
'native-country= Haiti',
'native-country= Holand-Netherlands',
'native-country= Honduras',
'native-country= Hong',
'native-country= Hungary',
'native-country= India',
'native-country= Iran',
'native-country= Ireland',
'native-country= Italy',
'native-country= Jamaica',
'native-country= Japan',
'native-country= Laos',
'native-country= Mexico',
'native-country= Nicaragua',
'native-country= Outlying-US(Guam-USVI-etc)',
'native-country= Peru',
'native-country= Philippines',
'native-country= Poland',
'native-country= Portugal',
'native-country= Puerto-Rico',
'native-country= Scotland',
'native-country= South',
'native-country= Taiwan',
'native-country= Thailand',
'native-country= Trinadad&Tobago',
'native-country= United-States',
'native-country= Vietnam',
'native-country= Yugoslavia',
'occupation= Adm-clerical',
'occupation= Armed-Forces',
'occupation= Craft-repair',
'occupation= Exec-managerial',
'occupation= Farming-fishing',
'occupation= Handlers-cleaners',
'occupation= Machine-op-inspct',
'occupation= Other-service',
'occupation= Priv-house-serv',
'occupation= Prof-specialty',
'occupation= Protective-serv',
'occupation= Sales',
'occupation= Tech-support',
'occupation= Transport-moving',
'race= Amer-Indian-Eskimo',
'race= Asian-Pac-Islander',
'race= Black',
'race= Other',
'race= White',
'relationship= Husband',
'relationship= Not-in-family',
'relationship= Other-relative',
'relationship= Own-child',
'relationship= Unmarried',
'relationship= Wife',
'sex= Female',
'sex= Male',
'workclass= Federal-gov',
'workclass= Local-gov',
'workclass= Private',
'workclass= Self-emp-inc',
'workclass= Self-emp-not-inc',
'workclass= State-gov',
'workclass= Without-pay']
同理我们对测试集的数据也进行处理:
X_test=dict_vect.transform(X_test.to_dict(orient='record'))
这里有个细节要注意:对训练集先用fit_transform然后接着多测试集使用transform函数,目的是将两类数据集按照同一样的标准进行处理;
到此为止,特征处理完毕,在这里我选择随机森林和XGBoost两种集成模型进行模型训练:
随机森林顾名思义,是用随机的方式建立一个森林,森林里面有很多的决策树组成,随机森林的每一棵决策树之间是没有关联的。在得到森林之后,当有一个新的输 入样本进入的时候,就让森林中的每一棵决策树分别进行一下判断,看看这个样本应该属于哪一类(对于分类算法),然后看看哪一类被选择最多,就预测这个样本 为那一类。
对于非深度学习类型的机器学习分类问题,XGBoost 是最流行的库。由于 XGBoost 可以很好地扩展到大型数据集中,并支持多种语言,它在商业化环境中特别有用。
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
from xgboost import XGBClassifier
xgbc=XGBClassifier()
首先进行交叉测试:
随机选取k-1折的数据进行模型训练。
剩余的一份数据用来验证模型的好坏。
然后,循环中计算的值的平均值即是k-fold交叉验证报告的性能指标。这种方法在计算上可能是昂贵的,但是不会浪费太多的数据(如固定任意测试集的情况)。
from sklearn.cross_validation import cross_val_score
cross_val_score(rfc,X_train,y_train,cv=5).mean()
0.8347554650214203
cross_val_score(xgbc,X_train,y_train,cv=5).mean()
0.8586709654097728
接着进行训练:
#默认随机森林预测
rfc.fit(X_train,y_train)
rfc_y_predict=rfc.predict(X_test)
rfc.score(X_test,y_test)
输出:
0.840737302744994
xgbc.fit(X_train,y_train)
xgbc_y_predict=xgbc.predict(X_test)
xgbc.score(X_test,y_test)
输出:
0.8635459488131547
实际查看:
print(rfc_y_predict[1:20])
print(xgbc_y_predict[1:20])
print(y_test)
从下面打印出的数据可以看到,20个数据有一个是有误的;
[' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
' >50K' ' <=50K' ' >50K' ' <=50K' ' <=50K' ' >50K' ' >50K' ' <=50K'
' <=50K' ' >50K' ' >50K']
[' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K'
' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' <=50K' ' >50K' ' <=50K'
' <=50K' ' <=50K' ' >50K']
32240 <=50K
16568 <=50K
16406 <=50K
13051 <=50K
29453 <=50K
31063 <=50K
6741 <=50K
29557 <=50K
32097 <=50K
19604 <=50K
30108 <=50K
20128 <=50K
13205 <=50K
2493 <=50K
1193 <=50K
31161 >50K
10252 <=50K
20591 <=50K
8290 >50K
25915 >50K
26877 <=50K
24098 <=50K
16900 <=50K
19801 <=50K
25334 <=50K
25366 <=50K
12910 <=50K
4579 <=50K
18281 <=50K
27193 <=50K
定量来看:
from sklearn.metrics import classification_report
print('随机森林的预测准确率:')
print(classification_report(y_test,rfc_y_predict,target_names=['result']))
print('XGBoost的预测准确率:')
print(classification_report(y_test,xgbc_y_predict,target_names=['result']))
输出:
随机森林的预测准确率:
precision recall f1-score support
result 0.87 0.92 0.90 5613
avg / total 0.83 0.84 0.84 7541
XGBoost的预测准确率:
precision recall f1-score support
result 0.87 0.95 0.91 5613
avg / total 0.86 0.86 0.86 7541
如果觉得本文写的还不错的伙伴,可以给个关注一起交流进步,如果有在找工作且对阿里感兴趣的伙伴,也可以发简历给我进行内推: