今天分析人类收入数据,想从中提炼分析出能够获得>50万年薪的人的特征都是什么,从中遇到了一个交叉验证的参数问题。
数据是从UIC获取的,
csv_file = pd.read_csv('adult.data',header=None,names=['age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-statu',
'occupation',
'Protective-serv',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'salary',
])
x_data = csv_file[['age','workclass','education','education-num','sex','hours-per-week']]
y_label = csv_file['salary']
通过pandas库读入划分出特征和标签,可以看到我从众多特征中选出了我认为比较影响收入的特征,进行分析,这样就导致了最终我得到的是一个6维的数据。
age workclass education education-num sex \
0 39 State-gov Bachelors 13 Male
1 50 Self-emp-not-inc Bachelors 13 Male
2 38 Private HS-grad 9 Male
3 53 Private 11th 7 Male
4 28 Private Bachelors 13 Female
5 37 Private Masters 14 Female
6 49 Private 9th 5 Female
7 52 Self-emp-not-inc HS-grad 9 Male
8 31 Private Masters 14 Female
9 42 Private Bachelors 13 Male
10 37 Private Some-college 10 Male
11 30 State-gov Bachelors 13 Male
12 23 Private Bachelors 13 Female
13 32 Private Assoc-acdm 12 Male
14 40 Private Assoc-voc 11 Male
15 34 Private 7th-8th 4 Male
16 25 Self-emp-not-inc HS-grad 9 Male
17 32 Private HS-grad 9 Male
18 38 Private 11th 7 Male
19 43 Self-emp-not-inc Masters 14 Female
20 40 Private Doctorate 16 Male
21 54 Private HS-grad 9 Female
22 35 Federal-gov 9th 5 Male
23 43 Private 11th 7 Male
24 59 Private HS-grad 9 Female
25 56 Local-gov Bachelors 13 Male
26 19 Private HS-grad 9 Male
27 54 ? Some-college 10 Male
28 39 Private HS-grad 9 Male
29 49 Private HS-grad 9 Male
.. ... ... ... ... ...
111 38 Private Prof-school 15 Male
112 56 Self-emp-not-inc HS-grad 9 Male
113 28 Private Some-college 10 Female
114 36 Private HS-grad 9 Female
115 53 Private 9th 5 Male
116 56 Self-emp-inc Some-college 10 Male
117 49 Local-gov Assoc-voc 11 Male
118 55 Private Some-college 10 Male
119 22 Private HS-grad 9 Male
120 21 Private Some-college 10 Female
121 40 Private Bachelors 13 Male
122 30 Private Bachelors 13 Male
123 29 State-gov Bachelors 13 Male
124 19 Private Some-college 10 Male
125 47 Private Bachelors 13 Female
126 20 Private Some-college 10 Female
127 31 Private Assoc-acdm 12 Male
128 35 ? HS-grad 9 Male
129 39 Private Some-college 10 Male
130 28 Private Assoc-acdm 12 Female
131 24 Private HS-grad 9 Female
132 38 Self-emp-not-inc HS-grad 9 Male
133 37 Private Bachelors 13 Male
134 46 Private Assoc-acdm 12 Female
135 38 Federal-gov Masters 14 Male
136 43 Self-emp-not-inc HS-grad 9 Male
137 27 Private Assoc-voc 11 Female
138 20 Private Some-college 10 Male
139 49 Private Some-college 10 Male
140 61 Self-emp-inc HS-grad 9 Male
可以看到很多数据都不是数值型的,这样不适合于scikit 评估器,这时候就想转换成合适的数值型数据。好在scikit提供了preprocessing包,很方便
对于字符串特征数据,我们可以用LabelEncoder,scikit会将字符串编码成合适的数值型数据。
le = LabelEncoder()
workcalss = le.fit_transform(x_data['workclass'].values)
sex = le.fit_transform(x_data['sex'].values)
education = le.fit_transform(x_data['education'].values)
x_data.sex = sex
x_data.workclass = workcalss
x_data.education = education
age workclass education education-num sex hours-per-week
0 39 6 7 13 1 40
1 50 5 7 13 1 13
2 38 3 9 9 1 40
3 53 3 1 7 1 40
4 28 3 7 13 0 40
5 37 3 10 14 0 40
6 49 3 4 5 0 16
7 52 5 9 9 1 45
8 31 3 10 14 0 50
9 42 3 7 13 1 40
10 37 3 12 10 1 80
11 30 6 7 13 1 40
12 23 3 7 13 0 30
13 32 3 5 12 1 50
14 40 3 6 11 1 40
15 34 3 3 4 1 45
16 25 5 9 9 1 35
17 32 3 9 9 1 40
18 38 3 1 7 1 50
19 43 5 10 14 0 45
20 40 3 8 16 1 60
21 54 3 9 9 0 20
22 35 1 4 5 1 40
23 43 3 1 7 1 40
24 59 3 9 9 0 40
25 56 2 7 13 1 40
26 19 3 9 9 1 40
27 54 0 12 10 1 60
28 39 3 9 9 1 80
29 49 3 9 9 1 40
.. ... ... ... ... ... ...
111 38 3 11 15 1 40
112 56 5 9 9 1 50
113 28 3 12 10 0 25
114 36 3 9 9 0 40
115 53 3 4 5 1 50
116 56 4 12 10 1 50
117 49 2 6 11 1 40
118 55 3 12 10 1 56
119 22 3 9 9 1 41
120 21 3 12 10 0 40
121 40 3 7 13 1 60
122 30 3 7 13 1 40
123 29 6 7 13 1 50
124 19 3 12 10 1 35
125 47 3 7 13 0 40
126 20 3 12 10 0 28
127 31 3 5 12 1 40
128 35 0 9 9 1 40
129 39 3 12 10 1 40
130 28 3 5 12 0 60
131 24 3 9 9 0 40
132 38 5 9 9 1 35
133 37 3 7 13 1 50
134 46 3 5 12 0 36
135 38 1 10 14 1 40
136 43 5 9 9 1 60
137 27 3 6 11 0 35
138 20 3 12 10 1 20
139 49 3 12 10 1 40
140 61 4 9 9 1 40
最后可以看到所有的数据已经转成数值型数据了
这里使用卡方检验,提取最影响结果4个特征
from sklearn.feature_selection import chi2
sb = SelectKBest(chi2,k=4)
new_x = sb.fit_transform(x_data,label)
print sb.scores_
new_x就是最多的前4个特征,可以打印出来看下。
[ 41.37096004 0.04822683 2.9466853 16.12471998 0.45100373
37.02336004]
这就是卡方的评分,拿出最高的前4个
接下来我来尝试下用决策树进行学习分类,使用kfold交叉验证
clf = DecisionTreeClassifier()
#print label
c, r = label.shape
labels = label.reshape(c,)
scores = cross_val_score(clf,x_data.values,labels)
print scores
在编码的过程中,遇到cross_val_score的一个问题IndexError: too many indices for array,原理是x 和 label行列不等,转换一下就可以了
python源码:
import numpy as np
import pandas as pd
csv_file = pd.read_csv('adult.data',header=None,names=['age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-statu',
'occupation',
'Protective-serv',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'salary',
])
x_data = csv_file[['age','workclass','education','education-num','sex','hours-per-week']]
y_label = csv_file['salary']
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
workcalss = le.fit_transform(x_data['workclass'].values)
sex = le.fit_transform(x_data['sex'].values)
education = le.fit_transform(x_data['education'].values)
x_data.sex = sex
x_data.workclass = workcalss
x_data.education = education
lb = LabelBinarizer()
label = lb.fit_transform(y_label.values)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
sb = SelectKBest(chi2,k=4)
new_x = sb.fit_transform(x_data,label)
print new_x
print sb.scores_
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
clf = DecisionTreeClassifier()
#print label
c, r = label.shape
labels = label.reshape(c,)
scores = cross_val_score(clf,x_data.values,label)
print scores
运行完可以看出,选所有最高的特征,进行决策树分类,准确度也就70多,看来这个选择策略不是很高嘛