接上节机器学习实例(1):
接下来我们要做的工作是屡清楚数据集的各个特征的特点,针对不同的特征提出不同的处理方法:
由于机器学习处理的都是都是数值信息,但是数据集有一部分是文本信息,这就需要对不同的文本信息进行不同的处理了。即下一步工作:
- 特征的类别信息
age:连续性数值变量;可能的处理方法:分年龄段;
workcass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov,State-gov, Without-pay, Never-worked.
:雇主类型,多类别,一般处理方法:化为数值类别,比如以上八个可以分别表示为1-8(仅为示例,本文并不推荐);
fnlwgt: 连续性数值变量;人口普查员认为观察值的人数。该变量在本文不被使用,笔者认为该特征并不重要。
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,Preschool.受教育程度,多类别数据,处理方法同workcass;
education-num: 连续性数值变量,受教育水平年限,一般来讲,该值越大,工资越高;
marital-status: Married-civ-spouse, Divorced, Never-married, Separated,Widowed, Married-spouse-absent, Married-AF-spouse.婚姻状况,多类别数据;
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,Armed-Forces.职业,多类别数据;
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,Unmarried.群体性关系,多类别数据;
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.种族,多类别数据,虽然美国反对种族歧视,但是实际上这个在区分美国工资的时候蛮重要;
sex: Female, Male.性别,最简单的二分法(0&1);
capital-gain: 资本收益,连续数值;
capital-loss: 资本损失,连续数值;
hours-per-week: 工作时间,连续数值;
native-country: United-States, Cambodia, England, Puerto-Rico, Canada,Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China,Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico,Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti,Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia,El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.国籍,多类别数据;
result:结果:“>50K”或“<=50K”,二分类数据,也是本文机器学习的目的(0&1);
2.特征处理:
可以看到目前,特征无非分三类:
- 连续性数值特征,如age,最好处理;
- 二分类文本信息,二分法处理;
- 多类别文本信息,
现在进行特征处理:
先查看数据集的缺失情况:
adult.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education-num 32561 non-null int64
marital-status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital-gain 32561 non-null int64
capital-loss 32561 non-null int64
hours-per-week 32561 non-null int64
native-country 32561 non-null object
result 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
或者第二种方法:
adult.isnull().any()
age False
workclass False
fnlwgt False
education False
education-num False
marital-status False
occupation False
relationship False
race False
sex False
capital-gain False
capital-loss False
hours-per-week False
native-country False
result False
dtype: bool
但是实际情况呢:
打开实际的csv文件进行查看,很容易可以看到红色部分“?”其实属于无效字符,虽未缺失(nan),实则无用,也是我们要想办法去掉的;
这里有个小技巧:
先来看
adult.shape
(32561, 15)
可以看到目前32561行数据,我们借助正则表达式,用nan代替‘?’:
adult_clean=adult.replace(regex=[r'\?|\.|\$'],value=np.nan)
这里用nan代替 ? . $ 三种符号,按照自己写法和数据替换
此时再来查看:
adult_clean.isnull().any()
age False
workclass True
fnlwgt False
education False
education-num False
marital-status False
occupation True
relationship False
race False
sex False
capital-gain False
capital-loss False
hours-per-week False
native-country True
result False
dtype: bool
然后我们将所有含有缺失值的行都去掉,对于有些数据集,有的为空值填充均值,但是此数据集要预测收入,所以我在这里直接去掉,并不影响结果,如果此处填充均值,训练集反而会受到精度影响;
adult=adult_clean.dropna(how='any')
#凡是含有nan的行一律去掉
再来查看:
adult.shape
(30162, 15)
有大概2000行的某些特征存在数据不齐整,到此为止,数据缺失整理完毕;
因为要预测,而且给出的数据属于有监督数据,那么我们使用监督学习;
首先分离训练集和测试集;
先将“最没用”的特征剔除:
adult=adult.drop(['fnlwgt'],axis=1)
adult.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age 30162 non-null int64
workclass 30162 non-null object
education 30162 non-null object
education-num 30162 non-null int64
marital-status 30162 non-null object
occupation 30162 non-null object
relationship 30162 non-null object
race 30162 non-null object
sex 30162 non-null object
capital-gain 30162 non-null int64
capital-loss 30162 non-null int64
hours-per-week 30162 non-null int64
native-country 30162 non-null object
result 30162 non-null object
dtypes: int64(5), object(9)
memory usage: 4.7+ MB
导入分离训练和测试集需要的包:
from sklearn.cross_validation import train_test_split
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
数据分离:按照75%的训练集和25%的测试集;random_state随便输入一个数字;
X_train,X_test,y_train,y_test=train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
# X_train,X_test,y_train,y_test=train_test_split(adult[1:13],adult[13],test_size=0.25,random_state=33)
很多人会写成第二种,但是这样写是错的,可以试试;
查看训练集和测试集情况:
print(X_train.shape)
print(X_test.shape)
(22621, 12)
(7541, 12)
测试集同上,具体来看:
X_train.head()
部分输出:
workclass education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
20607 Private Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 50 United-States
31257 Private HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 50 United-States
31892 Private HS-grad 9 Never-married Adm-clerical Not-in-family White Female 0 0 45 United-States
20220 Private HS-grad 9 Divorced Machine-op-inspct Unmarried Black Female 0 0 40 United-States
24044 Private Some-college 10 Divorced Sales Not-in-family White Female 0 0 45 United-States
y_train.head()
20607 >50K
31257 <=50K
31892 <=50K
20220 <=50K
24044 >50K
Name: result, dtype: object
到此为止,训练和测试集就分开了,那么是不是可以直接开始随机森林大法了呢?且慢,最重要的特征尚未处理,至于如何处理,下次见分晓;
第三节终结篇来喽
如果觉得本文写的还不错的伙伴,可以给个关注一起交流进步,如果有在找工作且对阿里感兴趣的伙伴,也可以发简历给我进行内推: