24.【终极】实战研究一:泰坦尼克号罹难乘客预测

1. 导入数据

#-*- coding:utf-8 -*-

import pandas as pd 
train = pd.read_csv('/Users/jianghui/Desktop/kaggle/titanic/train.csv')
test  = pd.read_csv('/Users/jianghui/Desktop/kaggle/titanic/test.csv')

print train.info()
print test.info()

selected_features = ['Pclass','Sex','Age','Embarked','SibSp','Parch','Fare']

X_train = train[selected_features]
X_test  = test[selected_features]

y_train = train['Survived']

2.原始数据预处理(填充缺失值)

print X_train.info()
print X_test.info() 

输出:

RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
Embarked    889 non-null object
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 48.8+ KB
None
———————————————————————————————————————
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         332 non-null float64
Embarked    418 non-null object
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        417 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB
None

3.观察发现:
X_train中,‘age’和‘Embarked’有缺失值;
X_test中,‘age’和‘Embarked’和‘Fare’有缺失值。

(1)填充Embarked(在船上的)特征缺失值

print X_train['Embarked'].value_counts()
print X_test['Embarked'].value_counts()

输出:

S    644
C    168
Q     77
Name: Embarked, dtype: int64
————————————————————————————
S    270
C    102
Q     46
Name: Embarked, dtype: int64
#对于Embarked这种类别型的特征,我们使用出现频率最高的特征值来填充,这也是相对可以减少误差的一种填充办法
X_train['Embarked'].fillna('S',inplace=True)
X_test['Embarked'].fillna('S',inplace=True)

(2)填充Age和Fare特征缺失值

#Age这种数值型特征,我们习惯使用求平均值或者中位数来填充缺失值,相对可以减少误差的一种填充办法
X_train['Age'].fillna(X_train['Age'].mean(),inplace=True)
X_test['Age'].fillna(X_test['Age'].mean(),inplace=True)

#同理,对于Fare数值型特征,也是使用平均数进行填充
X_test['Fare'].fillna(X_test['Fare'].mean(),inplace=True)

4.采用DictVectorizer对特征向量化

from sklearn.feature_extraction import DictVectorizer
dict_vec = DictVectorizer(sparse=False)
X_train = dict_vec.fit_transform(X_train.to_dict(orient='record'))
X_test = dict_vec.transform(X_test.to_dict(orient='record'))

print dict_vec.feature_names_

输出:

# dict_vec.feature_names_
['Age', 'Embarked=C', 'Embarked=Q', 'Embarked=S', 'Fare', 'Parch', 'Pclass', 'Sex=female', 'Sex=male', 'SibSp']

5.导入随机森林分类器和XGBoost分类器模型

from sklearn.ensemble import RandomForestClassifier
#使用默认配置初始化RandomForestClassifier
rfc = RandomForestClassifier()

from xgboost import XGBClassifier
#使用默认配置初始化XGBClassifier
xgbc = XGBClassifier()

6.使用5折交叉验证的方法在训练集上分别对默认配置的RandomForestClassifier和XGBClassifier进行性能评估,并获得分类准确性的平均值

from sklearn.cross_validation import cross_val_score
cross_val_score(rfc,X_train,y_train,cv=5).mean()
# 输出为:0.8193501494971074
cross_val_score(xgbc,X_train,y_train,cv=5).mean()
# 输出为:0.81824559798311

7.预测并保存结果到csv文件中

#使用默认配置的RandomForestClassifier进行预测操作
rfc.fit(X_train,y_train)
rfc_predict = rfc.predict(X_test)
rfc_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':rfc_predict})
rfc_submission.to_csv('/Users/jianghui/Desktop/kaggle/titanic/rfc_submisson.csv',index= False)
#使用默认配置的XGBClassifier进行预测操作
xgbc.fit(X_train,y_train)
xgbc_predict = xgbc.predict(X_test)
xgbc_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':xgbc_predict})
xgbc_submission.to_csv('/Users/jianghui/Desktop/kaggle/titanic/xgbc_submisson.csv',index= False)

8.使用并行网格搜索到方式寻找更好的超参数组合,以期待进一步提高XGBClassifier的预测性能

from sklearn.grid_search import GridSearchCV
params ={'max_depth':range(2,7),'n_estimators':range(100,1100,200),'learning_rate':[0.05,0.1,0.25,0.5,1.0]}
xgbc_best = XGBClassifier()
gs = GridSearchCV(xgbc_best,params,n_jobs=-1,verbose=1)
gs.fit(X_train,y_train)

#查验优化之后的XGBClassifier的超参数配置以及交叉验证的准确性
print gs.best_score_
#输出结果为:0.826038159371
print gs.best_params_
#输出结果为:{'n_estimators': 100, 'learning_rate': 0.05, 'max_depth': 6}

9.经过超参数配置的XGBClassifier预测并保存结果到csv文件中

#将经过超参数配置的XGBClassifier对测试数据的预测结果存储在文件xgbt_best_submisson.csv中
xgbc_best_predict = gs.predict(X_test)
xgbc_best_submission = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':xgbc_best_predict})
xgbc_best_submission.to_csv('/Users/jianghui/Desktop/kaggle/titanic/xgbc_best_submisson.csv',index= False)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值