目录
一、Data Preview
1.1 Data Structure
Kaggle已经帮我们分成训练集和测试集,测试集之后要用来预测Survived
data = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
1.2 Data head()
data.head()
可以看到Name,PassengerId,Ticket,是没什么用的,数据预处理时要删除
1.3 Null Detection
print('各属性缺失值情况:\n', data.isnull().sum())
print('各属性缺失值情况:\n', test.isnull().sum())
- 可以观察到Cabin属性缺失值过多,无法补充,数据处理时要删除改选项
- 训练数据集中Age缺失177,Embarked缺失2; 测试集中Age缺失86,Fare缺失1
1.4 Check ‘Survived’ distribution
还算平衡,不大需要undersample
二、Data preprocessing
2.1 Concatenate Train and Test Data Together
先将test data 和 train data合并(在test data上加上一列test=True,用来之后再分割)
test['test'] = True
df = pd.concat([data,test],axis=0,sort=False)
2.2 Drop Useless Features
df = df.drop(['Name','Cabin','PassengerId','Ticket'],axis=1)
2.3 Deal with NaN values
- 从之前观察来看测试集中有一个样本缺失了Fare,再经过观察其他样本,觉得差不多应该是 9 左右
df.loc[df['Fare'].isnull() == True,'Fare'] = 9
- 同样有两个样本缺失了Embarked
df[df['Embarked'].isnull()==True]
与之类似的样本中大部分是C,所以用C填充
df.loc[df['Embarked'].isnull()==True,'Embarked'] = 'C'
- 年龄Age缺失的也较多,要用所有样本的中位数来填充
df['Age'].fillna(df['Age'].median(),inplace=True)
2.4 One-hot transform
- 先观察哪些属性是字符类型
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.columns
Index([‘Sex’, ‘Embarked’, ‘test’], dtype=‘object’)
属性test,之后就删除了,不需要变换
- one-hot
df = pd.get_dummies(df, columns=["Sex","Embarked"])
2.5 分割测试集和训练集
test_data = df.loc[df['test']==True,:]
test_data = test_data.drop(['test','Survived'],axis=1)
train_data = df.loc[df['test']!=True,:]
train_data = train_data.drop(['test'],axis=1)
三、Build Model(RandomForest)
3.1 Split the Features and Label
features_data = train_data.drop(['Survived'],axis=1)
label_data = train_data['Survived']
3.2 GridSearchCV for best parameters
- 先拿120-180,步长为4的值来看n_estimators的最优取值范围
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':np.arange(120,180,4)}
clf = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1),param_grid=parameters,cv=8,n_jobs=-1) #scoring=
clf.fit(features_data,np.ravel(label_data))
print(clf.best_score_)
print(clf.best_params_)
0.8181818181818182
{‘n_estimators’: 128}
得到最优值在128左右
- 缩小n_estimators的范围到120-140
parameters = {'n_estimators':np.arange(120,140)}
clf = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1),param_grid=parameters,cv=8,n_jobs=-1) #scoring=
clf.fit(features_data,np.ravel(label_data))
print(clf.best_score_)
print(clf.best_params_)
0.8181818181818182
{‘n_estimators’: 132}
得到最优的n_estimators = 132
- 查找最优的max_depth 和 min_samples_leaf的参数
parameters = {'max_depth':np.arange(10,15),'min_samples_leaf':np.arange(2,6)}
clf = GridSearchCV(estimator=RandomForestClassifier(n_estimators= 132,n_jobs=-1),param_grid=parameters,cv=8,n_jobs=-1) #scoring=
clf.fit(features_data,np.ravel(label_data))
print(clf.best_score_)
print(clf.best_params_)
0.8395061728395061
{‘max_depth’: 11, ‘min_samples_leaf’: 3}
- 查找最优的min_samples_split
parameters = {'min_samples_split':np.arange(2,5)}
clf = GridSearchCV(estimator=RandomForestClassifier(max_depth=11,min_samples_leaf=3, n_estimators= 129,n_jobs=-1),param_grid=parameters,cv=8,n_jobs=-1) #scoring=
clf.fit(features_data,np.ravel(label_data))
print(clf.best_score_)
print(clf.best_params_)
0.8305274971941639
{‘min_samples_split’: 4}
- 构建最佳参数模型
clf = RandomForestClassifier(max_depth=13,min_samples_leaf=3, n_estimators= 129,min_samples_split=4,n_jobs=-1)
- 交叉验证得到如下结果
还不错,就用这个模型来预测test集
3.3 预测测试集
clf = RandomForestClassifier(max_depth=11,min_samples_leaf=3, n_estimators= 132,min_samples_split=4,n_jobs=-1)
clf.fit(features_data,np.ravel(label_data))
forecast = clf.predict(test_data)
submission = pd.DataFrame({"PassengerId":test['PassengerId'], "Survived":forecast})
submission['Survived']= submission['Survived'].astype(int)
submission.to_csv("submission.csv", index = False)
最终预测准确率有78.947%
四、Build Model(LogisticRegression)
4.1 Split the Features and Label
features_data = train_data.drop(['Survived'],axis=1)
label_data = train_data['Survived']
4.2 LogisticRegression
class_weight 调整成balanced后精度仍然不高
使用penalty惩罚项来平衡,稍微不平衡的Surived
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
penalty = {0: 4.5,1: 3.4}
clf = LogisticRegression(solver='liblinear',class_weight=penalty)
4.3 交叉验证
4.4 预测测试集
看精度也还行,
最终竞赛得分76.555%
五、Build Model(XGBoost)
5.1 初始化
import xgboost as xgb
dtrain = xgb.DMatrix(data=features_data, label=label_data)
#初始化设置
params_dict = {
'seed':33,
'eta':0.1,
'colsample_bytree':0.7,
'silent':1,
'subsample':0.7,
'objective':'reg:logistic',
'max_depth':5,
'min_child_weight':3
}
5.2 搜索最优的迭代次数
调num_boost_round
%%time
xgb_cv = xgb.cv(params_dict, dtrain, num_boost_round=60,nfold=5,seed=33,early_stopping_rounds=10)
print('CV score:',xgb_cv.iloc[-1,:]['test-rmse-mean'])
画图
import matplotlib.pyplot as plt
plt.figure()
xgb_cv[['train-rmse-mean','test-rmse-mean']].plot()
最后结果大概是num_boost_round = 100 左右
5.3 创建一个模型类方便使用
class XGBoost(object):
def __init__(self,**kwargs):
self.params = kwargs
if 'num_boost_round' in self.params:
self.num_boost_round = self.params['num_boost_round']
self.params.update({'silent':1,'objective':'binary:hinge','seed':0})
def fit(self, x_train, y_train):
dtrain = xgb.DMatrix(x_train, y_train)
self.bst = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round)
def predict(self, x_pred):
dpred = xgb.DMatrix(x_pred)
return self.bst.predict(dpred)
def kfold(self, x_train,y_train,nfold=5):
dtrain = xgb.DMatrix(x_train,y_train)
cv_rounds = xgb.cv(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round, nfold=nfold,early_stopping_rounds=10)
return cv_rounds.iloc[-1,:]
def plot_feature_importances(self):
feature_importance = pd.Series(self,bst.get_fscore()).sort_values(ascending=False)
feature_importance.plot(title='Feature Importance')
plt.ylabel('Feature Importance Score')
def get_params(self,deep=True):
return self.params
def set_params(self,**params):
self.params.update(params)
return self
5.4 调参
- 调 ‘max_depth’ 和 ‘min_child_weight’。维持其他初始参数不变,只调两个参数
%%time
params_dict = {
'max_depth':np.arange(10,16),
'min_child_weight':np.arange(1,6)
}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(eta=0.1,num_boost_round=100,colsample_bytree=0.5,subsample=0.7),param_grid=params_dict,cv=5,scoring='accuracy')
grid.fit(features_data,label_data)
print(grid.best_score_,grid.best_params_)
(0.8372615039281706, {‘max_depth’: 14, ‘min_child_weight’: 5})
- 调’min_child_weight’
%%time
params_dict = {
'min_child_weight':np.arange(6,14)
}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(max_depth = 14,eta=0.1,num_boost_round=100,colsample_bytree=0.5,subsample=0.7),param_grid=params_dict,cv=5,scoring='f1_weighted')
grid.fit(features_data,label_data)
grid.best_score_,grid.best_params_
(0.836100750932872, {‘min_child_weight’: 9})
- 调’gamma’
params_dict = {
'gamma':[0.15,0.2,0.22,0.24]
}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(max_depth=14,min_child_weight=9,eta=0.1,num_boost_round=100,colsample_bytree=0.5,subsample=0.7),param_grid=params_dict,cv=5,scoring='f1_weighted')
grid.fit(features_data,label_data)
grid.best_score_,grid.best_params_
(0.8342887064266276, {‘gamma’: 0.2})
- 调’subsample’ 和 ‘colsample_bytree’
params_dict = {
'subsample':[0.6,0.7,0.8,0.9,0.95],
'colsample_bytree':[0.6,0.7,0.8,0.9,0.95]
}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(gamma=0.2,max_depth=14,min_child_weight=9,eta=0.1,num_boost_round=100),param_grid=params_dict,cv=5,scoring='f1_weighted')
grid.fit(features_data,label_data)
grid.best_score_,grid.best_params_
- 调’eta’
params_dict = {
'eta':[0.01,0.03,0.05,0.075,0.1,0.2,0.3],
}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(gamma=0.2,max_depth=14,min_child_weight=9,num_boost_round=100,colsample_bytree=0.8,subsample=0.8),param_grid=params_dict,cv=5,scoring='f1_weighted')
grid.fit(features_data,label_data)
grid.best_score_,grid.best_params_
(0.8391645498500734, {‘eta’: 0.1})
5.5 使用最优参数构建模型并预测
#这里的XGBoost是
clf = XGBoost(gamma=0.2,max_depth=14,min_child_weight=9,num_boost_round=100,colsample_bytree=0.8,subsample=0.8,eta=0.1)
clf.fit(x_train=features_data,y_train=label_data)
forecast = clf.predict(test_data)
最终预测得分也就在76.55%左右