1.实操(Titanic)

一、Data Preview

1.1 Data Structure

Kaggle已经帮我们分成训练集和测试集,测试集之后要用来预测Survived

data = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

1.2 Data head()

data.head()

在这里插入图片描述
可以看到Name,PassengerId,Ticket,是没什么用的,数据预处理时要删除

1.3 Null Detection

print('各属性缺失值情况:\n', data.isnull().sum())

在这里插入图片描述

print('各属性缺失值情况:\n', test.isnull().sum())

在这里插入图片描述

  • 可以观察到Cabin属性缺失值过多,无法补充,数据处理时要删除改选项
  • 训练数据集中Age缺失177,Embarked缺失2; 测试集中Age缺失86,Fare缺失1

1.4 Check ‘Survived’ distribution

在这里插入图片描述
还算平衡,不大需要undersample

二、Data preprocessing

2.1 Concatenate Train and Test Data Together

先将test data 和 train data合并(在test data上加上一列test=True,用来之后再分割)

test['test'] = True

在这里插入图片描述

df = pd.concat([data,test],axis=0,sort=False)

2.2 Drop Useless Features

df = df.drop(['Name','Cabin','PassengerId','Ticket'],axis=1)

2.3 Deal with NaN values

  1. 从之前观察来看测试集中有一个样本缺失了Fare,再经过观察其他样本,觉得差不多应该是 9 左右
df.loc[df['Fare'].isnull() == True,'Fare'] = 9
  1. 同样有两个样本缺失了Embarked
df[df['Embarked'].isnull()==True]

在这里插入图片描述
与之类似的样本中大部分是C,所以用C填充

df.loc[df['Embarked'].isnull()==True,'Embarked'] = 'C'
  1. 年龄Age缺失的也较多,要用所有样本的中位数来填充
df['Age'].fillna(df['Age'].median(),inplace=True)

2.4 One-hot transform

  1. 先观察哪些属性是字符类型
obj_df = df.select_dtypes(include=['object']).copy()
obj_df.columns

Index([‘Sex’, ‘Embarked’, ‘test’], dtype=‘object’)
属性test,之后就删除了,不需要变换

  1. one-hot
df = pd.get_dummies(df, columns=["Sex","Embarked"])

在这里插入图片描述

2.5 分割测试集和训练集

test_data = df.loc[df['test']==True,:]
test_data = test_data.drop(['test','Survived'],axis=1)

train_data = df.loc[df['test']!=True,:]
train_data = train_data.drop(['test'],axis=1)

三、Build Model(RandomForest)

3.1 Split the Features and Label

features_data = train_data.drop(['Survived'],axis=1)
label_data = train_data['Survived']

3.2 GridSearchCV for best parameters

  1. 先拿120-180,步长为4的值来看n_estimators的最优取值范围
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

parameters = {'n_estimators':np.arange(120,180,4)}
clf = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1),param_grid=parameters,cv=8,n_jobs=-1)  #scoring=
clf.fit(features_data,np.ravel(label_data))
print(clf.best_score_)
print(clf.best_params_)

0.8181818181818182
{‘n_estimators’: 128}
得到最优值在128左右

  1. 缩小n_estimators的范围到120-140
parameters = {'n_estimators':np.arange(120,140)}
clf = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1),param_grid=parameters,cv=8,n_jobs=-1)  #scoring=
clf.fit(features_data,np.ravel(label_data))
print(clf.best_score_)
print(clf.best_params_)

0.8181818181818182
{‘n_estimators’: 132}
得到最优的n_estimators = 132

  1. 查找最优的max_depth 和 min_samples_leaf的参数
parameters = {'max_depth':np.arange(10,15),'min_samples_leaf':np.arange(2,6)}
clf = GridSearchCV(estimator=RandomForestClassifier(n_estimators= 132,n_jobs=-1),param_grid=parameters,cv=8,n_jobs=-1)  #scoring=
clf.fit(features_data,np.ravel(label_data))
print(clf.best_score_)
print(clf.best_params_)

0.8395061728395061
{‘max_depth’: 11, ‘min_samples_leaf’: 3}

  1. 查找最优的min_samples_split
parameters = {'min_samples_split':np.arange(2,5)}
clf = GridSearchCV(estimator=RandomForestClassifier(max_depth=11,min_samples_leaf=3, n_estimators= 129,n_jobs=-1),param_grid=parameters,cv=8,n_jobs=-1)  #scoring=
clf.fit(features_data,np.ravel(label_data))
print(clf.best_score_)
print(clf.best_params_)

0.8305274971941639
{‘min_samples_split’: 4}

  1. 构建最佳参数模型
clf = RandomForestClassifier(max_depth=13,min_samples_leaf=3, n_estimators= 129,min_samples_split=4,n_jobs=-1)
  1. 交叉验证得到如下结果
    在这里插入图片描述
    还不错,就用这个模型来预测test集

3.3 预测测试集

clf = RandomForestClassifier(max_depth=11,min_samples_leaf=3, n_estimators= 132,min_samples_split=4,n_jobs=-1)
clf.fit(features_data,np.ravel(label_data))
forecast = clf.predict(test_data)
submission = pd.DataFrame({"PassengerId":test['PassengerId'], "Survived":forecast})
submission['Survived']= submission['Survived'].astype(int)
submission.to_csv("submission.csv", index = False)

最终预测准确率有78.947%

四、Build Model(LogisticRegression)

4.1 Split the Features and Label

features_data = train_data.drop(['Survived'],axis=1)
label_data = train_data['Survived']

4.2 LogisticRegression

class_weight 调整成balanced后精度仍然不高
使用penalty惩罚项来平衡,稍微不平衡的Surived

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
penalty = {0: 4.5,1: 3.4}
clf = LogisticRegression(solver='liblinear',class_weight=penalty)

4.3 交叉验证

在这里插入图片描述

4.4 预测测试集

看精度也还行,
最终竞赛得分76.555%

五、Build Model(XGBoost)

5.1 初始化

import xgboost as xgb
    
dtrain = xgb.DMatrix(data=features_data, label=label_data)
#初始化设置
params_dict = {
    'seed':33,
    'eta':0.1,
    'colsample_bytree':0.7,
    'silent':1,
    'subsample':0.7,
    'objective':'reg:logistic',
    'max_depth':5,
    'min_child_weight':3
}

5.2 搜索最优的迭代次数

调num_boost_round

%%time
xgb_cv = xgb.cv(params_dict, dtrain, num_boost_round=60,nfold=5,seed=33,early_stopping_rounds=10)
print('CV score:',xgb_cv.iloc[-1,:]['test-rmse-mean'])

画图

import matplotlib.pyplot as plt
plt.figure()
xgb_cv[['train-rmse-mean','test-rmse-mean']].plot()

最后结果大概是num_boost_round = 100 左右

5.3 创建一个模型类方便使用

class XGBoost(object):
    def __init__(self,**kwargs):
        self.params = kwargs
        if 'num_boost_round' in self.params:
            self.num_boost_round = self.params['num_boost_round']
        self.params.update({'silent':1,'objective':'binary:hinge','seed':0})
    
    def fit(self, x_train, y_train):
        dtrain = xgb.DMatrix(x_train, y_train)
        self.bst = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round)
        
    def predict(self, x_pred):
        dpred = xgb.DMatrix(x_pred)
        return self.bst.predict(dpred)
    
    def kfold(self, x_train,y_train,nfold=5):
        dtrain = xgb.DMatrix(x_train,y_train)
        cv_rounds = xgb.cv(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round, nfold=nfold,early_stopping_rounds=10)
        return cv_rounds.iloc[-1,:]
    
    def plot_feature_importances(self):
        feature_importance = pd.Series(self,bst.get_fscore()).sort_values(ascending=False)
        feature_importance.plot(title='Feature Importance')
        plt.ylabel('Feature Importance Score')
    
    def get_params(self,deep=True):
        return self.params
    
    def set_params(self,**params):
        self.params.update(params)
        return self

5.4 调参

  1. 调 ‘max_depth’ 和 ‘min_child_weight’。维持其他初始参数不变,只调两个参数
%%time
params_dict = {
    'max_depth':np.arange(10,16),
    'min_child_weight':np.arange(1,6)
}

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(eta=0.1,num_boost_round=100,colsample_bytree=0.5,subsample=0.7),param_grid=params_dict,cv=5,scoring='accuracy')
grid.fit(features_data,label_data)
print(grid.best_score_,grid.best_params_)

(0.8372615039281706, {‘max_depth’: 14, ‘min_child_weight’: 5})

  1. 调’min_child_weight’
%%time
params_dict = {
    'min_child_weight':np.arange(6,14)
}

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(max_depth = 14,eta=0.1,num_boost_round=100,colsample_bytree=0.5,subsample=0.7),param_grid=params_dict,cv=5,scoring='f1_weighted')
grid.fit(features_data,label_data)

grid.best_score_,grid.best_params_

(0.836100750932872, {‘min_child_weight’: 9})

  1. 调’gamma’
params_dict = {
    'gamma':[0.15,0.2,0.22,0.24]
}

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(max_depth=14,min_child_weight=9,eta=0.1,num_boost_round=100,colsample_bytree=0.5,subsample=0.7),param_grid=params_dict,cv=5,scoring='f1_weighted')
grid.fit(features_data,label_data)

grid.best_score_,grid.best_params_

(0.8342887064266276, {‘gamma’: 0.2})

  1. 调’subsample’ 和 ‘colsample_bytree’
params_dict = {
    'subsample':[0.6,0.7,0.8,0.9,0.95],
    'colsample_bytree':[0.6,0.7,0.8,0.9,0.95]
}

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(gamma=0.2,max_depth=14,min_child_weight=9,eta=0.1,num_boost_round=100),param_grid=params_dict,cv=5,scoring='f1_weighted')
grid.fit(features_data,label_data)

grid.best_score_,grid.best_params_
  1. 调’eta’
params_dict = {
    'eta':[0.01,0.03,0.05,0.075,0.1,0.2,0.3],
}

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(XGBoost(gamma=0.2,max_depth=14,min_child_weight=9,num_boost_round=100,colsample_bytree=0.8,subsample=0.8),param_grid=params_dict,cv=5,scoring='f1_weighted')
grid.fit(features_data,label_data)

grid.best_score_,grid.best_params_

(0.8391645498500734, {‘eta’: 0.1})

5.5 使用最优参数构建模型并预测

#这里的XGBoost是
clf = XGBoost(gamma=0.2,max_depth=14,min_child_weight=9,num_boost_round=100,colsample_bytree=0.8,subsample=0.8,eta=0.1)
clf.fit(x_train=features_data,y_train=label_data)
forecast = clf.predict(test_data)

最终预测得分也就在76.55%左右

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值