Kaggle八门神器（一）：竞赛神器之XGBoost介绍

最新推荐文章于 2021-08-16 18:17:41 发布

ericchzy

最新推荐文章于 2021-08-16 18:17:41 发布

阅读量687

点赞数

文章标签： Machine Learning Package Man XGBoost

本文链接：https://blog.csdn.net/u014073373/article/details/90679788

版权

文章目录

Xgboost为一个十分有效的机器学习模型，在各种竞赛中均可以看到它的身影，同时Xgboost在工业届也有着广泛的应用，本文以Titanic数据集为研究对象，简单地探究Xgboost模型建模过程，同时对数据清理以及特征工程的内容作简单的介绍，以此作为Xgboost模型的学习笔记，错误和不足之处还请各位看官指出。

数据集

本文数据集源自于竞赛Titanic: Machine Learning from Disaster，竞赛中我们要求根据数据集提供的乘客编号、姓名性别等信息，运用机器学习模型预测船上乘客的存活与否

泰坦尼克号沉没事故（英语：Sinking of the RMS Titanic）是1912年4月14日深夜至15日凌晨在北大西洋发生的著名船难，事发时是泰坦尼克号从英国南安普敦港至美国纽约港首航的第5天，该船当时是世界最大的邮轮。1912年4月14日星期天23时40分[a]与一座冰山擦撞前，已经收到6次海冰警告，但当瞭望员看到冰山时，该船的行驶速度正接近最高速。由于无法快速转向，该船右舷侧面遭受了一次撞击，部分船体出现缝隙，使16个水密隔舱中的5个进水。泰坦尼克号的设计仅能够承受4个水密隔舱进水，因此沉没。 --Wikipedia

import pandas as pd
pd.options.mode.chained_assignment = None

titanic = pd.read_csv('Titanic/train.csv')
titanic.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

数据清理

数据分析中维持一个干净的数据集对建模十分关键，可靠的数据集主要由以下几个方面来评估：

数据的可靠性，这个方面由原始数据集保证
数据的版本控制, 输入数据对机器学习建模影响很大，如果模型训练输入数据不断发生变化的话很可能无法生成正确的模型，即上游的输入数据供给进程突然发生变化会波及到模型建立的过程
特征的必要性，建模特征数量和模型精度并不呈现严格的正相关
特征的相关性，建模过程中我们尽可能减少相关特征的数量

在本例子，Name和Ticket和乘客存活的相关性较低，我们可以考虑将这些特征剔除，同时对于数据集中缺失的数据常用的填充手段有：

对于浮点值可以使用该类别的均值或者中值填充
对于字符串可以使用该类别的出现频率最高的值填充

# 缺失各类别数据缺失的比例情况
import matplotlib.pyplot as plt
import seaborn as sns

missing_ratio = titanic.isnull().sum() / len(titanic) * 100
missing_ratio = missing_ratio.sort_values(ascending=False)
missing_ratio = pd.DataFrame({"missing ratio": missing_ratio})

missing_ratio.head(5)

	missing ratio
Cabin	77.104377
Age	19.865320
Embarked	0.224467
Fare	0.000000
Ticket	0.000000

从上表中可以看出Cabin座舱类别数据缺失比例为77%，虽然我们猜想座舱的位置应该和乘客存活率息息相关，但由于缺失数据太多我们只能选择放弃该类别，综合上述分析选取的类别分别为：

PassengerId
Pclass
Age
Sex
SibSp
Parch
Embarked
Fare

X = titanic[['PassengerId', 'Pclass', 'Age', 'Sex', 'SibSp', 'Parch', 'Embarked', 'Fare']]
y = titanic['Survived']

# Code based on [sveitser's post](https://stackoverflow.com/a/25562948)

from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
    """Fillna function that will apply the most frequent value for NaN in object columns and
    the mean value for NaN in float columns
    """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

# 使用上述填充缺失的数值并将字符
X = DataFrameImputer().fit_transform(X)

特征工程

传统编码工作的关注点在于代码编码的过程，而机器学习和数据分析工作者则是着力于数据特征的表示过程，开发者通过特征工程（新特征可以来源于数据济原始特征的逻辑运算）建立一个良好的数据特征原型。特征工程的主要工作有

映射字符串字符为整型
采用One-Hot编码方式映射枚举值

在本例中，我们将Titanic数据集的Sex类别的男性和女性分别映射为整型值0和1, 其他字符串类别以此类推，本文借助LabelEncoder完成转换过程

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for column in X:
    if X[column].dtype == np.dtype('O'):
        X[column] = le.fit_transform(X[column])

from sklearn.model_selection import train_test_split

# 将原始数据集随机排序之后划分为训练集和测试集，测试集比例为0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

# 检视训练集的数据
X_train.head(5)

	PassengerId	Pclass	Age	Sex	SibSp	Parch	Embarked	Fare
110	111	1	47.000000	1	0	0	2	52.0000
360	361	3	40.000000	1	1	4	2	27.9000
364	365	3	29.699118	1	1	0	1	15.5000
320	321	3	22.000000	1	0	0	2	7.2500
296	297	3	23.500000	1	0	0	0	7.2292

from sklearn.ensemble import RandomForestClassifier

titanic_rf = RandomForestClassifier()
titanic_rf.fit(X_train, y_train)

print('The accuracy of Random Forest Classifier on testing set:', 
      round(titanic_rf.score(X_test, y_test), 4))

The accuracy of Random Forest Classifier on testing set: 0.8386

from xgboost import XGBClassifier

titanic_xgb = XGBClassifier()
titanic_xgb.fit(X_train, y_train)

print('The accuracy of eXtreme Gradient Boosting Classifier on testing set:', 
      round(titanic_xgb.score(X_test, y_test), 4))

The accuracy of eXtreme Gradient Boosting Classifier on testing set: 0.8565

分类结果

目标分类中常用的指标有精确率、召回率以及F1均值，公式如下：

精确率 $\frac{T_P}{(T_P + F_P)}$
召回率 $\frac{T_P}{(T_P + F_N)}$
F1值 $2\times \frac{Precision \times Recall}{(Precision + Recall)}$

from sklearn.metrics import classification_report, precision_recall_curve
from sklearn.metrics import f1_score

rf_result = titanic_rf.predict(X_test)
xgb_result = titanic_xgb.predict(X_test)

print('随机森林模型: \n ' + classification_report(rf_result, y_test, digits=4))
print('XGBoost模型: \n ' + classification_report(xgb_result, y_test, digits=4))

随机森林模型: 
               precision    recall  f1-score   support

           0     0.9030    0.8403    0.8705       144
           1     0.7416    0.8354    0.7857        79

   micro avg     0.8386    0.8386    0.8386       223
   macro avg     0.8223    0.8379    0.8281       223
weighted avg     0.8458    0.8386    0.8405       223

XGBoost模型: 
               precision    recall  f1-score   support

           0     0.9179    0.8542    0.8849       144
           1     0.7640    0.8608    0.8095        79

   micro avg     0.8565    0.8565    0.8565       223
   macro avg     0.8410    0.8575    0.8472       223
weighted avg     0.8634    0.8565    0.8582       223

可以看到随机森林模型和XGBoost的F1均值分别为0.8405和0.8582，XGBoost模型在Titanic数据集中略胜一筹