本文数据集源自于竞赛Titanic: Machine Learning from Disaster,竞赛中我们要求根据数据集提供的乘客编号、姓名性别等信息,运用机器学习模型预测船上乘客的存活与否
泰坦尼克号沉没事故(英语:Sinking of the RMS Titanic)是1912年4月14日深夜至15日凌晨在北大西洋发生的著名船难,事发时是泰坦尼克号从英国南安普敦港至美国纽约港首航的第5天,该船当时是世界最大的邮轮。1912年4月14日星期天23时40分[a]与一座冰山擦撞前,已经收到6次海冰警告,但当瞭望员看到冰山时,该船的行驶速度正接近最高速。由于无法快速转向,该船右舷侧面遭受了一次撞击,部分船体出现缝隙,使16个水密隔舱中的5个进水。泰坦尼克号的设计仅能够承受4个水密隔舱进水,因此沉没。 --Wikipedia
import pandas as pd
pd.options.mode.chained_assignment = None
titanic = pd.read_csv('Titanic/train.csv')
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
- 数据的可靠性,这个方面由原始数据集保证
- 数据的版本控制, 输入数据对机器学习建模影响很大,如果模型训练输入数据不断发生变化的话很可能无法生成正确的模型,即上游的输入数据供给进程突然发生变化会波及到模型建立的过程
- 特征的必要性,建模特征数量和模型精度并不呈现严格的正相关
- 特征的相关性,建模过程中我们尽可能减少相关特征的数量
- 对于浮点值可以使用该类别的均值或者中值填充
- 对于字符串可以使用该类别的出现频率最高的值填充
# 缺失各类别数据缺失的比例情况
import matplotlib.pyplot as plt
import seaborn as sns
missing_ratio = titanic.isnull().sum() / len(titanic) * 100
missing_ratio = missing_ratio.sort_values(ascending=False)
missing_ratio = pd.DataFrame({"missing ratio": missing_ratio})
missing ratio | |
Cabin | 77.104377 |
Age | 19.865320 |
Embarked | 0.224467 |
Fare | 0.000000 |
Ticket | 0.000000 |
- PassengerId
- Pclass
- Age
- Sex
- SibSp
- Parch
- Embarked
- Fare
X = titanic[['PassengerId', 'Pclass', 'Age', 'Sex', 'SibSp', 'Parch', 'Embarked', 'Fare']]
y = titanic['Survived']
# Code based on [sveitser's post](https://stackoverflow.com/a/25562948)
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
"""Fillna function that will apply the most frequent value for NaN in object columns and
the mean value for NaN in float columns
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
# 使用上述填充缺失的数值并将字符
X = DataFrameImputer().fit_transform(X)
- 映射字符串字符为整型
- 采用One-Hot编码方式映射枚举值
类别的男性和女性分别映射为整型值0和1, 其他字符串类别以此类推,本文借助LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for column in X:
if X[column].dtype == np.dtype('O'):
X[column] = le.fit_transform(X[column])
from sklearn.model_selection import train_test_split
# 将原始数据集随机排序之后划分为训练集和测试集,测试集比例为0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
# 检视训练集的数据
PassengerId | Pclass | Age | Sex | SibSp | Parch | Embarked | Fare | |
110 | 111 | 1 | 47.000000 | 1 | 0 | 0 | 2 | 52.0000 |
360 | 361 | 3 | 40.000000 | 1 | 1 | 4 | 2 | 27.9000 |
364 | 365 | 3 | 29.699118 | 1 | 1 | 0 | 1 | 15.5000 |
320 | 321 | 3 | 22.000000 | 1 | 0 | 0 | 2 | 7.2500 |
296 | 297 | 3 | 23.500000 | 1 | 0 | 0 | 0 | 7.2292 |
from sklearn.ensemble import RandomForestClassifier
titanic_rf = RandomForestClassifier()
titanic_rf.fit(X_train, y_train)
print('The accuracy of Random Forest Classifier on testing set:',
round(titanic_rf.score(X_test, y_test), 4))
The accuracy of Random Forest Classifier on testing set: 0.8386
from xgboost import XGBClassifier
titanic_xgb = XGBClassifier()
titanic_xgb.fit(X_train, y_train)
print('The accuracy of eXtreme Gradient Boosting Classifier on testing set:',
round(titanic_xgb.score(X_test, y_test), 4))
The accuracy of eXtreme Gradient Boosting Classifier on testing set: 0.8565
- 精确率 P r e c i s i o n = T P ( T P + F P ) Precision = \frac{T_P}{(T_P + F_P)} Precision=(TP+FP)TP
- 召回率 R e c a l l = T P ( T P + F N ) Recall = \frac{T_P}{(T_P + F_N)} Recall=(TP+FN)TP
- F1值 F 1 = 2 × P r e c i s i o n × R e c a l l ( P r e c i s i o n + R e c a l l ) F1 = 2\times \frac{Precision \times Recall}{(Precision + Recall)} F1=2×(Precision+Recall)Precision×Recall
from sklearn.metrics import classification_report, precision_recall_curve
from sklearn.metrics import f1_score
rf_result = titanic_rf.predict(X_test)
xgb_result = titanic_xgb.predict(X_test)
print('随机森林模型: \n ' + classification_report(rf_result, y_test, digits=4))
print('XGBoost模型: \n ' + classification_report(xgb_result, y_test, digits=4))
precision recall f1-score support
0 0.9030 0.8403 0.8705 144
1 0.7416 0.8354 0.7857 79
micro avg 0.8386 0.8386 0.8386 223
macro avg 0.8223 0.8379 0.8281 223
weighted avg 0.8458 0.8386 0.8405 223
precision recall f1-score support
0 0.9179 0.8542 0.8849 144
1 0.7640 0.8608 0.8095 79
micro avg 0.8565 0.8565 0.8565 223
macro avg 0.8410 0.8575 0.8472 223
weighted avg 0.8634 0.8565 0.8582 223