1 泰坦尼克号生存预测
1.1 案例背景
- 针对发生在1912年的泰坦尼克号沉船灾难,这次灾难导致2224名船员和乘客中有1502人遇难。而哪些人幸存那些人丧生并非完全随机,生存与否与性别,年龄,阶层等因素有关,我们的任务是利用数据集把这些因素作为特征,生存的结果作为预测目标,利用机器学习模型实现生存预测。
1.2 数据集描述
列名 | 含义 |
---|
PassengerId | 乘客 |
Survived | 存活 |
Pclass | 乘客等级 |
Name | 姓名 |
Sex | 性别 |
Age | 年龄 |
SibSp | 堂兄弟/妹个数 |
Parch | 父母与小孩个数 |
Ticket | 船票号 |
Fare | 票价 |
Cabin | 客舱 |
Embarked | 登船港口 |
2 读取数据集
2.1 导入相关的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import font_manager
import warnings
warnings.filterwarnings("ignore")
my_font = font_manager.FontProperties(fname="/System/Library/Fonts/PingFang.ttc")
sns.set(style="darkgrid")
2.2 加载数据集
data = pd.read_csv("./data.csv", index_col="PassengerId")
data.shape
(891, 11)
data.describe()
| Survived | Pclass | Age | SibSp | Parch | Fare |
---|
count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
---|
mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
---|
std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
---|
min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
---|
25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
---|
50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
---|
75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
---|
max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
---|
data.head()
| Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
PassengerId | | | | | | | | | | | |
---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
3 数据预处理
3.1 缺失值
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
data["Age"].fillna(data["Age"].mean(), inplace=True)
index = data[data["Embarked"].isnull()].index
data.drop(index=index, axis=0, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 1 to 891
Data columns (total 11 columns):
Survived 889 non-null int64
Pclass 889 non-null int64
Name 889 non-null object
Sex 889 non-null object
Age 889 non-null float64
SibSp 889 non-null int64
Parch 889 non-null int64
Ticket 889 non-null object
Fare 889 non-null float64
Cabin 202 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.3+ KB
2.2 重复值
data.duplicated().sum()
0
4 数据分析
4.1 社会阶层对于存活的影响
pcalss_sur = data.groupby(["Pclass", "Survived"]).count()["Name"]
display(pcalss_sur)
Pclass Survived
1 0 80
1 134
2 0 97
1 87
3 0 372
1 119
Name: Name, dtype: int64
sns.countplot(x="Pclass", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x10dc26eb8>
- 从图中可以看出,贵族(1级)的存活人数明显高于其他阶层
4.2 性别对于存活的影响
sex_sur = data.groupby(["Sex", "Survived"]).count()["Name"]
display(sex_sur)
Sex Survived
female 0 81
1 231
male 0 468
1 109
Name: Name, dtype: int64
sns.countplot(x="Sex", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a206f58d0>
- 大部分Lady都活下来了, Gentleman的存活人数明显低于woman
4.3 年龄对于存活的影响
data["Age"] = data["Age"].astype(np.int32)
bins = np.arange(0, 85, 10)
count_bins = pd.cut(data["Age"], bins)
age_data = data.groupby([count_bins, "Survived"]).count()["Name"]
age_data
Age Survived
(0, 10] 0 26
1 31
(10, 20] 0 72
1 44
(20, 30] 0 272
1 136
(30, 40] 0 86
1 68
(40, 50] 0 51
1 33
(50, 60] 0 25
1 17
(60, 70] 0 14
1 3
(70, 80] 0 3
1 1
Name: Name, dtype: int64
data["count_bins"] = count_bins.values
sns.countplot(x="count_bins", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2078b780>
- 由图可知,0到10岁的小孩存活几率较大,60岁以上的貌似活下来的不多.所以,西方有爱幼的传统,尊老没看出来
4.4 SibSp和Parch特征的影响
sns.countplot(x="SibSp", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20b987f0>
sns.countplot(x="Parch", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20c5c7f0>
4.5 Embarked对存活的影响
sns.countplot(x="Embarked", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2078b278>
4.6 结论
- 1.贵族(1级)的存活人数明显高于其他阶层
- 2.Gentleman的存活人数明显低于Lady
- 3.年龄较小的乘客生存几率要高于其他年龄层
- 4.有亲属的乘客生存几率较高
- 5.从C港口登船的乘客存活人数较多,可能此处登船的人级别较高
5 模型建立
5.1 数据预处理
data.drop(["Name", "Ticket", "Cabin", "count_bins"], axis=1, inplace=True)
data.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked |
---|
PassengerId | | | | | | | | |
---|
1 | 0 | 3 | male | 22 | 1 | 0 | 7.2500 | S |
---|
2 | 1 | 1 | female | 38 | 1 | 0 | 71.2833 | C |
---|
3 | 1 | 3 | female | 26 | 0 | 0 | 7.9250 | S |
---|
4 | 1 | 1 | female | 35 | 1 | 0 | 53.1000 | S |
---|
5 | 0 | 3 | male | 35 | 0 | 0 | 8.0500 | S |
---|
labels = data["Embarked"].unique().tolist()
data["Embarked"] = data["Embarked"].apply(lambda x: labels.index(x))
data["Embarked"].value_counts()
0 644
1 168
2 77
Name: Embarked, dtype: int64
data["Sex"] = data["Sex"].map({"male": 0, "female": 1})
data.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked |
---|
PassengerId | | | | | | | | |
---|
1 | 0 | 3 | 0 | 22 | 1 | 0 | 7.2500 | 0 |
---|
2 | 1 | 1 | 1 | 38 | 1 | 0 | 71.2833 | 1 |
---|
3 | 1 | 3 | 1 | 26 | 0 | 0 | 7.9250 | 0 |
---|
4 | 1 | 1 | 1 | 35 | 1 | 0 | 53.1000 | 0 |
---|
5 | 0 | 3 | 0 | 35 | 0 | 0 | 8.0500 | 0 |
---|
5.2 提取标签和特征矩阵,分测试集和训练集
from sklearn.model_selection import train_test_split
x = data.iloc[:, data.columns != "Survived"]
y = data.iloc[:, data.columns == "Survived"]
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3)
for i in [xtrain, xtest, ytrain, ytest]:
i.index = range(i.shape[0])
xtrain.head()
| Pclass | Sex | Age | SibSp | Parch | Fare | Embarked |
---|
0 | 1 | 1 | 35 | 0 | 0 | 512.3292 | 1 |
---|
1 | 3 | 0 | 16 | 0 | 0 | 8.0500 | 0 |
---|
2 | 2 | 1 | 17 | 0 | 0 | 10.5000 | 0 |
---|
3 | 2 | 0 | 27 | 0 | 0 | 13.0000 | 0 |
---|
4 | 1 | 1 | 17 | 1 | 0 | 57.0000 | 0 |
---|
5.3 决策树模型
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
tree = DecisionTreeClassifier(random_state=25)
tree.fit(xtrain, ytrain)
tree.score(xtest, ytest)
0.8014981273408239
score_ = cross_val_score(tree, x, y, cv=10).mean()
score_
0.7828907048008171
tr = []
te = []
for i in range(10):
tree = DecisionTreeClassifier(criterion="entropy"
,max_depth=i+1
,random_state=25)
tree.fit(xtrain, ytrain)
score_tr = tree.score(xtest, ytest)
score_te = cross_val_score(tree, x, y, cv=10).mean()
tr.append(score_tr)
te.append(score_te)
print(max(te))
plt.plot(range(1,11),tr,color="red",label="train")
plt.plot(range(1,11),te,color="blue",label="test")
plt.xticks(range(1,11))
plt.legend()
plt.show()
0.8166624106230849
from sklearn.model_selection import GridSearchCV
params = {"splitter": ("random", "best")
,"criterion": ("gini", "entropy")
,"min_samples_split": [*range(2, 20)]
,"max_depth": [*range(1, 10)]
,"min_impurity_decrease": np.linspace(0, 0.5, 20)
}
tree = DecisionTreeClassifier(random_state=25)
gs = GridSearchCV(estimator=tree
,cv=10
,param_grid=params
,scoring="accuracy"
,n_jobs=-1)
gs.fit(xtrain, ytrain)
print(gs.best_params_)
print(gs.best_estimator_)
print(gs.best_score_)
{'criterion': 'gini', 'max_depth': 9, 'min_impurity_decrease': 0.0, 'min_samples_split': 5, 'splitter': 'random'}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=5,
min_weight_fraction_leaf=0.0, presort=False,
random_state=25, splitter='random')
0.8215434083601286
5.4 逻辑回归模型
from sklearn.linear_model import LogisticRegression
params = [{"penalty": ["l1", "l2"], "C":[0.1, 1, 10], "solver": ["liblinear"]},
{"penalty": ["elasticnet"], "C":[0.1, 1, 10], "solver": ["saga"], "l1_ratio": [0.5]}]
lr_gs = GridSearchCV(estimator=LogisticRegression()
,cv=10
,param_grid=params
,verbose=10
,scoring="accuracy"
,n_jobs=-1)
lr_gs.fit(xtrain, ytrain)
lr_score = lr_gs.best_score_
print(lr_score)
0.8038585209003215
5.5 KNN模型
from sklearn.neighbors import KNeighborsClassifier
params = {"n_neighbors": [*range(1, 11)]
,"weights": ["uniform", "distance"]
,"p": [2]
}
knn_gs = GridSearchCV(estimator=KNeighborsClassifier()
,cv=5
,param_grid=params
,verbose=10
,scoring="accuracy"
,n_jobs=-1)
knn_gs.fit(xtrain, ytrain)
knn_score = knn_gs.best_score_
print(knn_score)
0.7186495176848875
5.6 模型对比
rects = plt.bar(range(1, 4), [gs.best_score_, lr_score, knn_score])
plt.xticks(range(1, 4), labels=["决策树", "逻辑回归", "KNN"], fontproperties=my_font, size=15)
for rect in rects:
height = rect.get_height()
plt.text(rect.get_x() + rect.get_width() / 2, height+0.1, f"{height*100:.2f}%", va="center", ha="center")
plt.ylim(0, 1.1)
plt.show()