泰坦尼克号生存预测

1 泰坦尼克号生存预测

1.1 案例背景

  • 针对发生在1912年的泰坦尼克号沉船灾难,这次灾难导致2224名船员和乘客中有1502人遇难。而哪些人幸存那些人丧生并非完全随机,生存与否与性别,年龄,阶层等因素有关,我们的任务是利用数据集把这些因素作为特征,生存的结果作为预测目标,利用机器学习模型实现生存预测。

1.2 数据集描述

列名含义
PassengerId乘客
Survived存活
Pclass乘客等级
Name姓名
Sex性别
Age年龄
SibSp堂兄弟/妹个数
Parch父母与小孩个数
Ticket船票号
Fare票价
Cabin客舱
Embarked登船港口

2 读取数据集

2.1 导入相关的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import font_manager
import warnings

warnings.filterwarnings("ignore")
my_font = font_manager.FontProperties(fname="/System/Library/Fonts/PingFang.ttc")
sns.set(style="darkgrid")

2.2 加载数据集

data = pd.read_csv("./data.csv", index_col="PassengerId")
data.shape
(891, 11)
  • 加载数据集后,可以大致查看数据集内容
data.describe()
SurvivedPclassAgeSibSpParchFare
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200
data.head()
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS

3 数据预处理

3.1 缺失值

# 3.1.1 查看缺失值
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
# 3.1.2 处理缺失值
# Age列可以用均值填充
data["Age"].fillna(data["Age"].mean(), inplace=True)
# Embarked只有两列缺失值,可以直接删除
index = data[data["Embarked"].isnull()].index
data.drop(index=index, axis=0, inplace=True)
# Cabin对建模没有用处,后来可以直接删除这一列
# 3.1.3 最后查看处理结果
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 1 to 891
Data columns (total 11 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Name        889 non-null object
Sex         889 non-null object
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Ticket      889 non-null object
Fare        889 non-null float64
Cabin       202 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.3+ KB

2.2 重复值

# 2.2.1 查看重复值
data.duplicated().sum()
0
  • 可以发现,无重复值,无需处理

4 数据分析

4.1 社会阶层对于存活的影响

# 4.1.1 首先查看各阶层存活人数
pcalss_sur = data.groupby(["Pclass", "Survived"]).count()["Name"]
display(pcalss_sur)
Pclass  Survived
1       0            80
        1           134
2       0            97
        1            87
3       0           372
        1           119
Name: Name, dtype: int64
# 4.1.2 绘制柱状图
sns.countplot(x="Pclass", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x10dc26eb8>

在这里插入图片描述

  • 从图中可以看出,贵族(1级)的存活人数明显高于其他阶层

4.2 性别对于存活的影响

sex_sur = data.groupby(["Sex", "Survived"]).count()["Name"]
display(sex_sur)
Sex     Survived
female  0            81
        1           231
male    0           468
        1           109
Name: Name, dtype: int64
sns.countplot(x="Sex", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a206f58d0>

在这里插入图片描述

  • 大部分Lady都活下来了, Gentleman的存活人数明显低于woman

4.3 年龄对于存活的影响

data["Age"] = data["Age"].astype(np.int32)
bins = np.arange(0, 85, 10)
count_bins = pd.cut(data["Age"], bins)

age_data = data.groupby([count_bins, "Survived"]).count()["Name"]
age_data
Age       Survived
(0, 10]   0            26
          1            31
(10, 20]  0            72
          1            44
(20, 30]  0           272
          1           136
(30, 40]  0            86
          1            68
(40, 50]  0            51
          1            33
(50, 60]  0            25
          1            17
(60, 70]  0            14
          1             3
(70, 80]  0             3
          1             1
Name: Name, dtype: int64
data["count_bins"] = count_bins.values
sns.countplot(x="count_bins", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2078b780>

在这里插入图片描述

  • 由图可知,0到10岁的小孩存活几率较大,60岁以上的貌似活下来的不多.所以,西方有爱幼的传统,尊老没看出来

4.4 SibSp和Parch特征的影响

# 有堂兄弟/妹的存活几率能否高一些呢?可以作图看一下
sns.countplot(x="SibSp", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20b987f0>

在这里插入图片描述

# Parch(父母与小孩个数)与上述特征类似
sns.countplot(x="Parch", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a20c5c7f0>

在这里插入图片描述

4.5 Embarked对存活的影响

sns.countplot(x="Embarked", hue="Survived", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2078b278>

在这里插入图片描述

4.6 结论

  • 1.贵族(1级)的存活人数明显高于其他阶层
  • 2.Gentleman的存活人数明显低于Lady
  • 3.年龄较小的乘客生存几率要高于其他年龄层
  • 4.有亲属的乘客生存几率较高
  • 5.从C港口登船的乘客存活人数较多,可能此处登船的人级别较高

5 模型建立

5.1 数据预处理

# 5.1.1 删除不需要的列
data.drop(["Name", "Ticket", "Cabin", "count_bins"], axis=1, inplace=True)
data.head()
SurvivedPclassSexAgeSibSpParchFareEmbarked
PassengerId
103male22107.2500S
211female381071.2833C
313female26007.9250S
411female351053.1000S
503male35008.0500S
# 5.1.2 将字符串数据转换成数值数据
labels = data["Embarked"].unique().tolist()
data["Embarked"] = data["Embarked"].apply(lambda x: labels.index(x))
data["Embarked"].value_counts()
0    644
1    168
2     77
Name: Embarked, dtype: int64
data["Sex"] = data["Sex"].map({"male": 0, "female": 1})
data.head()
SurvivedPclassSexAgeSibSpParchFareEmbarked
PassengerId
103022107.25000
2111381071.28331
313126007.92500
4111351053.10000
503035008.05000

5.2 提取标签和特征矩阵,分测试集和训练集

from sklearn.model_selection import train_test_split

x = data.iloc[:, data.columns != "Survived"]
y = data.iloc[:, data.columns == "Survived"]
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3)

#修正测试集和训练集的索引
for i in [xtrain, xtest, ytrain, ytest]:
    i.index = range(i.shape[0])
    
xtrain.head()
PclassSexAgeSibSpParchFareEmbarked
0113500512.32921
13016008.05000
221170010.50000
320270013.00000
411171057.00000

5.3 决策树模型

# 5.3.1 先导入基模型,粗略看一下效果
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

tree = DecisionTreeClassifier(random_state=25)
tree.fit(xtrain, ytrain)
tree.score(xtest, ytest)
0.8014981273408239
score_ = cross_val_score(tree, x, y, cv=10).mean()
score_
0.7828907048008171
# 5.3.2 在不同的max_depth下观察模型的拟合效果
tr = []
te = []
for i in range(10):
    tree = DecisionTreeClassifier(criterion="entropy"
                                 ,max_depth=i+1
                                 ,random_state=25)
    tree.fit(xtrain, ytrain)
    score_tr = tree.score(xtest, ytest)
    score_te = cross_val_score(tree, x, y, cv=10).mean()
    tr.append(score_tr)
    te.append(score_te)
    
print(max(te))
plt.plot(range(1,11),tr,color="red",label="train")
plt.plot(range(1,11),te,color="blue",label="test")
plt.xticks(range(1,11))
plt.legend()
plt.show()
0.8166624106230849

在这里插入图片描述

# 5.3.3 采用网格交叉验证
from sklearn.model_selection import GridSearchCV

params = {"splitter": ("random", "best")
          ,"criterion": ("gini", "entropy")
          ,"min_samples_split": [*range(2, 20)]
          ,"max_depth": [*range(1, 10)]
          ,"min_impurity_decrease": np.linspace(0, 0.5, 20)
        }

tree = DecisionTreeClassifier(random_state=25)
gs = GridSearchCV(estimator=tree
                 ,cv=10
                 ,param_grid=params
#                  ,verbose=10
                 ,scoring="accuracy"
                 ,n_jobs=-1)
gs.fit(xtrain, ytrain)
print(gs.best_params_)
print(gs.best_estimator_)
print(gs.best_score_)
{'criterion': 'gini', 'max_depth': 9, 'min_impurity_decrease': 0.0, 'min_samples_split': 5, 'splitter': 'random'}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=25, splitter='random')
0.8215434083601286

5.4 逻辑回归模型

from sklearn.linear_model import LogisticRegression

params = [{"penalty": ["l1", "l2"], "C":[0.1, 1, 10], "solver": ["liblinear"]},
         {"penalty": ["elasticnet"], "C":[0.1, 1, 10], "solver": ["saga"], "l1_ratio": [0.5]}]
lr_gs = GridSearchCV(estimator=LogisticRegression()
                    ,cv=10
                    ,param_grid=params
                    ,verbose=10
                    ,scoring="accuracy"
                    ,n_jobs=-1)
lr_gs.fit(xtrain, ytrain)
lr_score = lr_gs.best_score_
print(lr_score)
0.8038585209003215

5.5 KNN模型

from sklearn.neighbors import KNeighborsClassifier

params = {"n_neighbors": [*range(1, 11)]
          ,"weights": ["uniform", "distance"]
          ,"p": [2]
         }

knn_gs = GridSearchCV(estimator=KNeighborsClassifier()
                    ,cv=5
                    ,param_grid=params
                    ,verbose=10
                    ,scoring="accuracy"
                    ,n_jobs=-1)
knn_gs.fit(xtrain, ytrain)
knn_score = knn_gs.best_score_
print(knn_score)
0.7186495176848875

5.6 模型对比

# 对比几种模型的分值,发现决策树和逻辑回归的效果良好(后续还可以测试更多的模型)
rects = plt.bar(range(1, 4), [gs.best_score_, lr_score, knn_score])
plt.xticks(range(1, 4), labels=["决策树", "逻辑回归", "KNN"], fontproperties=my_font, size=15)

# 加标注
for rect in rects:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height+0.1, f"{height*100:.2f}%", va="center", ha="center")
plt.ylim(0, 1.1)
plt.show()

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值