泰坦尼克号生存预测

最新推荐文章于 2023-03-17 09:41:49 发布

Mr.SwiftHorse

最新推荐文章于 2023-03-17 09:41:49 发布

阅读量1.8k

点赞数 7

文章标签：数据分析数据挖掘机器学习

本文链接：https://blog.csdn.net/weixin_40804957/article/details/106064298

版权

1 泰坦尼克号生存预测

1.1 案例背景

针对发生在1912年的泰坦尼克号沉船灾难，这次灾难导致2224名船员和乘客中有1502人遇难。而哪些人幸存那些人丧生并非完全随机,生存与否与性别,年龄,阶层等因素有关,我们的任务是利用数据集把这些因素作为特征,生存的结果作为预测目标,利用机器学习模型实现生存预测。

1.2 数据集描述

列名	含义
PassengerId	乘客
Survived	存活
Pclass	乘客等级
Name	姓名
Sex	性别
Age	年龄
SibSp	堂兄弟/妹个数
Parch	父母与小孩个数
Ticket	船票号
Fare	票价
Cabin	客舱
Embarked	登船港口

2 读取数据集

2.1 导入相关的库

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import font_manager
import warnings

warnings.filterwarnings("ignore")
my_font = font_manager.FontProperties(fname="/System/Library/Fonts/PingFang.ttc")
sns.set(style="darkgrid")

2.2 加载数据集

data = pd.read_csv("./data.csv", index_col="PassengerId")
data.shape

(891, 11)

加载数据集后,可以大致查看数据集内容

data.describe()

	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

data.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

3 数据预处理

3.1 缺失值

# 3.1.1 查看缺失值
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

# 3.1.2 处理缺失值
# Age列可以用均值填充
data["Age"].fillna(data["Age"].mean(), inplace=True)

# Embarked只有两列缺失值,可以直接删除
index = data[data["Embarked"].isnull()].index
data.drop(index=index, axis=0, inplace=True)
# Cabin对建模没有用处,后来可以直接删除这一列

# 3.1.3 最后查看处理结果
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 1 to 891
Data columns (total 11 columns):
Survived    889 non-null int64
Pclass      889 non-null int64
Name        889 non-null object
Sex         889 non-null object
Age         889 non-null float64
SibSp       889 non-null int64
Parch       889 non-null int64
Ticket      889 non-null object
Fare        889 non-null float64
Cabin       202 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.3+ KB

2.2 重复值

# 2.2.1 查看重复值
data.duplicated().sum()

可以发现,无重复值,无需处理

4 数据分析

4.1 社会阶层对于存活的影响

# 4.1.1 首先查看各阶层存活人数
pcalss_sur = data.groupby(["Pclass", "Survived"]).count()["Name"]
display(pcalss_sur)

Pclass  Survived
1       0            80
        1           134
2       0            97
        1            87
3       0           372
        1           119
Name: Name, dtype: int64

# 4.1.2 绘制柱状图
sns.countplot(x="Pclass", hue="Survived", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x10dc26eb8>

在这里插入图片描述

从图中可以看出,贵族(1级)的存活人数明显高于其他阶层

4.2 性别对于存活的影响

sex_sur = data.groupby(["Sex", "Survived"]).count()["Name"]
display(sex_sur)

Sex     Survived
female  0            81
        1           231
male    0           468
        1           109
Name: Name, dtype: int64

sns.countplot(x="Sex", hue="Survived", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x1a206f58d0>

在这里插入图片描述

大部分Lady都活下来了, Gentleman的存活人数明显低于woman

4.3 年龄对于存活的影响

data["Age"] = data["Age"].astype(np.int32)
bins = np.arange(0, 85, 10)
count_bins = pd.cut(data["Age"], bins)

age_data = data.groupby([count_bins, "Survived"]).count()["Name"]
age_data

Age       Survived
(0, 10]   0            26
          1            31
(10, 20]  0            72
          1            44
(20, 30]  0           272
          1           136
(30, 40]  0            86
          1            68
(40, 50]  0            51
          1            33
(50, 60]  0            25
          1            17
(60, 70]  0            14
          1             3
(70, 80]  0             3
          1             1
Name: Name, dtype: int64

data["count_bins"] = count_bins.values
sns.countplot(x="count_bins", hue="Survived", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2078b780>

在这里插入图片描述

由图可知,0到10岁的小孩存活几率较大,60岁以上的貌似活下来的不多.所以,西方有爱幼的传统,尊老没看出来

4.4 SibSp和Parch特征的影响

# 有堂兄弟/妹的存活几率能否高一些呢?可以作图看一下
sns.countplot(x="SibSp", hue="Survived", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x1a20b987f0>

在这里插入图片描述

# Parch(父母与小孩个数)与上述特征类似
sns.countplot(x="Parch", hue="Survived", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x1a20c5c7f0>

在这里插入图片描述

4.5 Embarked对存活的影响

sns.countplot(x="Embarked", hue="Survived", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2078b278>

在这里插入图片描述

4.6 结论

1.贵族(1级)的存活人数明显高于其他阶层
2.Gentleman的存活人数明显低于Lady
3.年龄较小的乘客生存几率要高于其他年龄层
4.有亲属的乘客生存几率较高
5.从C港口登船的乘客存活人数较多,可能此处登船的人级别较高

5 模型建立

5.1 数据预处理

# 5.1.1 删除不需要的列
data.drop(["Name", "Ticket", "Cabin", "count_bins"], axis=1, inplace=True)
data.head()

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
PassengerId
1	0	3	male	22	1	0	7.2500	S
2	1	1	female	38	1	0	71.2833	C
3	1	3	female	26	0	0	7.9250	S
4	1	1	female	35	1	0	53.1000	S
5	0	3	male	35	0	0	8.0500	S

# 5.1.2 将字符串数据转换成数值数据
labels = data["Embarked"].unique().tolist()
data["Embarked"] = data["Embarked"].apply(lambda x: labels.index(x))
data["Embarked"].value_counts()

0    644
1    168
2     77
Name: Embarked, dtype: int64

data["Sex"] = data["Sex"].map({"male": 0, "female": 1})
data.head()

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
PassengerId
1	0	3	0	22	1	0	7.2500	0
2	1	1	1	38	1	0	71.2833	1
3	1	3	1	26	0	0	7.9250	0
4	1	1	1	35	1	0	53.1000	0
5	0	3	0	35	0	0	8.0500	0

5.2 提取标签和特征矩阵，分测试集和训练集

from sklearn.model_selection import train_test_split

x = data.iloc[:, data.columns != "Survived"]
y = data.iloc[:, data.columns == "Survived"]
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3)

#修正测试集和训练集的索引
for i in [xtrain, xtest, ytrain, ytest]:
    i.index = range(i.shape[0])
    
xtrain.head()

	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	1	35	0	512.3292	1
1	3	0	16	0	8.0500	0
2	2	1	17	0	10.5000	0
3	2	0	27	0	13.0000	0
4	1	1	17	1	57.0000	0

5.3 决策树模型

# 5.3.1 先导入基模型,粗略看一下效果
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

tree = DecisionTreeClassifier(random_state=25)
tree.fit(xtrain, ytrain)
tree.score(xtest, ytest)

0.8014981273408239

score_ = cross_val_score(tree, x, y, cv=10).mean()
score_

0.7828907048008171

# 5.3.2 在不同的max_depth下观察模型的拟合效果
tr = []
te = []
for i in range(10):
    tree = DecisionTreeClassifier(criterion="entropy"
                                 ,max_depth=i+1
                                 ,random_state=25)
    tree.fit(xtrain, ytrain)
    score_tr = tree.score(xtest, ytest)
    score_te = cross_val_score(tree, x, y, cv=10).mean()
    tr.append(score_tr)
    te.append(score_te)
    
print(max(te))
plt.plot(range(1,11),tr,color="red",label="train")
plt.plot(range(1,11),te,color="blue",label="test")
plt.xticks(range(1,11))
plt.legend()
plt.show()

0.8166624106230849

在这里插入图片描述

# 5.3.3 采用网格交叉验证
from sklearn.model_selection import GridSearchCV

params = {"splitter": ("random", "best")
          ,"criterion": ("gini", "entropy")
          ,"min_samples_split": [*range(2, 20)]
          ,"max_depth": [*range(1, 10)]
          ,"min_impurity_decrease": np.linspace(0, 0.5, 20)
        }

tree = DecisionTreeClassifier(random_state=25)
gs = GridSearchCV(estimator=tree
                 ,cv=10
                 ,param_grid=params
#                  ,verbose=10
                 ,scoring="accuracy"
                 ,n_jobs=-1)
gs.fit(xtrain, ytrain)
print(gs.best_params_)
print(gs.best_estimator_)
print(gs.best_score_)

{'criterion': 'gini', 'max_depth': 9, 'min_impurity_decrease': 0.0, 'min_samples_split': 5, 'splitter': 'random'}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=25, splitter='random')
0.8215434083601286

5.4 逻辑回归模型

from sklearn.linear_model import LogisticRegression

params = [{"penalty": ["l1", "l2"], "C":[0.1, 1, 10], "solver": ["liblinear"]},
         {"penalty": ["elasticnet"], "C":[0.1, 1, 10], "solver": ["saga"], "l1_ratio": [0.5]}]
lr_gs = GridSearchCV(estimator=LogisticRegression()
                    ,cv=10
                    ,param_grid=params
                    ,verbose=10
                    ,scoring="accuracy"
                    ,n_jobs=-1)
lr_gs.fit(xtrain, ytrain)
lr_score = lr_gs.best_score_
print(lr_score)

0.8038585209003215

5.5 KNN模型

from sklearn.neighbors import KNeighborsClassifier

params = {"n_neighbors": [*range(1, 11)]
          ,"weights": ["uniform", "distance"]
          ,"p": [2]
         }

knn_gs = GridSearchCV(estimator=KNeighborsClassifier()
                    ,cv=5
                    ,param_grid=params
                    ,verbose=10
                    ,scoring="accuracy"
                    ,n_jobs=-1)
knn_gs.fit(xtrain, ytrain)
knn_score = knn_gs.best_score_
print(knn_score)

0.7186495176848875

5.6 模型对比

# 对比几种模型的分值,发现决策树和逻辑回归的效果良好(后续还可以测试更多的模型)
rects = plt.bar(range(1, 4), [gs.best_score_, lr_score, knn_score])
plt.xticks(range(1, 4), labels=["决策树", "逻辑回归", "KNN"], fontproperties=my_font, size=15)

# 加标注
for rect in rects:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height+0.1, f"{height*100:.2f}%", va="center", ha="center")
plt.ylim(0, 1.1)
plt.show()