【深耕 Python】Data Science with Python 数据科学（14）pandas 数据处理（五）：泰坦尼克号亡魂 Perished Souls on “RMS Titanic”

本文链接：https://blog.csdn.net/weixin_43031313/article/details/138260208

写在前面

关于数据科学环境的建立，可以参考我的博客：

【深耕 Python】Data Science with Python 数据科学（1）环境搭建

往期数据科学博文：

【深耕 Python】Data Science with Python 数据科学（2）jupyter-lab和numpy数组

【深耕 Python】Data Science with Python 数据科学（3）Numpy 常量、函数和线性空间

【深耕 Python】Data Science with Python 数据科学（4）（书337页）练习题及解答

【深耕 Python】Data Science with Python 数据科学（5）Matplotlib可视化（1）

【深耕 Python】Data Science with Python 数据科学（6）Matplotlib可视化（2）

【深耕 Python】Data Science with Python 数据科学（7）书352页练习题

【深耕 Python】Data Science with Python 数据科学（8）pandas数据结构：Series和DataFrame

【深耕 Python】Data Science with Python 数据科学（9）书361页练习题

【深耕 Python】Data Science with Python 数据科学（10）pandas 数据处理（一）

【深耕 Python】Data Science with Python 数据科学（11）pandas 数据处理（二）

【深耕 Python】Data Science with Python 数据科学（12）pandas 数据处理（三）

【深耕 Python】Data Science with Python 数据科学（13）pandas 数据处理（四）：书377页练习题

代码说明： 由于实机运行的原因，可能省略了某些导入（import）语句。

文章说明：本文中引用的图片出自电影《泰坦尼克号》，仅用于学习，请勿用于未经允许的商业行为。

本期，对1912年，泰坦尼克号（RMS Titanic）沉船事故（见维基百科）的生还者数据进行分析，并试图找出James Cameron 1997年电影《Titanic》中的男女主角。泰坦尼克号博物馆位于贝尔法斯特市Belfast，是重要的港口城市，也是该“不沉之”船的制造地点。据说造成该事故的原因很大程度上在于当时海上通讯系统的落后，在碰撞发生后（时间为1912年4月15日凌晨），船员曾多次尝试通过无线电呼叫求助，但都被忽略或者信号在传输过程中丢失，酿成惨剧。

在这里插入图片描述

一、数据总览

首先，使用pandas读取csv文件，并输出每一列（数据项）的名称。通过 read_csv() 方法进行文件读取，也可以通过给该方法的参数index_col赋值，优先输出指定列的名称。见如下代码：

import matplotlib.pyplot as plt
import pandas as pd

URL = "https://learnenough.s3.amazonaws.com/titanic.csv"
titanic = pd.read_csv(URL)
print(titanic.head())
titanic2 = pd.read_csv(URL, index_col="Name")
print(titanic2.head())

输出：

# 指定“Name”栏之前的输出顺序：
#  乘客序号     是否生还   舱等级
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
#                                               姓名     性别  年龄  兄弟姐妹或伴侣
                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
#  （？）            船票      票价  舱编号   登船
   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S
# 指定index_col = "Name"后的输出顺序：以姓名首字母顺序编号 
                                                    PassengerId  Survived  \
Name                                                                        
Braund, Mr. Owen Harris                                       1         0   
Cumings, Mrs. John Bradley (Florence Briggs Tha...            2         1   
Heikkinen, Miss. Laina                                        3         1   
Futrelle, Mrs. Jacques Heath (Lily May Peel)                  4         1   
Allen, Mr. William Henry                                      5         0   

                                                    Pclass     Sex   Age  \
Name                                                                       
Braund, Mr. Owen Harris                                  3    male  22.0   
Cumings, Mrs. John Bradley (Florence Briggs Tha...       1  female  38.0   
Heikkinen, Miss. Laina                                   3  female  26.0   
Futrelle, Mrs. Jacques Heath (Lily May Peel)             1  female  35.0   
Allen, Mr. William Henry                                 3    male  35.0   

                                                    SibSp  Parch  \
Name                                                               
Braund, Mr. Owen Harris                                 1      0   
Cumings, Mrs. John Bradley (Florence Briggs Tha...      1      0   
Heikkinen, Miss. Laina                                  0      0   
Futrelle, Mrs. Jacques Heath (Lily May Peel)            1      0   
Allen, Mr. William Henry                                0      0   

                                                              Ticket     Fare  \
Name                                                                            
Braund, Mr. Owen Harris                                    A/5 21171   7.2500   
Cumings, Mrs. John Bradley (Florence Briggs Tha...          PC 17599  71.2833   
Heikkinen, Miss. Laina                              STON/O2. 3101282   7.9250   
Futrelle, Mrs. Jacques Heath (Lily May Peel)                  113803  53.1000   
Allen, Mr. William Henry                                      373450   8.0500   

                                                   Cabin Embarked  
Name                                                               
Braund, Mr. Owen Harris                              NaN        S  
Cumings, Mrs. John Bradley (Florence Briggs Tha...   C85        C  
Heikkinen, Miss. Laina                               NaN        S  
Futrelle, Mrs. Jacques Heath (Lily May Peel)        C123        S  
Allen, Mr. William Henry                             NaN        S

众所周知，电影《泰坦尼克号》中女主和男主的名字分别叫萝丝（Rose Dawson）和杰克（Jack），我们在数据文件中搜索这二位：

print(titanic[titanic["Name"].astype('string').str.contains("Rose")])
print(titanic[titanic["Name"].astype('string').str.contains("Jack")])

程序输出：

# “萝丝”
     PassengerId  Survived  Pclass                        Name     Sex   Age  \
855          856         1       3  Aks, Mrs. Sam (Leah Rosen)  female  18.0   

     SibSp  Parch  Ticket  Fare Cabin Embarked  
855      0      1  392091  9.35   NaN        S

# “杰克”
     PassengerId  Survived  Pclass                       Name   Sex  Age  \
766          767         0       1  Brewe, Dr. Arthur Jackson  male  NaN   

     SibSp  Parch  Ticket  Fare Cabin Embarked  
766      0      0  112379  39.6   NaN        C

从上述输出中确实看出，有一位名叫Sam Leah Rosen的18岁女生活了下来，而另一位名叫Arthur Jackson的（年龄未知）的男性则不幸去世。电影中，萝丝登船时的年龄为17岁，基本符合。可以认为，上述的二位乘客是电影中男女主角的灵感来源。

在这里插入图片描述

二、生还率计算

接下来，我们计算泰坦尼克号事故的总生还率。对"Survived"一栏调用mean()方法进行计算：

print(titanic.iloc[0]["Survived"])
print(titanic.iloc[1]["Survived"])
print(titanic["Survived"].mean())

程序输出：

0
1
0.3838383838383838  # 平均生还率

Wikipedia中写道，Of the estimated 2,224 passengers and crew aboard, approximately 1,500 died, making the incident the deadliest sinking of a single ship at the time.

从中计算出的平均生还率为 (2224 - 1500) / 2224 = 0.3255396，和我们从csv文件中得到的统计结果之间存在一定的误差。

三、影响乘客生还率的因素分析

首先，通过调用info()方法，获得所有数据项，从中选取我们认为对生还率影响大的因素：

print(titanic.info())

输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64     # 乘客编号
 1   Survived     891 non-null    int64     # 是否生还
 2   Pclass       891 non-null    int64     # Passenger's class，舱等级
 3   Name         891 non-null    object    # 姓名
 4   Sex          891 non-null    object    # 性别
 5   Age          714 non-null    float64   # 年龄
 6   SibSp        891 non-null    int64     # Siblings and Spouses，兄弟姐妹和伴侣（的总数）
 7   Parch        891 non-null    int64     # ？
 8   Ticket       891 non-null    object    # 船票编号
 9   Fare         891 non-null    float64   # 支付的票价
 10  Cabin        204 non-null    object    # 舱房编号
 11  Embarked     889 non-null    object    # 登船
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

对于上表中的每一个数据列，可以调用其unique()方法获得其所有不同取值：

print(titanic["SibSp"].unique())
print(titanic["Pclass"].unique())
print(titanic["Parch"].unique())
print(titanic["Embarked"].unique())

程序输出：

[1 0 3 4 2 5 8]  # 最多的携带了8个兄弟姐妹或伴侣
[3 1 2]          # 1，2，3等舱
[0 1 2 5 3 4 6]
['S' 'C' 'Q' nan]

舱等级对生还率的影响

使用groupby()方法，分析舱等级对于生还率的影响：

print(titanic.groupby("Pclass")["Survived"].mean())

程序输出：

Pclass
1    0.629630   # 一等舱
2    0.472826   # 二等舱
3    0.242363   # 三等舱
Name: Survived, dtype: float64

拿钱买命。
一点题外话，下图为（电影中）三等舱Third Class的船票。
在这里插入图片描述

从图中看出，该票编号为92302，我们使用如下代码搜索是否真的存在这样一张船票：

print(titanic[titanic["Ticket"].astype('string').str.contains("92302")])

得到如下输出：

Empty DataFrame
Columns: [PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
Index: []

说明这张船票其实并不存在。

在这里插入图片描述

类似地，电影中Jack的舱房号为G-60，我们搜索是否存在这个舱房：

print(titanic[titanic["Cabin"].astype('string').str.contains("G-60")])

得到如下输出：

Empty DataFrame
Columns: [PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
Index: []

说明同样地，这个房间也是虚构的。

我们将上述的数据可视化，将其绘制为柱状图（bar plot）：

survival_rates = titanic.groupby("Pclass")["Survived"].mean()
survival_rates.plot.bar()
plt.grid()
plt.title("Survival Rates of Different Classes")
plt.ylabel("Survival Rate")
plt.show()

程序输出：

在这里插入图片描述

乘客性别对于生还率的影响

查看性别（Sex）一栏的取值：

print(titanic["Sex"].unique())

程序输出：

['male' 'female']  # 当年并没有跨性别者(Transgenders)

类似地，我们通过绘制柱状图的方式来分析男性和女性的生还率：

survival_rates = titanic.groupby("Sex")["Survived"].mean()
survival_rates.plot.bar()
plt.grid()
plt.title("Survival Rates of Different Sex")
plt.ylabel("Survival Rate")
plt.show()

程序输出：

在这里插入图片描述

毕竟 Women and Children First.

乘客年龄对于生还率的影响

乘客年龄存储于Age一栏中，我们首先输出年龄的最小值和最大值，并暂定使用7个柱进行绘图：

print(titanic["Age"].min())
print(titanic["Age"].max())
print((titanic["Age"].max() - titanic["Age"].min()) / 7)

程序输出：

0.42  # 最小年龄
80.0  # 最大年龄
11.368571428571428  # 每一段的区间长度，0-11.3算作“儿童”

接下来，使用 notna() 方法将未知年龄的数据项去掉：

print(titanic["Age"].notna())
valid_ages = titanic[titanic["Age"].notna()]

程序输出：

0       True
1       True
2       True
3       True
4       True
       ...  
886     True
887     True
888    False
889     True
890     True
Name: Age, Length: 891, dtype: bool

对乘客按照年龄进行升序排序：

sorted_by_age = valid_ages.sort_values(by="Age")

使用 cut() 方法对排序好的乘客数据进行分段：

sorted_by_age["Age range"] = pd.cut(sorted_by_age["Age"], 7)

接下来，计算每个年龄分段中的乘客的平均生还率：

survival_rates = sorted_by_age.groupby("Age range")["Survived"].mean()

进行柱状图的绘制：

survival_rates.plot.bar()
plt.title("Survival Rates of Different Ages")
plt.ylabel("Survival Rate")
plt.grid()
plt.show()

程序输出：

在这里插入图片描述

可以看出，儿童的生还率最高，其次是成年人，而68~80岁的老年人则生还率最低。

最后，计算男性和女性乘客的平均年龄：

print(titanic[titanic["Sex"] == "male"]["Age"].mean())
print(titanic[titanic["Sex"] == "female"]["Age"].mean())

30.72664459161148  # 男性平均年龄
27.915708812260537  # 女性平均年龄

本期数据分析到此结束，祝大家乘坐交通工具安全第一。

Bon Voyage.

在这里插入图片描述

参考文献 Reference

《Learn Enough Python to be Dangerous——Software Development, Flask Web Apps, and Beginning Data Science with Python》, Michael Hartl, Boston, Pearson, 2023.