机器学习-泰坦尼克号幸存者预测
泰坦尼克灾难数据描述
- PassengerId:乘客的ID号,这个是顺序编号,用来唯一标识一名乘客。这个特征和幸存与否无关,我们不使用这个特征。
- Survived:1表示幸存,0表示遇难。这个是标准数据。
- Pclass:仓位等级,是很重要的特征。看过这部电影的读者都知道,高仓位等级的乘客能更快地到达甲板,从而更容易获救。
- Name:乘客名字,这个特征和幸存与否无关,丢弃这个特征。
- Sex:乘客性别,看过电影的读者都知道,由于救生艇数量不多,船长让妇女和儿童先上救生艇。所以这也是个很重要的特征。* Age:乘客年龄,儿童会优先上救生艇,身强力壮者幸存概率也会高一些。
- SibSp:兄弟姐妹同在船上的数量。
- Parch:同船的父辈人员数量。
- Ticket:乘客票号,丢弃这个特征。
- Fare:乘客的体热指标。
- Cabin:乘客所在的船舱号,实际上这个特征和幸存与否有一定的关系,比如最早被水淹没的船舱位置,其乘客的幸存概率要低一些。但由于这个特征有大量的丢失数据,而且没有更多的数据来对船舱进行归类,因此我们丢弃这个特征的数据。
- Embarked:乘客登船的港口,我们需要把港口数据转换为数值型数据。
数据探索
导入必要的库
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
从以下地址导入数据
path = './data/train.csv'
将数据框命名为titanic
titanic = pd.read_csv(path)
titanic
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
886 |
887 |
0 |
2 |
Montvila, Rev. Juozas |
male |
27.0 |
0 |
0 |
211536 |
13.0000 |
NaN |
S |
887 |
888 |
1 |
1 |
Graham, Miss. Margaret Edith |
female |
19.0 |
0 |
0 |
112053 |
30.0000 |
B42 |
S |
888 |
889 |
0 |
3 |
Johnston, Miss. Catherine Helen "Carrie" |
female |
NaN |
1 |
2 |
W./C. 6607 |
23.4500 |
NaN |
S |
889 |
890 |
1 |
1 |
Behr, Mr. Karl Howell |
male |
26.0 |
0 |
0 |
111369 |
30.0000 |
C148 |
C |
890 |
891 |
0 |
3 |
Dooley, Mr. Patrick |
male |
32.0 |
0 |
0 |
370376 |
7.7500 |
NaN |
Q |
891 rows × 12 columns
查看数据属性
titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
描述性分析
titanic.describe()
|
PassengerId |
Survived |
Pclass |
Age |
SibSp |
Parch |
Fare |
count |
891.000000 |
891.000000 |
891.000000 |
714.000000 |
891.000000 |
891.000000 |
891.000000 |
mean |
446.000000 |
0.383838 |
2.308642 |
29.699118 |
0.523008 |
0.381594 |
32.204208 |
std |
257.353842 |
0.486592 |
0.836071 |
14.526497 |
1.102743 |
0.806057 |
49.693429 |
min |
1.000000 |
0.000000 |
1.000000 |
0.420000 |
0.000000 |
0.000000 |
0.000000 |
25% |
223.500000 |
0.000000 |
2.000000 |
20.125000 |
0.000000 |
0.000000 |
7.910400 |
50% |
446.000000 |
0.000000 |
3.000000 |
28.000000 |
0.000000 |
0.000000 |
14.454200 |
75% |
668.500000 |
1.000000 |
3.000000 |
38.000000 |
1.000000 |
0.000000 |
31.000000 |
max |
891.000000 |
1.000000 |
3.000000 |
80.000000 |
8.000000 |
6.000000 |
512.329200 |
从数据的基本描述性统计上,可以得出一些简单的结论,可以为接下来的分析,提供大体的方向,比如,总生存概率大概为38%,绝大部分的人都没有和父母子女一起出游,有大概1/4的人和兄弟姐妹或配偶一起出游,游客的平均年龄为30左右,20岁以下的有大概1/4,船票的价格差别很大,平均为32,最大值却达到512。
非数值型的描述性统计
titanic.describe(include='object')
|
Name |
Sex |
Ticket |
Cabin |
Embarked |
count |
891 |
891 |
891 |
204 |
889 |
unique |
891 |
2 |
681 |
147 |
3 |
top |
Oreskovic, Mr. Luka |
male |
1601 |
C23 C25 C27 |
S |
freq |
1 |
577 |
7 |
4 |
644 |
统计缺失值
titanic.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
将PassengerId设置为索引
titanic.set_index('PassengerId').head()
|
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
PassengerId |
|
|
|
|
|
|
|
|
|
|
|
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
绘制一个展示男女乘客比例的扇形图
males = (titanic['Sex']=='male').sum()
females = (titanic['Sex']=='female').sum()
proportions=[males,females]
plt.pie(
proportions,
labels=['Males','Females'],
shadow=False,
colors=['blue','red'],
explode=(0.15,0),
startangle=90,
autopct='%1.1f%%'
)
plt.axis('equal')
plt.title("Sex Proportion")
plt.tight_layout()
plt.show()
![png](https://i-blog.csdnimg.cn/blog_migrate/a3ddc07a5b3129ead9912ba2046d57b0.png)
绘制一个展示船票Fare, 与乘客年龄和性别的散点图
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)
lm.set(title = 'Fare x Age')
axes = lm.axes
axes[0,0]<