kaggle竞赛:泰坦尼克幸存者预测——(一)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore' )
%matplotlib inline
导入数据
titanic = pd.read_csv(r'E:\DataScience\ML\Titanic\train.csv' )
titanic_test = pd.read_csv(r'E:\DataScience\ML\Titanic\test.csv' )
titanic.head()
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th…
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
单词
翻译
Key
survival
是否幸存
0 = No, 1 = Yes
pclass
社会阶层
1 = 精英, 2 = 中层 , 3 = 普通民众
sex
性别
Age
年龄
sibsp
船上兄弟/姐妹的个数
parch
船上父母/孩子的个数
ticket
船票号
fare
船票价格
cabin
船舱号码
embarked
登船口
C = Cherbourg, Q = Queenstown, S = Southampton
titanic.describe()
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
count
891.000000
891.000000
891.000000
714.000000
891.000000
891.000000
891.000000
mean
446.000000
0.383838
2.308642
29.699118
0.523008
0.381594
32.204208
std
257.353842
0.486592
0.836071
14.526497
1.102743
0.806057
49.693429
min
1.000000
0.000000
1.000000
0.420000
0.000000
0.000000
0.000000
25%
223.500000
0.000000
2.000000
20.125000
0.000000
0.000000
7.910400
50%
446.000000
0.000000
3.000000
28.000000
0.000000
0.000000
14.454200
75%
668.500000
1.000000
3.000000
38.000000
1.000000
0.000000
31.000000
max
891.000000
1.000000
3.000000
80.000000
8.000000
6.000000
512.329200
titanic.info()
print(titanic.isnull().sum())
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
数据清洗
处理缺失值
titanic.Age.fillna(-30 , inplace=True )
titanic.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
数据分析
性别Sex对生还与否的影响
titanic.groupby(['Sex' ,'Survived' ])['Survived' ].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
df_sex = titanic[['Sex' ,'Survived' ]].groupby(['Sex' ]).mean()
df_sex
Survived
Sex
female
0.742038
male
0.188908
df_sex.plot(kind='bar' ,
figsize=(8 ,6 ),
rot=0 ,
fontsize=18 ,
stacked=True )
plt.grid(True , linestyle='--' )
从上面可以发现,事实是与男性比女性的生存能力更强的经验常识相悖的,可以推测Lady First起到了很大的作用
社会阶层 Pclass与生还与否的关系
titanic.groupby(['Pclass' , 'Survived' ])['Pclass' ].count()
Pclass Survived
1 0 80
1 136
2 0 97
1 87
3 0 372
1 119
Name: Pclass, dtype: int64
df_pclass = titanic[['Pclass' , 'Survived' ]].groupby(['Pclass' ]).mean()
df_pclass
Survived
Pclass
1
0.629630
2
0.472826
3
0.242363
df_pclass.plot(kind='bar' ,
rot=0 ,
fontsize=18 ,
figsize=(8 ,6 ))
plt.show()
可以看到,等级越高的人,生存几率越大,那么ladyfirst能否跨越等级界限呢?
df_psex = titanic[['Pclass' , 'Sex' , 'Survived' ]].groupby(['Pclass' , 'Sex' ]).mean()
df_psex
Survived
Pclass
Sex
1
female
0.968085
male
0.368852
2
female
0.921053
male
0.157407
3
female
0.500000
male
0.135447
df_psex.plot(kind='bar' ,
rot=0 ,
fontsize=12 ,
figsize=(8 ,6 ))
plt.show()
可以看到,ladyfirst确实跨越了社会等级界限,普通阶层的女性的生还率都高于精英阶层的男性生还率。 不过,无法忽视的是,不同等级的生还率还是有一定区别的。
年龄Age对生还与否的影响
绘图分析不同阶层和不同性别下的年龄分布情况以及与生还的关系
fig, ax = plt.subplots(1 , 2 , figsize=(18 ,8 ))
sns.violinplot('Pclass' ,'Age' , hue='Survived' , data=titanic, split=True , ax=ax[0 ])
ax[0 ].set_title('Pclass and Age vs Survived' ,size=18 )
ax[0 ].set_yticks(range(0 , 110 , 10 ))
sns.violinplot("Sex" ,