泰坦尼克生还因素分析及建模生还预测

问题背景:

泰坦尼克豪华游轮即将沉没,救生艇数量有限,无法人人都有,副船长‘女士小孩优先的指示下’,打破了随机获救的平衡。

分析过程:

1.流程设计:数据准备,数据清理,分析展示,建模评估
2.数据准备和观察
3.数据预处理:清理、变换、缺失值填补等
4.分析展示(一维分析、二维分析)
5.建模与评估

一:数据准备

1.1 导入数据

import numpy as np
import pandas as pd
import seaborn as sns
titanic_df=pd.read_csv('titanic-data.csv')

1.2. 观察数据

生还率为38.38%,超过75%的乘客没有父母子女,超过50%的乘客没有兄弟姐妹或配偶。
一共891条数据,Cabin缺失687个,Age缺失177个,Embarked缺失2个,Fare最小值为0,也可以看做缺失值。
sex、Cabin、Embarked 为字符型数据。
print (titanic_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
##查看前5条记录
print (titanic_df.head(5))
print ('--------------')
##查看数值型数据的分布信息
print (titanic_df.describe())
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
--------------
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

二:数据预处理

2.1.数据清理

Fare有为0值的票价:显然不合理
I
##处理0值票价
titanic_df[titanic_df['Fare']==0]

  PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
179 180 0 3 Leonard, Mr. Lionel male 36.0 0 0 LINE 0.0 NaN S
263 264 0 1 Harrison, Mr. William male 40.0 0 0 112059 0.0 B94 S
271 272 1 3 Tornquist, Mr. William Henry male 25.0 0 0 LINE 0.0 NaN S
277 278 0 2 Parkes, Mr. Francis "Frank" male NaN 0 0 239853 0.0 NaN S
302 303 0 3 Johnson, Mr. William Cahoone Jr male 19.0 0 0 LINE 0.0 NaN S
413 414 0 2 Cunningham, Mr. Alfred Fleming male NaN 0 0 239853 0.0 NaN S
466 467 0 2 Campbell, Mr. William male NaN 0 0 239853 0.0 NaN S
481 482 0 2 Frost, Mr. Anthony Wood "Archie" male NaN 0 0 239854 0.0 NaN S
597 598 0 3 Johnson, Mr. Alfred male 49.0 0 0 LINE 0.0 NaN S
633 634 0 1 Parr, Mr. William Henry Marsh male NaN 0 0 112052 0.0 NaN S
674 675 0 2 Watson, Mr. Ennis Hastings male NaN 0 0 239856 0.0 NaN S
732 733 0 2 Knight, Mr. Robert J male NaN 0 0 239855 0.0 NaN S
806 807 0 1 Andrews, Mr. Thomas Jr male 39.0 0 0 112050 0.0 A36 S
815 816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0 B102 S
822 823 0 1 Reuchlin, Jonkheer. John George male 38.0 0 0 19972 0.0 NaN S

import matplotlib.pyplot as plt
fig=plt.figure(figsize=(15,5))
titanic_df.Fare[titanic_df['Embarked']=='S'].plot(kind='kde') 
plt.show()

##可知S等舱口的票价大部分分布于0-20

titanic_df[titanic_df['Embarked']=='S'].Fare.median()

13.0

##用中位数填补0值票价
titanic_df.loc[titanic_df['Fare']==0,'Fare']=13.0

2.2.缺失值填补

2.2.1 cabin客舱号码204个,缺失687(处理办法:丢失了77%,失去分析价值了,直接丢掉该变量)

####cabin客舱号码缺失值处理:删掉
titanic_df=titanic_df.drop('Cabin',axis=1)
2.2.2 embarked 登舱号889个,缺失2个(处理办法:相似对象填充)

#查看缺失embarked的两条数据,票价为80 
titanic_df[titanic_df.Embarked.isnull()]

  PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 NaN

fig=plt.figure(figsize=(15,5))
df=titanic_df[titanic_df['Pclass']==1]
i=0
for x in ['S','C','Q']:
    plt.subplot2grid((1,3),(0,i))
    i=i+1
    df.Fare[df.Embarked==x].plot(kind='box')
    plt.legend((x))
plt.show()

##Embarked为C时,票价均值在80附近。用C来填充Embarked的空白
titanic_df['Embarked'][titanic_df.Embarked.isnull()]=['C']
2.2.3age只有714个,缺失了177个数据 (处理办法:随机数补充,选择Q1-Q3内的随机数)

###处理age变量缺失值,选择Q1-Q3内的随机数
Q1=titanic_df['Age'].describe()['25%']
Q3=titanic_df['Age'].describe()['75%']
num=titanic_df['Age'].isnull().sum()
random_ages=np.random.randint(Q1,Q3,size=num)
titanic_df['Age'][np.isnan(titanic_df['Age'])]=random_ages

2.3编码名义变量


def trans(data):  
    data.loc[data['Sex']=='male','Sex']=0 
    data.loc[data['Sex']=='female','Sex']=1
                                               
    data.loc[data['Embarked']=='S','Embarked']=0
    data.loc[data['Embarked']=='C','Embarked']=1
    data.loc[data['Embarked']=='Q','Embarked']=2

trans(titanic_df)


三.分析与展示

3.1一维数据分析

3.1.1 Pclass 船票等级
一等舱人数216,生还人数136,生还率62.96%
二等舱人数184,生还人数87,生还率47.28%
三等舱人数491,生还人数119,生还率24.24%
并没有富人优先,但是生还率却按客舱等级成正比,可能与位置有关。一等舱位于游轮上层,利于逃生。

##按PClass分组求出生还人数及生还率
Pc_dsc=titanic_df.groupby(['Pclass']).describe()['Survived']
Pc_dsc['Survied']=Pc_dsc['count']*Pc_dsc['mean']

Pc_dsc

  count mean std min 25% 50% 75% max Survied
Pclass                  
1 216.0 0.629630 0.484026 0.0 0.0 1.0 1.0 1.0 136.0
2 184.0 0.472826 0.500623 0.0 0.0 0.0 1.0 1.0 87.0
3 491.0 0.242363 0.428949 0.0 0.0 0.0 0.0 1.0 119.0

Pc_dsc['mean'].plot(kind='bar')
plt.title('Pclass_survived_rate')
plt.show()
3.1.2 性别
男士总人数577,生还109,生还率18.89%
女士总人数314,生还233,生还率74.20%
看来副船长的号召起作用了,女士的生还率明显高于客舱生还率,而与此对应的男士的生还率则低于客舱生还率。

##按Sex分组求出生还人数及生还率
Sex_dsc=titanic_df.groupby(['Sex']).describe()['Survived']
Sex_dsc['Survived']=Sex_dsc['count']*Sex_dsc['mean']
Sex_dsc

  count mean std min 25% 50% 75% max Survived
Sex                  
0 577.0 0.188908 0.391775 0.0 0.0 0.0 0.0 1.0 109.0
1 314.0 0.742038 0.438211 0.0 0.0 1.0 1.0 1.0 233.0

Sex_dsc['mean'].plot(kind='bar')
plt.title('Sex of survived')
plt.show()
3.1.3 年龄
儿童人数69,生还40人,生还率57.97%(儿童优先的体现)
青少年70人,生还30人,生还率42.86%
青壮年712人,生还260人,生还36.52%
老年40人,生还12人,生还率30%

##创建年龄阶段字段
'''
0-12儿童,12-18青少年,18-55青壮年,55-80老年
'''
titanic_df['Age_class']=pd.cut(titanic_df['Age'],bins=[0,12,18,55,80])

Age_dsc=titanic_df.groupby(['Age_class']).describe()['Survived']
Age_dsc['Survied']=Age_dsc['count']*Age_dsc['mean']
Age_dsc

  count mean std min 25% 50% 75% max Survied
Age_class                  
(0, 12] 69.0 0.579710 0.497222 0.0 0.0 1.0 1.0 1.0 40.0
(12, 18] 70.0 0.428571 0.498445 0.0 0.0 0.0 1.0 1.0 30.0
(18, 55] 712.0 0.365169 0.481816 0.0 0.0 0.0 1.0 1.0 260.0
(55, 80] 40.0 0.300000 0.464095 0.0 0.0 0.0 1.0 1.0 12.0

Age_dsc['mean'].plot(kind='bar')
plt.title('Age of survived')
plt.show()
3.1.4Fare 票价与生还率的关系
按照四分位点将票价分为4个等级,从图中可以看到,随着票价的增长,生还率也随之生长。

##票价比较分散,查看一下分布
titanic_df.Fare.describe()

count    891.000000
mean      32.423063
std       49.579484
min        4.012500
25%        7.925000
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

##创建票价分组字段  fare_class,依据上述四份位点(7.91,14.45,31)将人数均分在四组中
titanic_df['Fare_class']=pd.cut(titanic_df['Fare'],bins=[0,7.91,14.45,31,513])
##生还率:
fare_dsc=titanic_df.groupby(['Fare_class']).describe()['Survived']
fare_dsc

  count mean std min 25% 50% 75% max
Fare_class                
(0.0, 7.91] 208.0 0.206731 0.405938 0.0 0.0 0.0 0.0 1.0
(7.91, 14.45] 232.0 0.293103 0.456170 0.0 0.0 0.0 1.0 1.0
(14.45, 31.0] 229.0 0.445415 0.498100 0.0 0.0 0.0 1.0 1.0
(31.0, 513.0] 222.0 0.581081 0.494497 0.0 0.0 1.0 1.0 1.0

fare_dsc['mean'].plot(kind='bar')
plt.show()

3.2 二维数据分析

3.2.1 性别与客舱等级
性别与客舱等级都是生还率梯度较陡的指标,对两个因素综合分析,也许可以得到更清晰的观测结果
将1等舱、3等舱的男士女士分别组成:rich men、poor men、rich women、poor women
--------------
poor men的生还率不到20%,rich women的生还率接近100%。

fig=plt.figure(figsize=(15,5))
##rich men 生存情况
plt.subplot2grid((1,4),(0,0))
titanic_df.Survived[(titanic_df.Pclass==1)&(titanic_df.Sex==0)].value_counts(normalize=True).plot(kind='bar',alpha=0.5)## 
plt.title('Rich Men survived')

##poor men 生存情况
plt.subplot2grid((1,4),(0,1))
titanic_df.Survived[(titanic_df.Pclass==3)&(titanic_df.Sex==0)].value_counts(normalize=True).plot(kind='bar',alpha=0.5)## 
plt.title('Poor Men survived')

##rich Women 生存情况
plt.subplot2grid((1,4),(0,2))
titanic_df.Survived[(titanic_df.Pclass==1)&(titanic_df.Sex==1)].value_counts(normalize=True).plot(kind='bar',alpha=0.5)## 
plt.title('Rich Womenn survived')

##poor Women 生存情况
plt.subplot2grid((1,4),(0,3))
titanic_df.Survived[(titanic_df.Pclass==3)&(titanic_df.Sex==1)].value_counts(normalize=True).plot(kind='bar',alpha=0.5,)## 
plt.title('Poor Women survived')

plt.show()
3.2.2 客舱等级与年龄(阶段)
除了儿童年龄段,其他年龄段的生还率都是随着客舱等级的升高而增大。
而儿童的不符合这一规律。而一等舱的儿童总人数只有4人,不具有代表性。

fig=plt.figure(figsize=(15,5))
titanic_df.groupby(['Age_class','Pclass'])['Survived'].mean().unstack().plot(kind='bar')
plt.title('Pclass_age wrt Survived')
plt.show()
<matplotlib.figure.Figure at 0xbc40240>

##分析儿童年龄段背离规律的原因
titanic_df[titanic_df.Age<=12].groupby('Pclass').describe()['Survived']

  count mean std min 25% 50% 75% max
Pclass                
1 4.0 0.750000 0.500000 0.0 0.75 1.0 1.0 1.0
2 17.0 1.000000 0.000000 1.0 1.00 1.0 1.0 1.0
3 48.0 0.416667 0.498224 0.0 0.00 0.0 1.0 1.0
3.2.2 性别等级与年龄(阶段)
不论性别,儿童的生还率都很高。
但是对于非儿童的人,则‘女士优先’体现明显。

fig=plt.figure(figsize=(15,5))
titanic_df.groupby(['Age_class','Sex'])['Survived'].mean().unstack().plot(kind='bar')
plt.title('Sex_age wrt Survived')
plt.show()
<matplotlib.figure.Figure at 0xbd21d68>

3.3 从数据得到的初步结论

 性别、年龄(儿童与非儿童)、客舱等级是主要影响生还率的因

              四.建模与评估

          典型的二分类,生还与否取决于多种因素,我选择用逻辑回归和决策树来建模。

4.1逻辑回归建模

 首先进行交叉验证,将数据分为:训练集(train_set),评估集(valid_set),测试集(test_set)这三个部分,首先用训练集对分类器进行训练,再利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标。

from  sklearn import cross_validation
from  sklearn.linear_model import LogisticRegression

##初始化算法
alg=LogisticRegression(C=1.0,penalty='l2',tol=1e-6,random_state=1)

##选取対生成预测有影响的特征
titanic=titanic_df.ix[:,['Survived','Pclass','Sex','Fare','SibSp','Parch','Embarked','Age']]
##titanic=titanic_df.ix[:,['Survived','Pclass','Sex','Fare','SibSp','Parch','Embarked','Ca']]
X=titanic.as_matrix()[:,1:]
y=titanic.as_matrix()[:,0]

scores=cross_validation.cross_val_score(alg,X,y,cv=3)
scores.mean()

0.78563411896745228

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值