泰坦尼克生还因素分析及建模生还预测

最新推荐文章于 2024-02-01 11:49:22 发布

YUI0908

最新推荐文章于 2024-02-01 11:49:22 发布

阅读量796

点赞数

文章标签：数据分析

本文链接：https://blog.csdn.net/qq_26185193/article/details/79187949

版权

问题背景：

泰坦尼克豪华游轮即将沉没，救生艇数量有限，无法人人都有，副船长‘女士小孩优先的指示下’，打破了随机获救的平衡。

分析过程：

1.流程设计：数据准备，数据清理，分析展示，建模评估
2.数据准备和观察
3.数据预处理：清理、变换、缺失值填补等
4.分析展示（一维分析、二维分析）
5.建模与评估

一：数据准备

1.1 导入数据

import numpy as np
import pandas as pd
import seaborn as sns
titanic_df=pd.read_csv('titanic-data.csv')

1.2. 观察数据

生还率为38.38%，超过75%的乘客没有父母子女，超过50%的乘客没有兄弟姐妹或配偶。
一共891条数据，Cabin缺失687个，Age缺失177个，Embarked缺失2个，Fare最小值为0，也可以看做缺失值。
sex、Cabin、Embarked 为字符型数据。

print (titanic_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

##查看前5条记录
print (titanic_df.head(5))
print ('--------------')
##查看数值型数据的分布信息
print (titanic_df.describe())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
--------------
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200

二：数据预处理

2.1.数据清理

Fare有为0值的票价：显然不合理

I

##处理0值票价
titanic_df[titanic_df['Fare']==0]

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Cabin	Embarked
179	180	0	3	Leonard, Mr. Lionel	male	36.0	LINE	NaN	S
263	264	0	1	Harrison, Mr. William	male	40.0	112059	B94	S
271	272	1	3	Tornquist, Mr. William Henry	male	25.0	LINE	NaN	S
277	278	0	2	Parkes, Mr. Francis "Frank"	male	NaN	239853	NaN	S
302	303	0	3	Johnson, Mr. William Cahoone Jr	male	19.0	LINE	NaN	S
413	414	0	2	Cunningham, Mr. Alfred Fleming	male	NaN	239853	NaN	S
466	467	0	2	Campbell, Mr. William	male	NaN	239853	NaN	S
481	482	0	2	Frost, Mr. Anthony Wood "Archie"	male	NaN	239854	NaN	S
597	598	0	3	Johnson, Mr. Alfred	male	49.0	LINE	NaN	S
633	634	0	1	Parr, Mr. William Henry Marsh	male	NaN	112052	NaN	S
674	675	0	2	Watson, Mr. Ennis Hastings	male	NaN	239856	NaN	S
732	733	0	2	Knight, Mr. Robert J	male	NaN	239855	NaN	S
806	807	0	1	Andrews, Mr. Thomas Jr	male	39.0	112050	A36	S
815	816	0	1	Fry, Mr. Richard	male	NaN	112058	B102	S
822	823	0	1	Reuchlin, Jonkheer. John George	male	38.0	19972	NaN	S

import matplotlib.pyplot as plt
fig=plt.figure(figsize=(15,5))
titanic_df.Fare[titanic_df['Embarked']=='S'].plot(kind='kde') 
plt.show()

##可知S等舱口的票价大部分分布于0-20

titanic_df[titanic_df['Embarked']=='S'].Fare.median()

13.0

##用中位数填补0值票价
titanic_df.loc[titanic_df['Fare']==0,'Fare']=13.0

2.2.缺失值填补

2.2.1 cabin客舱号码204个，缺失687（处理办法：丢失了77%，失去分析价值了，直接丢掉该变量）

####cabin客舱号码缺失值处理：删掉
titanic_df=titanic_df.drop('Cabin',axis=1)

2.2.2 embarked 登舱号889个，缺失2个（处理办法：相似对象填充）

#查看缺失embarked的两条数据，票价为80 
titanic_df[titanic_df.Embarked.isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	NaN

fig=plt.figure(figsize=(15,5))
df=titanic_df[titanic_df['Pclass']==1]
i=0
for x in ['S','C','Q']:
    plt.subplot2grid((1,3),(0,i))
    i=i+1
    df.Fare[df.Embarked==x].plot(kind='box')
    plt.legend((x))
plt.show()

##Embarked为C时，票价均值在80附近。用C来填充Embarked的空白
titanic_df['Embarked'][titanic_df.Embarked.isnull()]=['C']

2.2.3age只有714个，缺失了177个数据（处理办法：随机数补充，选择Q1-Q3内的随机数）

###处理age变量缺失值，选择Q1-Q3内的随机数
Q1=titanic_df['Age'].describe()['25%']
Q3=titanic_df['Age'].describe()['75%']
num=titanic_df['Age'].isnull().sum()
random_ages=np.random.randint(Q1,Q3,size=num)
titanic_df['Age'][np.isnan(titanic_df['Age'])]=random_ages

2.3编码名义变量

def trans(data):  
    data.loc[data['Sex']=='male','Sex']=0 
    data.loc[data['Sex']=='female','Sex']=1
                                               
    data.loc[data['Embarked']=='S','Embarked']=0
    data.loc[data['Embarked']=='C','Embarked']=1
    data.loc[data['Embarked']=='Q','Embarked']=2

trans(titanic_df)

三.分析与展示

3.1一维数据分析

3.1.1 Pclass 船票等级

一等舱人数216，生还人数136，生还率62.96%
二等舱人数184，生还人数87，生还率47.28%
三等舱人数491，生还人数119，生还率24.24%
并没有富人优先，但是生还率却按客舱等级成正比，可能与位置有关。一等舱位于游轮上层，利于逃生。

##按PClass分组求出生还人数及生还率
Pc_dsc=titanic_df.groupby(['Pclass']).describe()['Survived']
Pc_dsc['Survied']=Pc_dsc['count']*Pc_dsc['mean']

Pc_dsc

	count	mean	std	min	25%	50%	75%	max	Survied
Pclass
1	216.0	0.629630	0.484026	0.0	0.0	1.0	1.0	1.0	136.0
2	184.0	0.472826	0.500623	0.0	0.0	0.0	1.0	1.0	87.0
3	491.0	0.242363	0.428949	0.0	0.0	0.0	0.0	1.0	119.0

Pc_dsc['mean'].plot(kind='bar')
plt.title('Pclass_survived_rate')
plt.show()

3.1.2 性别

男士总人数577，生还109，生还率18.89%
女士总人数314，生还233，生还率74.20%
看来副船长的号召起作用了，女士的生还率明显高于客舱生还率，而与此对应的男士的生还率则低于客舱生还率。

##按Sex分组求出生还人数及生还率
Sex_dsc=titanic_df.groupby(['Sex']).describe()['Survived']
Sex_dsc['Survived']=Sex_dsc['count']*Sex_dsc['mean']
Sex_dsc

	count	mean	std	min	25%	50%	75%	max	Survived
Sex
0	577.0	0.188908	0.391775	0.0	0.0	0.0	0.0	1.0	109.0
1	314.0	0.742038	0.438211	0.0	0.0	1.0	1.0	1.0	233.0

Sex_dsc['mean'].plot(kind='bar')
plt.title('Sex of survived')
plt.show()

3.1.3 年龄

儿童人数69，生还40人，生还率57.97%（儿童优先的体现）
青少年70人，生还30人，生还率42.86%
青壮年712人，生还260人，生还36.52%
老年40人，生还12人，生还率30%

##创建年龄阶段字段
'''
0-12儿童，12-18青少年，18-55青壮年，55-80老年
'''
titanic_df['Age_class']=pd.cut(titanic_df['Age'],bins=[0,12,18,55,80])

Age_dsc=titanic_df.groupby(['Age_class']).describe()['Survived']
Age_dsc['Survied']=Age_dsc['count']*Age_dsc['mean']
Age_dsc

	count	mean	std	min	25%	50%	75%	max	Survied
Age_class
(0, 12]	69.0	0.579710	0.497222	0.0	0.0	1.0	1.0	1.0	40.0
(12, 18]	70.0	0.428571	0.498445	0.0	0.0	0.0	1.0	1.0	30.0
(18, 55]	712.0	0.365169	0.481816	0.0	0.0	0.0	1.0	1.0	260.0
(55, 80]	40.0	0.300000	0.464095	0.0	0.0	0.0	1.0	1.0	12.0

Age_dsc['mean'].plot(kind='bar')
plt.title('Age of survived')
plt.show()

3.1.4Fare 票价与生还率的关系

按照四分位点将票价分为4个等级，从图中可以看到，随着票价的增长，生还率也随之生长。

##票价比较分散，查看一下分布
titanic_df.Fare.describe()

count    891.000000
mean      32.423063
std       49.579484
min        4.012500
25%        7.925000
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

##创建票价分组字段  fare_class,依据上述四份位点（7.91,14.45,31）将人数均分在四组中
titanic_df['Fare_class']=pd.cut(titanic_df['Fare'],bins=[0,7.91,14.45,31,513])
##生还率：
fare_dsc=titanic_df.groupby(['Fare_class']).describe()['Survived']
fare_dsc

	count	mean	std	min	25%	50%	75%	max
Fare_class
(0.0, 7.91]	208.0	0.206731	0.405938	0.0	0.0	0.0	0.0	1.0
(7.91, 14.45]	232.0	0.293103	0.456170	0.0	0.0	0.0	1.0	1.0
(14.45, 31.0]	229.0	0.445415	0.498100	0.0	0.0	0.0	1.0	1.0
(31.0, 513.0]	222.0	0.581081	0.494497	0.0	0.0	1.0	1.0	1.0

fare_dsc['mean'].plot(kind='bar')
plt.show()

3.2 二维数据分析

3.2.1 性别与客舱等级

性别与客舱等级都是生还率梯度较陡的指标，对两个因素综合分析，也许可以得到更清晰的观测结果
将1等舱、3等舱的男士女士分别组成：rich men、poor men、rich women、poor women
--------------
poor men的生还率不到20%，rich women的生还率接近100%。

fig=plt.figure(figsize=(15,5))
##rich men 生存情况
plt.subplot2grid((1,4),(0,0))
titanic_df.Survived[(titanic_df.Pclass==1)&(titanic_df.Sex==0)].value_counts(normalize=True).plot(kind='bar',alpha=0.5)## 
plt.title('Rich Men survived')

##poor men 生存情况
plt.subplot2grid((1,4),(0,1))
titanic_df.Survived[(titanic_df.Pclass==3)&(titanic_df.Sex==0)].value_counts(normalize=True).plot(kind='bar',alpha=0.5)## 
plt.title('Poor Men survived')

##rich Women 生存情况
plt.subplot2grid((1,4),(0,2))
titanic_df.Survived[(titanic_df.Pclass==1)&(titanic_df.Sex==1)].value_counts(normalize=True).plot(kind='bar',alpha=0.5)## 
plt.title('Rich Womenn survived')

##poor Women 生存情况
plt.subplot2grid((1,4),(0,3))
titanic_df.Survived[(titanic_df.Pclass==3)&(titanic_df.Sex==1)].value_counts(normalize=True).plot(kind='bar',alpha=0.5,)## 
plt.title('Poor Women survived')

plt.show()

3.2.2 客舱等级与年龄（阶段）

除了儿童年龄段，其他年龄段的生还率都是随着客舱等级的升高而增大。
而儿童的不符合这一规律。而一等舱的儿童总人数只有4人，不具有代表性。

fig=plt.figure(figsize=(15,5))
titanic_df.groupby(['Age_class','Pclass'])['Survived'].mean().unstack().plot(kind='bar')
plt.title('Pclass_age wrt Survived')
plt.show()

<matplotlib.figure.Figure at 0xbc40240>

##分析儿童年龄段背离规律的原因
titanic_df[titanic_df.Age<=12].groupby('Pclass').describe()['Survived']

	count	mean	std	min	25%	50%	75%	max
Pclass
1	4.0	0.750000	0.500000	0.0	0.75	1.0	1.0	1.0
2	17.0	1.000000	0.000000	1.0	1.00	1.0	1.0	1.0
3	48.0	0.416667	0.498224	0.0	0.00	0.0	1.0	1.0

3.2.2 性别等级与年龄（阶段）

不论性别，儿童的生还率都很高。
但是对于非儿童的人，则‘女士优先’体现明显。

fig=plt.figure(figsize=(15,5))
titanic_df.groupby(['Age_class','Sex'])['Survived'].mean().unstack().plot(kind='bar')
plt.title('Sex_age wrt Survived')
plt.show()

<matplotlib.figure.Figure at 0xbd21d68>

3.3 从数据得到的初步结论

 性别、年龄（儿童与非儿童）、客舱等级是主要影响生还率的因

四.建模与评估

          典型的二分类，生还与否取决于多种因素，我选择用逻辑回归和决策树来建模。

4.1逻辑回归建模

 首先进行交叉验证，将数据分为：训练集（train_set），评估集（valid_set），测试集（test_set）这三个部分，首先用训练集对分类器进行训练,再利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标。

from  sklearn import cross_validation
from  sklearn.linear_model import LogisticRegression

##初始化算法
alg=LogisticRegression(C=1.0,penalty='l2',tol=1e-6,random_state=1)

##选取対生成预测有影响的特征
titanic=titanic_df.ix[:,['Survived','Pclass','Sex','Fare','SibSp','Parch','Embarked','Age']]
##titanic=titanic_df.ix[:,['Survived','Pclass','Sex','Fare','SibSp','Parch','Embarked','Ca']]
X=titanic.as_matrix()[:,1:]
y=titanic.as_matrix()[:,0]

scores=cross_validation.cross_val_score(alg,X,y,cv=3)
scores.mean()

0.78563411896745228

YUI0908

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
泰坦尼克生还因素分析及建模生还预测

问题背景：泰坦尼克豪华游轮即将沉没，救生艇数量有限，无法人人都有，副船长‘女士小孩优先的指示下’，打破了随机获救的平衡。分析过程：1.流程设计：数据准备，数据清理，分析展示，建模评估2.数据准备和观察3.数据预处理：清理、变换、缺失值填补等4.分析展示（一维分析、二维分析）5.建模与评估一：数据准备
复制链接

扫一扫

泰坦尼克生还因素分析及建模生还预测

问题背景：

分析过程：

一：数据准备

1.1 导入数据

1.2. 观察数据

二：数据预处理

2.1.数据清理

2.2.缺失值填补

2.2.1 cabin客舱号码204个，缺失687（处理办法：丢失了77%，失去分析价值了，直接丢掉该变量）

2.2.2 embarked 登舱号889个，缺失2个（处理办法：相似对象填充）

2.2.3age只有714个，缺失了177个数据 （处理办法：随机数补充，选择Q1-Q3内的随机数）

2.3编码名义变量

三.分析与展示

3.1一维数据分析

3.1.1 Pclass 船票等级

3.1.2 性别

3.1.3 年龄

3.1.4Fare 票价与生还率的关系

3.2 二维数据分析

3.2.1 性别与客舱等级

3.2.2 客舱等级与年龄（阶段）

3.2.2 性别等级与年龄（阶段）

3.3 从数据得到的初步结论

四.建模与评估

4.1逻辑回归建模

2.2.3age只有714个，缺失了177个数据（处理办法：随机数补充，选择Q1-Q3内的随机数）