Kaggle - Titanic 生存预测

本文介绍了作者参加Kaggle Titanic生存预测竞赛的过程,通过Python3进行数据预处理、分析和建模。研究发现性别、年龄、船舱等级、登船港口等因素对生存率有显著影响。在预处理阶段,填充了缺失值,离散化了连续变量,并创建了新特征。尝试了Logistic回归、决策树、随机森林、Adaboost和梯度提升树等多种模型,最高在Kaggle上取得了0.81339的准确率。
摘要由CSDN通过智能技术生成

第一次参加Kaggle,以Titanic来入个门。本次竞赛的目的是根据Titanic的人员信息来预测最终的生存情况。采用Python3来完成本次竞赛。

一、数据总览

从Kaggle平台我们了解到,Training set一共有891条记录,Test set一共有418条记录。提供的相关变量有:

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class

1 = 1st, 2 = 2nd, 3 = 3rd

A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

sex Sex  
Age Age in years Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp # of siblings / spouses aboard the Titanic Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch # of parents / children aboard the Titanic Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
ticket Ticket number  
fare Passenger fare  
cabin Cabin number  
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

首先查看一下训练集和测试集的基本信息,对数据的规模、各个特征的数据类型以及是否有缺失,有一个总体的了解:

import pandas as pd 
import numpy as np
import re
from sklearn.feature_selection import chi2
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')
train_test_combined = train.append(test,ignore_index=True)

#查看基本信息
print (train.info())
print (test.info())

输出为:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

可知:训练集中Age、Cabin和Embarked这三个变量有缺失,测试集中Age、Cabin和Fare这三个变量有缺失。

接下来我们再查看一下数据的具体格式:

#默认打印出前5行数据
print (train.head())

我使用的是Sublime编辑器,因为列数太多,会分多行打印,输出结果不太美观。因此直接去Kaggle上查看数据,以下为Kaggle上的数据截图。

二、数据初步分析

1. 乘客基本属性分析

对于Survived、Sex、Pclass、Embarked这些分类变量,采用饼图来分析它们的构成比。对于Sibsp、Parch这些离散型数值变量,采用柱状图来显示它们的分布情况。对于Age、Fare这些连续型数值变量,采用直方图来显示它们的分布情况。

# 绘制分类变量的饼图
# labeldistance,文本的位置离远点有多远,1.1指1.1倍半径的位置
# autopct,圆里面的文本格式,%3.1f%%表示小数有三位,整数有一位的浮点数
# shadow,饼是否有阴影
# startangle,起始角度,0,表示从0开始逆时针转,为第一块。一般选择从90度开始比较好看
# pctdistance,百分比的text离圆心的距离

plt.subplot(2,2,1)
survived_counts = train['Survived'].value_counts()
survived_labels = ['Died','Survived']
plt.pie(x=survived_counts,labels=survived_labels,autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Survived')
#设置显示的是一个正圆
plt.axis('equal')
#plt.show()

plt.subplot(2,2,2)
gender_counts = train['Sex'].value_counts()
plt.pie(x=gender_counts,labels=gender_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Gender')
plt.axis('equal')

plt.subplot(2,2,3)
pclass_counts = train['Pclass'].value_counts()
plt.pie(x=pclass_counts,labels=pclass_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Pclass')
plt.axis('equal')

plt.subplot(2,2,4)
embarked_counts = train['Embarked'].value_counts()
plt.pie(x=embarked_counts,labels=embarked_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Embarked')
plt.axis('equal')

plt.show()

plt.subplot(2,2,1)
sibsp_counts = train['SibSp'].value_counts().to_dict()
plt.bar(list(sibsp_counts.keys()),list(sibsp_counts.values()))
plt.title('SibSp')

plt.subplot(2,2,2)
parch_counts = train['Parch'].value_counts().to_dict()
plt.bar(list(parch_counts.keys()),list(parch_counts.values()))
plt.title('Parch')

plt.style.use( 'ggplot')
plt.subplot(2,2,3)
plt.hist(train.Age,bins=np.arange(0,100,5),range=(0,100),color = 'steelblue', edgecolor = 'k')
plt.title('Age')

plt.subplot(2,2,4)
plt.hist(train.Fare,bins=20,color = 'steelblue', edgecolor = 'k')
plt.title('Fare')

plt.show()

2. 分析不同因素与生存情况之间的关系

(1)性别:

计算不同性别的生存率:

print (train.groupby('Sex')['Survived'].value_counts())
print (train.groupby('Sex')['Survived'].mean())

输出为:

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109

Sex
female    0.742038
male      0.188908

可知:女性的生存率为74.20%,男性的生存率仅为18.89%,女性的生存率远大于男性,因此性别是一个重要的影响因素。

(2)年龄:

计算不同年龄的生存率:

fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_age = train.dropna(subset=['Age'])
train_age["Age_int"] = train_age["Age"].astype(int)
train_age.groupby('Age_int')['Survived'].mean().plot(kind='bar')
plt.show()

输出为:

可知:小孩子的生存率较高,老年人中有好几个年龄段的生存率都为0,生存率较低。我们再看一下每个年龄段具体的幸存者和非幸存者的人数分布。

print (train_age.groupby('Age_int')['Survived'].value_counts())

输出为:

Age_int  Survived
0        1            7
1        1            5
         0            2
2        0            7
         1            3
3        1            5
         0            1
4        1            7
         0            3
5        1            4
6        1            2
         0            1
7        0            2
         1            1
8        0            2
         1            2
9        0            6
         1            2
10       0            2
11       0            3
         1            1
12       1            1
13       1            2
14       0            4
         1            3
15       1            4
         0            1
16       0           11
         1            6
17       0            7
         1            6
18       0           17
         1            9
19       0           16
         1            9
20       0           13
         1            3
21       0           19
         1            5
22       0           16
         1           11
23       0           11
         1            5
24       0           16
         1           15
25       0           17
         1            6
26       0           12
         1            6
27       1           11
         0            7
28       0           20
         1            7
29       0           12
         1            8
30       0           17
         1           10
31       0            9
         1            8
32       0           10
         1           10
33       0            9
         1            6
34       0           10
         1            6
35       1           11
         0            7
36       0           12
         1           11
37       0            5
         1            1
38       0            6
         1            5
39       0            9
         1            5
40       0            9
         1            6
41       0            4
         1            2
42       0            7
         1            6
43       0            4
         1            1
44       0            6
         1            3
45       0            9
         1            5
46       0            3
47       0            8
         1            1
48       1            6
         0            3
49       1          
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值