Kaggle - Titanic 生存预测

最新推荐文章于 2024-08-05 10:35:15 发布

chenhui229

最新推荐文章于 2024-08-05 10:35:15 发布

阅读量594

点赞数

分类专栏： Kaggle 文章标签： Kaggle 数据分析 Titanic

本文链接：https://blog.csdn.net/chenhui229/article/details/81451754

版权

本文介绍了作者参加Kaggle Titanic生存预测竞赛的过程，通过Python3进行数据预处理、分析和建模。研究发现性别、年龄、船舱等级、登船港口等因素对生存率有显著影响。在预处理阶段，填充了缺失值，离散化了连续变量，并创建了新特征。尝试了Logistic回归、决策树、随机森林、Adaboost和梯度提升树等多种模型，最高在Kaggle上取得了0.81339的准确率。

摘要由CSDN通过智能技术生成

第一次参加Kaggle，以Titanic来入个门。本次竞赛的目的是根据Titanic的人员信息来预测最终的生存情况。采用Python3来完成本次竞赛。

一、数据总览

从Kaggle平台我们了解到，Training set一共有891条记录，Test set一共有418条记录。提供的相关变量有：

Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
sex	Sex
Age	Age in years	Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp	# of siblings / spouses aboard the Titanic	Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch	# of parents / children aboard the Titanic	Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

首先查看一下训练集和测试集的基本信息，对数据的规模、各个特征的数据类型以及是否有缺失，有一个总体的了解：

import pandas as pd 
import numpy as np
import re
from sklearn.feature_selection import chi2
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')
train_test_combined = train.append(test,ignore_index=True)

#查看基本信息
print (train.info())
print (test.info())

输出为：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

可知：训练集中Age、Cabin和Embarked这三个变量有缺失，测试集中Age、Cabin和Fare这三个变量有缺失。

接下来我们再查看一下数据的具体格式：

#默认打印出前5行数据
print (train.head())

我使用的是Sublime编辑器，因为列数太多，会分多行打印，输出结果不太美观。因此直接去Kaggle上查看数据，以下为Kaggle上的数据截图。

二、数据初步分析

1. 乘客基本属性分析

对于Survived、Sex、Pclass、Embarked这些分类变量，采用饼图来分析它们的构成比。对于Sibsp、Parch这些离散型数值变量，采用柱状图来显示它们的分布情况。对于Age、Fare这些连续型数值变量，采用直方图来显示它们的分布情况。

# 绘制分类变量的饼图
# labeldistance，文本的位置离远点有多远，1.1指1.1倍半径的位置
# autopct，圆里面的文本格式，%3.1f%%表示小数有三位，整数有一位的浮点数
# shadow，饼是否有阴影
# startangle，起始角度，0，表示从0开始逆时针转，为第一块。一般选择从90度开始比较好看
# pctdistance，百分比的text离圆心的距离

plt.subplot(2,2,1)
survived_counts = train['Survived'].value_counts()
survived_labels = ['Died','Survived']
plt.pie(x=survived_counts,labels=survived_labels,autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Survived')
#设置显示的是一个正圆
plt.axis('equal')
#plt.show()

plt.subplot(2,2,2)
gender_counts = train['Sex'].value_counts()
plt.pie(x=gender_counts,labels=gender_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Gender')
plt.axis('equal')

plt.subplot(2,2,3)
pclass_counts = train['Pclass'].value_counts()
plt.pie(x=pclass_counts,labels=pclass_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Pclass')
plt.axis('equal')

plt.subplot(2,2,4)
embarked_counts = train['Embarked'].value_counts()
plt.pie(x=embarked_counts,labels=embarked_counts.keys(),autopct="%5.2f%%",  pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Embarked')
plt.axis('equal')

plt.show()

plt.subplot(2,2,1)
sibsp_counts = train['SibSp'].value_counts().to_dict()
plt.bar(list(sibsp_counts.keys()),list(sibsp_counts.values()))
plt.title('SibSp')

plt.subplot(2,2,2)
parch_counts = train['Parch'].value_counts().to_dict()
plt.bar(list(parch_counts.keys()),list(parch_counts.values()))
plt.title('Parch')

plt.style.use( 'ggplot')
plt.subplot(2,2,3)
plt.hist(train.Age,bins=np.arange(0,100,5),range=(0,100),color = 'steelblue', edgecolor = 'k')
plt.title('Age')

plt.subplot(2,2,4)
plt.hist(train.Fare,bins=20,color = 'steelblue', edgecolor = 'k')
plt.title('Fare')

plt.show()

2. 分析不同因素与生存情况之间的关系

（1）性别：

计算不同性别的生存率：

print (train.groupby('Sex')['Survived'].value_counts())
print (train.groupby('Sex')['Survived'].mean())

输出为：

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109

Sex
female    0.742038
male      0.188908

可知：女性的生存率为74.20%，男性的生存率仅为18.89%，女性的生存率远大于男性，因此性别是一个重要的影响因素。

（2）年龄：

计算不同年龄的生存率：

fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_age = train.dropna(subset=['Age'])
train_age["Age_int"] = train_age["Age"].astype(int)
train_age.groupby('Age_int')['Survived'].mean().plot(kind='bar')
plt.show()

输出为：

可知：小孩子的生存率较高，老年人中有好几个年龄段的生存率都为0，生存率较低。我们再看一下每个年龄段具体的幸存者和非幸存者的人数分布。

print (train_age.groupby('Age_int')['Survived'].value_counts())

输出为：

Age_int  Survived
0        1            7
1        1            5
         0            2
2        0            7
         1            3
3        1            5
         0            1
4        1            7
         0            3
5        1            4
6        1            2
         0            1
7        0            2
         1            1
8        0            2
         1            2
9        0            6
         1            2
10       0            2
11       0            3
         1            1
12       1            1
13       1            2
14       0            4
         1            3
15       1            4
         0            1
16       0           11
         1            6
17       0            7
         1            6
18       0           17
         1            9
19       0           16
         1            9
20       0           13
         1            3
21       0           19
         1            5
22       0           16
         1           11
23       0           11
         1            5
24       0           16
         1           15
25       0           17
         1            6
26       0           12
         1            6
27       1           11
         0            7
28       0           20
         1            7
29       0           12
         1            8
30       0           17
         1           10
31       0            9
         1            8
32       0           10
         1           10
33       0            9
         1            6
34       0           10
         1            6
35       1           11
         0            7
36       0           12
         1           11
37       0            5
         1            1
38       0            6
         1            5
39       0            9
         1            5
40       0            9
         1            6
41       0            4
         1            2
42       0            7
         1            6
43       0            4
         1            1
44       0            6
         1            3
45       0            9
         1            5
46       0            3
47       0            8
         1            1
48       1            6
         0            3
49       1