第一次参加Kaggle,以Titanic来入个门。本次竞赛的目的是根据Titanic的人员信息来预测最终的生存情况。采用Python3来完成本次竞赛。
一、数据总览
从Kaggle平台我们了解到,Training set一共有891条记录,Test set一共有418条记录。提供的相关变量有:
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd A proxy for socio-economic status (SES) |
sex | Sex | |
Age | Age in years | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 |
sibsp | # of siblings / spouses aboard the Titanic | Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored) |
parch | # of parents / children aboard the Titanic | Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them. |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
首先查看一下训练集和测试集的基本信息,对数据的规模、各个特征的数据类型以及是否有缺失,有一个总体的了解:
import pandas as pd
import numpy as np
import re
from sklearn.feature_selection import chi2
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
#读取数据
train = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/train.csv')
test = pd.read_csv('/Users/jingxuan.ljx/Documents/machine learning/kaggle/Titanic/test.csv')
train_test_combined = train.append(test,ignore_index=True)
#查看基本信息
print (train.info())
print (test.info())
输出为:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
可知:训练集中Age、Cabin和Embarked这三个变量有缺失,测试集中Age、Cabin和Fare这三个变量有缺失。
接下来我们再查看一下数据的具体格式:
#默认打印出前5行数据
print (train.head())
我使用的是Sublime编辑器,因为列数太多,会分多行打印,输出结果不太美观。因此直接去Kaggle上查看数据,以下为Kaggle上的数据截图。
二、数据初步分析
1. 乘客基本属性分析
对于Survived、Sex、Pclass、Embarked这些分类变量,采用饼图来分析它们的构成比。对于Sibsp、Parch这些离散型数值变量,采用柱状图来显示它们的分布情况。对于Age、Fare这些连续型数值变量,采用直方图来显示它们的分布情况。
# 绘制分类变量的饼图
# labeldistance,文本的位置离远点有多远,1.1指1.1倍半径的位置
# autopct,圆里面的文本格式,%3.1f%%表示小数有三位,整数有一位的浮点数
# shadow,饼是否有阴影
# startangle,起始角度,0,表示从0开始逆时针转,为第一块。一般选择从90度开始比较好看
# pctdistance,百分比的text离圆心的距离
plt.subplot(2,2,1)
survived_counts = train['Survived'].value_counts()
survived_labels = ['Died','Survived']
plt.pie(x=survived_counts,labels=survived_labels,autopct="%5.2f%%", pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Survived')
#设置显示的是一个正圆
plt.axis('equal')
#plt.show()
plt.subplot(2,2,2)
gender_counts = train['Sex'].value_counts()
plt.pie(x=gender_counts,labels=gender_counts.keys(),autopct="%5.2f%%", pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Gender')
plt.axis('equal')
plt.subplot(2,2,3)
pclass_counts = train['Pclass'].value_counts()
plt.pie(x=pclass_counts,labels=pclass_counts.keys(),autopct="%5.2f%%", pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Pclass')
plt.axis('equal')
plt.subplot(2,2,4)
embarked_counts = train['Embarked'].value_counts()
plt.pie(x=embarked_counts,labels=embarked_counts.keys(),autopct="%5.2f%%", pctdistance=0.6,
shadow=False, labeldistance=1.1, startangle=90)
plt.title('Embarked')
plt.axis('equal')
plt.show()
plt.subplot(2,2,1)
sibsp_counts = train['SibSp'].value_counts().to_dict()
plt.bar(list(sibsp_counts.keys()),list(sibsp_counts.values()))
plt.title('SibSp')
plt.subplot(2,2,2)
parch_counts = train['Parch'].value_counts().to_dict()
plt.bar(list(parch_counts.keys()),list(parch_counts.values()))
plt.title('Parch')
plt.style.use( 'ggplot')
plt.subplot(2,2,3)
plt.hist(train.Age,bins=np.arange(0,100,5),range=(0,100),color = 'steelblue', edgecolor = 'k')
plt.title('Age')
plt.subplot(2,2,4)
plt.hist(train.Fare,bins=20,color = 'steelblue', edgecolor = 'k')
plt.title('Fare')
plt.show()
2. 分析不同因素与生存情况之间的关系
(1)性别:
计算不同性别的生存率:
print (train.groupby('Sex')['Survived'].value_counts())
print (train.groupby('Sex')['Survived'].mean())
输出为:
Sex Survived
female 1 233
0 81
male 0 468
1 109
Sex
female 0.742038
male 0.188908
可知:女性的生存率为74.20%,男性的生存率仅为18.89%,女性的生存率远大于男性,因此性别是一个重要的影响因素。
(2)年龄:
计算不同年龄的生存率:
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_age = train.dropna(subset=['Age'])
train_age["Age_int"] = train_age["Age"].astype(int)
train_age.groupby('Age_int')['Survived'].mean().plot(kind='bar')
plt.show()
输出为:
可知:小孩子的生存率较高,老年人中有好几个年龄段的生存率都为0,生存率较低。我们再看一下每个年龄段具体的幸存者和非幸存者的人数分布。
print (train_age.groupby('Age_int')['Survived'].value_counts())
输出为:
Age_int Survived
0 1 7
1 1 5
0 2
2 0 7
1 3
3 1 5
0 1
4 1 7
0 3
5 1 4
6 1 2
0 1
7 0 2
1 1
8 0 2
1 2
9 0 6
1 2
10 0 2
11 0 3
1 1
12 1 1
13 1 2
14 0 4
1 3
15 1 4
0 1
16 0 11
1 6
17 0 7
1 6
18 0 17
1 9
19 0 16
1 9
20 0 13
1 3
21 0 19
1 5
22 0 16
1 11
23 0 11
1 5
24 0 16
1 15
25 0 17
1 6
26 0 12
1 6
27 1 11
0 7
28 0 20
1 7
29 0 12
1 8
30 0 17
1 10
31 0 9
1 8
32 0 10
1 10
33 0 9
1 6
34 0 10
1 6
35 1 11
0 7
36 0 12
1 11
37 0 5
1 1
38 0 6
1 5
39 0 9
1 5
40 0 9
1 6
41 0 4
1 2
42 0 7
1 6
43 0 4
1 1
44 0 6
1 3
45 0 9
1 5
46 0 3
47 0 8
1 1
48 1 6
0 3
49 1