【Start of DataScience】Titanic 预测分析

学了机器学习一个月了,跃跃欲试想实践一把。参考了kaggle上一篇非常详细的帖子,断断续续用了一个星期,自己一步一步地完成了Titanic的预测分析,也算是入了门。

看会和写出来再到生成文件提交上去,整个流程做下来遇到了各种各样的问题。作图不会注释怎么办?中文显示乱码怎么办?panda是切片loc和iloc用哪个? 训练的时候显示 y 标签为(n_samples,1)和(n_samples,)不一样吗?logspace 和 linspace 啥意思? 调参时模型中的各种参数是什么意义......?好不容易到了最后一步,提交之后显示0分瞬间心凉半截(其实是没有将输出的float转换为int) 当然,整个分析过程比较简单,待完善的空间大得很,但是很有成就感。

感受最深的是:知难行易。所谓实践出真知,可能是就是实践中遇到了更多没想到的问题,一一解决了,就获得了真知。

参考文章:https://www.kaggle.com/eraaz1/a-comprehensive-guide-to-titanic-machine-learning

目录

1导入数据库和数据

2 查看变量类型

2.1 变量描述

2.2 变量数据类型

3 单变量分析

3.1 分类型变量分析

3.2数值型变量分析

4 特征工程

5缺失值处理

6 双变量分析

6.1 分类型与分类型变量分析

6.2 数值型与分类型变量分析

7 数据转换

7.1 Age&Fare 分组

7.2 删除无用特征

7.3 修改数据类型并编码

7.4 编码

8 模型建立与评估

8.1 模型初选与训练评估

8.2 交叉验证评估

8.3 调参

8.4 用调好的参数重新训练验证

9 选择模型并预测

RF:score 0.75119-rank 8361

GBC:score 0.76555-rank 7317

SVC:score 0.77990-rank 4824

KNN:score 0.73684

DT: score 0.7320



1导入数据库和数据

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown
def bold(string):
    display(Markdown(string))
# 导入数据
train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")
bold('**preview train data**')# bold无法显示汉字-> u‘str’
display(train.head())
bold('**preview test data**')
display(test.head())
train.shape
test.shape

preview train data

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

preview test data

 PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS

Out[1]:

(418, 11)

2 查看变量类型

2.1 变量描述

为了方便后期数据分析和预处理,先将train和test数据集合并

In [3]:

merge=pd.concat([train,test],ignore_index=True,sort=False)
bold('**preview merge data**')
display(merge.head())
display(merge.shape)
display(merge.columns)

preview merge data

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
231.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
450.03Allen, Mr. William Henrymale35.0003734508.0500NaNS
(1309, 12)
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object')

分类型变量:

Pclass Name Sex SibSp Parch Ticket Cabin Embarket Survived

数值型变量:

Age Fare PassengerId

2.2 变量数据类型

In [4]:

display(merge.dtypes)
PassengerId      int64
Survived       float64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

int: PassengerId, Pclass, SibSp, Parch

float: Age, Fare, Survived

object(字符+数字): Name, Sex, Ticket, Cabin, Embarked

3 单变量分析

查看各变量的分布情况

分类型变量,一般用柱状图 bar plots;

数值型变量,一般用直方图 histogram

3.1 分类型变量分析

In [5]:

# 定义相对频率柱状图的注释
def abs_bar_labels():
    font_size = 15
    plt.ylabel('Absolute Frequency', fontsize = font_size)
    plt.xticks(rotation =0, fontsize = font_size)
    plt.yticks([])
    
    # Set individual bar lebels in absolute number
    for x in ax.patches:
        ax.annotate(x.get_height(), 
        (x.get_x() + x.get_width()/2., x.get_height()),  ha = 'center', va = 'center',xytext = (0,7), 
        textcoords = 'offset points', fontsize = font_size, color = 'black')

# 定义相对频率柱状图的注释        
def pct_bar_labels():
    font_size = 15
    plt.ylabel('Relative Frequency (%)', fontsize = font_size)
    plt.xticks(rotation = 0, fontsize = font_size)
    plt.yticks([]) 
    
    # Set individual bar lebels in proportional scale
    for x in ax1.patches:
        ax1.annotate(str(x.get_height()) + '%', 
        (x.get_x() + x.get_width()/2., x.get_height()), ha = 'center', va = 'center', xytext = (0, 7), 
        textcoords = 'offset points', fontsize = font_size, color = 'black')
In [6]:

#定义柱状图、绝对和相对频率显示函数:
def abs_rel_f(feature):
    abs_frequency=feature.value_counts()
    relative_frequency=feature.value_counts(normalize=True).round(3)*100
    abs_rel_f=pd.DataFrame({'Absolute Frequency':abs_frequency,'Relative Frequency(%)':relative_frequency})
    display(abs_rel_f) 
    #绘制柱状图
    global ax,ax1 #声明全局变量,这样上面的绝对和相对频率注释函数才可以使用ax和ax1
    ax=abs_frequency.plot.bar(figsize=(18,7))
    abs_bar_labels()  # Displays bar labels in abs scale.
    plt.show()
    ax1=relative_frequency.plot.bar(figsize=(18,7))
    pct_bar_labels()
    plt.show()
In [10]:

bold('**Survived**')
abs_rel_f(merge.Survived)

Survived

 Absolute FrequencyRelative Frequency(%)
0.054961.6
1.034238.4

分析:生还率仅为38%,62%的人都无法生还

In [24]:

bold(u'**Pclass**')
abs_rel_f(merge.Pclass)

Pclass

 Absolute FrequencyRelative Frequency(%)
370954.2
132324.7
227721.2

分析:各等级的分布并不均匀,3等的人最多约54%,其次是1等和2等约46%

In [26]:

bold('**Sex**')
abs_rel_f(merge.Sex)

Sex

 Absolute FrequencyRelative Frequency(%)
male84364.4
female46635.6

分析:性别分布也不均匀,男性较多约64%

In [27]:

bold('SibSp')
abs_rel_f(merge.SibSp)

SibSp

 Absolute FrequencyRelative Frequency(%)
089168.1
131924.4
2423.2
4221.7
3201.5
890.7
560.5

分析:没有兄弟姐妹的人是最多的,约68%,其次是有1个兄弟姐妹的约24%,剩下的都不足5%

In [28]:

bold('**Parch**')
abs_rel_f(merge.Parch)

Parch

 Absolute FrequencyRelative Frequency(%)
0100276.5
117013.0
21138.6
380.6
560.5
460.5
920.2
620.2

分析:可以看出,没有父母和孩子的人是最多的,约77%,有1个的约13%,有两个的约9%,剩下均不足1%

In [47]:

bold('**Embarked**')
abs_rel_f(merge.Embarked)

Embarked

 Absolute FrequencyRelative Frequency(%)
S91469.9
C27020.7
Q1239.4

分析:上船的不同港口人数分布也不一样,从S港口上船的乘客最多约70%,其次是C约21%,最少的是Q

In [63]:

bold('**Name&Cabin&Ticket**')
bold('Name:'+str(merge.Name.value_counts().count()))
display(merge.Name.head(5))
bold('Cabin:'+str(merge.Cabin.value_counts().count()))
display(merge.Cabin.value_counts(dropna=False).head(5))  
#统计个数的时候,自动忽略缺失值。查看的话需要加dropna=False
bold('Ticket:'+str(merge.Ticket.value_counts().count()))
display(merge.Ticket.value_counts(dropna=False).head(5))

Name&Cabin&Ticket

Name:1307

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

Cabin:186

NaN                1014
C23 C25 C27           6
B57 B59 B63 B66       5
G6                    5
C78                   4
Name: Cabin, dtype: int64

Ticket:929

CA. 2343        11
CA 2144          8
1601             8
S.O.C. 14879     7
PC 17608         7
Name: Ticket, dtype: int64

分析:约有1307个人名(全部为字符串),187个船舱号(1014个缺失值,字符+数字),929个船票号(字符+数字),变量数目太多,无法统计

3.2数值型变量分析

直方图、密度图,统计(describe) (Age & Fare)

In [155]:

#定义直方图、密度图显示函数
plt.rcParams['axes.unicode_minus']=False  # 用来正常显示负号
def hist(Feature):
    global ax
    fon_size=15
    fig_size=(18,7)
    ax=Feature.plot.hist(bins=20,figsize=fig_size,color='g')
    plt.xlabel('%s'% Feature.name+ ' Histogram',fontsize=30)
    plt.xticks(fontsize=fon_size)
    plt.yticks(fontsize=fon_size)
    abs_bar_labels()# 柱形图的标注函数,这里ax也可以使用
    
def density(Feature):
    fon_size=15
    fig_size=(18,7)
    Feature.plot.hist(bins=20,density=True,figsize=fig_size)
    Feature.plot.kde(style='r--')
    plt.xlabel('%s'% Feature.name+ u' Histogram',fontsize=30)
    plt.xticks(fontsize=fon_size)
    plt.yticks(fontsize=fon_size)
In [166]:

bold('**Age**')
hist(merge.Age)

Age

In [167]:

density(merge.Age)

In [168]:

merge.Age.describe()

Out[168]:

count    1046.000000
mean       29.881138
std        14.413493
min         0.170000
25%        21.000000
50%        28.000000
75%        39.000000
max        80.000000
Name: Age, dtype: float64

分析:年龄分布不均匀,20—25岁的人最多,年龄最大的有80岁,还有一个多月的婴儿

In [169]:

bold('**Fare**')
hist(merge.Fare)

Fare

In [170]:

density(merge.Fare)

In [171]:

merge.Fare.describe()

Out[171]:

count    1308.000000
mean       33.295479
std        51.758668
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200
Name: Fare, dtype: float64

分析:票价分布也不均匀,价格低廉的更多一些

4 特征工程

特征组合、创建、重组(针对3中无法分析的变量太多的特征)

In [178]:

# Cabin 缺失查看
display(merge.Cabin.value_counts().count())
display(merge.Cabin.isnull().sum())
186
1014
In [179]:

merge.Cabin.head(5)

Out[179]:

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
In [182]:

merge.Cabin.fillna(value='X',inplace=True)
merge.Cabin.head(5)

Out[182]:

0       X
1     C85
2       X
3    C123
4       X
Name: Cabin, dtype: object
In [192]:

merge.Cabin=merge.Cabin.apply(lambda x: x[0])
display(merge.Cabin.value_counts().count())
display(merge.Cabin.value_counts())
9
X    1014
C      94
B      65
D      46
E      41
A      22
F      21
G       5
T       1
Name: Cabin, dtype: int64

Cabin,将首字母提取出来替代原值,用X表示缺失值,最后由原来的186个变量减为了9个

In [193]:

abs_rel_f(merge.Cabin)
 Absolute FrequencyRelative Frequency(%)
X101477.5
C947.2
B655.0
D463.5
E413.1
A221.7
F211.6
G50.4
T10.1

In [216]:

bold('**Ticket**')
merge.Ticket=merge.Ticket.apply(lambda x: x[0])
display(merge.head())
merge.Ticket.value_counts()

Ticket

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamily_size
010.03Braund, Mr. Owen Harrismale22.010A7.2500XS2
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010P71.2833CC2
231.03Heikkinen, Miss. Lainafemale26.000S7.9250XS1
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.010153.1000CS2
450.03Allen, Mr. William Henrymale35.00038.0500XS1

Out[216]:

3    429
2    278
1    210
S     98
P     98
C     77
A     42
W     19
7     13
F     13
4     11
6      9
L      5
5      3
9      2
8      2
Name: Ticket, dtype: int64

分析:发现Ticket中首字母不全是字符,还有数字,重组后变量依然很多,所以这样的处理不够好,最后决定删去

In [196]:

bold(u'**SibSp&Parch组合**')

SibSp&Parch组合

In [206]:

merge['Family_size']=merge.SibSp+merge.Parch+1
In [208]:

merge.head()

Out[208]:

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamily_size
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500XS2
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833CC2
231.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250XS1
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000CS2
450.03Allen, Mr. William Henrymale35.0003734508.0500XS1

5缺失值处理

缺失值填充方法:

分类型变量:用众数填充(缺失不太多)

连续型变量:

缺失很少:平均数填充(正太分布)、中位数填充(非正态分布)

缺失很多:建立模型来预测缺失值; 分组,用每个子集的中位数填充

In [219]:

#查看缺失值情况:
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          1309 non-null object
Embarked       1307 non-null object
Family_size    1309 non-null int64
dtypes: float64(3), int64(5), object(5)
memory usage: 133.0+ KB

分析:Age缺失较多;Fare缺失一个,Embarked缺失两个,可以用众数填充

In [252]:

bold(u'**Embarked--分类型变量,缺失又少:众数补全**')
merge.loc[merge.Embarked.isnull(),'Embarked']=merge.Embarked.mode().iloc[0]  
#loc可以用行/列的标签名来切片,iloc是只能用行/列的索引即数字来切片
#merge.Embarked.fillna(value='S',inplace=True)
merge.Embarked.count()

Embarked--分类型变量,缺失又少:众数补全

Out[252]:

1309

In [253]:

bold(u'**Fare--数值型变量,非正态分布,中位数补全**')
merge.Fare.fillna(merge.Fare.median(),inplace=True)
merge.Fare.count()

Fare--数值型变量,非正态分布,中位数补全

Out[253]:

1309
In [256]:

merge.Age.isnull().sum()

Out[256]:

263

Age--数值型变量,缺失较多,不能简单地用中位数或平均数填充,可能会带来较大偏差

In [274]:

s=merge[['Pclass','Age']].groupby(['Pclass'])
s.median()

Out[274]:

 Age
Pclass 
139.0
229.0
324.0
In [275]:

merge.Age=s.transform(lambda x: x.fillna(x.median()))

利用Pclass 将Age分组,利用每个组里的中位数填充各分组内的缺失值 transform & apply 的区别?

6 双变量分析

相关性分析三类:

1 数值型与数值型变量:pearson's correlation(皮尔森相关性分析)、spearman correlation(斯皮尔曼相关性分析)

2 数值型与分类型变量:point-biserial correlation coefficient(双列相关分析)、ANOVA(方差分析)

3 分类型与分类型变量:chi-square test(卡方检验)

6.1 分类型与分类型变量分析

In [324]:

6.1
ax=sns.countplot('Sex',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Sex & Survived',fontsize=15)

Out[324]:

Text(0.5,0,'Sex & Survived')

分析:女性生还率确实比男性高的多

In [325]:

ax=sns.countplot('Pclass',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Pclass & Survived',fontsize=15)

Out[325]:

Text(0.5,0,'Oclass & Survived')

分析:等级越高,生还率越大

In [327]:

ax=sns.countplot('Embarked',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Embarked & Survived',fontsize=15)

Out[327]:

Text(0.5,0,'Embarked & Survived')

分析:从C港口上船的人生还率最高,其次是Q,S最小

In [328]:

ax=sns.countplot('Family_size',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Family_size & Survived',fontsize=15)

Out[328]:

Text(0.5,0,'Family_size & Survived')

分析:单身的人生还率最低

In [329]:

ax=sns.countplot('Cabin',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Cabin & Survived',fontsize=15)

Out[329]:

Text(0.5,0,'Cabin & Survived')

分析:X即缺失的人群,生还率最低,没有其他明显规律

6.2 数值型与分类型变量分析

方法:boxplot 、histogram 、ANOVA

Age & Fare

7 数据转换

1 数值型变量离散 2 删去无用特征 3 离散变量编码

7.1 Age&Fare 分组

In [334]:

age_label=['infant','child','teenager','young_adult','adult','old']
split_points=[0,5,12,18,35,60,81]
merge['Age_split']=pd.cut(merge.Age,split_points,labels=age_label)

In [335]:

merge.head(10)

Out[335]:

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamily_sizeAge_split
010.03Braund, Mr. Owen Harrismale22.010A7.2500XS2young_adult
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010P71.2833CC2adult
231.03Heikkinen, Miss. Lainafemale26.000S7.9250XS1young_adult
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.010153.1000CS2young_adult
450.03Allen, Mr. William Henrymale35.00038.0500XS1young_adult
560.03Moran, Mr. Jamesmale24.00038.4583XQ1young_adult
670.01McCarthy, Mr. Timothy Jmale54.000151.8625ES1adult
780.03Palsson, Master. Gosta Leonardmale2.031321.0750XS5infant
891.03Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.002311.1333XS3young_adult
9101.02Nasser, Mrs. Nicholas (Adele Achem)female14.010230.0708XC2teenager
In [340]:

fare_label=['low','medium','high','very_high']
f_split_points=[-1,30,100,300,600]
merge['Fare_split']=pd.cut(merge.Fare,f_split_points,labels=fare_label)
merge[['Fare','Fare_split']].head(10)

Out[340]:

 FareFare_split
07.2500low
171.2833medium
27.9250low
353.1000medium
48.0500low
58.4583low
651.8625medium
721.0750low
811.1333low
930.0708medium

7.2 删除无用特征

为了简化,不考虑Name Ticket. 所以删除以下特征:Name & Ticket & Fare & Age & SibSp & Parch

In [341]:

merge.head()

Out[341]:

 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamily_sizeAge_splitFare_split
010.03Braund, Mr. Owen Harrismale22.010A7.2500XS2young_adultlow
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010P71.2833CC2adultmedium
231.03Heikkinen, Miss. Lainafemale26.000S7.9250XS1young_adultlow
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.010153.1000CS2young_adultmedium
450.03Allen, Mr. William Henrymale35.00038.0500XS1young_adultlow
In [345]:

merge.drop(['Name','Ticket','Fare','Age','SibSp','Parch'],inplace=True,axis=1)
merge.head()

Out[345]:

 PassengerIdSurvivedPclassSexCabinEmbarkedFamily_sizeAge_splitFare_split
010.03maleXS2young_adultlow
121.01femaleCC2adultmedium
231.03femaleXS1young_adultlow
341.01femaleCS2young_adultmedium
450.03maleXS1young_adultlow

7.3 修改数据类型并编码

In [348]:

merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Sex            1309 non-null object
Cabin          1309 non-null object
Embarked       1309 non-null object
Family_size    1309 non-null int64
Age_split      1309 non-null category
Fare_split     1292 non-null category
dtypes: category(2), float64(1), int64(3), object(3)
memory usage: 74.6+ KB

后期编码,需要将分类型变量都转变为category类型,所以这里将object类型转换为category

In [352]:
merge.loc[:,['Pclass','Sex','Cabin','Embarked']]=merge.loc[:,['Pclass','Sex','Cabin','Embarked']].astype('category')
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null category
Sex            1309 non-null category
Cabin          1309 non-null category
Embarked       1309 non-null category
Family_size    1309 non-null int64
Age_split      1309 non-null category
Fare_split     1292 non-null category
dtypes: category(6), float64(1), int64(2)
memory usage: 39.5 KB

In [356]:

famiy类型仍然是int,之前忘记了对family离散

merge.Family_size.value_counts()

Out[356]:

1     790
2     235
3     159
4      43
6      25
5      22
7      16
11     11
8       8
dtype: int64
In [358]:

family_label=['sigle','small','medium','high']
family_split_points=[0,1,4,8,12]
merge['Family_scale']=pd.cut(merge.Family_size,family_split_points,labels=family_label)
merge[['Family_size','Family_scale']].head(10)

Out[358]:

 Family_sizeFamily_scale
02small
12small
21sigle
32small
41sigle
51sigle
61sigle
75medium
83small
92small
In [364]:

merge.drop(['Family_size'],inplace=True,axis=1)
merge.loc[:,['Family_scale']]=merge.loc[:,['Family_scale']].astype('category')
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
PassengerId     1309 non-null int64
Survived        891 non-null float64
Pclass          1309 non-null category
Sex             1309 non-null category
Cabin           1309 non-null category
Embarked        1309 non-null category
Age_split       1309 non-null category
Fare_split      1292 non-null category
Family_scale    1309 non-null category
dtypes: category(7), float64(1), int64(1)
memory usage: 30.7 KB
In [372]:

merge.Fare_split.value_counts(dropna=False)

Out[372]:

low          949
medium       259
high          80
NaN           17
very_high      4
Name: Fare_split, dtype: int64

问题:Fare少了17个,为啥呢?回过头检查了一遍发现:

Fare离散的时候,第一个数从0开始的,所以分的时候0的这些没考虑进去,也就是说,缺失的都是Fare为0的

In [377]:

# 补全Fare_split,用low补全票价为0的缺失值
merge.Fare_split=merge.Fare_split.fillna('low')
merge.Fare_split.value_counts(dropna=False)

Out[377]:

low          966
medium       259
high          80
very_high      4
Name: Fare_split, dtype: int64

In [378]:

merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
PassengerId     1309 non-null int64
Survived        891 non-null float64
Pclass          1309 non-null category
Sex             1309 non-null category
Cabin           1309 non-null category
Embarked        1309 non-null category
Age_split       1309 non-null category
Fare_split      1309 non-null category
Family_scale    1309 non-null category
dtypes: category(7), float64(1), int64(1)
memory usage: 30.7 KB

分析:由于之前的小错误,这里走了好多弯路:family_size之前忘记分组、Fare分组时候初始值选错导致没考虑0

至此,全部特征(Id与Survived不是特征)的数据类型都变为category,可以进行dummies的编码了

7.4 编码

数据处理的最后一步,接下来就可以建立模型开始训练和预测了

In [379]:

#pandas get_dummies默认只对数据中的category类型编码,不必担心int和float类型

merge=pd.get_dummies(merge)

In [380]:

merge.head()

Out[380]:

 PassengerIdSurvivedPclass_1Pclass_2Pclass_3Sex_femaleSex_maleCabin_ACabin_BCabin_C...Age_split_adultAge_split_oldFare_split_lowFare_split_mediumFare_split_highFare_split_very_highFamily_scale_sigleFamily_scale_smallFamily_scale_mediumFamily_scale_high
010.000101000...0010000100
121.010010001...1001000100
231.000110000...0010001000
341.010010001...0001000100
450.000101000...0010001000

5 rows × 33 columns

8 模型建立与评估

In [403]:

# 先将训练数据和测试数据分离,建立特征集和标签集
train_data=merge.loc[0:890,:]
test_data=merge.loc[891:,:]
display(train_data.shape)
display(test_data.shape)
(891, 33)
(418, 33)
In [486]:

# 将训练数据的标签集和特征集提取出来
bold(u'**训练数据的标签集和特征集**')
y_train=train_data.loc[:,['Survived']]
display(y_train.shape)
y_train=y_train.values.ravel()# 将2d array(891,1)转化为1d array(891L,),否则后面训练会出错
x_train=train_data.drop(['PassengerId','Survived'],axis=1)
display(y_train.shape)
display(x_train.shape)

训练数据的标签集和特征集

(891, 1)
(891L,)
(891, 31)
In [487]:

bold(u'**测试集**')
x_test=test_data.drop(['Survived','PassengerId'],axis=1)
display(x_test.shape)

测试集

(418, 31)

8.1 模型初选与训练评估

计划选用模型:LogisticRegression、KNN、Gaussian_Bayes、SVC、DecissionTree、RandomForest、GradientBoostingClassifier、AdaboostClassifier

In [630]:

seed=33 #控制随机种子,每次执行时保证随机数不变 
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()
from sklearn.svm import SVC
svc=SVC(gamma='auto')
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier(random_state = seed)
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(random_state = seed)
from sklearn.ensemble import GradientBoostingClassifier
gbc=GradientBoostingClassifier(random_state = seed)
from sklearn.ensemble import AdaBoostClassifier
abc= AdaBoostClassifier(random_state = seed)
In [497]:

#定义计算训练得分的函数
def train_accu(model):
    model.fit(x_train,y_train)
    train_accuracy=model.score(x_train,y_train)
    train_accuracy=round(train_accuracy*100,2)
    return train_accuracy
In [514]:

Train_accu=pd.DataFrame({'train_accuracy(%)':[train_accu(lr),train_accu(knn),train_accu(gnb),train_accu(svc),train_accu(dt),
                         train_accu(rf),train_accu(gbc),train_accu(abc)]})
Train_accu.index=['lr','knn','gnb','svc','dt','rf','gbc','abc']
display(Train_accu)
 train_accuracy(%)
lr82.04
knn85.07
gnb73.18
svc80.36
dt88.22
rf87.88
gbc86.08
abc81.37
In [517]:

sortedTrain_accu=Train_accu.sort_values(by='train_accuracy(%)',ascending=False)
bold(u'**模型训练得分排序**')
display(sortedTrain_accu)

模型训练得分排序

 train_accuracy(%)
dt88.22
rf87.88
gbc86.08
knn85.07
lr82.04
abc81.37
svc80.36
gnb73.18

分析:基于树算法的决策树、随机森林、梯度提升算法得分最高,高斯分布的贝叶斯算法得分最低。

但这并不能表明模型在测试集上就有良好的表现。因为训练集的得分无法得出模型的泛化能力,需要交叉验证方法来评估。

8.2 交叉验证评估

In [528]:

def cross_val_score(model):
    from sklearn.cross_validation import cross_val_score
    score=cross_val_score(model,x_train,y_train,cv=10,scoring='accuracy').mean()
    score=round(score*100,2)
    return score
In [530]:

Cross_accu=pd.DataFrame({'Cross_val accuracy(%)':[cross_val_score(lr),cross_val_score(knn),cross_val_score(gnb),cross_val_score(svc),
                                                  cross_val_score(dt),cross_val_score(rf),cross_val_score(gbc),cross_val_score(abc)]})
Cross_accu.index=['lr','knn','gnb','svc','dt','rf','gbc','abc']
display(Cross_accu)
 Cross_val accuracy(%)
lr81.26
knn80.37
gnb73.08
svc80.13
dt80.60
rf82.17
gbc82.50
abc81.04
In [531]:

sortedCross_accu=Cross_accu.sort_values(by='Cross_val accuracy(%)',ascending=False)
bold(u'**模型训练得分排序**')
display(sortedCross_accu)

模型训练得分排序

 Cross_val accuracy(%)
gbc82.50
rf82.17
lr81.26
abc81.04
dt80.60
knn80.37
svc80.13
gnb73.08

分析:基于交叉验证的得分显示出模型的泛化能力,才是模型评估的有效方法。可以看出排在前几位的是梯度提升和随机森林以及逻辑回归方法,这与训练集的得分排名还是有区别的。高斯分布的贝叶斯与支持向量机仍然排在末尾。

但这个排名仍不是最合理的排名。这里的模型参数都是随机初始化的参数,调参后利用最优参数进行交叉验证评估才是最合理的。

8.3 调参

In [623]:

from sklearn.model_selection import GridSearchCV
In [663]:

'lr','knn','gnb','svc','dt','rf','gbc','abc'
lr_param={'penalty':['l1','l2'],'C':np.logspace(0,4,10)}
knn_param={'n_neighbors':[3, 4, 5, 6, 7, 8],
              'leaf_size':[1, 2, 3, 5, 6, 7],
              'weights':['uniform', 'distance'],
              'algorithm':['auto', 'ball_tree','kd_tree','brute']}
# gnb 高斯分布贝叶斯没有参数
svc_param={'C':[1,2,3,4,5,6,7,8,9,10], 
              'kernel': ['linear','rbf'],
              'gamma': [0.5, 0.2, 0.1, 0.001, 0.0001]}
dt_param={'max_features': ['auto', 'sqrt', 'log2'],
             'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 
             'min_samples_leaf':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
             'random_state':[seed]}
rf_param={'criterion':['gini','entropy'],
             'n_estimators':[10, 15, 20, 25, 30],
             'min_samples_leaf':[1, 2, 3],
             'min_samples_split':[3, 4, 5, 6, 7], 
             'max_features':['sqrt', 'auto', 'log2'],
             'random_state':[seed]}
gbc_param={'learning_rate': [0.01, 0.02, 0.05, 0.1],
              'max_depth': [4, 6, 8],
              'max_features': [1.0, 0.3, 0.1], 
              'min_samples_split': [ 2, 3, 4],
              'random_state':[seed]}
adc_param={'n_estimators':[ 5, 10, 15, 20, 25, 40, 50, 60, 80, 100, 130, 160, 200, 250, 300],
              'learning_rate':[0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5],
              'random_state':[seed]}
In [632]:

def grid(model,params):
    grid=GridSearchCV(model,params,cv=10,scoring='accuracy')
    grid.fit(x_train,y_train)
    return round(grid.best_score_*100,2),grid.best_params_

 

In [606]:

%%time
lr_best_score,lr_best_param=grid(lr,lr_param)
display(lr_best_param)
display(lr_best_score)
 
{'C': 2.7825594022071245, 'penalty': 'l1'}
81.59
Wall time: 1.56 s
In [607]:

%%time
knn_best_score,knn_best_param=grid(knn,knn_param)
display(knn_best_param)
display(knn_best_score)
{'algorithm': 'auto', 'leaf_size': 2, 'n_neighbors': 6, 'weights': 'distance'}
82.15
Wall time: 1min 49s
In [608]:

%%time
svc_best_score,svc_best_param=grid(svc,svc_param)
display(svc_best_param)
display(svc_best_score)
{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
82.6
Wall time: 31.9 s
In [609]:

%%time
dt_best_score,dt_best_param=grid(dt,dt_param)
display(dt_best_param)
display(dt_best_score)
{'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 9,
 'random_state': 33}
81.93
Wall time: 9.93 s
In [664]:

%%time
rf_best_score,rf_best_param=grid(rf,rf_param)
display(rf_best_param)
display(rf_best_score)
{'criterion': 'entropy',
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 7,
 'n_estimators': 30,
 'random_state': 33}
83.5
Wall time: 2min 52s
In [628]:

%%time
abc_best_score,abc_best_param=grid(abc,abc_param)
display(abc_best_param)
display(abc_best_score)
{'learning_rate': 0.3, 'n_estimators': 60, 'random_state': 33}
81.37
Wall time: 2min 59s
In [633]:

%%time
gbc_best_score,gbc_best_param=grid(gbc,gbc_param)
display(gbc_best_param)
display(gbc_best_score)'lr','knn','gnb','svc','dt','rf','gbc','abc'
{'learning_rate': 0.02,
 'max_depth': 4,
 'max_features': 1.0,
 'min_samples_split': 3,
 'random_state': 33}
83.28
Wall time: 3min
In [665]:

tunned_acc=pd.DataFrame({u'调参后分数':[lr_best_score,knn_best_score,svc_best_score,
                            dt_best_score,rf_best_score,gbc_best_score,abc_best_score,]})
tunned_acc.index=['lr','knn','svc','dt','rf','gbc','abc']
sorted_tunned_accu=tunned_acc.sort_values(by=u'调参后分数',ascending=False)
display(sorted_tunned_accu)
 调参后分数
rf83.50
gbc83.28
svc82.60
knn82.15
dt81.93
lr81.59
abc81.37

分析:基于树的模型调参时间都比较长(跟参数多也有关系),调参后,随机森林RF效果最好其次是梯度提升gbc和支持向量机SVC

8.4 用调好的参数重新训练验证

In [666]:

lr=LogisticRegression(**lr_best_param)
knn=KNeighborsClassifier(**knn_best_param)
svc=SVC(**svc_best_param)
dt=DecisionTreeClassifier(**dt_best_param)
rf=RandomForestClassifier(**rf_best_param)
gbc=GradientBoostingClassifier(**gbc_best_param)
abc= AdaBoostClassifier(**abc_best_param)
models={'LR':lr,'KNN':knn,'SVC':svc,'DT':dt,'RF':rf,'GBC':gbc,'ABC':abc}
for (keys,items) in models.items():
    from sklearn.cross_validation import cross_val_score
    score=cross_val_score(items,x_train,y_train,cv=10,scoring='accuracy')*100
    print '%0.4f(±%0.4f) [%s]' % (score.mean(),score.std(),keys)
82.1643(±4.0666) [KNN]
81.3754(±2.2015) [ABC]
83.2842(±4.2737) [GBC]
82.6038(±2.9810) [SVC]
83.5128(±3.4590) [RF]
81.5976(±2.9117) [LR]
81.9296(±2.4371) [DT]

分析:用最优参数来重新进行模型的训练并进行交叉验证,得分与最优参数下的得分相同

9 选择模型并预测

In [688]:

#最优模型:RF & GBC

gbc=GradientBoostingClassifier(**gbc_best_param)
gbc.fit(x_train,y_train)
submission_gbc=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':gbc.predict(x_test).astype(int)})
submission_gbc.to_csv('submission_gbc',index=False)

rf=RandomForestClassifier(**rf_best_param)
rf.fit(x_train,y_train)
submission_rf=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':rf.predict(x_test).astype(int)})
submission_rf.to_csv('submission_rf',index=False)
In [693]:

svc=SVC(**svc_best_param)
svc.fit(x_train,y_train)
submission_svc=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':svc.predict(x_test).astype(int)})
submission_svc.to_csv('submission_svc',index=False)

knn=KNeighborsClassifier(**knn_best_param)
knn.fit(x_train,y_train)
submission_knn=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':knn.predict(x_test).astype(int)})
submission_knn.to_csv('submission_knn',index=False)

dt=DecisionTreeClassifier(**dt_best_param)
dt.fit(x_train,y_train)
submission_dt=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':dt.predict(x_test).astype(int)})
submission_dt.to_csv('submission_dt',index=False)

提交后结果:

RF:score 0.75119-rank 8361

GBC:score 0.76555-rank 7317

SVC:score 0.77990-rank 4824

KNN:score 0.73684

DT: score 0.7320

提交后发现,本来排在第三位的SVC却得分最高

可见交叉验证的得分只是代表其一般的泛化能力,和测试集真正的得分情况是有差异的。

        所有分析至此已经完成。过程中对特征的处理还是比较粗糙。Name Ticket 特征被直接删掉了,Age的填充采用分组中位数填充而非建立模型预测,这些都对模型的准确度产生影响。

        如果你也还在犹豫着翻看别人的文章和笔记觉得无从下手,那就赶紧去动手敲一遍代码吧,过程中学到的东西会比看到的多得多。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值