kaggle初探--泰坦尼克号生存预测

继续学习数据挖掘,尝试了kaggle上的泰坦尼克号生存预测。

Titanic for Machine Learning

导入和读取

# data processing
import numpy as np
import pandas as pd
import re
#visiulization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
train = pd.read_csv('D:/data/titanic/train.csv')
test = pd.read_csv('D:/data/titanic/test.csv')
train.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

数据特征有:PassengerId,无特别意义
Pclass,客舱等级,对生存有影响吗?是否高等仓的有更多机会?
Name,姓名,可帮助我们判断性别,大概年龄。
Sex,女性的生产率是否更高?
Age,不同年龄段是否对生存有影响?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有亲人的情况下生存率是提高还是降低?
Fare,票价,高票价是否有更多机会?
Cabin,Embarked,客舱和登录港口……自然理解对生存应该没有影响

train.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
train.describe(include=['O'])#['O'] indicates category feature
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
NameSexTicketCabinEmbarked
count891891891204889
unique89126811473
topHippach, Mrs. Louis Albert (Ida Sophia Fischer)male1601C23 C25 C27S
freq157774644

目标Survived特征

survive_num = train.Survived.value_counts()
survive_num.plot.pie(explode=[0,0.1],autopct='%1.1f%%',labels=['died','survived'],shadow=True)
plt.show()

这里写图片描述

x=[0,1]
plt.bar(x,survive_num,width=0.35)
plt.xticks(x,('died','survived'))
plt.show()

png

特征分析

num_f = [f for f in train.columns if train.dtypes[f] != 'object']
cat_f = [f for f in train.columns if train.dtypes[f]=='object']
print('there are %d numerical features:'%len(num_f),num_f)
print('there are %d category features:'%len(cat_f),cat_f)

there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]

feature类别:
- 数值型
- 特征型:可排序/不可排序型
- category不可排序型:sex,Embarked

category特征

性别
train.groupby(['Sex'])['Survived'].count()
Sex female 314 male 577 Name: Survived, dtype: int64
f,ax = plt.subplots(figsize=(8,6))
fig = sns.countplot(x='Sex',hue='Survived',data=train)
fig.set_title('Sex:Survived vs Dead')
plt.show()

png

train.groupby(['Sex'])['Survived'].sum()/train.groupby(['Sex'])['Survived'].count()
Sex female 0.742038 male 0.188908 Name: Survived, dtype: float64 船上原有人数,男性远高于女性;存活率,女性在75%左右,远高于男性18%-19%.可见女性存活率远高于男性,是重要特征。
Embarked
sns.factorplot('Embarked','Survived',data=train)
plt.show()

png

f,ax = plt.subplots(1,3,figsize=(24,6))
sns.countplot('Embarked',data=train,ax=ax[0])
ax[0].set_title('No. Of Passengers Boarded')
sns.countplot(x='Embarked',hue='Survived',data=train,ax=ax[1])
ax[1].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[2])
ax[2].set_title('Embarked vs Pclass')
#plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

png

#pd.pivot_table(train,index='Embarked',columns='Pclass',values='Fare')
sns.boxplot(x='Embarked',y='Fare',hue='Pclass',data=train)
plt.show()

png

从图中看出大部分乘客来自S port,其中多数为class 3,但是class 1 的人数也是3个口中最多的,C port的存活率最高,为0.55,因为C port中class1的人比例较高,Q port 绝大部分乘客是class 3的。C口1,2仓的票价均值较高,可能是暗示这个口上的人的社会地位较高。不过,从逻辑上说登录口对生存率是没有影响的,所以可以将其转成哑变量或drop.

Pclass
train.groupby('Pclass')['Survived'].value_counts()
Pclass Survived 1 1 136 0 80 2 0 97 1 87 3 0 372 1 119 Name: Survived, dtype: int64
plt.subplots(figsize=(8,6))
f = sns.countplot('Pclass',hue='Survived',data=train)

png

sns.factorplot('Pclass','Survived',hue='Sex',data=train)
plt.show()

png

class1,2的存活率明显较高,1有半数以上存活,2也基本持平,1,2仓女性甚至接近于1,所以客舱等级对生存有很大影响。

SibSp
train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
SibSpSurvived
110.535885
220.464286
000.345395
330.250000
440.166667
550.000000
680.000000
sns.factorplot('SibSp','Survived',data=train)
plt.show()

png

#pd.pivot_table(train,values='Survived',index='SibSp',columns='Pclass')
sns.countplot(x='SibSp',hue='Pclass',data=train)
plt.show()

png

在没有同伴的情况下,存活率大概在0.3左右,有一个同伴的存活率最高>0.5,可能原因是1,2仓的乘客比例较高,随后,随着同伴数量增加而降低,降低的主要原因可能是,超过3人以上的乘客主要在class3,class3中3人以上存活率很低

Parch
#pd.pivot_table(train,values='Survived',index='Parch',columns='Pclass')
sns.countplot(x='Parch',hue='Pclass',data=train)
plt.show()

png

sns.factorplot('Parch','Survived',data=train)
plt.show()

png

趋势跟SibSp相似,一个人存活率较低,在有1-3parents时存活率较高,随后迅速降低,因为多数乘客来自class3

Age
train.groupby('Survived')['Age'].describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
countmeanstdmin25%50%75%max
Survived
0424.030.62617914.1721101.0021.028.039.074.0
1290.028.34369014.9509520.4219.028.036.080.0
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.violinplot('Pclass','Age',hue='Survived',data=train,split=True,ax=ax[0])
ax[0].set_title('Pclass Age & Survived')
sns.violinplot('Sex','Age',hue='Survived',data=train,split=True,ax=ax[1])
ax[1].set_title('Sex Age & Survived')
plt.show()

png

1等仓获救年龄总体偏低,生存率年龄跨度大,尤其是20岁以上至50岁的生存率较高,可能和1等仓人年龄总体偏大有关;10岁左右的儿童在2,3等仓的生存率明显提升,对于男性而言同理,儿童有个明显提升,;女性的生存年龄集中在中青年;20-40岁左右的中青年人死亡人数最多。

Name

name主要用途是可以帮助我们分辨性别,帮助补充有相同title的年龄缺失值

#用正则表达式帮助找出姓名中表示年龄的title
def getTitle(data):

    name_sal = []
    for i in range(len(data['Name'])):
        name_sal.append(re.findall(r'.\w*\.',data.Name[i]))

    Salut = []
    for i in range(len(name_sal)):
        name = str(name_sal[i])
        name = name[1:-1].replace("'","")
        name = name.replace(".","").strip()
        name = name.replace(" ","")
        Salut.append(name)

    data['Title'] = Salut

getTitle(train)
train.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedTitle
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNSMr
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85CMrs
pd.crosstab(train['Title'],train['Sex'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sexfemalemale
Title
Capt01
Col02
Countess10
Don01
Dr16
Jonkheer01
Lady10
Major02
Master040
Miss1820
Mlle20
Mme10
Mr0517
Mrs1240
Mrs,L10
Ms10
Rev06
Sir01

补习一波英语:Mme:称呼非英语民族的”上层社会”已婚妇女,及有职业的妇女,相当于Mrs;Jonkheer:乡绅;Capt:船长;Lady:贵族夫人;Don唐:是西班牙语中贵族和有地位者的尊称;the Countess:女伯爵;Ms:Ms.或Mz:婚姻状态不明的妇女;Col:上校;Major:少校;Mlle:小姐;Rev:牧师。

Fare
train.groupby('Pclass')['Fare'].mean()
Pclass 1 84.154687 2 20.662183 3 13.675550 Name: Fare, dtype: float64
sns.distplot(train['Fare'].dropna())
plt.xlim((0,200))
plt.xticks(np.arange(0,200,10))
plt.show()

这里写图片描述

初步分析总结:
- 对于性别,女性生存率明显高于男性
- 头等舱生存率很高,3等仓很低,class1,2女性生存率接近于1
- 10岁左右的儿童生存率又明显提升
- SibSp和Parch相似,一个人存活率较低,有1-2SibSp或者1-3Parents生存率较高,但超过后生存率大幅下降
- name和age可以对所有数据进行处理,用name提取性别title,借助均值对age进行补充

数据处理

#合并训练集和测试集
passID = test['PassengerId']
all_data = pd.concat([train,test],keys=["train","test"])
all_data.shape
#all_data.head()
(1309, 13)
#统计缺失值
NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"])
NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
trainpercent_traintestpercent
Cabin6870.771044327.00.782297
Age1770.19865386.00.205742
Fare00.0000001.00.002392
Embarked20.0022450.00.000000
#删除无意义特征
all_data.drop(['PassengerId','Cabin'],axis=1,inplace=True)

all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
AgeEmbarkedFareNameParchPclassSexSibSpSurvivedTicketTitle
train022.0S7.2500Braund, Mr. Owen Harris03male10.0A/5 21171Mr
138.0C71.2833Cumings, Mrs. John Bradley (Florence Briggs Th…01female11.0PC 17599Mrs
Age处理
#先提取name中的title
getTitle(all_data)
pd.crosstab(all_data['Title'], all_data['Sex'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sexfemalemale
Title
Capt01
Col04
Countess10
Don01
Dona10
Dr17
Jonkheer01
Lady10
Major02
Master061
Miss2600
Mlle20
Mme10
Mr0757
Mrs1960
Mrs,L10
Ms20
Rev08
Sir01

all_data['Title'] = all_data['Title'].replace(
    ['Lady','Dr','Dona','Mme','Countess'],'Mrs')
all_data['Title'] =all_data['Title'].replace('Mlle','Miss')
all_data['Title'] =all_data['Title'].replace('Mrs,L','Mrs')
all_data['Title'] = all_data['Title'].replace('Ms', 'Miss')
#all_data['Title'] = all_data['Title'].replace('Mme', 'Mrs')
all_data['Title'] = all_data['Title'].replace(['Capt','Col','Don','Major','Rev','Jonkheer','Sir'],'Mr')
'''
all_data['Title'] = all_data.Title.replace({'Mlle':'Miss','Mme':'Mrs','Ms':'Miss','Dr':'Mrs',
                        'Major':'Mr','Lady':'Mrs','Countess':'Mrs',
                        'Jonkheer':'Mr','Col':'Mr','Rev':'Mr',
                        'Capt':'Mr','Sir':'Mr','Don':'Mr','Mrs,L':'Mrs'})

'''
all_data.Title.isnull().sum()
0
all_data[:train.shape[0]].groupby('Title')['Age'].mean()
Title Master 4.574167 Miss 21.845638 Mr 32.891990 Mrs 36.188034 Name: Age, dtype: float64
#通过训练集中title对应的age均值替换
all_data.loc[(all_data.Age.isnull()) & (all_data.Title=='Mr'),'Age']=32
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Mrs'),'Age']=36
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Master'),'Age']=5
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Miss'),'Age']=22
#all_data.loc[(all_data.Age.isnull())&(all_data.Title=='other'),'Age']=46

all_data.Age.isnull().sum()
0
all_data[:train.shape[0]][['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
TitleSurvived
0Master0.575000
1Miss0.702703
2Mr0.158192
3Mrs0.777778
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='female','Age'],color='red',ax=ax[0])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='male','Age'],color='blue',ax=ax[0])

sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Age' ],
                 color='red', label='Not Survived', ax=ax[1])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Age' ],
                 color='blue', label='Survived', ax=ax[1])
plt.legend(loc='best')
plt.show()

png

  • 16岁左右儿童存活率较高,最年长乘客(80岁)幸存
  • 大量16~40青少年没有存活
  • 大多数乘客在16~40岁
  • 为辅助分类,将年龄分段,创造新特征,同时增加儿童特征
add isChild
def male_female_child(passenger):
    # 取年龄和性别
    age,sex = passenger
    # 提出儿童特征
    if age < 16:
        return 'child'
    else:
        return sex
# 创建新特征
all_data['person'] = all_data[['Age','Sex']].apply(male_female_child,axis=1)
#0-80岁的年龄分布,若分段成3组,按少年、中青年、老年分

all_data['Age_band']=0
all_data.loc[all_data['Age']<=16,'Age_band']=0
all_data.loc[(all_data['Age']>16)&(all_data['Age']<=40),'Age_band']=1
all_data.loc[all_data['Age']>40,'Age_band']=2
Name处理
df = pd.get_dummies(all_data['Title'],prefix='Title')
all_data = pd.concat([all_data,df],axis=1)
all_data.drop('Title',axis=1,inplace=True)
#drop name
all_data.drop('Name',axis=1,inplace=True)
fiilna Embarked
all_data.loc[all_data.Embarked.isnull()]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
AgeEmbarkedFareParchPclassSexSibSpSurvivedTicketTitlepersonAge_band
train6138.0NaN80.001female01.01135722female1
82962.0NaN80.001female01.01135723female2

票价80,一等舱,很大概率是C口

all_data['Embarked'].fillna('C',inplace=True)

all_data.Embarked.isnull().any()
False
embark_dummy = pd.get_dummies(all_data.Embarked)
all_data = pd.concat([all_data,embark_dummy],axis=1)
all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
AgeEmbarkedFareParchPclassSexSibSpSurvivedTicketpersonAge_bandTitle_MasterTitle_MissTitle_MrTitle_MrsCQS
train022.0S7.250003male10.0A/5 21171male10010001
138.0C71.283301female11.0PC 17599female10001100
add SibSp and Parch
#创造familysize和alone两个新特征
all_data['Family_size'] = all_data['SibSp']+all_data['Parch']#是所有亲属总和
all_data['alone'] = 0#不是一个人
all_data.loc[all_data.Family_size==0,'alone']=1#代表是一个人
f,ax=plt.subplots(1,2,figsize=(16,6))
sns.factorplot('Family_size','Survived',data=all_data[:train.shape[0]],ax=ax[0])
ax[0].set_title('Family_size vs Survived')
sns.factorplot('alone','Survived',data=all_data[:train.shape[0]],ax=ax[1])
ax[1].set_title('alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()

png

当乘客一个人的时候,生存率很低,大概在0.3左右,有1-3家庭成员时生存率上升,但>4时,生存率又急速下降。

#再将family size分段
all_data['Family_size'] = np.where(all_data['Family_size']==0, 'solo',
                                    np.where(all_data['Family_size']<=3, 'normal', 'big'))
sns.factorplot('alone','Survived',hue='Sex',data=all_data[:train.shape[0]],col='Pclass')
plt.show()

png

对于女性,1,2等仓来说,是否一个人对生存率影响不大,但对于3等仓女性,一个人时反而生存率提高。

all_data['poor_girl'] = 0
all_data.loc[(all_data['Sex']=='female')&(all_data['Pclass']==3)&(all_data['alone']==1),'poor_girl']=1
连续变量Fare填充、分段
#补充全缺失值
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==1),'Fare']=84
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==2),'Fare']=21
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==3),'Fare']=14
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Fare' ],
                 color='red', label='Not Survived')
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Fare' ],
                 color='blue', label='Survived')
plt.xlim((0,100))
(0, 100)

png

sns.lmplot('Fare','Survived',data=all_data[:train.shape[0]])
plt.show()

png

#Fare平均分成3段取均值
all_data['Fare_band'] = pd.qcut(all_data['Fare'],3)

all_data[:train.shape[0]].groupby('Fare_band')['Survived'].mean()
Fare_band (-0.001, 8.662] 0.198052 (8.662, 26.0] 0.402778 (26.0, 512.329] 0.559322 Name: Survived, dtype: float64
#将连续变量fare分段,离散化

all_data['Fare_cut'] = 0
all_data.loc[all_data['Fare']<=8.662,'Fare_cut'] = 0
all_data.loc[((all_data['Fare']>8.662) & (all_data['Fare']<=26)),'Fare_cut'] = 1
#all_data.loc[((all_data['Fare']>14.454) & (all_data['Fare']<=31.275)),'Fare_cut'] = 2
all_data.loc[((all_data['Fare']>26) & (all_data['Fare']<513)),'Fare_cut'] = 2

sns.factorplot('Fare_cut','Survived',hue='Sex',data=all_data[:train.shape[0]])
plt.show()

png

价格上升,生存率增加,对男性尤为明显

# creat a feature about rich man
all_data['rich_man'] = 0
all_data.loc[((all_data['Fare']>=80) & (all_data['Sex']=='male')),'rich_man'] = 1
类型特征数值化
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
AgeEmbarkedFareParchPclassSexSibSpSurvivedTicketpersonTitle_MrsCQSFamily_sizealonepoor_girlFare_bandFare_cutrich_man
train022.0S7.250003male10.0A/5 21171male0001normal00(-0.001, 8.662]00
138.0C71.283301female11.0PC 17599female1100normal00(26.0, 512.329]20
226.0S7.925003female01.0STON/O2. 3101282female0001solo11(-0.001, 8.662]00
335.0S53.100001female11.0113803female1001normal00(26.0, 512.329]20
435.0S8.050003male00.0373450male0001solo10(-0.001, 8.662]00

5 rows × 24 columns

舍弃特征有Embarked(已离散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch

'''
舍弃不需要的特征:Age,用Age_band分段代替了,
Fare,Fare_band用Fare_cut分段代替了
Ticket无意义
'''
#all_data.drop(['Age','Fare','Fare_band','Ticket'],axis=1,inplace=True)
#all_data.drop(['Age','Fare','Fare_band','Ticket','Embarked','C'],axis=1,inplace=True)
all_data.drop(['Age','Fare','Ticket','Embarked','C','Fare_band','SibSp','Parch'],axis=1,inplace=True)
all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PclassSexSurvivedpersonAge_bandTitle_MasterTitle_MissTitle_MrTitle_MrsQSFamily_sizealonepoor_girlFare_cutrich_man
train03male0.0male1001001normal0000
11female1.0female1000100normal0020
df1 = pd.get_dummies(all_data['Family_size'],prefix='Family_size')
df2 = pd.get_dummies(all_data['person'],prefix='person')
df3 = pd.get_dummies(all_data['Age_band'],prefix='age')
all_data = pd.concat([all_data,df1,df2,df3],axis=1)
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PclassSexSurvivedpersonAge_bandTitle_MasterTitle_MissTitle_MrTitle_MrsQrich_manFamily_size_bigFamily_size_normalFamily_size_soloperson_childperson_femaleperson_maleage_0age_1age_2
train03male0.0male1001000010001010
11female1.0female1000100010010010
23female1.0female1010000001010010
31female1.0female1000100010010010
43male0.0male1001000001001010

5 rows × 25 columns

all_data.drop(['Sex','person','Age_band','Family_size'],axis=1,inplace=True)
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PclassSurvivedTitle_MasterTitle_MissTitle_MrTitle_MrsQSalonepoor_girlrich_manFamily_size_bigFamily_size_normalFamily_size_soloperson_childperson_femaleperson_maleage_0age_1age_2
train030.0001001000010001010
111.0000100000010010010
231.0010001110001010010
311.0000101000010010010
430.0001001100001001010

5 rows × 21 columns

建立模型

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix#retun array of prredict and target
from sklearn.model_selection import cross_val_predict#use to retun the predict of cross val 

from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
train_data = all_data[:train.shape[0]]
test_data = all_data[train.shape[0]:]
print('train data:'+str(train_data.shape))
print('test data:'+str(test_data.shape))
train data:(668, 21) test data:(641, 21)

train,test = train_test_split(train_data,test_size = 0.25, random_state=0,stratify=train_data['Survived'])
train_x = train.drop('Survived',axis=1)

train_y = train['Survived']

test_x = test.drop('Survived',axis=1)
test_y = test['Survived']
print(train_x.shape)
print(test_x.shape)
(668, 20) (223, 20)
# define score on train and test data
def cv_score(model):
    cv_result = cross_val_score(model,train_x,train_y,cv=10,scoring = "accuracy")
    return(cv_result)

def cv_score_test(model):
    cv_result_test = cross_val_score(model,test_x,test_y,cv=10,scoring = "accuracy")
    return(cv_result_test)

rbf SVM

# RBF SVM model

param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
clf_svc = GridSearchCV(svm.SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf_svc = clf_svc.fit(train_x, train_y)
print("Best estimator found by grid search:")
print(clf_svc.best_estimator_)
acc_svc_train = cv_score(clf_svc.best_estimator_).mean()
acc_svc_test = cv_score_test(clf_svc.best_estimator_).mean()
print(acc_svc_train)
print(acc_svc_test)
Best estimator found by grid search: SVC(C=1000.0, cache_size=200, class_weight=’balanced’, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.0001, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.826306967835 0.816196122718

决策树

#a simple tree

clf_tree = DecisionTreeClassifier()
clf_tree.fit(train_x,train_y)
acc_tree_train = cv_score(clf_tree).mean()
acc_tree_test = cv_score_test(clf_tree).mean()
print(acc_tree_train)
print(acc_tree_test)
0.808216271583 0.811631846414

KNN

#test n_neighbors 

pred = []
for i in range(1,11):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(train_x,train_y)
    pred.append(cv_score(model).mean())
n = list(range(1,11))
plt.plot(n,pred)
plt.xticks(range(1,11))
plt.show()  

png

clf_knn = KNeighborsClassifier(n_neighbors=4)
clf_knn.fit(train_x,train_y)
acc_knn_train = cv_score(clf_knn).mean()
acc_knn_test = cv_score_test(clf_knn).mean()
print(acc_knn_train)
print(acc_knn_test)
0.826239790353 0.829653679654

逻辑回归

#logistic regression

clf_LR = LogisticRegression()
clf_LR.fit(train_x,train_y)
acc_LR_train = cv_score(clf_LR).mean()
acc_LR_test = cv_score_test(clf_LR).mean()
print(acc_LR_train)
print(acc_LR_test)
0.838226647511 0.811848296631

高斯贝叶斯



clf_gb = GaussianNB()
clf_gb.fit(train_x,train_y)
acc_gb_train = cv_score(clf_gb).mean()
acc_gb_test = cv_score_test(clf_gb).mean()
print(acc_gb_train)
print(acc_gb_test)
0.794959693511 0.789695087521

随机森林



n_estimators = range(100,1000,100)
grid = {'n_estimators':n_estimators}

clf_forest = GridSearchCV(RandomForestClassifier(random_state=0),param_grid=grid,verbose=True)
clf_forest.fit(train_x,train_y)
print(clf_forest.best_estimator_)
print(clf_forest.best_score_)
#print(cv_score(clf_forest).mean())
#print(cv_score_test(clf_forest).mean())
Fitting 3 folds for each of 9 candidates, totalling 27 fits [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 32.2s finished RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False) 0.817365269461
clf_forest = RandomForestClassifier(n_estimators=200)
clf_forest.fit(train_x,train_y)
acc_forest_train = cv_score(clf_forest).mean()
acc_forest_test = cv_score_test(clf_forest).mean()
print(acc_forest_train)
print(acc_forest_test)
0.811178066885 0.811434217956
pd.Series(clf_forest.feature_importances_,train_x.columns).sort_values(ascending=True).plot.barh(width=0.8)
plt.show()

png


models = pd.DataFrame({
    'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],
    'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train],
    'score on test':[acc_svc_test,acc_tree_test,acc_knn_test,acc_LR_test,acc_gb_test,acc_forest_test]
})
models.sort_values(by='score on test', ascending=False)
'''
models = pd.DataFrame({
    'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],
    'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train]
})
'''
models.sort_values(by='score on test', ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
modelscore on testscore on train
2KNN0.8296540.826240
0SVM0.8161960.826307
3Logistic regression0.8118480.838227
1Decision Tree0.8116320.808216
5Random Forest0.8114340.811178
4Gaussion Bayes0.7896950.794960

Ensemble

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# bagging Decision tree
from sklearn.ensemble import BaggingClassifier
bag_tree = BaggingClassifier(base_estimator=clf_svc.best_estimator_,n_estimators=200,random_state=0)
bag_tree.fit(train_x,train_y)
acc_bagtree_train = cv_score(bag_tree).mean()
acc_bagtree_test =cv_score_test(bag_tree).mean()
print(acc_bagtree_train)
print(acc_bagtree_test)
0.82782211935
0.816196122718

Adaboosting

n_estimators = range(100,1000,100)
a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
grid = {'n_estimators':n_estimators,'learning_rate':a}
ada = GridSearchCV(AdaBoostClassifier(),param_grid=grid,verbose=True)
ada.fit(train_x,train_y)
print(ada.best_estimator_)
print(ada.best_score_)
#acc_ada_train = cv_score(ada).mean()
#acc_ada_test = cv_score_test(ada).mean()

#print(acc_ada_train)
#print(acc_ada_test)
Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  5.4min finished


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.05, n_estimators=200, random_state=None)
0.835329341317
ada = AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.2)
ada.fit(train_x,train_y)

acc_ada_train = cv_score(ada).mean()
acc_ada_test = cv_score_test(ada).mean()

print(acc_ada_train)
print(acc_ada_test)
0.829248144305
0.825719932242
#confusion matrix to see the presiction

y_pred = cross_val_predict(ada,test_x,test_y,cv=10)
sns.heatmap(confusion_matrix(test_y,y_pred),cmap='winter',annot=True,fmt='2.0f')
plt.show()

png

GradientBoosting


n_estimators = range(100,1000,100)
a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
grid = {'n_estimators':n_estimators,'learning_rate':a}
grad = GridSearchCV(GradientBoostingClassifier(),param_grid=grid,verbose=True)
grad.fit(train_x,train_y)
print(grad.best_estimator_)
print(grad.best_score_)
Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  2.4min finished


GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.05, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=200, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
0.824850299401
#use best estimator in gradient

clf_grad=GradientBoostingClassifier(n_estimators=200,random_state=0,learning_rate=0.05)
clf_grad.fit(train_x,train_y)
acc_grad_train = cv_score(clf_grad).mean()
acc_grad_test = cv_score_test(clf_grad).mean()

print(acc_grad_train)
print(acc_grad_test)
0.818709926304
0.807500470544
from sklearn.metrics import precision_score
class Ensemble(object):

    def __init__(self,estimators):
        self.estimator_names = []
        self.estimators = []
        for i in estimators:
            self.estimator_names.append(i[0])
            self.estimators.append(i[1])
        self.clf = LogisticRegression()

    def fit(self, train_x, train_y):
        for i in self.estimators:
            i.fit(train_x,train_y)
        x = np.array([i.predict(train_x) for i in self.estimators]).T
        y = train_y
        self.clf.fit(x, y)

    def predict(self,x):
        x = np.array([i.predict(x) for i in self.estimators]).T
        #print(x)
        return self.clf.predict(x)


    def score(self,x,y):
        s = precision_score(y,self.predict(x))
        return s
ensem = Ensemble([('Ada',ada),('Bag',bag_tree),('SVM',clf_svc.best_estimator_),('LR',clf_LR),('gbdt',clf_grad)])
score = 0
for i in range(0,10):
    ensem.fit(train_x, train_y)
    sco = round(ensem.score(test_x,test_y) * 100, 2)
    score+=sco
print(score/10)
89.83

提交

pre = ensem.predict(test_data)
pd.DataFrame({'PassengerId':temp['PassengerId'],'Survived':pre})
submission = pd.DataFrame({'PassengerId':passID,'Survived':pre})

提交结果看,ensemble模型和单个模型比并没有明显提升,分析可能是基模型相关性较强,训练数据不够多,或者是one-hot编码会不会引入共线性。虽然测试集和训练集结果相差不大,但提交结果降低明显,分析可能是数据不够,训练不充分,特征较少且相关性强,可以考虑引入更多特征。

  • 3
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值