学了机器学习一个月了,跃跃欲试想实践一把。参考了kaggle上一篇非常详细的帖子,断断续续用了一个星期,自己一步一步地完成了Titanic的预测分析,也算是入了门。
看会和写出来再到生成文件提交上去,整个流程做下来遇到了各种各样的问题。作图不会注释怎么办?中文显示乱码怎么办?panda是切片loc和iloc用哪个? 训练的时候显示 y 标签为(n_samples,1)和(n_samples,)不一样吗?logspace 和 linspace 啥意思? 调参时模型中的各种参数是什么意义......?好不容易到了最后一步,提交之后显示0分瞬间心凉半截(其实是没有将输出的float转换为int) 当然,整个分析过程比较简单,待完善的空间大得很,但是很有成就感。
感受最深的是:知难行易。所谓实践出真知,可能是就是实践中遇到了更多没想到的问题,一一解决了,就获得了真知。
参考文章:https://www.kaggle.com/eraaz1/a-comprehensive-guide-to-titanic-machine-learning
目录
1导入数据库和数据
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Markdown
def bold(string):
display(Markdown(string))
# 导入数据
train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")
bold('**preview train data**')# bold无法显示汉字-> u‘str’
display(train.head())
bold('**preview test data**')
display(test.head())
train.shape
test.shape
preview train data
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
preview test data
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
Out[1]:
(418, 11)
2 查看变量类型
2.1 变量描述
为了方便后期数据分析和预处理,先将train和test数据集合并
In [3]:
merge=pd.concat([train,test],ignore_index=True,sort=False)
bold('**preview merge data**')
display(merge.head())
display(merge.shape)
display(merge.columns)
preview merge data
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
(1309, 12)
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
dtype='object')
分类型变量:
Pclass Name Sex SibSp Parch Ticket Cabin Embarket Survived
数值型变量:
Age Fare PassengerId
2.2 变量数据类型
In [4]:
display(merge.dtypes)
PassengerId int64
Survived float64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
int: PassengerId, Pclass, SibSp, Parch
float: Age, Fare, Survived
object(字符+数字): Name, Sex, Ticket, Cabin, Embarked
3 单变量分析
查看各变量的分布情况:
分类型变量,一般用柱状图 bar plots;
数值型变量,一般用直方图 histogram
3.1 分类型变量分析
In [5]:
# 定义相对频率柱状图的注释
def abs_bar_labels():
font_size = 15
plt.ylabel('Absolute Frequency', fontsize = font_size)
plt.xticks(rotation =0, fontsize = font_size)
plt.yticks([])
# Set individual bar lebels in absolute number
for x in ax.patches:
ax.annotate(x.get_height(),
(x.get_x() + x.get_width()/2., x.get_height()), ha = 'center', va = 'center',xytext = (0,7),
textcoords = 'offset points', fontsize = font_size, color = 'black')
# 定义相对频率柱状图的注释
def pct_bar_labels():
font_size = 15
plt.ylabel('Relative Frequency (%)', fontsize = font_size)
plt.xticks(rotation = 0, fontsize = font_size)
plt.yticks([])
# Set individual bar lebels in proportional scale
for x in ax1.patches:
ax1.annotate(str(x.get_height()) + '%',
(x.get_x() + x.get_width()/2., x.get_height()), ha = 'center', va = 'center', xytext = (0, 7),
textcoords = 'offset points', fontsize = font_size, color = 'black')
In [6]:
#定义柱状图、绝对和相对频率显示函数:
def abs_rel_f(feature):
abs_frequency=feature.value_counts()
relative_frequency=feature.value_counts(normalize=True).round(3)*100
abs_rel_f=pd.DataFrame({'Absolute Frequency':abs_frequency,'Relative Frequency(%)':relative_frequency})
display(abs_rel_f)
#绘制柱状图
global ax,ax1 #声明全局变量,这样上面的绝对和相对频率注释函数才可以使用ax和ax1
ax=abs_frequency.plot.bar(figsize=(18,7))
abs_bar_labels() # Displays bar labels in abs scale.
plt.show()
ax1=relative_frequency.plot.bar(figsize=(18,7))
pct_bar_labels()
plt.show()
In [10]:
bold('**Survived**')
abs_rel_f(merge.Survived)
Survived
Absolute Frequency | Relative Frequency(%) | |
---|---|---|
0.0 | 549 | 61.6 |
1.0 | 342 | 38.4 |
分析:生还率仅为38%,62%的人都无法生还
In [24]:
bold(u'**Pclass**')
abs_rel_f(merge.Pclass)
Pclass
Absolute Frequency | Relative Frequency(%) | |
---|---|---|
3 | 709 | 54.2 |
1 | 323 | 24.7 |
2 | 277 | 21.2 |
分析:各等级的分布并不均匀,3等的人最多约54%,其次是1等和2等约46%
In [26]:
bold('**Sex**')
abs_rel_f(merge.Sex)
Sex
Absolute Frequency | Relative Frequency(%) | |
---|---|---|
male | 843 | 64.4 |
female | 466 | 35.6 |
分析:性别分布也不均匀,男性较多约64%
In [27]:
bold('SibSp')
abs_rel_f(merge.SibSp)
SibSp
Absolute Frequency | Relative Frequency(%) | |
---|---|---|
0 | 891 | 68.1 |
1 | 319 | 24.4 |
2 | 42 | 3.2 |
4 | 22 | 1.7 |
3 | 20 | 1.5 |
8 | 9 | 0.7 |
5 | 6 | 0.5 |
分析:没有兄弟姐妹的人是最多的,约68%,其次是有1个兄弟姐妹的约24%,剩下的都不足5%
In [28]:
bold('**Parch**')
abs_rel_f(merge.Parch)
Parch
Absolute Frequency | Relative Frequency(%) | |
---|---|---|
0 | 1002 | 76.5 |
1 | 170 | 13.0 |
2 | 113 | 8.6 |
3 | 8 | 0.6 |
5 | 6 | 0.5 |
4 | 6 | 0.5 |
9 | 2 | 0.2 |
6 | 2 | 0.2 |
分析:可以看出,没有父母和孩子的人是最多的,约77%,有1个的约13%,有两个的约9%,剩下均不足1%
In [47]:
bold('**Embarked**')
abs_rel_f(merge.Embarked)
Embarked
Absolute Frequency | Relative Frequency(%) | |
---|---|---|
S | 914 | 69.9 |
C | 270 | 20.7 |
Q | 123 | 9.4 |
分析:上船的不同港口人数分布也不一样,从S港口上船的乘客最多约70%,其次是C约21%,最少的是Q
In [63]:
bold('**Name&Cabin&Ticket**')
bold('Name:'+str(merge.Name.value_counts().count()))
display(merge.Name.head(5))
bold('Cabin:'+str(merge.Cabin.value_counts().count()))
display(merge.Cabin.value_counts(dropna=False).head(5))
#统计个数的时候,自动忽略缺失值。查看的话需要加dropna=False
bold('Ticket:'+str(merge.Ticket.value_counts().count()))
display(merge.Ticket.value_counts(dropna=False).head(5))
Name&Cabin&Ticket
Name:1307
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
Name: Name, dtype: object
Cabin:186
NaN 1014
C23 C25 C27 6
B57 B59 B63 B66 5
G6 5
C78 4
Name: Cabin, dtype: int64
Ticket:929
CA. 2343 11
CA 2144 8
1601 8
S.O.C. 14879 7
PC 17608 7
Name: Ticket, dtype: int64
分析:约有1307个人名(全部为字符串),187个船舱号(1014个缺失值,字符+数字),929个船票号(字符+数字),变量数目太多,无法统计
3.2数值型变量分析
直方图、密度图,统计(describe) (Age & Fare)
In [155]:
#定义直方图、密度图显示函数
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
def hist(Feature):
global ax
fon_size=15
fig_size=(18,7)
ax=Feature.plot.hist(bins=20,figsize=fig_size,color='g')
plt.xlabel('%s'% Feature.name+ ' Histogram',fontsize=30)
plt.xticks(fontsize=fon_size)
plt.yticks(fontsize=fon_size)
abs_bar_labels()# 柱形图的标注函数,这里ax也可以使用
def density(Feature):
fon_size=15
fig_size=(18,7)
Feature.plot.hist(bins=20,density=True,figsize=fig_size)
Feature.plot.kde(style='r--')
plt.xlabel('%s'% Feature.name+ u' Histogram',fontsize=30)
plt.xticks(fontsize=fon_size)
plt.yticks(fontsize=fon_size)
In [166]:
bold('**Age**')
hist(merge.Age)
Age
In [167]:
density(merge.Age)
In [168]:
merge.Age.describe()
Out[168]:
count 1046.000000
mean 29.881138
std 14.413493
min 0.170000
25% 21.000000
50% 28.000000
75% 39.000000
max 80.000000
Name: Age, dtype: float64
分析:年龄分布不均匀,20—25岁的人最多,年龄最大的有80岁,还有一个多月的婴儿
In [169]:
bold('**Fare**')
hist(merge.Fare)
Fare
In [170]:
density(merge.Fare)
In [171]:
merge.Fare.describe()
Out[171]:
count 1308.000000
mean 33.295479
std 51.758668
min 0.000000
25% 7.895800
50% 14.454200
75% 31.275000
max 512.329200
Name: Fare, dtype: float64
分析:票价分布也不均匀,价格低廉的更多一些
4 特征工程
特征组合、创建、重组(针对3中无法分析的变量太多的特征)
In [178]:
# Cabin 缺失查看
display(merge.Cabin.value_counts().count())
display(merge.Cabin.isnull().sum())
186
1014
In [179]:
merge.Cabin.head(5)
Out[179]:
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
In [182]:
merge.Cabin.fillna(value='X',inplace=True)
merge.Cabin.head(5)
Out[182]:
0 X
1 C85
2 X
3 C123
4 X
Name: Cabin, dtype: object
In [192]:
merge.Cabin=merge.Cabin.apply(lambda x: x[0])
display(merge.Cabin.value_counts().count())
display(merge.Cabin.value_counts())
9
X 1014
C 94
B 65
D 46
E 41
A 22
F 21
G 5
T 1
Name: Cabin, dtype: int64
Cabin,将首字母提取出来替代原值,用X表示缺失值,最后由原来的186个变量减为了9个
In [193]:
abs_rel_f(merge.Cabin)
Absolute Frequency | Relative Frequency(%) | |
---|---|---|
X | 1014 | 77.5 |
C | 94 | 7.2 |
B | 65 | 5.0 |
D | 46 | 3.5 |
E | 41 | 3.1 |
A | 22 | 1.7 |
F | 21 | 1.6 |
G | 5 | 0.4 |
T | 1 | 0.1 |
In [216]:
bold('**Ticket**')
merge.Ticket=merge.Ticket.apply(lambda x: x[0])
display(merge.head())
merge.Ticket.value_counts()
Ticket
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Family_size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A | 7.2500 | X | S | 2 |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | P | 71.2833 | C | C | 2 |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | S | 7.9250 | X | S | 1 |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 1 | 53.1000 | C | S | 2 |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 3 | 8.0500 | X | S | 1 |
Out[216]:
3 429
2 278
1 210
S 98
P 98
C 77
A 42
W 19
7 13
F 13
4 11
6 9
L 5
5 3
9 2
8 2
Name: Ticket, dtype: int64
分析:发现Ticket中首字母不全是字符,还有数字,重组后变量依然很多,所以这样的处理不够好,最后决定删去
In [196]:
bold(u'**SibSp&Parch组合**')
SibSp&Parch组合
In [206]:
merge['Family_size']=merge.SibSp+merge.Parch+1
In [208]:
merge.head()
Out[208]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Family_size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | X | S | 2 |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | C | 2 |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | X | S | 1 |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C | S | 2 |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | X | S | 1 |
5缺失值处理
缺失值填充方法:
分类型变量:用众数填充(缺失不太多)
连续型变量:
缺失很少:平均数填充(正太分布)、中位数填充(非正态分布)
缺失很多:建立模型来预测缺失值; 分组,用每个子集的中位数填充
In [219]:
#查看缺失值情况:
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
PassengerId 1309 non-null int64
Survived 891 non-null float64
Pclass 1309 non-null int64
Name 1309 non-null object
Sex 1309 non-null object
Age 1046 non-null float64
SibSp 1309 non-null int64
Parch 1309 non-null int64
Ticket 1309 non-null object
Fare 1308 non-null float64
Cabin 1309 non-null object
Embarked 1307 non-null object
Family_size 1309 non-null int64
dtypes: float64(3), int64(5), object(5)
memory usage: 133.0+ KB
分析:Age缺失较多;Fare缺失一个,Embarked缺失两个,可以用众数填充
In [252]:
bold(u'**Embarked--分类型变量,缺失又少:众数补全**')
merge.loc[merge.Embarked.isnull(),'Embarked']=merge.Embarked.mode().iloc[0]
#loc可以用行/列的标签名来切片,iloc是只能用行/列的索引即数字来切片
#merge.Embarked.fillna(value='S',inplace=True)
merge.Embarked.count()
Embarked--分类型变量,缺失又少:众数补全
Out[252]:
1309
In [253]:
bold(u'**Fare--数值型变量,非正态分布,中位数补全**')
merge.Fare.fillna(merge.Fare.median(),inplace=True)
merge.Fare.count()
Fare--数值型变量,非正态分布,中位数补全
Out[253]:
1309
In [256]:
merge.Age.isnull().sum()
Out[256]:
263
Age--数值型变量,缺失较多,不能简单地用中位数或平均数填充,可能会带来较大偏差
In [274]:
s=merge[['Pclass','Age']].groupby(['Pclass'])
s.median()
Out[274]:
Age | |
---|---|
Pclass | |
1 | 39.0 |
2 | 29.0 |
3 | 24.0 |
In [275]:
merge.Age=s.transform(lambda x: x.fillna(x.median()))
利用Pclass 将Age分组,利用每个组里的中位数填充各分组内的缺失值 transform & apply 的区别?
6 双变量分析
相关性分析三类:
1 数值型与数值型变量:pearson's correlation(皮尔森相关性分析)、spearman correlation(斯皮尔曼相关性分析)
2 数值型与分类型变量:point-biserial correlation coefficient(双列相关分析)、ANOVA(方差分析)
3 分类型与分类型变量:chi-square test(卡方检验)
6.1 分类型与分类型变量分析
In [324]:
6.1
ax=sns.countplot('Sex',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Sex & Survived',fontsize=15)
Out[324]:
Text(0.5,0,'Sex & Survived')
分析:女性生还率确实比男性高的多
In [325]:
ax=sns.countplot('Pclass',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Pclass & Survived',fontsize=15)
Out[325]:
Text(0.5,0,'Oclass & Survived')
分析:等级越高,生还率越大
In [327]:
ax=sns.countplot('Embarked',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Embarked & Survived',fontsize=15)
Out[327]:
Text(0.5,0,'Embarked & Survived')
分析:从C港口上船的人生还率最高,其次是Q,S最小
In [328]:
ax=sns.countplot('Family_size',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Family_size & Survived',fontsize=15)
Out[328]:
Text(0.5,0,'Family_size & Survived')
分析:单身的人生还率最低
In [329]:
ax=sns.countplot('Cabin',hue='Survived',data=merge)
abs_bar_labels()
plt.xlabel('Cabin & Survived',fontsize=15)
Out[329]:
Text(0.5,0,'Cabin & Survived')
分析:X即缺失的人群,生还率最低,没有其他明显规律
6.2 数值型与分类型变量分析
方法:boxplot 、histogram 、ANOVA
Age & Fare
7 数据转换
1 数值型变量离散 2 删去无用特征 3 离散变量编码
7.1 Age&Fare 分组
In [334]:
age_label=['infant','child','teenager','young_adult','adult','old']
split_points=[0,5,12,18,35,60,81]
merge['Age_split']=pd.cut(merge.Age,split_points,labels=age_label)
In [335]:
merge.head(10)
Out[335]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Family_size | Age_split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A | 7.2500 | X | S | 2 | young_adult |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | P | 71.2833 | C | C | 2 | adult |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | S | 7.9250 | X | S | 1 | young_adult |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 1 | 53.1000 | C | S | 2 | young_adult |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 3 | 8.0500 | X | S | 1 | young_adult |
5 | 6 | 0.0 | 3 | Moran, Mr. James | male | 24.0 | 0 | 0 | 3 | 8.4583 | X | Q | 1 | young_adult |
6 | 7 | 0.0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 1 | 51.8625 | E | S | 1 | adult |
7 | 8 | 0.0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 3 | 21.0750 | X | S | 5 | infant |
8 | 9 | 1.0 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 3 | 11.1333 | X | S | 3 | young_adult |
9 | 10 | 1.0 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 2 | 30.0708 | X | C | 2 | teenager |
In [340]:
fare_label=['low','medium','high','very_high']
f_split_points=[-1,30,100,300,600]
merge['Fare_split']=pd.cut(merge.Fare,f_split_points,labels=fare_label)
merge[['Fare','Fare_split']].head(10)
Out[340]:
Fare | Fare_split | |
---|---|---|
0 | 7.2500 | low |
1 | 71.2833 | medium |
2 | 7.9250 | low |
3 | 53.1000 | medium |
4 | 8.0500 | low |
5 | 8.4583 | low |
6 | 51.8625 | medium |
7 | 21.0750 | low |
8 | 11.1333 | low |
9 | 30.0708 | medium |
7.2 删除无用特征
为了简化,不考虑Name Ticket. 所以删除以下特征:Name & Ticket & Fare & Age & SibSp & Parch
In [341]:
merge.head()
Out[341]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Family_size | Age_split | Fare_split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A | 7.2500 | X | S | 2 | young_adult | low |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | P | 71.2833 | C | C | 2 | adult | medium |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | S | 7.9250 | X | S | 1 | young_adult | low |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 1 | 53.1000 | C | S | 2 | young_adult | medium |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 3 | 8.0500 | X | S | 1 | young_adult | low |
In [345]:
merge.drop(['Name','Ticket','Fare','Age','SibSp','Parch'],inplace=True,axis=1)
merge.head()
Out[345]:
PassengerId | Survived | Pclass | Sex | Cabin | Embarked | Family_size | Age_split | Fare_split | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | male | X | S | 2 | young_adult | low |
1 | 2 | 1.0 | 1 | female | C | C | 2 | adult | medium |
2 | 3 | 1.0 | 3 | female | X | S | 1 | young_adult | low |
3 | 4 | 1.0 | 1 | female | C | S | 2 | young_adult | medium |
4 | 5 | 0.0 | 3 | male | X | S | 1 | young_adult | low |
7.3 修改数据类型并编码
In [348]:
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
PassengerId 1309 non-null int64
Survived 891 non-null float64
Pclass 1309 non-null int64
Sex 1309 non-null object
Cabin 1309 non-null object
Embarked 1309 non-null object
Family_size 1309 non-null int64
Age_split 1309 non-null category
Fare_split 1292 non-null category
dtypes: category(2), float64(1), int64(3), object(3)
memory usage: 74.6+ KB
后期编码,需要将分类型变量都转变为category类型,所以这里将object类型转换为category
In [352]:
merge.loc[:,['Pclass','Sex','Cabin','Embarked']]=merge.loc[:,['Pclass','Sex','Cabin','Embarked']].astype('category')
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
PassengerId 1309 non-null int64
Survived 891 non-null float64
Pclass 1309 non-null category
Sex 1309 non-null category
Cabin 1309 non-null category
Embarked 1309 non-null category
Family_size 1309 non-null int64
Age_split 1309 non-null category
Fare_split 1292 non-null category
dtypes: category(6), float64(1), int64(2)
memory usage: 39.5 KB
In [356]:
famiy类型仍然是int,之前忘记了对family离散
merge.Family_size.value_counts()
Out[356]:
1 790
2 235
3 159
4 43
6 25
5 22
7 16
11 11
8 8
dtype: int64
In [358]:
family_label=['sigle','small','medium','high']
family_split_points=[0,1,4,8,12]
merge['Family_scale']=pd.cut(merge.Family_size,family_split_points,labels=family_label)
merge[['Family_size','Family_scale']].head(10)
Out[358]:
Family_size | Family_scale | |
---|---|---|
0 | 2 | small |
1 | 2 | small |
2 | 1 | sigle |
3 | 2 | small |
4 | 1 | sigle |
5 | 1 | sigle |
6 | 1 | sigle |
7 | 5 | medium |
8 | 3 | small |
9 | 2 | small |
In [364]:
merge.drop(['Family_size'],inplace=True,axis=1)
merge.loc[:,['Family_scale']]=merge.loc[:,['Family_scale']].astype('category')
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
PassengerId 1309 non-null int64
Survived 891 non-null float64
Pclass 1309 non-null category
Sex 1309 non-null category
Cabin 1309 non-null category
Embarked 1309 non-null category
Age_split 1309 non-null category
Fare_split 1292 non-null category
Family_scale 1309 non-null category
dtypes: category(7), float64(1), int64(1)
memory usage: 30.7 KB
In [372]:
merge.Fare_split.value_counts(dropna=False)
Out[372]:
low 949
medium 259
high 80
NaN 17
very_high 4
Name: Fare_split, dtype: int64
问题:Fare少了17个,为啥呢?回过头检查了一遍发现:
Fare离散的时候,第一个数从0开始的,所以分的时候0的这些没考虑进去,也就是说,缺失的都是Fare为0的
In [377]:
# 补全Fare_split,用low补全票价为0的缺失值
merge.Fare_split=merge.Fare_split.fillna('low')
merge.Fare_split.value_counts(dropna=False)
Out[377]:
low 966
medium 259
high 80
very_high 4
Name: Fare_split, dtype: int64
In [378]:
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
PassengerId 1309 non-null int64
Survived 891 non-null float64
Pclass 1309 non-null category
Sex 1309 non-null category
Cabin 1309 non-null category
Embarked 1309 non-null category
Age_split 1309 non-null category
Fare_split 1309 non-null category
Family_scale 1309 non-null category
dtypes: category(7), float64(1), int64(1)
memory usage: 30.7 KB
分析:由于之前的小错误,这里走了好多弯路:family_size之前忘记分组、Fare分组时候初始值选错导致没考虑0
至此,全部特征(Id与Survived不是特征)的数据类型都变为category,可以进行dummies的编码了
7.4 编码
数据处理的最后一步,接下来就可以建立模型开始训练和预测了
In [379]:
#pandas get_dummies默认只对数据中的category类型编码,不必担心int和float类型
merge=pd.get_dummies(merge)
In [380]:
merge.head()
Out[380]:
PassengerId | Survived | Pclass_1 | Pclass_2 | Pclass_3 | Sex_female | Sex_male | Cabin_A | Cabin_B | Cabin_C | ... | Age_split_adult | Age_split_old | Fare_split_low | Fare_split_medium | Fare_split_high | Fare_split_very_high | Family_scale_sigle | Family_scale_small | Family_scale_medium | Family_scale_high | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 2 | 1.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 1.0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 4 | 1.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 5 | 0.0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 33 columns
8 模型建立与评估
In [403]:
# 先将训练数据和测试数据分离,建立特征集和标签集
train_data=merge.loc[0:890,:]
test_data=merge.loc[891:,:]
display(train_data.shape)
display(test_data.shape)
(891, 33)
(418, 33)
In [486]:
# 将训练数据的标签集和特征集提取出来
bold(u'**训练数据的标签集和特征集**')
y_train=train_data.loc[:,['Survived']]
display(y_train.shape)
y_train=y_train.values.ravel()# 将2d array(891,1)转化为1d array(891L,),否则后面训练会出错
x_train=train_data.drop(['PassengerId','Survived'],axis=1)
display(y_train.shape)
display(x_train.shape)
训练数据的标签集和特征集
(891, 1)
(891L,)
(891, 31)
In [487]:
bold(u'**测试集**')
x_test=test_data.drop(['Survived','PassengerId'],axis=1)
display(x_test.shape)
测试集
(418, 31)
8.1 模型初选与训练评估
计划选用模型:LogisticRegression、KNN、Gaussian_Bayes、SVC、DecissionTree、RandomForest、GradientBoostingClassifier、AdaboostClassifier
In [630]:
seed=33 #控制随机种子,每次执行时保证随机数不变
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()
from sklearn.svm import SVC
svc=SVC(gamma='auto')
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier(random_state = seed)
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(random_state = seed)
from sklearn.ensemble import GradientBoostingClassifier
gbc=GradientBoostingClassifier(random_state = seed)
from sklearn.ensemble import AdaBoostClassifier
abc= AdaBoostClassifier(random_state = seed)
In [497]:
#定义计算训练得分的函数
def train_accu(model):
model.fit(x_train,y_train)
train_accuracy=model.score(x_train,y_train)
train_accuracy=round(train_accuracy*100,2)
return train_accuracy
In [514]:
Train_accu=pd.DataFrame({'train_accuracy(%)':[train_accu(lr),train_accu(knn),train_accu(gnb),train_accu(svc),train_accu(dt),
train_accu(rf),train_accu(gbc),train_accu(abc)]})
Train_accu.index=['lr','knn','gnb','svc','dt','rf','gbc','abc']
display(Train_accu)
train_accuracy(%) | |
---|---|
lr | 82.04 |
knn | 85.07 |
gnb | 73.18 |
svc | 80.36 |
dt | 88.22 |
rf | 87.88 |
gbc | 86.08 |
abc | 81.37 |
In [517]:
sortedTrain_accu=Train_accu.sort_values(by='train_accuracy(%)',ascending=False)
bold(u'**模型训练得分排序**')
display(sortedTrain_accu)
模型训练得分排序
train_accuracy(%) | |
---|---|
dt | 88.22 |
rf | 87.88 |
gbc | 86.08 |
knn | 85.07 |
lr | 82.04 |
abc | 81.37 |
svc | 80.36 |
gnb | 73.18 |
分析:基于树算法的决策树、随机森林、梯度提升算法得分最高,高斯分布的贝叶斯算法得分最低。
但这并不能表明模型在测试集上就有良好的表现。因为训练集的得分无法得出模型的泛化能力,需要交叉验证方法来评估。
8.2 交叉验证评估
In [528]:
def cross_val_score(model):
from sklearn.cross_validation import cross_val_score
score=cross_val_score(model,x_train,y_train,cv=10,scoring='accuracy').mean()
score=round(score*100,2)
return score
In [530]:
Cross_accu=pd.DataFrame({'Cross_val accuracy(%)':[cross_val_score(lr),cross_val_score(knn),cross_val_score(gnb),cross_val_score(svc),
cross_val_score(dt),cross_val_score(rf),cross_val_score(gbc),cross_val_score(abc)]})
Cross_accu.index=['lr','knn','gnb','svc','dt','rf','gbc','abc']
display(Cross_accu)
Cross_val accuracy(%) | |
---|---|
lr | 81.26 |
knn | 80.37 |
gnb | 73.08 |
svc | 80.13 |
dt | 80.60 |
rf | 82.17 |
gbc | 82.50 |
abc | 81.04 |
In [531]:
sortedCross_accu=Cross_accu.sort_values(by='Cross_val accuracy(%)',ascending=False)
bold(u'**模型训练得分排序**')
display(sortedCross_accu)
模型训练得分排序
Cross_val accuracy(%) | |
---|---|
gbc | 82.50 |
rf | 82.17 |
lr | 81.26 |
abc | 81.04 |
dt | 80.60 |
knn | 80.37 |
svc | 80.13 |
gnb | 73.08 |
分析:基于交叉验证的得分显示出模型的泛化能力,才是模型评估的有效方法。可以看出排在前几位的是梯度提升和随机森林以及逻辑回归方法,这与训练集的得分排名还是有区别的。高斯分布的贝叶斯与支持向量机仍然排在末尾。
但这个排名仍不是最合理的排名。这里的模型参数都是随机初始化的参数,调参后利用最优参数进行交叉验证评估才是最合理的。
8.3 调参
In [623]:
from sklearn.model_selection import GridSearchCV
In [663]:
'lr','knn','gnb','svc','dt','rf','gbc','abc'
lr_param={'penalty':['l1','l2'],'C':np.logspace(0,4,10)}
knn_param={'n_neighbors':[3, 4, 5, 6, 7, 8],
'leaf_size':[1, 2, 3, 5, 6, 7],
'weights':['uniform', 'distance'],
'algorithm':['auto', 'ball_tree','kd_tree','brute']}
# gnb 高斯分布贝叶斯没有参数
svc_param={'C':[1,2,3,4,5,6,7,8,9,10],
'kernel': ['linear','rbf'],
'gamma': [0.5, 0.2, 0.1, 0.001, 0.0001]}
dt_param={'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
'min_samples_leaf':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'random_state':[seed]}
rf_param={'criterion':['gini','entropy'],
'n_estimators':[10, 15, 20, 25, 30],
'min_samples_leaf':[1, 2, 3],
'min_samples_split':[3, 4, 5, 6, 7],
'max_features':['sqrt', 'auto', 'log2'],
'random_state':[seed]}
gbc_param={'learning_rate': [0.01, 0.02, 0.05, 0.1],
'max_depth': [4, 6, 8],
'max_features': [1.0, 0.3, 0.1],
'min_samples_split': [ 2, 3, 4],
'random_state':[seed]}
adc_param={'n_estimators':[ 5, 10, 15, 20, 25, 40, 50, 60, 80, 100, 130, 160, 200, 250, 300],
'learning_rate':[0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5],
'random_state':[seed]}
In [632]:
def grid(model,params):
grid=GridSearchCV(model,params,cv=10,scoring='accuracy')
grid.fit(x_train,y_train)
return round(grid.best_score_*100,2),grid.best_params_
In [606]:
%%time
lr_best_score,lr_best_param=grid(lr,lr_param)
display(lr_best_param)
display(lr_best_score)
{'C': 2.7825594022071245, 'penalty': 'l1'}
81.59
Wall time: 1.56 s
In [607]:
%%time
knn_best_score,knn_best_param=grid(knn,knn_param)
display(knn_best_param)
display(knn_best_score)
{'algorithm': 'auto', 'leaf_size': 2, 'n_neighbors': 6, 'weights': 'distance'}
82.15
Wall time: 1min 49s
In [608]:
%%time
svc_best_score,svc_best_param=grid(svc,svc_param)
display(svc_best_param)
display(svc_best_score)
{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
82.6
Wall time: 31.9 s
In [609]:
%%time
dt_best_score,dt_best_param=grid(dt,dt_param)
display(dt_best_param)
display(dt_best_score)
{'max_features': 'auto',
'min_samples_leaf': 2,
'min_samples_split': 9,
'random_state': 33}
81.93
Wall time: 9.93 s
In [664]:
%%time
rf_best_score,rf_best_param=grid(rf,rf_param)
display(rf_best_param)
display(rf_best_score)
{'criterion': 'entropy',
'max_features': 'sqrt',
'min_samples_leaf': 1,
'min_samples_split': 7,
'n_estimators': 30,
'random_state': 33}
83.5
Wall time: 2min 52s
In [628]:
%%time
abc_best_score,abc_best_param=grid(abc,abc_param)
display(abc_best_param)
display(abc_best_score)
{'learning_rate': 0.3, 'n_estimators': 60, 'random_state': 33}
81.37
Wall time: 2min 59s
In [633]:
%%time
gbc_best_score,gbc_best_param=grid(gbc,gbc_param)
display(gbc_best_param)
display(gbc_best_score)'lr','knn','gnb','svc','dt','rf','gbc','abc'
{'learning_rate': 0.02,
'max_depth': 4,
'max_features': 1.0,
'min_samples_split': 3,
'random_state': 33}
83.28
Wall time: 3min
In [665]:
tunned_acc=pd.DataFrame({u'调参后分数':[lr_best_score,knn_best_score,svc_best_score,
dt_best_score,rf_best_score,gbc_best_score,abc_best_score,]})
tunned_acc.index=['lr','knn','svc','dt','rf','gbc','abc']
sorted_tunned_accu=tunned_acc.sort_values(by=u'调参后分数',ascending=False)
display(sorted_tunned_accu)
调参后分数 | |
---|---|
rf | 83.50 |
gbc | 83.28 |
svc | 82.60 |
knn | 82.15 |
dt | 81.93 |
lr | 81.59 |
abc | 81.37 |
分析:基于树的模型调参时间都比较长(跟参数多也有关系),调参后,随机森林RF效果最好其次是梯度提升gbc和支持向量机SVC
8.4 用调好的参数重新训练验证
In [666]:
lr=LogisticRegression(**lr_best_param)
knn=KNeighborsClassifier(**knn_best_param)
svc=SVC(**svc_best_param)
dt=DecisionTreeClassifier(**dt_best_param)
rf=RandomForestClassifier(**rf_best_param)
gbc=GradientBoostingClassifier(**gbc_best_param)
abc= AdaBoostClassifier(**abc_best_param)
models={'LR':lr,'KNN':knn,'SVC':svc,'DT':dt,'RF':rf,'GBC':gbc,'ABC':abc}
for (keys,items) in models.items():
from sklearn.cross_validation import cross_val_score
score=cross_val_score(items,x_train,y_train,cv=10,scoring='accuracy')*100
print '%0.4f(±%0.4f) [%s]' % (score.mean(),score.std(),keys)
82.1643(±4.0666) [KNN]
81.3754(±2.2015) [ABC]
83.2842(±4.2737) [GBC]
82.6038(±2.9810) [SVC]
83.5128(±3.4590) [RF]
81.5976(±2.9117) [LR]
81.9296(±2.4371) [DT]
分析:用最优参数来重新进行模型的训练并进行交叉验证,得分与最优参数下的得分相同
9 选择模型并预测
In [688]:
#最优模型:RF & GBC
gbc=GradientBoostingClassifier(**gbc_best_param)
gbc.fit(x_train,y_train)
submission_gbc=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':gbc.predict(x_test).astype(int)})
submission_gbc.to_csv('submission_gbc',index=False)
rf=RandomForestClassifier(**rf_best_param)
rf.fit(x_train,y_train)
submission_rf=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':rf.predict(x_test).astype(int)})
submission_rf.to_csv('submission_rf',index=False)
In [693]:
svc=SVC(**svc_best_param)
svc.fit(x_train,y_train)
submission_svc=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':svc.predict(x_test).astype(int)})
submission_svc.to_csv('submission_svc',index=False)
knn=KNeighborsClassifier(**knn_best_param)
knn.fit(x_train,y_train)
submission_knn=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':knn.predict(x_test).astype(int)})
submission_knn.to_csv('submission_knn',index=False)
dt=DecisionTreeClassifier(**dt_best_param)
dt.fit(x_train,y_train)
submission_dt=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':dt.predict(x_test).astype(int)})
submission_dt.to_csv('submission_dt',index=False)
提交后结果:
RF:score 0.75119-rank 8361
GBC:score 0.76555-rank 7317
SVC:score 0.77990-rank 4824
KNN:score 0.73684
DT: score 0.7320
提交后发现,本来排在第三位的SVC却得分最高。
可见交叉验证的得分只是代表其一般的泛化能力,和测试集真正的得分情况是有差异的。
所有分析至此已经完成。过程中对特征的处理还是比较粗糙。Name Ticket 特征被直接删掉了,Age的填充采用分组中位数填充而非建立模型预测,这些都对模型的准确度产生影响。
如果你也还在犹豫着翻看别人的文章和笔记觉得无从下手,那就赶紧去动手敲一遍代码吧,过程中学到的东西会比看到的多得多。