实训五 泰坦尼克号沉没乘客存活分析
数据来源于Kaggle网站,网站给出两个文件,一个用于训练,一个用于测试提交(没有标记)。所以本实训采用第一个文件进行下面的分析
https://www.kaggle.com
数据初探
# 读入训练数据到pandas
import pandas as pd
data = pd.read_csv("data/titan_train.csv")
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
- PassengerId: 乘客编号
- Survived: 是否存活(1-存活, 0-未存活)
- PClass: 船舱等级(1、2、3等舱位)
- Name: 乘客姓名(包含Mr、Mrs、Miss、Master等头衔信息)
- Sex: 性别
- Age: 年龄
- Sibsp: 同船的堂兄弟或姐妹个数
- Parch: 直系亲属(父母或孩子)数量
- Ticket: 船票编号信息
- Fare: 票价
- Cabin: 船舱编号
- Embarked: 登船港口(S、C、Q)
# 看一看数据基本信息
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
以上结果表明:
- 一共891位乘客
- 一共12个特征(12个数剧列)
- 年龄、船舱编号、登船港口有缺失值
- 数据有数值类型、也有字符串类型
# 看看数值类型特征的基本统计
data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
data.loc[(data["Age"]== 0.42) | (data["Age"] == 80), :]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
630 | 631 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.00 | 0 | 0 | 27042 | 30.0000 | A23 | S |
803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | male | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | C |
以上结果表明:
- 年龄有缺失值
- 乘客编号的统计值没有意义
- 从Survived中的mean等于0.3838可看出38.4%的乘客获救
- 平均年龄29.7,最小的乘客0.42岁,最大的80岁(最大和最小的乘客都获救)
- 2、3等舱乘客居多,50%以上的乘客是3等舱
数据初步分析
每个乘客有12个特征,哪些特征对我们的分析和预测更有用,如何分析这些特征?
乘客各特征属性的分布
我们来看看各个特征与存活结果之间的关系
# 引入绘图库并设置相应的参数
%matplotlib inline
import matplotlib.pyplot as plt
# 设置中文处理
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 设置图像大小
fig = plt.figure(figsize = (12,8))
fig.set(alpha = 0.2) # 透明度
#设置子图模式(2行3列的网格模式),第一幅图获救情况的柱状图
plt.subplot2grid((2,3),(0,0))
data["Survived"].value_counts().plot(kind="bar")
plt.title("获救情况 柱状图")
plt.ylabel("人数")
plt.xlabel("0未获救 1获救")
# 绘制船舱等级分布柱状图
plt.subplot2grid((2,3),(0,1))
data["Pclass"].value_counts().plot(kind="bar")
plt.title("船舱等级分布柱状图")
plt.ylabel("人数")
#绘制获救和年龄之间的关系的散点图
plt.subplot2grid((2,3),(0,2))
plt.scatter(data["Survived"],data["Age"])
plt.title("获救乘客的年龄分布散点图(1为获救)")
plt.ylabel("人数")
plt.grid(b=True, which="major", axis="y")
#绘制各船舱等级乘客的年龄分布
plt.subplot2grid((2,3),(1,0), colspan=2)
data.loc[data["Pclass"]==1, "Age"].plot(kind="kde")
data.loc[data["Pclass"]==2, "Age"].plot(kind="kde")
data.loc[data["Pclass"]==3, "Age"].plot(kind="kde")
plt.title("各船舱等级的年龄分布曲线")
plt.tight_layout() #每个子图之间的距离
plt.xlabel("年龄")
plt.ylabel("密度")
plt.legend(["头等舱", "2等舱", "3等舱"])
# 绘制各港口登船人数的分布
plt.subplot2grid((2,3), (1,2))
data["Embarked"].value_counts().plot(kind="bar")
# print(data["Embarked"].value_counts())
plt.title("各港口登船人数分布柱状图")
plt.ylabel("人数")
以上图形分析结果:
- 大多数乘客未获救
- 3等舱乘客超过一半,2等舱最小
- 60岁以上获救概率较低
- 2、3等舱主要是20-30岁的乘客,头等舱主要是40岁以上
- S港口登船人数最多。S:英国南安普敦(Southampton),C:法国瑟堡-奥克特维尔(Cherbourg-Octeville), Q: 爱尔兰昆士敦(Queenstown),绝大多数乘客从始发港口登船。
一些想法:
- 不同的舱位等级可能和财富、社会地位相关,最终获救的概率是否会不一样?
- 年龄和性别是否对获救有影响?(电影里副船长说:小孩和女士先走)
- 登船港口和是否获救有关系吗?也许登船港口和乘客的出身、地位有关,还有国籍
# 蜗牛自己的想法:头等舱的获救人数
a = data.loc[data["Pclass"]==1,:]
a.info()
b = a.loc[data["Survived"]==1, :]
b.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 216 entries, 1 to 889
Data columns (total 12 columns):
PassengerId 216 non-null int64
Survived 216 non-null int64
Pclass 216 non-null int64
Name 216 non-null object
Sex 216 non-null object
Age 186 non-null float64
SibSp 216 non-null int64
Parch 216 non-null int64
Ticket 216 non-null object
Fare 216 non-null float64
Cabin 176 non-null object
Embarked 214 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 21.9+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 136 entries, 1 to 889
Data columns (total 12 columns):
PassengerId 136 non-null int64
Survived 136 non-null int64
Pclass 136 non-null int64
Name 136 non-null object
Sex 136 non-null object
Age 122 non-null float64
SibSp 136 non-null int64
Parch 136 non-null int64
Ticket 136 non-null object
Fare 136 non-null float64
Cabin 117 non-null object
Embarked 134 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 13.8+ KB
# 绘制各船舱等级和获救之间的关系
# 引入绘图库并设置相应的参数
%matplotlib inline
import matplotlib.pyplot as plt
# 设置中文处理
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 设置图像大小
fig = plt.figure(figsize = (6,4))
fig.set(alpha = 0.2) # 透明度
# 计算各等级获救和遇难的人数
s_0 = data.loc[data["Survived"]==0, "Pclass"].value_counts()
s_1 = data.loc[data["Survived"]==1, "Pclass"].value_counts()
#绘制层叠柱状图
df = pd.DataFrame({"获救":s_1, "未获救":s_0})
df.plot(kind="bar", stacked=True)
plt.title("各船舱等级的获救情况")
plt.ylabel("人数")
可以看出:头等舱获救比例非常高,3等舱获救比例非常低,财富和地位真的能决定生死?
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# 课堂练习: 乘客的性别和是否获救之间的关系
# 层叠柱状图
# 引入绘图库并设置相应的参数
%matplotlib inline
import matplotlib.pyplot as plt
# 设置中文处理
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 设置图像大小
fig = plt.figure(figsize = (6,4))
fig.set(alpha = 0.2) # 透明度
# 计算各等级获救和遇难的人数
s_0 = data.loc[data["Survived"]==0, "Sex"].value_counts()
s_1 = data.loc[data["Survived"]==1, "Sex"].value_counts()
#绘制层叠柱状图
df = pd.DataFrame({"获救":s_1, "未获救":s_0})
df.plot(kind="bar", stacked=False)
plt.title("乘客的性别的获救情况")
plt.ylabel("人数")
显然,女性获救比例远远高于男性,彰显lady first的绅士文化。
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
拓展:年龄与是否获救之间的关系(最好分年龄段体现)
# 我的想法是:先将年龄分类,定成不同的年龄段
# 引入绘图库并设置相应的参数
%matplotlib inline
import matplotlib.pyplot as plt
# 设置中文处理
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 设置图像大小
fig = plt.figure(figsize = (16,4))
fig.set(alpha = 0.2) # 透明度
#ax1 = fig.add_subplot(141)
data.Survived[data.Age <= 10].value_counts().plot(kind="bar", color="green")
#ax2 = fig.add_subplot(142)
data.Survived[data.Age > 10].value_counts().plot(kind="bar", color="red")
#ax3 = fig.add_subplot(143)
data.Survived[data.Age >= 40].value_counts().plot(kind="bar", color="blue")
#ax4 = fig.add_subplot(144)
data.Survived[data.Age >=60].value_counts().plot(kind="bar", color="yellow")
# 计算各等级获救和遇难的人数
# 各种舱位男女性乘客获救情况分析
# 引入绘图库并设置相应的参数
%matplotlib inline
import matplotlib.pyplot as plt
# 设置中文处理
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 设置图像大小
fig = plt.figure(figsize = (16,4))
fig.set(alpha = 0.2) # 透明度
plt.title("根据舱位等级和性别分析获救情况")
plt.xticks, plt.yticks=[], [] #让x,y轴刻度为空
#女性1、2等舱
ax1 = fig.add_subplot(141) #1行4列,这是第一幅
data.Survived[data.Sex=="female"][data.Pclass!=3].value_counts().plot(kind="bar", color="red")
ax1.set_xticklabels(["获救","未获救"], rotation=0)
ax1.legend(["女性/高级舱"], loc="best")
#女性3等舱
ax1 = fig.add_subplot(142) #1行4列,这是第一幅
data.Survived[data.Sex=="female"][data.Pclass==3].value_counts().plot(kind="bar", color="yellow")
ax1.set_xticklabels(["获救","未获救"], rotation=0)
ax1.legend(["女性/低级舱"], loc="best")
#男性1、2等舱
ax1 = fig.add_subplot(143) #1行4列,这是第一幅
data.Survived[data.Sex=="male"][data.Pclass!=3].value_counts().plot(kind="bar", color="green")
ax1.set_xticklabels(["未获救","获救"], rotation=0)
ax1.legend(["男性/高级舱"], loc="best")
#男性3等舱
ax1 = fig.add_subplot(144) #1行4列,这是第一幅
data.Survived[data.Sex=="male"][data.Pclass==3].value_counts().plot(kind="bar", color="blue")
ax1.set_xticklabels(["未获救","获救"], rotation=0)
ax1.legend(["男性/高级舱"], loc="best")
<matplotlib.legend.Legend at 0xbaee748>
上图可以得出以下结论:
- 高级舱(1、2等舱的女性几乎全部获救)
- 低级舱女性获救一半左右
- 男性高级舱获救比例30%不到
- 男性低级舱获救不到20%
说明女士优先在生死关头依然有效。
# 小孩获救的比例
data.Survived[data.Age < 10].value_counts()
1 38
0 24
Name: Survived, dtype: int64
# 引入绘图库并设置相应的参数
%matplotlib inline
import matplotlib.pyplot as plt
# 设置中文处理
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 设置图像大小
fig = plt.figure(figsize = (6,4))
fig.set(alpha = 0.2) # 透明度
# 计算各港口登船的乘客获救和遇难的人数
s_0 = data.Embarked[data.Survived==0].value_counts()
s_1 = data.Embarked[data.Survived==1].value_counts()
#绘制层叠柱状图
df = pd.DataFrame({"获救":s_1, "未获救":s_0})
df.plot(kind="bar", stacked=True)
plt.title("乘客登船港口与获救的关系")
plt.ylabel("人数")
登陆港口似乎和获救没有直接的关系,但C港口(法国瑟堡-奥克特维尔)登陆的乘客获救概率较高一些(是不是法国人更善于逃生呢?哈哈哈哈,老师幽默细胞不少嘛)
# 堂兄弟姐妹的数量对获救有没有影响?
g = data.groupby(["SibSp","Survived"])
df = pd.DataFrame(g.count()["PassengerId"])
df
PassengerId | ||
---|---|---|
SibSp | Survived | |
0 | 0 | 398 |
1 | 210 | |
1 | 0 | 97 |
1 | 112 | |
2 | 0 | 15 |
1 | 13 | |
3 | 0 | 12 |
1 | 4 | |
4 | 0 | 15 |
1 | 3 | |
5 | 0 | 5 |
8 | 0 | 7 |
兄弟姐妹越多,越容易死。有一个兄弟姐妹的存活率超过一半(当然数据还不够多)
# 父母孩子的数量对获救有没有影响?
g = data.groupby(["Parch","Survived"])
df = pd.DataFrame(g.count()["PassengerId"])
df
PassengerId | ||
---|---|---|
Parch | Survived | |
0 | 0 | 445 |
1 | 233 | |
1 | 0 | 53 |
1 | 65 | |
2 | 0 | 40 |
1 | 40 | |
3 | 0 | 2 |
1 | 3 | |
4 | 0 | 4 |
5 | 0 | 4 |
1 | 1 | |
6 | 0 | 1 |
父母孩子数量超过3个,以及兄弟姐妹超过4个的几乎全部遇难,旅游不要倾巢而动。
# 关于船舱编号,编号应该没有意义,缺失值太多
# 我们可以试着分析有无船舱编号是否有影响?
# 引入绘图库并设置相应的参数
%matplotlib inline
import matplotlib.pyplot as plt
# 设置中文处理
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 设置图像大小
fig = plt.figure(figsize = (6,4))
fig.set(alpha = 0.2) # 透明度
# 计算各等级获救和遇难的人数
s_c = data.Survived[pd.notnull(data.Cabin)].value_counts()
s_nc = data.Survived[pd.isnull(data.Cabin)].value_counts()
#绘制层叠柱状图
df = pd.DataFrame({"有编号":s_c, "无编号":s_nc}).transpose()
df.plot(kind="bar", stacked=True)
plt.title("有无船舱编号与获救的关系")
plt.ylabel("人数")
似乎有编号的获救比例明显要高一些,也许有编号的船票属于高级舱,也许信息健全的人比盲流的社会地位要高一些。
# 统计有船舱编号的乘客舱位分布,来验证以上的结论
data.Pclass[pd.notnull(data.Cabin)].value_counts()
1 176
2 16
3 12
Name: Pclass, dtype: int64
有编号的乘客绝大多数来自于头等舱,获救比例高就有了当然的解释,同时说明泰坦尼克号的管理者对头等舱乘客的格外关照
# 船票信息应该不会有太多的信息
# 看看船票信息中有前缀的获救情况
# for i in range(len(data)):
# if data.loc[i,"Ticket"].find(" ")!=-1:
# print(data.loc[i,"Pclass"])
s = s_s = 0
for i in range(len(data.Ticket)):
if data.Ticket[i].find(" ")!=-1:
s +=1
if data.Survived[i] ==1:
s_s +=1
print("船票有前缀总数:", s)
print("其中获救:", s_s)
船票有前缀总数: 226
其中获救: 87
船票信息有前缀似乎不是一件好事。
# 看看票价
data.Survived[data.Fare >= 50].value_counts()
1 109
0 52
Name: Survived, dtype: int64
高票价(大于50英镑)的获救比例三分之二,同样说明高等级乘客获救比例高。
# 年龄
print(data.Survived[data.Age <= 10].value_counts())
print(data.Survived[data.Age >= 60].value_counts())
print(data.Survived[data.Age >= 20][data.Age < 30][data.Sex=="male"].value_counts())
1 38
0 26
Name: Survived, dtype: int64
0 19
1 7
Name: Survived, dtype: int64
0 123
1 25
Name: Survived, dtype: int64
以上结果仍然说明,绅士精神在英国根深蒂固(青年男性遇难比例最高)。
** 关于属性特征和获救的关系我们先分析到这儿,我们可以思考还有哪些值得挖掘的信息? 例如:姓名中的头衔、港口和舱位等级是否有关联?年龄如果分段处理(小孩、青少年、中年、老年)是不是可以得到更好的结果?**
从下面的单元我们开始机器学习的过程。
简单数据预处理
#读入数据
import pandas as pd
data = pd.read_csv("data/titan_train.csv")
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
缺失值的处理
数据预处理的第一步就是对缺失值的处理,包括:Age、Cabin、Embarked三个字段
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
# 对年龄缺失值的处理:用年龄的平均值填充
# 年龄信息缺失的乘客绝大多数来自3等舱
data.Pclass[pd.isnull(data.Age)].value_counts()
3 136
1 30
2 11
Name: Pclass, dtype: int64
# 求年龄的均值
import numpy as np
aver_age = round(np.mean(data.Age), 1)
aver_age
29.7
# 填充年龄的缺失值
data.loc[pd.isnull(data.Age), "Age"] = aver_age
# 对于Cabin的处理,有编号的设为Yes,无编号的设为No
#必须先将有编号的设为Yes,若先修改无编号的为No,就没有空值了66
data.loc[pd.notnull(data.Cabin), "Cabin"] = "Yes"
data.loc[pd.isnull(data.Cabin), "Cabin"] = "No"
data.Cabin.value_counts()
No 687
Yes 204
Name: Cabin, dtype: int64
# 对于Embarked的处理,因为缺失值较少,用最多的“S”填充
data.loc[pd.isnull(data.Embarked), "Embarked"] = "S"
# 最后打印数据的信息,没有缺失值
data.info()
data.head(10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 891 non-null object
Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | No | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | Yes | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | No | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | Yes | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | No | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | 29.7 | 0 | 0 | 330877 | 8.4583 | No | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | Yes | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | No | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | No | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | No | C |
特征工程
经过缺失值的处理,我们发现现在数据是完整的。下一步对文本字段进行处理(one-Hot,独热编码),涉及到Embarked、Cabin
Sex和Pclass(这是数值字段,需要处理吗?)不需要处理的字段包括PassengerId、name和Ticket。
data.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | No | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | Yes | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | No | S |
# Sex字段的One-Hot编码
d_sex = pd.get_dummies(data.Sex, prefix="Sex")
# Cabin字段的One-Hot编码
d_cabin = pd.get_dummies(data.Cabin, prefix="Cabin")
# Embarked字段的One-Hot编码
d_embarked = pd.get_dummies(data.Embarked, prefix="Embarked")
# Pclass字段的Ont-Hot编码
#这个字段是否处理,有疑问。1、这是一个数值字段;2、取值的序或许是有意义的
d_pclass = pd.get_dummies(data.Pclass, prefix="Pclass")
下面我们用计算得到的One-Hot编码字段替代数据中原有字段。
# 拼接数据框,将新生成的字段加入到数据集中
df = pd.concat([data, d_sex, d_cabin, d_embarked, d_pclass], axis=1) # axis=1列拼接
df.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Sex_female | Sex_male | Cabin_No | Cabin_Yes | Embarked_C | Embarked_Q | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | ... | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
3 rows × 22 columns
# 删除数据集中被替代的列(Sex、Cabin、Embarked、Pclass) 以及不需要的列(PassengerId、Name、Ticket)
# inplace=True 将数据集真正的删除,axis=1 删除列
df.drop(["Sex", "Cabin", "Embarked", "Pclass", "PassengerId", "Name", "Ticket"], axis=1, inplace=True)
df.head()
Survived | Age | SibSp | Parch | Fare | Sex_female | Sex_male | Cabin_No | Cabin_Yes | Embarked_C | Embarked_Q | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 22.0 | 1 | 0 | 7.2500 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 1 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
4 | 0 | 35.0 | 0 | 0 | 8.0500 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
# 保存一下数据
df.to_csv("data/titan.csv", index=False)
小结一下特征工程:
- 去掉和目标(是否获救)明显没有关系的特征
- 字符特征和无序的分类特征进行编码(One-Hot)
我们没有做的: - 用算法(统计学的)决定选择哪些最有用的特征(对于特征维度较大的时候)
- 对于特征维度过于庞杂的数据集需要降维(PCA、LDA)
观察一下数据,发现Age和Fare这两个特征的取值量级较大,需要进行标准化。常用的标准化手段包括正态分布标准化、最大最小值标准化,这里我们采用后者。
# 最大最小值标准化(让取值在0-1之间)
# 公式:(X-min)/(max-min)
from sklearn.preprocessing import MinMaxScaler # preprocessing 库
#生成学习器
scaler = MinMaxScaler()
Age_scale = scaler.fit_transform(df.Age.values.reshape(-1,1))
Fare_scale = scaler.fit_transform(df.Fare.values.reshape(-1,1))
df["Age_scale"] = Age_scale
df["Fare_scale"] = Fare_scale
df.drop(["Age", "Fare"], axis=1, inplace=True)
df.head()
Survived | SibSp | Parch | Sex_female | Sex_male | Cabin_No | Cabin_Yes | Embarked_C | Embarked_Q | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | Age_scale | Fare_scale | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0.271174 | 0.014151 |
1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0.472229 | 0.139136 |
2 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0.321438 | 0.015469 |
3 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0.434531 | 0.103644 |
4 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0.434531 | 0.015713 |
# 再次保存一下数据
df.to_csv("data/titan.csv", index=False)
至此,数据预处理和特征工程工作已经完成,数据已经为下面的机器学习过程准备好。
机器学习建模
- 数据集的准备:数据和目标(X和y)
- 划分训练集和测试集
- 机器学习分类算法(尝试多种算法)
- 选择算法(交叉验证等)
- 算法调优(超参数的选择-网格搜索)
- 运用机器学习的模型进行预测
数据集的准备和划分
# 准备数据:X代表数据,y代表目标(是否获救)
import pandas as pd
df = pd.read_csv("data/titan.csv")
X = df.iloc[:,1:].values
y = df.iloc[:,0].values
# 划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=33)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(623, 14) (623,) (268, 14) (268,)
选择机器学习的算法
- 最近邻分类算法
- 逻辑回归算法
- 决策树算法
- 支持向量机算法
- 随机森林算法
- 其他算法
# KNN分类算法
from sklearn.neighbors import KNeighborsClassifier
# 建模
knn = KNeighborsClassifier(n_neighbors=7, p=1)
print(knn)
# 学习
knn.fit(X_train, y_train)
# 评估
print("模型参数:",knn)
print("模型的训练准确度:",knn.score(X_train, y_train))
print("模型的测试准确度:",knn.score(X_test, y_test))
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=7, p=1,
weights='uniform')
模型参数: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=7, p=1,
weights='uniform')
模型的训练准确度: 0.8282504012841091
模型的测试准确度: 0.8283582089552238
# 逻辑回归算法
from sklearn.linear_model import LogisticRegression
# 建模
lr = LogisticRegression()
# 学习
lr.fit(X_train, y_train)
#评估
print("模型参数:",lr)
print("模型的训练准确度:",lr.score(X_train, y_train))
print("模型的测试准确度:",lr.score(X_test, y_test))
模型参数: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
模型的训练准确度: 0.8073836276083467
模型的测试准确度: 0.8208955223880597
# 决策树算法
from sklearn.tree import DecisionTreeClassifier
# 建模
dtc = DecisionTreeClassifier(splitter="random")
# 学习
dtc.fit(X_train, y_train)
#评估
print("模型参数:",dtc)
print("模型的训练准确度:",dtc.score(X_train, y_train))
print("模型的测试准确度:",dtc.score(X_test, y_test))
模型参数: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='random')
模型的训练准确度: 0.9887640449438202
模型的测试准确度: 0.8059701492537313
# 支持向量机算法
from sklearn.svm import SVC
# 建模
svc = SVC(gamma=0.1, C=15)
# 学习
svc.fit(X_train, y_train)
#评估
print("模型参数:",svc)
print("模型的训练准确度:",svc.score(X_train, y_train))
print("模型的测试准确度:",svc.score(X_test, y_test))
模型参数: SVC(C=15, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
模型的训练准确度: 0.8378812199036918
模型的测试准确度: 0.832089552238806
# 随机森林算法(基于决策树)
from sklearn.ensemble import RandomForestClassifier
# 建模
rfc = RandomForestClassifier(n_estimators=9)
# 学习
rfc.fit(X_train, y_train)
#评估
print("模型参数:",rfc)
print("模型的训练准确度:",rfc.score(X_train, y_train))
print("模型的测试准确度:",rfc.score(X_test, y_test))
模型参数: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=9, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
模型的训练准确度: 0.9662921348314607
模型的测试准确度: 0.8059701492537313
交叉验证
也叫k-折验证。将数据划分为k份,其中一份作为测试集,其余k-1份作为训练集,重复k次,取模型的平均分作为评估。
# KNN
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
knn = KNeighborsClassifier(n_neighbors=7, p=1)
scores = cross_val_score(knn, X, y, cv=5) #打五折
print(scores)
print("交叉验证的平均分:", np.mean(scores))
[0.77094972 0.77094972 0.8258427 0.80337079 0.8079096 ]
交叉验证的平均分: 0.7958045058013248
# 决策树
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
dtc = DecisionTreeClassifier(splitter="random")
scores = cross_val_score(dtc, X, y, cv=10) #打五折
print(scores)
print("交叉验证的平均分:", np.mean(scores))
[0.71111111 0.71111111 0.70786517 0.80898876 0.84269663 0.78651685
0.80898876 0.75280899 0.80898876 0.73863636]
交叉验证的平均分: 0.7677712518442855
# 逻辑回归
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(penalty='l1')
scores = cross_val_score(lr, X, y, cv=10) #打五折
print(scores)
print("交叉验证的平均分:", np.mean(scores))
[0.83333333 0.81111111 0.7752809 0.83146067 0.82022472 0.76404494
0.78651685 0.79775281 0.83146067 0.84090909]
交叉验证的平均分: 0.8092095108387243
# 支持向量机
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
svc = SVC(gamma=0.1, C=15, kernel="poly", degree=2)
scores = cross_val_score(svc, X, y, cv=10) #打五折
print(scores)
print("交叉验证的平均分:", np.mean(scores))
print(svc)
[0.83333333 0.8 0.78651685 0.86516854 0.83146067 0.82022472
0.82022472 0.75280899 0.82022472 0.84090909]
交叉验证的平均分: 0.817087163772557
SVC(C=15, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=2, gamma=0.1, kernel='poly',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
网格搜索参数的调优
最后选择最好的算法,及最好的参数搭配
参考:https://blog.csdn.net/ling_mochen/article/details/80219850
课堂练习
- 选定支持向量机算法:
- 核可以在rbd、linear、poly、sigmoid中选择(kernel)
- 主要调整的参数:C、gamma、kernel、degree
- 采用网格搜索的方法确定最优参数
- 采用最优参数建模并用k折验证得到评估的平均分
#交叉验证网格搜索(为了找到最适合的参数)
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV #交叉验证网格搜索
params = {"C":[1,2,3,4,5,10,11,12,13,14,15,20,30,40,50,100], "gamma":[0.0001, 0.001,0.005, 0.01, 0.05, 0.1], "kernel":["rbf", "linear","poly","sigmoid"],"degree":[2,3,4,5]}
grid = GridSearchCV(svc, params, cv=5, scoring="accuracy")#cv一般大于等于3
grid.fit(X, y)
print(grid.best_score_)
print(grid.best_params_)
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
svc = SVC(gamma=0.1, C=13, kernel="poly", degree=2)
scores = cross_val_score(svc, X, y, cv=5) #打五折
print(scores)
print("交叉验证的平均分:", np.mean(scores))
print(svc)