机器学习 day4 决策树应用,验证,调参;多种回归比较

1. 决策树的应用:kaggle 泰坦尼克号生还者预测

数据集特征介绍:
PassengerId:乘客的ID号,这是个顺序编号,用来唯一地标识一名乘客。这个特征和幸存与否无关,不使用这个特征。

Survived:1 表示幸存,0 表示遇难。这是我们标注的数据

Pclass:仓位等级,是很重要的特征。高仓位等级的乘客能更快地到达甲板,从而更容易获救

Name:乘客名字,这个特征和幸存与否无关,丢弃

Sex:乘客性别,船长让妇女和儿童先上,很重要的特征

Age:乘客年龄,儿童会优先上船

SibSp:兄弟姐妹同在船上的数量

Parch:同船的父辈人员数量

Ticket:乘客票号,不使用这个特征

Fare:乘客体热指标

Cabin:乘客所在的船舱号。实际上这个特征和幸存与否有一定关系,比如最早被水淹没的船舱位置,其乘客的幸存概率要低一些。但由于这个特征由大量丢失数据,所以丢弃这个特征

Embarked:乘客登船的港口,需要把港口数据转换为数值型数据

但是这些特征里面有一些特征是没用的,所以我们把它删除掉,以此来减少数据的运算。

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

导入数据并观察

data = pd.read_csv('data.csv')
data
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

观测数值类型特征的数据描述:主要观察,标准差(越小越好),最值,均值等

data.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
Column Non-Null Count Dtype


0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

一个可视化信息缺失的库

import missingno
missingno.matrix(data)

在这里插入图片描述
可以看到age 和 cabin值缺失比较多,embarked 也存在缺失值

清洗数据

删除异常列 cabin :

del data['Cabin']

观察年龄数据项 :

plt.hist(data['Age'])

在这里插入图片描述
由于均值和中位数比较接近,都可以用来填充(这里我选用了整数的中位数)

data.Age.mean() # 29.69911764705882
data.Age.median() # 28.0

填充年龄的缺失:

data['Age'].fillna(data['Age'].median(),inplace=True)

填充embarded 缺失:

data['Embarked'].fillna(method='ffill',inplace=True)

数据处理完毕:

data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Column Non-Null Count Dtype


0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Embarked 891 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB

missingno.matrix(data)

在这里插入图片描述

筛选特征及编码

data.columns

Index([‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Embarked’],
dtype=‘object’)

X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
X.head()
PclassSexAgeSibSpParchFareEmbarked
03male22.0107.2500S
11female38.01071.2833C
23female26.0007.9250S
31female35.01053.1000S
43male35.0008.0500S

给性别编码:

X['Sex'] = 1*(X['Sex']=='male')
X.head()
PclassSexAgeSibSpParchFareEmbarked
03122.0107.2500S
11038.01071.2833C
23026.0007.9250S
31035.01053.1000S
43135.0008.0500S

给登船点编码:

unique = data.Embarked.unique().tolist()
unique # ['S', 'C', 'Q']
X['Embarked']=data['Embarked'].apply(lambda x:unique.index(x))
X
PclassSexAgeSibSpParchFareEmbarked
03122.0107.25000
11038.01071.28331
23026.0007.92500
31035.01053.10000
43135.0008.05000
8862127.00013.00000
8871019.00030.00000
8883028.01223.45000
8891126.00030.00001
8903132.0007.75002
891 rows × 7 columns

划分数据集

from sklearn.model_selection import train_test_split
y = data['Survived']
xtrain,xtest,ytrain,ytest = train_test_split(X,y,random_state =60)

导入模型计算

from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier(random_state=120).fit(xtrain,ytrain)

DT.score(xtest,ytest) # 0.7713004484304933

验证(交叉验证法)

from sklearn.model_selection import cross_val_score
cross_val_score(DT,xtrain,ytrain,cv=10)

array([0.8358209 , 0.85074627, 0.71641791, 0.67164179, 0.86567164,
0.7761194 , 0.86567164, 0.76119403, 0.8030303 , 0.81818182])

不同深度的决策树模型的训练集和测试集,交叉验证的对比参数

cross = []
score = []
train = []
for i in np.arange(1,20):
    DT1 = DecisionTreeClassifier(random_state=20,max_depth = i).fit(xtrain,ytrain)
    c = cross_val_score(DT1,xtrain,ytrain,cv=5).mean()
    cross.append(c)
    score.append(DT1.score(xtest,ytest))
    train.append(DT1.score(xtrain,ytrain))
plt.plot(np.arange(1,20),cross ,label = 'cross')
plt.plot(np.arange(1,20),score,label = 'test')
plt.plot(np.arange(1,20),train,label = 'train')
plt.legend()
plt.xticks(np.arange(1,20))

在这里插入图片描述
根据图像找到最适合的参数:

DT = DecisionTreeClassifier(random_state = 20, max_depth = 6)
cross_val_score(DT, xtrain,ytrain,cv=5).mean() # 0.8278083267871171

可以看到深度为6时,交叉验证的准确率最高

[*zip(np.arange(1,20),cross)]

[(1, 0.7978341375827629),
(2, 0.7783525979126922),
(3, 0.8023678599483783),
(4, 0.8248008079901246),
(5, 0.8203119739647626),
(6, 0.8278083267871171),
(7, 0.8158455841095276),
(8, 0.8218045112781954),
(9, 0.8158904724497813),
(10, 0.8038603972618112),
(11, 0.8039052856020648),
(12, 0.8054090450005612),
(13, 0.8039165076871282),
(14, 0.8009650993154528),
(15, 0.7979351363483336),
(16, 0.8009426551453259),
(17, 0.7979351363483336),
(18, 0.7964425990349007),
(19, 0.7994276736617663)]

调参:网格搜索 Grid_Search

from sklearn.model_selection import GridSearchCV

设置需要网格搜索的参数

paras = {
    "max_depth":np.arange(1,20),
    "min_samples_leaf":np.arange(1,20),
    "criterion":['gini','entropy']
        }

实例化模型(不用fit数据)

DT = DecisionTreeClassifier()

定义网格搜索,并fit()数据

GS = GridSearchCV(DT,param_grid=paras,cv = 8).fit(xtrain,ytrain)

最优参数:

GS.best_params_

结果:{‘criterion’: ‘entropy’, ‘max_depth’: 9, ‘min_samples_leaf’: 9}

最优分数

GS.best_score_

结果:0.8441802925989673

最优评估器:

GS.best_estimator_

结果:DecisionTreeClassifier(criterion=‘entropy’, max_depth=14, min_samples_leaf=9)

由上可设置最优决策树分类器:

DT = DecisionTreeClassifier(criterion='entropy', max_depth=14, min_samples_leaf=9).fit(xtrain,ytrain)
DT.score(xtest,ytest) # 0.8071748878923767

利用分类器分类:

DT.predict([[1,0,30,1,2,58,0]]) # array([1], dtype=int64)

可视化决策树:

import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(DT
                                ,out_file = None
                                ,feature_names= X.columns
                                ,class_names=['死亡','存活']
                                ,filled=True
                                ,rounded=True
                                )
graph = graphviz.Source(dot_data) 
graph

计算特征重要性:

[*zip(DT.feature_importances_,X.columns)]

[(0.17948345431191473, ‘Pclass’),
(0.4272386937323802, ‘Sex’),
(0.12579044257521602, ‘Age’),
(0.060547878544091265, ‘SibSp’),
(0.0, ‘Parch’),
(0.1842363885730208, ‘Fare’),
(0.0227031422633769, ‘Embarked’)]

预测可能性:

DT.predict_proba(xtest)

在这里插入图片描述

2. 随机森林

from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(max_depth = 4).fit(xtrain,ytrain)
RF.score(xtest,ytest)

结果:0.8026905829596412

测试不同深度该随机森林分类器的表现:交叉验证法

from sklearn.model_selection import cross_val_score
cross_val_score(RF,xtrain,ytrain,cv=10)
cross = []
score = []
train = []
for i in np.arange(1,20):
    RF1 = RandomForestClassifier(random_state=20,max_depth = i).fit(xtrain,ytrain)
    c = cross_val_score(RF1,xtrain,ytrain,cv=5).mean()
    cross.append(c)
    score.append(RF1.score(xtest,ytest))
    train.append(RF1.score(xtrain,ytrain))

plt.plot(np.arange(1,20),cross ,label = 'cross')
plt.plot(np.arange(1,20),score,label = 'test')
plt.plot(np.arange(1,20),train,label = 'train')
plt.legend()
plt.xticks(np.arange(1,20))

在这里插入图片描述

RF = RandomForestClassifier(random_state = 20, max_depth = 5).fit(xtrain,ytrain)
cross_val_score(RF, xtrain,ytrain,cv=5).mean() # 0.8367523285826506

结果:0.8367523285826506

RF.score(xtest,ytest)

结果:0.8071748878923767

[*zip(np.arange(1,20),cross)]

[(1, 0.7858489507350466),
(2, 0.7888452474469756),
(3, 0.8053192683200538),
(4, 0.8158006957692739),
(5, 0.8367523285826506),
(6, 0.8232970485916283),
(7, 0.8188306587363933),
(8, 0.82933453035574),
(9, 0.8203456402199528),
(10, 0.8128492873975984),
(11, 0.8113791942542925),
(12, 0.802412748288632),
(13, 0.802412748288632),
(14, 0.7964089327797105),
(15, 0.7994276736617664),
(16, 0.7979239142632701),
(17, 0.7934238581528448),
(18, 0.7949163954662776),
(19, 0.7949163954662776)]

利用网格搜索调参(比较耗时间,大概10分钟左右):

from sklearn.model_selection import GridSearchCV
paras = {
    "max_depth":np.arange(1,20),
    "min_samples_leaf":np.arange(1,20),
    "criterion":['gini','entropy']
        }
RF = RandomForestClassifier()
GS = GridSearchCV(RF,param_grid=paras).fit(xtrain,ytrain)
print(GS.best_params_) # {'criterion': 'entropy', 'max_depth': 8, 'min_samples_leaf': 3}
print(GS.best_score_) # 0.8382785321512737
print(GS.best_estimator_) # RandomForestClassifier(criterion='entropy', max_depth=8, min_samples_leaf=3)

根据最优结果重新设置随机森林分类器

RF = RandomForestClassifier(criterion='entropy', max_depth=8, min_samples_leaf=3).fit(xtrain,ytrain)
RF.score(xtest,ytest) # 0.8026905829596412

3. 多种回归比较(boston数据集) 待改进 数据标准化,归一化

数据准备

from sklearn.datasets import load_boston
boston = load_boston()
X = pd.DataFrame(boston.data,columns=boston.feature_names)
y = boston.target
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

xtrain,xtest,ytrain,ytest =train_test_split(X,y,random_state = 20)

回归树

DTR = DecisionTreeRegressor(max_depth = 8,random_state = 20).fit(xtrain,ytrain)
DTR.score(xtest,ytest),mean_squared_error(ytest,DTR.predict(xtest)) # (0.603193016561408, 31.984825561881298)

随机森林回归

# 随机森林

from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor(random_state = 20).fit(xtrain,ytrain)
RFR.score(xtest,ytest),mean_squared_error(ytest,RFR.predict(xtest)) # (0.8051659689805782, 15.704694614173214)

岭回归

from sklearn.linear_model import Ridge
LR = Ridge().fit(xtrain,ytrain)
LR.score(xtest,ytest),mean_squared_error(ytest,LR.predict(xtest)) # (0.7214294743488996, 22.45431668671955)

多项式回归

from sklearn.preprocessing import PolynomialFeatures  
PF = PolynomialFeatures(degree=2).fit(xtrain)
xtrain_poly = pd.DataFrame(PF.transform(xtrain),columns=PF.get_feature_names(input_features=X.columns))
xtest_poly = pd.DataFrame(PF.transform(xtest),columns=PF.get_feature_names(input_features=X.columns))
LR2 = Ridge().fit(xtrain_poly,ytrain)
LR2.score(xtest_poly,ytest),mean_squared_error(ytest,LR2.predict(xtest_poly)) #(0.7348578552292532, 21.37191532292632)
  • 4
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值