python——Titanic数据集做RandomForest建模分析实验

最新推荐文章于 2024-09-06 16:38:31 发布

偷偷搞塌

最新推荐文章于 2024-09-06 16:38:31 发布

阅读量731

点赞数

分类专栏： python学习记录

python学习记录专栏收录该内容

37 篇文章 0 订阅

订阅专栏

在这里插入图片描述
import pandas as pd
def read_dataset(fname):
data=pd.read_csv(fname, index_col=0)
data.drop([‘Name’,‘Ticket’,‘Cabin’],axis=1,inplace=True)
data[‘Sex’]=(data[‘Sex’]==‘male’).astype(‘int’)
labels=data[‘Embarked’].unique().tolist()
data[‘Embarked’]=data[‘Embarked’].apply(lambda n: labels.index(n))
data=data.fillna(0)
return data
#这样就定义好了函数read_dataset，在读取数据的同时就把缺失、标签、删除等一系列工作都做了
train=read_dataset(‘C:/Users/dell/Desktop/code/datasets/titanic/train.csv’)
train.head(5)

Out[8]:
Survived Pclass Sex … Parch Fare Embarked
PassengerId …
1 0 3 1 … 0 7.2500 0
2 1 1 0 … 0 71.2833 1
3 1 3 0 … 0 7.9250 0
4 1 1 0 … 0 53.1000 0
5 0 3 1 … 0 8.0500 0
[5 rows x 8 columns]

from sklearn.model_selection import train_test_split
y=train[‘Survived’].values
X=train.drop([‘Survived’],axis=1).values
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2)
print(‘train dataset:{0}; test dataset:{1}’.format(X_train.shape, X_test.shape))
train dataset:(712, 7); test dataset:(179, 7)

接下来使用DecisionTreeClassifier对数据进行fitting

from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier()
clf.fit(X_train,y_train)
Out[15]:
DecisionTreeClassifier(class_weight=None, criterion=‘gini’, max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter=‘best’)
train_score=clf.score(X_train,y_train)
test_score=clf.score(X_test,y_test)

print(‘train score:{0:.6f};test score:{1:.6f}’.format(train_score, test_score))

train score:0.981742;test score:0.770950
#可以看到针对训练样本评分很高，但是针对test set的评分比较低，obviously这是overfitting的特征。解决DecisionTreeClassifier过拟合的方法是剪枝pruning.

unfortunately， scikit-learn不支持 post-pruning（后剪枝）

#优化模型参数，先对参数max_depth进行编程选出最优的max_depth
#再同理编程选出最优的参数mini_impurity_split——当决策树分裂后 info gain信息增益低于这个threshold时就不再分裂

1–考虑参数max_depth

#range(a,b) 从a到b 不包括尾部的b
#tuning the parameter——max_depth

#函数cv_score(d)关于参数d=max_depth的函数创建如下
def cv_score(d):
clf=DecisionTreeClassifier(max_depth=d)
clf.fit(X_train, y_train)
tr_score=clf.score(X_train, y_train)
cv_score=clf.score(X_test,y_test)
return (tr_score, cv_score)
#所以其实cv_score有两列，第0列为tr_score,第1列为cv_score

#接着构造参数d=max_depth的范围
#在这个范围内分别计算模型评分并且找出评分最高的模型所对应的参数
import numpy as np
depths=range(2,15)
#in fact it’s from 2 to 14 because 15 is not included in it
#用scores把cv_score(d)函数值的共两列全部囊括过来了
#这里为什么使用中括号[ ]？？？
scores=[cv_score(d) for d in depths]
#显然tr_scores对应第1列
tr_scores=[s[0] for s in scores]
#显然cv_scores对应第2列
cv_scores=[s[1] for s in scores]

#use the function argmax() to find the best index corresponding to the best score
best_score_index=np.argmax(cv_scores)
best_param=depths[best_score_index]

best_score=cv_scores[best_score_index]

print(‘best param:{0:.4f}; best score:{1:.4f}’.format(best_param, best_score))
#.4f表示精度为小数点后4位
#.6f为精度为小数点后6位
best param:3.0000; best score:0.8268
#由此可以看到针对模型深度这个参数，最佳的max_depth为3，而此时对应的交叉验证数据集评分为cv_scores=0.8268
print(‘best param:{}’.format(best_param))
best param:3
print(best_score_index)
1

我们继续把模型参数和模型评分画出来，更直观地观察变化规律

import matplotlib.pyplot as plt
plt.figure(figsize=(6,4), dpi=144)

plt.grid()
plt.xlabel(‘max depth of decision tree’)
plt.ylabel(‘score’)

plt.plot(depths, cv_scores, ‘.g-’,label=‘cross-validation score’)
plt.plot(depths, tr_scores, ‘.r–’,label=‘training score’)

plt.legend()
plt.show()

在这里插入图片描述

2–同样的方法考虑参数min_impurity_split

#（这个参数用来指定信息熵或者基尼不纯度的阈值，当决策树分裂之后，如果其信息增益低于这个阈值时，则不再分裂）

def cv_score(val):
clf=DecisionTreeClassifier(criterion=‘gini’,min_impurity_split=val)
clf.fit(X_train,y_train)
tr_score=clf.score(X_train,y_train)
cv_score=clf.score(X_test,y_test)
return(tr_score, cv_score)
#specify the range for the parameter
values=np.linspace(0,0.5,50)

scores=[cv_score(v) for v in values]
tr_scores=[s[0] for s in scores]
cv_scores=[s[1] for s in scores]

best_score_index=np.argmax(cv_scores)
best_score=cv_scores[best_score_index]
best_param=values[best_score_index]
print(‘best parameter:{0:.4f}; best score:{1:.4f}’.format(best_param, best_score))
#可以看到最优的mini_impurity_split为0.1939，此时对应的交叉检验评分为0.8324
best parameter:0.1939; best score:0.8324

#画出参数变化和score变化的关系
plt.plot(figsize=(6,4),dpi=144)
plt.grid()

plt.xlabel(‘threshold of entropy=mini_impurity_split’)
plt.ylabel(‘score’)

plt.plot(values, cv_scores, ‘.g-’,label=‘cross-validation score’)
plt.plot(values, tr_scores, ‘.r–’,label=‘training score’)

plt.legend()
plt.show()

当阈值接近0.5时trainingScore和cv_score都出现了急剧下降说明出现了underfitting

3. 模型参数选择工具包 sklearn.model_selection去import GridSearchCV——见下一篇好了

可以让我们将多个参数结合起来的最优组合同时选出来。
我们刚自己编程的那种有两个问题：（1）数据不稳定，这次选出来的max_depth为7下次再运行可能就是6了；——它的原因是每次进行train_test_split划分时，都是随机划分，这样导致每次话分出来的train_test是有差异的从而训练出来的模型也有差异。解决这个问题的方法是多次计算，取mean——具体来说就是针对模型的某个特定参数值多次划分数据集、多次训练模型计算出这个参数值时的最低评分、最高评分以及平均评分（2）不能一次选择多个参数——这个通过优化代码倒也能同时处理多个参数的组合（不过我暂时不知道怎们搞。。。）