DataWhale一周算法实践2---模型构建（(随机森林、GBDT、XGBoost和LightGBM)

最新推荐文章于 2023-11-20 02:02:30 发布

大力壮壮

最新推荐文章于 2023-11-20 02:02:30 发布

阅读量1k

点赞数 2

分类专栏：算法项目

本文链接：https://blog.csdn.net/weixin_40363627/article/details/84958191

版权

算法项目专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一参考博客

构建随机森林、GBDT、XGBoost和LightGBM这4个模型，评分方式任意。

https://blog.csdn.net/w952470866/article/details/78987265 随机森林

https://blog.csdn.net/xiaoliuhexiaolu/article/details/80582247 GBDT

https://blog.csdn.net/q383700092/article/details/53744277 GBDT

https://blog.csdn.net/hb707934728/article/details/70739040 XGBoost

https://blog.csdn.net/luanpeng825485697/article/details/80236759 LightGBM

二导包，读取数据，划分数据集

导包

import pandas as pd 
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb

读取数据划分数据

# 读取数据
data_all = pd.read_csv('./data_all.csv', encoding='gbk')
data_all.head()
# 划分数据集
from sklearn.model_selection import train_test_split
features = [x for x in data_all.columns if x not in ['status']] # 特征
X = data_all[features] # 特征向量
y = data_all['status'] # labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018)

# 对数据进行归一化处理
scaler = StandardScaler()
scaler.fit(X_train)
X_train_stand = scaler.transform(X_train)
X_test_stand = scaler.transform(X_test)
scaler.fit(X)
X_stand = scaler.transform(X)
X_test_stand

三算法入门实例

A.随机森林

随机森林就是通过集成学习的思想将多棵树集成的一种算法，它的基本单元是决策树，而它的本质属于机器学习的一大分支——集成学习（Ensemble Learning）方法。

rf = RandomForestClassifier(n_estimators=230,max_features=0.2,random_state =2018)
rf.fit(X_stand, y)
tf_score = rf.score(X_test_stand,y_test)
tf_score
# 惊人的0.9964961457603364  然鹅不知道为啥

B.GBDT

GBDT 的全称是 Gradient Boosting Decision Tree，梯度下降树。
GBDT使用的决策树就是CART回归树，无论是处理回归问题还是二分类以及多分类，GBDT使用的决策树自始至终都是CART回归树

model= GradientBoostingClassifier(n_estimators=230, learning_rate=1.0, random_state=2018)
model.fit(X_stand, y)
model_score = model.score(X_test_stand,y_test)
model_score 

model= GradientBoostingClassifier(n_estimators=230,max_features=0.2, random_state=2018)
model.fit(X_stand, y)
model_score = model.score(X_test_stand,y_test)
model_score

0.8521373510861948
0.8472319551506657

C.XGBoost

前面已经知道，XGBoost 就是对 gradient boosting decision tree 的实现，但是一般来说，gradient boosting 的实现是比较慢的，因为每次都要先构造出一个树并添加到整个模型序列中。
而 XGBoost 的特点就是计算速度快，模型表现好，这两点也正是这个项目的目标

data_train=xgb.DMatrix(X_train_stand,label=y_train)
data_test=xgb.DMatrix(X_test_stand ,label=y_test)
watch_list=[(data_test,'eval'),(data_train,'train')]
param={'max_depth':2,'silent':0,'objective':'multi:softmax','num_class':2}
bst=xgb.train(param,data_train,num_boost_round=6,evals=watch_list,)
y_hat=bst.predict(data_test)
# result=y_test.reshape(1,-1)==y_hat
result=y_test==y_hat
print('正确率：\t',float(np.sum(result))/len(y_hat))

[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[0]	eval-merror:0.245971	train-merror:0.22122
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[1]	eval-merror:0.234057	train-merror:0.216712
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[2]	eval-merror:0.229853	train-merror:0.211602
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[3]	eval-merror:0.221444	train-merror:0.210099
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[4]	eval-merror:0.225648	train-merror:0.203787
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[5]	eval-merror:0.220743	train-merror:0.203186
正确率：	 0.7792571829011913

D.lightGBM

LightGBM 是一个梯度 boosting 框架，使用基于学习算法的决策树。它可以说是分布式的，高效的，有以下优势：
更快的训练效率低内存使用更高的准确率支持并行化学习可处理大规模数据

# 创建模型
gbm = lgb.LGBMRegressor(objective='regression',n_estimators=230)
# 训练模型
gbm.fit(X_train_stand, y_train,eval_set=[(X_test_stand, y_test)])
print('Start predicting...')
# 测试机预测
y_pred = gbm.predict(X_test_stand, num_iteration=gbm.best_iteration_)
# 模型评估
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

四问题

1 运行light GBM的时候很多次都直接 kernel dead ，为什么他的优点是占内存小但是会直接kenel dead呢

2 auc 准确率各个评分指标不太清楚

3 算法的内在推论

大力壮壮

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
DataWhale一周算法实践2---模型构建（(随机森林、GBDT、XGBoost和LightGBM)

一参考博客构建随机森林、GBDT、XGBoost和LightGBM这4个模型，评分方式任意。https://blog.csdn.net/w952470866/article/details/78987265 随机森林https://blog.csdn.net/xiaoliuhexiaolu/article/details/80582247 GBDThttps://blog.cs...
复制链接

扫一扫