一 参考博客
构建随机森林、GBDT、XGBoost和LightGBM这4个模型,评分方式任意。
https://blog.csdn.net/w952470866/article/details/78987265 随机森林
https://blog.csdn.net/xiaoliuhexiaolu/article/details/80582247 GBDT
https://blog.csdn.net/q383700092/article/details/53744277 GBDT
https://blog.csdn.net/hb707934728/article/details/70739040 XGBoost
https://blog.csdn.net/luanpeng825485697/article/details/80236759 LightGBM
二 导包,读取数据,划分数据集
导包
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
读取数据 划分数据
# 读取数据
data_all = pd.read_csv('./data_all.csv', encoding='gbk')
data_all.head()
# 划分数据集
from sklearn.model_selection import train_test_split
features = [x for x in data_all.columns if x not in ['status']] # 特征
X = data_all[features] # 特征向量
y = data_all['status'] # labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2018)
# 对数据进行归一化处理
scaler = StandardScaler()
scaler.fit(X_train)
X_train_stand = scaler.transform(X_train)
X_test_stand = scaler.transform(X_test)
scaler.fit(X)
X_stand = scaler.transform(X)
X_test_stand
三 算法入门实例
A.随机森林
随机森林就是通过集成学习的思想将多棵树集成的一种算法,它的基本单元是决策树,而它的本质属于机器学习的一大分支——集成学习(Ensemble Learning)方法。
rf = RandomForestClassifier(n_estimators=230,max_features=0.2,random_state =2018)
rf.fit(X_stand, y)
tf_score = rf.score(X_test_stand,y_test)
tf_score
# 惊人的0.9964961457603364 然鹅不知道为啥
B.GBDT
GBDT 的全称是 Gradient Boosting Decision Tree,梯度下降树。
GBDT使用的决策树就是CART回归树,无论是处理回归问题还是二分类以及多分类,GBDT使用的决策树自始至终都是CART回归树
model= GradientBoostingClassifier(n_estimators=230, learning_rate=1.0, random_state=2018)
model.fit(X_stand, y)
model_score = model.score(X_test_stand,y_test)
model_score
model= GradientBoostingClassifier(n_estimators=230,max_features=0.2, random_state=2018)
model.fit(X_stand, y)
model_score = model.score(X_test_stand,y_test)
model_score
0.8521373510861948
0.8472319551506657
C.XGBoost
前面已经知道,XGBoost 就是对 gradient boosting decision tree 的实现,但是一般来说,gradient boosting 的实现是比较慢的,因为每次都要先构造出一个树并添加到整个模型序列中。
而 XGBoost 的特点就是计算速度快,模型表现好,这两点也正是这个项目的目标
data_train=xgb.DMatrix(X_train_stand,label=y_train)
data_test=xgb.DMatrix(X_test_stand ,label=y_test)
watch_list=[(data_test,'eval'),(data_train,'train')]
param={'max_depth':2,'silent':0,'objective':'multi:softmax','num_class':2}
bst=xgb.train(param,data_train,num_boost_round=6,evals=watch_list,)
y_hat=bst.predict(data_test)
# result=y_test.reshape(1,-1)==y_hat
result=y_test==y_hat
print('正确率:\t',float(np.sum(result))/len(y_hat))
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[0] eval-merror:0.245971 train-merror:0.22122
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[1] eval-merror:0.234057 train-merror:0.216712
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[2] eval-merror:0.229853 train-merror:0.211602
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[3] eval-merror:0.221444 train-merror:0.210099
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[4] eval-merror:0.225648 train-merror:0.203787
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[16:51:49] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[5] eval-merror:0.220743 train-merror:0.203186
正确率: 0.7792571829011913
D.lightGBM
LightGBM 是一个梯度 boosting 框架,使用基于学习算法的决策树。它可以说是分布式的,高效的,有以下优势:
更快的训练效率 低内存使用 更高的准确率 支持并行化学习 可处理大规模数据
# 创建模型
gbm = lgb.LGBMRegressor(objective='regression',n_estimators=230)
# 训练模型
gbm.fit(X_train_stand, y_train,eval_set=[(X_test_stand, y_test)])
print('Start predicting...')
# 测试机预测
y_pred = gbm.predict(X_test_stand, num_iteration=gbm.best_iteration_)
# 模型评估
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
四 问题
1 运行light GBM的时候 很多次都直接 kernel dead ,为什么他的优点是 占内存小 但是会直接kenel dead呢
2 auc 准确率 各个评分指标不太清楚
3 算法的内在推论