任务——模型构建
构建随机森林、GBDT、XGBoost和LightGBM这4个模型,并对每一个模型进行评分,评分方式任意,例如准确度和auc值。
1、相关安装资源
- 随机森林、GBDT均在sklearn包中;
- LightGBM:https://github.com/Microsoft/LightGBM
- 目前已经是pypi中的资源 ==》pip方式安装
- XGBoost:https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost、https://github.com/dmlc/xgboost
Tips:若 pip 安装过程中,网速、超时等 ==》换源
sudo pip install -i http://pypi.douban.com/simple/ --trusted-host=pypi.douban.com/simple lightgbm
2、数据读取 + 标准化
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor
import warnings
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings(action ='ignore', category = DeprecationWarning)
## 读取数据
data = pd.read_csv("data_all.csv")
x = data.drop(labels='status', axis=1)
y = data['status']
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)
print(len(x)) # 4754
## 数据标准化
scaler = StandardScaler()
scaler.fit(x_train)
x_train_stand = scaler.transform(x_train)
x_test_stand = scaler.transform(x_test)
3、 随机森林模型
思想:通过 Bagging 的思想将多棵树集成的一种算法,它的基本单元是决策树。
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
rfc_score = rfc.score(x_test, y_test)
print("The score of RF:",rfc_score)
rfc1 = RandomForestClassifier()
rfc1.fit(x_train_stand, y_train)
rfc1_score = rfc1.score(x_test_stand, y_test)
print("The score of RF(with preprocessing):",rfc1_score)
输出结果
The score of RF: 0.7638402242466713
The score of RF(with preprocessing): 0.7652417659425368
4、GBDT模型
GBDT 的全称是 Gradient Boosting Decision Tree,梯度下降树。
思想:通过损失函数的负梯度来拟合
gbdt = GradientBoostingRegressor()
gbdt.fit(x_train, y_train)
gbdt_score = gbdt.score(x_test, y_test)
print("The score of GBDT:",gbdt_score)
输出结果:
The score of GBDT: 0.18118075405980671
5、XGBoost模型
xgb = xgb.XGBClassifier()
xgb.fit(x_train, y_train)
xgb_score = xgb.score(x_test, y_test)
print("The score of XGBoost:", xgb_score)
输出结果
The score of XGBoost: 0.7855641205325858
遇到的问题
DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
==》经过在网上查找问题发现:这是一个numpy问题,在空数组上弃用了真值检查。该问题numpy已经修复。
==》解决方案1:忽略警告2
import warnings
warnings.filterwarnings(action ='ignore', category = DeprecationWarning)
6、lightGBM
思想:LightGBM 是一个梯度 boosting 框架,使用基于学习算法的决策树。它可以说是分布式的,高效的,有以下优势:
更快的训练效率 低内存使用 更高的准确率 支持并行化学习 可处理大规模数据
gbm = lgb.LGBMRegressor()
gbm.fit(x_train, y_train)
gbm_score = gbm.score(x_test, y_test)
print("The score of LightGBM:", gbdt_score)
输出结果
The score of LightGBM: 0.18118075405980671