lightGBM
-
微软出品
-
优点:对
xgboost
进行了优化- 训练速度非常快
- 内存消耗非常低
- 准确率非常高
- 并发和支持GPU加速
- 能直接处理缺失值
- 能处理庞大体量的数据
-
fit参数
eval_set
:在模型每次迭代时查看进行验证的分数early_stopping_rounds=50
:模型50个迭代内发现验证的分数没有增长就不再迭代verbose=30
:每30个迭代显示一次分值
-
重要属性
best_iteration_
:在整个迭代中的最优迭代次数feature_importances_
:返回特征重要性feature_names
:特征名称num_iteration
:迭代次数
-
重要模型参数
subsample
:抽取样本比例learning_rate
:学习速率boosting_type
:gbdt
:traditional Gradient Boosting Decision Tree.
- ‘dart’:
Dropouts meet Multiple Additive Regression Trees.
- ‘rf’:
Random Forest.
n_estimators
-
其他参数
示例
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
feature=cancer.data
target=cancer.target
x_train,x_test,y_train,y_test=train_test_split(feature,target,random_state=2020)
lgb_model=lgb.LGBMClassifier(n_estimators=150)
lgb_model.fit(x_train,y_train)
y_pred=lgb_model.predict_proba(x_test)[:,1]
roc_auc_score(y_test,y_pred) #0.9940023990403839
lgb_model=lgb.LGBMClassifier(n_estimator=150)
#eval_set在模型每次迭代时进行验证的分数查看
#early_stopping_rounds=50,模型50个迭代内发现验证的分数没有增长就不在迭代
#verbose=30,表示每30个迭代显示一次分值
#eval_set在模型每次迭代时进行验证的分数查看
lgb_model = lgb.LGBMClassifier(n_estimators=150)
lgb_model.fit(x_train,y_train,eval_set=[(x_test,y_test)],eval_metric='auc')
# [1] valid_0's auc: 0.97451 valid_0's binary_logloss: 0.610631
# [2] valid_0's auc: 0.977109 valid_0's binary_logloss: 0.545753
#...
#early_stopping_rounds=50模型50个迭代内发现验证分数没有增长就不再迭代了
lgb_model = lgb.LGBMClassifier(n_estimators=150)
lgb_model.fit(x_train,y_train,eval_set=[(x_test,y_test)],eval_metric='auc',early_stopping_rounds=50)
# #[1] valid_0's auc: 0.97451 valid_0's binary_logloss: 0.610631
# Training until validation scores don't improve for 50 rounds
# [2] valid_0's auc: 0.977109 valid_0's binary_logloss: 0.545753
# [3] valid_0's auc: 0.978709 valid_0's binary_logloss: 0.488691
#...
#重要属性
#verbose=30,表示每30个迭代显示一次分值
lgb_model.fit(x_train,y_train,eval_set=[(x_test,y_test)],eval_metric='auc',early_stopping_rounds=50,verbose=30)
# Training until validation scores don't improve for 50 rounds
# [30] valid_0's auc: 0.992003 valid_0's binary_logloss: 0.116233
# [60] valid_0's auc: 0.989804 valid_0's binary_logloss: 0.0967597
# Early stopping, best iteration is:
# [37] valid_0's auc: 0.992603 valid_0's binary_logloss: 0.0995741
#best_iteration_表示在整个迭代中的最优迭代次数
lgb_model.best_iteration_ #66
#返回特征重要性
lgb_model.feature_importances_
# array([ 24, 121, 29, 15, 25, 20, 26, 67, 18, 50, 16, 17, 10,
# 33, 31, 42, 24, 41, 41, 31, 51, 104, 64, 49, 35, 24,
# 43, 69, 42, 15], dtype=int32)
#特征名字
cancer.feature_names
# array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
# 'mean smoothness', 'mean compactness', 'mean concavity',
# 'mean concave points', 'mean symmetry', 'mean fractal dimension',
# 'radius error', 'texture error', 'perimeter error', 'area error',
# 'smoothness error', 'compactness error', 'concavity error',
# 'concave points error', 'symmetry error',
# 'fractal dimension error', 'worst radius', 'worst texture',
# 'worst perimeter', 'worst area', 'worst smoothness',
# 'worst compactness', 'worst concavity', 'worst concave points',
# 'worst symmetry', 'worst fractal dimension'], dtype='<U23')
#根据最优迭代次数进行预测
lgb_model.predict(x_test,num_iteration=lgb_model.best_iteration_)