CSDN - LightGBM源码阅读+理论分析(处理特征类别,缺省值的实现细节)
文章目录
调试准备
把-O3改为-O0,不然指令优化之后有些值调试不出来
代码里面有些带#pragma omp parallel for
的预处理语句可以注释掉,并行改串行,不然调试不进去
先以二分类为例,数据用的是官方给的例子
从main到GBDT::Train
执行路径
- main函数执行路径
Application::Train
src/application/application.cpp:200
boosting_->Train(config_.snapshot_freq, config_.output_model);
去看boosting
include/LightGBM/boosting.h
boosting是一个抽象类,派生出boosting_type
对应的4个子类
有空画个继承关系图
找到GBDT::Train函数
src/boosting/gbdt.cpp:246
fun_timer
应该是一个RAII对象
Common::FunctionTimer fun_timer("GBDT::Train", global_timer);
朴实无华的迭代过程,看TrainOneIter
注意上层调用方式是is_finished = TrainOneIter(nullptr, nullptr);
,gradient
和hessian
都是空指针
GBDT有两个成员变量,维护每次迭代时的梯度和黑塞(用了自己写的内存分配器)
vector里面放智能指针比放指针靠谱(对比sLSM源码)
/*! \brief First order derivative of training data */
std::vector<score_t, Common::AlignmentAllocator<score_t, kAlignedSize>> gradients_;
/*! \brief Secend order derivative of training data */
std::vector<score_t, Common::AlignmentAllocator<score_t, kAlignedSize>> hessians_;
/*! \brief Trained models(trees) */
std::vector<std::unique_ptr<Tree>> models_;
/*! \brief Tree learner, will use this class to learn trees */
std::unique_ptr<TreeLearner> tree_learner_;
/*! \brief Objective function */
const ObjectiveFunction* objective_function_;
/*! \brief Pointer to training data */
const Dataset* train_data_;
重点看一下TrainOneIter
GBDT::BoostFromAverage
重点看一下初始值怎么设的
- BinaryLogLoss
double init_score = ObtainAutomaticInitialScore(objective_function_, class_id);
马上打印出:
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.530877 -> initscore=0.123666
去ObtainAutomaticInitialScore
看看
跳到src/objective/binary_objective.hpp:134
如图,总共7000全量样本,正样本3716,pavg=0.53
i n i t s c o r e = log ( p a v g 1 − p a v g ) initscore=\log(\frac{pavg}{1-pavg}) initscore=log(1−pavgpavg)
算出来的是logits