Introduction to Boosted Trees

最新推荐文章于 2024-08-17 11:24:13 发布

chiechie

最新推荐文章于 2024-08-17 11:24:13 发布

阅读量857

点赞数

分类专栏：算法文章标签： optimizati boosting tree

本文链接：https://blog.csdn.net/chiechie/article/details/49334437

版权

算法专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Review of concepts of supervised learning

objective function

O b j (Θ) = L (Θ) + Ω (Θ) (2.1)

$Obj(\Theta) = L(\Theta)+\Omega(\Theta)\tag{2.1}$

where $L(\Theta)$ is the training loss,and $\Omega(\Theta)$ is regularization which measures the complexity of model.

loss function $L(\Theta)$

square loss: $l(y_i,\hat y_i) = \|y_i-\hat y_i\|^2$
logistic loss: $l(y_i,\hat y_i) = y_i \ln (1+e^{-\hat y_i})+ \hat y_i \ln (1+e^{- y_i})$

regularization $\Omega(\Theta)$

l1 norm (lasso): $\Omega(w) = \lambda \| w\|_1$
l2 norm (ridge): $\Omega(w) = \lambda\|w\|_2^2$

lasso

\sum i = 1 n ∥ y i - y ̂ i ∥ 2 + λ ∥ w ∥ 22

$\sum\limits_{i=1}^{n}\|y_i-\hat y_i\|^2 +\lambda \| w\|_2^2$
- linear model,square loss,l1 regularition

ridge regression

\sum i = 1 n ∥ y i - y ̂ i ∥ 2 + λ ∥ w ∥ 22

$\sum\limits_{i=1}^{n}\|y_i-\hat y_i\|^2+ \lambda \| w\|_2^2$
- linear model,square loss,l2 regularition

logistic regression

\sum i = 1 n [y i ln (1 + e - y ̂ i) + y ̂ i ln (1 + e - y i)] + λ ∥ w ∥ 22

$\sum\limits_{i=1}^{n}[y_i \ln (1+e^{-\hat y_i})+ \hat y_i \ln (1+e^{- y_i})]+\lambda \| w\|_2^2$
- linear model, logistic loss,l2 regulation

boosted trees

Regression tree(CART)

regression tree, or classification and regression tree,contains one score in each leaf value.
这里写图片描述

Regression tree ensemble

prediction of each menber is sum of scores predicted by each of the tree
这里写图片描述

tree ensemble methods

mostly widely used ,such as GBM,random forest …
invariante to scaling of inputs,so needn’t worry about feature normalization
learn higehr order interaction between features

represent the RT in optimization method

model: assuming we have K trees

ŷ i=∑k=1Kfk(xi),fk(x)∈F
- F is space of functions containing all regression trees.
- regression tree is a function that maps the attitute to the score.
- each regression tree $f_k(x)$ is in charge of some parts of the attributes.
Parametres: struction of each tree $f_k(x)$ ,and the score in the leaf
or simply $\Theta = \{f_1,\dots,f_K\}$
Objective

$O b j (Θ) = \sum i = 1 n l (y i, y ̂ i) + \sum k = 1 K Ω (f k) (2.3)$ $Obj(\Theta) = \sum\limits_{i=1}^n l(y_i,\hat y_i)+ \sum\limits_{k=1}^K\Omega(f_k)\tag{2.3}$
how to define to define Ω ?
- # nodes/depth
- l2 norm of the leaf weights
- ….
how to define loss function
- square loss result in common gradient boosted machine
- logistic loss result in logitBoost

how to solve the optimization problem

additive training (boosting)

at training round 0,start from constant prediction, add a new function each time:
$\hat y_i^{(0)} = 0$
$\hat y_i^{(1)} = f_1{(x_i)} = \hat y_i^{(0)} + f_1{(x_i)}$
$\hat y_i^{(2)} = f_1{(x_i)} +f_2{(x_i)} = \hat y_i^{(1)} + f_2{(x_i)}$
$\dots$
at traing round t,Keep functions added in previous round( $\hat y_i^{(k-1)}$ ) and add a new function $f_k{(x_i)}$ :
$\hat y_i^{(t)} = \sum\limits_{k=1}^t f_{k}{(x_i)} = \hat y_i^{(k-1)} + f_k{(x_i)}$

how we decide which f to add?

at round t,solve the optimization:

min f t O b j (t) = \sum i = 1 n l (y i, y ̂ (t - 1) i + f t (x i)) + \sum i = 1 t Ω (f i) = \sum i = 1 n l (y i, y ̂ (t - 1) i + f t (x i)) + Ω (f t) + C (2.3)

$\begin{align} \min _{f_t} Obj^{(t)}& = \sum\limits_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+ \sum\limits_{i=1}^t \Omega(f_i) \\ &= \sum\limits_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i)) +\Omega(f_t) +C \end{align}\tag{2.3}$
for lose function:

square loss: $Obj^{(t)}$ becomes a quadratic function of $f_t$ , let the derivetive be 0 then…
other cases ,we can use the numeric ways:
- define $g_i = \partial_{\hat y^{(t-1)}} l(y_i,\hat y_i^{(t-1)})$ , $h_i = \partial^2_{\hat y^{(t-1)}} l(y_i,\hat y_i^{(t-1)})$
- the taylor expansion of $Obj^{(t)}$ :
  $Obj^{(t)} \approx \sum\limits_{i=1}^n [l(y_i,\hat y_i^{(t-1)})+g_i f_t(x_i)+\frac{1}{2}h^2_i f_t(x_i)] +\Omega(f_t) +C$
- with constants removed, $Obj^{(t)}$ becomes:
  $\sum i = 1 n [g i f t (x i) + 1 2 h 2 i f t (x i)] + Ω (f t)$ $\sum\limits_{i=1}^n [g_i f_t(x_i)+\frac{1}{2}h^2_i f_t(x_i)] +\Omega(f_t)$

so again we retain to a quadratic convex optimization.

how to define F: the space of functions containing all regression trees

map concept of tree to optimization

information gain -> train loss
pruning->regularization defined be #nodes
max depth -> constraint on the function space
smoothing leaf values -> l2 regularization on leaf weights

refine the definition of tree

We define tree by a vector of scores in leafs, and a leaf index mapping function q(x) that maps an instance to a leaf

ft(x)=wq(x),w∈RT,q:Rd→{1,2,…,T}
- T is # leaf
- q(x) means that sample falls into which leaf,such as 1,3
- $w_y$ means the weight of leaf y

这里写图片描述

Define the Complexity of Tree

Define complexity as (this is not the only possible definition)
$Ω (f t) = γ T + 1 2 λ \sum j = 1 T w 2 j$ $\Omega(f_t) = \gamma T +\frac{1}{2}\lambda\sum\limits_{j=1}^{T} w_j^2$

这里写图片描述

revisit the instance set in leaf j as

Define the instance set in leaf j as
$I j = {i | q (x i) = j}$ $\begin{equation} I_j = \{ i|q(x_i) = j\} \end{equation}$
Regroup the objective by each leaf
$O b j (t) \approx \sum i = 1 n [g i f t (x i) + 1 2 h 2 i f t (x i)] + Ω (f t) = \sum i = 1 n [g i f t (x i) + 1 2 h 2 i f t (x i)] + γ T + 1 2 λ \sum j = 1 T w 2 j = \sum j = 1 T [\sum i \in I j g i w j + 1 2 (\sum i \in I j h i + λ) w 2 j] + γ T$ $\begin{align} Obj^{(t)} & \approx \sum\limits_{i=1}^n [g_i f_t(x_i)+\frac{1}{2}h^2_i f_t(x_i)] +\Omega(f_t) \\ & = \sum\limits_{i=1}^n [g_i f_t(x_i)+\frac{1}{2}h^2_i f_t(x_i)] +\gamma T +\frac{1}{2}\lambda\sum\limits_{j=1}^{T} w_j^2 \\ &= \sum\limits_{j=1}^{T} [\sum\limits_{i \in I_j } g_iw_j + \frac{1}{2}(\sum\limits_{i \in I_j } h_i+\lambda)w_j^2] +\gamma T \end{align}$
• This is sum of T independent quadratic functions

The structure score

Two facts about single variable quadratic function:

arg min x G x + 1 2 H x 2 ⟺ x = - G H, min x G x + 1 2 H x 2 = - G 2 2 H

$\arg\min_ x Gx+\frac{1}{2} Hx^2 \iff x = -\frac{G}{H}, \min_ x Gx+\frac{1}{2} Hx^2 = -\frac{G^2}{2H}$
define

Gj=∑i∈Ijgj,Hj=∑i∈Ijhj $G_j = \sum\limits_{i \in I_j } g_j,H_j = \sum\limits_{i \in I_j } h_j$ ,then objective becomes

O b j (t) = \sum j = 1 T [\sum i \in I j g i w j + 1 2 (\sum i \in I j h i + λ) w 2 j] + γ T = \sum j = 1 T [G i w j + 1 2 (H j + λ) w 2 j] + γ T

$\begin{align} Obj^{(t)} &=\sum\limits_{j=1}^{T} [\sum\limits_{i \in I_j } g_iw_j + \frac{1}{2}(\sum\limits_{i \in I_j } h_i+\lambda)w_j^2] +\gamma T\\ &=\sum\limits_{j=1}^{T} [G_iw_j + \frac{1}{2}(H_j+\lambda)w_j^2] +\gamma T \end{align}$

Assume the structure of tree $q(x)$ is fixed, the optimal weight in each leaf, and the resulting objective value are:
$w * j = - G j H j + λ$ $w_j^{\ast} = -\frac{G_j}{H_j+\lambda}$
then objective becomes:
$O b j = - 1 2 \sum j = 1 T G 2 j H j + λ + γ T$ $Obj = -\frac{1}{2}\sum\limits_{j=1}^T \frac{G_j^2}{H_j+\lambda}+\gamma T$

EXAMPLE

here comes the example

这里写图片描述

search algorithm for single tree

enumerate the possible tree structures q(x)
Calculate the according structure score for the q(x), using the scoring function:
$O b j = - 1 2 \sum j = 1 T G 2 j H j + λ + γ T$ $Obj = -\frac{1}{2}\sum\limits_{j=1}^T \frac{G_j^2}{H_j+\lambda}+\gamma T$
find the best tree structure,and use the optimal leaf weight
$w * j = - G 2 j H j + λ$ $w_j^\ast =- \frac{G_j^2}{H_j+\lambda}$

But… there can be infinite possible tree structures..

so in practice we do not enumerate but greedly grow the tree:

start from tree with depth 0
For each leaf node of the tree, try to add a split. The change of
objective after adding the split is:

Gain=G2LHL+λ+G2RHR+λ−(GL+GR)2HL+HR+λ−γ
- the score of left/right child ： $\frac{G_L^2}{H_L+\lambda}$ & $\frac{G_R^2}{H_R+\lambda}$
- the score of if we do not split ： $\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}$
- The complexity cost by introducing additional leaf： $\gamma$
Remaining question: how do we find the best split?

Efficient Finding of the Best Split

first take a look at the Algorithm for Split Finding

Algorithm for Split Finding

For each node, enumerate over all features
For each feature, sorted the instances by feature value
Use a linear scan to decide the best split along that feature
Take the best split solution along all the features

Time Complexity growing a tree of depth K is $O(ndK\log n)$ ,or, each level need $O(n\log n)$ time to sort.
There are d features, and we need to do it for K level

This can be further optimized (e.g. use approximation or caching the sorted features)
Can scale to very large dataset

Let f and g be two functions defined on some subset of the real numbers. One writes

What about Categorical Variables?

Some tree learning algorithm handles categorical variable and continuous variable separately.Actually it is not necessary,
We can easily use the scoring formula we derived to score split based on categorical variables.We can encode the categorical variables into numerical vector
using one-hot encoding. Allocate a #categorical length vector
$z j = {1, 0, if x is in category j o t h e r s$ $z_j = \begin{cases} 1, & \text{if }x \text { is in category j} \\ 0, & others \end{cases}$
The vector will be sparse if there are lots of categories, the learning algorithm is preferred to handle sparse data

Pruning and Regularization

Pre-stopping
- Stop split if the best split have negative gain
- But maybe a split can benefit future splits..
Post-Prunning
- Grow a tree to maximum depth,
- recursively prune all the leaf splits with negative gain

Recap: Boosted Tree Algorithm

Add a new tree in each iteration
Beginning of each iteration, calculate $g_i,h_i$
Use the statistics to greedily grow a tree $f_t(x)$
$O b j = - 1 2 \sum j = 1 T G 2 j H j + λ + γ T$ $Obj = -\frac{1}{2}\sum\limits_{j=1}^T \frac{G_j^2}{H_j+\lambda}+\gamma T$
Add $f_t(x)$ to the model $\hat y_i^{(t)} =\hat y_i^{(t-1)}+ f_t(x_i)$
Usually, instead we do $\hat y_i^{(t)} =\hat y_i^{(t-1)}+ \epsilon f_t(x_i)$
$\epsilon$ is called step-size or shrinkage, usually set around 0.1
This means we do not do full optimization in each step and reserve chance for future rounds, it helps prevent overfitting

Questions 1

How can we build a boosted tree classifier to do weighted regression problem, such that each instance have a importance weight?

Define objective, calculate, feed it to the old tree learning algorithm we have for un-weighted version
Again think of separation of model and objective, how does the theory can help better organizing the machine learning toolkit

Questions 2

Back to the time series problem, if I want to learn step functions over time. Is there other ways to learn the time splits, other than the top down split approach?

这里写图片描述

All that is important is the structure score of the splits

Obj=−12∑j=1TG2jHj+λ+γT
- Top-down greedy, same as trees
- Bottom-up greedy, start from individual points as each group, greedily merge neighbors
- Dynamic programming, can find optimal solution for this case

Summary

• The separation between model, objective, parameters can be helpful for us to understand and customize learning models
• The bias-variance trade-off applies everywhere, including learning in functional space
• We can be formal about what we learn and how we learn. Clear understanding of theory can be used to guide cleaner implementation.