Boosted Tree
Definition:
y ^ = ∑ k = 1 K f k ( x ) \widehat y=\sum_{k=1}^{K}f_k(x) y =k=1∑Kfk(x)
In which f k ( x ) f_k(x) fk(x) is one of K regression trees.
Loss:
L o s s = ∑ i = 1 n L ( y i , y ^ i ) Loss=\sum_{i=1}^{n}L(y_i, \widehat y_i) Loss=i=1∑nL(yi,y i)
Add some regularization:
L o s s = ∑ i = 1 n L ( y i , y ^ i ) + ∑ k = 1 K Ω ( f k ) Loss=\sum_{i=1}^{n}L(y_i,\widehat y_i) + \sum_{k=1}^{K}\Omega(f_k) Loss=i=1∑nL(yi,y i)+k=1∑KΩ(fk)
Additive Training:
y ^ ( 1 ) = 0 \widehat y^{(1)} = 0 y (1)=0
y ^ ( t ) = y ^ ( t − 1 ) + f t ( x ) \widehat y^{(t)} = \widehat y^{(t-1)} + f_t(x) y (t)=y (t−1)+ft(x)
L o s s ( t ) = ∑ i = 1 n L ( y i , y ^ i ( t ) ) + ∑ k = 1 t Ω ( f k ) Loss^{(t)}=\sum_{i=1}^{n}L(y_i, \widehat y_i^{(t)}) + \sum_{k=1}^{t}\Omega(f_k) Loss(t)=i=1∑nL(yi,y i(t))+k=1∑tΩ(fk)
= ∑ i = 1 n L ( y i , y ^ i ( t − 1 ) + f t ( x i ) ) + ∑ k = 1 t − 1 Ω ( f k ) + Ω ( f t ) =\sum_{i=1}^{n}L(y_i, \widehat y_i^{(t-1)}+f_t(x_i))+\sum_{k=1}^{t-1}\Omega(f_k)+\Omega(f_t) =i=1∑nL(yi,y i(t−1)+ft(xi))+k=1∑t−1Ω(fk)+Ω(ft)
= ∑ i = 1 n L ( y i , y ^ i ( t − 1 ) + f t ( x i ) ) + Ω ( f t ) + C =\sum_{i=1}^{n}L(y_i, \widehat y_i^{(t-1)}+f_t(x_i))+\Omega(f_t)+C =i=1∑nL(yi,y i(t−1)+ft(xi))+Ω(ft)+C
≈ ∑ i = 1 n [ L ( y i , y ^ i ( t − 1 ) ) + f t ( x i ) ∂ L ∂ y ^ i ( t − 1 ) + 1 2 f t 2 ( x i ) ∂ L 2 ∂ y ^ i ( t − 1 ) ] + Ω ( f t ) + C \approx\sum_{i=1}^{n}[ L(y_i,\widehat y_i^{(t-1)})+f_t(x_i)\frac{\partial L}{\partial \widehat y_i^{(t-1)}}+\frac{1}{2}f_t^{2}(x_i)\frac{\partial L^2}{\partial \widehat y_i^{(t-1)}}]+\Omega(f_t)+C ≈i=1∑n[L(yi,y i(t−1))+ft(xi)∂y i(t−1)∂L+21ft2(xi)∂y i(t−1)∂L2]+Ω(ft)+C
= ∑ i = 1 n [ L ( y i , y ^ i ( t − 1 ) ) + f t ( x i ) G i + 1 2 f t 2 ( x i ) H i ] + Ω ( f t ) + C =\sum_{i=1}^{n}[ L(y_i,\widehat y_i^{(t-1)})+f_t(x_i)G_i+\frac{1}{2}f_t^{2}(x_i)H_i]+\Omega(f_t)+C =i=1∑n[L(yi,y i(t−1))+ft(xi)Gi+21ft2(xi)Hi]+Ω(ft)+C
= ∑ i = 1 n [ f t ( x i ) G i + 1 2 f t 2 ( x i ) H i ] + Ω ( f t ) + C ′ =\sum_{i=1}^{n}[f_t(x_i)G_i+\frac{1}{2}f_t^{2}(x_i)H_i] + \Omega(f_t) + C' =i=1∑n[ft(xi)Gi+21ft2(xi)Hi]+Ω(ft)+C′
Loss at time t is:
L o s s ( t ) = ∑ i = 1 n [ f t ( x i ) G i + 1 2 f t 2 ( x i ) H i ] + Ω ( f t ) + C ′ Loss^{(t)}=\sum_{i=1}^{n}[f_t(x_i)G_i+\frac{1}{2}f_t^{2}(x_i)H_i] + \Omega(f_t) + C' Loss(t)=i=1∑n[ft(xi)Gi+21ft2(xi)Hi]+Ω(ft)+C′
Use:
f t ( x ) = w q ( x ) , q : R d → { 1 , 2 , . . . , M } , w i ∈ R f_t(x)=w_{q(x)}, q:R^d\rightarrow\{1,2,...,M\}, w_i \in R ft(x)=wq(x),q:Rd→{1,2,...,M},wi∈R
Ω ( f ) = 1 2 λ ∑ i = 1 M w j 2 + γ M \Omega(f)=\frac{1}{2}\lambda\sum_{i=1}^{M}w_j^{2}+\gamma M Ω(f)=21λi=1∑Mwj2+γM
We get:
L o s s ( t ) = ∑ i = 1 n [ f t ( x i ) G i + 1 2 f t 2 ( x i ) H i ] + Ω ( f t ) + C ′ Loss^{(t)}=\sum_{i=1}^{n}[f_t(x_i)G_i+\frac{1}{2}f_t^{2}(x_i)H_i] + \Omega(f_t) + C' Loss(t)=i=1∑n[ft(xi)Gi+21ft2(xi)Hi]+Ω(ft)+C′
= ∑ i = 1 n [ w q ( x i ) G i + 1 2 w q ( x i ) 2 H i ] + 1 2 λ ∑ j = 1 M w j 2 + γ M + C ′ =\sum_{i=1}^{n}[w_{q(x_i)}G_i+\frac{1}{2} w_{q(x_i)}^2H_i]+\frac{1}{2}\lambda\sum_{j=1}^{M}w_j^{2}+\gamma M+C' =i=1∑n[wq(xi)Gi+21wq(xi)2Hi]+21λj=1∑Mwj2+γM+C′
With I j = { i ∣ q ( x i ) = j } I_j=\{i|q(x_i)=j\} Ij={i∣q(xi)=j}:
∑ i = 1 n w q ( x i ) G i = ∑ j = 1 M [ w j ∑ i ∈ I j G i ] \sum_{i=1}^{n}w_{q(x_i)}G_i=\sum_{j=1}^{M}[w_j\sum_{i\in I_j}^{}G_i] i=1∑nwq(xi)Gi=j=1∑M[wji∈Ij∑Gi]
∑ i = 1 n 1 2 w q ( x i ) 2 H i = ∑ j = 1 M w j 2 ∑ i ∈ I j 1 2 H i \sum_{i=1}^{n}\frac{1}{2}w_{q(x_i)}^2H_i=\sum_{j=1}^{M}w_j^2\sum_{i \in I_j}^{}\frac{1}{2}H_i i=1∑n21wq(xi)2Hi=j=1∑Mwj2i∈Ij∑21Hi
So:
L o s s ( t ) = ∑ j = 1 M [ w j ∑ i ∈ I j G i + w j 2 ∑ i ∈ I j 1 2 H i + 1 2 λ w j 2 ] + γ M + C ′ Loss^{(t)}=\sum_{j=1}^{M}[w_j\sum_{i\in I_j}G_i+w_j^2\sum_{i\in I_j}\frac{1}{2}H_i+\frac{1}{2}\lambda w_j^2]+\gamma M + C' Loss(t)=j=1∑M[wji∈Ij∑Gi+wj2i∈Ij∑21Hi+21λwj2]+γM+C′
= ∑ j = 1 M [ w j ∑ i ∈ I j G i + 1 2 w j 2 ( λ + ∑ i ∈ I j H i ) ] + γ M + C ′ =\sum_{j=1}^{M}[w_j\sum_{i\in I_j}G_i+\frac{1}{2}w_j^2(\lambda+\sum_{i\in I_j}H_i)]+\gamma M + C' =j=1∑M[wji∈Ij∑Gi+21wj2(λ+i∈Ij∑Hi)]+γM+C′
With G j ′ = ∑ i ∈ I j G i , H j ′ = ∑ i ∈ I j H i G_j'=\sum_{i\in I_j}G_i, H_j'=\sum_{i\in I_j}H_i Gj′=∑i∈IjGi,Hj′=∑i∈IjHi:
L o s s ( t ) = ∑ j = 1 M [ w j G j ′ + 1 2 w j 2 ( λ + H j ′ ) ] + γ M + C ′ Loss^{(t)}=\sum_{j=1}^{M}[w_jG_j'+\frac{1}{2}w_j^2(\lambda+H_j')]+\gamma M + C' Loss(t)=j=1∑M[wjGj′+21wj2(λ+Hj′)]+γM+C′
Finally:
w j ∗ = a r g m i n ( w j G j ′ + 1 2 w j 2 ( λ + H i ′ ) ) = − G j ′ λ + H i ′ w_j^*=argmin(w_jG_j'+\frac{1}{2}w_j^2(\lambda+H_i'))=-\frac{G_j'}{\lambda+H_i'} wj∗=argmin(wjGj′+21wj2(λ+Hi′))=−λ+Hi′Gj′
O b j ( t ) = m i n ( L o s s ( t ) ) = − 1 2 ∑ j = 1 M G j ′ 2 H j ′ + λ + γ M + C ′ Obj^{(t)}=min(Loss^{(t)})=-\frac{1}{2}\sum_{j=1}^{M}\frac{G_j'^2}{H_j'+\lambda}+\gamma M + C' Obj(t)=min(Loss(t))=−21j=1∑MHj′+λGj′2+γM+C′
So for each iteration t of training, greedily seach for a regression tree f t ( x i ) = w q ( x i ) f_t(x_i)=w_{q(x_i)} ft(xi)=wq(xi) with w j = − G j ′ λ + H i ′ w_j=-\frac{G_j'}{\lambda+H_i'} wj=−λ+Hi′Gj′ with minumum O b j ( t ) Obj^{(t)} Obj(t) and add it to model.
Factorization Machines
y = w 0 + ∑ i = 1 n w i x i + ∑ i = 1 n − 1 ∑ j = i + 1 n < v i , v j > x i x j y=w_0+\sum_{i=1}^{n}w_ix_i+\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}<\boldsymbol{v}_i,\boldsymbol{v}_j>x_ix_j y=w0+i=1∑nwixi+i=1∑n−1j=i+1∑n<vi,vj>xixj
In which
w 0 ∈ R , w ∈ R n , v ∈ R n × k w_0\in R,\boldsymbol{w}\in R^{n},\boldsymbol{v}\in R^{n\times k} w0∈R,w∈Rn,v∈Rn×k
< v i , v j > = ∑ l = 1 k v i l v j l <\boldsymbol{v}_i,\boldsymbol{v}_j>=\sum_{l=1}^{k}v_{il}v_{jl} <vi,vj>=l=1∑kvilvjl