相同点:都是集成模型。 不同点:1. Bagging是过拟合的弱分类器的集成,算各模型预测值的加权平均,如随机森林; 2. Boosting是欠拟合弱分类器的集成,算各模型预测值的和。
Example:
训练数据 ( x , y ) (x, y) (x,y),得到预测值 y ^ 1 \hat{y}_{1} y^1,残差 y − y ^ 1 y-\hat{y}_{1} y−y^1,较大;继续训练数据 ( x , y − y ^ 1 ) (x, y-\hat{y}_{1}) (x,y−y^1),得到预测值 y ^ 2 \hat{y}_{2} y^2,残差 y − y ^ 1 − y ^ 2 y-\hat{y}_{1}-\hat{y}_{2} y−y^1−y^2,还挺大;继续训练数据 ( x , y − y ^ 1 − y ^ 2 ) (x, y-\hat{y}_{1}-\hat{y}_{2}) (x,y−y^1−y^2),得到预测值 y ^ 3 \hat{y}_{3} y^3,残差 y − y ^ 1 − y ^ 2 − y ^ 3 y-\hat{y}_{1}-\hat{y}_{2}-\hat{y}_{3} y−y^1−y^2−y^3,小了.这样,就可以用 y ^ 1 + y ^ 2 + y ^ 3 \hat{y}_{1}+\hat{y}_{2}+\hat{y}_{3} y^1+y^2+y^3更准确地预测 y y y.
y ^ ( 0 ) = f 0 ( x ) = 0 y ^ ( 1 ) = y ^ ( 0 ) + f 1 ( x ) = f 1 ( x ) ⋮ y ^ = y ^ ( K ) = y ^ ( K − 1 ) + f K ( x ) = ∑ k = 1 K f k ( x i ) \begin{aligned} \hat{y}^{(0)} &= f_{0}(x) = 0 \\ \hat{y}^{(1)} &= \hat{y}^{(0)} + f_{1}(x) = f_{1}(x)\\ \vdots \\ {\color{red}\hat{y}} = \hat{y}^{(K)} &= \hat{y}^{(K-1)} + f_{K}(x) = \sum_{k=1}^{K}f_{k}(x_{i})\\ \end{aligned} y^(0)y^(1)⋮y^=y^(K)=f0(x)=0=y^(0)+f1(x)=f1(x)=y^(K−1)+fK(x)=k=1∑Kfk(xi)
L = ∑ i = 1 n l ( y i , y ^ i ) + ∑ k = 1 K Ω ( f k ) \begin{aligned} L &= \sum_{i=1}^{n}l(y_{i},\hat{y}_{i}) + \sum_{k=1}^{K}\Omega(f_{k}) \end{aligned} L=i=1∑nl(yi,y^i)+k=1∑KΩ(fk) 其中, l ( y i , y ^ i ) l(y_{i},\hat{y}_{i}) l(yi,y^i) 表示第i个样本的损失函数, Ω ( f k ) \Omega(f_{k}) Ω(fk) 表示第k棵树的复杂度。 在生成前K-1棵树之后,要训练第K棵树,那么: L = ∑ i = 1 n l ( y i , y ^ i ( K − 1 ) + f K ( x i ) ) + ∑ k = 1 K Ω ( f k ) ≈ ∑ i = 1 n { l ( y i , y ^ i ( K − 1 ) ) + ∂ y ^ i ( K − 1 ) l ( y i , y ^ i ( K − 1 ) ) f K ( x i ) + 1 2 ∂ y ^ i ( K − 1 ) 2 l ( y i , y ^ i ( K − 1 ) ) f K 2 ( x i ) } + ∑ k = 1 K Ω ( f k ) \begin{aligned} L &=\sum_{i=1}^{n} l\left(y_{i},\hat{y}_{i}^{(K-1)}+f_{K}(x_{i})\right) + \sum_{k=1}^{K}\Omega(f_{k})\\ &\approx \sum_{i=1}^{n} \left\{ l\left(y_{i},\hat{y}_{i}^{(K-1)}\right) + \partial_{\hat{y}_{i}^{(K-1)}} l\left(y_{i},\hat{y}_{i}^{(K-1)}\right)f_{K}(x_{i}) \right. \\ &\left.\quad+ \frac{1}{2}\partial^{2}_{\hat{y}_{i}^{(K-1)}} l\left(y_{i},\hat{y}_{i}^{(K-1)}\right)f_{K}^{2}(x_{i}) \right\}+ \sum_{k=1}^{K}\Omega(f_{k}) \end{aligned} L=i=1∑nl(yi,y^i(K−1)+fK(xi))+k=1∑KΩ(fk)≈i=1∑n{l(yi,y^i(K−1))+∂y^i(K−1)l(yi,y^i(K−1))fK(xi)+21∂y^i(K−1)2l(yi,y^i(K−1))fK2(xi)}+k=1∑KΩ(fk) 由于前K-1棵树的部分已知,所以上述目标函数可以简写为: L ( K ) = ∑ i = 1 n { g i f K ( x i ) + 1 2 h i f K 2 ( x i ) } + Ω ( f K ) , L^{(K)} = \sum_{i=1}^{n} \left\{ g_{i}f_{K}(x_{i}) + \frac{1}{2}h_{i}f_{K}^{2}(x_{i}) \right\} + \Omega(f_{K}), L(K)=i=1∑n{gifK(xi)+21hifK2(xi)}+Ω(fK), 其中, g i = ∂ y ^ i ( K − 1 ) l ( y i , y ^ i ( K − 1 ) ) , h i = ∂ y ^ i ( K − 1 ) 2 l ( y i , y ^ i ( K − 1 ) ) . \begin{aligned} g_{i} &= \partial_{\hat{y}_{i}^{(K-1)}} l\left(y_{i},\hat{y}_{i}^{(K-1)}\right), \\ h_{i} &= \partial^{2}_{\hat{y}_{i}^{(K-1)}} l\left(y_{i},\hat{y}_{i}^{(K-1)}\right). \end{aligned} gihi=∂y^i(K−1)l(yi,y^i(K−1)),=∂y^i(K−1)2l(yi,y^i(K−1)).
在树的形状已知的情况下,用 w = ( w 1 , ⋯ , w T ) w=(w_{1}, \cdots, w_{T}) w=(w1,⋯,wT)表示每个叶节点的值,用 q ( x i ) q(x_{i}) q(xi)表示每个样本点所在的叶节点的位置,记 I j = { i ∣ q ( x i ) = j } I_{j} = \{ i \mid q(x_{i})=j \} Ij={i∣q(xi)=j} f K ( x i ) = w q ( x i ) f_{K}(x_{i}) = w_{q(x_{i})} fK(xi)=wq(xi)
Ω ( f K ) = γ T + 1 2 λ ∑ j = 1 T w j 2 \Omega(f_{K}) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}w_{j}^{2} Ω(fK)=γT+21λj=1∑Twj2 其中,T是叶节点个数。
将 f K f_{K} fK和 Ω ( f K ) \Omega(f_{K}) Ω(fK) 代入损失函数得: o b j = L ( K ) = ∑ i = 1 n { g i f K ( x i ) + 1 2 h i f K 2 ( x i ) } + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ i = 1 n { g i w q ( x i ) + 1 2 h i w q ( x i ) 2 } + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T \begin{aligned} obj = L^{(K)} &= \sum_{i=1}^{n} \left\{ g_{i}f_{K}(x_{i}) + \frac{1}{2}h_{i}f_{K}^{2}(x_{i}) \right\} + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}w_{j}^{2} \\ &= \sum_{i=1}^{n} \left\{ g_{i}w_{q(x_{i})}+ \frac{1}{2}h_{i}w_{q(x_{i})}^{2} \right\} + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}w_{j}^{2} \\ &= \sum_{j=1}^{T}\left[ \left(\sum_{i\in I_{j}}g_{i}\right)w_{j} + \frac{1}{2}\left(\sum_{i\in I_{j}}h_{i}+\lambda\right)w_{j}^{2}\right] + \gamma T \end{aligned} obj=L(K)=i=1∑n{gifK(xi)+21hifK2(xi)}+γT+21λj=1∑Twj2=i=1∑n{giwq(xi)+21hiwq(xi)2}+γT+21λj=1∑Twj2=j=1∑T⎣⎡⎝⎛i∈Ij∑gi⎠⎞wj+21⎝⎛i∈Ij∑hi+λ⎠⎞wj2⎦⎤+γT 所以 w j ∗ = − G j H j + λ o b j ∗ = − 1 2 ∑ j = 1 T G j 2 H j + λ + γ T w_{j}^* = -\frac{G_{j}}{H_{j}+\lambda} \\ obj^*=-\frac{1}{2}\sum_{j=1}^{T}\frac{G_{j}^{2}}{H_{j}+\lambda} + \gamma T wj∗=−Hj+λGjobj∗=−21j=1∑THj+λGj2+γT 其中 G j = ∑ i ∈ I j g i , H j = ∑ i ∈ I j h i G_{j} = \sum\limits_{i\in I_{j}}g_{i}, H_{j} = \sum\limits_{i\in I_{j}}h_{i} Gj=i∈Ij∑gi,Hj=i∈Ij∑hi. 所以在树的形状已知的情况下可以求出最优参数 (即每个叶节点的值)。
贪心算法:每次都找到一个特征k,使得最小化 o b j ∗ ( n e w ) obj^{*(new)} obj∗(new),也就是最大化 o b j ∗ ( o l d ) − o b j ∗ ( n e w ) obj^{*(old)}-obj^{*(new)} obj∗(old)−obj∗(new) 其中, o b j ∗ ( o l d ) − o b j ∗ ( n e w ) = − 1 2 [ ( ∑ i ∈ I L g i ) 2 ∑ i ∈ I L h i + λ + ( ∑ i ∈ I R g i ) 2 ∑ i ∈ I R h i + λ − ( ∑ i ∈ I g i ) 2 ∑ i ∈ I h i + λ ] − γ \begin{aligned} obj^{*(old)}-obj^{*(new)} = -\frac{1}{2}\left[\frac{(\sum_{i\in I_L}g_{i})^{2}}{\sum_{i\in I_L}h_{i}+\lambda} + \frac{(\sum_{i\in I_R}g_{i})^{2}}{\sum_{i\in I_R}h_{i}+\lambda} - \frac{(\sum_{i\in I}g_{i})^{2}}{\sum_{i\in I}h_{i}+\lambda} \right] - \gamma \end{aligned} obj∗(old)−obj∗(new)=−21[∑i∈ILhi+λ(∑i∈ILgi)2+∑i∈IRhi+λ(∑i∈IRgi)2−∑i∈Ihi+λ(∑i∈Igi)2]−γ