本文共 10281 字,大约阅读时间需要 34 分钟。
aggregation type blending learning uniform voting/averaging Bagging non-uniform linear AdaBoost conditional stacking Decision Tree \begin{array} { c | | c | c } \text { aggregation type } & \text { blending } & \text { learning } \\ \hline \hline \text { uniform } & \text { voting/averaging } & \text { Bagging } \\ \hline \text { non-uniform } & \text { linear } & \text { AdaBoost } \\ \hline \text { conditional } & \text { stacking } & \text { Decision Tree } \end{array} aggregation type uniform non-uniform conditional blending voting/averaging linear stacking learning Bagging AdaBoost Decision Tree
在前文介绍过 blending ,这实际上就是 meta algorithm,只是将已获得的假设函数做一个融合。
而 bagging 属于 learing 的一种,也就是说在融合过程中获取假设函数。bagging 与 AdaBoost 区别于一个是 uniform 而另一个不是。而 stacking 实际上就是将已有的假设函数集的输出作为特征输入到一个机器学习模型中,在训练出一个融合模型,当然可以是 PLA 等算法。决策树与stacking不同的地方在于,不需要预先训练处 g t g_t gt,而是在学习过程中获得。下面以为是否观看视频做决策的决策树示意图如下:
G ( x ) = ∑ t = 1 T q t ( x ) ⋅ g t ( x ) G ( \mathbf { x } ) = \sum _ { t = 1 } ^ { T } q _ { t } ( \mathbf { x } ) \cdot g _ { t } ( \mathbf { x } ) G(x)=t=1∑Tqt(x)⋅gt(x)
其中 g t ( x ) g_t(\mathbf{x}) gt(x) 也是一个决策树, q t ( x ) q_t(\mathbf{x}) qt(x) 表示的是 x \mathbf{x} x 是否在 G G G 的路径 t t t 中。由此看来决策树是一种仿人模型(human-mimicking models)。
从递推(树)的角度看:
G ( x ) = ∑ c = 1 C [ [ b ( x ) = c ] ] ⋅ G c ( x ) G ( \mathbf { x } ) = \sum _ { c = 1 } ^ { C } \left[ \kern-0.15em \left[ b ( \mathbf { x } ) = c \right]\kern-0.15em \right]\cdot G _ { c } ( \mathbf { x } ) G(x)=c=1∑C[[b(x)=c]]⋅Gc(x)
其中
那么决策树的训练过程为:
function Decision Tree ( data D = { ( x n , y n ) } n = 1 N ) if termination criteria met return base hypothesis g t ( x ) else learn branching criteria b ( x ) split D to C parts D c = { ( x n , y n ) : b ( x n ) = c } build sub-tree G c ← Decision Tree ( D c ) return G ( x ) = ∑ c = 1 C [ [ b ( x ) = c ] ] ⋅ G c ( x ) \begin{array} { l } \text { function Decision Tree } \left( \text { data } \mathcal { D } = \left\{ \left( \mathbf { x } _ { n } , y _ { n } \right) \right\} _ { n = 1 } ^ { N } \right) \\ \text { if termination criteria met } \\ \text { return base hypothesis } g _ { t } ( \mathbf { x } ) \\ \text { else } \\ \qquad \begin{array} { l } \text { learn branching criteria } b ( \mathbf { x } ) \\ \text { split } \mathcal { D } \text { to } \text {C parts } \mathcal { D } _ { c } = \left\{ \left( \mathbf { x } _ { n } , y _ { n } \right) : b \left( \mathbf { x } _ { n } \right) = c \right\} \\ \text { build sub-tree } G _ { c } \leftarrow \text { Decision Tree } \left( \mathcal { D } _ { c } \right) \\ \text { return } G ( \mathbf { x } ) = \sum _ { c = 1 } ^ { C } \left[ \kern-0.15em \left[ b ( \mathbf { x } ) = c \right]\kern-0.15em \right]\cdot G _ { c } ( \mathbf { x } ) \end{array} \end{array} function Decision Tree ( data D={ (xn,yn)}n=1N) if termination criteria met return base hypothesis gt(x) else learn branching criteria b(x) split D to C parts Dc={ (xn,yn):b(xn)=c} build sub-tree Gc← Decision Tree (Dc) return G(x)=∑c=1C[[b(x)=c]]⋅Gc(x)
从直观训练过程中,可以得知现在需要确认四个问题:
由于有这么多可选条件,那么决策树模型有很多种实现方法,所以决策树模型有很多前人的巧思但是很有用(decision tree: mostly heuristic but useful on its own)。
下面介绍一个常用的决策树模型 —— Classification and Regression Tree(C&RT)
其在前文所提到的四个问题的是如何解决的呢:
C&RT 模型 的 具体实现:
function Decision Tree ( data D = { ( x n , y n ) } n = 1 N ) if cannot branch anymore return g t ( x ) = E in -optimal constant else learn branching criteria b ( x ) = argmin ∑ decision stumps h ( x ) ∑ c = 1 2 ∣ D c with h ∣ ⋅ impurity ( D c with h ) split D to 2 parts D c = { ( x n , y n ) : b ( x n ) = c } build sub-tree G c ← Decision Tree ( D c ) return G ( x ) = ∑ c = 1 C [ [ b ( x ) = c ] ] ⋅ G c ( x ) \begin{array} { l } \text { function Decision Tree } \left( \text { data } \mathcal { D } = \left\{ \left( \mathbf { x } _ { n } , y _ { n } \right) \right\} _ { n = 1 } ^ { N } \right) \\ \text { if cannot branch anymore } \\ \qquad \text { return } g _ { t } ( \mathbf { x } ) = E _ { \text {in } } \text { -optimal constant } \\ \text { else learn branching criteria } \\ \qquad \begin{aligned} & b ( \mathbf { x } ) = \operatorname { argmin } \sum _ { \text {decision stumps } h ( \mathbf { x } ) } \sum _ { c = 1 } ^ { 2 } | \mathcal { D } _ { c } \text { with } h | \cdot \text { impurity } \left( \mathcal { D } _ { c } \text { with } h \right) \\ & \text { split } \mathcal { D } \text { to } 2 \text { parts } \mathcal { D } _ { c } = \left\{ \left( \mathbf { x } _ { n } , y _ { n } \right) : b \left( \mathbf { x } _ { n } \right) = c \right\} \\ & \text { build sub-tree } G _ { c } \leftarrow \text { Decision Tree } \left( \mathcal { D } _ { c } \right) \\ & \text { return } G ( \mathbf { x } ) = \sum _ { c = 1 } ^ { C } \left[ \kern-0.15em \left[ b ( \mathbf { x } ) = c \right]\kern-0.15em \right]\cdot G _ { c } ( \mathbf { x } ) \end{aligned} \end{array} function Decision Tree ( data D={ (xn,yn)}n=1N) if cannot branch anymore return gt(x)=Ein -optimal constant else learn branching criteria b(x)=argmindecision stumps h(x)∑c=1∑2∣Dc with h∣⋅ impurity (Dc with h) split D to 2 parts Dc={ (xn,yn):b(xn)=c} build sub-tree Gc← Decision Tree (Dc) return G(x)=c=1∑C[[b(x)=c]]⋅Gc(x)
该决策树模型可以轻松驾驭回归,二分类和多分类。
用于回归,且是比较常用的方法
impurity ( D ) = 1 N ∑ n = 1 N ( y n − y ˉ ) 2 with y ˉ = average of { y n } \begin{array} { l } \text { impurity } ( \mathcal { D } ) = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \left( y _ { n } - \bar { y } \right) ^ { 2 } \\ \text { with } \bar { y } = \text { average of } \left\{ y _ { n } \right\} \end{array} impurity (D)=N1∑n=1N(yn−yˉ)2 with yˉ= average of { yn}
将平均值作为评判标准。
用于分类
impurity ( D ) = 1 N ∑ n = 1 N [ y n ≠ y ∗ ] with y ∗ = majority of { y n } \begin{array} { l } \text { impurity } ( \mathcal { D } ) = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \left[ y _ { n } \neq y ^ { * } \right] \\ \text { with } y ^ { * } = \text { majority of } \left\{ y _ { n } \right\} \end{array} impurity (D)=N1∑n=1N[yn=y∗] with y∗= majority of { yn}
将占比最多的类作为评判标准。
用于分类,且是比较常用的方法
1 − ∑ k = 1 K ( ∑ n = 1 N [ [ y n = k ] ] N ) 2 1 - \sum _ { k = 1 } ^ { K } \left( \frac { \sum _ { n = 1 } ^ { N } \left[ \kern-0.15em \left[ { y } _ { n } = k \right] \kern-0.15em \right] } { N } \right) ^ { 2 } 1−k=1∑K(N∑n=1N[[yn=k]])2
考虑全部的类,计算不纯度。
当全部的 x n \mathbf{x}_n xn 均不相同,那么 E i n ( G ) = 0 E_{in}(G) = 0 Ein(G)=0,也就是说这是一个完全长成的树,这样会导致过拟合,因为在低位的子树构建数据很少,所以很容导致过拟合。所以这里想到使用剪枝实现正则化,简单来说就是控制数的叶子数量。
将叶子数量用 Ω ( G ) \Omega ( G ) Ω(G) 表示:
Ω ( G ) = NumberOfLeaves ( G ) \Omega ( G ) = \text { NumberOfLeaves } ( G ) Ω(G)= NumberOfLeaves (G)
那么正则化后的优化目标则变为:
argmin all possible G E in ( G ) + λ Ω ( G ) \underset { \text { all possible } G } { \operatorname { argmin } } E _ { \text {in } } ( G ) + \lambda \Omega ( G ) all possible GargminEin (G)+λΩ(G)
做此操作后被叫做 被砍过的决策树(pruned decision tree)。
这里有一个比较困难的问题,那就是 all possible G \text { all possible } G all possible G,全部穷举是不可能的,在 C&RT 中,采用了以下策略:
G ( 0 ) = fully-grown tree G ( i ) = argmin G E in ( G ) such that G is one-leaf removed from G ( i − 1 ) \begin{array} { l } G ^ { ( 0 ) } = \text { fully-grown tree } \\ G ^ { ( i ) } = \operatorname { argmin } _ { G } E _ { \text {in } } ( G ) \text { such that } G \text { is one-leaf removed from } G ^ { ( i - 1 ) } \end{array} G(0)= fully-grown tree G(i)=argminGEin (G) such that G is one-leaf removed from G(i−1)
也就是说在 摘掉 i i i 片树叶的决策树中找出性能最优的一颗,该决策树由摘掉 i − 1 i -1 i−1 片树叶的决策树中最优的一颗再摘掉一片获得。
那么假设完全长成的决策树的叶子数量为 I I I,那么现在便可以获得:
G ( 0 ) , G ( 1 ) , ⋯ , G ( I − ) where I − ≤ I G ^ { ( 0 ) },G ^ { ( 1 ) },\cdots,G ^ { ( I^- ) } \quad \text{where } I^{-} \leq I G(0),G(1),⋯,G(I−)where I−≤I那么在从这一堆 G ( i ) G^{(i)} G(i) 中使用正则化的优化目标找出最优的那颗决策树。当然这里还有一个参数 λ \lambda λ ,其可以用 validation 获得。
在连续特征中,分支条件实现如下:
b ( x ) = [ [ x i ≤ θ ] ] + 1 with θ ∈ R \begin{aligned} & b ( \mathbf { x } ) = \left[ \kern-0.15em \left[ x _ { i } \leq \theta \right] \kern-0.15em \right] + 1 \\ \text{with } & \theta \in R \end{aligned} with b(x)=[[xi≤θ]]+1θ∈R
而在离散特征中,分支条件类似
b ( x ) = [ [ x i ∈ S ] ] + 1 with S ⊂ { 1 , 2 , … , K } \begin{aligned} & b ( \mathbf { x } ) = \left[ \kern-0.15em \left[ x _ { i } \in S \right] \kern-0.15em \right] + 1 \\ \text{with }& S \subset \{ 1,2 , \ldots , K \} \end{aligned} with b(x)=[[xi∈S]]+1S⊂{ 1,2,…,K}所以说决策树也很容易处理离散特征,或者说分类特征。
当你在做决策时,遇到特征值缺失你会怎么办呢,假如说这个特征是体重,我估计你会从身高和三维出发,估计这个人的大概的体重,因为你在决策过程中只是跟一个固定值做比较而已,所以不用太精准。
具体怎么做呢?数学表达如下:
threshold on height ≈ threshold on weight \text { threshold on height } \approx \text { threshold on weight } threshold on height ≈ threshold on weight
也就是说体重缺失的话那就按身高划分
那么在 C&RT 中是找出并存储替代分支应对预测过程中的特征值缺失问题:
surrogate branch b 1 ( x ) , b 2 ( x ) , … ≈ best branch b ( x ) \text { surrogate branch } b _ { 1 } ( \mathbf { x } ) , b _ { 2 } ( \mathbf { x } ) , \ldots \approx \text { best branch } b ( \mathbf { x } ) surrogate branch b1(x),b2(x),…≈ best branch b(x)
所以 C&RT 也很容易处理特征缺失问题。
简单的举例说明
以此可以看出 ,而 decision tree 则是在条件上在进行分割,如果不理解可以点击博文 观察AdaBoost-Stump 的学习过程。
值得注意的是另一个常用的决策树算法是 C4.5
,区别是巧思不同(different choices of heuristics)。
转载地址:http://nlqzk.baihongyu.com/