Quickly Boosting Decision Trees – Supplementary Material
Ron Appel
Thomas Fuchs
Piotr Doll ́ar
Pietro Perona
In the main text of the paper, we prove a bound on the misclassification error given a preliminary
m
-subset of
the data, given as Proposition 1:
m
≤
n
⇒
Z
m
ε
m
≤
Z
n
ε
n
Using this bound, we are able to prune features early and speed up the training of decision trees using the
classification error criterion. In this document, we prove the same bound on other common types of stump
splitting criteria: information gain, Gini purity, and variance minimization, extending our method for use with
regression trees as well as binary or multi-class classification trees.
In any decision tree, an input propagates down the tree until it reaches a single leaf node. Recall that an
n
-subset
is the set of
n
heaviest datapoints - the
n
samples with the largest weights. Let
ρ
be the set of elements in the
n
-subset that are assigned to a particular leaf. The sum of the weights of the elements that belong to that leaf
is
Z
ρ
and the sum of the weights of the elements in
ρ
with class
y
is
Z
y
ρ
Z
n
≡
∑
i
≤
n
w
i
Z
ρ
≡
∑
i
∈
ρ
w
i
Z
y
ρ
≡
∑
i
∈
ρ
w
i
1
{
y
i
=
y
}
In regression, elements have values
y
i
∈
R
d
, and we define the weighted average regression value of a leaf as:
̃
y
ρ
≡
1
Z
ρ
∑
i
∈
ρ
w
i
y
i
Given an
n
-subset of data (or possibly all of it), the total error is computed by summing the error in each leaf,
proportionally weighted by the total mass of samples belonging to that leaf:
ε
n
=
∑
j
Z
ρ
Z
n
ε
ρ
which we reformulate as:
Z
n
ε
n
=
∑
j
Z
ρ
ε
ρ
[for brevity, we have omitted the subscript
j
in what should be
ρ
j
]
Hence, we only need to show that for each leaf
ρ
, the product of total error and subset mass always exceeds that
of a smaller subset. Let
u
be the set of elements in
ρ
that are also in an
m
-subset and ̄
u
be the set of elements
in
ρ
that are not in the
m
-subset, where
m
≤
n
:
u
≡{
i
|
i
∈
ρ,i
≤
m
}
,
̄
u
≡{
i
|
i
∈
ρ,i > m
}
Note that:
u
∪
̄
u
=
ρ
Z
u
≡
∑
i
∈
u
w
i
Z
̄
u
≡
∑
i
∈
̄
u
w
i
Z
y
u
≡
∑
i
∈
u
w
i
1
{
y
i
=
y
}
Z
y
̄
u
≡
∑
i
∈
̄
u
w
i
1
{
y
i
=
y
}
Accordingly, in the following sections, it is enough to show that
Z
ρ
ε
ρ
≥
Z
u
ε
u
since:
Z
ρ
ε
ρ
≥
Z
u
ε
u
⇒
Z
n
ε
n
=
∑
j
Z
ρ
ε
ρ
≥
∑
j
Z
u
ε
u
=
Z
m
ε
m
Each of the following proofs is based on an inequality. All inequalities are proven at the very end. We also note
that these proofs apply to trees of any depth, not just stumps.
Quickly Boosting Decision Trees
Information Gain
ε
n
≡
∑
j
Z
ρ
Z
n
(
−
∑
y
Z
y
ρ
Z
ρ
ln
(
Z
y
ρ
Z
ρ
))
⇒
Z
n
ε
n
=
∑
j
Z
ρ
ε
ρ
︷
︸︸
︷
(
−
∑
y
Z
y
ρ
ln
(
Z
y
ρ
Z
ρ
))
Z
ρ
ε
ρ
=
−
∑
y
Z
y
ρ
ln
(
Z
y
ρ
Z
ρ
)
=
−
∑
y
(
Z
y
u
+
Z
y
̄
u
) ln
(
Z
y
u
+
Z
y
̄
u
Z
u
+
Z
̄
u
)
≥
(
−
∑
y
Z
y
u
ln
(
Z
y
u
Z
u
))
︸
︷︷
︸
Z
u
ε
u
+
(
−
∑
y
Z
y
̄
u
ln
(
Z
y
̄
u
Z
̄
u
))
︸
︷︷
︸
Z
̄
u
ε
̄
u
≥
0
The proof for Information Gain Ratio is a trivial adaptation of the proof above.
Gini Impurity
ε
n
≡
∑
j
Z
ρ
Z
n
∑
y
Z
y
ρ
Z
ρ
(
1
−
Z
y
ρ
Z
ρ
)
⇒
Z
n
ε
n
=
∑
j
Z
ρ
ε
ρ
︷
︸︸
︷
(
Z
ρ
−
∑
y
(
Z
y
ρ
)
2
Z
ρ
)
Z
ρ
ε
ρ
=
Z
ρ
−
∑
y
(
Z
y
ρ
)
2
Z
ρ
= (
Z
u
+
Z
̄
u
)
−
∑
y
(
Z
y
u
+
Z
y
̄
u
)
2
Z
u
+
Z
̄
u
≥
(
Z
u
−
∑
y
(
Z
y
u
)
2
Z
u
)
︸
︷︷
︸
Z
u
ε
u
+
(
Z
̄
u
−
∑
y
(
Z
y
̄
u
)
2
Z
̄
u
)
︸
︷︷
︸
Z
̄
u
ε
̄
u
≥
0
Variance Minimization
ε
n
≡
∑
j
Z
ρ
Z
n
∑
i
∈
ρ
w
i
Z
ρ
|
y
i
−
̃
y
ρ
|
2
⇒
Z
n
ε
n
=
∑
j
Z
ρ
ε
ρ
︷
︸︸
︷
(
∑
i
∈
ρ
w
i
|
y
i
|
2
−
|
Z
ρ
̃
y
ρ
|
2
Z
ρ
)
Z
ρ
ε
ρ
=
∑
i
∈
ρ
w
i
|
y
i
|
2
−
|
Z
ρ
̃
y
ρ
|
2
Z
ρ
=
∑
i
∈
u
w
i
|
y
i
|
2
+
∑
i
∈
̄
u
w
i
|
y
i
|
2
−
|
Z
u
̃
y
u
+
Z
̄
u
̃
y
̄
u
|
2
(
Z
u
+
Z
̄
u
)
≥
(
∑
i
∈
u
w
i
|
y
i
|
2
−
|
Z
u
̃
y
u
|
2
Z
u
)
︸
︷︷
︸
Z
u
ε
u
+
(
∑
i
∈
̄
u
w
i
|
y
i
|
2
−
|
Z
̄
u
̃
y
̄
u
|
2
Z
̄
u
)
︸
︷︷
︸
Z
̄
u
ε
̄
u
≥
0
Inequalities
For positive scalars
a,b
≥
0 and
α,β >
0, the following inequality holds:
(
a
+
b
) ln
(
a
+
b
α
+
β
)
≤
a
ln
(
a
α
+
β
)
+
b
ln
(
b
α
+
β
)
=
a
ln
(
a
α
·
α
α
+
β
)
+
b
ln
(
b
β
·
β
α
+
β
)
=
a
ln
(
a
α
)
+
b
ln
(
b
β
)
−
(
a
ln
(
1 +
β
α
)
+
b
ln
(
1 +
α
β
))
︸
︷︷
︸
≥
0
⇒
(
a
+
b
) ln
(
a
+
b
α
+
β
)
≤
a
ln
(
a
α
)
+
b
ln
(
b
β
)
For any vectors (or scalars)
a
,
b
and positive scalars
α,β >
0, the following inequality holds:
|
a
+
b
|
2
α
+
β
=
|
a
|
2
+
|
b
|
2
α
+
β
+
2
〈
a
,
b
〉
α
+
β
=
|
a
|
2
α
(
1
−
β
α
+
β
)
+
|
b
|
2
β
(
1
−
α
α
+
β
)
+
2
〈
a
,
b
〉
α
+
β
=
|
a
|
2
α
+
|
b
|
2
β
−
|
β
a
−
α
b
|
2
αβ
(
α
+
β
)
︸
︷︷
︸
≥
0
⇒
|
a
+
b
|
2
α
+
β
≤
|
a
|
2
α
+
|
b
|
2
β