of 19
Improved Multi-Class Cost-Sensitive Boosting
via Estimation of the Minimum-Risk Class
Ron Appel
Xavier Burgos-Artizzu
Pietro Perona
Caltech
Transmural Biotech
Caltech
Caltech
appel@caltech.edu xpburgos@transmuralbiotech.com pero
na@caltech.edu
Abstract
We present a simple unified framework for multi-class cost-s
ensitive boosting.
The minimum-risk class is estimated directly, rather than v
ia an approximation
of the posterior distribution. Our method jointly optimize
s binary weak learners
and their corresponding output vectors, requiring classes
to share features at each
iteration. By training in a cost-sensitive manner, weak lea
rners are invested in sep-
arating classes whose discrimination is important, at the e
xpense of less relevant
classification boundaries. Additional contributions are a
family of loss functions
along with proof that our algorithm is Boostable in the theor
etical sense, as well
as an efficient procedure for growing decision trees for use a
s weak learners. We
evaluate our method on a variety of datasets: a collection of
synthetic planar data,
common UCI datasets, MNIST digits, SUN scenes, and CUB-200 b
irds. Results
show state-of-the-art performance across all datasets aga
inst several strong base-
lines, including non-boosting multi-class approaches.
1) Estimate Posteriors
Irrelevant
Boundaries
Relevant
Boundaries
Forgivable
Errors
Costly
Errors
2) Choose Min-Risk Class
Directly Estimate Min-Risk Class
Standard (2-Step) Approach
REBEL
(Our Approach)
Bottle
Phone
Knife
Classification Costs
Prediction
True Class
0 1 10
1 0 10
2 50 0
nt
e
Rele
a
n
e
R
n
el
R
ant
an
v
Relev
va
n
v
R
n
R
a
n
es
i
e
e
ari
a
Bound
a
B
B
B
a
B
nd
rie
s
es
arie
e
a
da
Bounda
d
a
B
B
da
B
B
n
ar
B
nd
ri
s
For
ivable
b
i
e
For
g
ivable
o
r
a
b
le
Errors
E
ro
rs
Errors
ro
rs
g
g
Directl
y
y
Estimate Min-Risk Clas
s
s
s
REBEL
(Our Approach)
1) Estimate Posteriors
rrele
evant
Irr
rrele
le
rr
ele
I
relev
evant
Irr
relev
ev
r
rr
va
l
e
ele
I
ries
Boun
ndar
ri
n
nd
es
u
aries
Boun
undar
ar
u
n
ou
ar
es
Costly
os
C
Costly
os
C
y
Errors
ro
Errors
ro
s
y
y
2) Choose Min-Risk Class
Standard (2-Step) Approach
Figure 1:
Example scenario of an airport security checkpoint with thr
ee classes and corresponding classifi-
cation costs. With standard training, cost-sensitive clas
sification is a two-step (potentially inefficient) process.
Weak learners may be
wasted
on irrelevant (low-cost) boundaries. Using cost-sensitiv
e training (e.g. REBEL),
relevant boundaries are set first, giving less weight to the d
istinction between classes with forgivable errors.
1 Introduction
Boosting is one of the most popular off-the-shelf learning t
echniques in use today. Boosted clas-
sifiers are theoretically well understood [
34
,
20
,
21
], are easy and quick to train [
2
], and yield the
fastest algorithms when computed as a cascade [
6
], enabling real-time big-data applications such as
object detection and animal tracking [
40
,
16
,
27
].
In the case of binary classification, boosting is well unders
tood. From a statistical standpoint, boost-
ing can be seen as iteratively improving an estimator of the u
nderlying posterior distribution of the
data [
22
]. However, the multi-class case is not yet as well understoo
d. Generalizing binary boosting
to multi-class often requires various heuristics in order t
o work (see Sec.
2
). Moreover, different
errors may incur different costs, with such situations aris
ing in a wide variety of domains such as
computer vision, behavior analysis, and medical diagnosis
to name a few [
12
,
9
,
32
].
An interesting case in point is the categorization of object
s naturally organized into a taxonomy, such
as birds [
7
]. Misclassifying taxonomically close categories (eg. con
fusing two types of duck) is usu-
ally more forgivable than distant categories (eg. mistakin
g a vulture for a duck) [
13
]. Accordingly,
classifiers should be trained to minimize classification cos
t rather than error rates.
We present a novel approach to multi-class cost-sensitive b
oosting, offering both a clearer theoretical
framework and improved performance over prior art. Instead
of first approximating the class-wise
posterior distributions and then taking costs into account
at test time, our approach directly estimates
the class with minimum risk (i.e. minimum expected cost) by u
sing the costs during training. Since
our proposed loss function estimates the underlying risk, a
nd in particular, is based on a sum of
exponentials, we call our method Risk Estimation Boosting u
sing an Exponential Loss, or
REBEL
.
We prove that REBEL is a true Boosting algorithm in the theore
tical sense and outline a method for
jointly optimizing decision trees as weak learners, facili
tating feature sharing among classes.
In summary, the contributions of our work are:
1. a novel family of loss functions that facilitate multi-cl
ass cost-sensitive training (see Sec.
3
)
2. a proof that REBEL employs the weak learning condition req
uired for
Boostability
(see Sec.
3
)
3. a greedy optimization procedure for growing deeper binar
y decision trees (see Sec.
4
)
4. state-of-the-art results on several datasets for both ne
utral and cost-sensitive setups (see Sec.
5
)
2 Related Work
There are two main approaches to multi-class classification
: combining multiple binary classifiers,
and training multi-class classifiers directly, both of whic
h already exist in the context of boosting.
A multi-class prediction can be achieved by compounding the
outputs of multiple binary classi-
fiers. AdaBoost.M2 [
21
] and AdaBoost.MR [
35
] train binary classifiers that maximize the margins
between pairs of classes. AdaBoost.MH [
35
] concurrently trains
M
one-vs-all classifiers, sharing
the same weak learners for each binary classifier. Reduction
to multiple one-vs-all or one-vs-one
classifiers requires augmenting the given datasets, and oft
en results in sub-optimal joint classifiers
[
33
]. ECOC [
1
] trains multiple binary classifiers and uses error-correct
ing codes for output classi-
fication. Selecting error-codes is often problem dependent
and not straightforward; AdaBoost.ERP
[
26
] attempts to find a good trade-off between error-correcting
and base learning. JointBoost [
39
]
trains shared binary classifiers, optimizing over all combi
nations of binary groupings of classes.
CD-MCBoost [
33
] and CW-Boost [
36
] perform coordinate-descent on a multi-class loss functio
n,
thereby focusing each weak learner on a single class. HingeB
oost.OC [
23
] uses output coding com-
bined with a multi-class hinge loss, demonstrating improve
ments in performance on several datasets.
In all of these approaches, the number of classifiers trained
(and eventually evaluated) is super-linear
in the number of classes, potentially slowing down classific
ation [
16
].
On the other hand, strong boosted classifiers can be generate
d using a single chain of multi-class
weak learners directly, avoiding the super-linear issue of
the aforementioned methods. AdaBoost.M1
[
21
] is a simple adaptation of binary AdaBoost that makes use of m
ulti-class learners, but has an
unreasonably strong (often unachievable) weak-learning c
ondition [
29
]. SAMME [
44
] uses a fixed
set of codewords to additively generate a score for each clas
s; however, [
29
] show that it does not
satisfy the weak learning conditions required to be
Boostable
(i.e. it is unable to reduce the training
error in some situations). GD-MCBoost [
33
] also uses a fixed set of codewords. AOSO-LogitBoost
[
38
] improves upon previous methods by adaptively separating p
airs of classes at each iteration.
Struct-Boost [
37
] combines weak structured learners into a strong structure
d classifier, applicable
to multi-class classification. GLL-Boost [
4
] exhibits guess-aversion, attempting to avoid outputting
equal scores over all classes. However, multi-class weak le
arners are not as easy to generate and
are inherently more complex than their binary counterparts
, potentially leading to worse overfitting
of the training data. Further, these methods require a sum-t
o-zero constraint on their output codes,
adding to the complexity of computation as well as restricti
ng the classifier’s output [
28
].
As discussed above, many varieties of boosting have been pro
posed over the years. However, most
approaches do not properly deal with the case of cost-sensit
ive classification. As a workaround, sev-
eral post-hoc methods exist that interpret classifier outpu
t scores as posterior distributions, enabling
2
the estimation of the minimum-risk class as a second step [
18
,
22
,
31
]. Approximating posterior
distributions is often inaccurate [
30
], and we claim is unnecessary. Our proposed method is distin
-
guished from prior work, consolidating the following desir
able properties into a single framework:
a novel multi-class loss function that is easily implementa
ble due to its simplicity (see Eq.
2
)
direct estimation of the minimum-risk class via cost-sensi
tive training; avoiding the need to
(inaccurately) approximate a posterior distribution (see
Sec.
3
)
a classifier composed of a single chain of binary weak learner
s that are less complex and share
features among all classes, reducing the amount of overfitti
ng [
33
]
vector-valued outputs without
sum-to-zero constraints simplifying optimization and pro
ducing
classifiers that can be more expressive
a novel method for growing binary decision trees as weak lear
ners for use with vector-valued
outputs (see Sec.
4
), leading to improved performance (see Table
1
)
Unifying all the above features into a single framework tran
slates into superior performance as
demonstrated in Sec.
5
, compared against both prior boosting and non-boosting app
roaches. Con-
sidering that boosting is one of the most widely used supervi
sed classification frameworks, we find
that a simple, improved multi-class cost-sensitive classi
fier is an important contribution to the field.
Although this work focuses on developing a better boosting a
lgorithm; for completeness, we also
compare against several other state-of-the-art algorithm
s: 1-vs-All and multi-class SVMs [
10
], Struc-
tured SVMs [
7
] and Random Forests [
8
].
3 Approach
In this section, we introduce multi-class classification, f
ormulate an appropriate upper-bounding
function, and generalize to deal with cost-sensitive class
ification. We first define our notation.
Notation
scalars (regular), vectors (bold), constant vectors:
x,
x
[
x
1
,x
2
,...
]
,
0
[0
,
0
,...
]
,
1
[1
,
1
,...
]
indicator vector, logical indicator function:
δ
δ
δ
k
(
0
with a
1
in the
k
th
entry)
,
1
(
LOGICAL EXPRESSION
)
∈ {
0
,
1
}
inner product, element-wise multiplication:
h
x
,
v
i
,
x
v
element-wise function:
F
[
x
]
[
F
(
x
1
)
,F
(
x
2
)
,...
]
The goal of multi-class classification is to obtain a functio
n
h
that correctly predicts the class of
queried data points. A data point
x
is represented as a feature vector and is associated with a cl
ass
label
y
. If there are a total of
d
features and
K
possible output classes, then:
x
∈X ⊆
R
d
,y
∈Y ≡{
1
,
2
,...,K
}
,h
:
X →Y
For a specific
h
, the expected misclassification error is:
ε
E
{
1
(
h
(
x
)
6
=
y
)
}≡
E
{
h
1
δ
δ
δ
y
δ
δ
h
(
x
)
i
}
Since
h
outputs a discrete label, we construct a vector-valued
score function
H
:
X →
R
K
, where
the index of the maximum value is assigned as the output class
:
h
(
x
)
arg max
k
{h
H
(
x
)
δ
δ
k
i}
In practice, we do not know the underlying distribution of th
e data; hence, we empirically approx-
imate the expected misclassification error. Given a stochas
tically sampled training set of
N
points,
the error is approximated as:
ε
1
N
N
n
=1
h
1
δ
δ
δ
y
n
δ
δ
h
(
x
n
)
i
Hence, we formulate our problem as:
H
= arg min
H
∈H
{
ε
}
, where
H
is a hypothesis space of possible
functions. In this form, directly minimizing the error is in
feasible as it is both discontinuous and
non-convex [
28
]. Instead, we propose a family of surrogate loss functions t
hat can tractably be
minimized and reach the same optimum as the original error.
Consider any scalar function
F
that is convex, smooth, upper-bounds the unit step function
, and
attains the same minimum value (i.e.
F
(
x
)
e
x
,
log
2
(1 +
e
x
)
,
(
x
+ 1)
2
, etc.), then the following
coupled sum also possesses the above-mentioned properties
(see supplement for proof):
h
1
δ
δ
δ
y
δ
δ
h
(
x
)
i≤
1
2
(
h
1
δ
δ
δ
y
,
F
[
H
(
x
)]
i
+
h
δ
δ
δ
y
,
F
[
H
(
x
)]
i
)
(1)
As previously discussed, different errors may incur differ
ent costs (i.e. letting a knife through a
security checkpoint versus a water bottle). Costs can be enc
oded as a matrix
C
, where each entry
3
c
yk
0
is the cost of classifying a sample as class
k
when its true class is
y
. We implicitly assume
that correct classification incurs no cost;
c
yy
0
. A cost vector
c
y
is defined as the
y
th
row of the
cost matrix. Note that
c
y
can be uniquely decomposed as follows (see supplement for de
tails):
c
y
=
β
y
1
+
K
k
=1
b
ky
[
1
δ
δ
δ
k
]
where:
b
ky
0
,b
jy
= 0
(for some
j
)
Accordingly, the empirical risk (i.e. expected cost) can be
expanded as:
R≡
E
{
h
c
y
δ
δ
h
(
x
)
i
}
1
N
N
n
=1
h
c
y
n
δ
δ
h
(
x
n
)
i ≡
1
N
N
n
=1
β
y
n
+
1
N
N
n
=1
K
k
=1
b
ky
n
h
1
δ
δ
δ
k
δ
δ
h
(
x
n
)
i
Using the bound in (Eq.
1
), the risk
R
is replaced with a convex upper-bounding surrogate loss:
L≡
1
N
N
n
=1
(
β
y
n
+
c
n
2
)
︷︷
L
+
1
2
N
N
n
=1
(
h
c
+
n
,
F
[
H
(
x
n
)]
i
+
h
c
n
,
F
[
H
(
x
n
)]
i−
c
n
)
(2)
where:
c
n
min
H
{h
c
+
n
,
F
[
H
]
i
+
h
c
n
,
F
[
H
]
i}
and:
c
+
n
c
y
n
β
y
n
1
,
c
n
(max
{
c
y
n
}
)
1
c
y
n
Up to this point, we have derived a new multi-class cost-sens
itive loss function. We now focus on a
specific model for
H
and outline our corresponding optimization scheme.
Greedy Iterative Minimization
In this work, we model
H
as a weighted sum of binary weak learners. Recall that in stan
dard binary
boosting, a strong learner
h
T
is the cumulative sum of weak learners, where
α
t
R
are scalar
weights and
f
t
:
R
d
→ {±
1
}
are binary weak learners (chosen from a hypothesis set
F
), and the
final classification is just the sign of
h
T
(
x
)
[
34
]. In our model, the scalar weights
α
t
are extended
into
K
-dimensional vectors
a
t
R
K
as follows:
h
T
(
x
)
α
0
+
T
t
=1
f
t
(
x
)
α
t
H
T
(
x
)
a
0
+
T
t
=1
f
t
(
x
)
a
t
(3)
and the final classifier outputs the index of the maximum-valu
ed element in
H
. Hence, on the
T
+1
th
iteration, we have:
H
T
+1
(
x
) =
H
T
(
x
) +
f
T
+1
(
x
)
a
T
+1
. Greedy iterative optimization may be carried
out by fixing all parameters from the previous
T
iterations and minimizing the loss function with
respect to the
T
+1
th
iteration parameters. From hereon in, we explicitly set our
convex upper-bound
as:
F
(
x
)
e
x
due to its simplicity and closed-form solution.
L
f
(
a
) =
L
+
1
2
N
N
n
=1
(
h
c
+
n
,
exp
[
H
T
(
x
n
)+
f
(
x
n
)
a
]
i
+
h
c
n
,
exp
[
[
H
T
(
x
n
)+
f
(
x
n
)
a
]]
i−
c
n
)
=
L
+
1
2
N
N
n
=1
(
h
w
+
n
,
exp
[
f
(
x
n
)
a
]
i
+
h
w
n
,
exp
[
f
(
x
n
)
a
]
i−
c
n
)
with
multi-class weights
w
±
defined as:
w
+
n
c
+
n
exp
[
H
T
(
x
n
)]
,
w
n
c
n
exp
[
H
T
(
x
n
)]
Joint optimization of
f
and
a
is accomplished by looping over candidate weak learners
f
∈ F
(the
hypothesis space of possible weak learners), optimizing th
e corresponding vector
a
, and selecting
the joint
{
f,
a
}
that minimizes
L
f
(
a
)
. Note that every element of
a
is dynamically optimized at
every iteration. It is neither a fixed code-word, nor has a sum
-to-zero constraint, enhancing the ex-
pressiveness of the classifier, ultimately leading to bette
r accuracy (see Sec.
5
). Further, in the binary
(2-class) case, this framework exactly reduces to binary Ad
aBoost, LogitBoost, etc. (depending on
the choice of
F
(
x
)
; see supplement for details).
4
Weak Learning Condition
Although REBEL trains a multi-class classifier, each of its w
eak learners is a binary
classifier. For
an algorithm to be
Boostable
, given adequate weak learners (i.e. having at least
50% +
γ
accuracy,
where
γ>
0
), it must drive the training error to zero [
29
]. REBEL’s weak learning condition is the
following (please see supplement for more details):
γ>
0
such that:
w
+
n
,
w
n
0
,
f
satisfying:
N
n
=1
[
w
+
n
w
n
]
f
(
x
n
)
,
1
γ
N
n
=1
h|
w
+
n
w
n
|
,
1
i
For any positive multi-class weights
w
+
n
and
w
n
, if there exists a weak learner
f
such that the
above inequality holds, REBEL is guaranteed to minimize the
training loss exponentially quickly.
REBEL’s weak learning condition, although seemingly compl
ex, is equivalent to the binary weak
learning condition (please see supplement for proof), impl
ying that REBEL is a true Boosting algo-
rithm in the theoretical sense. The intuition behind this eq
uivalence is that at each iteration, REBEL
is actually only making a binary split between two sets of cla
sses. These sets are dynamically chosen
at each iteration to lead to optimal separation given the ava
ilable pool of weak learners.
4 Decision Trees
In section
3
, we introduced our loss function and an optimization proced
ure given a hypothesis space
of binary weak learners. In this section, we describe our app
roach specifically for growing useful
binary decision trees. Decision trees are simple white-box
models, and in particular, shallow trees
(i.e. depth
4
) generalize well and have proven robust in practice [
17
].
A decision stump
f
:
X →{±
1
}
can be written as:
f
(
x
)
ρ
sign
(
h
x
j
i−
τ
)
where
ρ
∈{±
1
}
is a polarity,
δ
j
selects the
j
th
feature in
x
(out of
d
possible features), and
τ
R
is a threshold. Let
H
1
denote the hypothesis space of all unique decision stumps on
the training data. The input space
X ⊆
R
d
has
d
dimensions, at times on the order of several thousands (see S
ec.
5
). Each dimension
has a finite number of possible thresholds (at most equaling t
he number of training samples
N
).
In practice, we only consider a fixed number of thresholds
N
τ
1
. Finally, there are two possible
polarities; hence,
d
×
N
τ
×
2
unique stumps in
H
1
. Let
H
D
denote the space of unique depth-
D
trees.
For each split in a depth-
D
tree, both child nodes are depth-
(
D
1)
trees. This leads to an exponential
number of possible trees, totaling
(
d
×
N
τ
×
2)
D
unique trees in
H
D
. Even on a GigaHertz computer,
looping through each possible tree in
H
D
is only feasible for
D
= 1
. To overcome this issue, we
developed a greedy algorithm for using deeper trees as weak l
earners in our estimator.
Greedily Growing Trees
By greedily adding one depth-layer at a time, we can grow tree
s to any depth in a computationally
tractable manner. Using our proposed approach, we guarante
e that with each added depth-layer, the
tree leads to a more accurate strong classifier. Let us assume
that we have already jointly trained a
depth-
D
tree
f
(
D
)
and corresponding output vector
a
(
D
)
; we now describe how to add the
D
+1
th
layer.
1. Replace each leaf node in
f
(
D
)
with a stump having identical parameters as its parent node,
thereby deepen-
ing the tree without changing its functionality.
2. Holding
a
(
D
)
fixed, optimize each of the
2
D
newly added stumps. This operation is
O
(2
D
×
d
×
N
τ
)
, resulting
in a new tree
f
(
D
+1)
. In the worst case, all added stumps (and hence, overall accu
racy) remain unchanged.
3. Fixing
f
(
D
+1)
, minimize
L
(
f,
a
)
with respect to
a
, storing the optimal vector as
a
(
D
+1)
.
Figure
2
illustrates our tree-growing technique, reducing the comp
utational complexity from
10
6
D
to
2
D
×
d
×
N
τ
, making our classifier fit for use with shallow binary decisio
n trees as weak learners.
1
We find that
N
τ
200
(evenly spaced thresholds over the range of the training dat
a) is sufficient to achieve
maximal accuracy in practice. Hence:
d
×
N
τ
×
2
10
6
unique stumps in
H
1
.
5
f
j
f
j
f
j
+
+
f
j
+
a
(
D
)
a
(
D
)
f
j
f
L
f
R
+
+
a
(
D
)
1
2
3
a
(
D
+1)
Figure 2:
Greedily growing a binary decision tree
by one layer. Please see text above for details.
Cost using
standard
method
10
-1
10
0
Cost using
REBEL
10
-1
10
0
Equal Costs
Figure 3:
REBEL compared against the standard (two-
step) approach. Please see text below for details.
5 Experiments
We benchmark our approach on both synthetic and real data, ex
ploring its performance vis-a-vis
other methods and demonstrating its functionality. Experi
ments using synthetic data test REBEL’s
main properties; please see supplement for corresponding r
esults. Fig.
3
compares REBEL’s perfor-
mance (using cost-sensitive training) to the standard meth
od (i.e. training on just the data, estimating
the posterior distributions, and using the costs only at tes
t-time). 10 random datasets and 20 random
cost matrices are generated for a total of 200 trials. Each da
taset has 1000 training and 500 test
points drawn from a mixture of multiple Gaussian clusters (d
atasets are plotted in the supplement).
All cost matrices have zeros along the diagonals and (positi
ve) normally-distributed off-diagonal
entries, normalized such that random classification result
s in unit cost. For each trial, classifiers
are trained using 100 stumps. REBEL outperforms the standar
d method in
90%
of the trials,
especially on the harder classification problems (i.e. when
both methods incur relatively high costs).
5.1 Standard multi-class classification on real data
We benchmark REBEL on several UCI datasets [
3
] and the MNIST handwritten digits [
25
], using
a uniform (cost-neutral) metric. We compared our method aga
inst several competitive boosting
methods: 1-vs-All AdaBoost and AdaBoost.MH [
35
], AdaBoost.ECC [
15
] Struct-Boost [
37
], CW-
Boost [
36
], and A0S0-LogitBoost [
38
]. Based on the experimental setup in [
37
], we use stumps
as weak learners; each method having a maximum of 200 weak lea
rners. Five versions of each
dataset are generated, randomly split with
50%
of the samples for training,
25%
for validation
(i.e. for finding the optimal number of weak learners), and th
e remaining
25%
for testing. All
misclassifications are equally costly; hence, we report per
cent error with the resulting means and
standard deviations in Fig.
4
. REBEL is the best performing method on three of these five dat
asets,
and is second best to CW-Boost on the two remaining datasets.
Even in these commonly-used and
saturated
UCI datasets, we are able to show consistent accuracy compar
ed to previous approaches.
5.2 Cost-sensitive classification
As discussed in the introduction, some scenarios require co
st-sensitive classification. Among many
existing datasets, we chose two current datasets that fall i
nto this category: (1) SUN scene recogni-
tion [
42
] and (2) CUB200 fine-grained bird categorization [
41
]. Both of these datasets are organized
into hierarchies; hence, an apt evaluation metric penalize
s each misclassification according to its
distance from the true class along the hierarchy.
We compare our method against state-of-the-art boosting me
thods as well as non-boosting baselines:
1vsAll AdaBoost and AdaBoost.MH [
35
], AdaBoost.ECC [
1
], Struct-Boost [
37
], CW-Boost [
36
],
A0S0-LogitBoost [
38
], multi-class and 1vsAll SVM [
10
], Struct-SVM [
7
], and Random Forests [
8
].
For Struct-Boost [
37
], we report performance as published by their authors. For t
he remaining meth-
ods, we downloaded or re-implemented code, cross-validati
ng using the training data to establish
optimal parameters; specifically, the number of boosted wea
k learners (up to 200 on SUN and 2500
on CUB) and tree depth (to a maximum of 4). For the SVMs, we perf
ormed grid search through
C
and
γ
as recommended in [
10
]. Finally, for the random forest, we performed grid search f
or tree
depth (to a maximum of 7) and number of trees (between 1 and 100
0) with combinations not ex-
ceeding the number of feature splits allotted to the boosted
methods to ensure a fair comparison. For
completeness, we report the performance of each method on bo
th the hierarchical misclassification
cost as well as uniform cost metric.
6
GLASS
VOWEL
LANDSAT
PENDIGITS
MNIST
Percent Error
0
10
20
30
40
50
31.7
32.3
32.7
35.8
35.4
34.2
30.4*
21.1
18.8
20.6
17.5
22.4
20.6
17.4*
15.1
12.7
12.8
12.1
11.1
15.4
10.7*
7.1
7.4
8.4
6.9
2.5*
12.8
3.2
11.0
13.4
15.8
12.5
9.3*
12.5
10.5
[
0
1
1
] Ada 1vsAll
[
0
0
2
] Ada.MH
[
0
0
0
] Ada.ECC
[
0
1
2
] Struct-Boost
[
2
1
0
] CW-Boost
[
0
0
0
] A0S0-Logit
[
3
2
0
]
REBEL
Figure 4:
Popular boosting algorithms tested on UCI
datasets. Caps on the bars indicate
±
1
std. from the
mean. The number of times each method places
1
st
,
2
nd
, and
3
rd
is indicated in the legend. REBEL consis-
tently places in the top two on every dataset.
SUN6 (Cost)
SUN6 (0-1)
SUN15 (Cost)
SUN15 (0-1)
Misclassification Cost
0.2
0.3
0.4
0.5
0.6
.215
.219
.209
.319
.202
.222
.201*
.315
.327
.343
.389
.315
.332
.303*
.321
.341
.294
.475
.293*
.332
.293
.418
.444
.396
.523
.393
.434
.360*
[0.317] AdaBoost.MH
[0.333] AdaBoost.ECC
[0.311] Struct-Boost
[0.427] 1vsAll SVM
[0.301] Multi-SVM
[0.330] Random Forest
[0.289]
REBEL
Figure 5:
SUN scene recognition datasets, with state-of-the-art
classifiers evaluated on a six-class (SUN6) and a fifteen-cla
ss sub-
set (SUN15). Each method was trained and evaluated in both a
(Cost) cost-sensitive and (
0
1
) cost-neutral way. REBEL is a
consistent top-performing method across all of these subse
ts.
SUN scene recognition
We benchmark REBEL on the same two subsets of the SUN database
as used in [
37
], containing six
and fifteen classes respectively, and using HOG features as i
nput [
11
]. For each subset, five random
splits were generated with
75%
of the samples used for training and
25%
for testing. In each of the
SUN subsets, the corresponding cost matrices were normaliz
ed by the appropriate factor to ensure
that random guessing incurred the same cost (
83%
in the 6-class subset and
93%
in the 15-class
subset). The mean and standard deviations over the five split
s are shown in Fig.
5
.
In cross-validating for optimal parameters on the SUN subse
ts, REBEL performed best using depth-
2 trees as weak learners, see Table
1
. REBEL is consistently the top-performing method on all of
the subsets, edging out multi-class SVM and the other boosti
ng methods. Other than the one-vs-all
SVM which consistently under-performs, most of the methods
lead to similar accuracies.
Stumps Depth-2 Depth-3 Depth-4
Stumps Depth-2 Depth-3 Dept
h-4
SUN6(
0
1
) 32.8%
30.3%
34.2%
35.0% CUB(
0
1
) 21.4%
20.9%
22.1%
22.9%
SUN6(cost) 0.363
0.323
0.327
0.325
CUB(cost) 0.291
0.291
0.286
0.26
Table 1:
REBEL misclassification errors on SUN-6 and CUB datasets usi
ng different tree depths. Deeper trees
consistently outperform stumps, giving credit to our tree-
growing method introduced in Sec.
4
.
Number of Weak Learners
150
300
600
1200
2400
Percent Error
20
30
40
50
[23.6%] 1vsAll-Ada
[21.9%] Multi-SVM
[24.4%] SVM-IS
[28.9%] 1vsAll-SVM
[23.1%] A0S0-Logit
[25.9%] CW-Boost
[23.3%] Random Forest
[21.1%] REBEL
Number of Weak Learners
150
300
600
1200
2400
Taxonomic Cost
0.25
0.5
1.0
1.5
[0.339] 1vsAll-Ada
[0.276] Multi-SVM
[0.281] SVM-IS
[0.709] 1vsAll-SVM
[0.314] A0S0-Logit
[0.387] CW-Boost
[0.277] Random Forest
[0.261] REBEL
(a) Misclassification error (
0
1
)
(b) Cost-sensitive (Cost)
Figure 6:
CUB200 fine-grained bird categorization results. Several s
tate-of-the-art classifiers evaluated on a
standard fourteen-class subset. All methods are trained an
d evaluated (a) cost-neutrally and (b) cost-sensitively,
tree depths for each boosting method are chosen by cross-val
idation. The lowest cost of each method on the
test set is indicated in square brackets. REBEL with decisio
n trees is the best performing method in both cases.
CUB fine-grained bird categorization
We further benchmark REBEL on the CUB200 fine-grained bird ca
tegorization dataset, using a
fourteen-class
Woodpecker
subset as in [
19
,
5
,
43
,
14
]. Each bird in the subset is represented as
a 4096-feature vector; the output of a pre-trained Convolut
ional Neural Network [
24
]. The subset
consists of 392 training and 398 test samples.
7
Classification errors are shown in Fig.
6
. Boosting methods are plotted as a function of the number
of weak learners and non-boosting methods are plotted as hor
izontal lines. REBEL is the best
performing method in both cases. All other boosting methods
saturated between 20% and 40%
higher costs than REBEL; significantly worse. REBEL also out
performed multi-class SVM, albeit
by much smaller margins. Since this subset of birds has a taxo
nomy that is particularly shallow,
the taxonomic cost matrix is not that different from that of u
niform errors; hence, we do not expect
drastic improvements in performance and are satisfied with t
hese
modest
gains.
In cross-validating for optimal parameters on CUB, REBEL pe
rformed best using depth-2 trees as
weak learners for the cost-neutral experiment and depth-4 t
rees for the cost-sensitive experiment
(see Table
1
), validating the benefit of our decision tree growing approa
ch introduced in Sec.
4
.
6 Discussion
Our experimental results indicate that REBEL is as good as – a
nd even outperforms – other state-of-
the-art classifiers. But (1) why does it work so well? And (2) w
hy does it also outperform existing
classifiers in cost-insensitive situations (in which REBEL
seemingly has no
advantage
)?
(1) We are pleased with REBEL’s performance and seek a better
understanding. Our current expla-
nation relates to the form of our model and agrees with empiri
cal results; summarized as follows:
(a) In each iteration, REBEL determines a single binary spli
t
f
t
(and the corresponding optimal
vector
a
t
). This forces feature sharing across classes. Experiments
show superior performance
against baselines that do not require feature sharing (see F
igs.
4
,
5
: REBEL vs 1vAll, ECC, CW).
Feature sharing is a form of regularization [
39
,
33
].
(b) Avoiding sum-to-zero constraints on the associated vec
tor
a
t
is empirically advantageous. The
additional degree of freedom lets each weak learner and vect
or pair be more expressive, leading to
a greater reduction in loss at each iteration (see Figs.
4
,
5
: REBEL vs AOSO, Fig.
6
: REBEL’s quick
initial descent in error/cost).
Therefore (a) and (b) act in opposite directions and the bala
nce of the two leads to beneficial re-
sults. This finding is also theoretically justified since REB
EL employs the Binary Weak Learning
Condition in [
29
], thereby neither overfitting nor underfitting.
(2) The proposed surrogate loss function (Eq.
2
) is set up to incorporate classification costs during
training. However, misclassification cost is equivalent to
classification error when the cost matrix
has uniform costs. In this case, the minimum-risk class is th
e max-posterior class. Hence, there is no
reason to expect better or worse performance than other meth
ods that minimize classification error.
The fact that our our method performs slightly better on the d
atasets we tested (see Fig.
4
) is a further
indication that our balance of regularization and expressi
veness (point (1) above) is advantageous.
7 Conclusions
We presented a multi-class cost-sensitive boosting framew
ork with a novel family of simple sur-
rogate loss functions. This framework directly models the m
inimum-risk class without explicitly
approximating a posterior distribution. Training is based
on minimizing classification costs, as spec-
ified by the user (e.g. following taxonomic distance). Using
an exponential-based loss function, we
derived and implemented REBEL.
REBEL unifies the best qualities from a number of algorithms.
It is conceptually simple, and opti-
mizes feature sharing among classes. We found that REBEL is a
ble to focus on important distinc-
tions (those with costly errors) in exchange for more forgiv
ableones. We prove that REBEL employs
the weak learning condition required to be a true Boosting al
gorithm in the theoretical sense.
We compared REBEL to several state-of-the-art boosting and
non-boosting methods, showing im-
provements on the various datasets used. Several images fro
m the SUN and CUB-200 datasets with
highest and lowest corresponding risk-scores (i.e.
H
(
x
)
) are plotted in the supplement. Being based
on boosting, our method may be implemented as a cascade to yie
ld very fast classification, useful in
real-time applications. Our technique therefore holds the
promise of producing classifiers that are
as accurate and faster than the state-of-the-art.
8
References
[1] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing mul
ticlass to binary: a unifying approach for margin
classifiers.
JMLR
, 2001.
[2] R. Appel, T. Fuchs, P. Dollár, and P. Perona. Quickly boos
ting decision trees-pruning underachieving fea-
tures early. In
ICML
, 2013.
[3] K. Bache and M. Lichman. UCI machine learning repository
, 2013.
[4] O. Beijbom, M. Saberian, D. Kriegman, and N. Vasconcelos
. Guess-averse loss functions for cost-sensitive
multiclass boosting. In
ICML
, 2014.
[5] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-on
e features for fine-grained categorization, face
verification, and attribute estimation. In
CVPR
, 2013.
[6] L. Bourdev and J. Brandt. Robust object detection via sof
t cascade. In
CVPR
, 2005.
[7] S. Branson, O. Beijbom, and S. Belongie. Efficient large-
scale structured learning. In
CVPR
, 2013.
[8] L. Breiman. Random forests.
Machine Learning
, 2011.
[9] X. Burgos-Artizzu, P. Dollár, D. Lin, D. Anderson, and P.
Perona. Social behavior recognition in continuous
videos. In
CVPR
, 2012.
[10] C. Chang and C. Lin. LIBSVM: A library for support vector
machines.
Transactions on Intelligent Systems
and Technology
, 2011.
[11] N. Dalal and B. Triggs. Histograms of oriented gradient
s for human detection. In
CVPR
, 2005.
[12] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei. What does class
ifying more than 10,000 image categories tell
us? In
ECCV
, 2010.
[13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fe
i. ImageNet: A Large-Scale Hierarchical Image
Database. In
CVPR
, 2009.
[14] J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowds
ourcing for fine-grained recognition. In
CVPR
, 2013.
[15] T. G. Dietterich and G. Bakiri. Solving multiclass lear
ning problems via error-correcting output codes.
arXiv
,
(9501101), 1995.
[16] P. Dollár, R. Appel, and W. Kienzle. Crosstalk cascades
for frame-rate pedestrian detection. In
ECCV
, 2012.
[17] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestri
an detection: An evaluation of the state of the art.
PAMI
, 2011.
[18] P. Domingos. Metacost: A general method for making clas
sifiers cost-sensitive. In
SIGKDD
, 1999.
[19] R. Farrell, O. Oza, N. Zhang, V. I. Morariu, T. Darrell, a
nd L. S. Davis. Birdlets: Subordinate categorization
using volumetric primitives and pose-normalized appearan
ce. In
ICCV
, 2011.
[20] Y. Freund. Boosting a weak learning algorithm by majori
ty.
Information and Computation
, 1995.
[21] Y. Freund and R. E. Schapire. Experiments with a new boos
ting algorithm. In
Machine Learning Interna-
tional Workshop
, 1996.
[22] J. Friedman, T. Hastie, and R. Tibshirani. Additive log
istic regression: a statistical view of boosting. In
The
Annals of Statistics
, 2000.
[23] T. Gao and D. Koller. Multiclass boosting with hinge los
s based on output coding. In
ICML
, 2011.
[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.
Girshick, S. Guadarrama, and T. Darrell. Caffe:
Convolutional architecture for fast feature embedding.
arXiv
, (1408.5093), 2014.
[25] Y. LeCun and C. Cortes. The MNIST database of handwritte
n digits, 1998.
[26] L. Li. Multiclass boosting with repartitioning. In
ICML
, 2006.
[27] M. R.-A. S. B. M. Kabra, A. Robie and K. Branson. Jaaba: An
interactive machine-learning tool for
automatic annotation of animal behavior. submitted to Natu
re Methods, 2012.
[28] Y. Mroueh, T. Poggio, L. Rosasco, and J. jeacques S. Mult
iclass learning with simplex coding.
NIPS
, 2012.
[29] I. Mukherjee and R. E. Schapire. A theory of multiclass b
oosting. In
NIPS
, 2010.
[30] A. Niculescu-Mizil and R. Caruana. Obtaining calibrat
ed probabilities from boosting. In
NIPS
, 2005.
[31] D. B. O’Brien, M. R. Gupta, and R. M. Gray. Cost-sensitiv
e multi-class classification from probability
estimates. 2008.
[32] S. Ramaswamy. Multiclass cancer diagnosis using tumor
gene expression signatures. In
National Academy
of Sciences
, 2001.
[33] M. Saberian and N. Vasconcelos. Multiclass Boosting: T
heory and Algorithms. In
NIPS
, 2011.
[34] R. E. Schapire. The strength of weak learnability. In
Machine Learning
, 1990.
[35] R. E. Schapire and Y. Singer. Improved boosting algorit
hms using confidence-rated predictions. 1999.
[36] C. Shen and Z. Hao. A direct formulation for totally-cor
rective multi-class boosting. In
CVPR
, 2011.
[37] G. Shen, G. Lin, and A. van den Hengel. Structboost: Boos
ting methods for predicting structured output
variables.
PAMI
, 2014.
[38] P. Sun, M. D. Reid, and J. Zhou. Aoso-logitboost: Adapti
ve one-vs-one logitboost for multi-class problem.
arXiv
, (1110.3907), 2011.
[39] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing vis
ual features for multiclass and multiview object
detection.
PAMI
, 2007.
[40] P. Viola and M. J. Jones. Robust real-time face detectio
n.
IJCV
, 2004.
[41] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. B
elongie, and P. Perona. Caltech-ucsd birds 200.
2010.
[42] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba
. Sun database: Large-scale scene recognition
from abbey to zoo. In
CVPR
, 2010.
[43] N. Zhang, R. Farrell, and T. Darrell. Pose pooling kerne
ls for sub-category recognition. In
CVPR
, 2012.
[44] J. Zhu, H. Zou, S. Rosset, and T. Hastie. Multi-class ada
boost. 2009.
9
Supplementary Materials
Notation
scalars (regular), vectors (bold), constant vectors:
x,
x
[
x
1
,x
2
,...
]
,
0
[0
,
0
,...
]
,
1
[1
,
1
,...
]
indicator vector, logical indicator function:
δ
δ
δ
k
(
0
with a
1
in the
k
th
entry)
,
1
(
LOGICAL EXPRESSION
)
∈ {
0
,
1
}
inner product, element-wise multiplication:
h
x
,
v
i
,
x
v
element-wise function:
F
[
x
]
[
F
(
x
1
)
,F
(
x
2
)
,...
]
S.1 Coupled Sum
Let the class of functions
F
:
R
R
be convex, upper-bounding of the unit step function, and
having the same optimal value of
0
:
2
F
∂x
2
0
,F
(
x
)
1
(
x
0
)
,
inf
x
{
F
(
x
)
}
= 0
Define
σ
as the following coupled sum:
σ
(
H
;
y
)
h
1
δ
δ
δ
y
,
F
[
H
]
i
+
h
δ
δ
δ
y
,
F
[
H
]
i
2
Then
σ
is: (1) convex, (2) tight at the optimum, and (3) upper-bound
ing.
Claim (1)
:
σ
(
H
;
y
)
is convex in
H
.
Proof
:
σ
(
H
;
y
)
1
2
(
F
(
H
y
) +
k
6
=
y
F
(
H
k
)
)
half of the sum of convex functions is also convex.
Claim (2)
:
inf
H
{
σ
(
H
;
y
)
}
= 0
Proof
:
lim
θ
→∞
σ
(
θ
[
1
2
δ
δ
δ
y
];
y
) = 0
Claim (3)
:
σ
(
H
(
x
);
y
)
1
(
y
6
= argmax
k
{h
H
(
x
)
δ
δ
k
i}
)
Proof
:
Let:
j
= arg max
k
6
=
y
{
H
y
+
H
k
2
}
,z
=
H
y
+
H
j
2
Then:
σ
(
H
;
y
)
1
2
(
F
(
H
y
) +
k
6
=
y
F
(
H
k
)
)
1
2
(
F
(
H
y
) +
F
(
H
j
)
)
F
(
H
y
+
H
j
2
)
︷︷
due to Jensen’s inequality
=
F
(
z
)
Case (a)
: correct classification
(should upper-bound
0
)
k
6
=
y,H
y
>H
k
z<
0
σ
(
H
;
y
)
F
(
z
)
1
(
z
0
) = 0
Case (b)
: incorrect classification
(should upper-bound
1
)
j
s.t.
H
j
H
y
z
0
σ
(
H
;
y
)
F
(
z
)
1
(
z
0
) = 1
10