of 8
Pruning Training Sets for Learning of Object Categories
Anelia Angelova
§
Yaser Abu-Mostafa
Pietro Perona
§
Computer Science Department
Electrical Engineering Department
California Institute of Technology, Pasadena, CA 91125
Abstract
Training datasets for learning of object categories are
often contaminated or imperfect. We explore an approach
to automatically identify examples that are noisy or trouble-
some for learning and exclude them from the training set.
The problem is relevant to learning in semi-supervised or
unsupervised setting, as well as to learning when the train-
ing data is contaminated with wrongly labeled examples
or when correctly labeled, but hard to learn examples, are
present. We propose a fully automatic mechanism for noise
cleaning, called ’data pruning’, and demonstrate its suc-
cess on learning of human faces. It is not assumed that the
data or the noise can be modeled or that additional train-
ing examples are available. Our experiments show that data
pruning can improve on generalization performance for al-
gorithms with various robustness to noise. It outperforms
methods with regularization properties and is superior to
commonly applied aggregation methods, such as bagging.
1. Introduction
Learning an unknown target function from examples is a
difficult problem in its own right. The task is further compli-
cated if some of the examples in the dataset are mislabeled
or are otherwise hard to learn for the chosen model, as is the
case with real-life data. Finding troublesome examples is a
’chicken-and-egg’ dilemma because good classifiers for the
object category are needed in order to determine which ex-
amples are noisy. On the other hand, learning on noisy data
may result in poor classifiers. We explore whether general-
ization performance can be impr
oved by eliminating noisy
examples and how to reliably identify them.
Real-life training data can have various sources of con-
tamination. For example, wrongly labeled examples might
be present due to human mistakes while labeling or a re-
sult of data received in a semi-supervised fashion, figure 1
(top). Quite often, examples that are hard to learn can also
fall into the dataset, figure 1 (bottom), because the data col-
lection and labeling is done independently from the learning
process, by people unaware of its future use.
Figure 1:
Face data with classification noise (non-face
examples wrongly labeled as faces (top)) and with
within-sample outliers (hard to learn face examples (bot-
tom)). The training data we use contains both sources
of noise. Noisy or hard to learn examples are marked
with a red box manually. The goal of data pruning is
to identify and eliminate them automatically in order to
improve generalization performance.
Pruning of noisy examples is important to fully automate
the process of data collection and learning for object recog-
nition. The efforts in this direction have been focused on
creating complicated models or features [5], particularly tai-
lored to the target domain. Our key idea is that automatic
noise elimination can be done without incorporating very
complicated machinery or domain specific information but,
instead, using only weaker clues about the target and that
they can be retrieved even in very noisy cases.
1.1. Previous Work
Robust methods are an important ingredient in most
computer vision algorithms which
work with real-life data
and applications. A large body of research, a detailed survey
of which is beyond the scope of the paper, addresses robust
techniques in various machine vision applications. To name
a few: short and wide baseline stereo, motion segmenta-
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
1063-6919/05 $20.00 © 2005
IEEE