Active Learning under Label Shift
Eric Zhao
Anqi Liu
Anima Anandkumar
Yisong Yue
California Institute of Technology
Abstract
We address the problem of active learning
under
label shift
: when the class proportions
of source and target domains differ. We in-
troduce a “medial distribution” to incorpo-
rate a tradeoff between importance weight-
ing and class-balanced sampling and propose
their combined usage in active learning. Our
method is known as Mediated Active Learn-
ing under Label Shift (MALLS). It balances
the bias from class-balanced sampling and
the variance from importance weighting. We
prove sample complexity and generalization
guarantees for MALLS which show active
learning reduces asymptotic sample complex-
ity even under arbitrary label shift. We em-
pirically demonstrate MALLS scales to high-
dimensional datasets and can reduce the sam-
ple complexity of active learning by 60% in
deep active learning tasks.
1 Introduction
Label Shift
In many real-world applications, the tar-
get (testing) distribution of a model can differ from
the source (training) distribution. Label shift arises
when class proportions differ between the source and
target, but the feature distributions of each class do
not. For example, the problems of bird identification in
San Francisco (SF) versus New York (NY) exhibit label
shift. While the likelihood of observing a snowy owl
may differ, snowy owls should look similar in New York
and San Francisco. The well-known class-imbalance
problem is a specific form of label shift where the target
label distribution is uniform but the source is not.
Active Learning under Label Shift
Label shift
poses a problem for active learning in the real world.
Proceedings of the 24
th
International Conference on Artifi-
cial Intelligence and Statistics (AISTATS) 2021, San Diego,
California, USA. PMLR: Volume 130. Copyright 2021 by
the author(s).
Source
Target
Imbalanced
Source
Imbalanced
Target
10
90
5
0
45
0
9
0
90
50
50
Sample
80
Atlantic
puffin
Sample
400
Atlantic
puffin
IW:
(5,
0.55)
IW:
(0.2,
1.8)
Scarlet Tanager
Dominating
Atlantic
P
uffin
Dominating
Figure 1:
Extreme label shift examples of binary
classification on 100 datapoints. The arrows in the
example illustrate two ways of correcting
imbalanced
source
and
imbalanced target
. It requires larger im-
portance weights to correct
imbalanced source
than
imbalanced target
, but correcting
imbalanced target
re-
quires more additional samples than
imbalanced source
.
A uniform
medial distribution
can decompose any label
shift into an
imbalanced source
and
imbalanced target
.
For example, we can train a bird classifier for New York
by labeling bird images off Google. However, due to
the label shift between New York and Google Images,
naive active learning algorithms will fail to collect data
on birds relevant in New York. The correction of mi-
nority underrepresentation in computer vision datasets
[
Yang et al., 2020
] similarly poses an active learning
under label shift problem. Proper label shift correction
must be incorporated into active learning techniques
to avoid inefficient and biased data collection.
Importance Weighting & Subsampling
There are
two ways to correct label shift, as shown in the two
extreme cases depicted in Figure 1. The arrows demon-
strate the required additional samples and importance
weights for the correction of
imbalanced source
and
imbalanced target
.
Importance weighting can cor-
rect for label shift with rigorous theoretical guaran-
tees [
Lipton et al., 2018
,
Azizzadenesheli et al., 2019
].
However, under large label shift, the estimation and use
of importance weights result in high variance. Class-
balanced sampling (subsampling) can also correct for
label shift and, although lacking strong theoretical guar-
antees, is practical and effective [
Aggarwal et al., 2020
].
Active Learning under Label Shift
However, in active learning settings, subsampling is im-
precise as only label predictions—not true labels—can
be used to assign subsampling probabilities to unla-
beled datapoints.
Target Data
Source Data
Subsampled
Weighted
Figure 2: Noisy linear classification of stars and circles
with a star-heavy source and circle-heavy target. Black
lines depict the empirical risk minimizer (ERM). Ig-
nored data are light grey. Our proposed algorithms first
subsample a
medial distribution
—in this case, equal
parts circle and star—then importance weighting pro-
duces a circle-dominant ERM. See sec. 5-6 for details.
In this paper, we answer the question: how can we
use both importance weighting and subsampling for
active learning—and how much should we use each?
We answer this question by introducing a
medial dis-
tribution
(Figure 2). Rather than active learning on
datapoints from the source distribution, datapoints are
instead sampled from a
medial distribution
by subsam-
pling. Importance weighting corrects the remaining
label shift between the
medial
and target distributions.
Our contributions
:
1.
Introduction of a
medial distribution
to describe a
bias-variance trade-off in label shift correction.
2.
Mediated Active Learning under Label Shift
(MALLS), a principled algorithm with theoretical
guarantees even under label shift.
3.
A batched variant of MALLS for practitioners which
integrates best practices and uncertainty sampling.
Aggressive use of subsampling can reduce the need
for, and thus variance of, importance weighting. How-
ever, subsampling also introduces bias from its use of
proxy labels. We derive a bias-variance tradeoff that
formalizes this trade-off and guides algorithm design.
In particular, we show subsampling can mitigate the
effect of label shift on importance weighting variance
and label complexity—but at the cost of introducing
bias. We further propose a choice of uniform medial
distribution, as we illustrate in Figure 1.
To the best of our knowledge, MALLS is the first active
learning framework for general
label shift
settings. We
also derive label complexity and generalization PAC
bounds for MALLS, the first such guarantees for this
setting. We present experiments of MALLS which cor-
roborate our theoretical insights into the trade-off be-
tween importance weighting and subsampling. In par-
ticular, batched MALLS improves the sample efficiency
of popular active learning algorithms by up to 60%
in the CIFAR10, CIFAR100 [
Krizhevsky, 2009
], and
NABirds datasets [
Van Horn et al., 2015
]. We share
the source code for the implementation of our method
in this repository:
https://github.com/ericzhao28/
alls
.
2 Related Works
Active Learning
Active learning has been in-
vestigated extensively from both theoretical and
practical perspectives.
Disagreement-based ac-
tive learning and its variants enjoy rigorous
learning guarantees and focus on the stream-
based
active
learning
setting
[
Hanneke, 2007
,
Hanneke, 2011
,
Balcan et al., 2009
,
Hanneke, 2014
,
Beygelzimer et al., 2010
,
Krishnamurthy et al., 2019
].
On the other hand, uncertainty sampling techniques
are popular practical algorithms which have been
successfuly applied to natural language processing
[
Shen et al., 2018
], computer vision [
Yang et al., 2015
],
and even robotics [
Choudhury and Srinivasa, 2020
].
We can incorporate our medial distribution design
principle to arrive at both a streaming disagreement-
based MALLS approach, as well as a Batched MALLS
for uncertainty sampling.
Distribution Shift
General domain adaptation the-
ory [
Ben-David et al., 2007
,
Ben-David et al., 2010
,
Cortes et al., 2010
,
Cortes and Mohri, 2014
]
looks
at joint distribution shift.
Covariate shift is
the most popular refinement of joint distribu-
tion shift [
Shimodaira, 2000
,
Gretton et al., 2009
,
Sugiyama et al., 2007
]. However, density estimation
for joint distribution shift or covariate shift is chal-
lenging due to the high-dimension nature of input
features in many applications [
Sugiyama et al., 2012
,
Tsuboi et al., 2009
,
Yamada et al., 2011
]. The label
shift setting is comparatively less popular, but
has received increased attention in recent years
[
Lipton et al., 2018
,
Azizzadenesheli et al., 2019
,
Garg et al., 2020
].
Density estimation under label
shift is comparatively simpler than under covariate
shift:
label spaces are simpler and often finite
[Lipton et al., 2018].
Active Learning under Distribution Shift
Ac-
tive learning [
Rai et al., 2010
,
Matasci et al., 2012
,
Deng et al., 2018
,
Su et al., 2019
] has been studied un-
der joint distribution shift and covariate shift. Ex-
isting literature, which sometimes term the problem
“active domain adaptation”, rely on heuristics for cor-
recting joint distribution shift [
Chan and Ng, 2007
,
Rai et al., 2010
] or build on the assumption of
covariate shift
[
Saha et al., 2011
,
Yan et al., 2018
,
Chattopadhyay et al., 2013
]. While active learning
with a covariate-shifted warm start guarantees label
Eric Zhao, Anqi Liu, Anima Anandkumar, Yisong Yue
complexity bounds, it requires importance weights
known a priori [
Yan et al., 2018
]. Label shift is a par-
ticularly difficult setting as, unlike covariate shift, label
shift cannot be estimated from unlabeled data.
With few exceptions [
Huang and Chen, 2016
], existing
literature assume active learners can query datapoints
from the test domain (our
canonical label shift
setting).
To the best of our knowledge, MALLS provides the
first guarantees for where test data is limited or labels
cannot be queried in the test domain.
The closest existing work to active learning under
label shift is active learning for imbalanced data
[
Aggarwal et al., 2020
,
Lin et al., 2018
], which can be
formalized as an instance of label shift with a uni-
form test distribution. While existing work have pro-
posed useful heuristics like diverse sampling and class-
balanced sampling, theoretical results are scarce.
3 Preliminaries
Active Learning under Distribution Shift
In an
active learning problem, a learner
L
actively collects
a labeled dataset
S
with the goal of maximizing the
performance of the hypothesis
h
∈
H
learned from
S
.
m
labeled datapoints sampled from some distribution
P
warm
initially populate
S
and constitute the “warm
start” dataset
D
warm
.
L
samples unlabeled datapoints
D
ulb
from some distribution
P
ulb
, and may select up
to
n
for labeling and appending to
S
. The learned
hypothesis
h
is evaluated on a test distribution
P
test
.
Traditional active learning assumes,
P
ulb
=
P
warm
=
P
test
.
(1)
In contrast,
active domain adaptation
does not assume
the warm start is sampled from the test distribution:
P
ulb
=
P
test
and
P
warm
6
=
P
test
.
(2)
This setting, which we term
canonical label shift
, is well-
studied but assumes active learning occurs in the test
domain. We address the more challenging
general label
shift
setting (Figure 3) which drops this assumption.
In the worst case, all distributions could be different:
P
warm
6
=
P
ulb
,P
warm
6
=
P
test
,P
ulb
6
=
P
test
.
(3)
For instance, the problem of creating a bird classifier
for New York by actively labeling data off Google is
general label shift
. This setting has received compar-
atively little attention despite its practical relevance
[
Huang and Chen, 2016
]: there may be a scarcity of
unlabeled target data or practical issues with labeling
target data, such as patient privacy or ownership rights.
Label Shift
The distribution shift problem concerns
training and evaluating models on different distribu-
Data
Stream/Pool
Warm
-
Start
Data
Stream/Pool
Labeled
Set
Labeled
Set
Canonical
Label
Shif
t
General
Label
Shif
t
Warm
-
Start
Evaluation
on
Test
P
warm
<latexit sha1_base64="rn5HdEatdTZR1vY++FojU57FxJ8=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0GPRi8cK9gPaEDbbTbt0Nwm7k2oJ+SdePCji1X/izX/jps1BWx8MPN6bYWZekAiuwXG+rcra+sbmVnW7trO7t39gHx51dJwqyto0FrHqBUQzwSPWBg6C9RLFiAwE6waT28LvTpnSPI4eYJYwT5JRxENOCRjJt+2Wnw2APUH2SJTM85pv152GMwdeJW5J6qhEy7e/BsOYppJFQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3NCKSaS+bX57jM6MMcRgrUxHgufp7IiNS65kMTKckMNbLXiH+5/VTCK+9jEdJCiyii0VhKjDEuIgBD7liFMTMEEIVN7diOiaKUDBhFSG4yy+vks5Fw3Ua7v1lvXlTxlFFJ+gUnSMXXaEmukMt1EYUTdEzekVvVma9WO/Wx6K1YpUzx+gPrM8f4FyTzA==</latexit>
<latexit sha1_base64="rn5HdEatdTZR1vY++FojU57FxJ8=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0GPRi8cK9gPaEDbbTbt0Nwm7k2oJ+SdePCji1X/izX/jps1BWx8MPN6bYWZekAiuwXG+rcra+sbmVnW7trO7t39gHx51dJwqyto0FrHqBUQzwSPWBg6C9RLFiAwE6waT28LvTpnSPI4eYJYwT5JRxENOCRjJt+2Wnw2APUH2SJTM85pv152GMwdeJW5J6qhEy7e/BsOYppJFQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3NCKSaS+bX57jM6MMcRgrUxHgufp7IiNS65kMTKckMNbLXiH+5/VTCK+9jEdJCiyii0VhKjDEuIgBD7liFMTMEEIVN7diOiaKUDBhFSG4yy+vks5Fw3Ua7v1lvXlTxlFFJ+gUnSMXXaEmukMt1EYUTdEzekVvVma9WO/Wx6K1YpUzx+gPrM8f4FyTzA==</latexit>
<latexit sha1_base64="rn5HdEatdTZR1vY++FojU57FxJ8=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0GPRi8cK9gPaEDbbTbt0Nwm7k2oJ+SdePCji1X/izX/jps1BWx8MPN6bYWZekAiuwXG+rcra+sbmVnW7trO7t39gHx51dJwqyto0FrHqBUQzwSPWBg6C9RLFiAwE6waT28LvTpnSPI4eYJYwT5JRxENOCRjJt+2Wnw2APUH2SJTM85pv152GMwdeJW5J6qhEy7e/BsOYppJFQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3NCKSaS+bX57jM6MMcRgrUxHgufp7IiNS65kMTKckMNbLXiH+5/VTCK+9jEdJCiyii0VhKjDEuIgBD7liFMTMEEIVN7diOiaKUDBhFSG4yy+vks5Fw3Ua7v1lvXlTxlFFJ+gUnSMXXaEmukMt1EYUTdEzekVvVma9WO/Wx6K1YpUzx+gPrM8f4FyTzA==</latexit>
<latexit sha1_base64="rn5HdEatdTZR1vY++FojU57FxJ8=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0GPRi8cK9gPaEDbbTbt0Nwm7k2oJ+SdePCji1X/izX/jps1BWx8MPN6bYWZekAiuwXG+rcra+sbmVnW7trO7t39gHx51dJwqyto0FrHqBUQzwSPWBg6C9RLFiAwE6waT28LvTpnSPI4eYJYwT5JRxENOCRjJt+2Wnw2APUH2SJTM85pv152GMwdeJW5J6qhEy7e/BsOYppJFQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3NCKSaS+bX57jM6MMcRgrUxHgufp7IiNS65kMTKckMNbLXiH+5/VTCK+9jEdJCiyii0VhKjDEuIgBD7liFMTMEEIVN7diOiaKUDBhFSG4yy+vks5Fw3Ua7v1lvXlTxlFFJ+gUnSMXXaEmukMt1EYUTdEzekVvVma9WO/Wx6K1YpUzx+gPrM8f4FyTzA==</latexit>
P
warm
<latexit sha1_base64="rn5HdEatdTZR1vY++FojU57FxJ8=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0GPRi8cK9gPaEDbbTbt0Nwm7k2oJ+SdePCji1X/izX/jps1BWx8MPN6bYWZekAiuwXG+rcra+sbmVnW7trO7t39gHx51dJwqyto0FrHqBUQzwSPWBg6C9RLFiAwE6waT28LvTpnSPI4eYJYwT5JRxENOCRjJt+2Wnw2APUH2SJTM85pv152GMwdeJW5J6qhEy7e/BsOYppJFQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3NCKSaS+bX57jM6MMcRgrUxHgufp7IiNS65kMTKckMNbLXiH+5/VTCK+9jEdJCiyii0VhKjDEuIgBD7liFMTMEEIVN7diOiaKUDBhFSG4yy+vks5Fw3Ua7v1lvXlTxlFFJ+gUnSMXXaEmukMt1EYUTdEzekVvVma9WO/Wx6K1YpUzx+gPrM8f4FyTzA==</latexit>
<latexit sha1_base64="rn5HdEatdTZR1vY++FojU57FxJ8=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0GPRi8cK9gPaEDbbTbt0Nwm7k2oJ+SdePCji1X/izX/jps1BWx8MPN6bYWZekAiuwXG+rcra+sbmVnW7trO7t39gHx51dJwqyto0FrHqBUQzwSPWBg6C9RLFiAwE6waT28LvTpnSPI4eYJYwT5JRxENOCRjJt+2Wnw2APUH2SJTM85pv152GMwdeJW5J6qhEy7e/BsOYppJFQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3NCKSaS+bX57jM6MMcRgrUxHgufp7IiNS65kMTKckMNbLXiH+5/VTCK+9jEdJCiyii0VhKjDEuIgBD7liFMTMEEIVN7diOiaKUDBhFSG4yy+vks5Fw3Ua7v1lvXlTxlFFJ+gUnSMXXaEmukMt1EYUTdEzekVvVma9WO/Wx6K1YpUzx+gPrM8f4FyTzA==</latexit>
<latexit sha1_base64="rn5HdEatdTZR1vY++FojU57FxJ8=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0GPRi8cK9gPaEDbbTbt0Nwm7k2oJ+SdePCji1X/izX/jps1BWx8MPN6bYWZekAiuwXG+rcra+sbmVnW7trO7t39gHx51dJwqyto0FrHqBUQzwSPWBg6C9RLFiAwE6waT28LvTpnSPI4eYJYwT5JRxENOCRjJt+2Wnw2APUH2SJTM85pv152GMwdeJW5J6qhEy7e/BsOYppJFQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3NCKSaS+bX57jM6MMcRgrUxHgufp7IiNS65kMTKckMNbLXiH+5/VTCK+9jEdJCiyii0VhKjDEuIgBD7liFMTMEEIVN7diOiaKUDBhFSG4yy+vks5Fw3Ua7v1lvXlTxlFFJ+gUnSMXXaEmukMt1EYUTdEzekVvVma9WO/Wx6K1YpUzx+gPrM8f4FyTzA==</latexit>
<latexit sha1_base64="rn5HdEatdTZR1vY++FojU57FxJ8=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0GPRi8cK9gPaEDbbTbt0Nwm7k2oJ+SdePCji1X/izX/jps1BWx8MPN6bYWZekAiuwXG+rcra+sbmVnW7trO7t39gHx51dJwqyto0FrHqBUQzwSPWBg6C9RLFiAwE6waT28LvTpnSPI4eYJYwT5JRxENOCRjJt+2Wnw2APUH2SJTM85pv152GMwdeJW5J6qhEy7e/BsOYppJFQAXRuu86CXgZUcCpYHltkGqWEDohI9Y3NCKSaS+bX57jM6MMcRgrUxHgufp7IiNS65kMTKckMNbLXiH+5/VTCK+9jEdJCiyii0VhKjDEuIgBD7liFMTMEEIVN7diOiaKUDBhFSG4yy+vks5Fw3Ua7v1lvXlTxlFFJ+gUnSMXXaEmukMt1EYUTdEzekVvVma9WO/Wx6K1YpUzx+gPrM8f4FyTzA==</latexit>
P
ulb
<latexit sha1_base64="DpKDTsAkOLOZqy5pPxJcpI8DG88=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEInkoigh6LXjxWsB/QhrDZbtqlm03YnRVryC/x4kERr/4Ub/4bkzYHbX0w8Hhvhpl5QSK4Bsf5tipr6xubW9Xt2s7u3n7dPjjs6tgoyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXTm8LvPTCleSzvYZYwLyJjyUNOCeSSb9fbfjoE9gipEUGW1Xy74TSdOfAqcUvSQCXavv01HMXUREwCFUTrgesk4KVEAaeCZbWh0SwhdErGbJBTSSKmvXR+eIZPc2WEw1jlJQHP1d8TKYm0nkVB3hkRmOhlrxD/8wYGwisv5TIxwCRdLAqNwBDjIgU84opRELOcEKp4fiumE6IIhTyrIgR3+eVV0j1vuk7TvbtotK7LOKroGJ2gM+SiS9RCt6iNOogig57RK3qznqwX6936WLRWrHLmCP2B9fkDBQiTTg==</latexit>
<latexit sha1_base64="DpKDTsAkOLOZqy5pPxJcpI8DG88=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEInkoigh6LXjxWsB/QhrDZbtqlm03YnRVryC/x4kERr/4Ub/4bkzYHbX0w8Hhvhpl5QSK4Bsf5tipr6xubW9Xt2s7u3n7dPjjs6tgoyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXTm8LvPTCleSzvYZYwLyJjyUNOCeSSb9fbfjoE9gipEUGW1Xy74TSdOfAqcUvSQCXavv01HMXUREwCFUTrgesk4KVEAaeCZbWh0SwhdErGbJBTSSKmvXR+eIZPc2WEw1jlJQHP1d8TKYm0nkVB3hkRmOhlrxD/8wYGwisv5TIxwCRdLAqNwBDjIgU84opRELOcEKp4fiumE6IIhTyrIgR3+eVV0j1vuk7TvbtotK7LOKroGJ2gM+SiS9RCt6iNOogig57RK3qznqwX6936WLRWrHLmCP2B9fkDBQiTTg==</latexit>
<latexit sha1_base64="DpKDTsAkOLOZqy5pPxJcpI8DG88=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEInkoigh6LXjxWsB/QhrDZbtqlm03YnRVryC/x4kERr/4Ub/4bkzYHbX0w8Hhvhpl5QSK4Bsf5tipr6xubW9Xt2s7u3n7dPjjs6tgoyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXTm8LvPTCleSzvYZYwLyJjyUNOCeSSb9fbfjoE9gipEUGW1Xy74TSdOfAqcUvSQCXavv01HMXUREwCFUTrgesk4KVEAaeCZbWh0SwhdErGbJBTSSKmvXR+eIZPc2WEw1jlJQHP1d8TKYm0nkVB3hkRmOhlrxD/8wYGwisv5TIxwCRdLAqNwBDjIgU84opRELOcEKp4fiumE6IIhTyrIgR3+eVV0j1vuk7TvbtotK7LOKroGJ2gM+SiS9RCt6iNOogig57RK3qznqwX6936WLRWrHLmCP2B9fkDBQiTTg==</latexit>
<latexit sha1_base64="DpKDTsAkOLOZqy5pPxJcpI8DG88=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEInkoigh6LXjxWsB/QhrDZbtqlm03YnRVryC/x4kERr/4Ub/4bkzYHbX0w8Hhvhpl5QSK4Bsf5tipr6xubW9Xt2s7u3n7dPjjs6tgoyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXTm8LvPTCleSzvYZYwLyJjyUNOCeSSb9fbfjoE9gipEUGW1Xy74TSdOfAqcUvSQCXavv01HMXUREwCFUTrgesk4KVEAaeCZbWh0SwhdErGbJBTSSKmvXR+eIZPc2WEw1jlJQHP1d8TKYm0nkVB3hkRmOhlrxD/8wYGwisv5TIxwCRdLAqNwBDjIgU84opRELOcEKp4fiumE6IIhTyrIgR3+eVV0j1vuk7TvbtotK7LOKroGJ2gM+SiS9RCt6iNOogig57RK3qznqwX6936WLRWrHLmCP2B9fkDBQiTTg==</latexit>
P
ulb
<latexit sha1_base64="DpKDTsAkOLOZqy5pPxJcpI8DG88=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEInkoigh6LXjxWsB/QhrDZbtqlm03YnRVryC/x4kERr/4Ub/4bkzYHbX0w8Hhvhpl5QSK4Bsf5tipr6xubW9Xt2s7u3n7dPjjs6tgoyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXTm8LvPTCleSzvYZYwLyJjyUNOCeSSb9fbfjoE9gipEUGW1Xy74TSdOfAqcUvSQCXavv01HMXUREwCFUTrgesk4KVEAaeCZbWh0SwhdErGbJBTSSKmvXR+eIZPc2WEw1jlJQHP1d8TKYm0nkVB3hkRmOhlrxD/8wYGwisv5TIxwCRdLAqNwBDjIgU84opRELOcEKp4fiumE6IIhTyrIgR3+eVV0j1vuk7TvbtotK7LOKroGJ2gM+SiS9RCt6iNOogig57RK3qznqwX6936WLRWrHLmCP2B9fkDBQiTTg==</latexit>
<latexit sha1_base64="DpKDTsAkOLOZqy5pPxJcpI8DG88=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEInkoigh6LXjxWsB/QhrDZbtqlm03YnRVryC/x4kERr/4Ub/4bkzYHbX0w8Hhvhpl5QSK4Bsf5tipr6xubW9Xt2s7u3n7dPjjs6tgoyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXTm8LvPTCleSzvYZYwLyJjyUNOCeSSb9fbfjoE9gipEUGW1Xy74TSdOfAqcUvSQCXavv01HMXUREwCFUTrgesk4KVEAaeCZbWh0SwhdErGbJBTSSKmvXR+eIZPc2WEw1jlJQHP1d8TKYm0nkVB3hkRmOhlrxD/8wYGwisv5TIxwCRdLAqNwBDjIgU84opRELOcEKp4fiumE6IIhTyrIgR3+eVV0j1vuk7TvbtotK7LOKroGJ2gM+SiS9RCt6iNOogig57RK3qznqwX6936WLRWrHLmCP2B9fkDBQiTTg==</latexit>
<latexit sha1_base64="DpKDTsAkOLOZqy5pPxJcpI8DG88=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEInkoigh6LXjxWsB/QhrDZbtqlm03YnRVryC/x4kERr/4Ub/4bkzYHbX0w8Hhvhpl5QSK4Bsf5tipr6xubW9Xt2s7u3n7dPjjs6tgoyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXTm8LvPTCleSzvYZYwLyJjyUNOCeSSb9fbfjoE9gipEUGW1Xy74TSdOfAqcUvSQCXavv01HMXUREwCFUTrgesk4KVEAaeCZbWh0SwhdErGbJBTSSKmvXR+eIZPc2WEw1jlJQHP1d8TKYm0nkVB3hkRmOhlrxD/8wYGwisv5TIxwCRdLAqNwBDjIgU84opRELOcEKp4fiumE6IIhTyrIgR3+eVV0j1vuk7TvbtotK7LOKroGJ2gM+SiS9RCt6iNOogig57RK3qznqwX6936WLRWrHLmCP2B9fkDBQiTTg==</latexit>
<latexit sha1_base64="DpKDTsAkOLOZqy5pPxJcpI8DG88=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoZbEInkoigh6LXjxWsB/QhrDZbtqlm03YnRVryC/x4kERr/4Ub/4bkzYHbX0w8Hhvhpl5QSK4Bsf5tipr6xubW9Xt2s7u3n7dPjjs6tgoyjo0FrHqB0QzwSXrAAfB+oliJAoE6wXTm8LvPTCleSzvYZYwLyJjyUNOCeSSb9fbfjoE9gipEUGW1Xy74TSdOfAqcUvSQCXavv01HMXUREwCFUTrgesk4KVEAaeCZbWh0SwhdErGbJBTSSKmvXR+eIZPc2WEw1jlJQHP1d8TKYm0nkVB3hkRmOhlrxD/8wYGwisv5TIxwCRdLAqNwBDjIgU84opRELOcEKp4fiumE6IIhTyrIgR3+eVV0j1vuk7TvbtotK7LOKroGJ2gM+SiS9RCt6iNOogig57RK3qznqwX6936WLRWrHLmCP2B9fkDBQiTTg==</latexit>
P
test
<latexit sha1_base64="fZg2/ANe1XVYA7AXnjl3kH7Clns=">AAAB+XicbVBNS8NAEN34WetX1KOXYBE8lUQEPRa9eKxgP6ANYbOdtEs3m7A7KZbQf+LFgyJe/Sfe/Ddu2hy09cHA470ZZuaFqeAaXffbWlvf2NzaruxUd/f2Dw7to+O2TjLFoMUSkahuSDUILqGFHAV0UwU0DgV0wvFd4XcmoDRP5CNOU/BjOpQ84oyikQLbbgZ5H+EJcwSNs1k1sGtu3Z3DWSVeSWqkRDOwv/qDhGUxSGSCat3z3BT9nCrkTMCs2s80pJSN6RB6hkoag/bz+eUz59woAydKlCmJzlz9PZHTWOtpHJrOmOJIL3uF+J/XyzC68XMu0wxBssWiKBMOJk4RgzPgChiKqSGUKW5uddiIKsrQhFWE4C2/vEral3XPrXsPV7XGbRlHhZySM3JBPHJNGuSeNEmLMDIhz+SVvFm59WK9Wx+L1jWrnDkhf2B9/gDuG5PV</latexit>
<latexit sha1_base64="fZg2/ANe1XVYA7AXnjl3kH7Clns=">AAAB+XicbVBNS8NAEN34WetX1KOXYBE8lUQEPRa9eKxgP6ANYbOdtEs3m7A7KZbQf+LFgyJe/Sfe/Ddu2hy09cHA470ZZuaFqeAaXffbWlvf2NzaruxUd/f2Dw7to+O2TjLFoMUSkahuSDUILqGFHAV0UwU0DgV0wvFd4XcmoDRP5CNOU/BjOpQ84oyikQLbbgZ5H+EJcwSNs1k1sGtu3Z3DWSVeSWqkRDOwv/qDhGUxSGSCat3z3BT9nCrkTMCs2s80pJSN6RB6hkoag/bz+eUz59woAydKlCmJzlz9PZHTWOtpHJrOmOJIL3uF+J/XyzC68XMu0wxBssWiKBMOJk4RgzPgChiKqSGUKW5uddiIKsrQhFWE4C2/vEral3XPrXsPV7XGbRlHhZySM3JBPHJNGuSeNEmLMDIhz+SVvFm59WK9Wx+L1jWrnDkhf2B9/gDuG5PV</latexit>
<latexit sha1_base64="fZg2/ANe1XVYA7AXnjl3kH7Clns=">AAAB+XicbVBNS8NAEN34WetX1KOXYBE8lUQEPRa9eKxgP6ANYbOdtEs3m7A7KZbQf+LFgyJe/Sfe/Ddu2hy09cHA470ZZuaFqeAaXffbWlvf2NzaruxUd/f2Dw7to+O2TjLFoMUSkahuSDUILqGFHAV0UwU0DgV0wvFd4XcmoDRP5CNOU/BjOpQ84oyikQLbbgZ5H+EJcwSNs1k1sGtu3Z3DWSVeSWqkRDOwv/qDhGUxSGSCat3z3BT9nCrkTMCs2s80pJSN6RB6hkoag/bz+eUz59woAydKlCmJzlz9PZHTWOtpHJrOmOJIL3uF+J/XyzC68XMu0wxBssWiKBMOJk4RgzPgChiKqSGUKW5uddiIKsrQhFWE4C2/vEral3XPrXsPV7XGbRlHhZySM3JBPHJNGuSeNEmLMDIhz+SVvFm59WK9Wx+L1jWrnDkhf2B9/gDuG5PV</latexit>
<latexit sha1_base64="fZg2/ANe1XVYA7AXnjl3kH7Clns=">AAAB+XicbVBNS8NAEN34WetX1KOXYBE8lUQEPRa9eKxgP6ANYbOdtEs3m7A7KZbQf+LFgyJe/Sfe/Ddu2hy09cHA470ZZuaFqeAaXffbWlvf2NzaruxUd/f2Dw7to+O2TjLFoMUSkahuSDUILqGFHAV0UwU0DgV0wvFd4XcmoDRP5CNOU/BjOpQ84oyikQLbbgZ5H+EJcwSNs1k1sGtu3Z3DWSVeSWqkRDOwv/qDhGUxSGSCat3z3BT9nCrkTMCs2s80pJSN6RB6hkoag/bz+eUz59woAydKlCmJzlz9PZHTWOtpHJrOmOJIL3uF+J/XyzC68XMu0wxBssWiKBMOJk4RgzPgChiKqSGUKW5uddiIKsrQhFWE4C2/vEral3XPrXsPV7XGbRlHhZySM3JBPHJNGuSeNEmLMDIhz+SVvFm59WK9Wx+L1jWrnDkhf2B9/gDuG5PV</latexit>
Figure 3: Diagram of active learning under label shift
settings. The
canonical label shift
actively samples
from the test distribution. The
general label shift set-
ting
actively samples from elsewhere. Identical colors
indicate identical distributions.
tions, termed the source (
P
src
) and target (
P
trg
) respec-
tively. We refer to a source and target in the abstract.
For instance, in the
canonical label shift
setting, the
source is the warm start
P
src
=
P
warm
, and the target
is the test
P
trg
=
P
test
. Unlike covariate shift, which
assumes the underlying distribution shift arises solely
from a change in the input distribution while condi-
tional label probabilities are unaffected
1
, label shift
assumes distribution shift arises solely from a change
in label marginals:
P
trg
(
Y
)
6
=
P
src
(
Y
)
,P
trg
(
X
|
Y
) =
P
src
(
X
|
Y
)
.
(4)
Importance Weighting (IW)
Importance weighting
is a straight-forward solution to label shift. Weighting
datapoints by likelihood ratio produces asymptotically
unbiased importance weighted estimators.
1
n
n
∑
i
=1
P
trg
(
y
i
)
P
src
(
y
i
)
f
(
x
i
,y
i
)
→
E
x,y
∼
P
src
[
P
trg
(
y
)
P
src
(
y
)
f
(
x,y
)
]
=
E
x,y
∼
P
trg
[
f
(
x,y
)]
.
(5)
Following existing label shift literature, we restrict
our learning problems to those with a finite
k
-class
label space. We can estimate these importance weights
with only labeled data from the source distribution,
unlabeled data from the target distribution, and a
blackbox hypothesis
h
0
[
Lipton et al., 2018
]. Let
C
h
denote the confusion matrix for hypothesis
h
on
P
src
where
E
[
C
h
[
i,j
]]
:
=
P
src
(
h
(
X
)
=
y
(
i
)
,Y
=
y
(
j
)
) and
q
h
de-
note a
k
-vector with
q
h
[
i
]
:
=
P
trg
(
h
(
X
)
=
y
(
j
)
). Assum-
ing for all labels
∀
y
:
P
trg
(
y
)
>
0
=
⇒
P
src
(
y
)
>
0,
[Lipton et al., 2018] shows importance weights
r
are,
r
:
=
P
trg
(
y
)
P
src
(
y
)
=
C
−
1
h
0
q
h
0
.
(6)
For instance, Regularized Learning under Label Shift
(RLLS) [
Azizzadenesheli et al., 2019
] finds
r
through
1
We abuse notation and define
P
(
·
) as
P
(
x
)
:
=
P
(
X
=
x
)
or
P
(
y
)
:
=
P
(
Y
=
y
) depending on the context.
Active Learning under Label Shift
convex optimization of:
C
−
1
h
0
q
h
0
≈
argmin
r
‖
C
h
0
r
−
q
h
0
‖
2
+
λ
‖
r
−
1
‖
2
,
(7)
where
λ
is some regularization constant.
Class-balanced Sampling (Subsampling)
A
popular heuristic for addressing class imbalance
in active learning is adjusting the probability
of labeling a datapoint by its predicted label
[
Yang and Ma, 2010
,
Park, 2011
]. Traditionally, class-
balanced sampling aims to ensure equal representation
of each label and can be framed as a form of label shift
with a uniform target label distribution. We generalize
class-balanced sampling to general label shift problems
with potentially non-uniform targets, a practice we
term
subsampling
. We now describe two methods
of subsampling. In these examples, we subsample a
user-defined distribution
P
med
from a source
P
src
using
predictor
φ
for predicting proxy labels.
1.
Subsampling with a filter
P
ss
, where
P
med
(
y
)
∝
P
ss
(
y
=
φ
(
x
))
P
src
(
y
=
φ
(
x
)). Repeat until a sample
is yielded: sample datapoint
x
from
P
src
and, with
probability
P
ss
(
Y
=
φ
(
x
)), yield
x
.
2.
Subsampling with the target
P
med
. Collect
N
datapoints from
P
src
into a buffer
S
′
, where
N
is large. For each label
y
∈
Y
, randomly add
NP
med
(
y
) datapoints from
{
x
∈
S
′
|
φ
(
x
) =
y
}
into a buffer
S
. To sample from
P
med
, draw from
S
.
While in finite settings only the former yields IID sam-
ples, the two are identical in the limit by the law of
large numbers. Since subsampling strictly concerns
proxy labels as thus does not require labeled samples,
we assume subsampling occurs at the limit and use the
two interchangeably.
In the expectation, subsampling is equivalent to impor-
tance weighting with proxy labels predicted by
φ
:
E
Q
[
1
n
n
∑
i
=1
Q
i
f
(
x
i
,y
i
)
]
=
1
n
n
∑
i
=1
P
trg
(
y
i
=
φ
(
x
i
))
P
src
(
y
i
=
φ
(
x
i
))
f
(
x
i
,y
i
)
,
where
Q
i
∈{
0
,
1
}
is an indicator variable for whether
the
i
th datapoint is subsampled and has conditional
expectation
E
[
Q
i
|
y
i
] =
P
ss
(
y
i
).
4 Medial Distribution
In this section, we propose the concept of a
medial
distribution
. We conceptually frame subsampling as
the importance sampling of an alternative distribution
from the source distribution. We term this alternative
distribution the
medial distribution
P
med
. As we will
show,
P
med
mediates a trade-off between subsampling
and importance weighting (IW).
IW-Subsampling Trade-off
In this section, we
adopt domain adaptation notation and denote source,
medial and target distributions as
P
src
,P
med
,P
trg
. Let
r
s
t
:
=
P
trg
(
y
)
/P
src
(
y
) denote the importance weights
which shift the source to the target. Similarly, let
r
s
m
:
=
P
med
(
y
)
/P
src
(
y
) and
r
m
t
:
=
P
trg
(
y
)
/P
med
(
y
) de-
note the importance weights
to
and
from
the me-
dial distribution. Note that
P
med
(
y
)
,P
src
(
y
)
,P
trg
(
y
) de-
note the likelihood of a ground-truth label
y
. Esti-
mated weights are accented with a hat:
ˆ
r
. We follow
[
Lipton et al., 2018
] and formalize label shift magni-
tude as
‖
θ
‖
: some norm of
θ
:
=
r
−
1
, usually the L2
norm
‖·‖
2
. A large
‖
θ
‖
corresponds to a larger la-
bel shift and harder learning problem.
‖
θ
s
m
‖
is the
amount of label shift corrected by subsampling and
‖
θ
m
t
‖
is the amount corrected by importance weight-
ing. We can analyze this
medial distribution
trade-off
by introducing the following bound on the accuracy of
empirical loss estimates under label shift where sub-
sampling and importance weighting are used. This
theorem is a modification of a common error bound for
offline supervised learning under label shift.
Theorem 1.
Let
∆
denote the subsampling and impor-
tance weighting (trained on
n
datapoints) estimation
error of the empirical loss of
N
datapoints:
∆
:
=
1
N
N
∑
i
=1
(
r
i
P
ss
(
y
i
)
−
ˆ
r
i
P
ss
(
h
(
x
i
)))
`
(
h
(
x
i
)
,y
i
)
,
(8)
where
`
:
Y
×
Y
→
[0
,
1]
is a loss function. With
probability
1
−
2
δ
, for all
n
≥
1
:
|
∆
|≤O
2
σ
min
‖
θ
m
t
‖
2
√
log
(
nk
δ
)
n
+
√
log
(
n
δ
)
n
+
‖
θ
s
m
‖
∞
err
(
h
0
,r
m
t
)
,
(9)
where
σ
min
denotes the smallest singular value of the
confusion matrix and
err
(
h
0
,r
)
denotes the importance
weighted
0
/
1
-error of a blackbox predictor
h
0
on
P
src
.
The error bound in theorem 1 shows that the use of
subsampling versus importance weighting results in
different error bounds with different trade-offs. In
particular, the trade-off lies between the first sum-
mand,
‖
θ
m
t
‖
2
√
log
(
nk
δ
)
n
, and the third summand,
‖
θ
s
m
‖
∞
err
(
h
0
,r
m
t
). The former term corresponds to
the error introduced by the use of importance weights—
in particular, the variance that arises from importance
Eric Zhao, Anqi Liu, Anima Anandkumar, Yisong Yue
weight estimation. This variance is sensitive to the
magnitude of the ground-truth importance weights,
‖
r
m
t
‖
2
. Recall that in our medial distribution frame-
work, importance weights correct the label shift be-
tween
P
med
and
P
trg
. The latter term corresponds to the
subsampling estimation error—in particular, the bias
introduced by the use of proxy labels for data weighting.
This bias is sensitive to the magnitude of subsampling,
‖
θ
s
m
‖
∞
, and the accuracy of the blackbox hypothesis
err
(
h
0
,r
m
t
). Hence, subsampling mitigates sensitivity
to label shift magnitude by splitting the norm of total
label shift,
‖
θ
s
t
‖
, into the sum of two factors which
scale with
‖
θ
s
m
‖
and
‖
θ
m
t
‖
.
The key to addressing this bias-variance trade-off is
choosing a medial distribution which balances the qual-
ity of the blackbox hypothesis
err
(
h
0
,r
m
t
) and the
label shift magnitude
‖
r
s
t
‖
. In effect, subsampling
turns a difficult label shift problem, requiring large im-
portance weights
r
s
t
, into an easier label shift problem,
with smaller importance weights
r
m
t
.
Uniform Medial Distribution
We motivate a par-
ticular choice of medial distribution, a uniform label
distribution, with an example. Figure 1 depicts two
fundamental label shift regimes which we term
imbal-
anced source
and
imbalanced target
.
Imbalanced target
requires smaller importance weights to correct than
imbalanced source
and is hence more efficient for IW
to correct.
Imbalanced source
requires fewer additional
examples than
imbalanced target
and is more efficient
for subsampling to correct. This holds more broadly.
Consider a binary classification problem with
n
data-
points and two possible label distributions: balanced
distribution
D
1
with
n/
2 datapoints in each class, and
imbalanced distribution
D
2
with
n
−
1 datapoints in
the majority class. Under
imbalanced source
, where
P
src
:
=
D
2
and
P
trg
:
=
D
1
,
n
−
2 additional samples from
the under-represented class are necessary for negating
label shift. Under
imbalanced target
, where
P
src
:
=
D
1
and
P
trg
:
=
D
2
, (
n
−
2)
n
2
∈ O
(
n
2
) additional samples
are necessary.
This suggests subsampling under
imbalanced source
and importance weighting under
imbalanced target
. A
uniform medial distribution decomposes every label
shift problem into the two settings: subsample
to
a
uniform label distribution (imbalanced source) then
importance weight away
from
uniform (imbalanced
target). As we will show, this affords a convenient upper
bounds on the sample complexity of active learning
with a uniform medial distribution. We will also show,
experimentally, that uniform distributions serve as a
reliable choice for medial distributions and perform
similarly to “square root” medial distributions that are
optimal in simple cases, e.g., singleton
X
.
Weight
Estimation
Target
Unlabeled
Data
Subsampler
Source
Unlabeled
Data
Learner
Active
Sampler
Sub
-
sampling
Weight
Estimation
Weighted
Learning
&
Active
Sampling
Medial
Distribution
Importance
Weights
Queried
Labels
Figure 4: The MALLS Algorithm. MALLS consists
of 3 routines: (1) class-balanced sampling from the
unlabeled set, (2) actively querying for labels, and (3)
importance weight estimation for correcting label shift.
5 Streaming MALLS
In this section, we present a streaming active learning
algorithm: Mediated Active Learning under Label Shift
(MALLS). We analyze the generalization error and
the label complexity of steaming MALLS and validate
the theory with experiments. We present a practical
batched MALLS approach in Sec. 6. We also open-
source an implementation of MALLS.
Algorithm 1
Mediated Active Learning under Label
Shift
Input:
Warm start set (
D
warm
), unlabeled set (
D
ulb
),
test set (
D
test
), active learning budget
n
, label shift
budget
λ
, blackbox predictor
h
0
, medial distribution
P
med
, hypothesis class
H
Initialize
the dataset
S
←
warm start set;
Subsample
the unlabeled set using
h
0
to induce
P
med
.
Estimate importance weights:
Obtain
r
with RLLS [Azizzadenesheli et al., 2019]
using
h
0
, unlabeled test data, and
λn
labeled
datapoints from the unlabeled set;
While
|
S
|
< n
Calculate IWAL-CAL [Beygelzimer et al., 2010] sam
pling probability
P
t
for
x
t
using
S
weighted by
r
;
Label and append (
x
t
,y
t
) to
S
with probability
P
t
;
Output:
h
T
:
= argmin
h
∈
H
err
(
h,r
) where
err
is esti-
mated on
S
.
Proposed Algorithm
We build on a popular
importance-weighted agnostic active learning algorithm
IWAL-CAL [
Beygelzimer et al., 2010
]. We refer to
IWAL-CAL as a subprocedure and defer its details
to the Appendix. IWAL-CAL takes as input a data-
point
x
t
and returns a sampling probability
P
t
. MALLS
modifies the computation of
P
t
by applying importance
weights to correct for label shift in empirical loss esti-
mates. Specifically, IWAL-CAL depends on estimating
hypothesis loss on the actively labeled dataset
S
:
err(
h
) =
1
|
S
|
|
S
|
∑
t
=1
`
(
h
(
x
t
)
,y
t
)
,
(10)
Active Learning under Label Shift
where
x
i
,y
i
are drawn from
S
. MALLS instead com-
putes empirical loss estimates as:
err(
h,r
) =
1
|
S
|
|
S
|
∑
t
=1
r
(
y
t
)
`
(
h
(
x
t
)
,y
t
)
,
(11)
where
r
(
y
t
) denotes an importance weight for data-
points of label
y
t
. MALLS computes these impor-
tance weights
r
by calling a blackbox label shift estima-
tor (e.g. BBSE [Lipton et al., 2018]). Our derivations
use Regularized Learning under Label Shift (RLLS)
[
Azizzadenesheli et al., 2019
]. Since label shift estima-
tion algorithms require an independent holdout set
for estimating importance weights, MALLS estimates
importance weights on a holdout set of
λn
labeled
datapoints sampled from
P
med
through subsampling.
MALLS also adds subsampling as a preprocessing step
to IWAL-CAL, re-using the blackbox hypothesis used
in label shift estimation as a predictor. Thus, instead of
directly sampling points from
D
ulb
, IWAL-CAL instead
interacts with datapoints subsampled from
D
ulb
and
distributed according to
P
med
. We detail the high-level
flow of MALLS in Figure 4 and provide pseudocode in
Algorithm 1.
Figure 5: Average performance and 95% confidence
intervals on 10 runs of experiments on MNIST, 5 runs
on CIFAR in a
canonical label shift
setting (defined in
Preliminaries). Accuracy on (a) MNIST, (b) CIFAR.
MALLS leads to sample efficiency gains in both settings.
A “uniform” medial performs on par with a “square
root” medial.
Theoretical Analysis
We now analyze label complex-
ity and generalization bounds for Algorithm 1. In the
canonical label shift setting, label shift naturally dis-
appears asymptotically as the warm start dataset is
diluted. For the remainder of this section, we instead
work in the more challenging
general label shift
setting.
As the presence of warm start data is not particularly
interesting in our analysis, we set the warm start bud-
get
m
= 0 for reading convenience and defer the case
where
m >
0 to the Appendix for interested readers.
We also defer the case where the quantity of unlabeled
test data is bounded to the Appendix.
The derivation of theoretical guarantees for MALLS
builds off our Theorem 1 and existing results from
IWAL-CAL. The proof consists of two primary steps.
First, new deviation bounds are derived for IWAL-CAL
to compensate for the additional variance introduced
by subsampling and importance weighting. Second,
triangle inequalities plug in results from Section 3 on
the bias-variance tradeoff. The resulting deviation
bound (see Appendix) yields the following guarantees
for MALLS.
Theorem 2.
With probability
>
1
−
δ
, for all
n
≥
1
,
err
Q
(
h
n
)
≤O
(
(1 +
1
σ
min
)
‖
r
s
t
‖
∞
err
(
h
0
,r
m
t
)
+
err
Q
(
h
∗
) +
√
2
C
0
log
n
n
−
1
+
2
C
0
log
n
n
−
1
)
,
(12)
where
err
Q
denotes hypothesis error in the target do-
main,
n
denotes observed datapoints including those
not labeled or subsampled, and the constant
C
0
is,
C
0
∈O
(
2
λσ
min
(
‖
θ
m
t
‖
2
2
log
(
k
δ
)
+ log
(
1
δ
))
+ log
(
|
H
|
δ
)
(1 +
‖
θ
s
t
‖
2
2
)
)
.
(13)
Our generalization bound differs from the original
IWAL-CAL bound in two key aspects. (1) The use of
subsampling introduces bias related to the performance
of the blackbox hypothesis:
1
σ
min
err
(
h
0
,r
m
t
). (2) In
the original IWAL-CAL algorithm
C
0
∈O
(
log (
|
H
|
/δ
)
.
However label shift inevitably introduces, to the con-
stant
C
0
, a dependence on the number of label classes
k
and label shift magnitudes
‖
θ
s
t
‖
2
2
and
‖
θ
m
t
‖
2
2
. When
the subsampling error is high, Theorem 2 shows im-
portance weighting can be used alone to preserve a
consistency guarantee even under
general label shift
.
Theorem 3.
With high probability
2
at least
1
−
δ
, the
number of labels queried is at most:
O
(
1 + log
(
1
δ
)
+ Θ
√
C
0
n
‖
r
s
m
‖
∞
log
n
+ Θ
C
0
log
3
n
+
λn
+Θ
·
(
n
−
1)
·
(
err
Q
(
h
∗
)
‖
r
s
m
‖
∞
+ (1 +
1
σ
min
)
err
(
h
0
,r
m
t
)
))
,
(14)
where
Θ
denotes
the
disagreement
coefficient
[Balcan et al., 2009].
Subsampling effectively increases the noise rate of the
underlying problem. This increases the linear noise rate
term
O
(
n
) inevitable in agnostic active learning label-
ing complexities. However, subsampling also reduces
sample complexity by a factor of
1
‖
r
s
m
‖
∞
. Importance
2
Where
δ <
2
(
−
2
e
−
1)
/
‖
r
s
m
‖
∞
.
Eric Zhao, Anqi Liu, Anima Anandkumar, Yisong Yue
(a)
(b)
(c)
(d)
Figure 6: Average performance and 95% confidence intervals of 10 runs on
CIFAR100
, and 4 runs on NABirds.
Plots (a)-(c) demonstrate MALLS consistently improves accuracy and macro F1 scores. Plot (d) depicts the
learning dynamics of MALLS and verifies a suppression of the over-represented class (Class 21) during learning.
(a)
(c)
(c)
(d)
Figure 7: Average performance and 95% confidence intervals of 10 runs on
CIFAR10
and CIFAR100. MALLS
consistently improves accuracy, macro F1, and weighted F1 scores.
weighting introduces a new linear label complexity term
O
(
λn
). This is used to collect a holdout set for label
shift estimation. Thus, when the blackbox hypothesis
is bad and strong importance weighting is necessary,
the sample complexity improvements of active learning
are lost. However, given a good blackbox hypothesis,
the medial distribution can be set closer to the tar-
get (small
‖
θ
m
t
‖
) and
λ
can be set small so MALLS
retains the sample complexity gains of active learning.
Experiments
We empirically validate MALLS with
experiments on synthetic label shift problems on the
MNIST and CIFAR benchmark datasets. These exper-
iments employ a bootstrap approximation of IWAL-
CAL recommended in [
Beygelzimer et al., 2009
] using
a version space of 8 Resnet-18 models. The black-
box hypothesis is obtained by training a standalone
model on the warm start data split. Random sampling
and vanilla active learning (IWAL-CAL) are compared
against MALLS for two choices of medial distributions:
1.
A “square root” medial distribution where
r
s
m
=
r
m
t
=
√
r
s
t
. This is a bare optimization of the
error tradeoff in Theorem 1.
2.
A “uniform” medial distribution motivated by Fig-
ure 1 and intuition of
imbalanced sources/targets
.
The results, shown in Figure 5 demonstrate signifi-
cant sample efficiency gains due to MALLS, even when
vanilla IWAL no longer beats random sampling. De-
spite its simplicity, the performance of the uniform
medial distribution is indistinguishable from the theo-
retically motivated “square root” medial distribution.
6 Batched MALLS
We present a variant of MALLS for practitioners which
integrates best practices for scaling label shift esti-
mation. This variant, depicted in Algorithm 2, is a
framework for batched active learning that supports
any blackbox uncertainty sampling algorithm.
Best Practices
Batched MALLS incorporates five
important techniques for scaling the real world practice
of label shift correction.
1.
Forgo use of independent holdout sets and instead
learn importance weights
r
on the main dataset
S
.
2.
Motivated by Theorem 1, Batched MALLS uses the
current active learning predictor for subsampling.
3.
Approximate subsampling with a generalization of
Active Learning under Label Shift
Figure 8: Average performance and 95% confidence intervals on 10 runs of experiments on CIFAR100 in the
canonical label shift
setting (defined in Preliminaries). In order of increasing label shift magnitude: (a), (b), (c),
(d). MALLS performance gains scale by label shift magnitude.
class-balanced sampling that is compatible with
batch-mode active learning [Aggarwal et al., 2020].
4.
Apply importance weights during inference time.
Batched MALLS replaces the importance weighting
of empirical loss estimates with posterior regulariza-
tion, a practice closely related to the expectation-
maximization algorithm in [Saerens et al., 2002].
5.
Use hypotheses learned with importance weights
as blackbox predictors to learn better importance
weights. We term this iterative reweighting.
Algorithm 2
Batched MALLS
Input:
Warm start set, unlabeled pool
D
ulb
, test set,
number of batches
T
, medial distribution
P
med
, uncer-
tainty quantifier
π
, batch size
B
; hypothesis class
H
Initialize
the labeled dataset
S
0
←
the warm start set;
Initialize
hypothesis
φ
by training on warm start set;
For
t
∈
1
,...,T
Find importance weights
r
t
←
RLLS(
S
t
,φ,P
med
);
Train hypothesis
φ
on
S
weighted by
r
t
;
For
y
∈
Y
Number of datapoints to collect
k
:
=
B
×
P
med
(
y
)
Find top-
k
most uncertain datapoints of label
y
:
D
y
:
= top-
k
(
π,
{
x
∈
D
ulb
\
S
t
−
1
|
φ
(
x
) =
y
}
)
Label and append the top-
k
datapoints,
D
y
, to
S
t
Output:
h
T
= argmin
{
err(
h,S
T
,r
T
) :
h
∈H}
Experiments
We demonstrate the Batched MALLS
framework on the ornithology dataset NABirds
[
Van Horn et al., 2015
] and the benchmark datasets
CIFAR10 & CIFAR100 [
Krizhevsky, 2009
]. Our exper-
iments show MALLS improves active learning perfor-
mance under a diverse range of label shift scenarios.
Methods
We evaluate our Batched MALLS framework
on several uncertainty sampling algorithms: (1) Monte
Carlo dropout (MC-D) [
Gal and Ghahramani, 2016
];
(2) maximum entropy sampling (MaxEnt); and (3)
maximum margin (Margin). We compare against ran-
dom sampling and active learning without MALLS
(marked
Vanilla
). In ablation studies, we also compare
against only importance weighting or subsampling. As
in Section 5, the blackbox hypothesis is obtained by
training a model on the warm start data split.
Primary Results
We present our primary results
in Figures 6-7. These experiments apply MALLS to
the batch-mode pool-based active learning of Resnet18
models. The label shift in the NABirds dataset arises
from a naturally occurring class imbalance where a
dominant class constitutes a near majority of all data
[
Aggarwal et al., 2020
]. We adopt this imbalance and
assume a uniform test label distribution. We artificially
induce
canonical label shift
in the CIFAR10 and CI-
FAR100 experiments by applying [
Lipton et al., 2018
]’s
Dirichlet Shift
procedure to the unlabeled
D
ulb
and test
D
test
datasets.
In all experiments, MALLS significantly improves both
accuracy and macro F1 scores. In synthetic shift ex-
periments, MALLS reduces sample complexity by up
to half an order of magnitude.
Learning Dynamics of MALLS
Figure 6(d) details
the learning evolution of MALLS by depicting a dom-
inant class’s accuracy over training time. The class’s
accuracy initially declines due to the class’s low im-
portance weights, but recovers as the label shift is
corrected and the dominant class’s importance weight
grows.
Uncertainty Measures
Figure 7(c)(d) and 9(c)
demonstrates the performance improvements from us-
ing Batched MALLS generalize to several popular un-
certainty sampling algorithms. Importantly, the gains
realized by using Batched MALLS is largely indepen-
dent of the choice of uncertainty sampling.
Imbalanced Source v.s. Imbalanced Target
Fig-
ures 9(a)(b) depicts synthetic
general label shift
prob-
lems under
imbalanced source
and
imbalanced target
settings on CIFAR100. We compare MALLS against
the use of subsampling or importance weighting alone
to investigate the trade-off implied by theory. While
Figure 9(a) demonstrates that subsampling accounts for
Eric Zhao, Anqi Liu, Anima Anandkumar, Yisong Yue
(a)
(b)
(c)
(d)
Figure 9: Average and 95% confidence intervals on 10 runs of experiments on CIFAR100. Figures (a)(b) depict
general label shift
problems, (c)(d) depict
canonical label shift
(defined in Preliminaries). (a) Top-5 accuracy
under
imbalanced source
, subsampling outperforms importance weighting; (b) Macro F1 under
imbalanced
target
, importance weighting outperforms subsampling; (c) Batched MALLS provides performance gains for
multiple popular uncertainty sampling methods; (d) Batched MALLS’s best practices provides significant gains
on performance.
MALLS’s performance gains under
imbalanced source
,
Figure 9(b) demonstrates that importance weighting
accounts for MALLS’s performance gains under
imbal-
anced target
. This corroborates our theoretical analysis.
Label Shift Magnitude
These experiments evaluate
MALLS on different magnitudes of label shift, where la-
bel shift is induced according to Dirichlet distributions
for varying choices of
α
. Note that shift magnitude is
inversely correlated with
α
—smaller
α
denotes a larger
shift. Figure 8 demonstrates that the performance
gains introduced by RLLS scale with the magnitude
of the label shift. The results also confirm that the
effectiveness of active learning drops under strong label
shift. Plot (a) confirms that even when label shift is
negligible, MALLS does not perform significantly worse
than vanilla active learning.
Best Practices
Figures 9(d) compares performance
when Batched MALLS’s heuristics of posterior regular-
ization (PR) and iterative reweighting (ITIW) are not
used. Posterior regularization lowers variance (versus
importance weighting) and especially improves early-
stage performance. Iterative reweighting similarly intro-
duces consistent performance gains. Combining them
provides additional gains.
7 Conclusion
In this paper, we propose an algorithm for active learn-
ing under label shift, MALLS, with strong label com-
plexity and generalization bounds. We also introduce
a framework, Batched MALLS, for practitioners to
address label shift in real world uncertainty sampling
applications. In many applications that require manu-
ally labeling of data, like natural language processing
and computer vision, an extension of the techniques
we explore in MALLS may help mitigate bias in the
data collection process. Many problems of theoreti-
cal importance—such as cost-sensitive, multi-domain,
and Neyman-Pearson settings—share a fundamental
connection with the label shift problem. We believe
MALLS can be extended to provide novel results in
these settings as well.
Acknowledgements
Anqi Liu is supported by the PIMCO Postdoctoral
Fellowship. Prof. Anandkumar is supported by Bren
endowed Chair, faculty awards from Microsoft, Google,
and Adobe, Beyond Limits, and LwLL grants. This
work is also supported by funding from Raytheon and
NASA TRISH.
References
[Aggarwal et al., 2020]
Aggarwal, U., Popescu, A.,
and Hudelot, C. (2020). Active Learning for Im-
balanced Datasets. pages 1428–1437.
[Azizzadenesheli et al., 2019]
Azizzadenesheli, K., Liu,
A., Yang, F., and Anandkumar, A. (2019). Reg-
ularized Learning for Domain Adaptation under
Label Shifts.
arXiv:1903.09734 [cs, stat]
. arXiv:
1903.09734.
[Balcan et al., 2009]
Balcan, M.-F., Beygelzimer, A.,
and Langford, J. (2009). Agnostic active learning.
Journal of Computer and System Sciences
, 75(1):78–
89.
[Ben-David et al., 2010]
Ben-David, S., Blitzer, J.,
Crammer, K., Kulesza, A., Pereira, F., and Vaughan,
J. W. (2010). A theory of learning from different
domains.
Machine Learning
, 79(1-2):151–175.
Active Learning under Label Shift
[Ben-David et al., 2007]
Ben-David, S., Blitzer, J.,
Crammer, K., and Pereira, F. (2007). Analysis
of Representations for Domain Adaptation.
In
Sch ̈olkopf, B., Platt, J. C., and Hoffman, T., ed-
itors,
Advances in Neural Information Processing
Systems 19
, pages 137–144. MIT Press.
[Beygelzimer et al., 2009]
Beygelzimer, A., Dasgupta,
S., and Langford, J. (2009). Importance Weighted
Active Learning.
arXiv:0812.4952 [cs]
.
arXiv:
0812.4952.
[Beygelzimer et al., 2010]
Beygelzimer, A., Hsu, D.,
Langford, J., and Zhang, T. (2010). Agnostic Active
Learning Without Constraints.
arXiv:1006.2588 [cs]
.
arXiv: 1006.2588.
[Chan and Ng, 2007]
Chan, Y. S. and Ng, H. T. (2007).
Domain Adaptation with Active Learning for Word
Sense Disambiguation. In
Proceedings of the 45th
Annual Meeting of the Association of Computational
Linguistics
, pages 49–56, Prague, Czech Republic.
Association for Computational Linguistics.
[Chattopadhyay et al., 2013]
Chattopadhyay, R., Fan,
W., Davidson, I., Panchanathan, S., and Ye, J.
(2013). Joint transfer and batch-mode active learn-
ing. In
30th International Conference on Machine
Learning, ICML 2013
, pages 1290–1298. Interna-
tional Machine Learning Society (IMLS).
[Choudhury and Srinivasa, 2020]
Choudhury, S. and
Srinivasa, S. S. (2020). A Bayesian Active Learning
Approach to Adaptive Motion Planning. In
Robotics
Research
, pages 33–40. Springer.
[Cortes et al., 2010]
Cortes, C., Mansour, Y., and
Mohri, M. (2010). Learning Bounds for Importance
Weighting. In Lafferty, J. D., Williams, C. K. I.,
Shawe-Taylor, J., Zemel, R. S., and Culotta, A., ed-
itors,
Advances in Neural Information Processing
Systems 23
, pages 442–450. Curran Associates, Inc.
[Cortes and Mohri, 2014]
Cortes, C. and Mohri, M.
(2014). Domain adaptation and sample bias correc-
tion theory and algorithm for regression.
Theoretical
Computer Science
, 519:103–126. Publisher: Elsevier.
[Deng et al., 2018]
Deng, C., Liu, X., Li, C., and Tao,
D. (2018). Active multi-kernel domain adaptation
for hyperspectral image classification.
Pattern Recog-
nition
, 77:306–315. Publisher: Elsevier.
[Gal and Ghahramani, 2016]
Gal, Y. and Ghahra-
mani, Z. (2016). Dropout as a Bayesian Approxima-
tion: Representing Model Uncertainty in Deep Learn-
ing.
arXiv:1506.02142 [cs, stat]
. arXiv: 1506.02142.
[Garg et al., 2020]
Garg, S., Wu, Y., Balakrishnan, S.,
and Lipton, Z. C. (2020). A Unified View of Label
Shift Estimation.
arXiv:2003.07554 [cs, stat]
. arXiv:
2003.07554.
[Gretton et al., 2009]
Gretton, A., Smola, A., Huang,
J., Schmittfull, M., Borgwardt, K., Sch ̈olkopf, B.,
Candela, J., Sugiyama, M., Schwaighofer, A., and
Lawrence, N. (2009). Covariate Shift by Kernel
Mean Matching.
Dataset Shift in Machine Learning,
131-160 (2009)
.
[Hanneke, 2007]
Hanneke, S. (2007). A bound on the
label complexity of agnostic active learning. In
Pro-
ceedings of the 24th international conference on Ma-
chine learning
, ICML ’07, pages 353–360, Corvalis,
Oregon, USA. Association for Computing Machinery.
[Hanneke, 2011]
Hanneke, S. (2011). Activized Learn-
ing: Transforming Passive to Active with Improved
Label Complexity.
arXiv:1108.1766 [cs, math, stat]
.
arXiv: 1108.1766.
[Hanneke, 2014]
Hanneke, S. (2014).
Theory of
Disagreement-Based Active Learning.
Foundations
and Trends
®
in Machine Learning
, 7(2-3):131–309.
Publisher: Now Publishers, Inc.
[Huang and Chen, 2016]
Huang, S.-J. and Chen, S.
(2016). Transfer learning with active queries from
source domain. In
Proceedings of the Twenty-Fifth
International Joint Conference on Artificial Intelli-
gence
, IJCAI’16, pages 1592–1598, New York, New
York, USA. AAAI Press.
[Krishnamurthy et al., 2019]
Krishnamurthy,
A.,
Agarwal, A., Huang, T.-K., Daume III, H., and
Langford, J. (2019). Active Learning for Cost-
Sensitive Classification.
arXiv:1703.01014 [cs, stat]
.
arXiv: 1703.01014.
[Krizhevsky, 2009]
Krizhevsky, A. (2009). Learning
Multiple Layers of Features from Tiny Images.
[Lin et al., 2018]
Lin, C. H., Mausam, M., and Weld,
D. S. (2018). Active Learning with Unbalanced
Classes and Example-Generation Queries. In
Sixth
AAAI Conference on Human Computation and
Crowdsourcing
.
[Lipton et al., 2018]
Lipton, Z. C., Wang, Y.-X., and
Smola, A. (2018). Detecting and Correcting for Label
Shift with Black Box Predictors.
[Matasci et al., 2012]
Matasci, G., Tuia, D., and
Kanevski, M. (2012). SVM-based boosting of active
learning strategies for efficient domain adaptation.
IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing
, 5(5):1335–1343.
Publisher: IEEE.