Active Learning under Label Shift
Eric Zhao Anqi Liu Animashree Anandkumar Yisong Yue
{elzhao,anqiliu,anima,yyue}@caltech.edu
Department of Computing and Mathematical Sciences
California Institute of Technology
Pasadena, CA 91126, USA
Abstract
Distribution shift poses a challenge for active data collection in the real world.
We address the problem of active learning under
label shift
and propose ALLS,
the first framework for active learning under label shift. ALLS builds on label
shift estimation techniques to correct for label shift with a balance of importance
weighting and class-balanced sampling. We show a bias-variance trade-off be-
tween these two techniques and prove error and sample complexity bounds for a
disagreement-based algorithm under ALLS. Experiments across a range of label
shift settings demonstrate ALLS consistently improves performance, often reduc-
ing sample complexity by more than half an order of magnitude. Ablation studies
corroborate the bias-variance trade-off revealed by our theory.
1 Introduction
Distribution shift poses a significant challenge for traditional active learning techniques. We study
how to effectively perform active learning under
label shift
, an important but often overlooked form
of distribution shift. Label shift arises when class proportions differ between training and testing
distributions but the feature distributions of each class are unchanged. For instance, the problem of
training a bird classifier using data from a different geographical region poses a label shift problem:
while the likelihood of observing a subspecies (i.e.
p
(
y
)
) varies by region, members of a subspecies
look the same (i.e.
p
(
x
|
y
)
) regardless of location. The problem of active learning under label shift is
particularly important for adapting existing machine learning models to new domains or addressing
under-represented classes in imbalanced datasets [1, 2]. This problem is also relevant to the correction
of societal bias in datasets, such as the important concern of minority under-representation in computer
vision datasets [3]. Label shift is also a helpful heuristic for addressing general distribution shift. For
instance, an importance weighting function for addressing data shift in classification problems can be
approximated with a finite set of importance weights by reducing to a label shift problem [1].
Current techniques for active learning under distribution shift, sometimes termed “active domain
adaptation”, either rely on heuristics for correcting general forms of distribution shift [4, 5] or build
on the assumption of
covariate shift
[6
–
8]. Unlike the covariate shift setting, active learning under
label shift is complicated by the fact that the underlying distribution shift is associated with labels,
which cannot be observed in unlabeled datapoints.
Algorithm contributions
We build on (1) importance weight estimation methods from prior
literature on learning under label shift [1, 9] and (2)
subsampling
heuristics from literature on active
learning under class imbalance [10]. We pose and address an important question of whether to correct
for label shift by importance weighting or sampling from under-represented classes. We present a
novel framework for active learning under label shift (ALLS) which unifies and balances the use of
importance weighting and subsampling to adapt to label shifts of varying forms and strengths. To the
Preprint. Under review.
arXiv:2007.08479v1 [cs.LG] 16 Jul 2020
Weight
Estimation
Test
Data
Subsampler
Unlabeled
Data
Learner
Active
Sampler
Sub
-
sampling
Weight
Estimation
Weighted
Learning
&
Active
Sampling
Medial
Distribution
Importance
Weights
Queried
Labels
Figure 1: The ALLS Framework. ALLS consists of 3 routines: (1) subsample from the unlabeled
set, (2) estimate importance weights for correcting label shift, and (3) actively query for labels and
update learner. Details are illustrated in Sec. 3 and 4.
best of our knowledge, ALLS is the first active learning framework for general label shift settings.
Figure 1 shows a flow chart of our framework.
Theoretical contributions
We derive label complexity and generalization PAC bounds for
ALLS, the first such guarantees for this setting, by instantiating our framework on a well-studied
disagreement-based active learning algorithm (IWAL-CAL). To this end, we formalize the practice
of subsampling through the concept of a
medial distribution
. Our analysis shows that label shift
estimation and importance weighting techniques preserve the provable consistency of IWAL-CAL
with similar asymptotic sample complexity bounds. Our analysis also reveals a bias-variance trade-off
between using importance weighting and subsampling. Specifically, we show that subsampling
introduces a bias which scales with the minimum achievable error while importance weighting
introduces variance which scales with label shift magnitude.
Empirical contributions
We further instantiate our framework with various uncertainty sampling
algorithms and empirically demonstrate that our framework outperforms both the original active
learning algorithm and random sampling—even when the original active learning algorithm fails
under strong label shift. We show the effectiveness of ALLS across both synthetic label-shift settings
in CIFAR10, CIFAR100 [11], and under natural label-shift settings in the NABirds dataset [12]. To
help close the gap between the theory and practice of learning under label shift, we present best
practices which scale label shift estimation techniques to deep learning settings. We also present
extensive ablation studies which corroborate our theoretical insights into the trade-off between
importance weighting and subsampling.
2 Preliminaries
Active Learning under Distribution Shift
In an active learning problem, a learner
L
actively
queries an oracle for labels on strategically selected datapoints. The learner
L
begins with a prelabeled
pool
D
warm
of
m
“warm start” datapoints sampled IID from distribution
P
warm
. The learner seeks to
maximize its performance on a test distribution,
P
test
. In a pool-based setting,
L
accesses a pool
D
ulb
of
n
unlabeled datapoints drawn IID from distribution
P
ulb
.
L
can view all unlabeled datapoints and
progressively selects datapoints from
D
ulb
to be labeled and added to a labeled set
S
. In an online
setting,
L
only observes one (or a batch) of datapoints from
D
ulb
at a time and must immediately
decide to discard or label.
Traditional active learning settings assume
P
ulb
=
P
warm
=
P
test
, which is rarely the case in practice.
The setting where
P
ulb
=
P
test
but
P
warm
6
=
P
test
is well-studied and known as the
active domain
adaptation
problem. We refer to this as the
canonical label shift
setting. The more
general label
shift
setting where the assumption that
P
ulb
=
P
test
is lifted has received comparatively little attention
despite its practical relevance: soliciting labels from the test distribution is often impractical. We
investigate both the
canonical label shift
and this more challenging
general label shift
setting, as
illustrated in figure 2 (a). In either shift setting, the task of adapting to the test distribution necessarily
assumes access to an unlabeled pool of samples,
D
test
, sampled IID from
P
test
.
2
Test
Data
Distribution
Data
Stream/
Pool
Labeled
Se
t
Warm
-
Start
Data
Stream/Pool
Labeled
Set
Warm
-
Start
Shift
Trainin
g
Shift
Class
A
Class
B
Class
A
Class
B
Class
A
Class
A
Class
B
Source
Distribution
Source
Shift
Target
Shift
Class
B
Targe
t
Distribution
(a)
(b)
Figure 2: (a) Diagram of
canonical label shift
and
general label shift
settings; (b): Example of
imbalanced source
and
imbalanced target
settings in a binary classification problem.
Label Shift
The distribution shift problem concerns training and evaluating models on different
distributions, termed the source (
P
src
) and target (
P
trg
) respectively. Due to the difficulty of the
distribution shift problem in a
general label shift
setting, assumptions on the nature of the distribution
shift are necessary. Covariate shift, the most popular and widely analyzed form of distribution shift,
assumes that the underlying distribution shift arises solely from a change in the input distribution
where
P
trg
(
x
)
6
=
P
src
(
x
)
, while conditional label probabilities are unaffected:
P
trg
(
y
|
x
) =
P
src
(
y
|
x
)
.
In contrast, label shift—the subject of this paper—assumes distribution shift arises solely from a
change in label marginals where
P
trg
(
y
)
6
=
P
src
(
y
)
but the anti-causal conditionals are unaffected:
P
trg
(
x
|
y
) =
P
src
(
x
|
y
)
. These shift assumptions are illustrated in Figure 3.
Importance weighting provides an important shift correction method for both covariate and label
shift settings. Under label shift, weighting datapoints by their likelihood ratio
P
trg
(
y
)
P
src
(
y
)
produces
asymptotically unbiased importance weighted estimators.
1
n
n
∑
i
=1
P
trg
(
y
i
)
P
src
(
y
i
)
f
(
x
i
,y
i
)
→
E
x,y
∼
P
src
[
P
trg
(
y
)
P
src
(
y
)
f
(
x,y
)
]
=
E
x,y
∼
P
trg
[
f
(
x,y
)]
Following existing label shift literature, we restrict our learning problems to those with a finite
k
-class
label space. We can estimate these importance weights
P
trg
(
y
)
P
src
(
y
)
with only labeled data from the source
distribution, unlabeled data from the target distribution, and a blackbox hypothesis
h
[1]. Let
C
h
denote a finite sample confusion matrix for
h
on
P
src
where
E
[
C
h
[
i,j
]] :=
P
src
(
h
(
X
) =
y
i
,Y
=
y
j
)
and define vector
q
h
where
q
h
[
i
] :=
̂
P
trg
(
h
(
X
) =
y
i
)
. Assuming
∀
y
:
P
trg
(
y
)
>
0 =
⇒
P
src
(
y
)
>
0
,
it holds that,
P
trg
(
h
(
X
) =
y
i
) =
k
∑
j
=1
P
src
(
h
(
X
) =
y
i
,Y
=
y
j
)
P
trg
(
y
j
)
P
src
(
y
j
)
(1)
r
:=
P
trg
(
y
)
P
src
(
y
)
=
C
−
1
h
q
h
(2)
For instance, Regularized Learning under Label Shift (RLLS) finds importance weights
r
through
convex optimization of equation 3 where
λ
is some regularization constant [9]:
C
−
1
h
q
h
≈
argmin
r
∥
∥
C
−
1
r
−
b
∥
∥
+
λ
‖
r
−
1
‖
,
(3)
Subsampling by Class
The class imbalance problem arises when the label distributions of a dataset
are highly imbalanced. Prior literature on active learning under class imbalance prescribe a variety
of class-based sampling techniques which adjust the sampling likelihood of datapoints associated
with rare classes. We build on an intuitive and simple strategy for class-based sampling which filters
out each datapoint
x
t
according to their label with probability
1
−
k
∑
n
i
=1
1
[
y
i
=
y
t
]
n
. For batch-mode
pool-based settings, we build on an analogous—but lower variance—strategy of mandating that
c
datapoints of each class are labeled from every batch. In practice, since labels are hidden, a classifier
φ
is necessary to guess labels. To generalize these tools for our setting, we frame the class imbalance
problem as a form of label shift with a target distribution known a-priori. We can accordingly
3
(a)
(b)
(c)
Figure 3: (a) Binary classification data separated by optimal hypothesis (black line); (b) Covariate
shift featuring dense top-right corner; (c) Label shift featuring higher density of the blue class.
(a)
(b)
(c)
(d)
Figure 4: (a) Target data, red class dominant; (b) Source data, blue class dominant; (c) Apply
subsampling for uniform medial. (d) Importance weighting for red dominance. Black line denotes
ERM. Larger markers indicate larger importance weights.
generalize class-balanced sampling techniques for arbitrary target distributions, a practice we term
subsampling
and depict in Algorithms 1, 2. Here, filter distribution
P
ss
is defined as
P
ss
:=
P
tar
P
src
for some choice of target and source distributions
P
tar
,P
src
. Note that a choice of filter distribution
P
ss
[
y
] =
k
∑
n
i
=1
1
[
y
i
=
y
]
n
and target distribution
P
tar
[
y
] :=
1
k
in Algorithms 1, 2 respectively exactly
coincide with their class-imbalance counterparts.
Algorithm 1
Subsampling (General)
Input:
Hypothesis
φ
∈
H
, unlabeled
x
∈
X
, filter
distribution
P
ss
, policy
π
:
X
→
[0
,
1]
.
Guess label
y
∈
Y
for
x
as
y
:=
φ
(
x
)
.
With probability
1
−
P
ss
(
y
)
, exit.
Otherwise, sample according to
π
(
x
)
.
Algorithm 2
Subsampling (Batch-mode)
Input:
Hypothesis
φ
∈
H
, unlabeled pool
D
⊂
X
, distribution
P
tar
, batch size
B
, policy
π
m
:
(
m
′
,X
m
)
→
[0
,
1]
m
.
For
y
∈
Y
Create sub-batch
D
y
:=
{
x
∈
D
|
φ
(
x
) =
y
}
Sample according to
π
|
D
y
|
(
BP
tar
(
y
)
,
D
y
)
.
3 The ALLS Framework
In this section, we present a new learning framework: Active Learning under Label Shift (ALLS). We
first present a unified view of subsampling and importance-weighting as shift-correction techniques
and pose a key question regarding the trade-off between the two strategies. We then detail our
proposed framework and discuss its online and pool-based versions.
Both importance-weighting and subsampling serve to “correct” label shift. As previously noted,
importance weights
r
as defined in Equation 2 provide for asymptotically unbiased estimators.
Subsampling functions similarly; in the expectation over the randomness of subsampling, subsampling
(Algorithm 1) with a label oracle as
φ
is equivalent to importance weighting. To see this, note that the
effect of subsampling on estimators is equivalent to multiplying samples with a random bit. Then,
E
Q
[
1
n
n
∑
i
=1
Q
i
f
(
x
i
,y
i
)] =
1
n
n
∑
i
=1
P
trg
(
y
)
P
src
(
y
)
f
(
x,y
)
→
E
x,y
∼
P
trg
[
f
(
x,y
)]
4
where
Q
i
∈ {
0
,
1
}
is an independent random variable with conditional expectation
E
[
Q
i
|
y
i
] =
P
ss
(
y
i
)
. Here,
Q
i
captures the function of subsampling with label oracle
φ
. Due to their similarity,
importance weighting and subsampling can be combined to correct label shift. Figure 4 depicts a
label shift scenario where subsampling partially corrects for label shift and importance weighting
corrects the remaining label shift.
Source
Targe
t
Subsampling
Importance
Weighting
Medial
Distribution
Medial Distribution
It is helpful to conceptually frame the use
of subsampling as a form of intentional label shift applied to the
source data which induces a new distribution. We refer to this
implicit distribution as the
medial distribution
P
med
, as the choice
of this distribution mediates the balance between subsampling and
importance weighting. We can then formalize the “amount” of
subsampling versus importance weighting used by the distance of
P
med
to
P
ulb
and to
P
test
. To formalize an intuition for label shift
distance, we follow [1] and set
θ
:=
r
−
1
; then,
‖
θ
‖
corresponds to
the shift magnitude between whatever source and target distribution
that
r
is defined for. We define
θ
u
→
m
and
θ
m
→
t
accordingly for the shift between the unlabeled and
medial, and the medial and test distributions respectively.
Proposed Algorithm
Our proposed framework, ALLS, is depicted in Figure 1 and detailed in
Algorithm 3. In addition to a primary active learning loop, ALLS diverts a fraction (
λ
) of datapoints
in
D
ulb
to accumulate an independent holdout dataset
O
t
, which is used for estimating label shift
weights
r
and training a hypothesis
φ
for subsampling. In the primary active learning loop, ALLS first
subsamples according to
φ
and then samples according to an active learning policy
π
, weighting any
empirical risk or uncertainty estimates with the importance weights
r
. We use Regularized Learning
under Label Shift (RLLS) [9] for label shift estimation, but any blackbox method (e.g. BBSE [1])
can plug into ALLS.
Holdout Set
Label shift estimation techniques such as BBSE and RLLS [1, 9] require independence
between the data used for estimating importance weights and the data used for the main learning task.
This poses a challenge as we thus require a holdout dataset
O
′
t
drawn IID from
P
med
but which is
independent of
S
t
and
φ
. To this end, we use a trick for mimicking IID draws from
P
med
using a
buffer
O
t
of IID draws from
P
ulb
. Hence, the motivation for the use of holdout set
O
t
for label shift
estimation and training
φ
is purely theoretical [9]. While necessary for our theoretical analysis, the
use of
O
t
renders our practically-motivated version of ALLS intractable due to a need for knowledge
of both
P
ss
and
P
med
, only one of which can be known at a time. Thus, in practice, this ALLS variant
forgoes the use of a holdout set for label shift estimation, as suggested by [9], and learns
r
and
φ
on
the main labeled set
S
instead.
Algorithm 3
Active Learning under Label Shift (ALLS)
Input: D
warm
,
D
ulb
,
D
test
,
P
ss
,
P
med
,
λ
, blackbox
h
0
, initial holdout set
O
0
, max timestep
T
, label oracle
C
,
policy
π
Initialize:
r
0
←
RLLS
(
O
0
,P
test
,h
0
)
,
S
0
←{
(
x
i
,y
i
,r
0
(
y
i
))
}
for
(
x
i
,y
i
)
∈
P
warm
For
x
t
,y
t
∈
P
warm
\
O
0
append
S
0
←{
(
x
t
,y
t
,r
0
(
y
t
))
}
For
t < T
Label
λn
datapoints into holdout set:
O
t
←
O
t
−
1
⋃
{
x
i
,y
i
,P
ss
(
y
i
)
}
λn
i
=1
;
Populate
O
′
t
with
(
x
i
,y
i
,p
i
)
∈
O
t
sampled w.p.
p
i
max
j
∈
[1
,
|
O
t
|
]
p
j
;
Train
φ
on
O
t
and
r
←
RLLS
(
O
′
t
,h
0
)
;
{
x
t
,P
t
}←
Subsample
(
φ,x
t
,P
ss
,π,r
)
;
#
Also pass weights
r
to sampling subroutine.
Label and add to set
S
t
←
S
t
−
1
⋃
{
x
t
,C
(
x
t
)
,P
t
}
;
Output:
h
T
=
argmin
{
err
(
h,S
T
,r
T
) :
h
∈H}
4 Theoretical Analysis
We now analyze label complexity and generalization bounds for Algorithm 3 by instantiating ALLS
on IWAL-CAL [13], an agnostic active learning algorithm with rigorous guarantees. Let
err
(
h,S
i
)
→
[0
,
1]
denote the error of
h
∈
H
as estimated on
S
i
while
err
(
h
)
denote the expected error of
h
on
5
P
test
. We next define,
h
∗
:=
argmin
h
∈
H
err
(
h
)
, h
k
:=
argmin
h
∈
H
err
(
h,S
k
−
1
)
,
h
′
k
:=
argmin
{
err
(
h,S
k
−
1
)
|
h
∈
H
∧
h
(
D
(
k
)
unlab
)
6
=
h
k
(
D
(
k
)
unlab
)
}
G
k
:=
err
(
h
′
k
,S
k
−
1
)
−
err
(
h
k
,S
k
−
1
)
(4)
IWAL-CAL corresponds to selecting a sampling policy
π
which outputs sampling probability
P
t
=
min
{
1
,s
}
for the
s
∈
(0
,
1)
which solves,
G
t
=
(
c
1
√
s
−
c
1
+ 1
)
√
C
0
log
t
t
−
1
+
(
c
2
s
−
c
2
+ 1
)
C
0
log
t
t
−
1
where
C
0
is as defined in Theorem 1 and
c
1
:= 5 + 2
√
2
,c
2
:= 5
. For the remainder of this section,
we work in the challenging
general label shift
setting. As the presence of warm start data is not
particularly interesting in our analysis, we set
m
= 0
for reading convenience and defer the case
where
m >
0
to Appendix C for interested readers.
4.1 Theoretical Guarantees
Let
σ
min
denote the smallest singular value of blackbox hypothesis
h
0
, and
P
min
,n
(
h
) :=
min
h
(
x
i
)
6
=
h
∗
(
x
i
)
P
i
the minimum sampling probability in the disagreement region of
h
and
h
′
. We
also denote the noise rate of the subsampling problem with
err
W
(
h
∗
) := min
h
∈
H
E
x,y
∈
P
ulb
[
P
med
(
y
)
P
ulb
(
y
)
−
P
med
(
h
(
x
))
P
ulb
(
h
(
x
))
]
. We now present generalization and sample complexity bounds for IWAL-CAL on ALLS.
Theorem 1.
With at least probability
1
−
δ
, for all
n
≥
1
,
err
(
h
n
)
≤
err
(
h
∗
) +
√
2
C
0
log
n
n
−
1
+
2
C
0
log
n
n
−
1
+
O
((
‖
θ
m
→
t
‖
2
+ 1)
err
W
(
h
∗
online
))
(5)
where
C
0
∈O
(
log
(
|
H
|
δ
)
(
d
∞
(
P
test
||
P
ulb
) +
d
2
(
P
test
||
P
ulb
) + 1 +
‖
θ
u
→
t
‖
2
2
)
+
log
(
k
δ
)
σ
2
min
d
∞
(
P
test
||
P
med
)
‖
θ
m
→
t
‖
2
2
(
err
W
(
h
∗
online
) + 1)
)
(6)
Our generalization bound differs from the original IWAL-CAL bound in two key aspects. (1) The use
of subsampling introduces a new constant term which scales with the noise rate of the subsampling
estimation task:
err
W
(
h
∗
online
)
. (2) Most terms are now scaled by magnitude of the label shift; the
largest such label shift terms arise from the variance of importance weighting. Aside from the
constant noise rate term, however, ALLS preserves the
log(
n
)
/n
+
√
log(
n
)
/n
asymptotic bound of
IWAL-CAL. In addition, when only importance weighting is used (
P
med
=
P
ulb
), the subsampling
learning problem is trivial. Accordingly, the subsampling noise rate is zero:
err
W
(
h
∗
online
) = 0
. In
this case, ALLS preserves the consistency guarantee of IWAL-CAL even under
general label shift
.
Theorem 2.
With probability at least
1
−
δ
, the number of labels queried is at most:
1 + (
λ
+ Θ
·
(2
err
(
h
∗
) +
‖
θ
m
→
t
‖
2
err
W
(
h
∗
)))
·
(
n
−
1) +
O
(
Θ
√
C
0
n
log
n
+ Θ
C
0
log
3
n
)
,
(7)
where
Θ
denotes the disagreement coefficient [14].
Besides the changes to
C
0
noted in our discussion of the generalization bound, we note two differences
with the sample complexity given in traditional IWAL-CAL. First, we introduce two additional linear
terms into the sample complexity: one corresponding to the bias of subsampling (again proportional
to noise rate of the subsampling problem) and one corresponding to the accumulation of holdout
set
H
t
(proportional to
λ
). These accompany a linear term proportional to the noise rate of the
original learning problem, which is also present in the original IWAL-CAL bounds and unavoidable
in agnostic active learning.
6
4.2 Bias-Variance Trade-off
We note a key bias-variance trade-off in the use of importance weighting and subsampling. In agnostic
learning problems, labels cannot be predicted with certainty and hence a non-trivial subsampling
strategy will always incur errors. This introduces a constant bias term into both generalization and
sample complexity bounds, a term which linearly scales with the subsampling noise rate
err
W
(
h
∗
)
.
In contrast, importance weighting suffers from high-variance as importance weights can easily grow
to large values. Thus, importance weighting introduces a multiplicative factor into our bounds
which scales quadratically with importance weight magnitudes:
‖
θ
m
→
t
+ 1
‖
∞
‖
θ
m
→
t
‖
2
2
. The key to
addressing this trade-off is minimizing some combination of
err
W
(
h
∗
)
and
‖
θ
m
→
t
‖
2
. This requires
striking a balance between the two as, assuming a reasonable choice of
P
med
, decreasing
‖
θ
m
→
t
‖
increases
‖
θ
u
→
m
‖
and thus err
W
(
h
∗
)
.
To inform a choice of medial distribution, we now analyze two common label shift regimes which
we term
imbalanced source
and
imbalanced target
and depict in Figure 2 (b). Consider a binary
classification problem with
n
datapoints and two possible label distributions: balanced distribution
D
1
with
n/
2
datapoints in each class, and imbalanced distribution
D
2
with
n
−
1
datapoints in the
majority class. Under
imbalanced source
, where
P
src
:=
D
2
and
P
test
:=
D
1
,
n
−
2
additional samples
from the under-represented class are necessary for negating label shift. Under
imbalanced target
,
where
P
src
:=
D
1
and
P
test
:=
D
2
,
(
n
−
2)
n
2
∈O
(
n
2
)
additional samples from the under-represented
class are necessary for negating label shift. While these two scenarios feature label shifts of identical
magnitude, subsampling is more efficient under
imbalanced source
. This suggests a simple heuristic
to subsample when under
imbalanced source
and importance weight when under
imbalanced target
.
We can extend this heuristic to the generic label shift case by noting that a choice of uniform medial
distribution precisely decomposes every label shift problem into the combination of a
imbalanced
source
and
imbalanced target
problem. Thus, as we verify experimentally, a uniform distribution
serves as a reliable baseline choice for medial distributions.
5 Experiments
We now present empirical evaluations of our pool-based ALLS framework on real-world species
recognition dataset NABirds [12] and benchmark datasets CIFAR10 & CIFAR100 [11]. We demon-
strate that ALLS improves active learning performance under a diverse range of label shift scenarios.
Scaling Label Shift Estimation
Due to the largely theoretical focus of prior literature on label
shift estimation, existing label shift estimation techniques often fail to scale when used out-of-the-
box. To scale these techniques for use on deep neural networks on high-dimensional datasets, we
introduce two techniques: posterior regularization (PR) and iterative reweighting (ITIW). Posterior
regularization avoids applying importance weights to the loss function by instead applying importance
weights during inference time:
p
′
(
y
) =
p
(
y
)
r
(
y
)
∑
i
r
(
y
i
)
. This reduces variance while preserving the use
of the label shift information to correct uncertainty estimation. This technique bears relation to the
expectation-maximization algorithm described in [15]. Iterative reweighting uses hypotheses trained
with importance weights to estimate better importance weights, an iterative process which helps steer
its finite sample confusion matrix estimates away from singularity.
Methods
We evaluate our ALLS framework on instantiations of uncertainty sampling algorithms:
(1) Monte Carlo dropout (MC-D), where uncertainty is given by disagreement between forward passes
due to dropout; [16] (2) Maximum Entropy sampling (MaxEnt), given by the entropy of predictive
distributions; and (3) Maximum Margin (Margin): given by the gap in logits of the most and second-
most likely classes. As baselines, we compare against the original active learning algorithm (marked
Vanilla
) and random sampling. In ablation studies, we also compare against partial applications of
ALLS which only use either importance weighting or shift correction. Further results and experiment
details can be found in the appendix, including a link to source code.
5.1 Primary Results
We present our primary experimental results in Figure 5. In these experiments, we apply ALLS to
training Resnet18 models with batch-mode pool-based active learning on three datasets: CIFAR10,
7
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 5: Average performance and 95% confidence intervals of 10 runs on
CIFAR10
,
CIFAR100
and 4 runs on NABirds. Plots (a)-(h) demonstrate ALLS consistently improves both accuracy, macro
F1, and weighted F1 scores on the CIFAR10, CIFAR100 and NABirds datasets, for both synthetic
and natural label shift settings. Plot (c) demonstrates these gains generalize to other uncertainty
sampling algorithms. Plot (i) depicts the learning dynamics of ALLS and verifies a suppression of
the over-represented class during learning.
8
CIFAR100, and NABirds. In the NABirds experiment, we apply ALLS to a naturally occurring
class imbalance problem in the NABirds dataset. As noted by [17], the coarsest available set of 22
bird labels in NABirds features strong class imbalance with a dominant class constituting almost a
majority of available training data. We adopt this imbalance and evaluate on a uniformly sampled test
distribution; this is a
imbalanced source
problem. In the CIFAR10 and CIFAR100 experiments, we
artificially induce
canonical label shift
settings by applying [1]’s
Dirichlet Shift
procedure individually
to each of the source and target data splits. In all experiments, we observe a significant gain in
performance from the application of ALLS, both in accuracy and macro F1 scores. In the synthetic
shift experiments, ALLS reduces sample complexity by up to half an order of magnitude. In Figure
5(c), we demonstrate the gains introduced by ALLS are similarly realized on other uncertainty
sampling algorithms. In Figure 5(i), we can observe the learning dynamics of ALLS by tracking the
accuracy of the dominant class. Observe that the ALLS curve features a drop in accuracy followed by
a recovery in performance. The initial period of decline in accuracy can be attributed to a period of
improvement in label shift estimation. As label shift estimation improves, dominant class accuracy is
suppressed by shrinking importance weights. Accuracy then recovers as the label shift is diluted and
the importance weight on the dominant class increases.
5.2 Ablation Studies
Source vs Target Shift
To investigate the trade-off between subsampling and importance weighting
suggested by theory, we induce synthetic
imbalanced source
and
imbalanced target
scenarios on
CIFAR100 in an ablation study depicted in Figure 6. To compare the strengths of these strategies, we
compare ALLS against the use of subsampling or importance weighting alone. We again use [1]’s
Dirichlet Shift
procedure to induce synthetic shifts. While Figure 6(a) demonstrates that subsampling
accounts for ALLS’s performance gains under
imbalanced source
, Figure 6(b) demonstrates that
importance weighting leads to gains under
imbalanced target
. This corroborates our previous analysis.
Although the strengths of the importance weighting and subsampling appear complementary, figure
6(c) demonstrates that, when properly balanced under ALLS, the joint usage of importance weighting
and subsampling outperforms the individual use of either technique.
(a)
(b)
(c)
Figure 6: Average performance and 95% confidence intervals on 10 runs in various
general label
shift
settings. (a) Top-5 accuracy under
imbalanced source
, subsampling outperforms importance
weighting; (b) Macro F1 under
imbalanced target
, importance weighting outperforms subsampling;
(c) Macro F1 under generic label shift. Importance weighting and subsampling feature complimentary
strengths and yield additional gains when unified in ALLS.
Label Shift Estimation Heuristics
We analyze the effects of these techniques on performance and
learning behavior in an ablation study on the CIFAR100 dataset, as depicted in Figure 7. Figure 7
demonstrates that a combination of these techniques provides performance gains in both accuracy
and macro F1 score. In addition, the study verifies the high variance associated with importance
weighting and the cumulative gains afforded by iteratively reweighting.
6 Related Work
Active Learning
Active learning has been investigated extensively from theoretical and practical
perspectives. Disagreement-based active learning and its variants [18
–
24] enjoy rigorous learning
9
(a)
(b)
Figure 7:
Average performance and 95% confidence intervals on 10 runs of experiments on
CIFAR100 in a
canonical label shift
setting. (a) Accuracy using MC-D; (b) Macro F1 using MC-D.
Posterior regularization lowers variance (versus importance weighting) and especially improves
early-stage performance. Iterative reweighting similarly introduces consistent performance gains.
Combining the two provides additional gains in macro F1 scores.
guarantees and focus on the stream-based active learning setting. On the other hand, active learning
has been widely studied in natural language processing [25], computer vision [26], and even robotics
[27]. Instead of the theoretically derived strategies, the most prevalent label solicitation method is
uncertainty sampling. In this paper, we provide both theoretical analysis and practical algorithmic
framework that is easy to implement.
Distribution Shift
General domain adaptation theory [28
–
31] looks at joint distribution shift.
Covariate shift [32
–
34] and label shift [35] are two special cases of distribution shift when more
specific assumptions are made regarding which distribution is variant and which is invariant in
the joint data distribution. Importance weighting methods under covariate shift and label shift are
asymptotically unbiased. Density ratio estimation [36
–
38] on the input distribution is challenging
due to the high-dimension nature of features in many application. In contrast, label shift assumptions
make the weight estimation more tractable using black-box predictors [1, 39]. In our work, we utilize
RLLS [9] for label shift correction and take advantage of its theoretical properties to help prove our
guarantees in active learning.
Active Learning under Distribution Shift
Active domain adaptation [40
–
43] has been studied
under the general joint distribution shift assumption. Even though their problem settings are similar
to ours, these methods focus on the practical side. Active learning from covariate-shifted warm-start
set [7] has guaranteed label complexity but requires known importance weights beforehand. Instead of
the covariate shift, we focus on the label shift case. On the other hand, active learning for imbalanced
data [10, 44] proposes useful sampling heuristics, like diverse sampling or class-balanced sampling,
without theoretical justification. Class-balance sampling has also been investigated extensively in
self-training for unsupervised domain adaptation [45]. We focus on the general label shift and
leverage subsampling to construct a
medial distribution
, which help achieve empirical and theoretical
trade-off between importance weighting and subsampling.
7 Conclusion
In this paper, we propose ALLS, a novel framework for active learning under label shift. Our
framework utilizes both importance weighting and subsampling to correct for label shift when active
learning. We derive a rigorously guaranteed online active learning algorithm and prove its label
complexity and the generalization bound. Our analysis shed light on the trade-off between importance
10
weighting and subsampling under label shift. We show the effectiveness of our method on both
real-world inherent-shift data and large-scale benchmark synthetic-shift data.
Data distribution bias in training and testing has a huge impact on model behaviors in machine
learning [3]. Our work generally tackles this problem by incorporating active data collection to
correct distribution shift. In many applications that require manually labeling of data, like natural
language processing and computer vision, an extension of the techniques we explore in ALLS may
help mitigate bias in the data collection process. We also believe this approach can be extended to
new settings, include cost-sensitive, multi-domain, and Neyman-Pearson settings.
Acknowledgement
Anqi Liu is supported by PIMCO Postdoctoral Fellowship at Caltech. Prof. Anandkumar is supported
by Bren endowed Chair, faculty awards from Microsoft, Google, and Adobe, and LwLL grants.
References
[1]
Zachary C. Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and Correcting for Label Shift
with Black Box Predictors. February 2018.
[2]
Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier
to new a priori probabilities: a simple procedure.
Neural Computation
, 14(1):21–41, January
2002.
[3]
Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. Towards fairer datasets:
Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In
Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency
, pages
547–558, 2020.
[4]
Yee Seng Chan and Hwee Tou Ng. Domain Adaptation with Active Learning for Word Sense
Disambiguation. In
Proceedings of the 45th Annual Meeting of the Association of Computational
Linguistics
, pages 49–56, Prague, Czech Republic, June 2007. Association for Computational
Linguistics.
[5]
Piyush Rai, Avishek Saha, Hal Daumé, and Suresh Venkatasubramanian. Domain Adaptation
meets Active Learning. In
Proceedings of the NAACL HLT 2010 Workshop on Active Learning
for Natural Language Processing
, pages 27–32, Los Angeles, California, June 2010. Association
for Computational Linguistics.
[6]
Avishek Saha, Piyush Rai, Hal Daumé, Suresh Venkatasubramanian, and Scott L. DuVall.
Active Supervised Domain Adaptation. In Dimitrios Gunopulos, Thomas Hofmann, Donato
Malerba, and Michalis Vazirgiannis, editors,
Machine Learning and Knowledge Discovery
in Databases
, Lecture Notes in Computer Science, pages 97–112, Berlin, Heidelberg, 2011.
Springer.
[7]
Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. Active Learning with Logged Data.
arXiv:1802.09069 [cs, stat]
, June 2018. arXiv: 1802.09069.
[8] Rita Chattopadhyay, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. Joint
transfer and batch-mode active learning. In
30th International Conference on Machine Learning,
ICML 2013
, pages 1290–1298. International Machine Learning Society (IMLS), 2013.
[9]
Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized
Learning for Domain Adaptation under Label Shifts.
arXiv:1903.09734 [cs, stat]
, March 2019.
arXiv: 1903.09734.
[10]
U. Aggarwal, A. Popescu, and C. Hudelot. Active learning for imbalanced datasets. In
2020
IEEE Winter Conference on Applications of Computer Vision (WACV)
, pages 1417–1426, 2020.
[11] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.
[12]
Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro
Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with
citizen scientists: The fine print in fine-grained dataset collection. In
Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
, pages 595–604, 2015.
11
[13]
Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic Active Learning
Without Constraints.
arXiv:1006.2588 [cs]
, June 2010. arXiv: 1006.2588.
[14]
Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning.
Journal
of Computer and System Sciences
, 75(1):78–89, January 2009.
[15]
Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier
to new a priori probabilities: a simple procedure.
Neural computation
, 14(1):21–41, 2002.
[16]
Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model
Uncertainty in Deep Learning.
arXiv:1506.02142 [cs, stat]
, October 2016. arXiv: 1506.02142.
[17]
Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed Elgammal. Link the head to the
"beak": Zero shot learning from noisy text description at part precision, 2017.
[18]
Steve Hanneke. A bound on the label complexity of agnostic active learning. In
Proceedings of
the 24th international conference on Machine learning
, ICML ’07, pages 353–360, Corvalis,
Oregon, USA, June 2007. Association for Computing Machinery.
[19]
Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning.
Journal
of Computer and System Sciences
, 75(1):78–89, 2009.
[20]
Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning.
In
Proceedings of the 26th annual international conference on machine learning
, pages 49–56,
2009.
[21]
Steve Hanneke. Activized Learning: Transforming Passive to Active with Improved Label
Complexity.
arXiv:1108.1766 [cs, math, stat]
, August 2011. arXiv: 1108.1766.
[22]
Alina Beygelzimer, Daniel J Hsu, John Langford, and Tong Zhang. Agnostic active learning
without constraints. In
Advances in neural information processing systems
, pages 199–207,
2010.
[23]
Steve Hanneke. Theory of Disagreement-Based Active Learning.
Foundations and Trends
R
©
in
Machine Learning
, 7(2-3):131–309, June 2014. Publisher: Now Publishers, Inc.
[24]
Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daumé III, and John Lang-
ford. Active learning for cost-sensitive classification.
Journal of Machine Learning Research
,
20(65):1–50, 2019.
[25]
Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and Animashree Anandkumar.
Deep active learning for named entity recognition.
arXiv preprint arXiv:1707.05928
, 2017.
[26]
Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G Hauptmann. Multi-class
active learning by uncertainty sampling with diversity maximization.
International Journal of
Computer Vision
, 113(2):113–127, 2015.
[27]
Sanjiban Choudhury and Siddhartha S Srinivasa. A bayesian active learning approach to
adaptive motion planning. In
Robotics Research
, pages 33–40. Springer, 2020.
[28]
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of Representa-
tions for Domain Adaptation. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors,
Advances in
Neural Information Processing Systems 19
, pages 137–144. MIT Press, 2007.
[29]
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-
nifer Wortman Vaughan. A theory of learning from different domains.
Machine learning
,
79(1-2):151–175, 2010.
[30]
Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning Bounds for Importance Weight-
ing. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors,
Advances in Neural Information Processing Systems 23
, pages 442–450. Curran Associates,
Inc., 2010.
[31]
Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and
algorithm for regression.
Theoretical Computer Science
, 519:103–126, 2014.
[32]
Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the
log-likelihood function.
Journal of statistical planning and inference
, 90(2):227–244, 2000.
[33]
Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and
Bernhard Schölkopf. Covariate shift by kernel mean matching.
Dataset shift in machine
learning
, 3(4):5, 2009.
12
[34]
Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert MÞller. Covariate shift adaptation
by importance weighted cross validation.
Journal of Machine Learning Research
, 8(May):985–
1005, 2007.
[35]
Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris
Mooij. On causal and anticausal learning.
arXiv preprint arXiv:1206.6471
, 2012.
[36]
Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori.
Density ratio estimation in machine
learning
. Cambridge University Press, 2012.
[37]
Yuta Tsuboi, Hisashi Kashima, Shohei Hido, Steffen Bickel, and Masashi Sugiyama. Direct
density ratio estimation for large-scale covariate shift adaptation.
Journal of Information
Processing
, 17:138–155, 2009.
[38]
Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama.
Relative density-ratio estimation for robust distribution comparison. In
Advances in neural
information processing systems
, pages 594–602, 2011.
[39]
Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary C Lipton. A unified view of
label shift estimation.
arXiv preprint arXiv:2003.07554
, 2020.
[40]
Piyush Rai, Avishek Saha, Hal Daumé III, and Suresh Venkatasubramanian. Domain adaptation
meets active learning. In
Proceedings of the NAACL HLT 2010 Workshop on Active Learning
for Natural Language Processing
, pages 27–32. Association for Computational Linguistics,
2010.
[41]
Giona Matasci, Devis Tuia, and Mikhail Kanevski. Svm-based boosting of active learning
strategies for efficient domain adaptation.
IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing
, 5(5):1335–1343, 2012.
[42]
Cheng Deng, Xianglong Liu, Chao Li, and Dacheng Tao. Active multi-kernel domain adaptation
for hyperspectral image classification.
Pattern Recognition
, 77:306–315, 2018.
[43]
Jong-Chyi Su, Yi-Hsuan Tsai, Kihyuk Sohn, Buyu Liu, Subhransu Maji, and Manmohan Chan-
draker. Active adversarial domain adaptation. In
The IEEE Winter Conference on Applications
of Computer Vision
, pages 739–748, 2020.
[44]
Christopher H Lin, Mausam Mausam, and Daniel S Weld. Active learning with unbalanced
classes and example-generation queries. In
Sixth AAAI Conference on Human Computation and
Crowdsourcing
, 2018.
[45]
Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Domain adaptation for semantic
segmentation via class-balanced self-training.
arXiv preprint arXiv:1810.07911
, 2018.
[46]
Tong Zhang. Data Dependent Concentration Bounds for Sequential Prediction Algorithms.
pages 173–187, June 2005.
[47]
Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance Weighted Active Learning.
arXiv:0812.4952 [cs]
, May 2009. arXiv: 0812.4952.
13
A Theorem 1 and Theorem 2 Proofs
A.1 Deviation Bound
The most involved step in deriving generalization and sample complexity bounds for ALLS is
bounding the deviation of empirical risk estimates. This is done through the following theorem.
Theorem 3.
Let
Z
i
:= (
X
i
,Y
i
,Q
i
)
be our source data set, where
Q
i
is the indicator function on
whether
(
X
i
,Y
i
)
is sampled as labeled data. The following holds for all
n
≥
1
and all
h
∈H
with
probability
1
−
δ
:
|
err
(
h,Z
1:
n
)
−
err
(
h
∗
,Z
1:
n
)
−
err
(
h
) +
err
(
h
∗
)
|
≤O
(
d
∞
(
P
test
,P
src
)
log(
n
|
H
|
/δ
)
n
+
√
d
2
(
P
test
,P
src
)
log(
n
|
H
|
/δ
)
n
+
√
log(
n
|
H
|
/δ
)
nP
min
,n
(
h
)
+
log(
n
|
H
|
/δ
)
nP
min
,n
(
h
)
+
(
1 +
err
W
(
h
∗
online
) +
log(
λn/δ
)
λn
+
√
err
W
(
h
∗
online
) log(
λn/δ
)
λn
+
‖
θ
src
→
med
‖
√
log(
n
|
H
|
/δ
)
nP
min
,n
(
h
)
)
·
∥
∥
∥
̃
θ
∥
∥
∥
2
+ 1
σ
min
√
d
∞
(
P
test
,P
med
) log(
nk/δ
)
λn
−
√
nd
∞
(
P
test
,P
med
) log(
n/δ
)
λ
+(
∥
∥
∥
̃
θ
∥
∥
∥
2
+ 1)
(
err
W
(
h
∗
online
) +
log(
λn/δ
)
λn
+
√
err
W
(
h
∗
online
) log(
λn/δ
)
λn
)
+
‖
θ
‖
2
√
log(
n
|
H
|
/δ
)
nP
min
,n
(
h
)
)
The corresponding bound for the case where only importance weighting is used can be recovered
by setting
P
med
=
P
src
. Since we are ignoring warm starts in this section, we use
P
src
:=
P
ulb
. This
deviation bound will plug in to IWAL-CAL for generalization and sample complexity bounds. In
the remainder of this appendix section, we detail our proof of theorem 3. We proceed by expressing
theorem 3 in a more general form with a bounded function
f
:
X
×
Y
→
[
−
1
,
1]
which will eventually
represent err
(
h
)
−
err
(
h
∗
)
.
We borrow notation for the terms
W,Q
from [13], where
Q
i
is an indicator random variable indicating
whether the
i
th datapoint is labeled and
W
:=
Q
i
̃
Q
i
̃
r
i
f
(
x
i
,y
i
)
. Our notation convention for the
accented letters is denoting the estimated (from data) version with
hat
and denoting the medial
distribution version with
tilde
. For example,
̃
Q
i
denotes whether the
i
th data sample in the medial
data set is labeled or not. We introduce the accented variants
̃
W
:=
Q
i
ˆ
̃
Q
i
̃
r
i
f
(
x
i
,y
i
)
and
ˆ
̃
W
:=
Q
i
ˆ
̃
Q
i
ˆ
̃
r
i
f
(
x
i
,y
i
)
. We also borrow [9]’s label shift notation and define
k
as the size of the output
space (finite) and denote estimated importance weights with hats
ˆ
·
. We introduce
̃
r
:=
r
med
→
tar
apply
these same semantics to accents on
θ
:=
r
−
1
. Finally, we follow [30] and use
d
α
(
P
||
P
′
)
to denote
2
D
α
(
P
||
P
′
)
where
D
α
(
P
||
P
′
) := log(
P
i
P
′
i
)
is the Renyi divergence of
P
and
P
′
.
We now arrive at a general form for the left-hand-side of theorem 3. To prove the theorem, we seek
to bound with high probability,
∆ :=
1
n
(
n
∑
i
=1
ˆ
̃
W
i
)
−
E
[
rf
(
X,Y
)]
(8)
14
We eventually individually bound the following terms,
∆
1
:=
E
[
rf
(
X,Y
)]
−
1
n
n
∑
i
=1
E
i
[
W
i
]
∆
2
:=
1
n
n
∑
i
=1
E
i
[
W
i
]
−
E
i
[
ˆ
W
i
]
∆
3
:=
1
n
n
∑
i
=1
E
i
[
ˆ
W
i
]
−
E
i
[
ˆ
̃
W
i
]
∆
4
:=
1
n
n
∑
i
=1
E
i
[
ˆ
̃
W
i
]
−
ˆ
̃
W
i
(9)
where
∆
1
corresponds to the variance associated with inherent stochasticity in datapoints.
∆
2
corresponds to label inference error during subsampling.
∆
3
corresponds to label shift estimation
errors.
∆
4
corresponds to the stochasticity of the IWAL-CAL sampling policy. Using repeated
applications of triangle inequalities, a bound on
∆
is given by:
|
∆
|≤|
∆
1
|
+
|
∆
2
|
+
|
∆
3
|
+
|
∆
4
|
(10)
A.2 Bounding Active Learning Stochasticity
We bound
∆
4
using a Martingale technique from [46] also adopted by [13]. We take Lemmas 1, 2
from [46] as given. We now proceed in a fashion similar to the proof of Theorem 1 from [13]. We
begin with a generalization of Lemma 6 in [13].
Lemma 1.
If
0
< λ <
3
P
i
ˆ
̃
r
i
, then
log
E
i
[exp(
λ
(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
]))]
≤
ˆ
r
i
ˆ
̃
r
i
λ
2
2
P
i
(1
−
ˆ
̃
r
λ
3
P
i
)
(11)
If
E
i
[
ˆ
̃
W
i
] = 0
then
log
E
i
[exp(
λ
(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
]))] = 0
(12)
Proof.
First, we bound the range and variance of
ˆ
̃
W
i
. The range is trivial
|
ˆ
̃
W
i
|≤
∣
∣
∣
∣
∣
Q
i
ˆ
̃
Q
i
ˆ
̃
r
i
P
i
∣
∣
∣
∣
∣
≤
ˆ
̃
r
i
P
i
(13)
To bound variance, note that
ˆ
r
i
=
ˆ
̃
r
i
E
i
[
ˆ
̃
Q
i
]
by definition. In other words, when combined, subsam-
pling and importance weighting should fully correct for any (perception of) underlying label shift.
Therefore
E
i
[(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
])
2
]
≤
ˆ
r
i
ˆ
̃
r
i
P
i
f
(
x
i
,y
i
)
2
−
2ˆ
r
2
i
f
(
x
i
,y
i
)
2
+ ˆ
r
2
i
f
(
x
i
,y
i
)
2
≤
ˆ
r
i
ˆ
̃
r
i
P
i
(14)
Following [13], we choose a function
g
(
x
) := (exp(
x
)
−
x
−
1)
/x
2
for
x
6
= 0
so that
exp(
x
) =
1 +
x
+
x
2
g
(
x
)
holds. Note that
g
(
x
)
is non-decreasing. Thus,
E
i
[exp(
λ
(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
]))] =
E
i
[1 +
λ
(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
]) +
λ
2
(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
])
2
g
(
λ
(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
]))]
= 1 +
λ
2
E
i
[(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
])
2
g
(
λ
(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
]))]
≤
1 +
λ
2
E
i
[(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
])
2
g
(
λ
ˆ
̃
r
i
/P
i
)]
= 1 +
λ
2
E
i
[(
ˆ
̃
W
i
−
E
i
[
ˆ
̃
W
i
])
2
]
g
(
λ
ˆ
̃
r
i
/P
i
)
≤
1 +
λ
2
ˆ
r
i
ˆ
̃
r
i
P
i
g
(
ˆ
̃
r
i
λ
P
i
)
(15)
15