of 26
arXiv:1903.09734v1 [cs.LG] 22 Mar 2019
Published as a conference paper at ICLR 2019
R
EGULARIZED
L
EARNING FOR
D
OMAIN
A
DAPTATION
UNDER
L
ABEL
S
HIFTS
Kamyar Azizzadenesheli
University of California, Irvine
kazizzad@uci.edu
Anqi Liu
California Institute of Technology
anqiliu@caltech.edu
Fanny Yang
Institute of Theoretical Studies, ETH Zürich
fan.yang@stat.math.ethz.ch
Animashree Anandkumar
California Institute of Technology
anima@caltech.edu
A
BSTRACT
We propose Regularized Learning under Label shifts (
RLLS
), a principled and a
practical domain-adaptation algorithm to correct for shif
ts in the label distribution
between a source and a target domain. We first estimate import
ance weights us-
ing labeled source data and unlabeled target data, and then t
rain a classifier on
the weighted source samples. We derive a generalization bou
nd for the classifier
on the target domain which is independent of the (ambient) da
ta dimensions, and
instead only depends on the complexity of the function class
. To the best of our
knowledge, this is the first generalization bound for the lab
el-shift problem where
the labels in the target domain are not available. Based on th
is bound, we pro-
pose a regularized estimator for the small-sample regime wh
ich accounts for the
uncertainty in the estimated weights. Experiments on the CI
FAR-10 and MNIST
datasets show that
RLLS
improves classification accuracy, especially in the low
sample and large-shift regimes, compared to previous metho
ds.
1 I
NTRODUCTION
When machine learning models are employed “in the wild”, the
distribution of the data of inter-
est(
target
distribution) can be significantly shifted compared to the d
istribution of the data on which
the model was trained (
source
distribution). In many cases, the publicly available large
-scale datasets
with which the models are trained do not represent and reflect
the statistics of a particular dataset
of interest. This is for example relevant in managed service
s on cloud providers used by clients
in different domains and regions, or medical diagnostic too
ls trained on data collected in a small
number of hospitals and deployed on previously unobserved p
opulations and time frames.
Covariate Shift
Label Shift
p
(
x
)
6
=
q
(
x
)
p
(
y
|
x
) =
q
(
y
|
x
)
p
(
y
)
6
=
q
(
y
)
p
(
x
|
y
) =
q
(
x
|
y
)
There are various ways to approach distribution shifts betw
een
a source data distribution
P
and a target data distribution
Q
. If
we denote input variables as
x
and output variables as
y
, we
consider the two following settings: (i) Covariate shift, w
hich
assumes that the conditional output distribution is invari
ant:
p
(
y
|
x
) =
q
(
y
|
x
)
between source and
target distributions, but the source distribution
p
(
x
)
changes. (ii) Label shift, where the conditional
input distribution is invariant:
p
(
x
|
y
) =
q
(
x
|
y
)
and
p
(
y
)
changes from source to target. In the
following, we assume that both input and output variables ar
e observed in the source distribution
whereas only input variables are available from the target d
istribution.
While covariate shift has been the focus of the literature on
distribution shifts to date, label-shift sce-
narios appear in a variety of practical machine learning pro
blems and warrant a separate discussion
as well. In one setting, suppliers of machine-learning mode
ls such as cloud providers have large
resources of diverse data sets (source set) to train the mode
ls, while during deployment, they have
no control over the proportion of label categories.
In another setting of e.g. medical diagnostics, the disease
distribution changes over locations and
time. Consider the task of diagnosing a disease in a country w
ith bad infrastructure and little data,
1
Published as a conference paper at ICLR 2019
based on reported symptoms. Can we use data from a different l
ocation with data abundance to
diagnose the disease in the new target location in an efficien
t way? How many labeled source and
unlabeled target data samples do we need to obtain good perfo
rmance on the target data?
Apart from being relevant in practice, label shift is a compu
tationally more tractable scenario than
covariate shift which can be mitigated. The reason is that th
e outputs
y
typically have a much lower
dimension than the inputs
x
. Labels are usually either categorical variables with a fini
te number
of categories or have simple well-defined structures. Despi
te being an intuitively natural scenario
in many real-world application, even this simplified model h
as only been scarcely studied in the
literature. Zhang et al. (2013) proposed a kernel mean match
ing method for label shift which is
not computationally feasible for large-scale data. The app
roach in Lipton et al. (2018) is based on
importance weights that are estimated using the confusion m
atrix (also used in the procedures of
Saerens et al. (2002); McLachlan (2004)) and demonstrate pr
omising performance on large-scale
data. Using a black-box classifier which can be biased, uncal
ibrated and inaccurate, they first es-
timate importance weights
q
(
y
)
/p
(
y
)
for the source samples and train a classifier on the weighted
data. In the following we refer to the procedure as
black box shift learning
(
BBSL
) which the authors
proved to be effective for large enough sample sizes.
However, there are three relevant questions which remain un
answered by their work: How to esti-
mate the importance weights in low sample setting, What are t
he generalization guarantees for the
final predictor which uses the weighted samples? How do we dea
l with the uncertainty of the weight
estimation when only few samples are available? This paper a
ims to fill the gap in terms of both
theoretical understanding and practical methods for the la
bel shift setting and thereby move a step
closer towards having a more complete understanding on the g
eneral topic of supervised learning for
distributionally shifted data. In particular, our goal is t
o find an efficient method which is applicable
to large-scale data and to establish generalization guaran
tees.
Our contribution in this work is trifold. Firstly, we propos
e an efficient weight estimator for which
we can obtain good statistical guarantees without a require
ment on the problem-dependentminimum
sample complexity as necessary for
BBSL
. In the
BBSL
case, the estimation error can become
arbitrarily large for small sample sizes. Secondly, we prop
ose a novel regularization method to
compensate for the high estimation error of the importance w
eights in low target sample settings.
It explicitly controls the influence of our weight estimates
when the target sample size is low (in
the following referred to as the low sample regime). Finally
, we derive a dimension-independent
generalization bound for the final Regularized Learning und
er Label Shift (
RLLS
) classifier based
on our weight estimator. In particular, our method improves
the weight estimation error and excess
risk of the classifier on reweighted samples by a factor of
k
log(
k
)
, where
k
is the number of classes,
i.e. the cardinality of
Y
.
In order to demonstrate the benefit of the proposed method for
practical situations, we empirically
study the performance of
RLLS
and show weight estimation as well as prediction accuracy co
mpari-
son for a variety of shifts, sample sizes and regularization
parameters on the CIFAR-10 and MNIST
datasets. For large target sample sizes and large shifts, wh
en applying the regularized weights fully,
we achieve an order of magnitude smaller weight estimation e
rror than baseline methods and enjoy
at most 20% higher accuracy and F-1 score in corresponding pr
edictive tasks. For low target sample
sizes, applying regularized weights partially also yields
an accuracy improvement of at least 10%
over fully weighted and unweighted methods.
2 R
EGULARIZED LEARNING OF LABEL SHIFTS
(
RLLS
)
Formally let us the short hand for the marginal probability m
ass functions of
Y
on finite
Y
with
respect to
P
,
Q
as
p, q
: [
k
]
[0
,
1]
with
p
(
i
) =
P
(
Y
=
i
)
, and
q
(
i
) =
Q
(
Y
=
i
)
for all
i
[
k
]
,
representable by vectors in
R
k
+
which sum to one. In the label shift setting, we define the impo
rtance
weight vector
w
R
k
between these two domains as
w
(
i
) =
q
(
i
)
p
(
i
)
. We quantify the shift using the
exponent of the infinite and second order Renyi divergence as
follows
d
(
q
||
p
) := sup
i
q
(
i
)
p
(
i
)
,
and
d
(
q
||
p
) :=
E
Y
Q

w
(
Y
)
2

=
k
X
i
q
(
i
)
q
(
i
)
p
(
i
)
.
2
Published as a conference paper at ICLR 2019
Given a hypothesis class
H
and a loss function
:
Y × Y →
[0
,
1]
, our aim is to find the hypothesis
h
∈ H
which minimizes
L
(
h
) =
E
X,Y
Q
[
(
Y, h
(
X
))] =
E
X,Y
P
[
w
(
Y
)
(
Y, h
(
X
))]
In the usual finite sample setting however,
L
unknown and we observe samples
{
(
x
j
, y
j
)
}
n
j
=1
from
P
instead. If we are given the vector of importance weights
w
we could then minimize the empirical
loss with importance weighted samples defined as
L
n
(
h
) =
1
n
n
X
j
=1
w
(
y
j
)
(
y
j
, h
(
x
j
))
where
n
is the number of available observations drawn from
P
used to learn the classifier
h
. As
w
is
unknown in practice, we have to find the minimizer of the empir
ical loss with
estimated
importance
weights
L
n
(
h
;
b
w
) =
1
n
n
X
j
=1
b
w
(
y
j
)
(
y
j
, h
(
x
j
))
(1)
where
b
w
are estimates of
w
. Given a set
D
p
of
n
p
samples from the source distribution
P
, we first
divide it into two sets where we use
(1
β
)
n
p
samples in set
D
weight
p
to compute the estimate
b
w
and the remaining
n
=
βn
p
in the set
D
class
p
to find the classifier which minimizes the loss (1), i.e.
b
h
̂
w
= arg min
h
∈H
L
n
(
h
;
b
w
)
. In the following, we describe how to estimate the weights
b
w
and
provide guarantees for the resulting estimator
b
h
̂
w
.
Plug-in weight estimation
The following simple correlation between the label distrib
utions
p, q
was noted in Lipton et al. (2018): for a fixed hypothesis
h
, if for all
y
∈ Y
it holds that
q
(
y
)
0 =
p
(
y
)
0
, we have
q
h
(
i
) :=
Q
(
h
(
X
) =
i
) =
k
X
j
=1
Q
(
h
(
X
) =
i
|
Y
=
j
)
q
(
j
) =
k
X
j
=1
P
(
h
(
X
) =
i
|
Y
=
j
)
q
(
j
)
=
k
X
j
=1
P
(
h
(
X
) =
i, Y
=
j
)
q
(
j
)
p
(
j
)
=
k
X
j
=1
P
(
h
(
X
) =
i, Y
=
j
)
w
j
for all
i, j
∈ Y
. This can equivalently be written in matrix vector notation
as
q
h
=
C
h
w,
(2)
where
C
h
is the confusion matrix with
[
C
h
]
i,j
=
P
(
h
(
X
) =
i, Y
=
j
)
and
q
h
is the vector which
represents the probability mass function of
h
(
X
)
under distribution
Q
. The requirement
q
(
y
)
0 =
p
(
y
)
0
is a reasonable condition since without any prior knowledge
, there is no way to
properly reason about a class in the target domain that is not
represented in the source domain.
In reality, both
q
h
and
C
h
can only be estimated by the corresponding finite sample aver
ages
b
q
h
,
b
C
h
.
Lipton et al. (2018) simply compute the inverse of the estima
ted confusion matrix
b
C
h
in order to
estimate the importance weight, i.e.
b
w
=
b
C
1
h
b
q
h
. While
C
1
h
b
q
h
is a statistically efficient estimator,
b
w
with estimated
b
C
1
h
can be arbitrarily bad since
b
C
1
h
can be arbitrary close to a singular matrix
especially for small sample sizes and small minimum singula
r value of the confusion matrix. In-
tuitively, when there are very few samples, the weight estim
ation will have high variance in which
case it might be better to avoid importance weighting altoge
ther. Furthermore, even when the sample
complexity in Lipton et al. (2018), unknown in practice, is m
et, the resulting error of this estimator
is linear in
k
which is problematic for large
k
.
We therefore aim to address these shortcomings by proposing
the following two-step procedure to
compute importance weights. In the case of no shift we have
w
=
1
so that we define the amount
of weight shift as
θ
=
w
1
. Given a “decent” black box estimator which we denote by
h
0
, we
make the final classifier less sensitive to the estimation per
formance of
C
(i.e. regularize the weight
estimate) by
3
Published as a conference paper at ICLR 2019
1. calculating the measurement error adjusted
b
θ
(described in Section 2.1 for
h
0
) and
2. computing the regularized weight
b
w
=
1
+
λ
b
θ
where
λ
depends on the sample size
(1
β
)
n
p
.
By "decent" we refer to a classifier
h
0
which yields a full rank confusion matrix
C
h
0
. A trivial
example for a non-”decent” classifier
h
0
is one that always outputs a fixed class. As it does not
capture any characteristics of the data, there is no hope to g
ain any statistical information without
any prior information.
2.1 E
STIMATOR CORRECTING FOR FINITE SAMPLE ERRORS
Both the confusion matrix
C
h
0
and the label distribution
q
h
0
on the target for the black box hypothe-
sis
h
0
are unknown and we are instead only given access to finite samp
le estimates
b
C
h
0
,
b
q
h
0
. In what
follows all empirical and population confusion matrices, a
s well as label distributions, are defined
with respect to the hypothesis
h
=
h
0
. For notation simplicity, we thus drop the subscript
h
0
in what
follows. The reparameterized linear model (2) with respect
to
θ
then reads
b
:=
q
C
1
=
with corresponding finite sample quantity
b
b
=
b
q
b
C
1
. When
b
C
is near singular, the estimation of
θ
becomes unstable. Furthermore, large values in the true shi
ft
θ
result in large variances. We address
this problem by adding a regularizing
2
penalty term to the usual loss and thus push the amount of
shift towards
0
, a method that has been proposed in (Pires & Szepesvári, 2012
). In particular, we
compute
b
θ
= arg min
θ
k
b
b
b
k
2
+ ∆
C
k
θ
k
2
(3)
Here,
C
is a parameter which will eventually be high probability upp
er bounds for
k
b
C
C
k
2
. Let
b
also denote the high probability upper bounds for
k
b
b
b
k
2
.
Lemma 1
For
b
θ
as defined in equation
(3)
, we have with probability at least
1
δ
that
1
k
b
θ
θ
k
2
ǫ
θ
(
n
p
, n
q
,
k
θ
k
2
, δ
)
where
ǫ
θ
(
n
p
, n
q
,
k
θ
k
2
, δ
) :=
O
1
σ
min
k
θ
k
2
s
log(
k/δ
)
(1
β
)
n
p
+
s
log(1
)
(1
β
)
n
p
+
s
log(1
)
n
q

!
.
The proof of this lemma can be found in Appendix B.1. A couple o
f remarks are in order at this
point. First of all, notice that the weight estimation proce
dure (3) does not require a minimum
sample complexity which is in the order of
σ
2
min
to obtain the guarantees for
BBSL
. This is due to
the fact that errors in the covariates are accounted for. In o
rder to directly see the improvements in
the upper bound of Lemma 1 compared to Theorem 3 in Lipton et al
. (2018), first observe that in
order to obtain their upper bound with a probability of at lea
st
1
δ
, it is necessary that
3
kn
10
p
+
2
kn
10
q
δ
. As a consequence, the upper bound in Theorem 3 of Lipton et al
. (2018) is bigger than
1
3
σ
min
k
θ
k
2
q
log(3
k/δ
)
n
p
+
q
k
log(2
k/δ
)
n
q

. Thus Lemma 1 improves upon the previous upper bound
by a factor of
k
.
Furthermore, as in Lipton et al. (2018), this result holds fo
r
any
black box estimator
h
0
which enters
the bound via
σ
min
(
C
h
0
)
. We can directly see how a good choice of
h
0
helps to decrease the upper
bound in Lemma 1. In particular, if
h
0
is an ideal estimator, and the source set is balanced,
C
is the
unit matrix with
σ
min
= 1
/k
. In contrast, when the model
h
0
is uncertain, the singular value
σ
min
is close to zero.
Moreover, for least square problems with Gaussian measurem
ent errors in both input and target
variables, it is standard to use regularized total least squ
ares approaches which requires a singular
value decomposition. Finally, our choice for the alternati
ve estimator in Eq. 3 with norm instead of
norm squared regularization is motivated by the cases with l
arge shifts
θ
, where using the squared
norm may shrink the estimate
b
θ
too much and away from the true
θ
.
1
Throughout the paper,
O
hides universal constant factors. Furthermore, we use
O
(
·
+
·
)
for short to denote
O
(
·
) +
O
(
·
)
.
4
Published as a conference paper at ICLR 2019
Algorithm 1
Regularized Learning of Label Shift (
RLLS
)
1:
Input: source set
D
p
,
D
q
,
θ
max
, estimate of
σ
min
, black box estimator
h
0
, model class
H
2:
Determine optimal split ratio
β
and regularizer
λ
by minimizing the RHS of Eq. (6) using an
estimate of
σ
min
3:
Randomly partition source set
D
p
into
D
class
p
, D
weight
p
such that
|
D
class
p
|
=
β
n
p
=:
n
4:
Compute
b
θ
using Eq. (3) and
b
w
:= 1 +
λ
b
θ
5:
Minimize the importance weighted empirical loss to obtain t
he weighted estimator
b
h
̂
w
= arg min
h
∈H
L
n
(
h
;
b
w
)
,
where
L
n
(
h
;
b
w
) =
1
n
X
(
x,y
)
D
class
p
b
w
(
y
)
(
y, h
(
x
))
6:
Deploy
b
h
̂
w
if the risk is acceptable
2.2 R
EGULARIZED ESTIMATOR AND GENERALIZATION BOUND
When a few samples from the target set are available or the lab
el shift is mild, the estimated weights
might be too uncertain to be applied. We therefore propose a r
egularized estimator defined as follows
b
w
=
1
+
λ
b
θ.
(4)
Note that
b
w
implicitly depends on
λ
, and
β
. By rewriting
b
w
= (1
λ
)
1
+
λ
(
1
+
b
θ
)
, we see that
intuitively
λ
closer to
1
the more reason there is to believe that
1
+
b
θ
is in fact the true weight.
Define the set
G
(
ℓ,
H
) =
{
g
h
(
x, y
) =
w
(
y
)
(
h
(
x
)
, y
) :
h
∈ H}
and its Rademacher complexity
measure
R
n
(
G
) :=
E
(
X
i
,Y
i
)
P
:
i
[
n
]
"
E
ξ
i
:
i
[
n
]
1
n
"
sup
h
∈H
n
X
i
=1
ξ
i
g
h
(
X
i
, h
(
Y
i
))
##
with
ξ
i
,
i
as the Rademacher random variables (see e.g. Bartlett & Mend
elson (2002)). We can
now state a generalization bound for the classifier
b
h
̂
w
in a general hypothesis class
H
, which is
trained on source data with the estimated weights defined in e
quation (4).
Theorem 1 (Generalization bound for
b
h
̂
w
)
Given
n
p
samples from the source data set and
n
q
samples from the target set, a hypothesis class
H
and loss function
, the following generalization
bound holds with probability at least
1
2
δ
L
(
b
h
̂
w
)
− L
(
h
)
ǫ
G
(
n
p
, δ, β
) + (1
λ
)
k
θ
k
2
+
λǫ
θ
(
n
p
, n
q
,
k
θ
k
2
, δ, β
)
.
(5)
where
ǫ
G
(
n
p
, δ
) := 2
R
n
(
G
) + min
(
d
(
q
||
p
)
s
log(2
)
βn
p
,
2
d
(
q
||
p
) log(2
)
n
+
r
2
d
(
q
||
p
) log(2
)
n
)
.
The proof can be found in Appendix B.4. Additionally, we deri
ve the analysis also for finite hypothe-
sis classes in Appendix B.6 to provide more insight into the p
roof of general hypothesis classes. The
size of
R
n
(
G
)
is determined by the structure of the function class
H
and the loss
. For example for
the
0
/
1
loss, the VC dimension of
H
can be deployed to upper bound the Rademacher complexity.
The bound (5) in Theorem 1 holds for all choices of
λ
. In order to exploit the possibility of choosing
λ
and
β
to have an improved accuracy depending on the sample sizes, w
e first let the user define a
set of shifts
θ
against which we want to be robust against, i.e. all shifts wi
th
k
θ
k
2
θ
max
. For these
shifts, we obtain the following upper bound
L
(
b
h
̂
w
)
− L
(
h
)
ǫ
G
(
n
p
, δ
) + (1
λ
)
θ
max
+
λǫ
θ
(
n
p
, n
q
, θ
max
, δ
)
(6)
The bound in equation (6) suggests using Algorithm 1 as our ul
timate label shift correction proce-
dure. where for step 2 of the algorithm, we choose
λ
= 1
whenever
n
q
1
θ
2
max
(
σ
min
1
n
p
)
2
(hereby
neglecting the log factors and thus dependencies on
k
) and
0
else. When using this rule, we obtain
5
Published as a conference paper at ICLR 2019
L
(
b
h
̂
w
)
− L
(
h
)
ǫ
G
(
n
p
, δ
) + min
{
θ
max
, ǫ
θ
(
n
p
, n
q
, θ
max
, δ
)
}
which is smaller than the unregu-
larized bound for small
n
q
, n
p
. Notice that in practice, we do not know
σ
min
in advance so that in
Algorithm 1 we need to use an estimate of
σ
min
, which could e.g. be the minimum eigenvalue of
the empirical confusion matrix
b
C
with an additional computational complexity of at most
O
(
k
3
)
.
0.20.40.60.81.0
θ
max
10
2
10
3
10
4
10
5


min
Figure 1: Given a
σ
min
and
θ
max
,
λ
switches from
0
to
1
at
a particular
n
q
.
n
p
and
k
are
fixed.
Figure 1 shows how the oracle thresholds vary with
n
q
and
σ
min
when
n
p
is kept fix. When the parameters are above the curves for
fixed
n
p
,
λ
should be chosen as
1
otherwise the samples should
be unweighted, i.e.
λ
= 0
. This figure illustrates that when the
confusion matrix has small singular values, the estimated w
eights
should only be trusted for rather high
n
q
and high believed shifts
θ
max
. Although the overall statistical rate of the excess risk of
the
classifier does not change as a function of the sample sizes,
θ
max
could be significantly smaller than
ǫ
θ
when
σ
min
is very small
and thus the accuracy in this regime could improve. Indeed we
observe this to be the case empirically in Section 3.3.
In the case of slight deviation from the label shift setting,
we expect the Alg. 1 to perform reasonably.
For
d
e
(
q
||
p
) :=
E
(
X,Y
)
Q
h
1
p
(
X
|
Y
)
q
(
X
|
Y
)
i
as the deviation form label shift constraint, i.e., zero
under label shift assumption, we have;
Theorem 2 (Drift in Label shift assumption)
In the presence of
d
e
(
q
||
p
)
deviation from label shift
assumption, the true importance weights
ω
(
x, y
) :=
q
(
x,y
)
p
(
x,y
)
, the
RLLS
generalizes as;
L
(
b
h
̂
w
, ω
)
− L
(
h
;
ω
)
ǫ
G
(
n
p
, δ
) + (1
λ
)
k
θ
k
2
+
λǫ
θ
(
n
p
, n
q
,
k
θ
k
2
, δ
) + 2 (1 +
λ
)
d
e
(
q
||
p
)
with high probability. Proof in Appendix B.7.
3 EXPERIMENTS
In this section we illustrate the theoretical analysis by ru
nning
RLLS
on a variety of artificially gen-
erated shifts on the MNIST (LeCun & Cortes, 2010) and CIFAR10
(Krizhevsky & Hinton, 2009)
datasets. We first randomly separate the entire dataset into
two sets (source and target pool) of the
same size. Then we sample, unless specified otherwise, the sa
me number of data points from each
pool to form the source and target set respectively. We chose
to have equal sample sizes to allow for
fair comparisons across shifts.
There are various kinds of shifts which we consider in our exp
eriments. In general we assume one
of the source or target datasets to have uniform distributio
n over the labels. Within the non-uniform
set, we consider three types of sampling strategies in the ma
in text: the
Tweak-One shift
refers to
the case where we set a class to have probability
p >
0
.
1
, while the distribution over the rest of the
classes is uniform. The
Minority-Class Shift
is a more general version of
Tweak-One shift
, where
a fixed number of classes
m
to have probability
p <
0
.
1
, while the distribution over the rest of
the classes is uniform. For the
Dirichlet shift
, we draw a probability vector
p
from the Dirichlet
distribution with concentration parameter set to
α
for all classes, before including sample points
which correspond to the multinomial label variable accordi
ng to
p
. Results for the tweak-one shift
strategy as in Lipton et al. (2018) can be found in Section A.0
.1.
After artificially shifting the label distribution in one of
the source and target sets, we then follow
algorithm 1, where we choose the black box predictor
h
0
to be a two-layer fully connected neural
network trained on (shifted) source dataset. Note that any b
lack box predictor could be employed
here, though the higher the accuracy, the more likely weight
estimation will be precise. Therefore,
we use different shifted source data to get (corrupted) blac
k box predictor across experiments. If not
noted,
h
0
is trained using uniform data.
In order to compute
b
ω
=
1
+
b
θ
in Eq. (3), we call a built-in solver to directly solve the low
dimen-
sional problem
min
θ
k
b
b
b
k
2
+ ∆
C
k
θ
k
2
where we empirically observer that
0
.
01
times of the
true
C
yields in a better estimator on various levels of label shift
pre-computed beforehand. It is
worth noting that
0
.
001
makes the theoretical bound in Lemma. 1
O
(1
/
0
.
01)
times bigger. We thus
treat it as a hyperparameter that can be chosen using standar
d cross validation methods. Finally, we
6
Published as a conference paper at ICLR 2019
train a classifier on the source samples weighted by
b
ω
, where we use a two-layer fully connected
neural network for MNIST and a ResNet-18 (He et al., 2016) for
CIFAR10.
We sample 20 datasets with the label distributions for each s
hift parameter. to evaluate the empirical
mean square estimation error (MSE) and variance of the estim
ated weights
E
k
b
w
w
k
2
2
and the
predictive accuracy on the target set. We use these measures
to compare our procedure with the
black box shift learning method (BBSL) in Lipton et al. (2018
). Notice that although KMM methods
(Zhang et al., 2013) would be another standard baseline to co
mpare with, it is not scalable to large
sample size regimes for
n
p
, n
q
above
n
= 8000
as mentioned by Lipton et al. (2018).
3.1 W
EIGHT
E
STIMATION AND PREDICTIVE PERFORMANCE FOR SOURCE SHIFT
In this set of experiments on the CIFAR10 dataset, we illustr
ate our weight estimation and prediction
performance for Tweak-One source shifts and compare it with
BBSL
. For this set of experiments,
we set the number of data points in both source and target set t
o
10000
and sample from the two
pools without replacement.
Figure 2 illustrates the weight estimation alongside final c
lassification performance for Minority-
Class source shift of CIFAR10. We created shifts with
ρ >
0
.
5
. We use a fixed black-box classifier
that is trained on biased source data, with tweak-one
ρ
= 0
.
5
. Observe that the MSE in weight
estimation is relatively large and
RLLS
outperforms
BBSL
as the number of minority classes in-
creases. As the shift increases the performance for all meth
ods deteriorates. Furthermore, Figure 2
(b) illustrates how the advantage of
RLLS
over the unweighted classifier increases as the shift in-
creases. Across all shifts, the
RLLS
based classifier yields higher accuracy than the one based on
BBSL
. Results for MNIST can be found in Section A.1.
(a)
(b)
Figure 2: (a) Mean squared error in estimated weights and (b)
accuracy on CIFAR10 for tweak-one
shifted source and uniform target with
h
0
trained using tweak-one shifted source data.
3.2 W
EIGHT ESTIMATION AND PREDICTIVE PERFORMANCE FOR TARGET SHI
FT
In this section, we compare the predictive performances bet
ween a classifier trained on unweighted
source data and the classifiers trained on weighted loss obta
ined by the
RLLS
and
BBSL
procedure
on CIFAR10. The target set is shifted using the Dirichlet shi
ft with parameters
α
= [0
.
01
,
0
.
1
,
1
,
10]
.
The number of data points in both source and target set is
10000
.
In the case of target shifts, larger shifts actually make the
predictive task easier, such that even a
constant majority class vote would give high accuracy. Howe
ver it would have zero accuracy on
all but one class. Therefore, in order to allow for a more comp
rehensive performance between
the methods, we also compute the macro-averaged
F-1 score
by averaging the per-class quantity
2(
precision
·
recall
)
/
(
precision
+
recall
)
over all classes. For a class
i
,
precision
is the percentage
of correct predictions among all samples predicted to have l
abel
i
, while
recall
is the proportion of
correctly predicted labels over the number of samples with t
rue label
i
. This measure gives higher
weight to the accuracies of minority classes which have no ef
fect on the total accuracy.
Figure 3 depicts the MSE of the weight estimation (a), the cor
responding performance comparison
on accuracy (b) and F-1 score (c). Recall that the accuracy pe
rformance for low shifts is not compa-
rable with standard CIFAR10 benchmark results because of th
e overall lower sample size chosen for
the comparability between shifts. We can see that in the larg
e target shift case for
α
= 0
.
01
, the F-1
7
Published as a conference paper at ICLR 2019
(a)
(b)
(c)
Figure 3: (a) Mean squared error in estimated weights, (b) ac
curacy and (c) F-1 score on CIFAR10
for uniform source and Dirichlet shifted target. Smaller
α
corresponds to bigger shift.
score for
BBSL
and the unweighted classifier is rather low compared to
RLLS
while the accuracy is
high. As mentioned before, the reason for this observation a
nd why in Figure 3 (b) the accuracy is
higher when the shift is larger, is that the predictive task a
ctually becomes easier with higher shift.
3.3 R
EGULARIZED WEIGHTS IN THE LOW SAMPLE REGIME FOR SOURCE SHIFT
In the following, we present the average accuracy of
RLLS
in Figure 4 as a function of the number
of target samples
n
q
for different values of
λ
for small
n
q
. Here we fix the sample size in the source
set to
n
p
= 1000
and investigate a Minority-Class source shift with fixed
p
= 0
.
01
and five minority
classes.
A motivation to use intermediate
λ
is discussed in Section 2.2, as
λ
in equation (4) may be chosen
according to
θ
max
, σ
min
. In practice, since
θ
max
is just an upper bound on the true amount of shift
k
θ
k
2
, in some cases
λ
should in fact ideally be
0
when
1
θ
2
max
(
σ
min
1
n
q
)
2
n
q
1
k
θ
k
2
(
σ
min
1
n
q
)
2
.
Thus for target sample sizes
n
q
that are a little bit above the threshold (depending on the ce
rtainty
of the belief how close to
θ
max
the norm of the shift is believed to be), it could be sensible t
o use an
intermediate value
λ
(0
,
1)
.
(a)
(b)
(c)
Figure 4: Performance on MNIST for Minority-Class shifted s
ource and uniform target with various
target sample size and
λ
using (a) better predictor
h
0
trained on tweak-one shifted source with
ρ
= 0
.
2
, (b) neutral predictor
h
0
with
ρ
= 0
.
5
and (c) corrupted predictor
h
0
with
ρ
= 0
.
8
.
Figure 4 suggests that unweighted samples (red) yield the be
st classifier for very few samples
n
q
,
while for
10
n
q
500
an intermediate
λ
(0
,
1)
(purple) has the highest accuracy and for
n
q
>
1000
, the weight estimation is certain enough for the fully weigh
ted classifier (yellow) to have
the best performance (see also the corresponding data point
s in Figure 2). The unweighted
BBSL
classifier is also shown for completeness. We can conclude th
at regularizing the influence of the
estimated weights allows us to adjust to the uncertainty on i
mportance weights and generalize well
for a wide range of target sample sizes.
Furthermore, the different plots in Figure 4 correspond to b
lack-box predictors
h
0
for weight esti-
mation which are trained on more or less corrupted data, i.e.
have a better or worse conditioned
8
Published as a conference paper at ICLR 2019
confusion matrix. The fully weighted methods with
λ
= 1
achieve the best performance faster with
a better trained black-box classifier (a), while it takes lon
ger for it to improve with a corrupted one
(c). Furthermore, this reflects the relation between eigenv
alue of confusion matrix
σ
min
and target
sample size
n
q
in Theorem 1. In other words, we need more samples from the tar
get data to com-
pensate a bad predictor in weight estimation. So the general
ization error decreases faster with an
increasing number of samples for good predictors.
In summary, our
RLLS
method outperforms
BBSL
in all settings for the common image datasets
MNIST and CIFAR10 to varying degrees. In general, significan
t improvements compared to
BBSL
can be observed for large shifts and the low sample regime. A n
ote of caution is in order: comparison
between the two methods alone might not always be meaningful
. In particular, there are cases when
the estimator trained on unweighted samples outperforms bo
th
RLLS
and
BBSL
. Our extensive
experiments for many different shifts, black box classifier
s and sample sizes do not allow for a final
conclusive statement about how weighting samples using our
estimator affects predictive results for
real-world data in general, as it usually does not fulfill the
label-shift assumptions.
4 R
ELATED
W
ORK
The covariate and label shift assumptions follow naturally
when viewing the data generating process
as a causal or anti-causal model (Schölkopf et al., 2012): Wi
th label shift, the label
Y
causes the
input
X
(that is,
X
is not a causal parent of
Y
, hence "anti-causal") and the causal mechanism that
generates
X
from
Y
is independent of the distribution of
Y
. A long line of work has addressed the
reverse causal setting where
X
causes
Y
and the conditional distribution of
Y
given
X
is assumed
to be constant. This assumption is sensible when there is rea
son to believe that there is a true optimal
mapping from
X
to
Y
which does not change if the distribution of
X
changes. Mathematically this
scenario corresponds to the covariate shift assumption.
Among the various methods to correct for covariate shift, th
e majority uses the concept of impor-
tance weights
q
(
x
)
/p
(
x
)
(Zadrozny, 2004; Cortes et al., 2010; Cortes & Mohri, 2014; S
himodaira,
2000), which are unknown but can be estimated for example via
kernel embeddings (Huang et al.,
2007; Gretton et al., 2009; 2012; Zhang et al., 2013; Zaremba
et al., 2013) or by learning a binary
discriminative classifier between source and target (Lopez
-Paz & Oquab, 2016; Liu et al., 2017).
A minimax approach that aims to be robust to the worst-case sh
ared conditional label distribu-
tion between source and target has also been investigated (L
iu & Ziebart, 2014; Chen et al., 2016).
Sanderson & Scott (2014); Ramaswamy et al. (2016) formulate
the label shift problem as a mixture
of the class conditional covariate distributions with unkn
own mixture weights. Under the pairwise
mutual irreducibility (Scott et al., 2013) assumption on th
e class conditional covariate distributions,
they deploy the Neyman-Pearson criterion (Blanchard et al.
, 2010) to estimate the class distribution
q
(
y
)
which also investigated in the maximum mean discrepancy fra
mework (Iyer et al., 2014).
Common issues shared by these methods is that they either res
ult in a massive computational bur-
den for large sample size problems or cannot be deployed for n
eural networks. Furthermore, im-
portance weighting methods such as (Shimodaira, 2000) esti
mate the density (ratio) beforehand,
which is a difficult task on its own when the data is high-dimen
sional. The resulting generaliza-
tion bounds based on importance weighting methods require t
he second order moments of the den-
sity ratio
(
q
(
x
)
/p
(
x
))
2
to be bounded, which means the bounds are extremely loose in m
ost cases
(Cortes et al., 2010).
Despite the wide applicability of label shift, approaches w
ith global guarantees in high dimensional
data regimes remain under-explored. The correction of labe
l shift mainly requires to estimate the
importance weights
q
(
y
)
/p
(
y
)
over the labels which typically live in a very low-dimension
al space.
Bayesian and probabilistic approaches are studied when a pr
ior over the marginal label distribution
is assumed (Storkey, 2009; Chan & Ng, 2005). These methods of
ten need to explicitly compute
the posterior distribution of
y
and suffer from the curse of dimensionality. Recent advance
s as in
Lipton et al. (2018) have proposed solutions applicable lar
ge scale data. This approach is related
to Buck et al. (1966); Forman (2008); Saerens et al. (2002) in
the low dimensional setting but lacks
guarantees for the excess risk.
Existing generalization bounds have historically been mai
nly developed for the case when
P
=
Q
(see e.g. Vapnik (1999); Bartlett & Mendelson (2002); Kakad
e et al. (2009); Wainwright (2019)).
9
Published as a conference paper at ICLR 2019
Ben-David et al. (2010) provides theoretical analysis and g
eneralization guarantees for distribu-
tion shifts when the H-divergence between joint distributi
ons is considered, whereas Crammer et al.
(2008) proves generalization bounds for learning from mult
iple sources. For the covariate shift set-
ting, Cortes et al. (2010) provides a generalization bound w
hen
q
(
x
)
/p
(
x
)
is known which however
does not apply in practice. To the best of our knowledge our wo
rk is the first to give generalization
bounds for the label shift scenario.
5 D
ISCUSSION
In this work, we establish the first generalization guarante
e for the label shift setting and propose an
importance weighting procedure for which no prior knowledg
e of
q
(
y
)
/p
(
y
)
is required. Although
RLLS
is inspired by
BBSL
, it leads to a more robust importance weight estimator as wel
l as gen-
eralization guarantees in particular for the small sample r
egime, which
BBSL
does not allow for.
RLLS
is also equipped with a sample-size-dependent regularizat
ion technique and further improves
the classifier in both regimes.
We consider this work a necessary step in the direction of sol
ving shifts of this type, although the
label shift assumption itself might be too simplified in the r
eal world. In future work, we plan to also
study the setting when it is slightly violated. For instance
,
x
in practice cannot be solely explained
by the wanted label
y
, but may also depend on attributes
z
which might not be observable. In the
disease prediction task for example, the symptoms might not
only depend on the disease but also on
the city and living conditions of its population. In such a ca
se, the label shift assumption only holds
in a slightly modified sense, i.e.
P
(
X
|
Y
=
y, Z
=
z
) =
Q
(
X
|
Y
=
y, Z
=
z
)
. If the attributes
Z
are observed, then our framework can readily be used to perfo
rm importance weighting.
Furthermore, it is not clear whether the final predictor is in
fact “better” or more robust to shifts
just because it achieves a better target accuracy than a vani
lla unweighted estimator. In fact, there
is a reason to believe that under certain shift scenarios, th
e predictor might learn to use spurious
correlations to boost accuracy. Finding a procedure which c
an both learn a robust model and achieve
high accuracies on new target sets remains to be an ongoing ch
allenge. Moreover, the current choice
of regularization depends on the number of samples rather th
an data-driven regularization which is
more desirable.
An important direction towards active learning for the same
disease-symptoms scenario is when we
also have an expert for diagnosing a limited number of patien
ts in the target location. Now the
question is which patients would be most "useful" to diagnos
e to obtain a high accuracy on the
entire target set? Furthermore, in the case of high risk, we m
ight be able to choose some of the
patients for further medical diagnosis or treatment, up to s
ome varying cost. We plan to extend the
current framework to the active learning setting where we ac
tively query the label of certain
x
’s
(Beygelzimer et al., 2009) as well as the cost-sensitive set
ting where we also consider the cost of
querying labels (Krishnamurthy et al., 2017).
Consider a realizable and over-parameterized setting, whe
re there exists a deterministic mapping
from
x
to
y
, and also suppose a perfect interpolation of the source data
with a minimum proper
norm is desired. In this case, weighting the samples in the em
pirical loss might not alter the trained
classifier (Belkin et al., 2018). Therefore, our results mig
ht not directly help the design of better
classifiers in this particular regime. However, for the gene
ral overparameterized settings, it remains
an open problem of how the importance weighting can improve t
he generalization. We leave this
study for future work.
6 A
CKNOWLEDGEMENT
K. Azizzadenesheli is supported in part by NSF Career Award C
CF-1254106 and Air Force FA9550-
15-1-0221. This research has been conducted when the first au
thor was a visiting researcher at
Caltech. Anqi Liu is supported in part by DOLCIT Postdoctora
l Fellowship at Caltech and Cal-
tech’s Center for Autonomous Systems and Technologies. Fan
Yang is supported by the Institute
for Theoretical Studies ETH Zurich and the Dr. Max Rössler an
d the Walter Haefner Foundation.
A. Anandkumar is supported in part by Microsoft Faculty Fell
owship, Google faculty award, Adobe
grant, NSF Career Award CCF- 1254106, and AFOSR YIP FA9550-1
5-1-0221.
10
Published as a conference paper at ICLR 2019
R
EFERENCES
Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A meth
od of moments for mixture
models and hidden markov models. In
Conference on Learning Theory
, pp. 33–1, 2012.
Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashr
ee Anandkumar. Reinforcement learn-
ing of pomdps using spectral methods.
arXiv preprint arXiv:1602.07764
, 2016.
Peter L Bartlett and Shahar Mendelson. Rademacher and gauss
ian complexities: Risk bounds and
structural results.
Journal of Machine Learning Research
, 3(Nov):463–482, 2002.
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Re
conciling modern machine learning
and the bias-variance trade-off.
arXiv preprint arXiv:1812.11118
, 2018.
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, F
ernando Pereira, and Jennifer Wort-
man Vaughan. A theory of learning from different domains.
Machine learning
, 79(1-2):151–175,
2010.
Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Imp
ortance weighted active learning. In
Proceedings of the 26th annual international conference on
machine learning
, pp. 49–56. ACM,
2009.
Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supe
rvised novelty detection.
Journal of
Machine Learning Research
, 11(Nov):2973–3009, 2010.
AA Buck, JJ Gart, et al. Comparison of a screening test and a re
ference test in epidemiologic
studies. ii. a probabilistic model for the comparison of dia
gnostic tests.
American Journal of
Epidemiology
, 83(3):593–602, 1966.
Yee Seng Chan and Hwee Tou Ng. Word sense disambiguation with
distribution estimation. In
IJCAI
, volume 5, pp. 1010–5, 2005.
Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart.
Robust covariate shift regression.
In
Artificial Intelligence and Statistics
, pp. 1270–1279, 2016.
Corinna Cortes and Mehryar Mohri. Domain adaptation and sam
ple bias correction theory and
algorithm for regression.
Theoretical Computer Science
, 519:103–126, 2014.
Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learnin
g bounds for importance weighting.
In
Advances in neural information processing systems
, pp. 442–450, 2010.
Koby Crammer, Michael Kearns, and Jennifer Wortman. Learni
ng from multiple sources.
Journal
of Machine Learning Research
, 9(Aug):1757–1774, 2008.
George Forman. Quantifying counts and costs via classificat
ion.
Data Mining and Knowledge
Discovery
, 17(2):164–206, 2008.
David A Freedman. On tail probabilities for martingales.
the Annals of Probability
, pp. 100–118,
1975.
Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Sc
hmittfull, Karsten M Borgwardt, and
Bernhard Schölkopf. Covariate shift by kernel mean matchin
g. 2009.
Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhar
d Schölkopf, and Alexander Smola.
A kernel two-sample test.
Journal of Machine Learning Research
, 13(Mar):723–773, 2012.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep r
esidual learning for image recog-
nition. In
Proceedings of the IEEE conference on computer vision and pa
ttern recognition
, pp.
770–778, 2016.
Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorit
hm for learning hidden markov
models.
Journal of Computer and System Sciences
, 78(5):1460–1480, 2012.
Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernha
rd Schölkopf, and Alex J Smola.
Correcting sample selection bias by unlabeled data. In
Advances in neural information processing
systems
, pp. 601–608, 2007.
11
Published as a conference paper at ICLR 2019
Arun Iyer, Saketha Nath, and Sunita Sarawagi. Maximum mean d
iscrepancy for class ratio es-
timation: Convergence bounds and kernel selection. In
International Conference on Machine
Learning
, pp. 530–538, 2014.
Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the co
mplexity of linear prediction:
Risk bounds, margin bounds, and regularization. In
Advances in neural information processing
systems
, pp. 793–800, 2009.
Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Da
ume III, and John Langford. Ac-
tive learning for cost-sensitive classification.
arXiv preprint arXiv:1703.01014
, 2017.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple lay
ers of features from tiny images. Tech-
nical report, Citeseer, 2009.
Yann LeCun and Corinna Cortes. MNIST handwritten digit data
base. 2010. URL
http://yann.lecun.com/exdb/mnist/
.
Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting a
nd correcting for label shift with
black box predictors.
arXiv preprint arXiv:1802.03916
, 2018.
Anqi Liu and Brian Ziebart. Robust classification under samp
le selection bias. In
Advances in neural
information processing systems
, pp. 37–45, 2014.
Song Liu, Akiko Takeda, Taiji Suzuki, and Kenji Fukumizu. Tr
immed density ratio estimation. In
Advances in Neural Information Processing Systems
, pp. 4518–4528, 2017.
David Lopez-Paz and Maxime Oquab. Revisiting classifier two
-sample tests.
arXiv preprint
arXiv:1610.06545
, 2016.
Geoffrey McLachlan.
Discriminant analysis and statistical pattern recognitio
n
, volume 544. John
Wiley & Sons, 2004.
Bernardo Avila Pires and Csaba Szepesvári. Statistical lin
ear estimation with penalized estimators:
an application to reinforcement learning.
arXiv preprint arXiv:1206.6444
, 2012.
Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture
proportion estimation via kernel
embeddings of distributions. In
International Conference on Machine Learning
, pp. 2052–2060,
2016.
Marco Saerens, Patrice Latinne, and Christine Decaestecke
r. Adjusting the outputs of a classifier to
new a priori probabilities: a simple procedure.
Neural computation
, 14(1):21–41, 2002.
Tyler Sanderson and Clayton Scott. Class proportion estima
tion with application to multiclass
anomaly rejection. In
Artificial Intelligence and Statistics
, pp. 850–858, 2014.
Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni S
gouritsa, Kun Zhang, and Joris Mooij.
On causal and anticausal learning.
arXiv preprint arXiv:1206.6471
, 2012.
Clayton Scott, Gilles Blanchard, and Gregory Handy. Classi
fication with asymmetric label noise:
Consistency and maximal denoising. In
Conference On Learning Theory
, pp. 489–511, 2013.
Hidetoshi Shimodaira. Improving predictive inference und
er covariate shift by weighting the log-
likelihood function.
Journal of statistical planning and inference
, 90(2):227–244, 2000.
Amos Storkey. When training and test sets are different: cha
racterizing learning transfer.
Dataset
shift in machine learning
, pp. 3–28, 2009.
Joel A Tropp. User-friendly tail bounds for sums of random ma
trices.
Foundations of computational
mathematics
, 12(4):389–434, 2012.
Vladimir Naumovich Vapnik. An overview of statistical lear
ning theory.
IEEE transactions on
neural networks
, 10(5):988–999, 1999.
M. J. Wainwright.
High-dimensional statistics: A non-asymptotic viewpoint
. Cambridge University
Press, 2019.
12
Published as a conference paper at ICLR 2019
Yiming Ying. Mcdiarmid’s inequalities of bernstein and ben
nett forms.
City University of Hong
Kong
, 2004.
Bianca Zadrozny. Learning and evaluating classifiers under
sample selection bias. In
Proceedings
of the twenty-first international conference on Machine lea
rning
, pp. 114. ACM, 2004.
Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-
test: A non-parametric, low variance
kernel two-sample test. In
Advances in neural information processing systems
, pp. 755–763,
2013.
Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhiku
n Wang. Domain adaptation under
target and conditional shift. In
International Conference on Machine Learning
, pp. 819–827,
2013.
13
Published as a conference paper at ICLR 2019
A M
ORE EXPERIMENTAL RESULTS
This section contains more experiments that provide more in
sights about in which settings the ad-
vantage of using
RLLS
vs.
BBSL
are more or less pronounced.
A.0.1 CIFAR10 E
XPERIMENTS UNDER TWEAK
-
ONE SHIFT AND
D
IRICHLET SHIFT
Here we compare weight estimation performance between
RLLS
and
BBSL
for different types of
shifts including the
Tweak-one Shift
, for which we randomly choose one class, e.g.
i
and set
p
(
i
) =
ρ
while all other classes are distributed evenly. Figure 5 dep
icts the the weight estimation performance
of
RLLS
compared to
BBSL
for a variety of values of
ρ
and
α
. Note that larger shifts correspond to
smaller
α
and larger
ρ
. In general, one observes that our
RLLS
estimator has smaller MSE and that
as the shift increases, the error of both methods increases.
For tweak-one shift we can additionally
see that as the shift increases,
RLLS
outperforms
BBSL
more and more as both in terms of bias and
variance.
(a)
(b)
Figure 5: Comparing MSE of estimated weights using BBSL and
RLLS
on CIFAR10 with (a) tweak-
one shift on source and uniform target, and (b) Dirichlet shi
ft on source and uniform target.
h
0
is
trained using the same source shifted data respectively.
A.1 MNIST E
XPERIMENTS UNDER
M
INORITY
-C
LASS SOURCE SHIFTS FOR DIFFERENT
VALUES OF
p
In order to show weight estimation and classification perfor
mance under different level of label
shifts, we include several additional sets of experiments h
ere in the appendix. Figure 6 shows the
weight estimation error and accuracy comparison under a min
ority-class shift with p = 0.001. The
training and testing sample size is 10000 examples in this ca
se. We can see that whenever the weight
estimation of RLLS is better, the accuracy is also better, ex
cept in the four classes case when both
methods are bad in weight estimation.
(a)
(b)
Figure 6: (a) Mean squared error in estimated weights and (b)
accuracy on MNIST for minority-class
shifted source and uniform target with p = 0.001.
Figure 7 demonstrates another case in minority-class shift
when
p
= 0
.
01
. The black-box classifier
is the same two-layers neural network trained on a biased sou
rce data set with tweak-one
ρ
= 0
.
5
.
We observe that when the number of minority class is small lik
e 1 or 2, the weight estimation
14
Published as a conference paper at ICLR 2019
is similar between two methods, as well as in the classificati
on accuracy. But when the shift get
larger, the weights are worse and the performance in accurac
y decreases, getting even worse than
the unweighted classifier.
(a)
(b)
Figure 7: (a) Mean squared error in estimated weights and (b)
accuracy on MNIST for minority-
class shifted source and uniform target with p = 0.01, with
h
0
trained on tweak-one shifted source
data.
Figure 8 illustrates the weight estimation alongside final c
lassification performance for Minority-
Class source shift of MNIST. We use
1000
training and testing data. We created large shifts of three
or more minority classes with
p
= 0
.
005
. We use a fixed black-box classifier that is trained on biased
source data, with tweak-one
ρ
= 0
.
5
. Observe that the MSE in weight estimation is relatively lar
ge
and
RLLS
outperforms
BBSL
as the number of minority classes increases. As the shift inc
reases the
performance for all methods deteriorates. Furthermore, Fi
gure 8 (b) illustrates how the advantage
of
RLLS
over the unweighted classifier increases as the shift increa
ses. Across all shifts, the
RLLS
based classifier yields higher accuracy than the one based on
BBSL
.
(a)
(b)
Figure 8: (a) Mean squared error in estimated weights and (b)
accuracy on MNIST for minority-
class shifted source and uniform target with p = 0.005, with
h
0
trained on tweak-one shifted source
data.
A.2 CIFAR10 E
XPERIMENT UNDER
D
IRICHLET SOURCE SHIFTS
Figure 9 illustrates the weight estimation alongside final c
lassification performance for Dirichlet
source shift of CIFAR10 dataset. We use
10000
training and testing data in this experiment, follow-
ing the way we generate shift on source data. We train
h
0
with tweak-one shifted source data with
ρ
= 0
.
5
. The results show that importance weighting in general is no
t helping the classification in
this relatively large shift case, because the weighted meth
ods, including true weights and estimated
weights, are similar in accuracy with the unweighted method
.
A.3 MNIST E
XPERIMENT UNDER
D
IRICHLET
S
HIFT WITH LOW TARGET SAMPLE SIZE
We show the performance of classifier with different regular
ization
λ
under a Dirichlet shift with
α
= 0
.
5
in Figure 10. The training has 5000 examples in this case. We c
an see that in this low target
sample case,
λ
= 1
only take over after several hundreds example, while some
λ
value between 0
and 1 outperforms it at the beginning. Similar as in the paper
, we use different black-box classifier
that is corrupted in different levels to show the relation be
tween the quality of black-box predictor
and the necessary target sample size. We use biased source da
ta with tweak-one
ρ
= 0
,
0
.
2
,
0
.
6
15