arXiv:1903.09734v1 [cs.LG] 22 Mar 2019
Published as a conference paper at ICLR 2019
R
EGULARIZED
L
EARNING FOR
D
OMAIN
A
DAPTATION
UNDER
L
ABEL
S
HIFTS
Kamyar Azizzadenesheli
University of California, Irvine
kazizzad@uci.edu
Anqi Liu
California Institute of Technology
anqiliu@caltech.edu
Fanny Yang
Institute of Theoretical Studies, ETH Zürich
fan.yang@stat.math.ethz.ch
Animashree Anandkumar
California Institute of Technology
anima@caltech.edu
A
BSTRACT
We propose Regularized Learning under Label shifts (
RLLS
), a principled and a
practical domain-adaptation algorithm to correct for shif
ts in the label distribution
between a source and a target domain. We first estimate import
ance weights us-
ing labeled source data and unlabeled target data, and then t
rain a classifier on
the weighted source samples. We derive a generalization bou
nd for the classifier
on the target domain which is independent of the (ambient) da
ta dimensions, and
instead only depends on the complexity of the function class
. To the best of our
knowledge, this is the first generalization bound for the lab
el-shift problem where
the labels in the target domain are not available. Based on th
is bound, we pro-
pose a regularized estimator for the small-sample regime wh
ich accounts for the
uncertainty in the estimated weights. Experiments on the CI
FAR-10 and MNIST
datasets show that
RLLS
improves classification accuracy, especially in the low
sample and large-shift regimes, compared to previous metho
ds.
1 I
NTRODUCTION
When machine learning models are employed “in the wild”, the
distribution of the data of inter-
est(
target
distribution) can be significantly shifted compared to the d
istribution of the data on which
the model was trained (
source
distribution). In many cases, the publicly available large
-scale datasets
with which the models are trained do not represent and reflect
the statistics of a particular dataset
of interest. This is for example relevant in managed service
s on cloud providers used by clients
in different domains and regions, or medical diagnostic too
ls trained on data collected in a small
number of hospitals and deployed on previously unobserved p
opulations and time frames.
Covariate Shift
Label Shift
p
(
x
)
6
=
q
(
x
)
p
(
y
|
x
) =
q
(
y
|
x
)
p
(
y
)
6
=
q
(
y
)
p
(
x
|
y
) =
q
(
x
|
y
)
There are various ways to approach distribution shifts betw
een
a source data distribution
P
and a target data distribution
Q
. If
we denote input variables as
x
and output variables as
y
, we
consider the two following settings: (i) Covariate shift, w
hich
assumes that the conditional output distribution is invari
ant:
p
(
y
|
x
) =
q
(
y
|
x
)
between source and
target distributions, but the source distribution
p
(
x
)
changes. (ii) Label shift, where the conditional
input distribution is invariant:
p
(
x
|
y
) =
q
(
x
|
y
)
and
p
(
y
)
changes from source to target. In the
following, we assume that both input and output variables ar
e observed in the source distribution
whereas only input variables are available from the target d
istribution.
While covariate shift has been the focus of the literature on
distribution shifts to date, label-shift sce-
narios appear in a variety of practical machine learning pro
blems and warrant a separate discussion
as well. In one setting, suppliers of machine-learning mode
ls such as cloud providers have large
resources of diverse data sets (source set) to train the mode
ls, while during deployment, they have
no control over the proportion of label categories.
In another setting of e.g. medical diagnostics, the disease
distribution changes over locations and
time. Consider the task of diagnosing a disease in a country w
ith bad infrastructure and little data,
1
Published as a conference paper at ICLR 2019
based on reported symptoms. Can we use data from a different l
ocation with data abundance to
diagnose the disease in the new target location in an efficien
t way? How many labeled source and
unlabeled target data samples do we need to obtain good perfo
rmance on the target data?
Apart from being relevant in practice, label shift is a compu
tationally more tractable scenario than
covariate shift which can be mitigated. The reason is that th
e outputs
y
typically have a much lower
dimension than the inputs
x
. Labels are usually either categorical variables with a fini
te number
of categories or have simple well-defined structures. Despi
te being an intuitively natural scenario
in many real-world application, even this simplified model h
as only been scarcely studied in the
literature. Zhang et al. (2013) proposed a kernel mean match
ing method for label shift which is
not computationally feasible for large-scale data. The app
roach in Lipton et al. (2018) is based on
importance weights that are estimated using the confusion m
atrix (also used in the procedures of
Saerens et al. (2002); McLachlan (2004)) and demonstrate pr
omising performance on large-scale
data. Using a black-box classifier which can be biased, uncal
ibrated and inaccurate, they first es-
timate importance weights
q
(
y
)
/p
(
y
)
for the source samples and train a classifier on the weighted
data. In the following we refer to the procedure as
black box shift learning
(
BBSL
) which the authors
proved to be effective for large enough sample sizes.
However, there are three relevant questions which remain un
answered by their work: How to esti-
mate the importance weights in low sample setting, What are t
he generalization guarantees for the
final predictor which uses the weighted samples? How do we dea
l with the uncertainty of the weight
estimation when only few samples are available? This paper a
ims to fill the gap in terms of both
theoretical understanding and practical methods for the la
bel shift setting and thereby move a step
closer towards having a more complete understanding on the g
eneral topic of supervised learning for
distributionally shifted data. In particular, our goal is t
o find an efficient method which is applicable
to large-scale data and to establish generalization guaran
tees.
Our contribution in this work is trifold. Firstly, we propos
e an efficient weight estimator for which
we can obtain good statistical guarantees without a require
ment on the problem-dependentminimum
sample complexity as necessary for
BBSL
. In the
BBSL
case, the estimation error can become
arbitrarily large for small sample sizes. Secondly, we prop
ose a novel regularization method to
compensate for the high estimation error of the importance w
eights in low target sample settings.
It explicitly controls the influence of our weight estimates
when the target sample size is low (in
the following referred to as the low sample regime). Finally
, we derive a dimension-independent
generalization bound for the final Regularized Learning und
er Label Shift (
RLLS
) classifier based
on our weight estimator. In particular, our method improves
the weight estimation error and excess
risk of the classifier on reweighted samples by a factor of
k
log(
k
)
, where
k
is the number of classes,
i.e. the cardinality of
Y
.
In order to demonstrate the benefit of the proposed method for
practical situations, we empirically
study the performance of
RLLS
and show weight estimation as well as prediction accuracy co
mpari-
son for a variety of shifts, sample sizes and regularization
parameters on the CIFAR-10 and MNIST
datasets. For large target sample sizes and large shifts, wh
en applying the regularized weights fully,
we achieve an order of magnitude smaller weight estimation e
rror than baseline methods and enjoy
at most 20% higher accuracy and F-1 score in corresponding pr
edictive tasks. For low target sample
sizes, applying regularized weights partially also yields
an accuracy improvement of at least 10%
over fully weighted and unweighted methods.
2 R
EGULARIZED LEARNING OF LABEL SHIFTS
(
RLLS
)
Formally let us the short hand for the marginal probability m
ass functions of
Y
on finite
Y
with
respect to
P
,
Q
as
p, q
: [
k
]
→
[0
,
1]
with
p
(
i
) =
P
(
Y
=
i
)
, and
q
(
i
) =
Q
(
Y
=
i
)
for all
i
∈
[
k
]
,
representable by vectors in
R
k
+
which sum to one. In the label shift setting, we define the impo
rtance
weight vector
w
∈
R
k
between these two domains as
w
(
i
) =
q
(
i
)
p
(
i
)
. We quantify the shift using the
exponent of the infinite and second order Renyi divergence as
follows
d
∞
(
q
||
p
) := sup
i
q
(
i
)
p
(
i
)
,
and
d
(
q
||
p
) :=
E
Y
∼
Q
w
(
Y
)
2
=
k
X
i
q
(
i
)
q
(
i
)
p
(
i
)
.
2
Published as a conference paper at ICLR 2019
Given a hypothesis class
H
and a loss function
ℓ
:
Y × Y →
[0
,
1]
, our aim is to find the hypothesis
h
∈ H
which minimizes
L
(
h
) =
E
X,Y
∼
Q
[
ℓ
(
Y, h
(
X
))] =
E
X,Y
∼
P
[
w
(
Y
)
ℓ
(
Y, h
(
X
))]
In the usual finite sample setting however,
L
unknown and we observe samples
{
(
x
j
, y
j
)
}
n
j
=1
from
P
instead. If we are given the vector of importance weights
w
we could then minimize the empirical
loss with importance weighted samples defined as
L
n
(
h
) =
1
n
n
X
j
=1
w
(
y
j
)
ℓ
(
y
j
, h
(
x
j
))
where
n
is the number of available observations drawn from
P
used to learn the classifier
h
. As
w
is
unknown in practice, we have to find the minimizer of the empir
ical loss with
estimated
importance
weights
L
n
(
h
;
b
w
) =
1
n
n
X
j
=1
b
w
(
y
j
)
ℓ
(
y
j
, h
(
x
j
))
(1)
where
b
w
are estimates of
w
. Given a set
D
p
of
n
p
samples from the source distribution
P
, we first
divide it into two sets where we use
(1
−
β
)
n
p
samples in set
D
weight
p
to compute the estimate
b
w
and the remaining
n
=
βn
p
in the set
D
class
p
to find the classifier which minimizes the loss (1), i.e.
b
h
̂
w
= arg min
h
∈H
L
n
(
h
;
b
w
)
. In the following, we describe how to estimate the weights
b
w
and
provide guarantees for the resulting estimator
b
h
̂
w
.
Plug-in weight estimation
The following simple correlation between the label distrib
utions
p, q
was noted in Lipton et al. (2018): for a fixed hypothesis
h
, if for all
y
∈ Y
it holds that
q
(
y
)
≥
0 =
⇒
p
(
y
)
≥
0
, we have
q
h
(
i
) :=
Q
(
h
(
X
) =
i
) =
k
X
j
=1
Q
(
h
(
X
) =
i
|
Y
=
j
)
q
(
j
) =
k
X
j
=1
P
(
h
(
X
) =
i
|
Y
=
j
)
q
(
j
)
=
k
X
j
=1
P
(
h
(
X
) =
i, Y
=
j
)
q
(
j
)
p
(
j
)
=
k
X
j
=1
P
(
h
(
X
) =
i, Y
=
j
)
w
j
for all
i, j
∈ Y
. This can equivalently be written in matrix vector notation
as
q
h
=
C
h
w,
(2)
where
C
h
is the confusion matrix with
[
C
h
]
i,j
=
P
(
h
(
X
) =
i, Y
=
j
)
and
q
h
is the vector which
represents the probability mass function of
h
(
X
)
under distribution
Q
. The requirement
q
(
y
)
≥
0 =
⇒
p
(
y
)
≥
0
is a reasonable condition since without any prior knowledge
, there is no way to
properly reason about a class in the target domain that is not
represented in the source domain.
In reality, both
q
h
and
C
h
can only be estimated by the corresponding finite sample aver
ages
b
q
h
,
b
C
h
.
Lipton et al. (2018) simply compute the inverse of the estima
ted confusion matrix
b
C
h
in order to
estimate the importance weight, i.e.
b
w
=
b
C
−
1
h
b
q
h
. While
C
−
1
h
b
q
h
is a statistically efficient estimator,
b
w
with estimated
b
C
−
1
h
can be arbitrarily bad since
b
C
−
1
h
can be arbitrary close to a singular matrix
especially for small sample sizes and small minimum singula
r value of the confusion matrix. In-
tuitively, when there are very few samples, the weight estim
ation will have high variance in which
case it might be better to avoid importance weighting altoge
ther. Furthermore, even when the sample
complexity in Lipton et al. (2018), unknown in practice, is m
et, the resulting error of this estimator
is linear in
k
which is problematic for large
k
.
We therefore aim to address these shortcomings by proposing
the following two-step procedure to
compute importance weights. In the case of no shift we have
w
=
1
so that we define the amount
of weight shift as
θ
=
w
−
1
. Given a “decent” black box estimator which we denote by
h
0
, we
make the final classifier less sensitive to the estimation per
formance of
C
(i.e. regularize the weight
estimate) by
3
Published as a conference paper at ICLR 2019
1. calculating the measurement error adjusted
b
θ
(described in Section 2.1 for
h
0
) and
2. computing the regularized weight
b
w
=
1
+
λ
b
θ
where
λ
depends on the sample size
(1
−
β
)
n
p
.
By "decent" we refer to a classifier
h
0
which yields a full rank confusion matrix
C
h
0
. A trivial
example for a non-”decent” classifier
h
0
is one that always outputs a fixed class. As it does not
capture any characteristics of the data, there is no hope to g
ain any statistical information without
any prior information.
2.1 E
STIMATOR CORRECTING FOR FINITE SAMPLE ERRORS
Both the confusion matrix
C
h
0
and the label distribution
q
h
0
on the target for the black box hypothe-
sis
h
0
are unknown and we are instead only given access to finite samp
le estimates
b
C
h
0
,
b
q
h
0
. In what
follows all empirical and population confusion matrices, a
s well as label distributions, are defined
with respect to the hypothesis
h
=
h
0
. For notation simplicity, we thus drop the subscript
h
0
in what
follows. The reparameterized linear model (2) with respect
to
θ
then reads
b
:=
q
−
C
1
=
Cθ
with corresponding finite sample quantity
b
b
=
b
q
−
b
C
1
. When
b
C
is near singular, the estimation of
θ
becomes unstable. Furthermore, large values in the true shi
ft
θ
result in large variances. We address
this problem by adding a regularizing
ℓ
2
penalty term to the usual loss and thus push the amount of
shift towards
0
, a method that has been proposed in (Pires & Szepesvári, 2012
). In particular, we
compute
b
θ
= arg min
θ
k
b
Cθ
−
b
b
k
2
+ ∆
C
k
θ
k
2
(3)
Here,
∆
C
is a parameter which will eventually be high probability upp
er bounds for
k
b
C
−
C
k
2
. Let
∆
b
also denote the high probability upper bounds for
k
b
b
−
b
k
2
.
Lemma 1
For
b
θ
as defined in equation
(3)
, we have with probability at least
1
−
δ
that
1
k
b
θ
−
θ
k
2
≤
ǫ
θ
(
n
p
, n
q
,
k
θ
k
2
, δ
)
where
ǫ
θ
(
n
p
, n
q
,
k
θ
k
2
, δ
) :=
O
1
σ
min