arXiv:2112.04161v1 [econ.TH] 8 Dec 2021
AGGREGATION OF PARETO OPTIMAL MODELS
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
Abstract.
Pareto efficiency is a concept commonly used in economics,
statistics, and engineering. In the setting of statistical
decision theory,
a model is said to be Pareto efficient/optimal (or admissible)
if no other
model carries less risk for at least one state of nature while
presenting
no more risk for others. How can you rationally aggregate/co
mbine a
finite set of Pareto optimal models while preserving Pareto e
fficiency?
This question is nontrivial because weighted model averagi
ng does not,
in general, preserve Pareto efficiency. This paper presents a
n answer
in four logical steps: (1) A rational aggregation rule shoul
d preserve
Pareto efficiency (2) Due to the complete class theorem, Paret
o optimal
models must be Bayesian, i.e., they minimize a risk where the
true state
of nature is averaged with respect to some prior. Therefore e
ach Pareto
optimal model can be associated with a prior, and Pareto effici
ency can
be maintained by aggregating Pareto optimal models through
their pri-
ors. (3) A prior can be interpreted as a preference ranking ov
er models:
prior
π
prefers model A over model B if the average risk of A is lower
than the average risk of B (where the average is taken with res
pect to
the prior
π
). (4) A rational/consistent aggregation rule should prese
rve
this preference ranking: If both priors
π
and
π
1
prefer model A over
model B, then the prior obtained by aggregating
π
and
π
1
must also
prefer A over B. Under these four logical steps, we show that a
ll ra-
tional/consistent aggregation rules are as follows: Give e
ach individual
Pareto optimal model a weight, introduce a weak order/ranki
ng over
the set of Pareto optimal models, aggregate a finite set of mod
els S as
the model associated with the prior obtained as the weighted
average of
the priors of the highest-ranked models in S. This result sho
ws that all
rational/consistent aggregation rules must follow a gener
alization of hi-
erarchical Bayesian modeling. Following our main result, w
e present ap-
plications to Kernel smoothing, time-depreciating models
, social Choice
theory, and voting mechanisms.
1. Introduction
The purpose of this paper is to characterize rational/consi
stent aggre-
gation rules for Pareto efficient/optimal (admissible) mode
ls, i.e., answer
the following question: how can a decision-maker consisten
tly aggregate the
opinions of different experts or different Pareto efficient model
s into one
single/aggregate Pareto efficient model?
Affiliation: Division of Computing and Mathematical Science
s (CMS), California In-
stitute of Technology. Email:
hhamzeyi@caltech.edu
and
owhadi@caltech.edu
Date
: December 9, 2021.
1
2
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
For example, the decision-maker can be a financial planner wh
o can use
different models or expert opinions to create a portfolio of as
sets to maxi-
mize the expected profit of her portfolio. Given access to a se
t of different
experts or Pareto efficient models, she must design a plan/rul
e on how to
aggregate the different models/opinions to form a single final
Pareto efficient
model/opinion.
More generally, employing Wald’s Decision theoretic setti
ng, the decision-
maker may be interested in estimating some quantity of inter
est depending
on some unknown parameter given data sampled from distribut
ion depend-
ing on that parameter. In the process of selecting a decision
rule (a plan/rule
on how to use the observed data to estimate the quantity of the
interest),
the decision-maker observes the opinions or characteristi
cs of some experts
(which, from now on, we simply refer to as experts). The decis
ion maker’s
goal is then to aggregate all those experts into a single Pare
to efficient model
to form the final decision rule.
Our goal is to show, for all these examples, that the aggregat
ion plan/rule
and the final Pareto efficient model have a simple form under the
follow-
ing consistency/rationality requirements (derived from H
amze Bajgiran and
Owhadi,
2021
).
(1) Regardless of the set of observed experts, the decision-
maker plays
optimally. That is, she never plays a rule that, regardless o
f the true
underlying parameter, leads to a higher loss than another ru
le.
(2) As a consequence of the complete class theorem, the decis
ion-maker
should find a minimizer of the loss function with respect to a s
ingle
prior.
(3) By enabling comparisons between decisions rules/model
s through
their average loss, a prior can be interpreted as a preferenc
e ranking
over the set of all decision rules/models. In this interpret
ation, the
decision-maker should find the highest-ranked decision rul
e (which
carries the lowest average risk).
(4) If the decision-maker interprets a prior as a ranking ove
r decision
rules in the risk set, she should consistently aggregate the
observed
experts. The form of consistency we use is the one introduced
in
Hamze Bajgiran and Owhadi,
2021
and also has been mentioned
with different names and purposes in the literature on case-ba
sed
decision theory and social choice theory. That is, if the dec
ision-
maker observes a set of experts
A
and another disjoint set of experts
B
and form their respective priors
f
p
A
q
and
f
p
B
q
. Then, the ag-
gregated ranking induced by
f
p
A
Y
B
q
over the set of decision rules
(risk set) should preserve the ranking of
f
p
A
q
and
f
p
B
q
. That is,
for every two decision rules
d
1
and
d
2
, if both
f
p
A
q
and
f
p
B
q
prefer
d
1
over
d
2
, then
f
p
A
Y
B
q
should also prefer
d
1
over
d
2
.
AGGREGATION OF PARETO OPTIMAL MODELS
3
Our main result is to show that the consistency/rationality
requirements
described above can only be satisfied by the following simple
aggregation
rule (which can be interpreted as a generalization of
Hierarchical Bayes
).
(1) Select a weight function and a weak ordering over experts
.
(2) Identify the prior associated with each expert.
(3) For every subset of observed experts, find the average pri
or (by av-
eraging the priors of highest ranked individual experts in t
he subset
with respect to the weight).
(4) Finally, select a minimizer of the loss function with res
pect to the
obtained average prior of step 3.
Note that the weight used to form the average prior is indepen
dent of the
subset of observed experts.
The organization of the paper is as follows. In section
2
, we present the
decision-theoretic setting used to formalize our results.
Then, we articulate
the four main logical steps leading to our main results. Sect
ion
3
provides
examples of the representation of the set of experts with app
lications and
connections to the literature in statistics and social choi
ce theory.
2. Main Model and Results
Let
E
be a set of experts. Depending on the application, we may assu
me
that
E
to be a subset of a linear vector space. In that case, each expe
rt
e
P
E
may have different characteristics encoded in the coordinate
s of the
vector
e
. The goal of the decision-maker is to identify a (modeling) r
ule for
aggregating experts by mapping the set of finite subsets of th
e set
E
, which
we denote by
E
̊
, to a set of models or decision rules
M
.
Definition 1.
Let
E
be a set of experts and
M
be a set of models. A
modeling rule
on
E
is a function
f
:
E
̊
Ñ
M
, that maps any finite subset
of experts
A
P
E
̊
to a model
f
p
A
q P
M
.
We will now use Wald’s decision-theoretic setting to descri
be
M
.
2.1. Identification of
M
in Wald’s decision theoretic setting.
Let
p
X
,
Σ
q
be a measurable outcome space and
p
Θ
,
Σ
Θ
q
be a measurable space
of the possible states of nature, with
P
p
Θ
q
being the set of probability
distributions on
p
Θ
,
Σ
Θ
q
. Moreover, there is a class of probability measures
t
P
θ
:
θ
P
Θ
u
such that whenever the true state is
θ
P
Θ, the distribution
of observations
X
P
X
is according to
P
θ
. In other words,
p
X
,
Σ
,P
θ
q
is a
probability space for every
θ
P
Θ.
Wald’s decision-theoretic setting is concerned with the pr
oblem of esti-
mating some
quantity of the interest
q
p
θ
q
in a space
Q
given the obser-
vation of samples from the outcome space whose distribution
depends on
the true state of nature
θ
P
Θ (
q
: Θ
Ñ
Q
). Let
l
:
Q
Ś
Q
Ñ
R
be a
loss
function
such that
l
p
x,y
q ě
0 for all
x,y
P
Q
, and
l
p
x,y
q “
0 whenever
x
“
y
.
4
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
Definition 2.
A
randomized model
or a
randomized decision rule
is
a function
d
:
X
ˆ r
0
,
1
s Ñ
Q
such that
l
p
q
p
θ
q
,d
q
is a measurable function
on the measurable space
p
X
ˆ r
0
,
1
s
,σ
p
Σ
ˆ
B
r
0
,
1
sqq
for every
θ
P
Θ, where
B
represents the Borel
σ
-algebra. We denote the set of all randomized decision
rules by ∆
p
D
q
.
For every randomized decision rule
d
, the decision-maker first
u
according
to the uniform distribution on
r
0
,
1
s
and then estimates
q
p
θ
q
according to
the non-randomized decision rule
d
p ̈
,u
q
:
X
Ñ
Q
.
Definition 3.
The
risk function
R
q
: Θ
Ś
∆
p
D
q Ñ
R
is defined as the
expected loss given the state
θ
and the decision rule
d
:
R
q
p
θ,d
q “
E
X
„
P
θ
,u
„
U
r
0
,
1
s
r
l
p
q
p
θ
q
,d
p
X,u
qqs
.
(2.1)
From now on, we set
M
“
∆
p
D
q
: we assume that the decision-maker
selects a randomized decision rule, i.e., we consider the mo
deling rule
f
:
E
̊
Ñ
∆
p
D
q
.
We will now investigate four logical steps (rationality con
ditions) in the
process of identifying a rule
f
. We will show that if all four steps are satisfied,
then the modeling rule
f
must be a simple weighted average.
2.2. Step 1; Optimality/Admissibility.
The goal of a decision-maker is
to select a decision rule minimizing some risk function. As i
n fig
1
, regardless
of the procedure employed to select a decision rule, a ration
al decision-maker
should always select a rule that cannot be worst than another
rule for all
states of nature. Otherwise, there is another estimator whi
ch provides less
risk for at least one state of nature, and no more risk for othe
rs.
R
(
θ
1
, d
)
R
(
θ
2
, d
)
Playing
d
a
Playing
d
b
Figure 1.
Playing
d
b
is always better than playing
d
a
.
Definition 4.
A decision rule
d
1
P
∆
p
D
q
is as good as a decision rule
d
2
P
∆
p
D
q
if
R
q
p
θ,d
1
q ď
R
q
p
θ,d
2
q
,
θ
P
Θ. A decision rule
d
2
is
Pareto
dominated
by
d
1
if
d
1
is as good as
d
2
, and there exists at least a state
AGGREGATION OF PARETO OPTIMAL MODELS
5
θ
such that
R
q
p
θ,d
1
q ă
R
q
p
θ,d
2
q
. An
admissible rule
is a decision rule
that is not Pareto dominated. A class of estimators
C
Ă
∆
p
D
q
is said to be
complete
if it contains all admissible decision rules in ∆
p
D
q
.
The idea is that if the goal is to minimize the risk, the decisi
on-maker
should only use the models in a complete class. Therefore, we
assume that
the range of a good modeling rule is a subset of admissible (no
t Pareto
dominated) decision rules.
Definition 5.
A modeling rule
f
:
E
̊
Ñ
∆
p
D
q
is
admissible
if the range
of
f
is subset of the set of admissible (not Pareto dominated) ran
domized
decision rules.
2.3. Step 2; Complete Class Theorem.
Admissible rules are related to
the class of Bayes decision rules. To explore this relation,
note that since
the true state of nature
θ
is unknown, one may average the risk with respect
to a distribution of possible states of nature Θ. The followi
ng definition
captures this idea.
Definition 6.
The
Bayes risk function
R
q
:
P
p
Θ
q
Ś
∆
p
D
q Ñ
R
is the
expectation of the risk function with respect to a
prior distribution
π
P
P
p
Θ
q
and a randomized decision rule
d
P
∆
p
D
q
:
R
q
p
π,d
q “
E
θ
„
π
r
R
q
p
θ,d
qs
,
(2.2)
where, for ease of presentation, we have overloaded the nota
tion of
R
q
.
Remark
1
.
The Bayes risk function is a multi-linear function in the fol
lowing
sense. If the prior
π
is a convex combination of two priors
π
1
,π
2
, i.e
π
“
απ
1
` p
1
́
α
q
π
2
, then
R
q
p
π,d
q “
αR
q
p
π
1
,d
q ` p
1
́
α
q
R
q
p
π
2
,d
q
for all
d
P
∆
p
D
q
. Moreover, if the distribution of a randomized decision rul
e
d
is the
same as the distribution of the randomization of two rules
d
1
and
d
2
which
are selected with probability
α
and 1
́
α
, then
R
q
p
π,d
q “
αR
q
p
π,d
1
q ` p
1
́
α
q
R
q
p
π,d
2
q
.
One way of defining a randomized decision rule
d
to have the distribution
of the randomization of two randomized decision rules
d
1
,d
2
P
∆
p
D
q
with
probability
α
and 1
́
α
, is by defining
d
p
x,u
q “
#
d
1
p
x,
u
α
q
if
u
ă
α,
d
2
p
x,
u
́
α
1
́
α
q
otherwise
.
(2.3)
Bayes decision rules are the minimizer of the Bayes risk func
tions.
Definition 7.
Let
π
be a prior on Θ. A
Bayes decision rule
for the prior
π
is a decision rule
d
π
P
∆
p
D
q
that minimizes the Bayes risk function, i.e,
d
π
P
argmin
d
P
∆
p
D
q
R
q
p
π,d
q
.
Wald,
1947
shows that in many cases, the class of Bayes decision rules
forms a complete class. In other words, every admissible dec
ision rule should
minimize the loss function for a prior. Since, the result and
the geometrical
6
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
understanding of the result is important for the rest of the p
aper, we provide
a geometrical overview and simplified proof of the result. To
that end, it is
helpful to consider the geometry of the
risk set
S
Ď
R
Θ
, which we endow
with a topology later, defined as
S
“ t
s
P
R
Θ
| D
d
P
∆
p
D
q
s.t
s
p
θ
q “
R
q
p
θ,d
q
for all
θ
P
Θ
u
.
(2.4)
Essentially, for every
risk profile
s
P
S
there exists a randomized decision
rule
d
P
∆
p
D
q
such that the risk of playing the rule
d
is exactly
s
p
θ
q
for
every state
θ
. In other words, the risk set captures all possible attainab
le
risk profiles.
By the definition of the risk set
inf
d
P
∆
p
D
q
R
p
π,d
q “
inf
s
P
S
ż
Θ
s
p
θ
q
dπ
p
θ
q
.
(2.5)
Informally, the complete class theorem is supported by a sim
ple geometric
argument. Minimizing the risk function defined by a prior
π
over the set of
randomized decision rules is the same as minimization of the
linear function
ş
Θ
s
p
θ
q
dπ
p
θ
q
, defined by
π
, over the risk set
S
. By Remark
1
,
S
is a convex
set. Therefore, the minimizer is on the intersection of the h
yperplane defined
by the prior
π
and the boundary of
S
. As in fig
2
, we will show that the
other direction works as well. That is, we show that if the ris
k set is closed,
then risk profiles associated with admissible decision rule
s are on the lower
boundary of the risk set. Moreover, for every point on the low
er boundary
of the risk set
S
, there exists a tangent hyperplane defined by a prior
π
to
the risk set at that point. We show that the decision rule asso
ciated with
that point on the boundary is a Bayes decision rule with respe
ct to
π
.
R
(
θ
1
, d
)
R
(
θ
2
, d
)
Playing an optimal rule
d
∗
Boundary of the Risk set
Hyperplane de
fi
ned by
π
∗
,
·
π
∗
Figure 2.
For every admissible decision rule
d
̊
on the
boundary of the risk set, one can find a prior
π
̊
such that
the hyperplane defined by
x
π
̊
,
̈y
that passes through
d
̊
sep-
arates the risk set and the set of negative functions.
AGGREGATION OF PARETO OPTIMAL MODELS
7
To formalize this, we need the following definitions. Given a
function
r
P
R
Θ
, define the negative quadrant at
r
to be
Q
r
“ t
f
P
R
Θ
|
f
p
θ
q ď
r
p
θ
q
for all
θ
P
Θ
u
.
We define the lower boundary
L
p
S
q
of
S
by
L
p
S
q “ t
r
P
R
Θ
|
Q
r
X
̄
S
“ t
r
uu
,
where
̄
S
is the closure of the set
S
. The set
S
is said to be
closed from
below
if
L
p
S
q Ď
S
.
The main connection between a prior in the minimization of th
e
risk function and a tangent hyperplane to the risk set is thro
ugh the
Riesz–Markov–Kakutani representation theorem (see Alipr
antis and Border,
2006
chapter 13).
Theorem 1.
Let X be a compact Hausdorff space and let
C
p
X
q
denote the
set of continuous functions on
X
equipped with
sup
-norm. For any contin-
uous linear function
ψ
on
C
p
X
q
, there is a unique signed Borel measure
μ
on
X
such that
ψ
p
f
q “
ż
X
f
p
x
q
dμ
p
x
q
,
@
f
P
C
p
X
q
.
The norm of
ψ
as a linear function is the total variation of
μ
, that is
}
ψ
} “
|
μ
|p
X
q
. Finally,
ψ
is positive
`
ψ
p
f
q ě
0
for every non-negative function
f
P
C
p
X
q
̆
if and only if the measure
μ
is non-negative.
We are now ready to establish a complete class theorem; under
some
conditions, the set of Bayes decision rules contains the set
of admissible
rules. There are three main geometrical components, the con
vexity of the
risk set, application of the separating hyperplane theorem
to form a tangent
hyperplane at any lower boundary of the risk set, and applica
tion of the
Riesz–Markov–Kakutani representation theorem to obtain a
representation
of the tangent hyperplane in the form of an integral of the ris
k function with
respect to a prior.
Let
C
p
Θ
q
be the space of continuous function on Θ equipped with the
sup norm. In the following theorem, we assume that risk funct
ions are
continuous in their first argument and therefore
S
Ă
C
p
Θ
q
. Hence, we
endow the risk set with the topology of
C
p
Θ
q
.
Theorem 2.
(Complete Class Theorem) Let
Θ
be a compact subset of a
Hausdorff topological space. If for every decision rule
d
P
∆
p
D
q
the risk
function
R
q
p
θ,d
q
is a continuous function of
θ
, and the risk set is closed from
below in
C
p
Θ
q
, then the Bayes decision rules form an essentially complete
class.
Proof.
Let
d
P
∆
p
D
q
be an admissible rule and let
r
p ̈q “
R
p ̈
,d
q P
S
be its
associated risk profile. By the admissibility of
d
, we have
p
Q
r
X
C
p
Θ
qqX
̄
S
“
r
and therefore any risk profile associated with an admissible
rule is on the
lower boundary
L
p
S
q
of the risk set.
8
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
Since
S
and
Q
r
X
C
p
Θ
q
are convex,
Q
r
X
C
p
Θ
q
has a nonempty interior and
p
Q
r
X
C
p
Θ
qq X
̄
S
“
r
, the separating hyperplane theorem (check Aliprantis
et al.,
2006
Section 5.13 or Luenberger,
1969
Thm 2. Section 5.12) implies
that there is a continuous linear function
ψ
separating
Q
r
X
C
p
Θ
q
and
S
achieving its minimum on the set
S
at
r
. That is,
sup
f
P
Q
r
X
C
p
Θ
q
ψ
p
f
q ď
ψ
p
r
q “
min
s
P
S
ψ
p
s
q
.
(2.6)
Since Θ is compact, the Riesz–Markov–Kakutani representat
ion theorem
assures us that there exists a finite signed measure
μ
ψ
on Θ representing the
continuous linear function
ψ
as
ψ
p
f
q “
ż
Θ
f
p
θ
q
dμ
ψ
p
θ
q
,
@
f
P
C
p
Θ
q
.
(2.7)
To show that
μ
ψ
is a non negative measure, by the second part of the
Riesz–Markov–Kakutani representation theorem, it is enou
gh to show that
ψ
p
g
q ě
0, for every positive function
g
P
C
p
Θ
q
. Assume that it is not the
case and there exists a positive function
g
P
C
p
Θ
q
with
ψ
p
g
q ă
0. Let
g
α
“
́
αg
`
r
for
α
ą
0. By the positivity of
g
,
g
α
P
Q
r
X
C
p
Θ
q
for every
α
ą
0.
Moreover, by the linearity of
ψ
,
ψ
p
g
α
q “
ψ
p ́
αg
`
r
q “ ́
αψ
p
g
q`
ψ
p
r
q ą
ψ
p
r
q
for every
α
ą
0. However, by the choice of
ψ
as in (
2.6
), we should have
ψ
p
g
α
q ď
ψ
p
r
q
, which is a contradiction. Therefore, the measure
μ
ψ
is a finite
non negative measure, and by normalizing it we can assume, wi
thout loss of
generality, that it is a probability measure.
Finally, observe that (
2.6
) and (
2.7
) imply that
r
P
argmin
s
P
S
ψ
p
s
q “
argmin
s
P
S
ż
Θ
s
p
θ
q
dμ
ψ
p
θ
q
and (
2.5
) implies that
min
α
P
∆
p
D
q
ż
Θ
R
p
θ,α
q
dμ
ψ
p
θ
q “
min
s
P
S
ż
Θ
s
p
θ
q
dμ
ψ
p
θ
q
.
Consequently, since
R
p ̈
,d
q “
r
, we have
min
α
P
∆
p
D
q
ż
Θ
R
p
θ,α
q
dμ
ψ
p
θ
q “
min
s
P
S
ż
Θ
s
p
θ
q
dμ
ψ
p
θ
q “
ż
Θ
r
p
θ
q
dμ
ψ
p
θ
q “
R
p
μ
ψ
,d
q
.
Hence, the decision rule
d
is a Bayes decision rule with respect to the prob-
ability measure
μ
ψ
on Θ. This completes the proof.
In the more general case, such as where Θ is not compact, or the
risk set
is not closed from below, the Bayes decision rules do not nece
ssarily form
a complete class. However, similar geometrical arguments g
ive us insight
regarding the form of admissible rules. In many of the more ge
neral cases,
admissible rules are limits of Bayes decision rules or are th
e minimizers of
AGGREGATION OF PARETO OPTIMAL MODELS
9
the risk function with respect to measures that are not neces
sarily finite
measures.
Remark
2
.
Note that in many cases, such as when
P
θ
is an exponential fam-
ily, and the loss function is a squared loss, the assumptions
of the theorem
are satisfied. More generally, if the loss function is contin
uous in its first
argument, the quantity of the interest
q
is continuous,
P
θ
are absolutely
continuous with respect to the Lebesgue measure, and the den
sity functions
associated with
P
θ
are continuous in
θ
for every
x
, then the risk function is
a continuous function of
θ
for every selected decision rule.
Discussion
1
.
Every admissible rule is the best response to a prior by the
complete class theorem. However, it is not trivial to check w
hether a rule is
admissible or not. To be precise, if the decision-maker has a
ccess to the set
of rules, it is not trivial to check whether she is working wit
h the admissible
ones or not.
For example, in the case that
x
„
N
d
p
θ,I
q
where
d
is the dimension of
the parameter space, one might consider the sample average a
s their rule.
However, as the parameter space dimension becomes larger (
d
ě
3), the
shrinkage-based estimator will beat the sample average for
the mean square
loss function.
To be precise, in the case of single observation
x
1
„
N
d
p
θ,I
q
, the James-
Stein estimator
ˆ
θ
JS
“
ˆ
1
́
p
d
́
2
q
}
x
1
}
2
̇
x
1
,
(2.8)
is going to dominate
x
1
with respect to the Mean Square Loss function. The
more interesting observation is that another class of rules
can dominate the
James-Stein estimator itself and as a result it is not admiss
ible as well.
However, we are assuming that the knowledge of the complete c
lass theory
makes us model as if we are minimizing a loss function with res
pect to a
prior. The assumption might be incorrect in practice.
Generally speaking, for a given decision rule
d
, we can not go over all
the priors to check if it is admissible or not. However, one mi
ght minimize
the loss function with respect to a prior, check the Bayes ris
k of that prior,
and compare it with the risk given by the decision rule
d
. Accepting a rule
as an approximately admissible is another question that is n
ot our primary
concern in this paper.
2.4. Step 3; Interpreting a prior as a preference ranking over
the risk set.
Lets elaborate more on the consequence of the complete
class theorem. Again, the interpretation is through the fol
lowing geomet-
rical picture. Every prior
π
P
P
p
Θ
q
induces a continuous linear functional
x
π,s
y “
ş
Θ
s
p
θ
q
dπ
p
θ
q
over the risk set
S
“ t
s
P
R
Θ
| D
d
P
∆
p
D
q
s.t
s
p
θ
q “
R
p
θ,d
q
for all
θ
P
Θ
u
. The induced linear functional
x
π,
̈y
:
S
Ñ
R
is rank-
ing the elements of the risk set with respect to their risk ass
ociated with the
prior
π
. As a result of the complete class theorem, every admissible
model
10
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
is associated with the highest ranked point in the risk set wi
th respect to a
ranking associated with a prior
π
P
P
p
Θ
q
.
Therefore, one may think about an admissible modeling rule
f
:
E
̊
Ñ
∆
p
D
q
as a minimizer of the induced rankings of priors over the risk
set
S
.
That is, for every
A
P
E
̊
, there exists a
π
A
P
P
p
Θ
q
that can be interpreted
as a ranking over the risk set
S
. The final model,
f
p
A
q
, is the one that has
the highest-ranked over all the elements of
S
with respect to the ranking
induced by
π
A
. Formally, we can define the connection as follows.
Definition 8.
Let
f
:
E
̊
Ñ
∆
p
D
q
be an admissible modeling rule.
A
ranking rule
is a function
g
f
:
E
̊
Ñ
P
p
Θ
q
such that
f
p
A
q P
argmin
d
P
∆
p
D
q
R
p
g
f
p
A
q
,d
q
for every
A
P
E
̊
.
To emphasis the view that each prior ranks the risk set linear
ly and for
the simplicity of the notation, for every prior, we define a we
ak order
Á
as
follows:
Definition 9.
Let
π
P
P
p
Θ
q
be a prior over the set of states of nature. The
weak order (reflexive, transitive, and complete binary rela
tion)
Á
π
over the
risk set
S
“ t
s
P
R
Θ
| D
d
P
∆
p
D
q
s.t
s
p
θ
q “
R
p
θ,d
q
for all
θ
P
Θ
u
, is defined
as:
s
1
Á
π
s
2
ô x
π,s
1
y ď x
π,s
2
y
(2.9)
As a consequence, for every
A
P
E
̊
we may interpret the prior
g
f
p
A
q P
P
p
Θ
q
as its associated weak order
Á
g
f
p
A
q
over the elements of the risk set
S
. Suppose we accept this viewpoint as a viable one. In that cas
e, we may
interpret the ranking rule
g
f
as a ranking mechanism in which, by observing
a subset of experts
A
P
E
, ranks the risk set
S
using the induced weak order
Á
g
f
p
A
q
.
Discussion
2
.
In practice, we might not have a flat indifference curve. In
other words, the assumption that the decision-maker may hav
e a linear
variety as her indifference set in the risk set might not be a via
ble one.
There are approaches to handle these issues; however, it is n
ot the primary
concern in the paper.
Discussion
3
.
Generally, there is no bijection between the set of priors an
d
the set of admissible rules. If the boundary of the risk set is
smooth enough
(such that the set of sub-differentials have a unique element)
, then we can
form a bijection between the set of decision rules and priors
. Otherwise, as
in fig
3
, an admissible rule may be the best response to distinct prio
rs. In
those cases, there will be an ambiguity between the selectio
n of a prior that
in a better way captures the main characteristics the modele
r is interested
in. However, in this paper, we will not deal with such situati
ons, and we
will only assume that the modeler always selects a prior (thi
s prior leads to
a ranking of the risk profiles in the risk set). The process use
d to select the
prior can be arbitrary.
AGGREGATION OF PARETO OPTIMAL MODELS
11
R
(
θ
1
, d
)
R
(
θ
2
, d
)
Playing an optimal rule
d
∗
Boundary of the Risk set
Hyperplane de
fi
ned by
π
1
,
·
π
1
Hyperplane de
fi
ned by
π
2
,
·
π
2
Figure 3.
The decision rule
d
̊
is the best response with
respect to both
π
1
and
π
2
.
2.5. Step 4; Consistency.
As a consequence of the last three steps, we
reduce the problem from the class of modeling rule
f
:
E
̊
Ñ
M
, to the class
of ranking rules
g
f
:
E
̊
Ñ
P
p
Θ
q
. In this step, our goal is to assess how to
combine a result of two separate ranking orders
g
f
p
A
q
and
g
f
p
B
q
to form
g
f
p
A
Y
B
q
.
Lets consider the following simple example. Let
A
“ t
e
1
,e
2
u
with
g
f
p
e
1
q “
π
1
and
g
f
p
e
2
q “
π
2
. As a result of discussions in the step 3,
to form
g
f
p
A
q
we may consider the aggregation of the corresponding weak
orders
Á
π
1
,
Á
π
2
over the risk set
S
. One might assume that if both weak
orders prefer a risk profile
s
1
to another risk profile
s
2
, the aggregated rank-
ing should also respect this order. If this is the case, we cal
l the ranking to
be
consistent
with respect to both priors
π
1
and
π
2
. For a general ranking
mechanism, we may generalize the definition as follows.
Definition 10.
Let
g
f
:
E
̊
:
Ñ
P
p
Θ
q
be a ranking rule over the risk set
S
.
We say that
g
f
is
weakly consistent
if for every two disjoint sets
A,B
P
E
̊
,
and for every two risk profiles
s
1
,s
2
P
S
,
s
1
Á
g
f
p
A
q
s
2
, s
1
Á
g
f
p
B
q
s
2
ñ
s
1
Á
g
f
p
A
Y
B
q
s
2
(2.10)
Moreover, it is
consistent
if it also satisfies the following condition:
s
1
ą
g
f
p
A
q
s
2
, s
1
Á
g
f
p
B
q
s
2
ñ
s
1
ą
g
f
p
A
Y
B
q
s
2
(2.11)
To better understand the geometry of the consistency, let
g
f
:
E
̊
Ñ
P
p
Θ
q
be a consistent ranking rule. Consider two disjoint subsets
of experts
A,B
P
E
̊
and two risk profiles
s
1
,s
2
P
S
such that
s
1
Á
g
f
p
A
q
s
2
, s
1
Á
g
f
p
B
q
s
2
.
Using the definition of the
Á
g
f
p
A
q
and
Á
g
f
p
B
q
, we obtain that
x
g
f
p
A
q
,s
1
́
s
2
y ě
0 and
x
g
f
p
B
q
,s
1
́
s
2
y ě
0. Consistency implies that
s
1
Á
g
f
p
A
Y
B
q
s
2
.
Therefore, we should have
x
g
f
p
A
Y
B
q
,s
1
́
s
2
y ě
0. Using the duality
(Farkas’ Lemma for finite dimensional cases or Hahn-Banach T
heorem for
12
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
general cases), the continuous linear function represente
d by
g
f
p
A
Y
B
q
should be in the cone generated by
g
f
p
A
q
,g
f
p
B
q
in the dual space of
C
p
Θ
q
.
However, since
g
f
p
A
Y
B
q
is a probability distribution, it should be a convex
combination of
g
f
p
A
q
and
g
f
p
B
q
. That is it is a randomization of
g
f
p
A
q
and
g
f
p
B
q
by some positive weight. Note that the condition
2.11
in the
definition of consistency, guaranteed that
f
p
A
Y
B
q
should be in the interior
of the line segment connecting
g
f
p
A
q
and
g
f
p
B
q
in the dual space of the risk
set. Therefore, we may connect the consistency to another co
ndition that
has been studied in different litterateur with different names.
Definition 11.
We say that a ranking rule
g
f
:
E
̊
Ñ
P
p
X
q
satisfies the
weighted averaging
property if for all
A,B
P
E
̊
such that
A
X
B
“ H
, it
holds true that
g
f
p
A
Y
B
q “
λg
f
p
A
q ` p
1
́
λ
q
g
f
p
B
q
(2.12)
for some
λ
P r
0
,
1
s
(which may depend on
A
and
B
). We say that
f
satisfies
the
strict weighted averaging
property if (
2.12
) holds true for
λ
P p
0
,
1
q
.
Therefore, as a result of the duality, the two conditions are
the same.
Lemma 1.
Let
g
f
:
E
̊
Ñ
P
p
Θ
q
be a ranking rule. Then, the followings are
equivalent:
(1)
g
f
is consistent.
(2)
g
f
satisfies the strict weighted averaging axiom.
Moreover, the followings are also equivalent:
(1)
g
f
is weakly consistent.
(2)
g
f
satisfies the weighted averaging axiom.
To elaborate more on the above observation, let
g
f
be a consistent ranking
rule and
A
“ t
e
1
,...,e
n
u P
E
̊
. By applying the result of the lemma
1
,
g
f
p
A
q
is in the convex hull of the probability measures
g
f
p
e
i
q
, i
P t
1
,...,n
u
. That
means, there exists a randomization of
g
f
p
e
i
q
by a probability measure rep-
resented by
p
λ
e
1
,...,λ
e
n
q P
P
pt
e
1
,...,e
n
uq
such that
g
f
p
A
q “
ř
i
λ
e
i
g
f
p
e
i
q
.
Consequently lemma
1
results in
g
f
p
A
q P
ConvexHull
p
g
f
p
e
i
qq
, with
e
i
P
E
,
which we call a
coordinate wise Pareto
.
Definition 12.
We say that a ranking rule
g
f
:
E
̊
Ñ
P
p
X
q
is
coordinate
wise Pareto
if for all
A
P
E
̊
,
g
f
p
A
q P
ConvexHull
t
g
f
p
e
q|
e
P
A
u
(2.13)
We can have a better understanding of the Pareto through the l
enses of
duality.
Lemma 2.
The rule
g
f
:
E
̊
Ñ
P
p
X
q
is coordinate wise Pareto if and
only if for every set
A
“ t
e
1
,...,e
n
u P
E
̊
, and for every two risk profiles
s
1
,s
2
P
S
,
s
1
Á
g
f
p
e
1
q
s
2
,..., s
1
Á
g
f
p
e
n
q
s
2
ñ
s
1
Á
g
f
p
A
q
s
2
(2.14)
AGGREGATION OF PARETO OPTIMAL MODELS
13
A simple induction shows that all consistent ranking rules a
re coordinate-
wise Pareto.
Corollary 1.
Every consistent ranking rule
g
f
:
E
̊
Ñ
P
p
X
q
is a coordinate
wise Pareto ranking rule.
One might wonder whether the opposite of the above observati
on is also
true or not. In other words, whether the consistency is only a
bout
g
f
p
A
q
being in the convex hull of the
g
f
p
e
i
q
, for
e
i
P
A
, or not. The answer is no.
Consistency is a stronger assumption. To better understand
it, consider the
following example.
Example 1.
To elaborate more on the above observation, let
g
f
be a
coordinate wise Pareto rule and
A
“ t
x,y,z
u
,B
“ t
x,y,w
u P
E
̊
with
z
‰
w
. Hence, there exists a two set of randomization
λ
A
,λ
B
on el-
ements of
A,B
such that,
g
f
p
A
q “
λ
A
x
g
f
p
x
q `
λ
A
y
g
f
p
y
q `
λ
A
z
g
f
p
z
q
and
g
f
p
B
q “
λ
B
x
g
f
p
x
q `
λ
B
y
g
f
p
y
q `
λ
B
w
g
f
p
w
q
.
Without the consistency, there is nothing more to be said. Ho
wever, with
consistency there is a connections between
λ
A
x
{
λ
A
y
and
λ
B
x
{
λ
B
y
. And the
connection is that it is always possible to make
λ
A
x
{
λ
A
y
“
λ
B
x
{
λ
B
y
.
To be more precise, consider the figure
4
. We will show that by knowing
g
f
p
x,y
q
,g
f
p
z,y
q
,g
f
p
z,w
q
, we can deduce
g
f
p
x,y,z
q
,g
f
p
x,y,w
q
uniquely.
g
f
(
x
)
g
f
(
y
)
g
f
(
z
)
g
f
(
w
)
g
f
(
x, y
)
g
f
(
z, w
)
g
f
(
y, z
)
Figure 4.
We are assuming that the value of
g
f
is known
at
t
x
u
,
t
y
u
,
t
z
u
,
t
w
u
,
t
x,y
u
,
t
y,z
u
,
t
z,w
u
. The goal is to find
g
f
p
x,y,z
q
,g
f
p
x,y,w
q
in a unique way.
First, as in the figure
5a
, by consistency
g
f
p
x,y,z
q
is on the intersec-
tion of the line joining
g
f
p
x
q
,g
f
p
y,z
q
and the line joining
g
f
p
z
q
,g
f
p
x,y
q
.
Then, as in the figure
5b
, again consistency shows that
g
f
p
x,y,z,w
q
must
be the intersection of the line joining
g
f
p
x,y,z
q
,g
f
p
w
q
and the line joining
g
f
p
z,w
q
,g
f
p
x,y
q
. Finally, as in the figure
5c
, one last application of the
consistency shows that
g
f
p
x,y,w
q
must be on the intersection of the line
joining
g
f
p
z
q
,g
f
p
x,y,z,w
q
and the line joining
g
f
p
x,y
q
,g
f
p
w
q
. Therefore,
by consistency all three
g
f
p
x,y,z
q
,g
f
p
x,y,w
q
, and
g
f
p
x,y,z,w
q
are uniquely
determined.
14
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
g
f
(
x
)
g
f
(
y
)
g
f
(
z
)
g
f
(
w
)
g
f
(
x, y
)
g
f
(
z, w
)
g
f
(
x, y, z
)
g
f
(
y, z
)
(a)
g
f
p
x, y, z
q
.
g
f
(
x
)
g
f
(
y
)
g
f
(
z
)
g
f
(
w
)
g
f
(
x, y
)
g
f
(
z, w
)
g
f
(
x, y, z
)
g
f
(
x, y, z, w
)
(b)
g
f
p
x, y, z, w
q
.
g
f
(
x
)
g
f
(
y
)
g
f
(
z
)
g
f
(
w
)
g
f
(
x, y
)
g
f
(
z, w
)
g
f
(
x, y, z
)
g
f
(
x, y, z, w
)
g
f
(
x, y, w
)
(c)
g
f
p
x, y, w
q
.
Figure 5.
By consistency, we can inductively obtain
g
f
p
x,y,z
q
,g
f
p
x,y,z,w
q
, and
g
f
p
x,y,w
q
.
By inductively applying the same couple of arguments, as in t
he Ex-
ample
1
(see Hamze Bajgiran and Owhadi,
2021
Thm. 1), we obtain the
following more general result.
Corollary 2.
Let
g
f
:
E
̊
Ñ
P
p
Θ
q
be a consistent ranking rule. If the range
of
g
f
is not a subset of a one dimensional linear variety, then ther
e exists a
weight function
w
:
E
Ñ
R
``
such that for every set of experts
A
P
E
̊
,
g
f
p
A
q “
ÿ
e
i
P
A
̈
̊
̋
w
p
e
i
q
ř
e
j
P
A
w
p
e
j
q
̨
‹
‚
g
f
p
e
i
q
.
(2.15)
Moreover, the weight function is unique up to multiplicatio
n by a positive
number.
As a consequence of the above corollary, a modeling rule
f
:
E
̊
Ñ
∆
p
Θ
q
is consistent (and non-degenerate) if and only if it can be co
nstructed as
follows.
(1) select a weight function
w
:
E
Ñ
R
``
,
(2) figure out
g
f
p
e
q
for
e
P
E
,
AGGREGATION OF PARETO OPTIMAL MODELS
15
(3) for every
A
P
E
̊
, form
g
f
p
A
q
as in
g
f
p
A
q “
ÿ
e
i
P
A
̈
̊
̋
w
p
e
i
q
ř
e
j
P
A
w
p
e
j
q
̨
‹
‚
g
f
p
e
i
q
,
(2.16)
(4) finally, the rule
f
is
f
p
A
q P
argmin
d
P
∆
p
D
q
R
p
g
f
p
A
q
,d
q
.
(2.17)
We now present more general version in which, instead of cons
istency, we
impose the weak consistency. Weakly consistent rules are ch
aracterized by
both a weight function and weak order over experts. They are o
btained by
averaging the prior of the highest ordered experts rather th
an all of them.
Definition 13.
A binary relation
ě
on
E
is a
weak order
on
E
, if it is
reflexive (
x
ě
x
), transitive (
x
ě
y
and
y
ě
z
imply
x
ě
z
), and complete
(for all
x,y
P
X
,
x
ě
y
or
y
ě
x
). We say that
x
is equivalent to
y
, and
write
x
„
y
, if
x
ě
y
and
y
ě
x
.
Consider a weak order
ě
on
E
. For
A
P
E
̊
, write
M
p
A,
Á
q
for the highest
order elements in
A
.
A more general result is as follows (see Hamze Bajgiran and Ow
hadi,
2021
Thm. 2).
Corollary 3.
Let
g
f
:
E
̊
Ñ
P
p
Θ
q
be a weakly consistent ranking rule.
If
g
f
satisfies the non-degeneracy condition (strongly richness
) condition of
Hamze Bajgiran and Owhadi,
2021
, then there exist a unique weak order
ě
on
E
and a weight function
w
:
E
Ñ
R
``
such that for every set of experts
A
P
E
̊
,
g
f
p
A
q “
ÿ
e
i
P
M
p
A,
ě
q
̈
̊
̋
w
p
e
i
q
ř
e
j
P
M
p
A,
ě
q
w
p
e
j
q
̨
‹
‚
g
f
p
e
i
q
.
(2.18)
Moreover, the weight function is unique up to multiplicatio
n by a positive
number in each of the equivalence classes of the weak order
ě
.
As a consequence of the above corollary, a (non-degenerate)
modeling rule
f
:
E
̊
Ñ
∆
p
Θ
q
is weakly consistent if and only if it can be constructed as
follows.
(1) select a weight function
w
:
E
Ñ
R
``
and a weak order
ě
on
E
,
(2) figure out
g
f
p
e
q
for
e
P
E
,
(3) for every
A
P
E
, form
g
f
p
A
q
as in
g
f
p
A
q “
ÿ
e
i
P
M
p
A,
ě
q
̈
̊
̋
w
p
e
i
q
ř
e
j
P
M
p
A,
ě
q
w
p
e
j
q
̨
‹
‚
g
f
p
e
i
q
,
(2.19)
16
HAMED HAMZE BAJGIRAN, HOUMAN OWHADI
(4) finally, the rule
f
is
f
p
A
q P
argmin
d
P
∆
p
D
q
R
p
g
f
p
A
q
,d
q
.
(2.20)
The representation (
2.19
) has two components: one is captured by the
weak order
ě
; the other is the weight function
w
. The weak order partitions
the set of experts into equivalence classes and ranks them fr
om top to
bottom. If all experts
e
P
A
have the same ranking, then
g
f
p
A
q
is the
weighted average of
g
p
e
q
for
e
P
A
. However, if some experts have a higher
ranking than others, then the rule will ignore the lower-ord
ered experts.
Hence, the assessment of the rule has two steps. First, it onl
y considers
the highest-ordered priors. Then, it uses the weight functi
on and finds the
weighted average among the highest-ordered priors.
3. More Examples
In the previous section, we interpreted the elements of the s
et
E
as indi-
vidual experts. We will from now interpret the elements of
E
as representing
experts and their characteristics.
3.1. Kernel Smoother.
Assume that the decision-maker herself is also
an element of the set
E
. To be precise, let
e
0
P
E
be all the relevant
characteristics and beliefs of the decision-maker without
the observation
of any other expert. Therefore, with using the same language
as before,
g
f
p
e
0
,.
q
:
E
̊
Y H Ñ
P
p
Θ
q
and
g
f
p
e
0
,
Hq
is the prior that the decision-
maker is going to use for selection the decision rules, witho
ut observing any
other experts’ characteristics.
More generally, we can interpret the rule
g
f
:
E
ˆ
E
̊
Y H Ñ
P
p
Θ
q
as a ranking rule such that for every characteristics of the d
ecision maker
e
P
E
and for every observation of the set of experts’ characteris
tics
A
P
E
̊
,
g
f
p
e,A
q
is the prior that the decision maker is going to use to select h
er
decision rule.
Under the conditions of the previous section, we have the fol
lowing rep-
resentation.
Corollary 4.
Let
g
f
:
E
ˆ
E
̊
YH Ñ
P
p
Θ
q
be a rule such that for every
e
P
E
,
g
f
p
e,.
q
being a consistent ranking rule with the range not being a sub
set
of a one dimensional linear variety. Then, then exists a weigh
t function
w
:
E
ˆ
E
Ñ
R
``
such that for every decision maker’s characteristics
e
P
E
and for every set of expert’s characteristics
A
P
E
̊
,
g
f
p
e,A
q “
ÿ
e
i
P
A
̈
̊
̋
w
p
e,e
i
q
ř
e
j
P
A
w
p
e,e
j
q
̨
‹
‚
g
f
p
e,e
i
q
.
(3.1)
Moreover, the weight function is unique up to multiplicatio
n by a positive
number.