of 34
Interpreting Latent Variables in Factor Models via Convex
Optimization
Armeen Taeb
and Venkat Chandrasekaran
,
‡ ∗
Department of Electrical Engineering
Department of Computing and Mathematical Sciences
California Institute of Technology
Pasadena, Ca 91125
November 4, 2016
Abstract
Latent or unobserved phenomena pose a significant difficulty in data analysis as they induce
complicated and confounding dependencies among a collection of observed variables. Factor
analysis is a prominent multivariate statistical modeling approach that addresses this challenge
by identifying the effects of (a small number of) latent variables on a set of observed variables.
However, the latent variables in a factor model are purely mathematical objects that are derived
from the observed phenomena, and they do not have any interpretation associated to them. A
natural approach for attributing semantic information to the latent variables in a factor model
is to obtain measurements of some additional plausibly useful covariates that may be related to
the original set of observed variables, and to associate these auxiliary covariates to the latent
variables. In this paper, we describe a systematic approach for identifying such associations.
Our method is based on solving computationally tractable convex optimization problems, and
it can be viewed as a generalization of the minimum-trace factor analysis procedure for fitting
factor models via convex optimization. We analyze the theoretical consistency of our approach
in a high-dimensional setting as well as its utility in practice via experimental demonstrations
with real data.
1 Introduction
A central goal in data analysis is to identify concisely described models that characterize the sta-
tistical dependencies among a collection of variables. Such concisely parametrized models avoid
problems associated with overfitting, and they are often useful in providing meaningful interpreta-
tions of the relationships inherent in the underlying variables. Latent or unobserved phenomena
complicate the task of determining concisely parametrized models as they induce confounding de-
pendencies among the observed variables that are not easily or succinctly described. Consequently,
significant efforts over many decades have been directed towards the problem of accounting for
the effects of latent phenomena in statistical modeling. A common shortcoming of approaches
to latent-variable modeling is that the latent variables are typically mathematical constructs that
Email: ataeb@caltech.edu, venkatc@caltech.edu
1
arXiv:1601.00389v2 [stat.ME] 3 Nov 2016
are derived from the originally observed data, and these variables do not directly have semantic
information linked to them. Discovering interpretable meaning underlying latent variables would
clearly impact a range of contemporary problem domains throughout science and technology. For
example, in data-driven approaches to scientific discovery, the association of semantics to latent
variables would lead to the identification of new phenomena that are relevant to a scientific process,
or would guide data-gathering exercises by providing choices of variables for which to obtain new
measurements.
In this paper, we focus for the sake of concreteness on the challenge of interpreting the latent
variables in a factor model [20].
Factor analysis
is perhaps the most widely used latent-variable
modeling technique in practice. The objective with this method is to fit observations of a collection
of random variables
y
R
p
to the following linear model:
y
=
B
ζ
+
,
(1.1)
where
B ∈
R
p
×
k
,k

p
. The random vectors
ζ
R
k
,
R
p
are independent of each other,
and they are normally distributed as
1
ζ
∼ N
(0
,
Σ
ζ
)
,
∼ N
(0
,
Σ

), with Σ
ζ

0
,
Σ


0 and
Σ

being diagonal. Here the random vector
ζ
represents a small number of unobserved, latent
variables that impact all the observed variables
y
, and the matrix
B
specifies the effect that the
latent variables have on the observed variables. However, the latent variables
ζ
themselves do
not have any interpretable meaning, and they are essentially a mathematical abstraction employed
to fit a concisely parameterized model to the conditional distribution of
y
|
ζ
(which represents the
remaining uncertainty in
y
after accounting for the effects of the latent variables
ζ
) – this conditional
distribution is succinctly described as it is specified by a model consisting of independent variables
(as the covariance of the Gaussian random vector

is diagonal).
A natural approach to attributing semantic information to the latent variables
ζ
in a factor
model is to obtain measurements of some additional plausibly useful covariates
x
R
q
(the choice
of these variables is domain-specific), and to link these to the variables
ζ
. However, defining and
specifying such a link in a precise manner is challenging. Indeed, a fundamental difficulty that arises
in establishing this association is that the variables
ζ
in the factor model (1.1) are not identifiable.
In particular, for any non-singular matrix
W ∈
R
k
×
k
, we have that
B
ζ
= (
BW
1
)(
W
ζ
). In
this paper, we describe a systematic and computationally tractable methodology based on convex
optimization that integrates factor analysis and the task of interpreting the latent variables. Our
convex relaxation approach generalizes the
minimum-trace factor analysis
technique, which has
received much attention in the mathematical programming community over the years [10, 17, 18,
19, 16].
1.1 A Composite Factor Model
We begin by making the observation that the column space of
B
– which specifies the
k
-dimensional
component of
y
that is influenced by the latent variables
ζ
– is invariant under transformations of
the form
B →BW
1
for non-singular matrices
W ∈
R
k
×
k
. Consequently, we approach the problem
of associating the covariates
x
to the latent variables
ζ
by linking the effects of
x
on
y
to the
column space of
B
. Conceptually, we seek a decomposition of the column space of
B
into transverse
subspaces
H
x
,
H
u
R
p
,
H
x
H
u
=
{
0
}
so that column-space(
B
)
H
x
H
u
– the subspace
H
x
specifies those components of
y
that are influenced by the latent variables
ζ
and are also affected by
the covariates
x
, and the subspace
H
u
represents any unobserved residual effects on
y
due to
ζ
that
1
The mean vector does not play a significant role in our development, and therefore we consider zero-mean random
variables throughout this paper.
2
are not captured by
x
. To identify such a decomposition of the column space of
B
, our objective is
to split the term
B
ζ
in the factor model (1.1) as
B
ζ
≈A
x
+
B
u
ζ
u
,
(1.2)
where the column space of
A∈
R
p
×
q
is the subspace
H
x
and the column space of
B
u
R
p
×
dim(
H
u
)
is the subspace
H
u
, i.e., dim(column-space(
A
)) + dim(column-space(
B
u
)) = dim(column-space(
B
))
and column-space(
A
)
column-space(
B
u
) =
{
0
}
. Since the number of latent variables
ζ
in the
factor model (1.1) is typically much smaller than
p
, the dimension of the column space of
A
is
also much smaller than
p
; as a result, if the dimension
q
of the additional covariates
x
is large,
the matrix
A
has small rank. Hence, the matrix
A
plays two important roles: its column space
(in
R
p
) identifies those components of the subspace
B
that are influenced by the covariates
x
, and
its rowspace (in
R
q
) specifies those components of (a potentially large number of) the covariates
x
that influence
y
. Thus,
the projection of the covariates
x
onto the rowspace of
A
represents the
interpretable component of the latent variables
ζ
.
The term
B
u
ζ
u
in (1.2) represents, in some sense,
the effects of those phenomena that continue to remain unobserved despite the incorporation of the
covariates
x
.
Motivated by this discussion, we fit observations of (
y,x
)
R
p
×
R
q
to the following
composite
factor model
that incorporates the effects of the covariates
x
as well as of additional unobserved
latent phenomena on
y
:
y
=
A
x
+
B
u
ζ
u
+ ̄

(1.3)
where
A ∈
R
p
×
q
with rank(
A
)

min
{
p,q
}
,
B
u
R
p
×
k
u
with
k
u

p
, and the variables
ζ
u
,
̄

are
independent of each other (and of
x
) and normally distributed as
ζ
u
∼ N
(0
,
Σ
ζ
u
)
,
̄

∼ N
(0
,
Σ
̄

),
with Σ
ζ
u

0
,
Σ
̄


0 and Σ
̄

being a diagonal matrix. The matrix
A
may also be viewed as the
map specifying the best linear estimate of
y
based on
x
. In other words, the goal is to identify
a low-rank matrix
A
such that the conditional distribution of
y
|
x
(and equivalently of
y
|A
x
) is
specified by a standard factor model of the form (1.1).
1.2 Composite Factor Modeling via Convex Optimization
Next we describe techniques to fit observations of
y
R
p
to the model (1.3). This method is a key
subroutine in our algorithmic approach for associating semantics to the latent variables in a factor
model (see Section 1.3 for a high-level discussion of our approach and Section 3 for a more detailed
experimental demonstration). Fitting observations of (
y,x
)
R
p
×
R
q
to the composite factor model
(1.3) is accomplished by identifying a Gaussian model over (
y,x
) with the covariance matrix of the
model satisfying certain algebraic properties. For background on multivariate Gaussian statistical
models, we refer the reader to [9].
The covariance matrix of
y
in the factor model is decomposable as the sum of a low-rank matrix
B
Σ
ζ
B
(corresponding to the
k

p
latent variables
ζ
) and a diagonal matrix Σ

. Based on this
algebraic structure, a natural approach to factor modeling is to find the smallest rank (positive
semidefinite) matrix such that the difference between this matrix and the empirical covariance of
the observations of
y
is close to being a diagonal matrix (according to some measure of closeness,
such as in the Frobenius norm). This problem is computationally intractable to solve in general
due to the rank minimization objective [13]. As a result, a common heuristic is to replace the
matrix rank by the trace functional, which results in the minimum trace factor analysis problem
[10, 17, 18, 19]; this problem is convex and it can be solved efficiently. The use of the trace of a
positive semidefinite matrix as a surrogate for the matrix rank goes back many decades, and this
topic has received much renewed interest over the past several years [12, 7, 15, 3].
3
In attempting to generalize the minimum-trace factor analysis approach to the composite factor
model, one encounters a difficulty that arises due to the parametrization of the underlying Gaussian
model in terms of covariance matrices. Specifically, with the additional covariates
x
R
q
in the
composite model (1.3), our objective is to identify a Gaussian model over (
y,x
)
R
p
×
R
q
with
the joint covariance Σ =
(
Σ
y
Σ
yx
Σ
yx
Σ
x
)
S
p
+
q
satisfying certain structural properties. One of these
properties is that the conditional distribution of
y
|
x
is specified by a factor model, which implies
that the conditional covariance of
y
|
x
must be decomposable as the sum of a low-rank matrix
and a diagonal matrix. However, this conditional covariance is given by the Schur complement
Σ
y
Σ
yx
Σ
1
x
Σ
yx
, and specifying a constraint on the conditional covariance matrix in terms of the
joint covariance matrix Σ presents an obstacle to obtaining computationally tractable optimization
formulations.
A more convenient approach to parameterizing conditional distributions in Gaussian models is
to consider models specified in terms of inverse covariance matrices, which are also called
precision
matrices
. Specifically, the algebraic properties that we desire in the joint covariance matrix Σ of
(
y,x
) in a composite factor model can also be stated in terms of the joint precision matrix Θ = Σ
1
via conditions on the submatrices of Θ =
(
Θ
y
Θ
yx
Θ
yx
Θ
x
)
. First, the precision matrix of the conditional
distribution of
y
|
x
is specified by the submatrix Θ
y
; as the covariance matrix of the conditional
distribution of
y
|
x
is the sum of a diagonal matrix and a low-rank matrix, the Woodbury matrix
identity implies that the submatrix Θ
y
is the difference of a diagonal matrix and a low-rank matrix.
Second, the rank of the submatrix Θ
yx
R
p
×
q
is equal to the rank of
A∈
R
p
×
q
in non-degenerate
models (i.e., if Σ

0) because the relation between
A
and Θ is given by
A
=
y
]
1
Θ
yx
. Based
on this algebraic structure desired in Θ, we propose the following natural convex relaxation for
fitting a collection of observations
D
+
n
=
{
(
y
(
i
)
,x
(
i
)
)
}
n
i
=1
R
p
+
q
to the composite model (1.3):
(
ˆ
Θ
,
ˆ
D
y
,
ˆ
L
y
) = arg
min
Θ
S
p
+
q
,
Θ

0
D
y
,L
y
S
p
`
(Θ;
D
n
+
) +
λ
n
[
γ
Θ
yx
?
+ trace(
L
y
)]
s
.
t
.
Θ
y
=
D
y
L
y
, L
y

0
,D
y
is diagonal
(1.4)
The term
`
(Θ;
D
n
+
) is the Gaussian log-likelihood function that enforces fidelity to the data, and it
is given as follows (up to some additive and multiplicative terms):
`
(Θ;
D
n
+
) = log det(Θ)
trace
[
Θ
·
1
n
n
i
=1
(
y
(
i
)
x
(
i
)
)(
y
(
i
)
x
(
i
)
)
]
.
(1.5)
This function is concave as a function of the joint precision matrix
2
Θ. The matrices
D
y
,L
y
represent the diagonal and low-rank components of Θ
y
. As with the idea behind minimum-trace
factor analysis, the role of the trace norm penalty on
L
y
is to induce low-rank structure in this
matrix. Based on a more recent line of work originating with the thesis of Fazel [7, 15, 3], the
nuclear norm penalty
Θ
yx
?
on the submatrix Θ
yx
(which is in general a non-square matrix) is
useful for promoting low-rank structure in that submatrix of Θ. The parameter
γ
provides a tradeoff
between the observed/interpretable and the unobserved parts of the composite factor model (1.3),
and the parameter
λ
n
provides a tradeoff between the fidelity of the model to the data and the
overall complexity of the model (the total number of observed and unobserved components in the
2
An additional virtue of parameterizing our problem in terms of precision matrices rather than in terms of co-
variance matrices is that the log-likelihood function in Gaussian models is not concave over the cone of positive
semidefinite matrices when viewed as a function of the covariance matrix.
4
composite model (1.3)). In summary, for
λ
n
0 the regularized maximum-likelihood problem
(1.4) is a convex program. From the optimal solution (
ˆ
Θ
,
ˆ
D
y
,
ˆ
L
y
) of (1.4), we can obtain estimates
for the parameters of the composite factor model (1.3) as follows:
ˆ
A
=
[
ˆ
Θ
y
]
1
ˆ
Θ
yx
ˆ
B
u
= any squareroot of (
ˆ
D
y
ˆ
L
y
)
1
ˆ
D
1
y
such that
ˆ
B
u
R
p
×
rank(
ˆ
L
y
)
,
(1.6)
with the covariance of
ζ
u
being the identity matrix of appropriate dimensions and the covariance
of ̄

being
ˆ
D
1
y
. The convex program (1.4) is log-determinant semidefinite programs that can be
solved efficiently using existing numerical solvers such as the LogDetPPA package [21].
1.3 Algorithmic Approach for Interpreting Latent Variables in a Factor Model
Our discussion has led us to a natural (meta-) procedure for interpreting latent variables in a fac-
tor model. Suppose that we are given a factor model underlying
y
R
p
. The analyst proceeds
by obtaining simultaneous measurements of the variables
y
as well as some additional covariates
x
R
q
of plausibly relevant phenomena. Based on these joint observations, we identify a suit-
able composite factor model (1.3) via the convex program (1.4). In particular, we sweep over the
parameters
λ
n
in (1.4) to identify composite models that achieve a suitable decomposition – in
terms of effects attributable to the additional covariates
x
and of effects corresponding to remain-
ing unobserved phenomena – of the effects of the latent variables in the factor model given as input.
To make this approach more formal, consider a composite factor model (1.3)
y
=
A
x
+
B
u
ζ
u
+

underlying a pair of random vectors (
y,x
)
R
p
×
R
q
, with rank(
A
) =
k
x
,
B
u
R
p
×
k
u
, and
column-space(
A
)
column-space(
B
u
) =
{
0
}
. As described in Section 1.2, the algebraic aspects of
the underlying composite factor model translate to algebraic properties of submatrices of Θ
S
p
+
q
.
In particular, the submatrix Θ
yx
has rank equal to
k
x
and the submatrix Θ
y
is decomposable as
D
y
L
y
with
D
y
being diagonal and
L
y

0 having rank equal to
k
u
. Finally, the transver-
sality of column-space(
A
) and column-space(
B
u
) translates to the fact that column-space(Θ
yx
)
column-space(
L
y
) =
{
0
}
have a transverse intersection. One can simply check that the factor model
underlying the random vector
y
R
p
that is induced upon marginalization of
x
is specified by the
precision matrix of
y
given by
̃
Θ
y
=
D
y
[
L
y
yx
x
)
1
Θ
xy
]. Here, the matrix
L
y
yx
x
)
1
Θ
xy
is a rank
k
x
+
k
u
matrix that captures the effect of latent variables in the factor model. This effect
is decomposed into Θ
yx
x
)
1
Θ
xy
– a rank
k
x
matrix representing the component of this effect
attributed to
x
, and
L
y
– a matrix of rank
k
u
representing the effect attributed to residual latent
variables.
These observations motivate the following algorithmic procedure. Suppose we are given a factor
model that specifies the precision matrix of
y
as the difference
ˆ
̃
D
y
ˆ
̃
L
y
, where
ˆ
̃
D
y
is diagonal
and
ˆ
̃
L
y
is low rank. Then the composite factor model of (
y,x
) with estimates (
ˆ
Θ
,
ˆ
D
y
,
ˆ
L
y
) offers
an interpretation of the latent variables of the given factor model if (
i
) rank(
ˆ
̃
L
y
) = rank(
ˆ
L
y
+
ˆ
Θ
yx
ˆ
Θ
1
x
ˆ
Θ
xy
), (
ii
) column-space(
ˆ
Θ
yx
)
column-space(
ˆ
L
y
) =
{
0
}
, and
(
iii
) max
{‖
ˆ
̃
D
y
ˆ
D
y
2
/
ˆ
̃
D
y
2
,
ˆ
̃
L
y
[
ˆ
L
y
+
ˆ
Θ
yx
ˆ
Θ
1
x
ˆ
Θ
xy
]
2
/
ˆ
̃
L
y
2
}
is small. The full algorithmic
procedure for attributing meaning to latent variables of a factor model is outlined below:
5
Algorithm 1
Interpreting Latent Variables in a Factor Model
1:
Input
: A collection of observations
D
+
n
=
{
(
y
(
i
)
,x
(
i
))
}
n
i
=1
R
p
×
R
q
of the variables
y
and of
some auxiliary covariates
x
; Factor model with parameters (
ˆ
̃
D
y
,
ˆ
̃
L
y
).
2:
Composite Factor Modeling
: For each
d
= 1
,...,q
, sweep over parameters (
λ
n
) in
the convex program (1.4) (with
D
+
n
as input) to identify composite models with estimates
(
ˆ
Θ
,
ˆ
D
y
,
ˆ
L
y
) that satisfy the following three properties: (
i
) rank(
ˆ
Θ
yx
) =
d
, (
ii
) rank(
ˆ
̃
L
y
) =
rank(
ˆ
L
y
) + rank(
ˆ
Θ
yx
), and (
iii
) rank(
ˆ
̃
L
y
) = rank(
ˆ
L
y
+ rank(
ˆ
Θ
yx
ˆ
Θ
1
x
ˆ
Θ
xy
)).
3:
Identifying Subspace
: For each
d
= 1
,...,q
and among the candidate composite mod-
els (from the previous step), choose the composite factor model that minimizes the quantity
max
{‖
ˆ
̃
D
y
ˆ
D
y
2
/
ˆ
̃
D
y
2
,
ˆ
̃
L
y
[
ˆ
L
y
+
ˆ
Θ
yx
ˆ
Θ
1
x
ˆ
Θ
xy
]
2
/
ˆ
̃
L
y
2
}
.
4:
Output
: For each
d
= 1
,...q
, the
d
-dimensional projection of
x
into the row-space of
ˆ
Θ
yx
represents the interpretable component of the latent variables in the factor model.
The effectiveness of Algorithm 1 is dependent on the size of the quantity max
{‖
ˆ
̃
D
y
ˆ
D
y
2
/
ˆ
̃
D
y
2
,
ˆ
̃
L
y
ˆ
L
y
ˆ
Θ
yx
ˆ
Θ
1
x
ˆ
Θ
xy
]
2
/
ˆ
̃
L
y
2
}
. The smaller this quantity, the better the composite factor model fits
to the given factor model. Finally, recall from Section 1.1 that the projection of covariates
x
onto
to the row-space of
A
(from the composite model (1.3)) represents the interpretable component of
the latent variables of the factor model. Because of the relation
A
=
y
]
1
Θ
yx
, this interpretable
component is obtained by projecting the covariates
x
onto the row-space of Θ
yx
. This observation
explains the final step of Algorithm 1.
The input to Algorithm 1 is a factor model underlying a collection of variables
y
R
p
, and the
algorithm proceeds to obtain semantic interpretation of the latent variables of the factor model.
However, in many situations, a factor model underlying
y
R
p
may not be available in advance,
and must be learned in a data-driven fashion based on observations of
y
R
p
. In our experiments
(see Section 3), we learn a factor model using a specialization of the convex program (1.4). It
is reasonable to ask whether one might directly fit to a composite model to the covariates and
responses jointly without reference to the underlying factor model based on the responses. However,
in our experience with applications, it is often the case that observations of the responses
y
are
much more plentiful than of joint observations of responses
y
and covariates
x
. As an example,
consider a setting in which the responses are a collection of financial asset prices (such as stock
return values); observations of these variables are available at a very fine time-resolution on the
order of seconds. On the other hand, some potentially useful covariates such as GDP, government
expenditures, federal debt, and consumer rate are available at a much coarser scale (usually on
the order of months or quarters). As another example, consider a setting in which the responses
are reservoir volumes of California; observations of these variables are available at a daily scale.
On the other hand, reasonable covariates that one may wish to associate to the latent variables
underlying California reservoir volumes such as agricultural production, crop yield rate, average
income, and population growth rate are available at a much coarser time scale (e.g. monthly).
In such settings, the analyst can utilize the more abundant set of observations of the responses
y
to learn an accurate factor model first. Subsequently, one can employ our approach to associate
semantics to the latent variables in this factor model based on the potentially limited number of
observations of the responses
y
and the covariates
x
.
6
1.4 Our Results
In Section 2 we carry out a theoretical analysis to investigate whether the framework outlined in
Algorithm 1 can succeed. We discuss a model problem setup, which serves as the basis for the main
theoretical result in Section 2. Suppose we have Gaussian random vectors (
y,x
)
R
p
×
R
q
that are
related to each other via a composite factor model (1.3). Note that this composite factor model
induces a factor model underlying the variables
y
R
p
upon marginalization of the covariates
x
. In
the subsequent discussion, we assume that the factor model that is supplied as input to Algorithm
1 is the factor model underlying the responses
y
.
Now we consider the following question: Given observations jointly of (
y,x
)
R
p
+
q
, does the
convex relaxation (1.4) (for suitable choices of regularization parameters
λ
n
) estimate the com-
posite factor model underlying these two random vectors accurately? An affirmative answer to this
question demonstrates the success of Algorithm 1. In particular, a positive answer to this question
implies that we can decompose the effects of the latent variables in the factor model underlying
y
using the convex relaxation (1.4), as the accurate estimation of the composite model underlying
(
y,x
) implies a successful decomposition of the effects of the latent variables in the factor model
underlying
y
. That is, steps 2-3 in the Algorithm are successful. In Section 2, we show that un-
der suitable identifiability conditions on the population model of the joint random vector (
y,x
),
the convex program (1.4) succeeds in solving this question. Our analysis is carried out in a high-
dimensional asymptotic scaling regime in which the dimensions
p,q
, the number of observations
n
,
and other model parameters may all grow simultaneously [2, 23].
We give concrete demonstration of Algorithm 1 with experiments on synthetic data and real-
world financial data. For the financial asset problem, we consider as our variables
y
the monthly
averaged stock prices of 45 companies from the Standard and Poor index over the period March
1982 to March 2016, and we identify a factor model (1.1) over
y
with 10 latent variables (the
approach we use to fit a factor model is described in Section 3). We then obtain observations
of
q
= 13 covariates on quantities related to oil trade, GDP, government expenditures, etc. (See
Section 3 for the full list), as these plausibly influence stock returns. Following the steps outlined
in Algorithm 1, we use the convex program (1.4) to identify a two-dimensional projection of these
13 covariates that represent an interpretable component of the 10 latent variables in the factor
model, as well as a remaining set of 8 latent variables that constitute phenomena not observed
via the covariates
x
. In further analyzing the characteristics of the two-dimensional projection, we
find that EUR to USD exchange rate and government expenditures are the most relevant of the 13
covariates considered in our experiment, while mortgage rate and oil imports are less useful. See
Section 3 for complete details.
1.5 Related Work
Elements of our approach bear some similarity with
canonical correlations analysis
[8], which is a
classical technique for identifying relationships between two sets of variables. In particular, for a
pair of jointly Gaussian random vectors (
y,x
)
R
p
×
q
, canonical correlations analysis may be used
as a technique for identifying the most relevant component(s) of
x
that influence
y
. However, the
composite factor model (1.3) allows for the effect of further unobserved phenomena not captured via
observations of the covariates
x
. Consequently, our approach in some sense incorporates elements
of both canonical correlations analysis and factor analysis. It is important to note that algorithms
for factor analysis and for canonical correlations analysis usually operate on covariance and cross-
covariance matrices. However, we parametrize our regularized maximum-likelihood problem (1.4)
7
in terms of precision matrices, which is a crucial ingredient in leading to a computationally tractable
convex program.
The nuclear-norm heuristic has been employed widely over the past several years in a range of
statistical modeling tasks involving rank minimization problems; see [23] and the references therein.
The proof of our main result in Section 2 incorporates some elements from the theoretical analyses
in these previous papers, along with the introduction of some new ingredients. We give specific
pointers to the relevant literature in Section 4.
1.6 Notation
Given a matrix
U
R
p
1
×
p
2
, and the norm
U
2
denotes the spectral norm (the largest singular
value of
U
). We define the linear operators
F
:
S
p
×
S
p
×
R
p
×
q
×
S
q
S
(
p
+
q
)
and its adjoint
F
:
S
(
p
+
q
)
S
p
×
S
p
×
R
p
×
q
×
S
q
as follows:
F
(
M,N,K,O
)
,
(
M
N K
K
T
O
)
,
F
(
Q K
K
T
O
)
,
(
Q,Q,K,O
)
(1.7)
Similarly, we define the linear operators
G
:
S
p
×
R
p
×
q
S
(
p
+
q
)
and its adjoint
G
:
S
(
p
+
q
)
S
p
×
R
p
×
q
as follows:
G
(
M,K
)
,
(
M K
K
T
0
)
,
G
(
Q K
K
T
O
)
,
(
Q,K
)
(1.8)
Finally, for any subspace
H
, the projection onto the subspace is denoted by
P
H
.
2 Theoretical Results
In this section, we state a theorem to prove the consistency of convex program (1.4). This theorem
requires assumptions on the population precision matrix, which are discussed in Section 2.2. We
provide examples of population composite factor models (1.4) that satisfy these conditions. The
theorem statement is given in Section 2.4 and the proof of the theorem is given in Section 4 with
some details deferred to the appendix.
2.1 Technical Setup
As discussed in Section 1.4, our theorems are premised on the existence of a population composite
factor model (1.3)
y
=
A
?
x
+
B
?
u
ζ
u
+

underlying a pair of random vectors (
y,x
)
R
p
×
R
q
, with
rank(
A
?
) =
k
x
,
B
?
u
R
p
×
k
u
, and column-space(
A
?
)
column-space(
B
u
?
) =
{
0
}
. As the convex
relaxation (1.4) is solved in the precision matrix parametrization, the conditions for our theorems
are more naturally stated in terms of the joint precision matrix Θ
?
S
p
+
q
,
Θ
?

0 of (
y,x
). The
algebraic aspects of the parameters underlying the factor model translate to algebraic properties
of submatrices of Θ
?
. In particular, the submatrix Θ
?
yx
has rank equal to
k
x
, and the submatrix
Θ
?
y
is decomposable as
D
?
y
L
?
y
with
D
?
y
being diagonal and
L
?
y

0 having rank equal to
k
u
.
Finally, the transversality of column-space(
A
?
) and column-space(
B
u
?
) translates to the fact that
column-space(Θ
?
yx
)
column-space(
L
?
y
) =
{
0
}
have a transverse intersection.
To address the requirements raised in Section 1.4, we seek an estimate (
ˆ
Θ
,
ˆ
D
y
,
ˆ
L
y
) from the
convex relaxation (1.4) such that rank(
ˆ
Θ
yx
) = rank(Θ
?
yx
), rank(
ˆ
L
y
) = rank(
L
?
y
)
,
and that
ˆ
Θ
Θ
?
2
is small. Building on both classical statistical estimation theory [1] as well as the recent literature
on high-dimensional statistical inference [2, 23], a natural set of conditions for obtaining accurate
8
parameter estimates is to assume that the curvature of the likelihood function at Θ
?
is bounded in
certain directions. This curvature is governed by the Fisher information at Θ
?
:
I
?
,
Θ
?
1
Θ
?
1
= Σ
?
Σ
?
.
Here
denotes a tensor product between matrices and
I
?
may be viewed as a map from
S
(
p
+
q
)
to
S
(
p
+
q
)
. We impose conditions requiring that
I
?
is well-behaved when applied to matrices of the form
Θ
Θ
?
=
(
(
D
y
D
?
y
)
(
L
y
L
?
y
) Θ
yx
Θ
?
yx
Θ
yx
Θ
?
yx
Θ
x
Θ
?
x
)
, where (
L
y
,
Θ
yx
) are in a neighborhood of (
L
?
y
,
Θ
?
yx
)
restricted to sets of low-rank matrices. These local properties of
I
?
around Θ
?
are conveniently
stated in terms of
tangent spaces
to the algebraic varieties of low-rank matrices. In particular,
the tangent space at a rank-
r
matrix
N
R
p
1
×
p
2
with respect to the algebraic variety of
p
1
×
p
2
matrices with rank less than or equal to
r
is given by
3
:
T
(
N
)
,
{
N
R
+
N
C
|
N
R
,N
C
R
p
1
×
p
2
,
row-space (
N
R
)
row-space (
N
)
,
column-space (
N
C
)
column-space (
N
)
}
In the next section, we describe conditions on the population Fisher information
I
?
in terms of
the tangent spaces
T
(
L
?
y
), and
T
?
yx
); under these conditions, we present a theorem in Section 2.4
showing that the convex program (1.4) obtains accurate estimates.
2.2 Fisher Information Conditions
Given a norm
‖·‖
Υ
on
S
p
×
S
p
×
R
p
×
q
×
S
q
, we first consider a classical condition in statistical
estimation literature, which is to control the minimum gain of the Fisher information
I
?
restricted
to a subspace
H
S
p
×
S
p
×
R
p
×
q
×
S
q
as follows:
χ
(
H
,
‖·‖
Υ
)
,
min
Z
H
Z
Υ
=1
‖P
H
I
I
?
IP
H
(
Z
)
Υ
,
(2.1)
where
P
H
denotes the projection operator onto the subspace
H
and the linear maps
I
and
I
are
defined in (1.7). The quantity
χ
(
H
,
‖·‖
Υ
) being large ensures that the Fisher information
I
?
is
well-conditioned restricted to image
I
H
S
p
+
q
. The remaining conditions that we impose on
I
?
are in the spirit of irrepresentibility-type conditions [11, 24, 22, 5, 2] that are frequently employed
in high-dimensional estimation. In the subsequent discussion, we employ the following notation to
denote restrictions of a subspace
H
=
H
1
×
H
2
×
H
3
×
H
4
S
p
×
S
p
×
R
p
×
q
×
S
q
(here
H
1
,H
2
,H
3
,H
4
are subspaces in
S
p
,
S
p
,
R
p
×
q
,
S
q
, respectively) to its individual components. The restriction to the
second components of
H
is given by
H
[2] =
H
2
. The restriction to the second and third component
of
H
is given by
H
[2
,
3] =
H
2
×
H
3
S
p
×
R
p
×
q
. Given a norm
.
Π
on
S
p
×
R
p
×
q
, we control the
gain of
I
?
restricted to
H
[2
,
3]
Ξ(
H
,
‖·‖
Π
)
,
min
Z
H
[2
,
3]
Z
Π
=1
‖P
H
[2
,
3]
G
I
?
GP
H
[2
,
3]
(
Z
)
Π
.
(2.2)
Here, the linear maps
G
and
G
are defined in (1.8). In the spirit of irrepresentability conditions, we
control the inner-product between elements in
G
H
[2
,
3] and
G
H
[2
,
3]
, as quantified by the metric
3
We also consider the tangent space at a symmetric low-rank matrix with respect to the algebraic variety of
symmetric low-rank matrices. We use the same notation ‘
T
’ to denote tangent spaces in both the symmetric and
non-symmetric cases, and the appropriate tangent space is clear from the context.
9
induced by
I
?
via the following quantity
φ
(
H
,
‖·‖
Π
)
,
max
Z
H
[2
,
3]
Z
Π
=1
‖P
H
[2
,
3]
G
I
?
GP
H
[2
,
3]
(
P
H
[2
,
3]
G
I
?
GP
H
[2
,
3]
)
1
(
Z
)
Π
.
(2.3)
The operator (
P
H
[2
,
3]
G
I
?
GP
H
[2
,
3]
)
1
in (2.3) is well-defined if Ξ(
H
)
>
0, since this latter condition
implies that
I
?
is injective restricted to
G
H
[2
,
3]. The quantity
φ
(
H
,
‖·‖
Υ
) being small implies that
any element of
G
H
[2
,
3] and any element of
G
H
[2
,
3]
have a small inner-product (in the metric
induced by
I
?
). The reason that we restrict this inner product to the second and third components
of
H
in the quantity
φ
(
H
,
.
Υ
) is that the regularization terms in the convex program (1.4) are
only applied to the matrices
L
y
and Θ
yx
.
A natural approach to controlling the conditioning of the Fisher information around Θ
?
is to
bound the quantities
χ
(
H
?
,
‖·‖
Υ
), Ξ(
H
?
,
‖·‖
Π
), and
φ
(
H
?
,
‖·‖
Υ
) for
H
?
=
T
(
L
?
y
)
×
T
?
yx
)
×
S
q
where
W ∈
S
p
is the set of diagonal matrices. However, a complication that arises with this
approach is that the varieties of low-rank matrices are locally curved around
L
?
y
and around Θ
?
yx
.
Consequently, the tangent spaces at points in neighborhoods around
L
?
y
and around Θ
?
yx
are not
the same as
T
(
L
?
y
) and
T
?
yx
). In order to account for this curvature underlying the varieties of
low-rank matrices, we bound the distance between nearby tangent spaces via the following induced
norm:
ρ
(
T
1
,T
2
)
,
max
N
2
1
(
P
T
1
−P
T
2
)(
N
)
2
.
The quantity
ρ
(
T
1
,T
2
) measures the largest angle between
T
1
and
T
2
. Using this approach for
bounding nearby tangent spaces, we consider subspaces
H
=
T
y
×
T
yx
×
S
q
for all
T
y
close to
T
(
L
?
y
) and for all
T
yx
close to
T
?
yx
), as measured by
ρ
[2]. For
ω
y
(0
,
1) and
ω
yx
(0
,
1), we
bound
χ
(
H
,
‖·‖
Υ
), Ξ(
H
,
‖·‖
Π
), and
φ
(
H
,
‖·‖
Π
) in the sequel for all subspaces
H
in the following
set:
U
(
ω
y
yx
)
,
{
W ×
T
y
×
T
yx
×
S
q
|
ρ
(
T
y
,T
(
L
?
y
))
ω
y
ρ
(
T
yx
,T
?
yx
))
ω
yx
}
.
(2.4)
We control the quantities Ξ(
H
,
‖·‖
Π
) and
φ
(
H
,
‖·‖
Π
) using the dual norm of the regularizer
trace(
L
y
) +
γ
Θ
yx
?
in (1.4):
Γ
γ
(
L
y
,
Θ
yx
)
,
max
{
L
y
2
,
Θ
yx
2
γ
}
.
(2.5)
Furthermore, we control the quantity
χ
(
H
,
‖·‖
Υ
) using a slight variant of the dual norm:
Φ
γ
(
D
y
,L
y
,
Θ
yx
,
Θ
x
)
,
max
{
D
y
2
,
L
y
2
,
Θ
yx
2
γ
,
Θ
x
2
}
.
(2.6)
As the dual norm max
{
L
y
2
,
Θ
yx
2
γ
}
of the regularizer in (1.4) plays a central role in the opti-
mality conditions of (1.4), controlling the quantities
χ
(
H
,
Φ
γ
), Ξ(
H
,
Γ
γ
), and
φ
(
H
,
Γ
γ
) leads to
a natural set of conditions that guarantee the consistency of the estimates produced by (1.4). In
summary, given a fixed set of parameters (
γ,ω
y
yx
)
R
+
×
(0
,
1)
×
(0
,
1), we assume that
I
?
10
satisfies the following conditions:
Assumption 1 :
inf
H
U
(
ω
y
yx
)
χ
(
H
,
Φ
γ
)
α,
for some
α >
0
(2.7)
Assumption 2 :
inf
H
U
(
ω
y
yx
)
Ξ(
H
,
Γ
γ
)
>
0
(2.8)
Assumption 3 :
sup
H
U
(
ω
yx
yx
)
φ
(
H
,
Γ
γ
)
1
2
β
+ 1
for some
β
2
.
(2.9)
For fixed (
γ,ω
y
yx
), larger value of
α
and smaller value of
β
in these assumptions lead to a better
conditioned
I
?
.
Assumptions 1, 2, and 3 are analogous to conditions that play an important role in the analysis
of the Lasso for sparse linear regression, graphical model selection via the Graphical Lasso [5], and
in several other approaches for high-dimensional estimation. As a point of comparison with respect
to analyses of the Lasso, the role of the Fisher information
I
?
is played by
A
T
A
, where
A
is the
underlying design matrix. In analyses of both the Lasso and the Graphical Lasso in the papers
referenced above, the analog of the subspace
H
is the set of models with support contained inside
the support of the underlying sparse population model. Assumptions 1, 2, and 3 are also similar
in spirit to conditions employed in the analysis of convex relaxation methods for latent-variable
graphical model selection [2].
2.3 When Do the Fisher Information Assumptions Hold?
In this section, we provide examples of composite models (1.3) that satisfy Assumptions 1, 2 and
3 in (2.7) (2.8), and (2.9) for some choices of
α >
0,
β
2,
ω
y
(0
,
1),
ω
yx
(0
,
1) and
γ >
0 .
Specifically, consider a population composite factor model
y
=
A
?
x
+
B
?
u
ζ
u
+ ̄

, where
A
?
R
p
×
q
with rank(
A
?
) =
k
x
,
B
?
u
R
p,k
u
, column-space(
A
?
)
column-space(
B
?
u
) =
{
0
}
, and the random
variables
ζ
u
,
̄
,x
are independent of each other and normally distributed as
ζ
u
∼ N
(0
,
Σ
ζ
u
)
,
̄

N
(0
,
Σ
̄

). As described in Section 1.2, the properties of the composite factor model translate to
algebraic properties on the underlying precision matrix Θ
?
S
p
+
q
. Namely, the submatrix Θ
?
yx
has rank equal to
k
x
and the submatrix Θ
?
y
is decomposable as
D
?
y
L
?
y
with
D
?
y
being diagonal
and
L
?
y

0 having rank equal to
k
u
. Recall that the factor model underlying the random vector
y
R
p
that is induced upon marginalization of
x
is specified by the precision matrix of
y
given
by
̃
Θ
?
y
=
D
?
y
[
L
?
y
+ Θ
?
yx
?
x
)
1
Θ
?
xy
]
. Here,
L
?
y
+ Θ
?
yx
?
x
)
1
Θ
?
xy
represents the effect of the latent
variables in the underlying factor model. When learning a composite factor model, this effect is
decomposed into: Θ
?
yx
?
x
)
1
Θ
?
xy
– a rank
k
x
matrix representing the component of this affect
attributed to
x
– and
L
?
y
– a matrix of rank
k
u
representing the effect of residual latent variables.
There are two identifiability concerns that arise when learning a composite factor model. First, the
low rank matrices
L
?
y
and Θ
?
yx
?
x
)
1
Θ
?
xy
must be distinguishable from the diagonal matrix
D
?
y
.
Following previous literature in diagonal and low rank matrix decompositions [16, 2], this task can
be achieved by ensuring that the column/row spaces of
L
?
y
and Θ
?
yx
?
x
)
1
Θ
?
xy
are
incoherent
with
respect to the standard basis. Specifically, given a subspace
U
R
p
, the coherence of the subspace
U
is defined as:
μ
(
U
) = max
i
=1
,
2
...p
‖P
U
(
e
i
)
2
`
2
11
where
P
denotes a projection operation and
e
i
R
p
denotes the i’th standard basis vector. It is
not difficult to show that this incoherence parameter satisfies the following inequality:
dim(
U
)
p
μ
(
U
)
1
.
A subspace
U
with small coherence is necessarily of small dimension and far from containing stan-
dard basis elements. As such, a symmetric matrix with incoherent row and column spaces is low-
rank and quite different from being a diagonal matrix. Consequently, we require that the quantities
μ
(column-space(
L
?
y
)) and
μ
(column-space(Θ
?
yx
Θ
?
x
1
Θ
?
xy
)) are small
4
. The second identifiability is-
sue that arises is distinguishing the low rank matrices
L
?
y
and Θ
?
yx
?
x
)
1
Θ
?
xy
from one another.
This task is made difficult when the row/column spaces of these matrices are nearly aligned. Thus,
we must ensure that the row/column spaces of
L
?
y
and Θ
?
yx
?
x
)
1
Θ
?
xy
are sufficiently transverse
(i.e. have large angles).
These identifiability issues directly translate to conditions on the population composite factor
model. Specifically,
μ
(column-space(
L
?
y
)) and
μ
(column-space(Θ
?
yx
?
x
)
1
Θ
?
xy
)) being small translates to
μ
(column-space(
A
?
)) and
μ
(column-space(
B
?
u
))
being small. Such a condition has another interpretation. It states that the effect of
x
and
ζ
u
must
not concentrate on any one variable of
y
; otherwise, this effect can be absorbed by the random
variable ̄

in (1.3). The second identifiability assumption that the row/column spaces of
L
?
y
and
Θ
?
yx
?
x
)
1
Θ
?
xy
have a large angle translates to the angle between column spaces of
A
?
and
B
?
u
being
large. This assumption ensures that the effect of
x
and
ζ
u
on
y
can be distinguished.
Having these identifiability concerns in mind, we give a stylized composite factor model (1.3)
and check that the Fisher Information Assumptions 1,2, and 3 in (2.7), (2.8), and (2.9) are satisfied
for appropriate choices of parameters. Specifically, we let
p
= 60,
q
= 2,
k
x
= 1, and
k
u
= 1. We
let the random variables
x
R
q
,
ζ
u
R
k
u
, ̄

R
p
be distributed according to
x
∼ N
(0
,
I
q
×
q
),
ζ
u
∼ N
(0
,
I
k
u
×
k
u
), and ̄

∼ N
(0
,
I
p
×
p
). We generate matrices
J
R
p
×
k
x
,K
R
q
×
k
x
with i.i.d
Gaussian entries, and let
A
?
=
JK
T
. Similarly, we generate
B
?
u
R
p
×
k
u
with i.i.d Gaussian
entries. We scale matrices
A
?
and
B
?
u
to have spectral norm equal to 0
.
2. With this selection,
the smallest angle between the column spaces of
A
?
and
B
?
u
is 87 degrees. Furthermore, the
quantities
μ
(column-space(
A
?
)) and
μ
(column-space(
B
?
u
)) are 0
.
072 and 0
.
074 respectively, . Under
this stylized setting, we numerically evaluate Assumptions 1, 2, and 3 in (2.7), (2.8), and (2.9) with
a Fisher information
I
?
that takes the form:
I
?
=
(
I
+
A
?
A
?
T
+
B
?
u
B
?
u
T
A
?
A
?
T
I
)
(
I
+
A
?
A
?
T
+
B
?
u
B
?
u
T
A
?
A
?
T
I
)
We let
ω
y
= 0
.
03
yx
= 0
.
03 so that the largest angle between the pair of tangent spaces
T
y
,T
(
L
?
y
)
and tangent spaces
T
yx
,T
?
yx
) is less than 1
.
8 degrees. Letting
γ
(1
,
1
.
4), one can numerically
check that
inf
H
U
(
ω
y
yx
)
χ
(
H
,
Φ
γ
)
>
0
.
2, inf
H
U
(
ω
y
yx
)
Ξ(
H
,
Γ
γ
)
>
0
.
4 and
sup
H
U
(
ω
y
yx
)
φ
(
H
,
Γ
γ
)
<
0
.
8. Thus, for
ω
y
= 0
.
03,
ω
yx
= 0
.
03,
α
= 0
.
2,
β
= 9, and
γ
(1
,
1
.
4),
the Fisher information condition Assumptions 1, 2, and 3 in (2.7), (2.8) and (2.9) are satisfied.
4
We only need to control the coherence of the column spaces since these matrices are symmetric.
12
2.4 Theorem Statement
We now describe the performance of the regularized maximum-likelihood programs (1.4) under
suitable conditions on the quantities introduced in the previous section. Before formally stating
our main result, we introduce some notation. Let
σ
y
denote the minimum nonzero singular value of
L
?
y
and let
σ
yx
denote the minimum nonzero singular value of Θ
?
yx
. We state the theorem based on
essential aspects of the conditions required for the success of our convex relaxation (i.e. the Fisher
information conditions) and omit complicated constants. We specify these constants in Section 4.
Theorem 2.1.
Suppose that there exists
α >
0
,
β
2
,
ω
y
(0
,
1)
,
ω
yx
(0
,
1)
, and the choice
of parameter
γ
so that the population Fisher information
I
?
satisfies Assumptions 1, 2, and 3 in
(2.7)
,
(2.8)
and
(2.9)
. Let
m
,
max
{
1
,
1
γ
}
, and
̄
m
,
max
{
1
}
. Furthermore, suppose that the
following conditions hold:
1.
n
&
[
β
2
α
2
m
6
]
(
p
+
q
)
2.
λ
n
[
β
α
m
2
]
p
+
q
n
3.
σ
y
&
[
β
α
5
ω
y
m
4
]
λ
n
4.
σ
yx
&
[
β
α
5
ω
yx
m
5
̄
m
2
]
λ
n
Then with probability greater than
1
2 exp
{
̃
C
prob
α
2
β
2
m
4
2
n
}
, the optimal solution
(
ˆ
Θ
,
ˆ
D
y
,
ˆ
L
y
)
of
(1.4)
with i.i.d. observations
D
+
n
=
{
y
(
i
)
,x
(
i
)
}
n
i
=1
of
(
y,x
)
satisfies the following properties:
1. rank(
ˆ
L
y
) = rank(
L
?
y
), rank(
ˆ
Θ
yx
) = rank(
Θ
?
yx
)
2.
ˆ
D
y
D
?
y
2
.
m
α
2
λ
n
,
ˆ
L
y
L
?
y
2
.
m
α
2
λ
n
,
ˆ
Θ
yx
Θ
?
yx
2
.
m
̄
m
α
2
λ
n
,
ˆ
Θ
x
Θ
?
x
2
.
m
α
2
λ
n
We outline the proof of Theorem 6.4 in Section 4. The quantities
α,β,ω
y
yx
as well as the
choices of parameters
γ
play a prominent role in the results of Theorem 6.4. Indeed larger values
of
α,ω
y
yx
and smaller values of
β
(leading to a better conditioned Fisher information even for
large distortions around the tangent space
T
(
L
?
y
) and
T
?
yx
) lead to less stringent requirements on
the sample complexity, on the minimum nonzero singular value of
σ
y
of
L
?
y
, and on the minimum
nonzero singular value
σ
yx
of Θ
?
yx
.
3 Experimental Results
In this section, we demonstrate the utility of Algorithm 1 for interpreting latent variables in factor
models both with synthetic and real financial asset data.
3.1 Synthetic Simulations
We give experimental evidence for the utility of Algorithm 1 on synthetic examples. Specifically,
we generate a composite factor model (1.3)
y
=
A
?
x
+
B
?
u
ζ
u
+ ̄

as follows: we fix
p
= 40 and
q
= 10.
We let the random variables
x
R
q
,
ζ
u
R
k
u
, ̄

R
p
be distributed according to
x
∼N
(0
,
I
q
×
q
),
ζ
u
∼ N
(0
,
I
k
u
×
k
u
), and ̄

∼ N
(0
,
I
p
×
p
). We generate matrices
J
R
p
×
k
x
,K
R
q
×
k
x
with i.i.d
Gaussian entries, and let
A
?
=
JK
T
. Similarly, we generate
B
?
u
R
p
×
k
u
with i.i.d Gaussian
13
entries. This approach generates a factor model (1.1) with
k
=
k
x
+
k
u
. The composite factor
model translates to a joint precision matrix Θ
?
, with the submatrix Θ
?
y
=
D
?
y
L
?
y
where
D
?
y
is
diagonal, rank(
L
?
y
) =
k
u
, and rank(Θ
?
yx
) =
k
x
. We scale matrices
A
?
and
B
?
u
to have spectral norm
equal to
τ
. The value
τ
is chosen to be as large as possible without the condition number of Θ
?
exceeding 10 (this is imposed for the purposes of numerical conditioning). We obtain four models
with (
k
x
,k
u
) = (1
,
1)
,
(
k
x
,k
u
) = (2
,
2), and (
k
x
,k
u
) = (4
,
4), and (
k
x
,k
u
) = (6
,
6).
For the purposes of this experiment, we assume that the input to Algorithm 1 is the oracle
factor model specified by the parameters (
D
?
y
,L
?
y
+ Θ
?
yx
x
)
1
Θ
?
xy
), and demonstrate the success of
steps 2-3 of Algorithm 1. In particular, for each model, we generate
n
samples of responses
y
and
covariates
x
, and use these observations as input to the convex program (1.4). The regularization
parameters
λ
n
are chosen so that the estimates (
ˆ
Θ
,
ˆ
L
y
,
ˆ
D
y
) satisfy (
i
) rank(
L
?
y
?
yx
?
x
)
1
Θ
?
xy
) =
rank(
ˆ
L
y
+
ˆ
Θ
yx
ˆ
Θ
1
x
ˆ
Θ
xy
),
(
ii
) column-space(
ˆ
Θ
yx
)
column-space(
ˆ
L
y
) =
{
0
}
, and the deviation from the underlying factor
model
max
{‖
D
?
y
ˆ
D
y
2
/
D
?
y
2
,
L
?
y
[
ˆ
L
y
+
ˆ
Θ
yx
ˆ
Θ
1
x
ˆ
Θ
xy
]
2
/
L
?
y
2
}
is minimized. Figure 1(a) shows the
magnitude of the deviation for different values of
n
. Furthermore, for each fixed
n
, we use the choice
of regularization parameters (
λ
n
) to compute the probability of obtaining structurally correct
estimates of the composite model (i.e. rank(
ˆ
L
y
) = rank(
L
?
y
) and rank(Θ
?
yx
) = rank(
ˆ
Θ
y
)). These
probabilities are evaluated over 10 experiments and are shown in Figure 1(b). These results support
Theorem 1 that given (sufficiently many) samples of responses/covariates, the convex program (1.4)
provides accurate estimates of the composite factor model (1.3).
0
500
1000
1500
2000
n
p
+
q
0
0.5
1
1.5
Error
(k
x
,k
u
) = (1,1)
(k
x
,k
u
) = (2,2)
(k
x
,k
u
) = (4,4)
(k
x
,k
u
) = (6,6)
(a) composite factor model error
0
500
1000
1500
2000
n
p
+
q
0
1
2
3
4
5
6
7
8
9
10
Probability of Success
(k
x
,k
u
) = (1,1)
(k
x
,k
u
) = (2,2)
(k
x
,k
u
) = (4,4)
(k
x
,k
u
) = (6,6)
(b) composite factor model struc-
tural recovery
Figure 1: Synthetic data: plot shows the error (defined in the main text) and probability of correct structure
recovery in composite factor models. The four models studied are
(
i
) (
k
x
,k
u
) = (1
,
1)
,
(
ii
) (
k
x
,k
u
) = (2
,
2)
,
and
(
iii
) (
k
x
,k
u
) = (4
,
4)
, and
(
iv
) (
k
x
,k
u
) = (6
,
6)
. For each plotted point in (b), the probability of
structurally correct estimation is obtained over
10
trials.
3.2 Experimental Results on Financial Asset Data
We consider as our responses
y
the monthly stock returns of
p
= 45 companies from the Standard
and Poor index over the period March 1982 to March 2016, which leads to a total of
n
= 408
observations. We then obtain measurements of 13 covariates that can plausibly influence the values
of stock prices: consumer price index, producer price index, EUR to USD exchange rate, federal debt
14