Supplementary Material: Interpreting Latent Variables in Factor
Models via Convex Optimization
Armeen Taeb
†
and Venkat Chandrasekaran
†
,
‡ ∗
†
Department of Electrical Engineering
‡
Department of Computing and Mathematical Sciences
California Institute of Technology
Pasadena, CA 91125
1 A Numerical Approach for Verifying Assumptions 1, 2, and 3
(main paper)
We begin by considering Assumption 1 in (15)(main paper). Let
f
1
,
2 max
{
√
p,
√
2
k
u
,
√
2
k
x
γ,
√
q
}
,
f
2
,
max
{
√
2
k
u
,
√
2
k
x
γ
}
and
ω
,
max
{
ω
y
,ω
yx
}
. Let
Z
= (
Z
1
,Z
2
,Z
3
,Z
4
)
∈
H
′
with Φ
γ
(
Z
1
,Z
2
,Z
3
,Z
4
) = 1. It is straightforward
to check that:
Φ
γ
[
P
H
′
F
†
I
?
FP
H
′
(
Z
1
,Z
2
,Z
3
,Z
4
)]
≥
f
−
1
1
σ
min
(
P
H
?
F
†
I
?
FP
H
?
)
−
max
{
1
,
1
γ
}
(
√
3
ω
+
ω
+
√
3
ω
2
)
f
2
ψ
2
,
T
1
Notice that the quantity
σ
min
(
P
H
?
F
†
I
?
FP
H
?
) (and henceforth the quantity
T
1
) is computable given
the population model. Thus a trivial lower bound for
α
is given by:
inf
H
′
∈
U
(
ω
y
,ω
yx
)
χ
(
H
′
,
Φ
γ
)
≥
α
≥
T
1
We now consider Assumption 2 in (16) (main paper). Let
Z
= (
Z
1
,Z
2
)
∈
H
[2
,
3]
′
with
Γ
γ
(
Z
1
,Z
2
) = 1. Using triangle inequality, it is straightforward to check the following bound:
Γ
γ
[
P
H
[2
,
3]
′
G
†
I
?
GP
H
[2
,
3]
′
(
Z
1
,Z
2
)]
≥
min
{
1
,
1
γ
}
(
√
3
f
2
)
−
1
σ
min
(
P
H
[2
,
3]
?
G
†
I
?
GP
H
[2
,
3]
?
)
−
max
{
1
,
1
γ
}
(
√
3
ω
+
ω
+
√
3
ω
2
)
f
2
ψ
2
,
T
2
Notice that the quantity
T
2
is computable giving the population model. Then,
inf
H
′
∈
U
(
ω
y
,ω
yx
)
Ξ(
H
′
,
Γ
γ
)
≥
T
2
∗
Email: ataeb@caltech.edu, venkatc@caltech.edu
1
Now we consider Assumption 3 in (17) (main paper). Using triangle inequality, it is straightforward
to check that:
Γ
γ
[
P
H
[2
,
3]
′
⊥
G
†
I
?
GP
H
[2
,
3]
′
(
Z
1
,Z
2
)]
≤
√
3
f
2
max
{
1
,
1
γ
}
σ
max
(
P
H
[2
,
3]
?
⊥
G
†
I
?
GP
H
[2
,
3]
?
)
+ max
{
1
,
1
γ
}
(
√
3
ω
+
ω
+
√
3
ω
2
)
f
2
ψ
2
,
T
3
Similarly, the quantity
T
3
can be computed given the population model. Then, an upper bound
for
φ
(
H
′
,
Γ
γ
) is given by:
sup
H
′
∈
U
(
ω
yx
,ω
yx
)
φ
(
H
′
,
Γ
γ
)
≤
1
−
2
1 +
β
≤
T
3
T
2
=
⇒
β
≤
2
1
−
T
3
T
2
−
1
2 Proof of Proposition 1 (main paper)
Proof.
We note that:
‖
∆
‖
2
≤ ‖
∆
D
y
‖
2
+
‖
∆
L
y
‖
2
+
‖
∆Θ
yx
‖
2
+
‖
∆Θ
x
‖
2
≤
(3 +
γ
)Φ
γ
(∆)
Furthermore, recall that
R
Σ
?
(
F
(∆)) = Σ
?
−
1
[
∞
∑
k
=2
(
−F
(∆)Σ
?
−
1
)
k
]
.
Using this observation and some algebra, we have that:
Φ
γ
[
F
†
R
Σ
?
(
F
(∆))]
≤
mψ
[
∞
∑
k
=2
(
ψ
‖
∆
‖
2
)
k
]
≤
mψ
3
(3 +
γ
)
2
Φ
γ
[∆]
2
1
−
(3 +
γ
)Φ
γ
[∆]
ψ
≤
2
mψC
′
2
Φ
γ
[∆]
2
3 Proof of Proposition 2 (main paper)
Proof.
The proof of this result uses Brouwer’s fixed-point theorem, and is inspired by the proof of
a similar result in [5, 2]. The optimality conditions of (20) (main paper) suggest that there exist
Lagrange multipliers
Q
D
y
∈W
,
Q
T
y
∈
T
′
y
⊥
, and
Q
T
yx
∈
T
′
yx
⊥
such that
[Σ
n
−
̃
Θ
−
1
]
y
+
Q
D
y
= 0; [Σ
n
−
̃
Θ
−
1
]
y
+
Q
T
y
∈
λ
n
∂
‖
̃
L
y
‖
?
[Σ
n
−
̃
Θ
−
1
]
yx
+
Q
T
yx
∈−
λ
n
γ∂
‖
̃
Θ
yx
‖
?
; [Σ
n
−
̃
Θ
−
1
]
x
= 0
Letting the SVD of
̃
L
and
̃
Θ
yx
be given by
̃
L
y
=
̄
U
̄
D
̄
V
′
and
̃
Θ
yx
=
̆
U
̆
D
̆
V
′
respectively, and
Z
,
(0
, λ
n
̄
U
̄
V
′
,
−
λ
n
γ
̆
U
̆
V
′
,
0), we can restrict the optimality conditions of (15) (main paper)
to the space
H
′
to obtain,
P
H
′
F
†
(Σ
n
−
̃
Θ
−
1
) =
Z
. Further, by appealing to the matrix inversion
lemma, this condition can be restated as
P
H
M
F
†
(
E
n
−
R
Σ
?
(
F
∆) +
I
?
F
(∆)) =
Z
. Based on the
2
Fisher information assumption 1 in (15) (main paper), the optimum of (20) (main paper) is unique
(this is because the Hessian of the negative log-likelihood term is positive definite restricted to the
tangent space constraints). Moreover, using standard Lagrangian duality, one can show that the
set of variables (
̃
Θ
,
̃
D
y
,
̃
L
y
) that satisfy the restricted optimality conditions are unique. Consider
the following function
S
(
δ
) restricted to
δ
∈ W ×
T
′
y
×
T
′
yx
×
S
q
with
ρ
(
T
(
L
?
y
)
,T
′
y
)
≤
ω
y
and
ρ
(
T
(Θ
?
yx
)
,T
′
yx
)
≤
ω
yx
:
S
(
δ
) =
δ
−
(
P
H
′
F
†
I
?
FP
H
′
)
−
1
(
P
H
′
F
†
[
E
n
−
R
Σ
?
F
(
δ
+
C
T
)
+
I
?
F
(
δ
+
C
T
)]
−
Z
)
The function
S
(
δ
) is well-defined since the operator
P
H
′
F
†
I
?
FP
H
′
is bijective due to Fisher infor-
mation assumption 1 in (15) (main paper). As a result,
δ
is a fixed point of
S
(
δ
) if and only if
P
H
′
F
†
[
E
n
−
R
Σ
?
(
F
(
δ
+
C
T
)) +
I
?
F
(
δ
+
C
T
)] =
Z
. Since the pair (
̃
Θ
,
̃
D
y
,
̃
L
y
) are the unique solution
to (20) (main paper), the only fixed point of
S
is
P
H
′
[∆]. Next we show that this unique optimum
lives inside the ball
B
r
u
1
,r
u
2
=
{
δ
|
max
{‖
δ
2
‖
2
,
1
γ
‖
δ
3
‖
2
} ≤
r
u
1
,
max
{‖
δ
1
‖
2
,
‖
δ
4
‖
2
} ≤
r
u
2
δ
∈
H
′
}
. In
particular, we show that under the map
S
, the image of
B
r
u
1
,r
u
2
lies in
B
r
u
1
,r
u
2
and appeal to Brouwer’s
fixed point theorem to conclude that
P
H
′
[∆]
∈
B
r
u
1
,r
u
2
. For
δ
∈
B
r
u
1
,r
u
2
, the first component of
S
(
δ
),
denoted by
S
(
δ
)
1
, can be bounded as follows:
‖
S
(
δ
)
1
‖
2
=
∥
∥
∥
[
(
P
H
′
F
†
I
?
FP
H
′
)
−
1
(
P
H
′
F
†
[
E
n
−
R
Σ
?
(
F
(
δ
+
C
T
))
+
I
?
F
C
T
] +
Z
)]
1
∥
∥
∥
2
≤
2
α
[
Φ
γ
[
F
†
(
E
n
+
I
?
F
(
C
T
))]
]
+
2
α
Φ
γ
[
F
†
R
Σ
?
(
δ
+
C
T
)]
≤
r
u
2
2
+
2
α
Φ
γ
[
F
†
R
Σ
?
(
δ
+
C
T
)]
The first inequality holds because of Fisher Information Assumption 1 in (15) (main paper), and the
properties that Φ
γ
[
P
H
M
(
.
)]
≤
2Φ
γ
(
.
) (since projecting into the tangent space of a low-rank matrix
variety increases the spectral norm by a factor of at most two) and Φ
γ
(
Z
) =
λ
n
. Moreover, since
r
u
1
≤
1
4
C
′
, we have Φ
γ
(
δ
+
C
T
)
≤
Φ
γ
(
δ
) + Φ
γ
(
C
T
)
≤
2
r
u
1
≤
1
2
C
′
. Moreover,
r
u
1
≤
r
u
2
max
{
1 +
κ
2
,
α
8
}
.
We can now appeal to Proposition 1 (main paper) to obtain:
2
α
Φ
γ
[
F
†
R
Σ
?
(
δ
+
C
T
)]
≤
4
α
mψC
′
2
[Φ
γ
(
δ
+
C
T
)]
2
≤
16
α
mψC
′
2
(
r
u
2
)
2
max
{
1 +
κ
2
,
α
8
}
2
≤
r
u
2
2
Thus, we conclude that
‖
S
(
δ
)
1
‖
2
≤
r
u
2
. Similarly, we check that:
‖
[
S
(
δ
)
2
]
‖
2
=
∥
∥
∥
[
(
P
H
′
F
†
I
?
FP
H
′
)
−
1
(
P
H
′
F
†
[
E
n
−
R
Σ
?
(
F
(
δ
+
C
T
))
+
I
?
F
C
T
] +
Z
)]
2
∥
∥
∥
2
≤
2
α
[
Φ
γ
[
F
†
(
E
n
+
I
?
F
(
C
T
)] +
λ
n
]
+
2
α
Φ
γ
[
F
†
R
Σ
?
(
δ
+
C
T
)]
≤
r
u
1
2
+
2
α
Φ
γ
[
F
†
R
Σ
?
(
δ
+
C
T
)]
≤
r
u
1
Using a similar approach, we can conclude that
1
γ
‖
S
(
δ
)
3
‖
2
≤
r
u
1
and
‖
S
(
δ
)
3
‖
2
≤
r
u
2
. Therefore,
Brouwer’s fixed point theorem suggests that
P
H
′
(∆)
∈B
r
u
1
,r
u
2
. Hence,
‖
∆
1
‖
2
≤
r
u
2
,
‖
∆
4
‖
2
≤
r
u
2
,
‖
∆
2
‖
2
≤‖P
H
′
[2]
(∆
2
)
‖
2
+
‖P
H
′
[2]
⊥
(∆
2
)
‖
2
≤
2
r
u
1
, and
1
γ
‖
∆
3
‖
2
≤
1
γ
‖P
H
′
[3]
(∆
3
)
‖
2
+
1
γ
‖P
H
′
[3]
⊥
(∆
2
)
‖
2
≤
2
r
u
1
.
3
4 Proof of Proposition 3 (main paper)
Below, we outline our proof strategy:
1. We proceed by analyzing (19) (main paper) with additional constraints that the variables
L
y
, and Θ
yx
belong to the algebraic varieties low-rank matrices (specified by rank of
L
?
y
, and
Θ
?
yx
) , and that the tangent spaces
T
(
L
y
),
T
(Θ
yx
) are close to the nominal tangent spaces
T
(
L
?
y
), and
T
(Θ
?
yx
) respectively. We prove that under suitable conditions on the minimum
nonzero singular value of
L
?
y
, and minimum nonzero singular value of Θ
?
yx
, any optimum
pair of variables (Θ
,D
y
,L
y
) of this non-convex program are smooth points of the underlying
varieties; that is rank(L
y
) = rank(L
?
y
) and rank(Θ
yx
) = rank(Θ
?
yx
). Further, we show that
L
y
has the same inertia as
L
?
y
so that
L
y
0.
2. Conclusions of the previous step imply the the variety constraints can be “linearized” at the
optimum of the non-convex program to obtain tangent-space constraints. Under the specified
conditions on the regularization parameter
λ
n
, we prove that with high probability, the unique
optimum of this “linearized” program coincides with the global optimum of the non-convex
program.
3. Finally, we show that the tangent-space constraints of the linearized program are inactive
at the optimum. Therefore the optimal solution of (19) (main paper) has the property that
with high probability: rank(
̄
L
y
) = rank(
L
?
y
) and rank(
̄
Θ
yx
) = rank(Θ
?
yx
). Since
̄
L
y
0, we
conclude that the variables (
̄
Θ
,
̄
D
y
,
̄
L
y
) are the unique optimum of (4) (main paper).
4.1 Variety Constrained Program
We begin by considering a variety-constrained optimization program. Letting (
M,N,P,Q
)
⊂
S
p
×
S
p
×
R
p
×
q
×
S
q
, we denote
P
[2
,
3]
(
M,N,P,Q
) = (
N,P
)
⊂
S
p
×
R
p
×
q
. The variety-constrained
optimization program is given by:
(Θ
M
,D
M
y
,L
M
y
) =
argmin
Θ
∈
S
q
+
p
,
Θ
0
D
y
,L
y
∈
S
p
−
`
(Θ;
D
+
n
) +
λ
n
[
‖
L
y
‖
?
+
γ
‖
Θ
yx
‖
?
]
s
.
t
.
Θ
y
=
D
y
−
L
y
,
(Θ
,D
y
,L
y
)
∈M
.
(1)
Here, the set
M
=
M
1
∩M
2
, where the sets
M
1
and
M
2
are given by:
M
1
,
{
(Θ
,D
y
,L
y
)
∈
S
(
p
+
q
)
×
S
p
×
S
p
∣
∣
∣
D
y
is diagonal
,
rank(
L
y
)
≤
rank(
L
?
y
)
rank (Θ
yx
)
≤
rank(Θ
?
yx
);
‖P
T
(
L
?
y
)
⊥
(
L
y
−
L
?
y
)
‖
2
≤
λ
n
2
ψ
2
‖P
T
(Θ
?
yx
)
⊥
(Θ
yx
−
Θ
?
yx
)
‖
2
≤
λ
n
2
ψ
2
}
M
2
,
{
(Θ
,D
y
,L
y
)
∈
S
(
p
+
q
)
×
S
p
×
S
p
∣
∣
∣
‖
I
?
F
(∆)
‖
2
≤
6 ̄
mψ
2
λ
n
(
8
ακ
+
4
α
+
1
κ
)}
,
The optimization program (1) is non-convex due to the rank constraints rank(
L
y
)
≤
rank(
L
?
y
) and
rank(Θ
yx
)
≤
rank(Θ
?
yx
) in the set
M
. These constraints ensure that the matrices
L
y
, and Θ
yx
belong to appropriate varieties. The constraints in
M
along
T
(
L
?
y
)
⊥
and
T
(Θ
?
yx
)
⊥
ensure that the
tangent spaces
T
(
L
y
) and
T
(Θ
yx
) are “close” to
T
(
L
?
y
) and
T
(Θ
?
yx
) respectively. Finally, the last
conditions roughly controls the error. We begin by proving the following useful proposition:
4
Proposition 4.1.
Let
(Θ
,D
y
,L
y
)
be a set of feasible variables of
(1)
. Let
∆ = (
D
y
−
D
?
y
,L
y
−
L
?
y
,
Θ
yx
−
Θ
?
yx
,
Θ
x
−
Θ
?
x
)
and recall that
C
′
1
=
2 ̄
mm
κα
(
6
ψ
2
+
5
α
ψ
2
+
46
ψ
2
κ
α
+
κ
)
+
1
ψ
2
. Then,
Φ
γ
[∆]
≤
C
′
1
λ
n
Proof.
Let
H
?
=
W ×
T
(
L
?
y
)
×
T
(Θ
?
yx
)
×
S
q
. Then,
Φ
γ
[
F
†
I
?
FP
H
?
(∆)]
≤
Φ
γ
[
F
†
I
?
F
(∆)] + Φ
γ
[
F
†
I
?
FP
H
?
⊥
(∆)]
≤
6 ̄
mmψ
2
λ
n
(
8
ακ
+
4
α
+
1
κ
)
+
mψ
2
(
ω
y
λ
n
2
ψ
2
+
ω
yx
λ
n
2
ψ
2
)
≤
̄
mmλ
n
κ
(
6
ψ
2
+
24
α
ψ
2
+
48
ψ
2
κ
α
+
κ
)
Since Φ
γ
[
P
H
?
(
·
)]
≤
2Φ
γ
(
·
), we have that Φ
γ
[
P
H
?
F
†
I
?
FP
H
?
(∆)]
≤
2 ̄
mmλ
n
κα
(
6
ψ
2
+
24
α
ψ
2
+
48
ψ
2
κ
α
+
κ
)
.
Consequently, we apply Fisher Information Assumption 1 in (15) (main paper) to conclude that
Φ
γ
[
P
H
?
(∆)]
≤
2 ̄
mmλ
n
κα
(
6
ψ
2
+
24
α
ψ
2
+
48
ψ
2
κ
α
+
κ
)
. Moreover:
Φ
γ
[∆]
≤
Φ
γ
[
P
H
?
(∆)] + Φ
γ
[
P
H
?
⊥
(∆)]
≤
2 ̄
mmλ
n
κα
(
6
ψ
2
+
24
α
ψ
2
+
48
ψ
2
κ
α
+
κ
)
+
λ
n
ψ
2
=
C
′
1
λ
n
Proposition 4.1 leads to powerful implications. In particular, under additional conditions on
the minimum nonzero singular values of
L
?
y
and Θ
?
yx
, any feasible set of variables (Θ
,D
y
,L
y
) of (1)
has two key properties: (
a
) The variables (Θ
yx
,L
y
) are smooth points of the underlying varieties,
(
b
) The constraints in
M
along
T
(
L
?
y
)
⊥
and
T
(Θ
?
yx
)
⊥
are locally inactive at Θ
yx
and
L
y
. These
properties, among others, are proved in the following corollary.
Corollary 4.2.
Consider any feasible variables
(Θ
,D
y
,L
y
)
of
(1)
. Let
σ
y
be the smallest nonzero
singular value of
L
?
y
and
σ
yx
be the smallest nonzero singular value of
Θ
?
yx
. Let
H
′
=
W×
T
(
L
y
)
×
T
(Θ
yx
)
×
S
q
and
C
T
′
=
P
H
′⊥
(0
,L
?
y
,
Θ
?
yx
,
0)
. Furthermore, recall that
C
′
1
=
2 ̄
mm
κα
(
6
ψ
2
+
24
α
ψ
2
+
48
ψ
2
κ
α
+
κ
)
+
1
ψ
2
,
C
′
2
=
4
α
(1+
2
κ
)
,
C
′
σ
y
=
C
′
2
1
ψ
2
max
{
2
κ
+1
,
2
C
′
2
ψ
2
+1
}
and
C
′
σ
yx
=
C
′
2
1
ψ
2
max
{
2
κ
+
κ
γ
,
2
C
′
2
ψ
2
+
κ
γ
}
.
Suppose that the following inequalities are met:
σ
y
≥
m
ω
y
C
σ
y
λ
n
,
σ
yx
≥
mγ
2
ω
yx
C
′
σ
yx
λ
n
. Then,
1.
L
y
and
Θ
yx
are smooth points of their underlying varieties, i.e.
rank(L
y
) = rank(L
?
y
)
,
rank(Θ
yx
) = rank(Θ
?
yx
)
; Moreover
L
y
has the same inertia as
L
?
y
.
2.
‖P
T
(
L
?
y
)
⊥
(
L
y
−
L
?
y
)
‖
2
≤
λ
n
48
mψ
2
and
‖P
T
(Θ
?
yx
)
⊥
(Θ
yx
−
Θ
?
yx
)
‖
2
≤
λ
n
48
mψ
2
3.
ρ
(
T
(
L
y
)
,T
(
L
?
y
))
≤
ω
y
;
ρ
(
T
(Θ
yx
)
,T
(Θ
?
yx
))
≤
ω
yx
; that is, the tangent spaces at
L
y
and
Θ
yx
is “close” to the tangent space
L
?
y
and
Θ
?
yx
.
4.
Φ
γ
[
C
T
′
]
≤
min
{
λ
n
κψ
2
,C
′
2
λ
n
}
5
Proof.
We note the following relations before proving each step:
C
′
1
≥
1
ψ
2
≥
1
mψ
2
,
ω
y
,ω
yx
∈
(0
,
1),
and
κ
≥
6. We also appeal to the results of regarding perturbation analysis of the low-rank matrix
variety [1].
1. Based on the assumptions regarding the minimum nonzero singular values of
L
?
y
and Θ
?
yx
,
one can check that:
σ
y
≥
C
′
2
1
λ
n
ω
y
mψ
2
(
κ
+ 1)
≥
C
′
1
λ
n
ω
y
(2
κ
+ 1)
≥
8
‖
L
−
L
?
y
‖
2
σ
yx
≥
C
′
2
1
λ
n
ω
yx
γ
2
mψ
2
(
6
β
γ
+ 2
κ
)
≥
8
‖
Θ
yx
−
Θ
?
yx
‖
2
Combining these results and Proposition 4.1, we conclude that
L
y
and Θ
yx
are smooth points of
their respective varieties, i.e. rank(
L
y
) = rank(
L
?
y
), and rank(Θ
yx
) = rank(Θ
?
yx
). Furthermore,
L
y
has the same inertia as
L
?
y
.
2. Since
σ
y
≥
8
‖
L
y
−
L
?
y
‖
2
, and
σ
yx
≥
8
‖
Θ
yx
−
Θ
?
yx
‖
2
, we can appeal to Proposition 2.2 of [2]
to conclude that the constraints in
M
along
P
T
(
L
?
y
)
⊥
and
P
T
(Θ
?
yx
)
⊥
are strictly feasible:
‖P
T
(
L
?
y
)
⊥
(
L
y
−
L
?
y
)
‖
2
≤
‖
L
y
−
L
?
y
‖
2
2
σ
y
≤
λ
n
48
mψ
2
‖P
T
(Θ
?
yx
)
⊥
(Θ
yx
−
Θ
?
yx
)
‖
2
≤
‖
Θ
yx
−
Θ
?
yx
‖
2
2
σ
yx
≤
λ
n
48
mψ
2
3. Appealing to Proposition 2.1 of [2], we prove that the tangent spaces
T
(
L
y
) and
T
(Θ
yx
) are
close to
T
(
L
?
y
) and
T
(Θ
?
yx
) respectively:
ρ
(
T
(
L
y
)
,T
(
L
?
y
))
≤
2
‖
L
y
−
L
?
y
‖
2
σ
y
≤
2
C
′
1
λ
n
ω
y
C
′
2
1
λ
n
mψ
2
(2
κ
+ 1)
≤
ω
y
ρ
(
T
(Θ
yx
)
,T
(Θ
?
yx
))
≤
2
‖
Θ
yx
−
Θ
?
yx
‖
2
σ
yx
≤
2
C
′
1
λ
n
γω
yx
C
′
2
1
λ
n
ω
yx
γ
2
mψ
2
(
κ
γ
+ 2
κ
)
≤
ω
yx
4. Letting
σ
′
y
and
σ
′
yx
be the minimum nonzero singular value of
L
y
and Θ
yx
respectively, one
can check that:
σ
′
y
≥
σ
y
−‖
L
y
−
L
?
y
‖
2
≥
8
C
′
1
λ
n
≥
8
‖
L
y
−
L
?
y
‖
2
σ
′
yx
≥
σ
yx
−‖
Θ
yx
−
Θ
?
yx
‖
2
≥
8
C
′
1
λ
n
γ
≥
8
‖
Θ
yx
−
Θ
?
yx
‖
2
Once again appealing to Proposition 2.2 of [2] and simple algebra, we have:
Φ
γ
(
C
T
′
)
≤
m
‖P
T
(
L
y
)
⊥
(
L
y
−
L
?
y
)
‖
2
+
m
‖P
T
(Θ
yx
)
⊥
(Θ
yx
−
Θ
?
yx
)
‖
2
≤
m
‖
L
y
−
L
?
y
‖
2
2
σ
′
y
+
m
‖
Θ
yx
−
Θ
?
yx
‖
2
2
σ
′
yx
≤
min
{
λ
n
κψ
2
,C
′
2
λ
n
}
6
4.2 Variety Constrained Program to Tangent Space Constrained Program
Consider any optimal solution (Θ
M
,D
M
y
,L
M
y
) of (1). In Corollary 4.2, we concluded that the
variables (Θ
M
yx
,L
M
y
) are smooth points of their respective varieties. As a result, the rank constraints
rank(L
y
)
≤
rank(L
?
y
) and rank(Θ
yx
)
≤
rank(Θ
?
yx
) can be “linearized” to
L
y
∈
T
(
L
M
y
) and Θ
yx
∈
T
(Θ
M
yx
) respectively. Since all the remaining constraints are convex, the optimum of this linearized
program is also the optimum of (1). Moreover, we once more appeal to Corollary 4.2 to conclude
that the constraints in
M
along
P
T
(
L
?
y
)
⊥
and
P
T
(Θ
?
yx
)
⊥
are strictly feasible at (Θ
M
,D
M
y
,L
M
y
). As
a result, these constraints are locally inactive and can be removed without changing the optimum.
Therefore the constraint (Θ
M
,D
M
y
,L
M
y
)
∈M
1
is inactive and can be removed. We now argue that
the constraint (Θ
M
,D
M
y
,L
M
y
)
∈M
2
in (1) can also removed in this “linearized” convex program.
In particular, letting
H
M
,
W×
T
(
L
M
y
)
×
T
(Θ
M
yx
)
×
S
q
, consider the following convex optimization
program:
(
̃
Θ
,
̃
D
y
,
̃
L
y
) =
argmin
Θ
∈
S
q
+
p
,
Θ
0
D
y
,L
y
∈
S
p
−
`
(Θ;
D
+
n
) +
λ
n
[
‖
L
y
‖
?
+
γ
‖
Θ
yx
‖
?
]
s
.
t
.
Θ
y
=
D
y
−
L
y
,
(
D
y
,L
y
,
Θ
yx
,
Θ
x
)
∈
H
M
(2)
We prove that under conditions imposed on the regularization parameter
λ
n
, the pair of variables
(Θ
M
,D
M
y
,L
M
y
) is the unique optimum of (2). That is, we show that
1.
‖
I
?
F
(∆)
‖
2
<
6 ̄
mψ
2
λ
n
(
8
ακ
+
4
α
+
1
κ
)
Appealing to Corollary 4.2 and Proposition 4 (main paper), we have that Φ
γ
[
F
†
I
?
F
C
T
M
]
≤
λ
n
κ
, Φ
γ
[
C
T
M
]
≤
C
′
2
λ
n
and (with high probability) Φ
γ
[
F
†
E
n
]
≤
λ
n
κ
. Consequently, based on the
bound on
λ
n
in assumption of Theorem 1 (main paper), it is straightforward to show that
r
u
1
≤
min
{
1
4
C
′
,
α
32 max
{
1+
κ
2
,
α
8
}
2
mψC
′
2
}
so that Φ
γ
[∆]
≤
1
2
C
′
. Hence by Proposition 2 (main paper), we
have that
‖
∆
1
‖
2
,
‖
∆
4
‖
2
≤
r
u
2
< r
u
1
,
‖
∆
2
‖
2
≤
2
r
u
1
and
‖
∆
‖
3
≤
2
γr
u
1
. Therefore:
‖
I
?
F
(∆)
‖
2
≤
ψ
2
(
‖
∆
1
‖
2
+
‖
∆
2
‖
2
+
‖
∆
3
‖
2
+
‖
∆
4
‖
2
)
<
6 ̄
mψ
2
r
u
1
≤
6 ̄
mψ
2
λ
n
(
8
ακ
+
4
α
+
1
κ
)
4.3 From Tangent Space Constraints to the Original Problem
The optimality conditions of (2) suggest that there exist Lagrange multipliers
Q
D
y
∈ W
,
Q
T
y
∈
T
(
L
M
y
)
⊥
, and
Q
T
yx
∈
T
(Θ
M
yx
)
⊥
such that
[Σ
n
−
̃
Θ
−
1
]
y
+
Q
D
y
= 0; [Σ
n
−
̃
Θ
−
1
]
y
+
Q
T
y
∈
λ
n
∂
‖
̃
L
y
‖
?
[Σ
n
−
̃
Θ
−
1
]
yx
+
Q
T
yx
∈−
λ
n
γ∂
‖
̃
Θ
yx
‖
?
; [Σ
n
−
̃
Θ
−
1
]
x
= 0
Letting the SVD of
̃
L
y
and
̃
Θ
yx
be given by
̃
L
y
=
̄
U
̄
O
̄
V
′
and
̃
Θ
yx
=
̆
U
̆
O
̆
V
′
respectively, and
Z
,
(0
, λ
n
̄
U
̄
V
′
,
−
λ
n
γ
̆
U
̆
V
′
,
0), we can restrict the optimality conditions to the space
H
M
to
obtain,
P
H
M
F
†
(Σ
n
−
̃
Θ
−
1
) =
Z
. We proceed by proving that the variables (
̃
Θ
,
̃
D
y
,
̃
L
y
) satisfy the
optimality conditions of the original convex program (4) (main paper). That is:
1.
P
H
M
F
†
(Σ
n
−
̃
Θ
−
1
) =
Z
2. max
{
‖P
T
′⊥
y
(Σ
n
−
̃
Θ
−
1
)
y
‖
2
,
1
γ
‖P
T
′⊥
yx
(Σ
n
−
̃
Θ
−
1
)
yx
‖
2
}
< λ
n
7
It is clear that the first condition is satisfied since the pair (
̃
Θ
,
̃
S
y
,
̃
L
y
) is optimum for (2). To
prove that the second condition, we must prove that Γ
γ
[
P
H
⊥
M
[2
,
3]
G
†
(Σ
n
−
̃
Θ
−
1
)]
< λ
n
. In particular,
denoting ∆ = (
̃
D
y
−
D
?
y
,
̃
L
y
−
L
?
y
,
̃
Θ
yx
−
Θ
?
yx
,
̃
Θ
x
−
Θ
?
x
) we show that:
Γ
γ
[
P
H
⊥
M
[2
,
3]
G
†
I
?
GP
H
M
[2
,
3]
(∆)]
< λ
n
−
Φ
γ
[
P
H
⊥
M
F
†
E
n
]
(3)
−
Φ
γ
[
P
H
⊥
M
F
†
R
Σ
?
(
F
(∆))]
−
Φ
γ
[
P
H
⊥
M
F
†
I
?
F
C
T
M
]
−
Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
I
?
F
(∆
1
,
0
,
0
,
∆
4
)]
Using the fact that Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
(
N
)]
≤
Φ
γ
[
P
H
⊥
M
F
†
(
N
)] for any matrix
N
∈
S
p
+
q
, this would
in turn imply that:
Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
I
?
GP
H
M
[2
,
3]
(∆)]
< λ
n
−
Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
E
n
]
(4)
−
Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
R
Σ
?
(
F
(∆))]
−
Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
I
?
F
C
T
M
]
−
Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
I
?
F
(∆
1
,
0
,
0
,
∆
4
)]
Indeed (4) implies that the second optimality condition is satisfied. So we focus on showing that
(4) is satisfied. Since Φ
γ
[∆]
≤
1
2
C
′
, we can appeal to Proposition 1 (main paper) and the bound
on
λ
n
to conclude Φ
γ
[
F
†
R
Σ
?
(
F
(∆))]
≤
2
mψC
′
2
Φ
γ
[∆]
2
≤
2
mψC
′
2
C
′
2
1
λ
2
n
≤
λ
n
κ
. Using the first
optimality condition, the fact that projecting into tangent spaces with respect to rank variety
increase the spectral norm by at most a factor of two (i.e. Φ
γ
[
P
H
M
(
·
)]
≤
2Φ
γ
(
·
)), the fact that
Γ
γ
[
G
†
(
·
)]
≤
Φ
γ
[
F
†
(
·
)], and that
κ
=
β
(6 +
16
ψ
2
m
α
), we have that:
Γ
γ
[
P
H
M
[2
,
3]
G
†
I
?
GP
H
M
[2
,
3]
(∆)]
≤
λ
n
+ 2Γ
γ
[
G
†
R
Σ
?
(∆)] + 2Γ
γ
[
G
†
I
?
F
C
T
M
]
+ 2Γ
γ
[
G
†
E
n
] + Γ
γ
[
G
†
I
?
F
(∆
1
,
0
,
0
,
∆
4
)]
≤
λ
n
+ 2Φ
γ
[
F
†
R
Σ
?
(∆)] + 2Φ
γ
[
F
†
I
?
F
C
T
M
]
+ 2Φ
γ
[
F
†
E
n
] + Φ
γ
[
F
†
I
?
F
(∆
1
,
0
,
0
,
∆
4
)]
≤
λ
n
+
λ
n
β
Applying Fisher Information Assumption 2 in (16) (main paper), we obtain:
Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
I
?
GP
H
M
[2
,
3]
(∆)]
≤
(
β
+ 1)
λ
n
β
(
1
−
2
β
+ 1
)
=
λ
n
−
λ
n
β
< λ
n
−
λ
n
2
β
≤
λ
n
−
Φ
γ
[
F
†
R
Σ
(
F
(∆))]
−
Φ
γ
[
F
†
I
?
F
C
T
M
]
−
Φ
γ
[
F
†
E
n
]
−
Γ
γ
[
G
†
I
?
F
(∆
1
,
0
,
0
,
∆
4
)]
≤
λ
n
−
Φ
γ
[
P
H
⊥
M
F
†
R
Σ
?
(
F
(∆))]
−
Φ
γ
[
P
H
⊥
M
F
†
I
?
F
C
T
M
]
−
Φ
γ
[
P
H
⊥
M
F
†
E
n
]
−
Γ
γ
[
P
H
M
[2
,
3]
⊥
G
†
I
?
F
(∆
1
,
0
,
0
,
∆
4
)]
Here, we used the fact that
‖P
T
⊥
(
.
)
‖
2
≤‖
.
‖
2
for a tangent space
T
of the low-rank matrix variety.
8
5 Proof of Proposition 4 (main paper)
We must study the rate of convergence of the sample covariance matrix to the population covariance
matrix. The following result from [3] plays a key role in obtaining this result.
Proposition 5.1.
Given natural numbers
n,p
with
p
≤
n
, Let
Γ
be a
p
×
n
matrix with i.i.d
Gaussian entries that have zero-mean and variance
1
n
. Then the largest and smallest singular
values
σ
1
(Γ)
and
σ
p
(Γ)
of
Γ
are such that:
max
{
Prob
[
σ
1
(Γ)
≤
1 +
√
p
n
+
t
]
,
Prob
[
σ
p
(Γ)
≤
1
−
√
p
n
−
t
]
}
.
We now proceed with proving Proposition 4 (main paper). First, note that Φ
γ
[
F
†
E
n
]
≤
m
‖
Σ
n
−
Σ
?
‖
2
. Using Proposition 5.1 and the fact that
λ
n
mκ
≤
8
ψ
and
n
≥
64
κ
2
(
p
+
q
)
m
2
ψ
2
λ
2
n
, the following bound
holds: Pr[
m
‖
Σ
n
−
Σ
?
‖
2
≥
λ
n
κ
]
≤
2exp
{
−
nλ
2
n
128
κ
2
m
2
ψ
2
}
. Thus, Φ
γ
[
F
†
E
n
]
≤
λ
n
κ
with probability greater
than 1
−
2exp
{
−
nλ
2
n
128
κ
2
m
2
ψ
2
}
.
6 Consistency of the Convex Program (18) (main paper)
In this section, we prove the consistency of convex program (18) (main paper) for estimating a
factor model. We first introduce some notation. We define the linear operator:
̃
F
:
S
p
×
S
p
→
S
p
and its adjoint
̃
F
†
:
S
p
→
S
p
×
S
p
as follows:
̃
F
(
M,K
)
,
M
−
K,
̃
F
†
(
Q
)
,
(
Q,Q
)
.
(5)
We consider a population composite factor model (3) (main paper)
y
=
A
?
x
+
B
?
u
ζ
u
+
un-
derlying a pair of random vectors (
y,x
)
∈
R
p
×
R
q
, with rank(
A
?
) =
k
x
,
B
?
u
∈
R
p
×
k
u
, and
column-space(
A
?
)
∩
column-space(
B
?
u
) =
{
0
}
. As the convex relaxation (18) (main paper) is
solved in the precision matrix parametrization, the conditions for our theorems are more natu-
rally stated in terms of the joint precision matrix Θ
?
∈
S
p
+
q
,
Θ
?
0 of (
y,x
). The algebraic
aspects of the parameters underlying the factor model translate to algebraic properties of sub-
matrices of Θ
?
. In particular, the submatrix Θ
?
yx
has rank equal to
k
x
, and the submatrix Θ
?
y
is decomposable as
D
?
y
−
L
?
y
with
D
?
y
being diagonal and
L
?
y
0 having rank equal to
k
u
. Fi-
nally, the transversality of column-space(
A
?
) and column-space(
B
?
u
) translates to the fact that
column-space(Θ
?
yx
)
∩
column-space(
L
?
y
) =
{
0
}
have a transverse intersection. We consider the fac-
tor model underlying the random vector
y
∈
R
p
that is induced upon marginalization of
x
. In
particular, the precision matrix of
y
is given by
̃
Θ
?
y
=
D
?
y
−
[
L
?
y
+ Θ
?
yx
(Θ
?
x
)
−
1
Θ
?
xy
]. To learn an accu-
rate factor model, we seek an estimate (
ˆ
̃
D
y
,
ˆ
̃
L
y
) from the convex program (18) (main paper) such
that rank(
ˆ
̃
L
y
) = rank(
L
?
y
+ Θ
?
yx
Θ
?
x
−
1
Θ
?
xy
), and the errors
‖
ˆ
̃
D
y
−
D
?
y
‖
2
,
‖
ˆ
̃
L
y
−
[
L
?
y
+ Θ
?
yx
(Θ
?
x
)
−
1
Θ
?
xy
]
‖
2
are small.
Following the same reasoning as the Fisher information conditions for consistency of the convex
program (4) (main paper), A natural set of conditions on the population Fisher information at
̃
Θ
?
y
9