Journal
of
Machine
Learning
Research
8 (2007)
203-226
Submitted
10/05;
Re
vised
9/06;
Published
2/07
A Pr
obabilistic
Analysis
of
EM
for
Mixtur
es
of
Separated,
Spherical
Gaussians
Sanjoy
Dasgupta
DASGUPTA
@
CS
.
UCSD
.
EDU
Univer
sity
of
California,
San
Die
go
9500
Gilman
Drive
#0404
La
Jolla,
CA
92093-0404
Leonard
Schulman
SCHULMAN
@
CALTECH
.
EDU
California
Institute
of
Technolo
gy
1200
E.
California
Blvd.,
MC
256-80
Pasadena,
CA
91125
Editor:
Leslie
Pack
Kaelbling
Abstract
We sho
w
that,
given
data
from
a mixture
of
k
well-separated
spherical
Gaussians
in
R
d
, a simple
two-round
variant
of
EM
will,
with
high
probability
, learn
the
parameters
of
the
Gaussians
to
near
-
optimal
precision,
if the
dimension
is high
(
d
ln
k
).
We relate
this
to
pre
vious
theoretical
and
empirical
work
on
the
EM
algorithm.
Keyw
ords:
expectation
maximization,
mixtures
of
Gaussians,
clustering,
unsupervised
learning,
probabilistic
analysis
1.
Intr
oduction
At
present
the
expectation-maximization
algorithm
(Dempster
, Laird,
and
Rubin,
1977;
Wu,
1983)
is the
method
of
choice
for
learning
mixtures
of
Gaussians.
A series
of
theoretical
and
experimental
studies
over
the
past
three
decades
have contrib
uted
to
the
collecti
ve intuition
about
this
algorithm.
We will
reinterpret
some
of
these
results
in
the
conte
xt
of
a new performance
guarantee.
In
brief,
EM
is a hillclimbing
algorithm
which
starts
with
some
initial
estimate
of
the
Gaussian
mixture
and
then
repeatedly
changes
this
estimate
so
as
to
impro
ve its
lik
elihood,
until
it finally
con
verges
to
a local
optimum
of
the
search
space.
It is well
kno
wn
to
practitioners
that
the
quality
of
the
output
can
be
influenced
significantly
by
the
manner
in
which
EM
is
initialized.
Another
source
of
variation
is
that
in
practice
it is
quite
common
to
add
or
remo
ve Gaussians
during
the
search
process,
according
to
various
intuiti
vely-moti
vated
criteria.
Given
the
enormous
importance
of
Gaussian
mixture
models
in
applied
statistics
and
machine
learning,
it is reasonable
to
wonder
how good
the
EM
algorithm
is.
Among
other
things,
a rigorous
analysis
of
its
performance
might
shed
light
on
some
of
the
unresolv
ed
issues
in its
implementation—
for
instance,
initialization.
It is in
this
spirit
that
we
approach
our
analysis
of
EM.
A
common
form
of
theoretical
study
is
wor
st-case
analysis
, which
in
this
case
would
attempt
to
bound
how
far
EM
can
deviate
from
the
optimal
log-lik
elihood,
or
perhaps
the
optimal
set
of
mixture
parameters.
Such
an
analysis
turns
out
to
be
trivial,
because—as
is
well
kno
wn—EM’
s
output
can
be
arbitrarily
far
from
optimal
for
either
of
the
two criteria
abo
ve (we
will
see
such
an
c
2007
Sanjo
y Dasgupta
and
Leonard
Schulman.
D
ASGUPTA
AND
S
CHULMAN
example
in Section
2.3).
Thus,
this
line
of
reasoning
does
not
appear
to yield
any interesting
insights
into
the
algorithm’
s beha
vior
.
In
this
paper
, we
perform
what
might
be
called
a
best-case
analysis
. We assume
that
the
data
are
the
best
EM
could
possibly
hope
for:
i.i.d.
samples
from
a mixture
of
spherical
Gaussians
in
R
d
which
are
well
separated
from
one
another
. At
first
glance,
it seems
that
we
are
once
again
in danger
of
getting
a trivial
result,
namely
that
EM
will
succeed
without
a hiccup.
But
this
is not
the
case.
In
fact,
we
will
see
that
even
in
this
extremely
optimistic
setting,
man
y common
ways
of
initializing,
and
subsequently
running,
EM
will
mak
e it fail
dramatically
. On
the
other
hand,
if EM
is run
in
a
particular
manner
which
we
specify
, then
within
just
two rounds,
it will
identify
the
correct
mixture
parameters
with
near
-perfect
accurac
y.
If the
data
is expected
to
have
k
clusters,
it is common
for
practitioners
to
start
EM
with
more
than
k
centers,
and
to
later
prune
some
of
these
extra
centers.
We present
a simple
example
to
demonstrate
exactly
wh
y this
is necessary
, and
obtain
an
expression
for
the
number
of
initial
cen-
ters
which
should
be
used:
at
least
1
w
min
ln
k
, where
w
min
is a lower
bound
on
the
smallest
mixing
weight.
The
typical
method
of
pruning
is
to
remo
ve Gaussian-estimates
with
very
low
mixing
weight
(kno
wn
as
starved
cluster
s
).
Our
theoretical
analysis
sho
ws
that
this
is
not
enough,
that
there
is another
type
of
Gaussian-estimate,
easy
to
detect,
which
also
needs
to
be
pruned.
Specif-
ically
, it is
possible
(and
frequently
occurs
in
simulations)
that
two of
EM’
s Gaussian-estimates
share
the
same
cluster
, each
with
relati
vely
high
mixing
weight.
We present
a very
simple,
pro
vably
correct
method
of
detecting
this
situation
and
correcting
it.
It is
widely
recognized
that
a crucial
issue
in
the
performance
of
EM
is
the
choice
of
initial
parameters.
For
the
means,
we
use
the
popular
technique
of
selecting
them
randomly
from
the
data
set.
This
is sho
wn
to
be
adequate
for
the
performance
guarantee
we
deri
ve. Our
analysis
also
mak
es
it clear
that
it is vitally
important
to
pick
good
initial
estimates
of
the
covariances,
a subject
which
has
recei
ved
some
what
less
attention.
We use
an
initializer
whose
origin
we
are
unable
to
trace
but
which
is mentioned
in
Bishop’
s text
(1995).
Our
analysis
is focused
on
the
case
when
the
data
are
high-dimensional
(
d
ln
k
), and
it brings
out
some
interesting
qualitati
ve dif
ferences
from
the
low-dimensional
case.
In
particular
, it is com-
mon
to
think
of
EM
as
assigning
“soft”
cluster
memberships
in
which
each
data
point
is not
defini-
tively
assigned
to
a single
cluster
but,
rather
, is split
between
clusters
according
to
its
relati
ve prox-
imities
to
them.
Moreo
ver, this
is sometimes
quoted
as
a source
of
EM’
s effecti
veness—that
these
soft
memberships
allo
w cluster
boundaries
to
shift
in
a smooth
and
stable
manner
. In
the
optimistic
high-dimensional
scenario
we
analyze,
the
beha
vior
is quite
dif
ferent,
in
that
cluster
memberships
are
effecti
vely
always
“hard”.
This
is because
the
distances
are
so
lar
ge
that
for
any two clusters,
there
is
only
a small
part
of
the
space
in
which
there
is
any ambiguity
of
assignment,
and
it is
unlik
ely
that
data
points
would
lie
in
these
zones.
Moreo
ver, the
phenomenon
of
smooth
transi-
tions
between
clusterings
is none
xistent.
Instead,
EM
quickly
snaps
into
one
of
several
trajectories
(loosely
defined),
and
thereafter
heads
to
a nearby
local
optimum.
Initialization
is
of
supreme
importance—once
EM
has
snapped
into
the
wrong
trajectory
, all
is lost.
1.1
Relation
to
Pr
evious
Work
on
EM
A
standard
criticism
of
EM
is
that
it con
verges
slo
wly
. Simulations
performed
by
Redner
and
Walk
er
(1984),
and
others,
demonstrate
this
decisi
vely
for
one-dimensional
mixtures
of
two Gaus-
sians.
It is
also
kno
wn
that
given
data
from
a mixture
of
Gaussians,
when
EM
gets
close
to
the
204
P
ROBABILISTIC
A
NALYSIS OF
EM
true
solution,
it exhibits
first-or
der
con
ver
gence
(Titterington,
Smith,
and
Mak
ov, 1985).
Roughly
speaking,
the
idea
is this:
given
m
data
points
from
a mixture
with
parameters
(means,
covariances,
and
mixing
weights)
θ
, where
m
is very
lar
ge,
the
lik
elihood
has
a local
maximum
at
some
set
of
parameters
θ
m
close
to
θ
. Let
θ
h
t
i
denote
EM’
s parameter
-estimates
at time
t
. It can
be
sho
wn
(cf.
Taylor
expansion)
that
when
θ
h
t
i
is near
θ
m
,
k
θ
h
t
+
1
i