of 24
Journal
of
Machine
Learning
Research
8 (2007)
203-226
Submitted
10/05;
Re
vised
9/06;
Published
2/07
A Pr
obabilistic
Analysis
of
EM
for
Mixtur
es
of
Separated,
Spherical
Gaussians
Sanjoy
Dasgupta
DASGUPTA
@
CS
.
UCSD
.
EDU
Univer
sity
of
California,
San
Die
go
9500
Gilman
Drive
#0404
La
Jolla,
CA
92093-0404
Leonard
Schulman
SCHULMAN
@
CALTECH
.
EDU
California
Institute
of
Technolo
gy
1200
E.
California
Blvd.,
MC
256-80
Pasadena,
CA
91125
Editor:
Leslie
Pack
Kaelbling
Abstract
We sho
w
that,
given
data
from
a mixture
of
k
well-separated
spherical
Gaussians
in
R
d
, a simple
two-round
variant
of
EM
will,
with
high
probability
, learn
the
parameters
of
the
Gaussians
to
near
-
optimal
precision,
if the
dimension
is high
(
d

ln
k
).
We relate
this
to
pre
vious
theoretical
and
empirical
work
on
the
EM
algorithm.
Keyw
ords:
expectation
maximization,
mixtures
of
Gaussians,
clustering,
unsupervised
learning,
probabilistic
analysis
1.
Intr
oduction
At
present
the
expectation-maximization
algorithm
(Dempster
, Laird,
and
Rubin,
1977;
Wu,
1983)
is the
method
of
choice
for
learning
mixtures
of
Gaussians.
A series
of
theoretical
and
experimental
studies
over
the
past
three
decades
have contrib
uted
to
the
collecti
ve intuition
about
this
algorithm.
We will
reinterpret
some
of
these
results
in
the
conte
xt
of
a new performance
guarantee.
In
brief,
EM
is a hillclimbing
algorithm
which
starts
with
some
initial
estimate
of
the
Gaussian
mixture
and
then
repeatedly
changes
this
estimate
so
as
to
impro
ve its
lik
elihood,
until
it finally
con
verges
to
a local
optimum
of
the
search
space.
It is well
kno
wn
to
practitioners
that
the
quality
of
the
output
can
be
influenced
significantly
by
the
manner
in
which
EM
is
initialized.
Another
source
of
variation
is
that
in
practice
it is
quite
common
to
add
or
remo
ve Gaussians
during
the
search
process,
according
to
various
intuiti
vely-moti
vated
criteria.
Given
the
enormous
importance
of
Gaussian
mixture
models
in
applied
statistics
and
machine
learning,
it is reasonable
to
wonder
how good
the
EM
algorithm
is.
Among
other
things,
a rigorous
analysis
of
its
performance
might
shed
light
on
some
of
the
unresolv
ed
issues
in its
implementation—
for
instance,
initialization.
It is in
this
spirit
that
we
approach
our
analysis
of
EM.
A
common
form
of
theoretical
study
is
wor
st-case
analysis
, which
in
this
case
would
attempt
to
bound
how
far
EM
can
deviate
from
the
optimal
log-lik
elihood,
or
perhaps
the
optimal
set
of
mixture
parameters.
Such
an
analysis
turns
out
to
be
trivial,
because—as
is
well
kno
wn—EM’
s
output
can
be
arbitrarily
far
from
optimal
for
either
of
the
two criteria
abo
ve (we
will
see
such
an
c
2007
Sanjo
y Dasgupta
and
Leonard
Schulman.
D
ASGUPTA
AND
S
CHULMAN
example
in Section
2.3).
Thus,
this
line
of
reasoning
does
not
appear
to yield
any interesting
insights
into
the
algorithm’
s beha
vior
.
In
this
paper
, we
perform
what
might
be
called
a
best-case
analysis
. We assume
that
the
data
are
the
best
EM
could
possibly
hope
for:
i.i.d.
samples
from
a mixture
of
spherical
Gaussians
in
R
d
which
are
well
separated
from
one
another
. At
first
glance,
it seems
that
we
are
once
again
in danger
of
getting
a trivial
result,
namely
that
EM
will
succeed
without
a hiccup.
But
this
is not
the
case.
In
fact,
we
will
see
that
even
in
this
extremely
optimistic
setting,
man
y common
ways
of
initializing,
and
subsequently
running,
EM
will
mak
e it fail
dramatically
. On
the
other
hand,
if EM
is run
in
a
particular
manner
which
we
specify
, then
within
just
two rounds,
it will
identify
the
correct
mixture
parameters
with
near
-perfect
accurac
y.
If the
data
is expected
to
have
k
clusters,
it is common
for
practitioners
to
start
EM
with
more
than
k
centers,
and
to
later
prune
some
of
these
extra
centers.
We present
a simple
example
to
demonstrate
exactly
wh
y this
is necessary
, and
obtain
an
expression
for
the
number
of
initial
cen-
ters
which
should
be
used:
at
least
1
w
min
ln
k
, where
w
min
is a lower
bound
on
the
smallest
mixing
weight.
The
typical
method
of
pruning
is
to
remo
ve Gaussian-estimates
with
very
low
mixing
weight
(kno
wn
as
starved
cluster
s
).
Our
theoretical
analysis
sho
ws
that
this
is
not
enough,
that
there
is another
type
of
Gaussian-estimate,
easy
to
detect,
which
also
needs
to
be
pruned.
Specif-
ically
, it is
possible
(and
frequently
occurs
in
simulations)
that
two of
EM’
s Gaussian-estimates
share
the
same
cluster
, each
with
relati
vely
high
mixing
weight.
We present
a very
simple,
pro
vably
correct
method
of
detecting
this
situation
and
correcting
it.
It is
widely
recognized
that
a crucial
issue
in
the
performance
of
EM
is
the
choice
of
initial
parameters.
For
the
means,
we
use
the
popular
technique
of
selecting
them
randomly
from
the
data
set.
This
is sho
wn
to
be
adequate
for
the
performance
guarantee
we
deri
ve. Our
analysis
also
mak
es
it clear
that
it is vitally
important
to
pick
good
initial
estimates
of
the
covariances,
a subject
which
has
recei
ved
some
what
less
attention.
We use
an
initializer
whose
origin
we
are
unable
to
trace
but
which
is mentioned
in
Bishop’
s text
(1995).
Our
analysis
is focused
on
the
case
when
the
data
are
high-dimensional
(
d

ln
k
), and
it brings
out
some
interesting
qualitati
ve dif
ferences
from
the
low-dimensional
case.
In
particular
, it is com-
mon
to
think
of
EM
as
assigning
“soft”
cluster
memberships
in
which
each
data
point
is not
defini-
tively
assigned
to
a single
cluster
but,
rather
, is split
between
clusters
according
to
its
relati
ve prox-
imities
to
them.
Moreo
ver, this
is sometimes
quoted
as
a source
of
EM’
s effecti
veness—that
these
soft
memberships
allo
w cluster
boundaries
to
shift
in
a smooth
and
stable
manner
. In
the
optimistic
high-dimensional
scenario
we
analyze,
the
beha
vior
is quite
dif
ferent,
in
that
cluster
memberships
are
effecti
vely
always
“hard”.
This
is because
the
distances
are
so
lar
ge
that
for
any two clusters,
there
is
only
a small
part
of
the
space
in
which
there
is
any ambiguity
of
assignment,
and
it is
unlik
ely
that
data
points
would
lie
in
these
zones.
Moreo
ver, the
phenomenon
of
smooth
transi-
tions
between
clusterings
is none
xistent.
Instead,
EM
quickly
snaps
into
one
of
several
trajectories
(loosely
defined),
and
thereafter
heads
to
a nearby
local
optimum.
Initialization
is
of
supreme
importance—once
EM
has
snapped
into
the
wrong
trajectory
, all
is lost.
1.1
Relation
to
Pr
evious
Work
on
EM
A
standard
criticism
of
EM
is
that
it con
verges
slo
wly
. Simulations
performed
by
Redner
and
Walk
er
(1984),
and
others,
demonstrate
this
decisi
vely
for
one-dimensional
mixtures
of
two Gaus-
sians.
It is
also
kno
wn
that
given
data
from
a mixture
of
Gaussians,
when
EM
gets
close
to
the
204
P
ROBABILISTIC
A
NALYSIS OF
EM
true
solution,
it exhibits
first-or
der
con
ver
gence
(Titterington,
Smith,
and
Mak
ov, 1985).
Roughly
speaking,
the
idea
is this:
given
m
data
points
from
a mixture
with
parameters
(means,
covariances,
and
mixing
weights)
θ

, where
m
is very
lar
ge,
the
lik
elihood
has
a local
maximum
at
some
set
of
parameters
θ
m
close
to
θ

. Let
θ
h
t
i
denote
EM’
s parameter
-estimates
at time
t
. It can
be
sho
wn
(cf.
Taylor
expansion)
that
when
θ
h
t
i
is near
θ
m
,
k
θ
h
t
+
1
i
θ
m
k 
λ
 k
θ
h
t
i
θ
m
k
;
where
λ
2
[
0
;
1
)
and
k  k
is some
norm.
1
If the
Gaussians
are
closely
pack
ed
then
λ
is close
to
one;
if the
y are
very
far
from
one
another
then
λ
is close
to
zero.
Xu
and
Jordan
(1995)
present
theoretical
results
which
mitig
ate
some
of
the
pessimism
of
first-
order
con
vergence,
particularly
in
the
case
of
well-separated
mixtures,
and
the
y note
that
moreo
ver
near
-optimal
log-lik
elihood
is typically
reached
in
just
a few
iterations.
We also
argue
in
favor
of
EM,
but
in
a dif
ferent
way. We ask,
how
close
does
θ
h
t
i
have to
be
to
θ
m
for
slo
w
con
vergence
to
hold?
Let
D
(
θ
1
;
θ
2
)
denote
the
maximum
Euclidean
distance
between
the
respecti
ve means
of
θ
1
and
θ
2
. For
one-dimensional
data,
it can
be
seen
quite
easily
from
canonical
experiments
(Redner
and
Walk
er, 1984)
that
con
vergence
is slo
w even
if
D
(
θ
h
t
i
;
θ

)
is lar
ge.
Ho
we
ver, our
results
suggest
that
this
no
longer
holds
in
higher
dimension.
For
reasonably
well-separated
spherical
Gaussians
in
R
d
(where
separ
ation
is
defined
precisely
in
the
next
section),
con
vergence
is
very
fast
until
D
(
θ
h
t
i
;
θ

)

e
(
d
)
. In
fact,
we
can
mak
e EM
attain
this
accurac
y in
just
two rounds.
The
error
e
(
d
)
is so
miniscule
for
lar
ge
d
that
subsequent
impro
vements
are
not
especially
important.
At
a high
level,
pre
vious
analyses
of
EM
have typically
adopted
an
optimization-based
vie
w-
point:
the
y have studied
EM
by
studying
the
objecti
ve function
(log-lik
elihood)
that
it is ostensibly
optimizing.
A
typical
tool
in
this
kind
of
analysis
is
to
perform
a Taylor
expansion
of
the
log-
lik
elihood
in
the
vicinity
of
a local
optimum,
assuming
the
data
are
i.i.d.
dra
ws
from
a mixture
of
Gaussians,
and
to thereby
get
insight
into
the
speed
at which
EM
is lik
ely
to mo
ve when
it gets
close
to
this
optimum.
A major
dra
wback
of
this
approach
is that
it only
addresses
what
happens
close
to
con
ver
gence
. It cannot,
for
instance,
give intuition
about
how
to
initialize
EM,
or
about
whether
a
local
optimum
of
high
quality
is attained.
In
contrast,
we
perform
a
probabilistic
analysis.
We also
assume
that
the
data
are
i.i.d.
dra
ws
from
a mixture
of
Gaussians,
but
we
focus
upon
the
actual
algorithm
and
ignore
the
lik
elihood
function
altogether
. We ask,
what
will
the
algorithm
do
in
step
one,
with
high
probability
over
the
choice
of
data?
In
step
two?
And
so
on.
This
enables
us
to address
issues
of
initialization
and
global
optimality
.
1.2
Results
Performance
guarantees
for
clustering
will
ine
vitably
involv
e some
notion
of
the
separ
ation
between
dif
ferent
clusters.
There
are
at least
two natural
ways
of
defining
this.
Take for
simplicity
the
case
of
two
d
-dimensional
Gaussians
N
(
μ
1
;
I
d
)
and
N
(
μ
2
;
I
d
)
. If each
coordinate
(attrib
ute)
pro
vides
a
little
bit
of
discriminati
ve information
between
the
two clusters,
then
on
each
coordinate
μ
1
and
μ
2
dif
fer
by
at
least
some
small
amount,
say
δ
. The
L
2
distance
between
μ
1
and
μ
2
is
then
at
least
δ
p
d
. As
further
attrib
utes
are
added,
the
distance
between
the
centers
gro
ws,
and
the
two clusters
become
more
clearly
distinguishable
from
one
another
. This
is the
usual
rationale
for
using
high-
dimensional
data:
the
higher
the
dimension,
the
easier
(in
an
information-theoretic
sense)
clustering
1.
This
might
not
seem
so
bad,
but contrast
it with
second-or
der
con
ver
gence
, in which
k
θ
h
t
+
1
i
θ
m
k 
λ
k
θ
h
t
i
θ
m
k
2
.
205
D
ASGUPTA
AND
S
CHULMAN
should
be.
The
only
problem
then,
is
whether
there
are
algorithms
which
can
efficiently
exploit
the
tradeof
f between
this
high
information
content
and
the
curse
of
dimensionality
. This
vie
wpoint
suggests
that
the
Euclidean
distance
between
the
centers
of
d
-dimensional
clusters
can
reasonably
be
measured
in
units
of
p
d
, and
that
it is most
important
to
develop
algorithms
which
work
well
under
the
assumption
that
this
distance
is some
constant
times
p
d
. On
the
other
hand,
it should
be
pointed
out
that
if
k
μ
1
μ
2
k
=
δ
p
d
for
some
constant
δ
>
0,
then
for
lar
ge
d
the
overlap
in
probability
mass
between
the
two Gaussians
is miniscule,
exponentially
small
in
d
. Therefore,
it
should
not
only
be
interesting
but
also
possible
to
develop
algorithms
which
work
well
when
the
L
2
distance
between
centers
of
clusters
is some
constant,
regardless
of
the
dimension
(as
opposed
to
a
constant
times
p
d
).
Where
does
EM
fall
in
this
spectrum
of
separation?
It lies
some
where
in
between:
we
sho
w that
it works
well
when
the
distance
between
d
-dimensional
clusters
is bigger
than
d
1
=
4
.
Our
central
performance
guarantee
requires
that
the
clusters
actually
look
spherical-Gaussian,
more
specifically
that
the
data
points
are
dra
wn
i.i.d.
from
some
(unkno
wn)
mixture
of
spherical
Gaussians,
which
could
potentially
have dif
ferent
variances.
We sho
w
that
if the
clusters
are
rea-
sonably
well-separated
(in
the
sense
we
just
defined),
and
if the
dimension
d

ln
k
then
only
two
rounds
of
EM
are
required
to
learn
the
mixture
to
within
near
-optimal
precision,
with
high
prob-
ability
1
k
(
1
)
. Our
measure
of
accurac
y is the
function
D
(

;

)
introduced
abo
ve.
The
precise
statement
of
the
theorem
can
be
found
in
the
next
section,
and
applies
not
only
to
EM
but
also
to
other
similar
schemes,
including
for
instance
some
of
the
variants
of
EM
and
k
-means
introduced
by
Kearns,
Mansour
, and
Ng
(1997).
Several
recent
papers
have suggested
alternati
ve algorithms
for
learning
the
parameters
of
a
mixture
of
well-separated
Gaussians
(or
other
distrib
utions
with
similar
concentration
properties),
given
data
dra
wn
i.i.d.
from
that
distrib
ution.
The
first
in
this
series,
by
Dasgupta
(1999),
requires
the
Gaussians
to
be
“sphere-lik
e”
and
separated
by
a distance
of
(
p
d
)
. Arora
and
Kannan
(2004)
handle
more
general
Gaussians,
and
reduce
the
separation
requirement
to
(
d
1
=
4
)
in
the
spherical
case
(as
in
this
paper).
Vempala
and
Wang
(2004)
use
spectral
projection
to
bring
the
separation
constraint
for
spherical
Gaussians
down
to
just
((
k
log
d
)
1
=
4
)
, where
k
is the
number
of
clusters.
Vempala,
Kannan,
and
Salmasian
(2005)
and
Achlioptas
and
McSherry
(2005)
give extensions
of
these
latter
results
to
ellipsoidal
clusters.
The
last
three
results
are
especially
rele
vant
to
the
current
paper
because
the
amount
of
intercluster
separation
we
require
can
lik
ely
be
substantially
reduced
if spectral
projection
is used
as
a preprocessing
step.
In
the
final
section
of
the
paper
, we
discuss
a crucial
issue:
what
features
of
our
main
assumption
(that
the
clusters
are
high-dimensional
Gaussians)
mak
e our
guarantees
for
EM
possible?
This
assumption
is also
the
basis
of
the
other
theoretical
results
mentioned
abo
ve, but
can
real
data
sets
reasonably
be
expected
to
satisfy
it?
If not,
in
what
way
can
it usefully
be
relax
ed?
2.
Statement
of
Results
To moti
vate
our
model,
we
start
by
examining
some
properties
of
high-dimensional
Gaussians.
206
P
ROBABILISTIC
A
NALYSIS OF
EM
2.1
High-dimensional
Gaussians
A spherical
Gaussian
N
(
μ
;
σ
2
I
d
)
assigns
to
point
x
2
R
d
the
density
p
(
x
) =
1
(
2
π
)
d
=
2
σ
d
exp

k
x
μ
k
2
2
σ
2

;
k  k
being
Euclidean
distance.
If
X
= (
X
1
; : : : ;
X
d
)
is
randomly
chosen
from
N
(
0
;
σ
2
I
d
)
then
its
coordinates
are
i.i.d.
N
(
0
;
σ
2
)
random
variables.
Each
coordinate
has
expected
squared
value
σ
2
so
E
k
X
k
2
=
E
(
X
2
1
+

+
X
2
d
) =
σ
2
d
. It then
follo
ws
by
a lar
ge
deviation
bound
that
k
X
k
2
will
be
tightly
concentrated
around
σ
2
d
:
P
(
jk
X
k
2
σ
2
d
j
>
εσ
2
d
)

e
d
ε
2
=
8
:
This
bound
and
others
lik
e it will
be
pro
ved
in
Section
3.
It means
that
almost
the
entire
probability
mass
of
N
(
0
;
σ
2
I
d
)
lies
in
a thin
shell
at a radius
of
about
σ
p
d
from
the
origin.
The
density
of
the
Gaussian
is highest
at
the
origin;
howe
ver, the
surf
ace
area
at
distance
r
from
the
origin,
0

r

σ
p
d
, increases
faster
than
the
density
at distance
r
decreases
(Bishop,
1995,
exercise
1.4).
It is
natural
therefore
to
think
of
a Gaussian
N
(
μ
;
σ
2
I
d
)
as
having
radius
σ
p
d
. We say
two
Gaussians
N
(
μ
1
;
σ
2
1
I
d
)
;
N
(
μ
2
;
σ
2
2
I
d
)
in
R
d
are
c-separ
ated
if
k
μ
1
μ
2
k 
c
max
f
σ
1
;
σ
2
g
p
d
;
that
is,
if the
y are
c
radii
apart.
A mixture
of
Gaussians
is
c
-separated
if the
Gaussians
in
it are
pair
-
wise
c
-separated.
In
general
we
will
let
c
i j
denote
the
separation
between
the
i
th
and
j
th
Gaussians,
and
c
=
min
i
6
=
j
c
i j
. We can
reasonably
expect
that
the
dif
ficulty
of
learning
a mixture
of
Gaussians
increases
as
c
decreases.
A 2-separated
mixture
of
Gaussians
contains
clusters
with
almost
no
overlap.
For
lar
ge
d
, this
is
true
even
of
a
1
100
-separated
mixture,
because
for
instance,
the
two balls
B
(
0
;
p
d
)
and
B
(
1
100
p
d
;
p
d
)
in
R
d
share
only
a tin
y fraction
of
their
volume.
One
useful
way
of
thinking
about
a pair
of
c
-
separated
Gaussians
is to
imagine
that
on
each
coordinate
their
means
dif
fer
by
c
. If
c
is small,
then
the
projection
of
the
mixture
onto
any one
coordinate
will
look
unimodal.
This
might
also
be
true
of
a projection
onto
a few
coordinates.
But
for
lar
ge
d
, when
all
coordinates
are
considered
together
,
the
distrib
ution
will
cease
to
be
unimodal.
This
is precisely
the
reason
for
using
high-dimensional
data.
What
values
of
c
can
be
expected
of
real-w
orld
data
sets?
This
will
vary
from
case
to
case.
As
an
example,
we
analyzed
a canonical
data
set
consisting
of
handwritten
digits
collected
by
USPS.
Each
digit
is represented
as
a vector
in
[
1
;
1
]
256
. We fit
a mixture
of
ten
(non-spherical)
Gaussians
to
this
data
set,
by
doing
each
digit
separately
, and
found
that
it was
0
:
63-separated.
2.2
The
EM
algorithm
A mixture
of
k
spherical
Gaussians
in
R
d
is specified
by
a set
of
mixing
weights
w
i
(which
sum
to
one
and
represent
the
proportions
in
which
the
various
Gaussians
are
present)
and
by
the
indi
vidual
Gaussian
means
μ
i
2
R
d
and
variances
σ
2
i
.
Given
a data
set
S
2
R
d
, the
EM
algorithm
works
by
first
choosing
starting
values
w
h
0
i
i
;
μ
h
0
i
i
;
σ
h
0
i
i
for
the
parameters,
and
then
updating
them
iterati
vely
according
to
the
follo
wing
two steps
(at
time
t
).
207
D
ASGUPTA
AND
S
CHULMAN
E step
Let
τ
i

N
(
μ
h
t
i
i
;
σ
h
t
i
2
i
I
d
)
denote
the
density
of
the
i
th
Gaussian-estimate.
For
each
data
point
x
2
S
, and
each
1

i

k
, compute
p
h
t
+
1
i
i
(
x
) =
w
h
t
i
i
τ
i
(
x
)
j
w
h
t
i
j
τ
j
(
x
)
;
the
conditional
probability
that
x
comes
from
the
i
th
Gaussian
with
respect
to
the
current
estimated
parameters.
M
step
No
w update
the
various
parameters
in
an
intuiti
ve way. Let
m
be
the
size
of
S
.
w
h
t
+
1
i
i
=
1
m
x
2
S
p
h
t
+
1
i
i
(
x
)
;
μ
h
t
+
1
i
i
=
1
mw
h
t
+
1
i
i
x
2
S
x p
h
t
+
1
i
i
(
x
)
;
σ
h
t
+
1
i
2
i
=
1
mw
h
t
+
1
i
i
d
x
2
S
k
x
μ
h
t
+
1
i
i
k
2
p
h
t
+
1
i
i
(
x
)
:
2.3
The
Main
Issues
We now give a high-le
vel description
of
some
fundamental
issues
that
arise
in
our
analysis
of
EM,
and
that
dictate
our
key design
decisions.
2.3.1
T
IGHT
C
ONCENTRATION
OF
I
NTERPOINT
D
ISTANCES
In
a high-dimensional
space
R
d
, the
distances
between
data
points—whether
sampled
from
the
same
Gaussian
or
from
dif
ferent
Gaussians—are
tightly
concentrated
around
their
expected
values.
In
particular
, if the
Gaussians
happen
to have the
same
variance
σ
2
I
d
, and
if the
distances
between
their
centers
are

σ
d
1
=
4
, then
the
chance
that
two points
from
dif
ferent
Gaussians
are
closer
together
than
two points
from
the
same
Gaussian,
is tin
y,
e
(
pol
y
(
d
))
. Therefore,
an
examination
of
interpoint
distances
is enough
to
almost
perfectly
cluster
the
data.
A variety
of
dif
ferent
algorithms
will
work
well
under
these
circumstances,
and
EM
is no
exception.
What
if the
Gaussians
have dif
ferent
variances
σ
i
? Once
again,
interpoint
distances
are
close
to
their
expected
values,
but
now a new complication
is introduced.
If a small-v
ariance
cluster
B
is
close
to
the
center
of
a lar
ger
-variance
cluster
A
, then
it is quite
possible
for
points
x
2
A
to
be
closer
to
all
of
B
than
to
any other
point
in
A
:
A
B
x
We expressly
rule
out
this
case
by
requiring
the
separation
between
any two clusters
i
and
j
to satisfy
k
μ
i
μ
j
k
2
 j
σ
2
i
σ
2
j
j
d
:
208
P
ROBABILISTIC
A
NALYSIS OF
EM
2.3.2
I
NITIALIZATION
Suppose
the
true
number
of
Gaussians,
k
, is kno
wn.
Let
S
denote
the
entire
data
set,
with
S
i
being
the
points
dra
wn
from
the
i
th
true
Gaussian
N
(
μ
i
;
σ
2
I
d
)
. A common
way
to
initialize
EM
is to
pick
l
data
points
at random
from
S
, and
to
use
these
as
initial
center
-estimates
μ
h
0
i
i
. Ho
w lar
ge
should
l
be?
It turns
out
that
if these
l
points
include
at least
one
point
from
each
S
i
, then
EM
can
be
made
to
perform
well.
This
suggests
l
=
(
k
ln
k
)
. Con
versely
, if the
initial
centers
miss
some
S
i
, then
EM
might
perform
poorly
.
Here
is a concrete
example
(Figure
1).
Let
d
denote
some
high
dimension,
and
place
the
k
true
Gaussians
N
(
μ
1
;
I
d
)
; : : : ;
N
(
μ
k
;
I
d
)
side
by
side
in
a line,
lea
ving
a distance
of
at least
3
p
d
between
consecuti
ve Gaussians.
Assign
them
equal
mixing
weights.
As
before
let
S
i
be
the
data
points
from
the
i
th
Gaussian,
and
choose
EM’
s initial
center
-estimates
from
the
data.
Suppose
the
initial
centers
contain
nothing
from
S
1
, one
point
from
S
2
, and
at least
one
point
from
S
3
. The
probability
of
this
event
is at least
some
constant.
Then
no
matter
how long
EM
is run,
it will
assign
just
one
Gaussian-
estimate
to
the
first
two true
Gaussians.
In
the
first
round
of
EM,
the
point
from
S
2
(call
it
μ
h
0
i
1
) will
mo
ve between
μ
1
and
μ
2
. It will
stay
there,
right
between
the
two true
centers.
None
of
the
other
center
-estimates
μ
h
t
i
i
will
ever
come
closer
to
μ
2
; their
distance
from
it is so
lar
ge
that
their
influence
is overwhelmed
by
that
of
μ
h
t
i
1
. This
argument
can
be
formalized
easily
using
the
lar
ge
deviation
bounds
that
we
will
introduce
in
the
next
section.
s
μ
1
s
μ
2
s
μ
3
s
μ
h
1
i
2
s
μ
h
1
i
1
s
μ
h
1
i
3
s
μ
h
0
i
2
s
μ
h
0
i
3
s
μ
h
0
i
1
Figure
1:
For
this
mixture,
the
positions
of
the
center
-estimates
do
not
mo
ve much
after
the
first
step
of
EM.
Ho
w about
the
initial
choice
of
variance?
In
the
case
when
the
Gaussians
have the
same
spher
-
ical
covariance,
this
is not
all
that
important,
except
that
a huge
overestimate
might
cause
slo
wer
con
vergence.
In
the
case
when
the
Gaussians
have dif
ferent
variances,
howe
ver, the
initial
estimates
are
crucially
important,
and
so
we
will
use
a fairly
precise
estimator
, a variant
of
which
is mentioned
in
Bishop’
s text
(1995).
2.3.3
A
FTER
THE
F
IRST
R
OUND
OF
EM
After
one
round
of
EM,
the
center
-estimates
are
pruned
to
lea
ve exactly
one
per
true
Gaussian.
This
is accomplished
in a simple
manner
. First,
remo
ve any center
-estimates
with
very
low mixing
weight
(this
is
often
called
“cluster
starv
ation”).
An
y remaining
center
-estimate
(originally
chosen,
say
,
209
D
ASGUPTA
AND
S
CHULMAN
Figure
2:
The
lar
ge
circles
are
true
clusters,
while
the
dots
(solid
or
hollo
w)
are
EM’
s center
-
estimates.
Left:
after
initialization,
we
have at
least
one
center
-estimate
per
true
cluster
.
Right:
After
the
first
round
of
EM,
each
center
-estimate
has
either
been
“starv
ed”
(sho
wn
as
hollo
w)
or
has
mo
ved
closer
to
the
corresponding
true
center
.
from
S
i
) has
relati
vely
high
mixing
weight,
and
we
can
sho
w that
as
a result
of
the
first
EM
iteration,
it will
have mo
ved
close
to
μ
i
(Figure
2).
A
trivial
clustering
heuristic,
due
to
Gonzalez
(1985),
is
then
good
enough
to
select
one
center
-estimate
near
each
μ
i
.
With
exactly
one
center
-estimate
per
(true)
Gaussian,
a second
iteration
of
EM
will
accurately
retrie
ve the
means,
variances,
and
mixing
weights.
In
fact
the
clustering
of
the
data
(the
fractional
labels
assigned
by
EM)
will
be
almost
perfect,
that
is to
say
, each
fractional
label
will
be
close
to
zero
or
one,
and
will
in almost
all
cases
correctly
identify
the
generating
Gaussian.
Therefore
further
iterations
will
not
help
much:
these
additional
iterations
will
mo
ve the
center
-estimates
around
by
at most
e
(
d
)
.
2.4
A Two-r
ound
Variant
of
EM
Here
is a summary
of
the
modified
algorithm,
given
m
data
points
in
R
d
which
have been
generated
by
a mixture
of
k
Gaussians.
The
value
of
l
will
be
specified
later;
for
the
time
being
it can
be
thought
of
as
O
(
k
ln
k
)
.
Initialization
Pick
l
data
points
at random
as
starting
estimates
μ
h
0
i
i
for
the
Gaussian
centers.
As-
sign
them
identical
mixing
weights
w
h
0
i
i
=
1
l
. For
initial
estimates
of
the
variances
use
σ
h
0
i
2
i
=
1
2
d
min
j
6
=
i
k
μ
h
0
i
i
μ
h
0
i
j
k
2
:
EM
Run
one
round
of
EM.
This
yields
modified
estimates
w
h
1
i
i
;
μ
h
1
i
i
;
σ
h
1
i
i
.
Pruning
Remo
ve all
center
-estimates
whose
mixing
weights
are
belo
w
w
T
=
1
4
l
. Then,
prune
the
remaining
center
-estimates
down
to
just
k
by
using
the
follo
wing
adaptation
of
an
algorithm
of
Gonzalez
(1985):
210
P
ROBABILISTIC
A
NALYSIS OF
EM

Compute
distances
between
center
-estimates:
d
(
μ
h
1
i
i
;
μ
h
1
i
j
) =
k
μ
h
1
i
i
μ
h
1
i
j
k
σ
h
1
i
i
+
σ
h
1
i
j
:

Choose
one
of
these
centers
arbitrarily
.

Pick
the
remaining
k
1 iterati
vely
as
follo
ws:
pick
the
center
farthest
from
the
ones
pick
ed
so
far. (The
distance
from
a point
x
to
a set
S
is min
y
2
S
d
(
x
;
y
)
.)
Call
the
resulting
center
-estimates
̃
μ
h
1
i
i
(where
1

i

k
), and
let
̃
σ
h
1
i
2
i
be
the
corresponding
variances.
Set
the
mixing
weights
to
̃
w
h
1
i
i
=
1
k
.
EM
Run
one
more
step
of
EM,
starting
at
the
f
̃
w
h
1
i
i
;
̃
μ
h
1
i
i
;
̃
σ
h
1
i
i
g
parameters
and
yielding
the
final
estimates
w
h
2
i
i
;
μ
h
2
i
i
;
σ
h
2
i
i
.
2.5
The
Main
Result
No
w that
the
notation
and
algorithm
have been
introduced,
we
can
state
the
main
theorem.
Theor
em
1
Say
m data
points
are gener
ated
from
a mixtur
e of
k Gaussians
in
R
d
,
w
1
N
(
μ
1
;
σ
2
1
I
d
) +

+
w
k
N
(
μ
k
;
σ
2
k
I
d
)
;
wher
e the
inter
center
distances
satisfy
the
inequality
k
μ
i
μ
j
k
2
 j
σ
2
i
σ
2
j
j
d.
Define
c
i j
to be
the
separ
ation
between
the
i
th
and
j
th
Gaussians—that
is,
k
μ
i
μ
j
k
=
c
i j
max
(
σ
i
;
σ
j
)
p
d
and
c
=
min
i
6
=
j
c
i j
to
be
the
over
all
separ
ation.
Let
S
i
denote
the
points
from
the
i
th
Gaussian,
and
let
w
min
=
min
i
w
i
. For
any
δ
>
0
and
ε
>
0
, if
1.
par
ameter
l
=
(
1
w
min
ln
1
δ
w
min
)
,
2.
dimension
d
=
(
max
(
1
;
c
4
)
ln
max
(
1
;
c
4
)
δ
w
min
)
,
3.
number
of
samples
m
=
(
l
max
(
1
;
c
2
))
,
4.
separ
ation
c
2
d
=
(
ln
1
ε
w
min
)
,
then
with
probability
at
least
1
δ
, the
variant
of
EM
described
abo
ve
will
produce
final
center
-
estimates
whic
h (appr
opriately
permuted)
satisfy
k
μ
h
2
i
i
μ
i
k  k
mean
(
S
i
)
μ
i
k
+
εσ
i
p
d
:
This
theorem
will
be
pro
ved
over
the
remainder
of
the
paper
, but
a few
words
about
it are
in
order
at
this
stage.
First
of
all,
the
constants
which
have been
left
out
of
the
theorem
statement
are
given
in
Section
4.
Second,
the
best
that
can
be
hoped
is that
μ
h
2
i
i
=
mean
(
S
i
)
; therefore,
the
final
error
bound
on
the
center
-estimates
becomes
very
close
to
optimal
as
c
2
d
increases.
Third,
similarly
strong
bounds
can
be
given
for
the
final
mixing
weights
and
variances;
see
Theorem
17.
Finally
, notice
that
the
bounds
require
c

d
1
=
4
; in
other
words,
the
distance
between
the
centers
of
Gaussians
i
and
j
must
be

max
(
σ
i
;
σ
j
)
d
1
=
4
.
211
D
ASGUPTA
AND
S
CHULMAN
d
dimension
k
true
number
of
clusters
w
i
;
μ
i
;
σ
2
i
true
mixture
parameters
σ
i j
shorthand
for
max
(
σ
i
;
σ
j
)
w
min
lower
bound
on
the
mixing
weights:
w
i

w
min
l
number
of
clusters
with
which
EM
is started;
l
>
k
w
h
t
i
i
;
μ
h
t
i
i
;
σ
h
t
i
2
i
EM’
s parameter
-estimates
at time
t
p
h
t
i
i
(
x
)
EM’
s soft-assignment
(to
point
x
, from
cluster
i
) at time
t
w
T
threshold
used
to
prune
“starv
ed
clusters”:
w
T
=
1
=
4
l
c
i j
separation
between
Gaussians
i
and
j
;
k
μ
i
μ
j
k
=
c
i j
σ
i j
p
d
c
overall
separation,
c
=
min
i
6
=
j
c
i j
S
data
points
S
i
data
points
sampled
from
the
i
th
Gaussian
m
number
of
data
points
ε
o
concentration
of
interpoint
distances
and
dot
products
C
i
center
-estimates
from
S
i
which
survi
ved
the
first
round:
C
i
=
f
μ
h
1
i
i
0
:
μ
h
0
i
i
0
2
S
i
;
w
h
1
i
i
0

w
T
g
d
(
μ
h
1
i
i
;
μ
h
1
i
j
)
weighted
distance
between
centers,
used
by
pruning
procedure
Figure
3:
Notation.
3.
Concentration
Pr
operties
of
Gaussian
Samples
Our
analysis
hinges
crucially
upon
concentration
effects:
specifically
, that
interpoint
distances
and
means
of
subsets
of
points
are
lik
ely
to
be
close
to
their
expected
values.
The
most
basic
such
property
is that
the
squared
length
of
a point
dra
wn
from
a high-dimensional
Gaussian
is tightly
concentrated.
The
proof
of
this
well-kno
wn
fact
is repeated
here
for
easy
refer
-
ence.
Lemma
2
Pic
k X from
the
distrib
ution
N
(
0
;
I
d
)
.
(a)
(Very
lar
ge deviations)
For
any
λ

1
, we
have
P
(
k
X
k
2

λ
d
)

(
e
λ
1
ln
λ
)
d
=
2
.
(b)
(Modest
deviations)
For
any
ε
2
(
0
;
1
)
, we
have
P
(
jk
X
k
2
d
j 
ε
d
)

2
e
d
ε
2
=
8
.
Proof
.
k
X
k
2
has
a
χ
2
distrib
ution
with
expectation
d
, variance
2
d
, and
moment-generating
function
φ
(
t
) =
E
e
t
k
X
k
2
= (
1
2
t
)
d
=
2
.
(a)
By
Mark
ov’s inequality
, for
t
2
[
0
;
1
2
]
,
P

k
X
k
2

λ
d


φ
(
t
)
e
t
λ
d
;
the
first
assertion
of
the
lemma
follo
ws
by
choosing
t
=
1
2
(
1
1
λ
)
.
(b)
Fix
any
ε
2
(
0
;
1
)
. Applying
Mark
ov’s inequality
tells
us
that
for
any
t
2
[
0
;
1
2
]
,
P
(
k
X
k
2

(
1
+
ε
)
d
) =
P
(
e
t
k
X
k
2

e
t
(
1
+
ε
)
d
)

φ
(
t
)
e
t
(
1
+
ε
)
d
212