Information Processing Letters 176 (2022) 106251
Contents lists available at
ScienceDirect
Information
Processing
Letters
www.elsevier.com/locate/ipl
A
refined
approximation
for
Euclidean
k-means
Fabrizio Grandoni
a
,
∗
,
Rafail Ostrovsky
b
,
Yuval Rabani
c
,
Leonard
J. Schulman
d
,
Rakesh Venkat
e
a
IDSIA,
USI-SUPSI,
Switzerland
b
UCLA,
United
States
of
America
c
The
Hebrew
University
of
Jerusalem,
Israel
d
Caltech,
United
States
of
America
e
IIT
Hyderabad,
India
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Article
history:
Received
15
July
2021
Accepted
23
January
2022
Available
online
2
February
2022
Communicated
by
Leah
Epstein
Keywords:
Approximation
algorithms
Euclidean
k-means
Euclidean
facility
location
Integrality
gaps
In
the
Euclidean
k
-Means
problem
we
are
given
a
collection
of
n
points
D
in
an
Euclidean
space
and
a
positive
integer
k
.
Our
goal
is
to
identify
a
collection
of
k
points
in
the
same
space
(centers)
so
as
to
minimize
the
sum
of
the
squared
Euclidean
distances
between
each
point
in
D
and
the
closest
center.
This
problem
is
known
to
be
APX-hard
and
the
current
best
approximation
ratio
is
a
primal-dual
6
.
357 approximation
based
on
a
standard
LP
for
the
problem
[Ahmadian
et
al.
FOCS’17,
SICOMP’20].
In
this
note
we
show
how
a
minor
modification
of
Ahmadian
et
al.’s
analysis
leads
to
a
slightly
improved
6
.
12903 approximation.
As
a
related
result,
we
also
show
that
the
mentioned
LP
has
integrality
gap
at
least
16
+
√
5
15
>
1
.
2157.
©
2022
The
Author(s).
Published
by
Elsevier
B.V.
This
is
an
open
access
article
under
the
CC
BY-NC-ND
license
(
http://creativecommons.org/licenses/by-nc-nd/4.0/
).
1.
Introduction
Clustering
is
a
central
problem
in
Computer
Science,
with
many
applications
in
data
science,
machine
learning
etc.
One
of
the
most
famous
and
best-studied
problems
in
this
area
is
Euclidean
k
-Means:
given
a
set
D
of
n
points
(or
demands
)
in
R
and
an
integer
k
∈[
1
,
n
]
,
select
k
points
S
(
centers
)
so
as
to
minimize
j
∈
D
d
2
(
j
,
S
)
.
Here
d
(
j
,
i
)
is
the
Euclidean
distance
between
points
j
and
i
and
for
a
set
of
points
I
,
d
(
j
,
I
)
=
min
i
∈
I
d
(
j
,
i
)
.
In
other
words,
we
wish
to
select
k
centers
so
as
to
minimize
the
sum
of
the
squared
Euclidean
distances
between
each
demand
and
the
closest
center.
Equivalently,
a
feasible
solution
is
given
by
a
partition
of
the
demands
into
k
subsets
(
clusters
).
The
cost
w
C
of
a
cluster
C
⊂
D
is
j
∈
C
d
2
(
c
,
μ
)
,
where
μ
is
the
center
of
mass
of
C
.
We
recall
that
w
C
can
also
be
expressed
as
1
2
|
C
|
j
∈
C
j
∈
C
d
2
(
j
,
j
)
.
Our
goal
is
to
minimize
the
total
cost
of
these
clusters.
Euclidean
k
-Means
is
well-studied
in
terms
of
approximation
algorithms.
It
is
known
to
be
APX-hard.
More
precisely,
it
is
hard
to
approximate
k
-Means
below
a
factor
1
.
0013 in
polynomial
time
unless
P
=
NP
[
6
,
16
].
The
hardness
was
improved
to
1
.
07 under
the
Unique
Games
Conjecture
[
9
].
Some
heuristics
are
known
to
perform
very
well
in
practice,
however
their
approximation
factor
is
O
(
log
k
)
or
worse
on
general
instances
[
3
,
4
,
17
,
21
].
Constant
approximation
algorithms
are
known.
A
local-search
algorithm
by
Kanugo
et
al. [
15
] provides
a
9
+
ε
approximation.
1
The
authors
also
show
that
natural
local-search
based
algorithms
cannot
perform
better
than
this.
This
ratio
was
improved
to
6
.
357 by
Ahmadian
et
al. [
1
,
2
]
*
Corresponding
author.
E-mail
address:
fabrizio@idsia.ch
(F. Grandoni).
1
Throughout
this
paper
by
ε
we
mean
an
arbitrarily
small
positive
constant.
W.l.o.g.
we
assume
ε
≤
1.
https://doi.org/10.1016/j.ipl.2022.106251
0020-0190/
©
2022
The
Author(s).
Published
by
Elsevier
B.V.
This
is
an
open
access
article
under
the
CC
BY-NC-ND
license
(
http://creativecommons.org/licenses/by-nc-nd/4.0/
).
F. Grandoni, R. Ostrovsky, Y. Rabani et al.
Information Processing Letters 176 (2022) 106251
using
a
primal-dual
approach.
They
also
prove
a
9
+
ε
approximation
for
general
(possibly
non-Euclidean)
metrics.
Better
approximation
factors
are
known
under
reasonable
restrictions
on
the
input
[
5
,
7
,
10
,
20
].
A
PTAS
is
known
for
constant
k
[
19
]
or
for
constant
dimension
[
10
,
12
].
Notice
that
can
be
always
assumed
to
be
O
(
log
n
)
by
a
standard
application
of
the
Johnson-Lindenstrauss
transform
[
14
].
This
was
recently
improved
to
O
(
log
k
+
log log
n
)
[
8
] and
finally
to
O
(
log
k
)
[
18
].
In
this
paper
we
describe
a
simple
modification
of
the
analysis
of
Ahmadian
et
al. [
2
] which
leads
to
a
slightly
improved
approximation
for
Euclidean
k
-Means
(see
Section
2
).
Theorem
1.
There
exists
a
deterministic
polynomial-time
algorithm
for
Euclidean
k-Means
with
approximation
ratio
ρ
+
ε
for
any
positive
constant
ε
>
0
,
where
ρ
:=
⎛
⎝
1
+
1
2
(
2
+
3
3
−
2
√
2
+
3
3
+
2
√
2
)
⎞
⎠
2
<
6
.
12903
.
The
above
approximation
ratio
is
w.r.t.
the
optimal
fractional
solution
to
a
standard
LP
relaxation
LP
k-Means
for
the
problem
(defined
later).
As
a
side
result
(see
Section
3
),
we
prove
a
lower
bound
on
the
integrality
gap
of
this
relaxation
(we
are
not
aware
of
any
explicit
such
lower
bound
in
the
literature).
Theorem
2.
The
integrality
gap
of
LP
k-Means
,
even
in
the
Euclidean
plane
(i.e.,
for
=
2
),
is
at
least
16
+
√
5
15
>
1
.
2157
.
1.1.
Preliminaries
As
mentioned
earlier,
one
can
formulate
Euclidean
k
-Means
in
term
of
the
selection
of
k
centers.
In
this
case,
it
is
convenient
to
discretize
the
possible
choices
for
the
centers,
hence
obtaining
a
polynomial-size
set
F
of
candidate
centers,
at
the
cost
of
an
extra
factor
1
+
ε
in
the
approximation
ratio
(we
will
neglect
this
factor
in
the
approximation
ratios
since
it
is
absorbed
by
analogous
factors
in
the
rest
of
the
analysis).
In
particular
we
will
use
the
construction
in
[
11
] (Lemma
24)
that
chooses
as
F
the
centers
of
mass
of
any
collection
of
up
to
16
/
ε
2
points
with
repetitions.
In
particular
|
F
|
=
O
(
n
16
/
ε
2
)
in
this
case.
Let
c
(
j
,
i
)
be
an
abbreviation
for
d
2
(
j
,
i
)
.
Then
a
standard
LP-relaxation
for
k
-Means
is
as
follows:
min
i
∈
F
,
j
∈
D
x
ij
·
c
(
j
,
i
)
LP
k-Means
s
.
t
.
i
∈
F
x
ij
≥
1
∀
j
∈
D
x
ij
≤
y
i
∀
j
∈
D
,
∀
i
∈
F
i
∈
F
y
i
≤
k
∀
j
∈
D
,
∀
i
∈
F
x
ij
,
y
i
≥
0
∀
j
∈
D
,
∀
i
∈
F
In
an
integral
solution,
we
interpret
y
i
=
1as
i
being
a
selected
center
in
S
(
i
is
open
),
and
x
ij
=
1as
demand
j
being
assigned
to
center
i
.
2
The
first
family
of
constraints
states
that
each
demand
has
to
be
assigned
to
some
center,
the
second
one
that
a
demand
can
only
be
assigned
to
an
open
center,
and
the
third
one
that
we
can
open
at
most
k
centers.
For
any
parameter
λ
>
0(
Lagrangian
multiplier
),
the
Lagrangian
relaxation LP
(λ)
of
LP
k-Means
(w.r.t.
the
last
matrix
con-
straint)
and
its
dual
DP
(λ)
are
as
follows:
min
i
∈
F
,
j
∈
D
x
ij
·
c
(
j
,
i
)
+
λ
·
i
∈
F
y
i
−
λ
kLP
(λ)
s
.
t
.
i
∈
F
x
ij
≥
1
∀
j
∈
D
x
ij
≤
y
i
∀
j
∈
D
,
∀
i
∈
F
x
ij
,
y
i
≥
0
∀
j
∈
D
,
∀
i
∈
F
2
Technically
each
demand
is
automatically
assigned
to
the
closest
open
center.
However
it
is
convenient
to
allow
also
sub-optimal
assignments
in
the
LP
relaxation.
2
F. Grandoni, R. Ostrovsky, Y. Rabani et al.
Information Processing Letters 176 (2022) 106251
max
j
∈
D
α
j
−
λ
kDP
(λ)
s
.
t
.
j
∈
D
max
{
0
,
α
j
−
c
(
j
,
i
)
}≤
λ
∀
i
∈
F
(1)
α
j
≥
0
∀
j
∈
D
Above
max
{
0
,
α
j
−
c
(
j
,
i
)
}
replaces
the
dual
variable
β
ij
corresponding
to
the
second
constraint
in
the
primal
in
the
standard
formulation
of
the
dual
LP.
Notice
that,
by
removing
the
fixed
term
−
λ
k
in
the
objective
functions
of
LP
(λ)
and
DP
(λ)
,
one
obtains
the
standard
LP
relaxation
LP
FL
(λ)
for
the
Facility
Location
problem
(FL)
with
uniform
facility
cost
λ
and
its
dual
DP
FL
(λ)
.
We
say
that
a
ρ
-approximation
algorithm
for
a
FL
instance
of
the
above
type
is
Lagrangian
Multiplier
Preserving
(LMP)
if
it
returns
a
set
of
facilities
S
that
satisfies:
j
∈
D
c
(
j
,
S
)
≤
ρ
(
OPT
(λ)
−
λ
|
S
|
),
where
OPT
(λ)
is
the
value
of
the
optimal
solution
to
LP
FL
(λ)
.
2.
A
refined
approximation
for
Euclidean
k
-means
In
this
section
we
present
our
refined
approximation
for
Euclidean
k
-Means.
We
start
by
presenting
the
LMP
approxi-
mation
algorithm
for
the
FL
instances
arising
from
k
-Means
described
in
[
2
]in
Section
2.1
.
We
then
present
the
analysis
of
that
algorithm
as
in
[
2
]in
Section
2.2
.
In
Section
2.3
we
describe
our
refined
analysis
of
the
same
algorithm.
Finally,
in
Section
2.4
we
sketch
how
to
use
this
to
approximate
k
-Means.
2.1.
A
primal-dual
LMP
algorithm
for
Euclidean
facility
location
We
consider
an
instance
of
Euclidean
FL
induced
by
a
k
-Means
instance
in
the
mentioned
way,
for
a
given
Lagrangian
multiplier
λ
>
0.
We
consider
exactly
the
same
Lagrangian
Multiplier
Preserving
(LMP)
primal-dual
algorithm
JV
(δ)
as
in
[
2
].
In
more
detail,
let
δ
≥
2be
a
parameter
to
be
fixed
later.
The
algorithm
consists
of
a
dual-growth
phase
and
a
pruning
phase.
The
dual-growth
phase
is
exactly
as
in
the
classical
primal-dual
algorithm
JV
by
Jain
and
Vazirani
[
13
].
We
start
with
all
the
dual
variables
set
to
0 and
an
empty
set
O
t
of
tentatively
open
facilities.
The
clients
such
that
α
j
≥
c
(
j
,
i
)
for
some
i
∈
O
t
are
frozen,
and
the
other
clients
are
active.
We
grow
the
dual
variables
of
active
clients
at
uniform
rate
until
one
of
the
following
two
events
happens.
The
first
event
is
that
some
constraint
of
type
(
1
) becomes
tight.
At
that
point
the
corresponding
facility
i
is
added
to
O
t
and
all
clients
j
with
α
j
≥
c
(
j
,
i
)
are
set
to
frozen.
The
second
event
is
that
α
j
=
c
(
j
,
i
)
for
some
i
∈
O
t
.
In
that
case
j
is
set
to
frozen.
In
any
case,
the
facility
w
(
j
)
that
causes
j
to
become
frozen
is
called
the
witness
of
j
.
The
phase
halts
when
all
clients
are
frozen.
In
the
pruning
phase
we
will
close
some
facilities
in
O
t
,
hence
obtaining
the
final
set
of
open
facilities
IS
.
Here
JV
(δ)
deviates
from
JV
.
For
each
client
j
∈
D
,
let
N
(
j
)
={
i
∈
F
:
α
j
>
c
(
j
,
i
)
}
be
the
set
of
facilities
i
such
that
j
contributed
with
a
positive
amount
to
the
opening
of
i
.
Symmetrically,
for
i
∈
F
,
let
N
(
i
)
={
j
∈
D
:
α
j
>
c
(
j
,
i
)
}
be
the
clients
that
contributed
with
a
positive
amount
to
the
opening
of
i
.
For
i
∈
O
t
,
we
let
t
i
=
max
j
∈
N
(
i
)
α
j
,
where
the
values
α
j
are
considered
at
the
end
of
the
dual-growth
phase.
We
set
conventionally
t
i
=
0for
N
(
i
)
=∅
.
Intuitively,
t
i
is
the
“time”
when
facility
i
is
tentatively
open
(at
which
point
all
the
dual
variables
of
contributing
clients
stop
growing).
We
define
a
conflict
graph
H
over
tentatively
open
facilities
as
follows.
The
node
set
of
H
is
O
t
.
We
place
an
edge
between
i
,
i
∈
O
t
iff
the
following
two
conditions
hold:
(1)
for
some
client
j
,
j
∈
N
(
i
)
∩
N
(
i
)
(in
words,
j
contributes
to
the
opening
of
both
i
and
i
)
and
(2)
one
has
c
(
i
,
i
)
≤
δ
·
min
{
t
i
,
t
i
}
.
In
this
graph
we
compute
a
maximal
independent
set
IS
,
which
provides
the
desired
solution
to
the
facility
location
problem
(where
each
client
is
assigned
to
the
closest
facility
in
IS
).
We
remark
that
the
pruning
phase
of
JV
differs
from
the
one
of
JV
(δ)
only
in
the
definition
of
H
,
where
condition
(2)
is
not
required
to
hold
(or,
equivalently,
JV
behaves
like
JV
(
+∞
)
for
λ
>
0).
2.2.
The
analysis
in
[
2
]
The
general
goal
is
to
show
that
j
∈
D
c
(
j
,
IS
)
≤
ρ
(
j
∈
D
α
j
−
λ
|
IS
|
),
for
some
ρ
≥
1as
small
as
possible.
This
shows
that
the
algorithm
is
an
LMP
ρ
-approximation
for
the
problem.
It
is
sufficient
to
prove
that,
for
each
client
j
,
one
has
3
F. Grandoni, R. Ostrovsky, Y. Rabani et al.
Information Processing Letters 176 (2022) 106251
c
(
j
,
IS
)
ρ
≤
α
j
−
i
∈
N
(
j
)
∩
IS
(
α
j
−
c
(
j
,
i
))
=
α
j
−
i
∈
IS
max
{
0
,
α
j
−
c
(
j
,
i
)
}
.
Let
S
=
N
(
j
)
∩
IS
and
s
=|
S
|
.
We
distinguish
3cases
depending
on
the
value
of
s
.
Case
A:
s
=
1Let
S
={
i
∗
}
.
Then
for
any
ρ
≥
1,
c
(
j
,
IS
)
ρ
≤
c
(
j
,
IS
)
=
c
(
j
,
i
∗
)
=
α
j
−
(
α
j
−
c
(
j
,
i
∗
)).
Case
B:
s
>
1Here
we
use
the
properties
of
Euclidean
metrics.
The
sum
i
∈
S
c
(
j
,
i
)
is
the
sum
of
the
squared
distances
from
j
to
the
facilities
in
S
.
This
quantity
is
lower
bounded
by
the
sum
of
the
squared
distances
from
S
to
the
centroid
μ
of
S
.
Recall
that
i
∈
S
c
(
μ
,
i
)
=
1
2
s
i
∈
S
i
∈
S
c
(
i
,
i
)
.
We
also
observe
that,
by
construction,
for
any
two
distinct
i
,
i
∈
IS
one
has
c
(
i
,
i
)>δ
·
min
{
t
i
,
t
i
}≥
δ
·
α
j
,
where
the
last
inequality
follows
from
the
fact
that
j
is
contributing
to
the
opening
of
both
i
and
i
.
Altogether
one
obtains
i
∈
S
c
(
j
,
i
)
≥
i
∈
S
c
(
μ
,
i
)
=
1
2
s
i
∈
S
i
∈
S
c
(
i
,
i
)
≥
(
s
−
1
)δ
α
j
2
.
Thus
i
∈
S
(
α
j
−
c
(
j
,
i
))
≤
(
s
−
δ(
s
−
1
)
2
)
α
j
=
(
s
(
1
−
δ
2
)
+
δ
2
)
α
j
δ
≥
2
,
s
≥
2
≤
(
2
−
δ
2
)
α
j
.
Using
the
fact
that
α
j
>
c
(
j
,
i
)
for
all
i
∈
S
,
hence
α
j
>
c
(
j
,
IS
)
,
one
gets
(
δ
2
−
1
)
c
(
j
,
IS
)
δ
≥
2
≤
(
δ
2
−
1
)
α
j
.
We
conclude
that
i
∈
S
(
α
j
−
c
(
j
,
i
))
+
(
δ
2
−
1
)
c
(
j
,
IS
)
≤
(
2
−
δ
2
)
α
j
+
(
δ
2
−
1
)
α
j
=
α
j
.
This
gives
the
desired
inequality
assuming
that
ρ
≥
1
δ/
2
−
1
.
Case
C:
s
=
0 Consider
the
witness
i
=
w
(
j
)
of
j
.
Notice
that
α
j
≥
t
i
and
α
j
≥
c
(
j
,
i
)
=
d
2
(
j
,
i
)
.
Hence
d
(
j
,
i
)
+
δ
t
i
≤
(
1
+
√
δ)
√
α
j
.
If
i
∈
IS
,
then
d
(
j
,
IS
)
≤
d
(
j
,
i
)
.
Otherwise
there
exists
i
∈
IS
such
that
d
2
(
i
,
i
)
≤
δ
min
{
t
i
,
t
i
}
≤
δ
t
i
.
Thus
d
(
j
,
IS
)
≤
d
(
j
,
i
)
+
d
(
i
,
i
)
≤
d
(
j
,
i
)
+
√
δ
t
i
.
In
both
cases
one
has
d
(
j
,
IS
)
≤
(
1
+
√
δ)
√
α
j
,
hence
c
(
j
,
IS
)
≤
(
1
+
√
δ)
2
α
j
.
This
gives
the
desired
inequality
for
ρ
≥
(
1
+
√
δ)
2
.
Fixing
δ
Altogether
we
can
set
ρ
=
max
{
1
δ/
2
−
1
,
(
1
+
√
δ)
2
}
.
The
best
choice
for
δ
(namely,
the
one
that
minimizes
ρ
)
is
the
solution
of
1
δ/
2
−
1
=
(
1
+
√
δ)
2
.
This
is
achieved
for
δ
2
.
3146 and
gives
ρ
6
.
3574.
2.3.
A
refined
analysis
We
refine
the
analysis
in
Case
B
as
follows.
Let
=
i
∈
S
c
(
j
,
i
)
.
Our
goal
is
to
upper
bound
c
(
j
,
S
)
α
j
−
i
∈
S
(
α
j
−
c
(
j
,
i
))
=
c
(
j
,
S
)
i
∈
S
c
(
j
,
i
)
−
(
s
−
1
)
α
j
=
c
(
j
,
S
)
−
(
s
−
1
)
α
j
.
Instead
of
using
the
upper
bound
c
(
j
,
S
)
≤
α
j
we
use
the
average
c
(
j
,
S
)
≤
1
s
i
∈
S
c
(
j
,
i
)
=
s
.
4