of 18
arXiv:2203.01418v1 [cs.IT] 2 Mar 2022
1
Third-order Analysis of Channel Coding in the
Moderate Deviations Regime
Recep Can Yavas, Victoria Kostina, and Michelle Effros
Abstract
—The channel coding problem in the moderate de-
viations regime is studied; here, the error probability dec
ays
sub-exponentially to zero, and the rate approaches the capa
city
slower than
O
(1
/
n
)
. Our result refines Altu
̆
g and Wagner’s
moderate deviations result by deriving lower and upper boun
ds
on the third-order term in the asymptotic expansion of the
maximum achievable message set size. The third-order term
of our expansion employs a new quantity here called the
channel skewness. For the binary symmetric channel and most
practically important
(
n, ǫ
)
pairs, including
n
[100
,
500]
and
ǫ
[10
10
,
10
1
]
, an approximation up to the channel skewness
is the most accurate among several expansions in the literat
ure.
I. I
NTRODUCTION
The fundamental limit of channel coding is the maxi-
mum achievable message set size
M
(
n, ǫ
)
given a channel
P
Y
|
X
, a blocklength
n
, and an average error probability
ǫ
.
Since determining
M
(
n, ǫ
)
exactly is difficult for arbitrary
triples
(
P
Y
|
X
, n, ǫ
)
, the literature investigating the behavior
of
M
(
n, ǫ
)
studies two asymptotic regimes: the central limit
theorem (CLT) and large deviations (LD) regimes. To make
this precise, consider the sum of
n
independent and identically
distributed (i.i.d.) random variables. By the CLT, the prob
a-
bility that this sum deviates from the mean by order-
n
is
characterized by the Gaussian distribution, whose paramet
ers
are constant with respect to
n
. By Cram ́er’s theorem [
1
],
commonly known as the LD theorem, the probability that the
sum of
n
i.i.d. random variables deviates from the mean by
order-
n
decays exponentially with
n
if Cram ́er’s condition is
satisfied. Deviations from the mean by order strictly greate
r
than
n
and strictly smaller than
n
are called moderate
deviations (MD). Bounds on the probability with which an
i.i.d. sum deviates from the mean by an amount that falls in
the MD regime are given in [
2
, Ch. 8]. This work focuses on
channel coding in the MD regime.
Given a channel, error probability
ǫ
, and blocklength
n
, we
define the
non-Gaussianity
of a channel, which captures the
third-order term in a rate characterization, as
ζ
(
n, ǫ
)
,
log
M
(
n, ǫ
)
(
nC
p
nV
ǫ
Q
1
(
ǫ
))
,
(1)
where
C
is the capacity, and
V
ǫ
>
0
is the
ǫ
-dispersion of the
channel [
3
, Sec. IV].
Channel coding analyses in the CLT regime fix a target
error probability
ǫ
(0
,
1)
and approximate
M
(
n, ǫ
)
as the
R. C. Yavas, V. Kostina, and M. Effros are with the Department
of Electrical
Engineering, California Institute of Technology, Pasaden
a, CA 91125, USA
(e-mail: ryavas, vkostina, effros@caltech.edu). This wor
k was supported in
part by the National Science Foundation (NSF) under grant CC
F-1817241
and CCF-1956386.
blocklength
n
approaches infinity by the bracketed term in
(
1
). Examples of such results include Strassen’s results [
4
]
for discrete memoryless channels (DMCs) under the maximal
error probability constraint with
ζ
(
n, ǫ
) =
O
(log
n
)
. Polyan-
skiy
et al.
[
3
] and Hayashi [
5
] revisit Strassen’s result [
4
],
showing that the same asymptotic expansion holds for the
average error probability constraint, deriving lower and u
pper
bounds on the coefficient of the
log
n
term, and extending
the result to the Gaussian channel with maximal and average
power constraints. CLT-style analyses also exist for chann
els
with feedback [
6
], lossy [
7
] and lossless data compression [
4
],
[
8
], network information theory [
9
], random access channels
[
10
], [
11
], and many other scenarios.
For channel coding in the LD regime, which is commonly
known as the error exponent regime, one fixes a rate
R
=
log
M
n
strictly below the channel capacity and seeks to char-
acterize the minimum achievable error probability
ǫ
(
n, R
)
as
the blocklength
n
approaches infinity. In this regime,
ǫ
(
n, R
)
decays exponentially with
n
, and [
12
, Ch. 5] derives the
optimal error exponent
lim
n
→∞
1
n
log
ǫ
(
n, R
)
for
R
above the
critical rate.
In [
13
], Altu ̆g and Wagner seek to bridge the gap between
the CLT and LD regimes. In their asymptotic regime, the error
probability
ǫ
n
decays sub-exponentially to zero, i.e.,
ǫ
n
0
and
1
n
log
ǫ
n
0
, and the rate approaches the capacity with
a gap of order strictly greater than
1
n
. Altu ̆g and Wagner
argue that this formulation is more practically relevant si
nce
it simultaneously considers low error probabilities and hi
gh
achievable rates. For DMCs with positive
ǫ
n
-dispersion
V
ǫ
n
and a sequence of sub-exponentially decaying
ǫ
n
, they show
that
ζ
(
n, ǫ
n
) =
o
(
nQ
1
(
ǫ
n
))
.
(2)
This result implies that the dispersion approximation to th
e
maximum achievable message set size
log
M
(
n, ǫ
n
)
nC
p
nV
ǫ
n
Q
1
(
ǫ
n
)
, as in (
1
) is still valid in the MD regime;
however, they do not specify how accurate it is since they
do not derive the leading term of
ζ
(
n, ǫ
n
)
. Our goal is to
determine the accuracy of the dispersion approximation in t
his
regime by deriving tight bounds on the non-Gaussianity. In
[
14
], Polyanskiy and Verd ́u give an alternative proof of (
2
)
and extend the result to the Gaussian channel with a maximal
power constraint. In [
15
], Chubb
et al.
extend the second-order
result in (
2
) to quantum channels.
Emerging applications with tight delay constraints motiva
te
interest in refining the asymptotic expansions of maximum
achievable channel coding rate. For small blocklength
n
,
the non-Gaussianity
ζ
(
n, ǫ
)
in (
1
) can be quite large when
2
compared to the second-order term
O
(
n
)
. For example, in
the CLT regime, given a DMC with finite input alphabet
X
and output alphabet
Y
, [
3
] bounds the non-Gaussianity as
O
(1)
ζ
(
n, ǫ
)

|X|−
1
2

log
n
+
O
(1)
.
(3)
A variety of refinements follow. The first considers non-
singular channels, where singular channels are channels fo
r
which the channel transition probability from
x
to
y
is the
same for every compatible
(
x, y
)
pair, while nonsingular
channels are channels that do not satisfy this property; see
Section
II-E
for formal definitions. For nonsingular channels,
the random coding union bound improves the lower bound
to
1
2
log
n
+
O
(1)
[
16
, Cor. 54]. For DMCs with positive
ǫ
-
dispersion, Tomamichel and Tan [
17
] improve the upper bound
to
1
2
log
n
+
O
(1)
. A random variable is called lattice if it takes
values on a lattice with probability 1, and is called non-lat
tice
otherwise. For nonsingular channels with positive
ǫ
-dispersion
and non-lattice information density, Moulin [
18
] shows
ζ
(
n, ǫ
)
1
2
log
n
+
S
Q
1
(
ǫ
)
2
+
B
+
o
(1)
(4)
ζ
(
n, ǫ
)
1
2
log
n
+
S Q
1
(
ǫ
)
2
+
B
+
o
(1)
,
(5)
where
S
,
S
.
B
, and
B
are constants depending on the
channel parameters. Gallager-symmetric channels are chan
nels
for which the output alphabet can be partitioned into subset
s
in a way that for each subset of the probability transition
matrix that uses inputs as rows and outputs of the subset as
columns has the property that each row (respectively column
)
is a permutation of each other [
12
, p. 94]. For Gallager-
symmetric, singular channels,
ζ
(
n, ǫ
) =
O
(1)
[
19
]. Bounds
on the sub-exponential factors in the LD regime for singular
and nonsingular channels appear in [
19
]–[
22
].
In this work, we extend the definition of the MD error
probability region in [
13
] to include the sequences that sub-
exponentially approach 1 or other constant values: we say th
at
a sequence of error probabilities
{
ǫ
n
}
n
=1
is an MD sequence
if for all
c >
0
, there exists some
n
0
N
such that
exp
{−
cn
}≤
ǫ
n
1
exp
{−
cn
}
(6)
for all
n
n
0
. We refine the lower and upper bounds in (
2
)
for nonsingular channels. Our result generalizes (
4
)–(
5
) to all
error probability sequences
ǫ
n
satisfying (
6
) at the expense of
not bounding the constant term. We show that for nonsingular
channels with positive minimum dispersion and
ǫ
n
satisfying
(
6
),
ζ
(
n, ǫ
n
)
in (
2
) satisfies the following bounds from below
and above as
1
2
log
n
+
S
Q
1
(
ǫ
n
)
2
+
O

Q
1
(
ǫ
n
)
3
n

+
O
(1)
ζ
(
n, ǫ
n
)
(7)
1
2
log
n
+
S Q
1
(
ǫ
n
)
2
+
O

Q
1
(
ǫ
n
)
3
n

+
O
(1)
,
(8)
where the constants
S
and
S
are the same as in (
4
) and (
5
).
We show that the non-Gaussianity approaches
O
(
n
)
as
ǫ
n
approaches an exponential decay, rivaling the dispersion t
erm
in (
1
). Thus, refining the third-order term as we do in (
7
)–(
8
)
is especially important in the MD regime. Our achievability
bound applies the standard random coding bound used in both
the CLT [
3
], [
18
] and LD [
20
] regimes. Our converse bound
combines the result in [
17
, Prop. 6], which is a relaxation
of the meta-converse bound [
3
, Th. 27], and a saddlepoint
result of a minimax problem in [
18
, Lemma 14]. Neither the
Berry-Esseen theorem used in [
3
] nor the refined Edgeworth
expansion used in [
18
] to treat the constant
ǫ
case is sharp
enough for the
O
(1)
precision in (
7
)–(
8
), which applies to
any sequence
ǫ
n
satisfying (
6
). We replace these tools with
the moderate deviations bounds found in [
2
, Ch. 8].
We define the channel
skewness
operationally as
S
,
lim
ǫ
0
lim sup
n
→∞
ζ
(
n, ǫ
)
1
2
log
n
Q
1
(
ǫ
)
2
.
(9)
For the maximum achievable rate, the channel skewness
characterizes third-order bounds much like channel disper
sion
characterizes second-order bounds in [
3
, Sec. IV]. The quan-
tities
S
and
S
provide informational lower and upper bounds
to the channel skewness. They depend on the second and
third moments of the information density under a dispersion
-
achieving input distribution.
For Cover-Thomas symmetric channels [
23
], where each
row (respectively column) of the probability transition ma
trix
is a permutation of each other,
S
=
S
. For the binary
symmetric channel (BSC) and a wide range of
(
n, ǫ
)
pairs,
our asymptotic approximation for the maximum achievable
rate using terms up to the channel skewness, i.e.,
ζ
(
n, ǫ
)
1
2
log
n
+
S Q
1
(
ǫ
)
2
is more accurate than both Moulin’s
bounds with
B
and
B
in (
4
) and (
5
) included and the
normal approximation, which takes
ζ
(
n, ǫ
)
1
2
log
n
; our
approximation competes with the saddlepoint approximatio
ns
in [
21
], [
22
]. Moreover, for the BSC with an
(
n, ǫ
)
pair
satisfying
ǫ
[10
10
,
10
1
]
and
n
[100
,
500]
, including
the leading term of
O

Q
1
(
ǫ
n
)
3
n

in our approximation (see
Section
VI
) yields a less accurate approximation than is
obtained by stopping at the channel skewness. This highligh
ts
the importance of channel skewness relative to the higher-
order terms in characterizing the channel. Characterizati
on of
all
(
P
Y
|
X
, n, ǫ
)
triples satisfying this property remains an open
problem.
Using the moderate deviations results in [
2
, Ch. 8] and
the strong large deviations results in [
24
], we derive the
asymptotics of binary hypothesis testing in the MD regime
(Theorem
6
), characterizing the minimum achievable type-II
error of a hypothesis test that chooses between two product
distributions (i.e., the
β
α
function defined in [
3
]), given that
type-I error is an MD sequence (
6
). Binary hypothesis testing
is known to be closely related to several information-theor
etic
problems. For instance, Blahut [
25
] derives a lower bound
on the error exponent in channel coding in terms of the
asymptotics of binary hypothesis testing in the LD regime.
Polyanskiy
et al.
derive a converse result [
3
, Th. 27] in
channel coding using the minimax of the
β
α
function; this
converse is commonly known as the meta-converse bound.
Kostina and Verd ́u prove a converse result for fixed-length
lossy [
7
, Th. 8] compression of stationary memoryless sources
using the
β
α
function. This result is extended to lossless joint
3
source-channel coding in [
26
]. For lossless data compression,
Kostina and Verd ́u give lower and upper bounds [
7
, eq. (64)]
on the minimum achievable codebook size in terms of
β
α
.
For lossless multiple access source coding, also known as
Slepian-Wolf coding, Chen
et al.
derive a converse result [
27
,
Th. 19] in terms of the composite hypothesis testing version
of the
β
α
function that considers a single null hypothesis
and
k
alternative hypotheses. Composite hypothesis testing
is also used in a random access channel coding scenario
to decide whether any transmitter is active [
11
]. The works
in [
3
], [
7
], [
11
], [
26
], [
27
] derive second- or third-order
asymptotic expansions for their respective problems by usi
ng
the asymptotics of the
β
α
function in the CLT regime. As an
application of Theorem
6
, the results in [
3
], [
7
], [
11
], [
26
],
[
27
] could be extended to the MD regime. For Cover-Thomas
symmetric channels, we show a refined converse result in
Theorem
8
using Theorem
6
and the meta-converse bound.
The paper is organized as follows. We define notation and
give the prelimiaries to present our results in Section
II
.
Section
III
presents and discusses the main results. Proofs
of the main results appear in Section
IV
. The asymptotics
of binary hypothesis testing in the MD regime are derived
in Section
V
. Refined bounds for Cover-Thomas symmetric
channels appear in Section
VI
.
II. N
OTATION AND
P
RELIMINARIES
A. Notation
For any
k
N
, we denote
[
k
]
,
{
1
, . . . , k
}
. We denote
random variables by capital letters (e.g.,
X
) and individual
realizations of random variables by lowercase letters (e.g
.,
x
).
We use boldface letters (e.g.,
x
) to denote vectors, calligraphic
letters (e.g.,
X
) to denote alphabets and sets, and sans serif
font (e.g.,
A
)
to denote matrices. The
i
-th entry of a vector
x
is denoted by
x
i
, and
(
i, j
)
-th entry of a matrix
A
is denoted
by
A
i,j
. The sets of real numbers and complex numbers are
denoted by
R
and
C
, respectively. All-zero and all-one vectors
are denoted by
0
and
1
, respectively. A vector inequality
x
y
for
x
,
y
R
d
is understood element-wise, i.e.,
x
i
y
i
for
all
i
[
d
]
. We denote the inner product
P
d
i
=1
x
i
y
i
by
h
x
,
y
i
.
We use
k·k
to denote the
norm, i.e.,
k
x
k
,
max
i
[
d
]
|
x
i
|
.
The sets of all distributions on the channel input alphabet
X
and the channel output alphabet
Y
are denoted by
P
and
Q
, respectively. We write
X
P
X
to indicate that
X
is distributed according to
P
X
∈ P
. Given a distribution
P
X
∈ P
and a conditional distribution
P
Y
|
X
from
X
to
Y
, we write
P
X
×
P
Y
|
X
to denote the joint distribution of
(
X, Y
)
, and
P
Y
to denote the marginal distribution of
Y
,
i.e.,
P
Y
(
y
) =
P
x
∈X
P
X
(
x
)
P
Y
|
X
(
y
|
x
)
for all
y
∈ Y
. Given
a conditional distribution
P
Y
|
X
, the distribution of
Y
given
X
=
x
is denoted by
P
Y
|
X
=
x
. The skewness of a random
variable
X
is denoted by
Sk(
X
)
,
E
[
X
3
]
Var[
X
]
3
/
2
. For a sequence
x
= (
x
1
, . . . , x
n
)
, the empirical distribution (or type) of
x
is
denoted by
ˆ
P
x
(
x
) =
1
n
n
X
i
=1
1
{
x
i
=
x
}
,
x
∈X
.
(10)
A lattice random variable is a random variable taking values
in
{
a
+
kd
:
k
Z
}
, where
d
is the
span
of the lattice. We
say that a random vector
X
= (
X
1
, . . . , X
n
)
is non-lattice if
each of
X
i
,
i
[
n
]
is non-lattice, and is lattice if each of
X
i
,
i
[
n
]
is lattice.
1
We measure information in nats, and
logarithms and exponents have base
e
.
As is standard,
f
(
n
)
=
O
(
g
(
n
))
means
lim sup
n
→∞
f
(
n
)
g
(
n
)
<
, and
f
(
n
) =
o
(
g
(
n
))
means
lim
n
→∞
f
(
n
)
g
(
n
)
= 0
. We use
Q
(
·
)
to represent the
complementary Gaussian cumulative distribution function
Q
(
x
)
,
1
2
π
R
x
exp
n
t
2
2
o
dt
and
Q
1
(
·
)
to represent its
functional inverse.
B. Definitions related to information density
The relative entropy, divergence variance, and divergence
centralized third moment between distributions
P
and
Q
on a
common alphabet are denoted by
D
(
P
k
Q
)
,
E

log
P
(
X
)
Q
(
X
)

(11)
V
(
P
k
Q
)
,
Var

log
P
(
X
)
Q
(
X
)

(12)
T
(
P
k
Q
)
,
E
"

log
P
(
X
)
Q
(
X
)
D
(
P
k
Q
)

3
#
,
(13)
where
X
P
. Let
P
X
∈ P
,
Q
Y
∈ Q
, and
P
Y
|
X
be a
conditional distribution from
X
to
Y
. The conditional relative
entropy, conditional divergence variance, conditional di
ver-
gence centralized third moment, and conditional divergenc
e
skewness are denoted by
D
(
P
Y
|
X
k
Q
Y
|
P
X
)
,
X
x
∈X
P
X
(
x
)
D
(
P
Y
|
X
=
x
k
Q
Y
)
(14)
V
(
P
Y
|
X
k
Q
Y
|
P
X
)
,
X
x
∈X
P
X
(
x
)
V
(
P
Y
|
X
=
x
k
Q
Y
)
(15)
T
(
P
Y
|
X
k
Q
Y
|
P
X
)
,
X
x
∈X
P
X
(
x
)
T
(
P
Y
|
X
=
x
k
Q
Y
)
(16)
Sk(
P
Y
|
X
k
Q
Y
|
P
X
)
,
T
(
P
Y
|
X
k
Q
Y
|
P
X
)
V
(
P
Y
|
X
k
Q
Y
|
P
X
)
3
/
2
.
(17)
Let
(
X, Y
)
P
X
×
P
Y
|
X
. The information density is defined
as
ı
(
x
;
y
)
,
log
P
Y
|
X
(
y
|
x
)
P
Y
(
y
)
,
x
∈X
, y
∈Y
.
(18)
We define the following moments of the random variable
ı
(
X
;
Y
)
.
The mutual information
I
(
P
X
, P
Y
|
X
)
,
E
[
ı
(
X
;
Y
)] =
D
(
P
Y
|
X
k
P
Y
|
P
X
)
,
(19)
the unconditional information variance
V
u
(
P
X
, P
Y
|
X
)
,
V
(
P
X
×
P
Y
|
X
k
P
X
×
P
Y
)
= Var [
ı
(
X
;
Y
)]
,
(20)
1
The case that some of the coordinates of
X
are lattice and the rest of its
coordinates is non-lattice is excluded in this paper.
4
the unconditional information third central moment
T
u
(
P
X
, P
Y
|
X
)
,
T
(
P
X
×
P
Y
|
X
k
P
X
×
P
Y
)
(21)
=
E

(
ı
(
X
;
Y
)
I
(
P
X
, P
Y
|
X
))
3

,
(22)
the unconditional information skewness
Sk
u
(
P
X
, P
Y
|
X
)
,
Sk(
ı
(
X
;
Y
)) =
T
u
(
P
X
, P
Y
|
X
)
V
u
(
P
X
, P
Y
|
X
)
3
/
2
,
(23)
the conditional information variance
V
(
P
X
, P
Y
|
X
)
,
V
(
P
Y
|
X
k
P
Y
|
P
X
)
=
E
[Var [
ı
(
X
;
Y
)
|
X
]]
,
(24)
the conditional information third central moment
T
(
P
X
, P
Y
|
X
)
,
T
(
P
Y
|
X
k
P
Y
|
P
X
)
,
(25)
the conditional information skewness
S
(
P
X
, P
Y
|
X
)
,
T
(
P
Y
|
X
k
P
Y
|
P
X
)
V
(
P
Y
|
X
k
P
Y
|
P
X
)
3
/
2
,
(26)
the reverse dispersion [
16
, Sec. 3.4.5]
V
r
(
P
X
, P
Y
|
X
)
,
E
[Var [
ı
(
X
;
Y
)
|
Y
]]
.
(27)
C. Discrete memoryless channel
A discrete memoryless channel (DMC) is characterized by
a finite input alphabet
X
, a finite output alphabet
Y
, and a
probability transition matrix
P
Y
|
X
, where
P
Y
|
X
(
y
|
x
)
is the
probability that the output of the channel is
y
∈Y
given that
the input to the channel is
x
∈ X
. The input-output relation
of a DMC is
P
n
Y
|
X
(
y
|
x
) =
n
Y
i
=1
P
Y
|
X
(
y
i
|
x
i
)
.
(28)
We proceed to define the channel code.
Definition 1:
An
(
n, M, ǫ
)
-code for a DMC
P
Y
|
X
comprises
an encoding function
f
: [
M
]
→X
n
,
(29)
and a decoding function
g
:
Y
n
[
M
]
,
(30)
that satisfy an average error probability constraint
1
1
M
M
X
m
=1
P
n
Y
|
X
(
g
1
(
m
)
|
f
(
m
))
ǫ.
(31)
The maximum achievable message set size
M
(
n, ǫ
)
under
the average error probability criterion is defined as
M
(
n, ǫ
)
,
max
{
M
:
an
(
n, M, ǫ
)
-code
}
.
(32)
D. Definitions related to the optimal input distribution
The capacity of a DMC
P
Y
|
X
is
C
(
P
Y
|
X
)
,
max
P
X
∈P
I
(
P
X
, P
Y
|
X
)
.
(33)
We denote the set of capacity-achieving input distribution
s by
P
,
{
P
X
∈P
:
I
(
P
X
, P
Y
|
X
) =
C
(
P
Y
|
X
)
}
.
(34)
Even when the capacity-achieving input distribution is not
unique (
|P
|
>
1
), the capacity-achieving output distribution
is unique (
P
X
, P
X
∈ P
implies
P
x
∈X
P
X
(
x
)
P
Y
|
X
(
y
|
x
) =
P
x
∈X
P
X
(
x
)
P
Y
|
X
(
y
|
x
)
for all
y
∈ Y
) [
12
, Cor. 2 to
Th. 4.5.2]. We denote this unique capacity-achieving outpu
t
distribution by
Q
Y
∈ Q
;
Q
Y
satisfies
Q
Y
(
y
)
>
0
for all
y
∈ Y
for which there exists
x
∈ X
with
P
Y
|
X
(
y
|
x
)
>
0
[
12
, Cor. 1 to Th. 4.5.2]. For any
P
X
∈ P
, it holds that
V
(
P
X
, P
Y
|
X
) =
V
u
(
P
X
, P
Y
|
X
)
[
3
, Lemma 62].
Define
V
min
,
min
P
X
∈P
V
(
P
X
, P
Y
|
X
)
and
V
max
,
max
P
X
∈P
V
(
P
X
, P
Y
|
X
)
. The
ǫ
-dispersion [
3
] of a channel
is defined as
V
ǫ
,
(
V
min
if
ǫ <
1
2
V
max
if
ǫ
1
2
.
(35)
The set of dispersion-achieving input distributions is defi
ned as
P
,
(
n
P
X
∈P
:
V
(
P
X
, P
Y
|
X
) =
V
ǫ
o
if
ǫ
6
=
1
2
P
if
ǫ
=
1
2
.
(36)
Any
P
X
∈P
satisfies
D
(
P
Y
|
X
=
x
k
Q
Y
) =
C
for any
x
∈X
with
P
X
(
x
)
>
0
, and
D
(
P
Y
|
X
=
x
k
Q
Y
)
C
for all
x
∈ X
[
12
, Th. 4.5.1]. Hence, the support of any capacity-achieving
input distribution is a subset of
X
=
{
x
∈X
:
D
(
P
Y
|
X
=
x
k
Q
Y
) =
C
}
.
(37)
The support of any dispersion-achieving input distributio
n is
a subset of
X
,
[
P
X
∈P
supp(
P
X
)
⊆X
.
(38)
The quantities below are used to describe the input dis-
tribution that achieves our lower bound
S
on the channel
skewness
S
in (
9
). The gradient and the Hessian of the mutual
information
I
(
P
X
, P
Y
|
X
)
with respect to
P
X
are given by [
18
]
I
(
P
X
, P
Y
|
X
)
x
=
D
(
P
Y
|
X
=
x
k
P
Y
)
1
(39)
2
I
(
P
X
, P
Y
|
X
)
x,x
=
X
y
∈Y
P
Y
|
X
(
y
|
x
)
P
Y
|
X
(
y
|
x
)
P
Y
(
y
)
(40)
for
(
x, x
)
∈ X
2
. The matrix
−∇
2
I
(
P
X
, P
Y
|
X
)
is the same
for all
P
X
∈P
, and is positive semidefinite. See [
18
, Sec. II-
D and II-E] for other properties of
−∇
2
I
(
P
X
, P
Y
|
X
)
. Define
the
|X|×|X|
matrix
J
as
J
x,x
,
(
−∇
2
I
(
P
X
, P
Y
|
X
)
x,x
if
x, x
∈X
,
0
otherwise.
(41)
Define the set of vectors
L
,

h
R
|X|
:
X
x
∈X
h
x
= 0
, h
x
= 0
for
x
/
∈X
,
5
h
x
′′
0
for
x
′′
∈X
\X
.
(42)
The following convex optimization problem arises in the
optimization of the input distribution achieving the lower
bound
S
sup
h
∈L∩
row(
J
)

g
h
1
2
h
J
h

,
(43)
where
row(
·
)
denotes the row space of a matrix. If
X
=
X
,
then
h
that achieves (
43
) is given by [
18
, Lemma 1]
h
=
̃
J
g
,
(44)
where
̃
J
=
J
+
1
1
J
+
1
(
J
+
1
)(
J
+
1
)
,
(45)
J
+
denotes the Moore-Penrose pseudo-inverse
2
of
J
, and the
optimal value in (
43
) is given by
1
2
g
̃
J
g
.
In our main results, Theorem
1
and Theorem
2
below, we
characterize the lower bound
S
and upper bound
S
on the
skewness
S
of the DMC. The following notation is used in
those bounds. Define
v
(
P
X
)
x
,
V
(
P
X
, P
Y
|
X
)
x
(46)
̃
v
x
,
V
(
P
Y
|
X
=
x
k
Q
Y
)
(47)
v
(
P
X
)
x
,
E

∂V
(
P
Y
|
X
=
x
k
P
Y
)
∂P
X
(
x
)

, x
∈X
,
and
X
P
X
(48)
A
0
(
P
X
)
,
1
8
V
ǫ
v
(
P
X
)
̃
J
v
(
P
X
)
(49)
A
1
(
P
X
)
,
1
8
V
ǫ
v
(
P
X
)
̃
J
v
(
P
X
)
.
(50)
See [
18
, Lemma 2] for properties of these quantities.
E. Singularity of a DMC
The following definition divides DMCs into two groups, for
which
M
(
n, ǫ
n
)
behaves differently in the non-Gaussianity
(even in the CLT regime). An input distribution-channel pai
r
(
P
X
, P
Y
|
X
)
is
singular
[
20
, Def. 1] if for all
(
x,
x, y
)
such
that
P
X
×
P
Y
|
X
(
x, y
)
>
0
and
P
X
×
P
Y
|
X
(
x, y
)
>
0
, it holds
that
P
Y
|
X
(
y
|
x
) =
P
Y
|
X
(
y
|
x
)
.
(51)
We define the singularity parameter [
18
, eq. (2.25)]
η
(
P
X
, P
Y
|
X
)
,
1
V
r
(
P
X
, P
Y
|
X
)
V
u
(
P
X
, P
Y
|
X
)
,
(52)
which is a constant in
[0
,
1]
. The pair
(
P
X
, P
Y
|
X
)
is singular
if and only if
η
(
P
X
, P
Y
|
X
) = 1
[
28
, Remark 1]. A channel
P
Y
|
X
is singular if
η
(
P
X
, P
Y
|
X
) = 1
for all
P
X
∈ P
,
and nonsingular otherwise. An example singular channel is
the binary erasure channel. Our focus in this paper is on
nonsingular channels.
For brevity, if the channel is clear from the context, we drop
P
Y
|
X
in the notation for capacity, dispersion, skewness, and
singularity parameter of the channel.
2
Given that
A
=
UΣV
is the singular value decomposition of
A
,
A
+
=
1
U
.
III. M
AIN
R
ESULT
Theorems
1
and
2
are our achievability and converse results,
respectively.
Theorem 1:
Suppose that
ǫ
n
satisfies (
6
) and that
P
Y
|
X
is
a nonsingular DMC with
V
ǫ
n
>
0
for all
n
and
X
=
X
. It
holds that
ζ
(
n, ǫ
n
)
1
2
log
n
+
S
Q
1
(
ǫ
n
)
2
+
O

Q
1
(
ǫ
n
)
3
n

+
O
(1)
,
(53)
where
S
,
max
P
X
∈P

Sk
u
(
P
X
)
p
V
ǫ
n
6
+
A
0
(
P
X
)
+
1
η
(
P
X
)
2(1 +
η
(
P
X
))

.
(54)
Proof:
The proof consists of two parts and extends the
argument in [
18
]
3
to include
ǫ
n
that decreases to 0 or increases
to 1 as permitted by (
6
). The first part is a standard random
coding bound. It is used in the CLT regime for a third-
order analysis in [
16
] and a fourth-order analysis in [
18
]; it
also comes up in the LD regime [
20
]. We set an arbitrary
distribution
P
X
∈ P
to generate the i.i.d. random codebook
and employ a maximum likelihood decoder. To bound the
probability
P
[
ı
(
X
;
Y
)
τ
]
, we replace the Edgeworth expan-
sion in [
18
, eq. (5.30)], which gives the refined asymptotics
of the Berry-Esseen theorem, with its moderate deviations
version from [
2
, Ch. 8, Th. 2]. Note that the Edgeworth
expansion yields an additive remainder term
O
(1
/n
)
to the
normal term; this remainder becomes too large for the entire
range of probabilities in (
6
). Therefore, a moderate deviation
result that yields a multiplicative remainder term
(1 +
o
(1))
is desired. We apply the large deviations result in [
24
, Th.
3.4] to bound the probability
P

ı
(
X
;
Y
)
ı
(
X
;
Y
)
τ

,
where
X
and
X
denote the transmitted random codeword and
an independent codeword drawn from the same distribution,
respectively. This bound replaces the bounds in [
18
, eq. (7.25)-
(7.27)] and refines the large deviations bound [
3
, Lemma 47]
used in [
16
, Th. 53]. We show an achievability result as a
function of
I
(
P
X
)
,
V
u
(
P
X
)
, and
Sk
u
(
P
X
)
. If
P
X
=
P
X
∈P
,
the resulting bound is (
53
) with
A
0
(
P
X
)
replaced by zero. We
then optimize the bound over
P
X
using the second-, first- and
zeroth-order Taylor series expansions around
P
X
∈ P
of
I
(
P
X
)
, V
u
(
P
X
)
, and
Sk
u
(
P
X
)
, respectively. Interestingly, the
right-hand side of (
53
) is achieved using
P
X
=
P
X
Q
1
(
ǫ
n
)
2
p
nV
ǫ
n
̃
J
v
(
P
X
)
∈P
(55)
instead of a dispersion-achieving input distribution
P
X
∈ P
to generate i.i.d. random codewords. Note that despite bein
g in
the neighborhood of a dispersion-achieving
P
X
,
P
X
in (
55
),
itself, might not belong to
P
.
3
There is a sign error in [
18
, eq. (3.1)-(3.2)], which then propagates through
the rest of the paper. The sign of the terms with
S
(
P
X
)
should be positive
rather than negative in both equations. The error in the achi
evability result
originates in [
18
, eq. (7.15) and (7.19)], where it is missed that
Sk(
X
) =
Sk(
X
)
for any random variable
X
. The error in the converse result also
stems from the sign error in [
18
, eq. (6.8)].
6
In the second-order MD result in [
13
], Altu ̆g and Wagner
apply the non-asymptotic bound in [
12
, Cor. 2 on p. 140],
which turns out to be insufficiently sharp for the derivation
of
the third-order term. See Section
IV-C
for the details of the
proof.
We require the condition
V
ǫ
n
>
0
in Theorem
1
since the
moderate (Theorem
3
in Section
IV-A
) and large deviations
results (Theorems
4
and
5
in Section
IV-B
) apply only to the
random variables with positive variance. In the CLT regime,
[
3
, Th. 45 and 48] and [
17
, Prop. 9-10] derive bounds on the
non-Gaussianity for DMCs with
V
ǫ
n
= 0
. If
V
ǫ
n
= 0
, the
scaling of the non-Gaussianity changes according to whethe
r
the DMC is exotic [
3
, p. 2331], which most DMCs do not
satisfy, and whether
ǫ
n
is less than, equal to, or greater than
1
2
. A summary of the non-Gaussianity terms under these cases
appears in [
17
, Fig. 1].
The condition
X
=
X
is a technical one that yields a
closed-form solution (
55
) for the input distribution achieving
the lower bound
S
. If that condition is not satisfied, then the
second term in (
55
) is replaced by the solution to the convex
optimization problem (
43
), and
A
0
(
P
X
)
in (
54
) is replaced by
the optimal value of (
43
).
Theorem 2:
Under the conditions of Theorem
1
,
ζ
(
n, ǫ
n
)
1
2
log
n
+
SQ
1
(
ǫ
n
)
2
+
O

Q
1
(
ǫ
n
)
3
n

+
O
(1)
,
(56)
where
S
,
max
P
X
∈P

Sk
u
(
P
X
)
p
V
ǫ
n
6
+
1
2
+
A
0
(
P
X
)
A
1
(
P
X
)

.
(57)
Proof:
The proof of Theorem
2
combines the converse
bound from [
17
, Prop. 6], which is derived from the meta-
converse bound [
3
, Th. 27], and a saddlepoint result in [
18
,
Lemma 14], which involves a maximization over an input
distribution
P
X
∈ P
and a minimization over an output
distribution
Q
Y
∈Q
. Combining these results and not deriving
the
O
(1)
term in (
56
) yield a much simpler proof than that in
[
18
]. While [
18
, proof of Th. 4] relies on the asymptotic expan-
sion of the Neyman-Pearson Lemma (the
β
1
ǫ
(
P, Q
)
function
defined in [
3
, eq. (100)]), the use of [
17
, Prop. 6] allows us
to bypass this part. After carefully choosing the parameter
δ
in [
17
, Prop. 6], the problem reduces to a single-letter mini-
max problem involving the quantities
D
(
P
Y
|
X
k
Q
Y
|
P
X
)
and
V
(
P
Y
|
X
k
Q
Y
|
P
X
)
, where the maximization is over
P
X
∈ P
and the minimization is over
Q
Y
∈ Q
. Then, similar to the
steps in [
18
, eq. (8.22)], for the maximization over
P
X
, we
separate the cases where
k
P
X
P
X
k ≤
c
0
Q
1
(
ǫ
n
)
n
or not,
where
P
X
∈ P
, and
c
0
>
0
is a constant. Applying [
18
,
Lemmas 14 and 9-iii] completes the proof. See Section
IV-D
for the details.
The constant terms
B
and
B
in [
18
] depend on whether
the information density random variable
ı
(
X
;
Y
)
is a lattice
or non-lattice random variable because both the Edgeworth
expansion and the large deviation result used in [
18
] take
distinct forms for lattice and non-lattice random variable
s. The
BSC is analyzed separately in [
18
, Th. 7] since the information
10
-10
10
-9
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Fig. 1. The expansion from Theorems
1
and
2
, excluding the
O
(
·
)
terms, is
shown for the BSC(0.11) with
ǫ
[10
10
,
10
1
]
and
n
=
{
100
,
250
,
500
}
.
The error bars correspond to the non-asymptotic achievabil
ity and converse
bounds from [
3
, Th. 33 and 35]; the normal approximation, which achieves
ζ
(
n, ǫ
) =
1
2
log
n
, is from [
16
, Th. 53]; Moulin’s results are from [
18
, Th.
7]; the saddlepoint approximation is from [
21
, Th. 1] and [
22
, Sec. III-D].
density of the BSC is lattice. A single proof holds for lattic
e
and non-lattice cases if we do not attempt to bound the
O
(1)
term as in this paper.
A. The tightness of Theorem
1
and Theorem
2
If the channel satisfies
|P
|
= 1
,
A
0
(
P
X
) =
A
1
(
P
X
) = 0
,
and
η
(
P
X
) = 0
, then achievability (
53
) and converse (
56
)
bounds yield the channel skewness (
9
)
S
=
Sk
u
(
P
X
)
V
min
6
+
1
2
.
(58)
Cover-Thomas symmetric channels [
23
, p. 190] satisfy all
conditions;
4
the BSC is an example. Further, if
ǫ
n
satisfies
Q
1
(
ǫ
n
) =
O
(
n
1
/
6
)
, then the
O

Q
1
(
ǫ
n
)
3
n

in (
53
) and (
56
)
is dominated by the
O
(1)
term, giving that for Cover-Thomas
symmetric channels,
ζ
(
n, ǫ
n
) =
1
2
log
n
+
S Q
1
(
ǫ
n
)
2
+
O
(1)
.
For the BSC with crossover probability 0.11, Fig.
1
com-
pares asymptotic expansions for the maximum achievable rat
e,
log
2
M
(
n,ǫ
n
)
n
, dropping
o
(
·
)
and
O
(
·
)
terms except where
noted otherwise. The curves plotted in Fig.
1
include The-
orems
1
and
2
both with and without the leading term of
O

Q
1
(
ǫ
n
)
3
n

computed, various other asymptotic expansions
in the CLT and LD regimes, and the non-asymptotic bounds
from [
3
, Th. 33 and 35]. The leading term of
O

Q
1
(
ǫ
n
)
3
n

in Theorems
1
and
2
are computed in Theorems
7
and
8
in Section
VI
below. Among these asymptotic expansions,
Theorems
1
and
2
ignoring the
O
(
·
)
are the closest to the
non-asymptotic bounds for most
(
n, ǫ
)
pairs shown, which
highlights the significance of the channel skewness.
In [
19
], Altu ̆g and Wagner show that in the LD regime,
for Gallager-symmetric channels, the prefactors in the low
er
and upper bounds on the error probability have the same
order; that order depends on whether the channel is singular
or nonsingular. Extending the analysis in [
18
, Sec. III-C-
2)] to any Gallager-symmetric channel shows that Gallager-
symmetric channels satisfy
A
0
(
P
X
) =
A
1
(
P
X
) = 0
, but
4
Channels that are (i) Cover-Thomas weakly symmetric, have (
ii)
|X|
=
|Y|
and (iii) a positive definite
J
satisfy the same conditions [
18
, Prop. 6].
7
η
(
P
X
)
is not necessarily zero (see [
18
, Sec. III-C-2)] for a
counterexample), which means that (
53
) and (
56
) are not tight
up to the
O
(1)
term for some Gallager-symmetric channels.
The findings in [
19
] suggest that Theorem
1
or Theorem
2
or both could be improved for some channels. The main
difference between the achievability bounds in [
19
], [
20
] and
ours is that [
20
] bounds the error probability by
P
[
D
] + (
M
1)
P

D
c
∩{
ı
(
X
;
Y
)
ı
(
X
;
Y
)
}

,
(59)
where
D
,
(
log
P
n
Y
|
X
(
Y
|
X
)
Q
n
Y
(
Y
)
< τ
)
(60)
Q
Y
(
y
)
,
c
X
x
∈X
P
X
(
x
)
P
Y
|
X
(
y
|
x
)
1
/
1+
ρ
!
1+
ρ
, y
∈Y
.
(61)
Here
Q
Y
is the tilted output distribution, and
ρ
[0
,
1]
,
τ
, and
c
are some constants. Our achievability bound uses a special
case of (
61
) with
ρ
= 0
, giving
Q
Y
=
P
Y
. Whether the more
general bound in (
61
) yields an improved bound in the MD
regime is a question for future work.
IV. P
ROOFS OF
T
HEOREMS
1
AND
2
We begin by presenting the supporting lemmas used to
bound the probability terms that appear in the proofs of
Theorems
1
and
2
.
A. Moderate deviations asymptotics
Theorem
3
, stated next, is a moderate deviations result that
bounds the probability that the sum of
n
i.i.d. random variables
normalized by a factor
1
n
deviates from the mean by
o
(
n
)
.
The resulting probability is an MD sequence (
6
).
Theorem 3 (Petrov [
2
, Ch. 8, Th. 4]):
Let
X
1
, . . . , X
n
be independent random variables. Let
E
[
X
i
] = 0
for
i
=
1
, . . . , n
,
σ
2
=
1
n
P
n
i
=1
Var [
X
i
]
>
0
,
μ
k
=
1
n
P
n
i
=1
E

X
k
i

for
k
3
, and
Sk =
μ
3
σ
3
. Define
S
n
,
1
2
n
X
i
=1
X
i
(62)
F
n
(
x
)
,
P
[
S
n
x
]
.
(63)
Suppose that there exist some positive constants
t
0
and
H
such
that the moment generating function (MGF) satisfies
E
[exp
{
tX
i
}
]
< H
(64)
for all
|
t
| ≤
t
0
and
i
= 1
, . . . , n
. This condition is called
Cram ́er’s condition. Let
x >
1
and
x
=
o
(
n
)
. Then, it holds
that
1
F
n
(
x
) =
Q
(
x
) exp

x
3
n
λ
n

x
n

1 +
O

x
n

(65)
F
n
(
x
) =
Q
(
x
) exp

x
3
n
λ
n

x
n

1 +
O

x
n

,
(66)
where
λ
n
(
x
)
,
X
i
=0
a
i
x
i
(67)
is Cram ́er’s series whose first two coefficients are
a
0
=
Sk
6
(68)
a
1
=
(
μ
4
3
σ
4
)
σ
2
3
μ
2
3
24
σ
6
.
(69)
The
O
(
·
)
terms in (
65
)–(
66
) constitute a bottleneck in
deriving the
O
(1)
terms in (
53
) and (
56
), that is, one needs
to compute the leading term of the
O
(
·
)
terms in (
65
)–(
66
)
in order to compute the
O
(1)
terms in our achievability and
converse bounds.
Inverting Theorem
3
, namely, obtaining an expansion for
x
in terms
y
where
F
n
(
x
) =
Q
(
y
)
, is advantageous in
many applications. Such an expansion gives the percentile
value given a probability satisfying the MD sequence (
6
). In
the CLT regime, i.e.,
F
n
(
x
)
(0
,
1)
is equal to a value
independent of
n
, that expansion is known as the Corner-
Fisher theorem [
29
] that inverts the Edgeworth expansion. The
following lemma is an extension of the Cornish-Fisher theor
em
to the MD regime.
Lemma 1:
Let
X
1
, . . . , X
n
satisfy the conditions in Theo-
rem
3
. Let
y
,
Q
1
(
ǫ
n
) =
o
(
n
)
. Suppose that
F
n
(
x
) =
Q
(
y
) =
ǫ
n
, then
x
=
y
b
0
y
2
n
+
b
1
y
3
n
+
O

y
4
n
3
/
2

+
O

1
n

,
(70)
where
b
0
,
Sk
6
(71)
b
1
,
3(
μ
4
3
σ
4
)
σ
2
4
μ
2
3
72
σ
6
.
(72)
Proof:
See Appendix
A
.
A weaker version of Lemma
1
with only the first two terms in
(
70
), and with
ǫ
n
decaying polynomially with
n
is proved in
[
30
, Lemma 7]. We use Theorem
3
and Lemma
1
to bound the
probability
P
[
ı
(
X
;
Y
)
τ
]
, where
τ
is a threshold satisfying
the condition in Theorem
3
, and the resulting probability is an
MD sequence (
6
).
B. Strong large deviations asymptotics
Theorem
4
, below, is a strong large deviations result for an
arbitrary sequence of random vectors in
R
d
.
Let
{
S
n
}
n
=1
be a sequence of
d
-dimensional random
vectors. Denote the MGF of the random vector
S
n
by
φ
n
(
z
) =
E
[exp
{h
z
,
S
n
i}
]
(73)
for
z
C
d
, and the normalized cumulant generating function
of
S
n
by
κ
n
(
z
) =
1
n
log
φ
n
(
z
)
.
(74)
The Fenchel-Legendre transform of
κ
n
is given by
Λ
n
(
x
) = sup
t
R
d
{h
t
,
x
i−
κ
n
(
t
)
}
,
(75)
8
where
x
R
d
and
h
t
,
x
i
,
P
d
i
=1
t
i
x
i
. The quantity (
75
) is
commonly known as the
rate function
in the large deviations
literature [
31
, Ch. 2.2].
Theorem 4 (Chaganty and Sethuraman [
24
, Th. 3.4]):
Let
S
n
= (
S
n,
1
, . . . , S
n,d
)
,
n
= 1
,
2
, . . .
be a sequence of
d
-dimensional random vectors, and
{
a
n
}
n
=1
be a bounded
sequence of
d
-dimensional vectors. Assume that
(S)
κ
n
is bounded below and above, and is analytic in
D
d
,
where
D
,
{
z
C
:
|
z
|
< c
}
and
c
is a finite constant;
(ND) there exists a sequence
{
s
n
}
n
=1
and constants
c
0
and
c
1
that satisfy
κ
n
(
s
n
) =
a
n
(76)
0
< c
0
< s
n,j
< c
1
< c
for all
j
[
d
]
and
n
1
,
(77)
where
c
is the constant given in condition (S), and that the
Hessian matrix
2
κ
n
(
s
n
)
, which is a covariance matrix of
a tilted distribution obtained from
S
n
, is positive definite
with a minimum eigenvalue bounded away from zero for
all
n
;
(NL) there exists
δ
0
>
0
such that
sup
t
:
δ
1
<
k
t
k≤
δ
2
φ
n
(
s
n
+ i
t
)
φ
n
(
s
n
)
=
o

n
d/
2

(78)
for any given
δ
1
and
δ
2
such that
0
< δ
1
< δ
0
< δ
2
, where
i =
1
is the imaginary unit.
Then,
P
[
S
n
n
a
n
] =
C
nl
n
d/
2
exp
{−
n
Λ
n
(
a
n
)
}
(1 +
o
(1))
,
(79)
where
C
nl
,
1
(2
π
)
d/
2
d
Q
j
=1
s
n,j
!
p
det(
2
κ
n
(
s
n
))
.
(80)
Condition (S) of Theorem
4
is a
smoothness
assumption for
the MGF
κ
n
, which is a generalization of Cram ́er’s condition
that appears in the large deviations theorem for the sum of
i.i.d. random vectors [
31
, Th. 2.2.30]. Condition (S) implies
that all moments of the tilted distribution obtained from
S
n
are finite. Condition (ND) is used to satisfy that
S
n
is a
non-
degenerate
random vector, meaning that it does not converge
in distribution to a random vector with
ℓ < d
dimensions, and
that the rate function
Λ
n
(
a
n
)
is bounded and does not decay
to zero. The latter follows from the boundedness condition i
n
(
77
), and implies that the probability of interest is in the LD
regime. The left-hand side of (
78
) is a characteristic function of
a tilted distribution obtained from the distribution of
S
n
[
24
];
hence the supremum in (
78
) is strictly below 1 for any non-
lattice random vector.
5
Condition (NL) is used to guarantee
that the absolute value of that characteristic function dec
ays to
zero fast enough outside a neighborhood of the origin, which
makes the random vector
S
n
behave like a sum of
n
non-lattice
random vectors.
We apply Theorem
4
to the sequence of 2-dimensional
non-lattice random vectors
(
ı
(
X
;
Y
)
, ı
(
X
;
Y
)
ı
(
X
;
Y
))
to
5
Characteristic function of a lattice random variable is per
iodic; and
therefore, its absolute value takes 1 at a non-zero value.
bound the probability
P

ı
(
X
;
Y
))
ı
(
X
;
Y
)
τ
n

for some
sequence
τ
n
. Here,
X
is a random codeword, that is indepen-
dent of both
X
and
Y
.
When applied to the sum of
n
i.i.d. random variables
S
n
=
P
n
i
=1
X
i
,
κ
n
in (
74
) reduces to
κ
(
z
) = log
E
[exp
{h
z
,
X
1
i}
]
.
(81)
If
X
1
has a finite support, the expectation in (
81
) is bounded,
and all moments of
X
1
are finite; therefore, condition (S) of
Theorem
4
is satisfied. Further, the characteristic function of
the sum of
n
i.i.d. random vectors is equal to
n
-th power of the
characteristic function of one of the summands. Therefore,
the
left-hand side of (
78
) decays to zero exponentially fast for the
sum of i.i.d. non-lattice random vectors, satisfying condi
tion
(NL) of Theorem
4
with room to spare.
We use the following strong large deviations result to
bound the probability
P

ı
(
X
;
Y
))
ı
(
X
;
Y
)
τ
n

with lat-
tice
ı
(
X
;
Y
)
and
ı
(
X
;
Y
))
.
Theorem 5:
Let
{
S
n
}
n
=1
be a sequence of random vec-
tors on
R
d
. Suppose that
S
n
= (
S
n,
1
, . . . , S
n,d
)
, and
S
n,j
is a lattice random variable with a span
h
n,j
, i.e.,
P
[
S
n,j
∈{
b
n,j
+
kh
n,j
:
k
Z
}
] = 1
for some
b
n,j
, such that
there exist positive constants
h
j
and
h
j
satisfying
h
j
< h
n,j
<
h
j
for all
j
[
d
]
,
n
1
. Assume that conditions (S) and (ND)
in Theorem
4
hold, and replace condition (NL) by
(L) there exists
λ
>
0
such that for any given
δ
satisfying
0
<
δ
<
λ
,
sup
t
:
δ
j
<
|
t
j
|≤
π
h
n,j
for
j
[
d
]
φ
n
(
s
n
+ i
t
)
φ
n
(
s
n
)
=
o

n
d/
2

.
(82)
Assume that
n
a
n
is in the range of the random vector
S
n
.
Then,
P
[
S
n
n
a
n
] =
C
l
n
d/
2
exp
{−
n
Λ
n
(
a
n
)
}
(1 +
o
(1))
,
(83)
where
C
l
,
1
(2
π
)
d/
2
p
det(
2
κ
n
(
s
n
))
d
Y
j
=1
h
n,j
1
exp
{−
s
n,j
h
n,j
}
.
(84)
Proof:
The one-dimensional lattice case, i.e.,
d
= 1
, is
proved in [
32
, Th. 3.5]. The proof of the
d
-dimensional lattice
case follows by inspecting the proofs for the
d
-dimensional
non-lattice random vectors in [
24
, Th. 3.4] and the one-
dimensional lattice random variables in [
32
, Th. 3.5]. Specif-
ically, in the proof of [
24
, Th. 3.4], we replace [
24
, Th. 2.4]
by [
32
, Th. 2.10]. The auxiliary result [
32
, Th. 2.10] gives the
asymptotics of an expectation of a lattice random variable.
The
modification in the proof yields Theorem
5
.
In the application of Theorem
5
, if
S
n
= (
S
n,
1
, . . . , S
n,d
)
is a sum of
n
i.i.d. random vectors, where
S
n,j
=
n
X
i
=1
X
i,j
, j
[
d
]
,
(85)
and
X
1
,j
is lattice distributed with a span
h
j
for
j
[
d
]
, then
sup
δ
j
<
|
t
j
|≤
π
h
j
φ
(
j
)
(
s
j
+ i
t
j
)
φ
(
j
)
(
s
j
)
<
1
, j
[
d
]
,
(86)