THE ENERGY-STEPPING MONTE CARLO METHOD:
A HAMILTONIAN MONTE CARLO METHOD WITH A
100% ACCEPTANCE RATIO
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
Abstract.
We introduce the energy-stepping Monte Carlo (ESMC)
method, a Markov chain Monte Carlo (MCMC) algorithm based on
the conventional dynamical interpretation of the proposal stage but em-
ploying an energy-stepping integrator. The energy-stepping integrator is
quasi-explicit, symplectic, energy-conserving, and symmetry-preserving.
As a result of the exact energy conservation of energy-stepping integra-
tors, ESMC has a 100% acceptance ratio of the proposal states. Nu-
merical tests provide empirical evidence that ESMC affords a number
of additional benefits: the Markov chains it generates have weak auto-
correlation, it has the ability to explore distant characteristic sets of the
sampled probability distribution and it yields smaller errors than chains
sampled with Hamiltonian Monte Carlo (HMC) and similar step sizes.
Finally, ESMC benefits from the exact symmetry conservation prop-
erties of the energy-stepping integrator when sampling from potentials
with built-in symmetries, whether explicitly known or not.
1.
Introduction
Markov chain Monte Carlo (MCMC) methods are among the most impor-
tant algorithms in scientific computing [1, 2, 3]. Introduced during World
War II in the context of the Manhattan Project [4], they have become indis-
pensable in statistics, applied mathematics, statistical mechanics, chemistry,
machine learning, and other fields (see, e.g., [5, 6, 7, 8, 2]).
The goal of MCMC methods is to sample from an arbitrary probabil-
ity distribution or, more generally, calculate expectations on (possibly un-
bounded) probability spaces with known densities. There exist well-known
closed-form expressions that can perform this task when the distribution
is simple or, more generally, when the sample space is one-dimensional [9].
However, this is not the case for complex and multidimensional sample dis-
tributions that are often of interest in practical applications. Many methods
have been developed to alleviate computational costs and efforts still con-
tinue (see, for example, the monographs [9, 10, 11]).
Importance sampling
and
rejection sampling
are among the simplest sam-
pling methods for general probability distributions [12]. It is possible to show
that, under mild conditions, arbitrarily large samples can be obtained that
are distributed according to a given probability distribution. However, in
1
arXiv:2312.07215v1 [math-ph] 12 Dec 2023
2
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
actual practice calculations can be exceedingly costly and inefficient and the
applicability of those methods is limited to small and simple distributions.
Instead, MCMC methods are the
de facto
choice in practical problems
that require random sampling. The theory of these algorithms is grounded
on the properties of Markov chains [10] and always involves a two-step pro-
cess: given an existing sample, a new one is
proposed
and it is either accepted
or rejected before proceeding in a recursive fashion. The proposal step is
key, and the success of an MCMC method mainly depends on its ability to
efficiently generate samples that are accepted almost always, while simul-
taneously covering the sample space. Many variants of MCMC have been
proposed in the pursuit of these goals, starting from the original Metropolis
method [13] and including the popular modification by Hastings [14].
MCMC methods explore the sample space of a distribution and produce
a sequence of samples that should be concentrated where the probability
density is highest. This
characteristic set
is often small and may consist
of disconnected subsets that are far apart, with vast regions of low proba-
bility density separating them. The goal of MCMC methods is to explore
as quickly as possible the characteristic set, covering all the regions where
the probability measure is non-negligible. Indeed, when an MCMC method
generates samples on regions with small probability most proposals are re-
jected, which adds to cost and renders the method inefficient. Thus, one
of the goals of an MCMC method is to transition rapidly between high-
probability regions of the characteristic set.
One particular class of MCMC algorithms that has proven particularly
efficient is the class of
Hamiltonian Monte Carlo
(HMC) methods. Origi-
nally introduced in the context of molecular dynamics, and initially referred
to as the
hybrid MC method
[15], it exploits an interpretation of the sam-
pling process as the motion of a generalized particle, of the type customarily
considered in Hamiltonian mechanics [16, 17]. Such identification opens the
door to the use of techniques developed to integrate in time Hamiltonian
systems. Specifically, the proposal state in HMC employs
symplectic
inte-
grators, time-stepping algorithms designed to preserve some of the geomet-
ric structure of Hamiltonian systems. This strategy generates fast proposals
that efficiently cover the characteristic set, as desired.
Symplectic integrators preserve the symplectic form of Hamiltonian sys-
tems and possibly other important invariants and symmetries. Their excel-
lent properties have made them popular and have been exhaustively ana-
lyzed [18]. One well-known result is that, even in conservative problems,
symplectic integrators cannot preserve the total energy [19] unless the time
step size is added to the unknowns of the problem [20]. This fact is exploited
in MCMC methods: when the symplectic integration yields a proposal with
a large energy error, this sample is rejected. A delicate balance is then
sought: if the time integration of the MCMC method is performed for a
small period of time or with a small time step, the proposal will most likely
be accepted, albeit at great computational cost and time to traverse the
THE ENERGY-STEPPING MONTE CARLO METHOD
3
characteristic set; conversely, if the time integration is performed for a long
time or with a large time step, the method can potentially explore larger
regions of the state space, albeit at the risk of proposing a sample that is
likely to be rejected. Thus, the challenge HMC is to use fast, structure-
preserving integrators capable of generating samples that efficiently explore
the characteristic set while incurring in small energy errors that would entail
their rejection.
The standard integrator for HMC is the leapfrog method [21]. Leapfrog
is a first-order accurate, explicit, symplectic time integration scheme widely
employed, e. g., in molecular dynamics [22]. Being explicit, the method is
conditionally stable but has very low computational cost. In addition, as a
result of its symplecticity it exactly preserves a
shadow
Hamiltonian [23],
which limits the extent of energy drift from the exact value.
In this work, we propose a new class of MCMC methods of the HMC type,
which we refer to as energy-stepping Monte Carlo (ESMC) methods, where
the symplectic integrator (e. g., leapfrog) is replaced by an energy-stepping
integrator [24, 25]. These integrators are designed for Hamiltonian problems
and possess remarkable properties such as symplecticity, unconditional sta-
bility, exact conservation of energy, and preservation of
all
the symmetries
of the Lagrangian, whether explicitly known or not. Energy-stepping inte-
grators are essentially explicit, requiring the solution of
one
scalar equation
per integration step, irrespective of the dimension of the system. Given the
remarkable properties of these integrators and their competitive cost, they
suggest themselves as excellent candidates to replace other symplectic inte-
grators in HMC: they can be expected to explore the characteristic set as
efficiently as other symplectic integrators but, remarkably, propose samples
with
absolutely no rejections
. For an almost negligible increment in compu-
tational cost, because of its remarkable property of a 100% acceptance rate
ESMC has the potential to significantly improve the performance of HMC.
The article is structured as follows. In Section 2, we review the concepts
of MCMC methods, and provide the framework for ESMC. The Hamiltonian
Monte Carlo method is presented in Section 3, and the role played by the
time integration step is carefully stressed. Next, Section 4 describes the
energy-stepping time integration scheme, without any reference to Monte
Carlo methods but, rather, as a method designed to integrate Hamiltonian
mechanics. Then, in Section 5, we introduce the main contribution of this
work, namely, the ESMC method. Numerical simulations that illustrate its
performance and the comparison with other standard MCMC methods are
presented in Section 6. We emphasize that the numerical tests presented in
this work focus on exemplifying the fundamental properties of ESMC and
are not intended to be representative of production-ready codes or libraries
such as STAN and NIMBLE (e. g. [26]), which are the result of extensive
development and fine tuning. Finally, Section 7 closes the article with a
summary of the main findings and outlook for further work.
4
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
2.
Markov chain Monte Carlo methods
By way of background, we start by briefly reviewing the fundamentals of
Markov chains and their link with sampling methods. In many applications
of statistics, it is necessary to evaluate expectations relative to probabil-
ity density functions that are complex and, therefore, impossible to obtain
analytically. These situations appear, for instance, when doing inference in
Bayesian models or, simply, when predicting the output of random processes.
In view of the difficulty in obtaining closed-form expectations, numerical ap-
proximation is required.
The standard procedure is as follows. Let Ω be an
n
−
dimensional sample
space, not necessarily compact, and
μ
a Lebesgue-continuous probability
measure
(2.1)
μ
(
B
) =
Z
B
π
(
x
) dΩ
,
for all Lebesgue-measurable sets
B
⊂
Ω and integrable probability density
function
π
: Ω
→
R
+
. The main problem is to calculate
(2.2)
E
μ
[
f
] =
Z
Ω
f
(
x
)
π
(
x
) dΩ
,
for all bounded continuous functions
f
. In order to approximate inte-
gral (2.2), assume that we have a collection
{
x
k
}
N
k
=1
of independent, iden-
tically distributed samples with probability
p
. Then, the corresponding
empirical approximation of the expectation is
(2.3)
E
μ
[
f
]
≈
S
N
:=
1
N
N
X
k
=1
f
(
x
k
)
.
By the weak-* density of Diracs in the space of probability measures, possi-
bly under moment constraints (e. g., [27, 28]), it is possible to chose (unbi-
ased) sequences
{
x
k
}
N
k
=1
such that
(2.4)
lim
N
→∞
S
N
=
E
μ
[
f
]
,
for every bounded continuous function
f
.
Evidently, the key to calculating converging empirical expectations of
the form (2.3) efficiently is the construction of the sample array
{
x
k
}
N
k
=1
.
Naive methods to generate samples using, for example, rejection sampling
are extremely costly, especially in high-dimension sample spaces [12].
The standard workhorse for this task is the Markov chain Monte Carlo
method (MCMC) and its variants [12]. MCMC methods are designed to
generate sequences of samples distributed according to the target probabil-
ity and spending most of the computational effort in those regions of high
probability density. To this end, special random walks in sample space are
generated with stochastic rules determining whether some of these states
should be discarded or not.
THE ENERGY-STEPPING MONTE CARLO METHOD
5
2.1.
Markov chains.
We summarize the basic concepts of Markov chains
that are required to define MCMC methods and we refer to specialized
monographs for additional results and proofs (e.g., [10]).
A (discrete) Markov chain is a finite sequence
{
q
k
}
N
k
=1
of random variables
that possesses the
Markov property
: the conditioned probability of
q
k
+1
given all past values
q
k
,q
k
−
1
,...q
1
,q
0
is actually a function of
q
k
only. This
probability is known as the
transition kernel
K
and we write
(2.5)
q
k
+1
|
q
k
,q
k
−
1
,...,q
1
,q
0
∼
q
k
+1
|
q
k
∼
K
(
q
k
,q
k
+1
)
.
The Markov chains of interest for MCMC methods are those that are
irre-
ducible
and possess a
stationary distribution
. The first property, irreducibil-
ity, requires that, given any initial value
q
0
and an arbitrary non-empty set,
the probability that — at some time – the Markov chain will generate a state
belonging to the set is greater than zero. The second property, the existence
of a stationary distribution, demands that there exist a probability density
function
π
that is preserved by the transition kernel, i. e., if
q
k
∼
π
then
q
k
+1
∼
π
, or, equivalently,
(2.6)
Z
K
(
x,y
)
π
(
x
) d
x
=
π
(
y
)
.
A necessary condition for a Markov chain to be stationary is that it is
irreducible.
In a recurrent Markov chain, the stationary distribution
π
is limiting, i. e.,
for almost any initial sample
q
0
the sample
q
k
is distributed as
π
for large
enough
k
. This property — also referred to as
ergodicity
— is exploited
in the formulation of MCMC methods: the search of samples distributed
according to
π
aims at building an ergodic Markov chain with a transition
kernel
K
and stationary probability identical to
π
. If these properties are
attained, the values of the chain will eventually be distributed as unbiased
samples from
π
.
A stationary Markov chain is
reversible
if
K
(
x,y
) =
K
(
y,x
). Moreover,
a Markov chain satisfies the
detailed balance condition
if there exists a prob-
ability
π
such that
(2.7)
K
(
y,x
)
π
(
y
) =
K
(
x,y
)
π
(
x
)
If the transition kernel of a Markov chain verifies the detailed balance con-
dition with function
π
, then the latter is a stationary probability associated
with the transition kernel
K
and the chain is reversible. While this condi-
tion is sufficient but not necessary for the two properties to hold, detailed
balance is easy to verify and is often employed in practice to analyze MCMC
methods.
2.2.
Markov chains for Monte Carlo methods.
The key idea behind
MCMC methods is that independent, identically distributed unbiased sam-
ples with probability distribution
π
can be effectively obtained simply by
6
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
collecting the states in a ergodic Markov chain defined by a transition ker-
nel
K
designed
to verify the detailed balance condition (2.7) with probabil-
ity
π
. However, this strategy leaves considerable freedom of sampling from
π
in many different ways, depending on the Markov chain employed.
Metropolis-Hastings MCMC, the most common type of MCMC method,
employs a
proposal
or
instrumental
probability distribution
p
to define the
transition kernel. Specifically, and given a chain with current value
q
k
, the
next state is obtained by a two-step procedure: first, a random sample ̃
q
is
generated with probability density
p
( ̃
q
|
q
k
). Then, the state
q
k
+1
is selected
to be equal to ̃
q
with probability
ρ
(
q
k
,
̃
q
), or equal to the previous state
q
k
with probability 1
−
ρ
(
q
k
,
̃
q
), where
(2.8)
ρ
(
q
k
,
̃
q
) = min
π
( ̃
q
)
π
(
q
k
)
p
(
q
k
|
̃
q
)
p
( ̃
q
|
q
k
)
,
1
.
This transition map can be shown to satisfy the detailed balance condi-
tion for all probabilities
p
and
π
(cf. [10]). It bears emphasis that in the
Metropolis-Hastings algorithm, once a proposal ̃
q
is generated, it may be re-
jected. Thus, the ratio of the accepted to the rejected proposal samples can
dramatically affect the ability of the method to explore the characteristic
set of the target probability distribution
π
, as well as its computational cost
[17]. This is a delicate tradeoff for which there exist heuristic rules: too small
an acceptance ratio leads to an inefficient sampling strategy, whereas a too-
large one might indicate that the Markov chain is covering the characteristic
set of
π
too slowly.
The simplest variant of MCMC, commonly known as
random walk
MCMC,
employs a proposal distribution that is a Gaussian centered at the current
value of the Markov chain. This proposal is obviously symmetric, and its
variance can be selected to optimize the acceptance ratio. Other proposals,
such as the Gamma or the Student-t distributions, may be exploited to bias
the new states of the Markov chain away from previously explored regions.
3.
Hamiltonian Monte Carlo methods
Hamiltonian Monte Carlo methods (HMC) are yet another family of al-
gorithms designed to sample complex target probability distributions. They
have proven advantageous over other, more traditional, MCMC methods,
especially for high-dimensional sample spaces such as appear in problems
of statistical mechanics. Originally introduced in the context of molecular
dynamics and dubbed the
hybrid Monte Carlo method
[15], this family of
algorithms exploits ideas and numerical methods used in the study of dy-
namical systems, and more specifically, Hamiltonian problems on symplectic
or Poisson spaces [16].
Similarly to Metropolis-Hastings MCMC, HMC methods generate Markov
chains of independent samples distributed according to a target probability
distribution, but differ thereof in that they take advantage of the geometrical
information provided by the
gradient
of the target probability density. By
THE ENERGY-STEPPING MONTE CARLO METHOD
7
interpreting the state of the Markov chain as a particle moving conservatively
under the action of a fictitious potential defined by the target probability,
HMC methods generate chains that have been shown to efficiently explore
the regions of high probability density, even when highly concentrated [17].
Thus, HMC methods essentially replace the
proposal
step of any MCMC
method but keep the accept/reject step unchanged, albeit in a form more
conveniently expressed in terms of the surrogate energy.
To describe the HMC approach, let
q
denote as before the random variable
for which a target probability with density
π
is known. The goal of HMC is to
construct a collection
{
q
k
}
N
k
=1
of independent samples identically distributed
according to
π
. Let us assume that
(3.1)
π
(
q
) =
1
Z
q
e
−
V
(
q
)
,
where we refer to
V
: Ω
→
R
as the
potential function
and
Z
q
is a normalizing
factor. We additionally introduce an ancillary random variable
p
: Ω
→
R
and postulate that the two random variables
q
and
p
, defining coordinate
z
= (
q,p
) in
phase space
, satisfy
(3.2)
(
q,p
)
∼
π
(
q,p
) =
π
(
p
|
q
)
π
(
q
)
,
with
(3.3)
π
(
p
|
q
) =
1
Z
p
e
−
K
(
q,p
)
and
K
: Ω
×
Ω
→
R
+
referred to as the
kinetic energy
. Then, the joint
distribution (3.2) follows as
(3.4)
π
(
q,p
) =
1
Z
q
Z
p
e
−
V
(
q
)
−
K
(
q,p
)
=
1
Z
H
e
−
H
(
q,p
)
,
where the normalizing constant is now
Z
H
=
Z
p
Z
q
and
(3.5)
H
(
q,p
) =
V
(
q
) +
K
(
q,p
)
is the
Hamiltonian
. Note that the marginal distribution of the pair (
q,p
)
with respect to
p
satisfies
(3.6)
Z
R
n
π
(
q,p
) d
p
=
Z
R
n
π
(
p
|
q
)
π
(
q
) d
p
=
π
(
q
)
.
Hence, if a sequence
{
(
q
k
,p
k
)
}
N
k
=1
is sampled with distribution
π
(
q,p
), the
submersion (
q
k
,p
k
)
7→
q
k
is distributed with probability
π
(
q
).
3.1.
Algorithmic details.
Similarly to other MCMC methods, HMC pro-
ceeds recursively, in the process generating a Markov chain of states. On
each iteration, a state is proposed and then either accepted or rejected.
Specifically, let
q
k
be the last accepted state of the chain. Then, to compute
the next state, HMC methods calculate the following:
8
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
(1) In the first step, a random value for the momentum
p
k
is sampled
with probability distribution
π
(
p
k
|
q
k
) as defined in Eq. (3.3). To
that end, a kinetic potential is selected. A simple choice is
(3.7)
K
(
q,p
) =
1
2
p
·
M
−
1
p
+ log
|
M
|
,
with
M
a constant metric. Other more sophisticated kinetic energies
make use of metrics
M
that depend on the configuration in an at-
tempt to better capture geometrical detail, but such extensions are
not considered here.
(2) In the second step, the pair
z
k
= (
q
k
,p
k
) evolves under a Hamiltonian
flow
(3.8)
̇
q
(
t
) =
∂H
∂p
(
q
(
t
)
,p
(
t
))
,
̇
p
(
t
) =
−
∂H
∂q
(
q
(
t
)
,p
(
t
))
,
for a time interval
t
∈
[0
,T
], and with initial conditions
(3.9)
q
(0) =
q
k
, p
(0) =
p
k
.
Since the potential may be an arbitrary function, an exact solution
to this evolution problem is not available in general and a time inte-
grator, such as leapfrog, needs to be used instead. A delicate tradeoff
concerns the choice of the length
T
of the integration interval and
its relationship with the time step size required for stability.
(3) The proposal state ̃
z
= ( ̃
q,
̃
p
) = (
q
(
T
)
,p
(
T
)) is accepted with prob-
ability
(3.10)
ρ
(
z
k
,
̃
z
) = min
{
1
,
exp[
H
(
z
k
)
−
H
( ̃
z
)]
}
,
and rejected with probability 1
−
ρ
(
z
k
,
̃
z
) otherwise. If accepted,
we set
z
k
+1
= (
q
k
+1
,p
k
+1
) = ̃
z
and the state
q
k
+1
is added to the
Markov chain.
4.
Energy-stepping integrators
We proceed to review
energy-stepping
integrators, a paradigm that differs
fundamentally from classical integrators such as Runge-Kutta, multistep,
or one-leg methods [29, 30], but shares all the advantageous features of
symplectic methods as well as possessing some unique ones [24, 25].
We begin by focusing on systems characterized by Lagrangians
L
:
R
N
×
R
N
→
R
of the form
(4.1)
L
(
q,
̇
q
) =
1
2
̇
q
T
M
̇
q
−
V
(
q
)
,
where
M
is a mass matrix and
V
(
q
) is the potential energy function. The
classical motivation for systems of this type comes from celestial, structural,
solid, and molecular mechanics [31]. This framework fits within HMC since
the Lagrangian (4.1) is equivalent to the Hamiltonian (3.5) through the
(4.2)
L
(
q,
̇
q
) = sup
p
( ̇
q p
−
H
(
q,p
))
.
THE ENERGY-STEPPING MONTE CARLO METHOD
9
The trajectories of the Lagrangian (4.1) render stationary Hamilton’s action
(4.3)
I
=
Z
L
(
q
(
t
)
,
̇
q
(
t
))
dt.
Since such trajectories are not easy to find, the classical approximation para-
digm is to discrete the action in time, leading to variational time integrators.
The energy-stepping paradigm is at variance with time-stepping in that
it employs
exact solutions
of an
approximate Lagrangian
that can be solved
exactly. For Lagrangians of the form (4.1), [24, 25] propose the approximate
Lagrangians
(4.4)
L
h
(
q,
̇
q
) =
1
2
̇
q
T
M
̇
q
−
V
h
(
q
)
,
where
V
h
is some exactly solvable approximation of the potential energy.
The energy-stepping method specifically considers
piecewise constant
ap-
proximations of the potential energy, i. e.,
terraced
approximations
V
h
of
constant energy-step height
h
defined as
(4.5)
V
h
(
q
) =
h
⌊
h
−
1
V
(
q
)
⌋
,
where
⌊·⌋
is the floor function, i. e.,
⌊
x
⌋
= max
{
n
∈
Z
:
n
≤
x
}
. Based on
this definition,
V
h
is the largest piecewise-constant function with values in
h
Z
majorized by
V
.
An approximating sequence of potential energies, and by extension La-
grangians, can be built by selecting energy steps of decreasing height. Other
types of approximations, such as piecewise linear interpolations of the po-
tential energy, also result in exactly integrable approximating systems [24]
but will not be considered here for definiteness. See Fig. 4.1 for an illus-
tration of a piecewise constant and a piecewise linear approximation of the
Kepler potential.
4.1.
Computation of the exact trajectories of the approximating
Lagrangian.
Following [25], we describe next the calculation of the
exact
trajectories generated by the terraced Lagrangians
L
h
(
q,
̇
q
).
Suppose that a mechanical system is in configuration
q
0
at time
t
0
, in
configuration
q
2
at time
t
2
, and that during the time interval [
t
0
,t
2
] it inter-
sects one single jump surface Γ
1
separating two regions of constant energies
V
0
and
V
2
(see Figure 4.2). Based on the form of
V
h
, Γ
1
is the level surface
V
=
V
2
for an uphill step
V
2
=
V
0
+
h
, or the level surface
V
=
V
0
for a
downhill step,
V
2
=
V
0
−
h
. For simplicity, we shall further assume that
V
is differentiable and that all energy-level crossings are transversal, i. e.,
(4.6)
n
(
q
1
)
·
̇
q
−
1
̸
= 0
where ̇
q
−
1
= ̇
q