of 38
THE ENERGY-STEPPING MONTE CARLO METHOD:
A HAMILTONIAN MONTE CARLO METHOD WITH A
100% ACCEPTANCE RATIO
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
Abstract.
We introduce the energy-stepping Monte Carlo (ESMC)
method, a Markov chain Monte Carlo (MCMC) algorithm based on
the conventional dynamical interpretation of the proposal stage but em-
ploying an energy-stepping integrator. The energy-stepping integrator is
quasi-explicit, symplectic, energy-conserving, and symmetry-preserving.
As a result of the exact energy conservation of energy-stepping integra-
tors, ESMC has a 100% acceptance ratio of the proposal states. Nu-
merical tests provide empirical evidence that ESMC affords a number
of additional benefits: the Markov chains it generates have weak auto-
correlation, it has the ability to explore distant characteristic sets of the
sampled probability distribution and it yields smaller errors than chains
sampled with Hamiltonian Monte Carlo (HMC) and similar step sizes.
Finally, ESMC benefits from the exact symmetry conservation prop-
erties of the energy-stepping integrator when sampling from potentials
with built-in symmetries, whether explicitly known or not.
1.
Introduction
Markov chain Monte Carlo (MCMC) methods are among the most impor-
tant algorithms in scientific computing [1, 2, 3]. Introduced during World
War II in the context of the Manhattan Project [4], they have become indis-
pensable in statistics, applied mathematics, statistical mechanics, chemistry,
machine learning, and other fields (see, e.g., [5, 6, 7, 8, 2]).
The goal of MCMC methods is to sample from an arbitrary probabil-
ity distribution or, more generally, calculate expectations on (possibly un-
bounded) probability spaces with known densities. There exist well-known
closed-form expressions that can perform this task when the distribution
is simple or, more generally, when the sample space is one-dimensional [9].
However, this is not the case for complex and multidimensional sample dis-
tributions that are often of interest in practical applications. Many methods
have been developed to alleviate computational costs and efforts still con-
tinue (see, for example, the monographs [9, 10, 11]).
Importance sampling
and
rejection sampling
are among the simplest sam-
pling methods for general probability distributions [12]. It is possible to show
that, under mild conditions, arbitrarily large samples can be obtained that
are distributed according to a given probability distribution. However, in
1
arXiv:2312.07215v1 [math-ph] 12 Dec 2023
2
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
actual practice calculations can be exceedingly costly and inefficient and the
applicability of those methods is limited to small and simple distributions.
Instead, MCMC methods are the
de facto
choice in practical problems
that require random sampling. The theory of these algorithms is grounded
on the properties of Markov chains [10] and always involves a two-step pro-
cess: given an existing sample, a new one is
proposed
and it is either accepted
or rejected before proceeding in a recursive fashion. The proposal step is
key, and the success of an MCMC method mainly depends on its ability to
efficiently generate samples that are accepted almost always, while simul-
taneously covering the sample space. Many variants of MCMC have been
proposed in the pursuit of these goals, starting from the original Metropolis
method [13] and including the popular modification by Hastings [14].
MCMC methods explore the sample space of a distribution and produce
a sequence of samples that should be concentrated where the probability
density is highest. This
characteristic set
is often small and may consist
of disconnected subsets that are far apart, with vast regions of low proba-
bility density separating them. The goal of MCMC methods is to explore
as quickly as possible the characteristic set, covering all the regions where
the probability measure is non-negligible. Indeed, when an MCMC method
generates samples on regions with small probability most proposals are re-
jected, which adds to cost and renders the method inefficient. Thus, one
of the goals of an MCMC method is to transition rapidly between high-
probability regions of the characteristic set.
One particular class of MCMC algorithms that has proven particularly
efficient is the class of
Hamiltonian Monte Carlo
(HMC) methods. Origi-
nally introduced in the context of molecular dynamics, and initially referred
to as the
hybrid MC method
[15], it exploits an interpretation of the sam-
pling process as the motion of a generalized particle, of the type customarily
considered in Hamiltonian mechanics [16, 17]. Such identification opens the
door to the use of techniques developed to integrate in time Hamiltonian
systems. Specifically, the proposal state in HMC employs
symplectic
inte-
grators, time-stepping algorithms designed to preserve some of the geomet-
ric structure of Hamiltonian systems. This strategy generates fast proposals
that efficiently cover the characteristic set, as desired.
Symplectic integrators preserve the symplectic form of Hamiltonian sys-
tems and possibly other important invariants and symmetries. Their excel-
lent properties have made them popular and have been exhaustively ana-
lyzed [18]. One well-known result is that, even in conservative problems,
symplectic integrators cannot preserve the total energy [19] unless the time
step size is added to the unknowns of the problem [20]. This fact is exploited
in MCMC methods: when the symplectic integration yields a proposal with
a large energy error, this sample is rejected. A delicate balance is then
sought: if the time integration of the MCMC method is performed for a
small period of time or with a small time step, the proposal will most likely
be accepted, albeit at great computational cost and time to traverse the
THE ENERGY-STEPPING MONTE CARLO METHOD
3
characteristic set; conversely, if the time integration is performed for a long
time or with a large time step, the method can potentially explore larger
regions of the state space, albeit at the risk of proposing a sample that is
likely to be rejected. Thus, the challenge HMC is to use fast, structure-
preserving integrators capable of generating samples that efficiently explore
the characteristic set while incurring in small energy errors that would entail
their rejection.
The standard integrator for HMC is the leapfrog method [21]. Leapfrog
is a first-order accurate, explicit, symplectic time integration scheme widely
employed, e. g., in molecular dynamics [22]. Being explicit, the method is
conditionally stable but has very low computational cost. In addition, as a
result of its symplecticity it exactly preserves a
shadow
Hamiltonian [23],
which limits the extent of energy drift from the exact value.
In this work, we propose a new class of MCMC methods of the HMC type,
which we refer to as energy-stepping Monte Carlo (ESMC) methods, where
the symplectic integrator (e. g., leapfrog) is replaced by an energy-stepping
integrator [24, 25]. These integrators are designed for Hamiltonian problems
and possess remarkable properties such as symplecticity, unconditional sta-
bility, exact conservation of energy, and preservation of
all
the symmetries
of the Lagrangian, whether explicitly known or not. Energy-stepping inte-
grators are essentially explicit, requiring the solution of
one
scalar equation
per integration step, irrespective of the dimension of the system. Given the
remarkable properties of these integrators and their competitive cost, they
suggest themselves as excellent candidates to replace other symplectic inte-
grators in HMC: they can be expected to explore the characteristic set as
efficiently as other symplectic integrators but, remarkably, propose samples
with
absolutely no rejections
. For an almost negligible increment in compu-
tational cost, because of its remarkable property of a 100% acceptance rate
ESMC has the potential to significantly improve the performance of HMC.
The article is structured as follows. In Section 2, we review the concepts
of MCMC methods, and provide the framework for ESMC. The Hamiltonian
Monte Carlo method is presented in Section 3, and the role played by the
time integration step is carefully stressed. Next, Section 4 describes the
energy-stepping time integration scheme, without any reference to Monte
Carlo methods but, rather, as a method designed to integrate Hamiltonian
mechanics. Then, in Section 5, we introduce the main contribution of this
work, namely, the ESMC method. Numerical simulations that illustrate its
performance and the comparison with other standard MCMC methods are
presented in Section 6. We emphasize that the numerical tests presented in
this work focus on exemplifying the fundamental properties of ESMC and
are not intended to be representative of production-ready codes or libraries
such as STAN and NIMBLE (e. g. [26]), which are the result of extensive
development and fine tuning. Finally, Section 7 closes the article with a
summary of the main findings and outlook for further work.
4
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
2.
Markov chain Monte Carlo methods
By way of background, we start by briefly reviewing the fundamentals of
Markov chains and their link with sampling methods. In many applications
of statistics, it is necessary to evaluate expectations relative to probabil-
ity density functions that are complex and, therefore, impossible to obtain
analytically. These situations appear, for instance, when doing inference in
Bayesian models or, simply, when predicting the output of random processes.
In view of the difficulty in obtaining closed-form expectations, numerical ap-
proximation is required.
The standard procedure is as follows. Let Ω be an
n
dimensional sample
space, not necessarily compact, and
μ
a Lebesgue-continuous probability
measure
(2.1)
μ
(
B
) =
Z
B
π
(
x
) dΩ
,
for all Lebesgue-measurable sets
B
Ω and integrable probability density
function
π
: Ω
R
+
. The main problem is to calculate
(2.2)
E
μ
[
f
] =
Z
f
(
x
)
π
(
x
) dΩ
,
for all bounded continuous functions
f
. In order to approximate inte-
gral (2.2), assume that we have a collection
{
x
k
}
N
k
=1
of independent, iden-
tically distributed samples with probability
p
. Then, the corresponding
empirical approximation of the expectation is
(2.3)
E
μ
[
f
]
S
N
:=
1
N
N
X
k
=1
f
(
x
k
)
.
By the weak-* density of Diracs in the space of probability measures, possi-
bly under moment constraints (e. g., [27, 28]), it is possible to chose (unbi-
ased) sequences
{
x
k
}
N
k
=1
such that
(2.4)
lim
N
→∞
S
N
=
E
μ
[
f
]
,
for every bounded continuous function
f
.
Evidently, the key to calculating converging empirical expectations of
the form (2.3) efficiently is the construction of the sample array
{
x
k
}
N
k
=1
.
Naive methods to generate samples using, for example, rejection sampling
are extremely costly, especially in high-dimension sample spaces [12].
The standard workhorse for this task is the Markov chain Monte Carlo
method (MCMC) and its variants [12]. MCMC methods are designed to
generate sequences of samples distributed according to the target probabil-
ity and spending most of the computational effort in those regions of high
probability density. To this end, special random walks in sample space are
generated with stochastic rules determining whether some of these states
should be discarded or not.
THE ENERGY-STEPPING MONTE CARLO METHOD
5
2.1.
Markov chains.
We summarize the basic concepts of Markov chains
that are required to define MCMC methods and we refer to specialized
monographs for additional results and proofs (e.g., [10]).
A (discrete) Markov chain is a finite sequence
{
q
k
}
N
k
=1
of random variables
that possesses the
Markov property
: the conditioned probability of
q
k
+1
given all past values
q
k
,q
k
1
,...q
1
,q
0
is actually a function of
q
k
only. This
probability is known as the
transition kernel
K
and we write
(2.5)
q
k
+1
|
q
k
,q
k
1
,...,q
1
,q
0
q
k
+1
|
q
k
K
(
q
k
,q
k
+1
)
.
The Markov chains of interest for MCMC methods are those that are
irre-
ducible
and possess a
stationary distribution
. The first property, irreducibil-
ity, requires that, given any initial value
q
0
and an arbitrary non-empty set,
the probability that — at some time – the Markov chain will generate a state
belonging to the set is greater than zero. The second property, the existence
of a stationary distribution, demands that there exist a probability density
function
π
that is preserved by the transition kernel, i. e., if
q
k
π
then
q
k
+1
π
, or, equivalently,
(2.6)
Z
K
(
x,y
)
π
(
x
) d
x
=
π
(
y
)
.
A necessary condition for a Markov chain to be stationary is that it is
irreducible.
In a recurrent Markov chain, the stationary distribution
π
is limiting, i. e.,
for almost any initial sample
q
0
the sample
q
k
is distributed as
π
for large
enough
k
. This property — also referred to as
ergodicity
— is exploited
in the formulation of MCMC methods: the search of samples distributed
according to
π
aims at building an ergodic Markov chain with a transition
kernel
K
and stationary probability identical to
π
. If these properties are
attained, the values of the chain will eventually be distributed as unbiased
samples from
π
.
A stationary Markov chain is
reversible
if
K
(
x,y
) =
K
(
y,x
). Moreover,
a Markov chain satisfies the
detailed balance condition
if there exists a prob-
ability
π
such that
(2.7)
K
(
y,x
)
π
(
y
) =
K
(
x,y
)
π
(
x
)
If the transition kernel of a Markov chain verifies the detailed balance con-
dition with function
π
, then the latter is a stationary probability associated
with the transition kernel
K
and the chain is reversible. While this condi-
tion is sufficient but not necessary for the two properties to hold, detailed
balance is easy to verify and is often employed in practice to analyze MCMC
methods.
2.2.
Markov chains for Monte Carlo methods.
The key idea behind
MCMC methods is that independent, identically distributed unbiased sam-
ples with probability distribution
π
can be effectively obtained simply by
6
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
collecting the states in a ergodic Markov chain defined by a transition ker-
nel
K
designed
to verify the detailed balance condition (2.7) with probabil-
ity
π
. However, this strategy leaves considerable freedom of sampling from
π
in many different ways, depending on the Markov chain employed.
Metropolis-Hastings MCMC, the most common type of MCMC method,
employs a
proposal
or
instrumental
probability distribution
p
to define the
transition kernel. Specifically, and given a chain with current value
q
k
, the
next state is obtained by a two-step procedure: first, a random sample ̃
q
is
generated with probability density
p
( ̃
q
|
q
k
). Then, the state
q
k
+1
is selected
to be equal to ̃
q
with probability
ρ
(
q
k
,
̃
q
), or equal to the previous state
q
k
with probability 1
ρ
(
q
k
,
̃
q
), where
(2.8)
ρ
(
q
k
,
̃
q
) = min

π
( ̃
q
)
π
(
q
k
)
p
(
q
k
|
̃
q
)
p
( ̃
q
|
q
k
)
,
1

.
This transition map can be shown to satisfy the detailed balance condi-
tion for all probabilities
p
and
π
(cf. [10]). It bears emphasis that in the
Metropolis-Hastings algorithm, once a proposal ̃
q
is generated, it may be re-
jected. Thus, the ratio of the accepted to the rejected proposal samples can
dramatically affect the ability of the method to explore the characteristic
set of the target probability distribution
π
, as well as its computational cost
[17]. This is a delicate tradeoff for which there exist heuristic rules: too small
an acceptance ratio leads to an inefficient sampling strategy, whereas a too-
large one might indicate that the Markov chain is covering the characteristic
set of
π
too slowly.
The simplest variant of MCMC, commonly known as
random walk
MCMC,
employs a proposal distribution that is a Gaussian centered at the current
value of the Markov chain. This proposal is obviously symmetric, and its
variance can be selected to optimize the acceptance ratio. Other proposals,
such as the Gamma or the Student-t distributions, may be exploited to bias
the new states of the Markov chain away from previously explored regions.
3.
Hamiltonian Monte Carlo methods
Hamiltonian Monte Carlo methods (HMC) are yet another family of al-
gorithms designed to sample complex target probability distributions. They
have proven advantageous over other, more traditional, MCMC methods,
especially for high-dimensional sample spaces such as appear in problems
of statistical mechanics. Originally introduced in the context of molecular
dynamics and dubbed the
hybrid Monte Carlo method
[15], this family of
algorithms exploits ideas and numerical methods used in the study of dy-
namical systems, and more specifically, Hamiltonian problems on symplectic
or Poisson spaces [16].
Similarly to Metropolis-Hastings MCMC, HMC methods generate Markov
chains of independent samples distributed according to a target probability
distribution, but differ thereof in that they take advantage of the geometrical
information provided by the
gradient
of the target probability density. By
THE ENERGY-STEPPING MONTE CARLO METHOD
7
interpreting the state of the Markov chain as a particle moving conservatively
under the action of a fictitious potential defined by the target probability,
HMC methods generate chains that have been shown to efficiently explore
the regions of high probability density, even when highly concentrated [17].
Thus, HMC methods essentially replace the
proposal
step of any MCMC
method but keep the accept/reject step unchanged, albeit in a form more
conveniently expressed in terms of the surrogate energy.
To describe the HMC approach, let
q
denote as before the random variable
for which a target probability with density
π
is known. The goal of HMC is to
construct a collection
{
q
k
}
N
k
=1
of independent samples identically distributed
according to
π
. Let us assume that
(3.1)
π
(
q
) =
1
Z
q
e
V
(
q
)
,
where we refer to
V
: Ω
R
as the
potential function
and
Z
q
is a normalizing
factor. We additionally introduce an ancillary random variable
p
: Ω
R
and postulate that the two random variables
q
and
p
, defining coordinate
z
= (
q,p
) in
phase space
, satisfy
(3.2)
(
q,p
)
π
(
q,p
) =
π
(
p
|
q
)
π
(
q
)
,
with
(3.3)
π
(
p
|
q
) =
1
Z
p
e
K
(
q,p
)
and
K
: Ω
×
R
+
referred to as the
kinetic energy
. Then, the joint
distribution (3.2) follows as
(3.4)
π
(
q,p
) =
1
Z
q
Z
p
e
V
(
q
)
K
(
q,p
)
=
1
Z
H
e
H
(
q,p
)
,
where the normalizing constant is now
Z
H
=
Z
p
Z
q
and
(3.5)
H
(
q,p
) =
V
(
q
) +
K
(
q,p
)
is the
Hamiltonian
. Note that the marginal distribution of the pair (
q,p
)
with respect to
p
satisfies
(3.6)
Z
R
n
π
(
q,p
) d
p
=
Z
R
n
π
(
p
|
q
)
π
(
q
) d
p
=
π
(
q
)
.
Hence, if a sequence
{
(
q
k
,p
k
)
}
N
k
=1
is sampled with distribution
π
(
q,p
), the
submersion (
q
k
,p
k
)
7→
q
k
is distributed with probability
π
(
q
).
3.1.
Algorithmic details.
Similarly to other MCMC methods, HMC pro-
ceeds recursively, in the process generating a Markov chain of states. On
each iteration, a state is proposed and then either accepted or rejected.
Specifically, let
q
k
be the last accepted state of the chain. Then, to compute
the next state, HMC methods calculate the following:
8
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
(1) In the first step, a random value for the momentum
p
k
is sampled
with probability distribution
π
(
p
k
|
q
k
) as defined in Eq. (3.3). To
that end, a kinetic potential is selected. A simple choice is
(3.7)
K
(
q,p
) =
1
2
p
·
M
1
p
+ log
|
M
|
,
with
M
a constant metric. Other more sophisticated kinetic energies
make use of metrics
M
that depend on the configuration in an at-
tempt to better capture geometrical detail, but such extensions are
not considered here.
(2) In the second step, the pair
z
k
= (
q
k
,p
k
) evolves under a Hamiltonian
flow
(3.8)
̇
q
(
t
) =
∂H
∂p
(
q
(
t
)
,p
(
t
))
,
̇
p
(
t
) =
∂H
∂q
(
q
(
t
)
,p
(
t
))
,
for a time interval
t
[0
,T
], and with initial conditions
(3.9)
q
(0) =
q
k
, p
(0) =
p
k
.
Since the potential may be an arbitrary function, an exact solution
to this evolution problem is not available in general and a time inte-
grator, such as leapfrog, needs to be used instead. A delicate tradeoff
concerns the choice of the length
T
of the integration interval and
its relationship with the time step size required for stability.
(3) The proposal state ̃
z
= ( ̃
q,
̃
p
) = (
q
(
T
)
,p
(
T
)) is accepted with prob-
ability
(3.10)
ρ
(
z
k
,
̃
z
) = min
{
1
,
exp[
H
(
z
k
)
H
( ̃
z
)]
}
,
and rejected with probability 1
ρ
(
z
k
,
̃
z
) otherwise. If accepted,
we set
z
k
+1
= (
q
k
+1
,p
k
+1
) = ̃
z
and the state
q
k
+1
is added to the
Markov chain.
4.
Energy-stepping integrators
We proceed to review
energy-stepping
integrators, a paradigm that differs
fundamentally from classical integrators such as Runge-Kutta, multistep,
or one-leg methods [29, 30], but shares all the advantageous features of
symplectic methods as well as possessing some unique ones [24, 25].
We begin by focusing on systems characterized by Lagrangians
L
:
R
N
×
R
N
R
of the form
(4.1)
L
(
q,
̇
q
) =
1
2
̇
q
T
M
̇
q
V
(
q
)
,
where
M
is a mass matrix and
V
(
q
) is the potential energy function. The
classical motivation for systems of this type comes from celestial, structural,
solid, and molecular mechanics [31]. This framework fits within HMC since
the Lagrangian (4.1) is equivalent to the Hamiltonian (3.5) through the
(4.2)
L
(
q,
̇
q
) = sup
p
( ̇
q p
H
(
q,p
))
.
THE ENERGY-STEPPING MONTE CARLO METHOD
9
The trajectories of the Lagrangian (4.1) render stationary Hamilton’s action
(4.3)
I
=
Z
L
(
q
(
t
)
,
̇
q
(
t
))
dt.
Since such trajectories are not easy to find, the classical approximation para-
digm is to discrete the action in time, leading to variational time integrators.
The energy-stepping paradigm is at variance with time-stepping in that
it employs
exact solutions
of an
approximate Lagrangian
that can be solved
exactly. For Lagrangians of the form (4.1), [24, 25] propose the approximate
Lagrangians
(4.4)
L
h
(
q,
̇
q
) =
1
2
̇
q
T
M
̇
q
V
h
(
q
)
,
where
V
h
is some exactly solvable approximation of the potential energy.
The energy-stepping method specifically considers
piecewise constant
ap-
proximations of the potential energy, i. e.,
terraced
approximations
V
h
of
constant energy-step height
h
defined as
(4.5)
V
h
(
q
) =
h
h
1
V
(
q
)
,
where
⌊·⌋
is the floor function, i. e.,
x
= max
{
n
Z
:
n
x
}
. Based on
this definition,
V
h
is the largest piecewise-constant function with values in
h
Z
majorized by
V
.
An approximating sequence of potential energies, and by extension La-
grangians, can be built by selecting energy steps of decreasing height. Other
types of approximations, such as piecewise linear interpolations of the po-
tential energy, also result in exactly integrable approximating systems [24]
but will not be considered here for definiteness. See Fig. 4.1 for an illus-
tration of a piecewise constant and a piecewise linear approximation of the
Kepler potential.
4.1.
Computation of the exact trajectories of the approximating
Lagrangian.
Following [25], we describe next the calculation of the
exact
trajectories generated by the terraced Lagrangians
L
h
(
q,
̇
q
).
Suppose that a mechanical system is in configuration
q
0
at time
t
0
, in
configuration
q
2
at time
t
2
, and that during the time interval [
t
0
,t
2
] it inter-
sects one single jump surface Γ
1
separating two regions of constant energies
V
0
and
V
2
(see Figure 4.2). Based on the form of
V
h
, Γ
1
is the level surface
V
=
V
2
for an uphill step
V
2
=
V
0
+
h
, or the level surface
V
=
V
0
for a
downhill step,
V
2
=
V
0
h
. For simplicity, we shall further assume that
V
is differentiable and that all energy-level crossings are transversal, i. e.,
(4.6)
n
(
q
1
)
·
̇
q
1
̸
= 0
where ̇
q
1
= ̇
q
t
1

and
n
(
q
1
) is a vector normal to Γ
1
pointing in the direction
of advance.
10
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
In[51]:=
GraphicsRow

Plot3D
V

x, y

,
x,
1,
0.25`
,
y, 0.25`, 1
, PlotPoints
20, Mesh
False, Ticks
None,
AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
, PlotLabel
"Potential Energy
V
",
BaseStyle
FontFamily
"Helvetica", FontSize
10
, PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

,
plot3
Plot3D
IntegerPart
V

x, y

0.15`
0.15`,
x,
1,
0.25`
,
y, 0.25`, 1
, PlotPoints
100, Mesh
False,
Ticks
None, AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
,
PlotLabel
"Approximate Potential V
h
", BaseStyle
FontFamily
"Helvetica", FontSize
10
,
PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

, ExclusionsStyle
Automatic, Black

,
plot2
TriangularSurfacePlot
data3D, Axes
True, Ticks
None, AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
, AxesEdge

1,
1
,
1,
1
,
1,
1

,
PlotLabel
"Approximate Potential V
h
", PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

,
BaseStyle
FontFamily
"Helvetica", FontSize
10
, ViewPoint
1.3`,
2.4`, 2
, BoxRatios
1, 1, 0.43`

Out[51]=
Export
"C:\Temp\Fig
AppPotentials.eps",
, "EPS", ImageSize
900
;
2
2D_Kepler_Approx.nb
In[51]:=
GraphicsRow

Plot3D
V

x, y

,
x,
1,
0.25`
,
y, 0.25`, 1
, PlotPoints
20, Mesh
False, Ticks
None,
AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
, PlotLabel
"Potential Energy
V
",
BaseStyle
FontFamily
"Helvetica", FontSize
10
, PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

,
plot3
Plot3D
IntegerPart
V

x, y

0.15`
0.15`,
x,
1,
0.25`
,
y, 0.25`, 1
, PlotPoints
100, Mesh
False,
Ticks
None, AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
,
PlotLabel
"Approximate Potential V
h
", BaseStyle
FontFamily
"Helvetica", FontSize
10
,
PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

, ExclusionsStyle
Automatic, Black

,
plot2
TriangularSurfacePlot
data3D, Axes
True, Ticks
None, AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
, AxesEdge

1,
1
,
1,
1
,
1,
1

,
PlotLabel
"Approximate Potential V
h
", PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

,
BaseStyle
FontFamily
"Helvetica", FontSize
10
, ViewPoint
1.3`,
2.4`, 2
, BoxRatios
1, 1, 0.43`

Out[51]=
Export
"C:\Temp\Fig
AppPotentials.eps",
, "EPS", ImageSize
900
;
2
2D_Kepler_Approx.nb
In[51]:=
GraphicsRow

Plot3D
V

x, y

,
x,
1,
0.25`
,
y, 0.25`, 1
, PlotPoints
20, Mesh
False, Ticks
None,
AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
, PlotLabel
"Potential Energy
V
",
BaseStyle
FontFamily
"Helvetica", FontSize
10
, PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

,
plot3
Plot3D
IntegerPart
V

x, y

0.15`
0.15`,
x,
1,
0.25`
,
y, 0.25`, 1
, PlotPoints
100, Mesh
False,
Ticks
None, AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
,
PlotLabel
"Approximate Potential V
h
", BaseStyle
FontFamily
"Helvetica", FontSize
10
,
PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

, ExclusionsStyle
Automatic, Black

,
plot2
TriangularSurfacePlot
data3D, Axes
True, Ticks
None, AxesLabel
Style
" x", 11, TextAlignment
Left
, Style
"y", 11
,""
, AxesEdge

1,
1
,
1,
1
,
1,
1

,
PlotLabel
"Approximate Potential V
h
", PlotRange

1,
0.25`
,
0.25`, 1
,
2.9`,
0.7`

,
BaseStyle
FontFamily
"Helvetica", FontSize
10
, ViewPoint
1.3`,
2.4`, 2
, BoxRatios
1, 1, 0.43`

Out[51]=
Export
"C:\Temp\Fig
AppPotentials.eps",
, "EPS", ImageSize
900
;
2
2D_Kepler_Approx.nb
Figure 4.1.
Kepler problem. Exact, piecewise constant and
piecewise linear continuous approximate potential energies.
Under these assumptions, the action integral (4.3) over the time interval
[
t
0
,t
2
] follows as
(4.7)
I
h
=
Z
t
2
t
0
L
h
(
q,
̇
q
)
dt
=
Z
t
1
t
0
L
h
(
q,
̇
q
)
dt
+
Z
t
2
t
1
L
h
(
q,
̇
q
)
dt
where
t
1
is the time at which the trajectory intersects Γ
1
. In regions where
V
h
(
q
) is constant the trajectory
q
(
t
) is linear in time. Therefore, the action
of the system can be computed
exactly
and reduces to
I
h
= (
t
1
t
0
)
(
1
2

q
1
q
0
t
1
t
0

T
M

q
1
q
0
t
1
t
0

V
0
)
+ (
t
2
t
1
)
(
1
2

q
2
q
1
t
2
t
1

T
M

q
2
q
1
t
2
t
1

V
2
)
,
(4.8)
THE ENERGY-STEPPING MONTE CARLO METHOD
11
Ne
g
ative ste
p
Null ste
p
gp
p
Figure 4.2.
Trajectories of a system with a piecewise con-
stant potential energy. Left: uphill diffraction step; Center:
uphill reflection step; Right: downhill diffraction step.
where
q
1
=
q
(
t
1
) is constrained to be on the jump surface Γ
1
. Assuming
differentiability of Γ
1
, stationarity of the action
I
h
with respect to (
t
1
,q
1
)
additionally gives the energy conservation equation
(4.9)

q
1
q
0
t
1
t
0

T
M

q
1
q
0
t
1
t
0

+ 2
V
0
=

q
2
q
1
t
2
t
1

T
M

q
2
q
1
t
2
t
1

+ 2
V
2
,
and the linear momentum balance equation
(4.10)
M
q
1
q
0
t
1
t
0
M
q
2
q
1
t
2
t
1
+
λn
(
q
1
) = 0
,
where
λ
is a Lagrange multiplier.
In order to make contact with time-integration schemes, we reformulate
the problem slightly by assuming that
t
0
,
q
0
— the latter on a jump surface
Γ
0
except, possibly, at the initial time — and the initial velocity
(4.11)
̇
q
+
0
= ̇
q
t
+
0

=
q
1
q
0
t
1
t
0
are known. Let
t
1
and
q
1
be the time and point at which the trajectory
intersects the next jump surface Γ
1
. We then seek to determine
(4.12)
̇
q
+
1
= ̇
q
t
+
1

.
A reformulation of Eqs. (4.9) and (4.10) in terms of ̇
q
+
1
gives
̇
q
+
1

T
M
̇
q
+
1
=
̇
q
1

T
M
̇
q
1
2∆
V
(4.13)
̇
q
+
1
= ̇
q
1
+
λM
1
n
(
q
1
)
,
(4.14)
where ̇
q
1
= ̇
q
+
0
and the potential energy jump is ∆
V
=
V
h
q
(
t
+
1
)

V
h
q
(
t
1
)

. Next, we proceed to examine the various alternatives that can
arise in the solution of (4.13) and (4.14).
12
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
4.1.1.
Diffraction by downhill energy step.
Suppose that ∆
V
=
h
, i. e.,
the system decreases its potential energy as the trajectory crosses Γ
1
. Then
(4.13) becomes
(4.15)
̇
q
+
1

T
M
̇
q
+
1
=
̇
q
1

T
M
̇
q
1
+ 2
h.
In this case, the system of equations (4.14) - (4.15) has a real solution
(4.16)
̇
q
+
1
= ̇
q
1
+
̇
q
1
·
n
1
q
̇
q
1
·
n
1

2
+ 2
hn
T
1
M
1
n
1
n
T
1
M
1
n
1
M
1
n
1
,
with
n
1
=
n
(
q
1
) This solution represents the diffraction, or change of direc-
tion, of the trajectory by a downhill energy step.
4.1.2.
Diffraction by uphill energy step.
Suppose now that ∆
V
=
h
, i. e.,
the system increases its potential energy as the trajectory crosses Γ
1
. Then,
(4.13) becomes
(4.17)
̇
q
+
1

T
M
̇
q
+
1
=
̇
q
1

T
M
̇
q
1
2
h.
Additionally, suppose that
(4.18)
̇
q
1
·
n
1

2
>
2
hn
T
1
M
1
n
1
.
Then, Eqs. (4.14) and (4.17) again has a real solution, namely,
(4.19)
̇
q
+
1
= ̇
q
1
+
̇
q
1
·
n
1
+
q
̇
q
1
·
n
1

2
2
hn
T
1
M
1
n
1
n
T
1
M
1
n
1
M
1
n
1
.
This solution represents the diffraction of the trajectory by an uphill energy
step when the system has sufficient initial energy to overcome the energy
barrier.
4.1.3.
Reflection by uphill energy step.
Suppose now that ∆
V
=
h
, i. e.,
the system increases its potential energy as the trajectory crosses Γ
1
, but,
contrary to the preceding case,
(4.20)
̇
q
1
·
n
1

2
<
2
hn
T
1
M
1
n
1
.
Then, the system (4.14)-(4.17) has no real solutions, indicating that the
mechanical system does not have sufficient energy to overcome the energy
barrier. Instead, the trajectory remains within the same potential energy
level and equation (4.13) becomes
(4.21)
̇
q
+
1

T
M
̇
q
+
1
=
̇
q
1

T
M
̇
q
1
.
In this situation, the system of equations (4.14) - (4.21) has the solution
(4.22)
̇
q
+
1
= ̇
q
1
2
̇
q
1
·
n
1
n
T
1
M
1
n
1
M
1
n
1
.
This solution represents the reflection of the trajectory by an uphill poten-
tial energy step when the system does not have sufficient initial energy to
overcome the energy barrier.
THE ENERGY-STEPPING MONTE CARLO METHOD
13
4.2.
Summary of the energy-stepping scheme.
We proceed to summa-
rize the relations obtained in the foregoing and define the
energy-stepping
integrator resulting from a piecewise-constant approximation of the poten-
tial energy.
Suppose that
t
k
,q
k
,
̇
q
+
k

and a piecewise-constant approximation of the
potential energy
V
h
are given. Let
t
k
+1
and
q
k
+1
be the time and point of
exit of the rectilinear trajectory
q
k
+ (
t
t
k
) ̇
q
+
k
from the set
{
V
=
h
Z
}
. Let
V
be the jump of the potential energy at
q
k
+1
in the direction of advance.
The, the updated velocity is
(4.23)
̇
q
+
k
+1
= ̇
q
+
k
+
λ
k
+1
M
1
n
k
+1
,
where
n
k
+1
=
n
(
q
k
+1
) and
(4.24)
λ
k
+1
=
2
̇
q
+
k
·
n
k
+1
n
T
k
+1
M
1
n
k
+1
,
if
̇
q
+
k
·
n
k
+1

2
<
2∆
V
n
T
k
+1
M
1
n
k
+1

,
̇
q
+
k
·
n
k
+1
+sign(∆
V
)
q
(
̇
q
+
k
·
n
k
+1
)
2
2∆
V
(
n
T
k
+1
M
1
n
k
+1
)
n
T
k
+1
M
1
n
k
+1
,
otherwise
.
These relations define a discrete propagator
(4.25)
Φ
h
:
t
k
,q
k
,
̇
q
+
k

7→
t
k
+1
,q
k
+1
,
̇
q
+
k
+1

that can be applied recursively to generate a discrete trajectory.
Algorithm 1
Energy-stepping scheme
Require:
V
(
q
),
q
0
, ̇
q
0
,
t
0
,
t
f
and the energy step
h
1:
i
0
2:
while
t
i
< t
f
do
3:
t
i
+1
Smallest-Root
(
V
(
q
i
+ (
t
i
+1
t
i
) ̇
q
i
)
V
(
q
i
) + ∆
V
= 0)
4:
q
i
+1
q
i
+ (
t
i
+1
t
i
) ̇
q
i
5:
n
i
+1
←∇
V
(
q
i
+1
)
6:
̇
q
i
+1
Update-Velocities
( ̇
q
i
,n
i
+1
,h
)
7:
i
i
+ 1
8:
end while
9:
if
t
i
> t
f
then
10:
q
i
(1
t
f
t
i
1
t
i
t
i
1
)
q
i
1
+
t
f
t
i
1
t
i
t
i
1
q
i
11:
t
i
t
f
12:
end if
The computational workflow of the energy-stepping scheme is summarized
in Algorithm 1. The algorithm combines two methods. The first method
Smallest-Root
determines the root
t
i
+1
> t
i
of the equation
(4.26)
V
(
q
i
+ (
t
i
+1
t
i
) ̇
q
i
)
V
(
q
i
) + ∆
V
= 0
,
where ∆
V
can take values in
{
0
,h
}
. The second method
Update-Velocities
updates velocities and reduces to only two situations. The method is defined
in Algorithm 2.
14
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
Algorithm 2
Update-Velocities
( ̇
q
0
,n
1
,h
)
1:
if
̇
q
0
·
n
1
0
then
2:
V
h
3:
else
4:
V
←−
h
5:
end if
6:
if
( ̇
q
0
·
n
1
)
2
2∆
V
n
T
1
M
1
n
1

then
7:
λ
←−
2
̇
q
0
·
n
1
n
T
1
M
1
n
1
8:
else
9:
λ
̇
q
0
·
n
1
+sign(∆
V
)
q
( ̇
q
0
·
n
1
)
2
2∆
V
(
n
T
1
M
1
n
1
)
n
T
1
M
1
n
1
10:
end if
11:
return
̇
q
0
+
λM
1
n
1
Remarkably, the energy-stepping method does not require the solution of
a system of equations and, therefore, its complexity is comparable to that
of explicit methods. However, the need to compute the root of a nonlinear
scalar function per step adds somewhat to the overhead of the algorithm.
4.3.
Properties of energy-stepping integrators.
We summarize from
[24, 25] the main properties of energy-stepping integrators
The terraced approximation of the potential energy (4.5) preserves all the
symmetries of the system. By Noether’s theorem, and since the discrete tra-
jectories are exact trajectories of a Lagrangian system, the energy-stepping
method conserves all the momentum maps and the symplectic structure of
the original Lagrangian system. In particular, energy can be viewed as the
momentum map associated with the time-reparametrization of a Lagrangian
defined in space-time. The energy-stepping method must also preserve this
symmetry, and thus, it is exactly energy-conserving for all step heights.
In summary, the energy-stepping integrator combines the following prop-
erties:
(1) It exactly preserves the symplectic form.
(2) It exactly preserves all the momentum maps that originate from
symmetries of the Lagrangian, including the linear and angular mo-
menta, and the energy.
(3) It is time-reversible.
(4) Its solution converges when
h
0 to the trajectory that makes
stationary the exact action (4.3).
We refer to [25] for the complete proofs of these statements.
5.
Energy-stepping Monte Carlo
With reference to Section 3, the energy-stepping Monte Carlo (ESMC)
method can now be set forth as a
Hamiltonian Monte Carlo method that
uses the energy-stepping integrator
for generating the proposal distribution.
THE ENERGY-STEPPING MONTE CARLO METHOD
15
Equivalently, ESMC is a Markov chain Monte Carlo method that proposes
new states by integrating
exactly
the approximate Hamiltonian
(5.1)
H
h
(
q,p
) =
V
h
(
q
) +
1
2
p
T
M
1
p,
corresponding to the Legendre transform of the approximate Lagrangian (4.4).
In the same way that the energy-stepping integrator defines a Hamilton-
ian propagator Φ
h
at every step, ESMC defines a Hamiltonian propagator
Ψ
T
: Ω
×
R
n
×
R
n
that maps the
k
th sample of the Markov chain
and a random momentum to the (
k
+ 1)
sample and a momentum that is
discarded. This propagator is the composition map
(5.2)
Ψ
T
= Φ
h
Φ
h
◦···◦
Φ
h
,
with as many updates Φ
h
as steps in the integration and, therefore, it is
a indeed Hamiltonian propagator. Given a state
q
k
, the energy-stepping
Monte Carlo method advances
q
k
by sampling a random momentum
p
k
and
defining
q
k
+1
=
ψ
T
(
q
k
), with
(5.3)
ψ
T
(
q
k
) = (Π
1
Ψ
T
)(
q
k
,p
k
)
,
Π
1
being the projection onto the first phase coordinate, and
with no rejection
whatsoever
.
The ESMC method, by construction, shares the advantageous properties
of HMC. In addition, a remarkable benefit is accrued from its zero-rejection
ratio: for the (small) computational cost, namely, the solution of a scalar
nonlinear equation per step, which is negligible in high dimensional sample
spaces, the effort expended in the proposal stage is never wasted.
5.1.
Some remarks on the step size.
As explained in Section 4, the
energy-stepping method integrates Hamilton’s equations in time with a gran-
ularity dictated by the energy step
h
. This is in contrast with classical time
integration schemes, such as leapfrog, that use the time step size to control
the resolution of the incremental updates. In explicit methods, moreover,
this time step size is also bound by the Courant–Friedrichs–Lewy (CFL)
stability condition. In practical terms, the time step size and hence the
computational cost of the proposal stage are constrained by the variation
of the potential energy. Again, in contrast, the energy-stepping integrator
is exact for the terraced potential and thus stable for any choice of energy
step.
In HMC methods, the accuracy of the proposal step is controlled by the
length
T
of the dynamic excursion that every state undergoes once the
momentum is randomly selected (see Section 3.1). If this motion is to be
stably integrated by the leapfrog scheme, the time step size must be chosen
to be smaller than a certain global bound. Alternatively, this step size
can be adapted by estimating the curvature of the potential energy at every
integration step. By contrast, once the energy step size is selected for ESMC,
every proposal motion is integrated — without the need for any adaptivity
16
I. ROMERO
1
,
2
AND M. ORTIZ
3
,
4
— in a number of steps that depends on the gradient of the potential energy.
Hence, regions with steep energy and, thus, steep probability density, are
integrated with higher resolution by default.
Owing to the different granularity control of the leapfrog and the energy-
stepping methods, it is not possible to compare directly their relative cost
and discretization error. However, for every integration interval of length
T
an
effective time step
can be calculated for the energy-stepping integrator
and used as a basis for comparison, namely,
(5.4)
t
e
=
T
i
,
where
i
is the number of rectilinear trajectories in state space resolved by the
algorithm in the interval [0
,T
], see Section 4. However, it should be carefully
noted that the effective time step is an average measure for purposes of
comparison only, since the integration of a time interval might encompass
long and short rectilinear trajectories precisely due to the intrinsic adaptivity
of the method.
6.
Numerical examples
This section investigates the properties of ESMC and compares its per-
formance with other MCMC methods. The goal is to assess — by means of
numerical tests — the salient properties of the method. Some of these prop-
erties are expected to be comparable to other HMC methods but with the
added benefit of a 100% acceptance ratio. In addition, we explore the conse-
quences of the
exact
preservation of all symmetries by the energy-stepping
integrator. We select examples that range from one-dimensional functions,
for which we have closed-form expression, to high-dimensional cases with
and without symmetries.
6.1.
One dimensional sampling.
As a first test, we compare ESMC with
standard MCMC methods: the random-walk MCMC — to be referred to as
RWMC — and HMC. We seek to sample from the bi-modal, one-dimensional
probability function
(6.1)
p
(
x
)
3
2
π
3
2
exp[
(
x
+ 2)
2
18
] +
1
4
2
π
exp[
(
x
4)
2
2
]
.
For RWMC we employ as proposal distribution a Gaussian with zero mean
and variance equal to one. The HMC method implementation employs a
leapfrog symplectic integrator with a fixed time step of length 1 and a total
integration time
T
= 10 per proposal. The ESMC method also solves for
time integration periods of length
T
= 10 and uses a terraced height of value
h
= 0
.
35 that results in an average time step size ∆
t
e
= 0
.
85, similar to the
one employed in HMC.
To sample the probability
p
, we use Markov chains of length 5000 and a
burn-in subset of size 500. For each of the compared methods, we generate