Transferability in Machine Learning for Electronic Structure
via the Molecular Orbital Basis
Matthew Welborn, Lixue Cheng, and Thomas F. Miller III
∗
Division of Chemistry and Chemical Engineering,
California Institute of Technology, Pasadena, CA 91125, USA
(Dated: July 30, 2018)
We present a machine learning (ML) method for predicting electronic structure correlation en-
ergies using Hartree-Fock input. The total correlation energy is expressed in terms of individual
and pair contributions from occupied molecular orbitals, and Gaussian process regression is used to
predict these contributions from a feature set that is based on molecular orbital properties, such as
Fock, Coulomb, and exchange matrix elements. With the aim of maximizing transferability across
chemical systems and compactness of the feature set, we avoid the usual specification of ML fea-
tures in terms of atom- or geometry-specific information, such atom/element-types, bond-types, or
local molecular structure. ML predictions of MP2 and CCSD energies are presented for a range of
systems, demonstrating that the method maintains accuracy while providing transferability both
within and across chemical families; this includes predictions for molecules with atom-types and el-
ements that are not included in the training set. The method holds promise both in its current form
and as a proof-of-principle for the use of ML in the design of generalized density-matrix functionals.
Keywords: machine learning, electron correlation, Gaussian processes, density-matrix functional theory
I. INTRODUCTION
Recent interest in the use of machine learning (ML) for
electronic structure has focused on models that are for-
mulated in terms of atom- and geometry-specific features,
such as atom-types and bonding connectivities. The ad-
vantage of this approach is that it can yield excellent
accuracy with computational cost that is comparable to
classical force fields.
1–16
However, a disadvantage of this
approach is that building a ML model to describe a di-
verse set of elements and chemistries requires training
with respect to a number of features that grows quickly
with the number of atom- and bond-types, and also re-
quires vast amounts of reference data for the selection and
training of those features; these issues have hindered the
degree of chemical transferability of existing ML mod-
els for electronic structure. For example, previous meth-
ods have not demonstrated predictions for molecules with
chemical elements that are not included in the training
data.
In this work, we focus on the more modest goal of us-
ing ML to describe the post-Hartree-Fock correlation en-
ergy. Assuming willingness to incur the cost of a Hartree-
Fock self-consistent field (SCF) calculation, we aim to
describe the correlation energy associated with pertur-
bation theory,
17
coupled-cluster theory,
18
or other post-
Hartree-Fock methods. Our approach focuses on training
not with respect to atom-based features, but instead us-
ing features based on the Hartree-Fock molecular orbitals
(MOs), which have no explicit dependence on the under-
lying atom-types and may thus be expected to provide
greater chemical transferability.
For a general post-Hartree-Fock electronic structure
method, the correlation energy may be expressed via Nes-
bet’s theorem as a sum over occupied MOs
19
E
c
=
occ
∑
ij
ij
.
(1)
Our strategy is to use ML to describe the diagonal and
off-diagonal contributions to this sum,
ii
=
d
(
f
i
)
and
ij
=
o
(
f
ij
)
,
(2)
respectively, where
f
i
is a vector of features associated
with the
i
th
occupied MO, and
f
ij
is a vector of features
associated with the
i,j
pair of occupied MOs. Employ-
ing this strategy in the representation of localized MOs
(LMOs), for which Eq. 1 also holds, leads to a ML model
that is compact with respect to the number of features
and that is both chemically accurate and encouragingly
transferable across chemical systems.
II. FEATURE DESIGN AND SELECTION
All ML features used in this study are elements of the
Fock matrix
F
, Coulomb matrix
J
, or exchange matrix
K
. With the aim of maximizing transferability of the fea-
tures, we represent the matrices in the LMO basis. Only
matrix elements associated with the subset of valence
occupied and virtual LMOs are included as ML features;
occupied core orbitals are excluded, as the post-Hartree-
Fock calculations employ the frozen core approximation,
and the valence virtual orbitals are defined by projection
onto a minimal basis (details in Sec. III).
20
For a given
i,j
pair of occupied LMOs, the total feature
vector
f
ij
is comprised of feature vectors associated with
elements of the Fock, Coulomb, and exchange matrices,
f
ij
=
(
f
(
F
)
ij
,
f
(
J
)
ij
,
f
(
K
)
ij
)
.
(3)
arXiv:1806.00133v2 [physics.chem-ph] 26 Jul 2018
2
These composite vectors involve matrix elements from
the occupied-occupied, occupied-virtual, and virtual-
virtual blocks of the matrices, such that
f
(
F
)
ij
=
(
F
ii
,F
ij
,F
jj
,
F
vv
ij
)
(4)
f
(
J
)
ij
=
(
J
ii
,J
ij
,J
jj
,
J
v
i
,
J
v
j
,
J
vv
ij
)
f
(
K
)
ij
=
(
K
ij
,
K
v
i
,
K
v
j
,
K
vv
ij
)
,
where terms are sorted with respect to
i
and
j
such that
F
ii
< F
jj
. This sorting guarantees that
ij
=
ji
. The
vectors
J
v
i
,
J
v
j
,
K
v
i
, and
K
v
j
include matrix elements as-
sociated with localized valence virtual orbitals (indexed
a,b,c,...
) such that
J
v
i
= (
J
ia
,J
ib
,J
ic
,...
)
(5)
K
v
i
= (
K
ia
,K
ib
,K
ic
,...
)
and likewise for
J
v
j
and
K
v
j
. The localized valence virtual
orbitals associated with the matrix elements in
J
v
i
and
K
v
i
are selected and sorted on the basis of having the
largest off-diagonal Coulomb matrix elements, such that
J
ia
> J
ib
> J
ic
, etc.; likewise for
J
v
j
and
K
v
j
. Note that
the valence virtual LMO associated with
J
ia
is the same
as for
K
ia
, but it need not be the same as that associated
with
J
ja
. Finally, the matrices
F
vv
ij
,
J
vv
ij
,
and
K
vv
ij
in Eq. 4
contain virtual-virtual matrix elements corresponding to
localized valence virtual orbitals that are selected and
sorted such that
J
ia
+
J
ja
> J
ib
+
J
jb
, etc.; only the upper
diagonal of these matrices comprise independent features
and are included. Because they appear in the cluster
amplitude equations of MP2 and CCSD, virtual-virtual
matrix elements are potentially informative features for
the prediction of pair correlation energies.
Appropriate sorting of the virtual LMOs was found to
be important for achieving transferability as the GP re-
gression is sensitive to permutations of elements within
the feature vector.
J
ia
acts as a proxy for spatial
distance. As dynamical electron correlation is “near-
sighted,”
21
the spatially closest valence virtual LMOs are
also likely to be most important to the pair correlation
energy. In a large system, the number of included va-
lence virtual LMOs must be limited, and sorting ensures
that these most important elements are included in the
feature vector. Any distance-based cutoff procedure is
subject to discontinuities in the energy if valence vir-
tual LMOs move in or out of the cutoff region. As in
local correlation methods, sufficiently large cutoffs must
be chosen to ensure that the energy surface is acceptably
smooth.
22
The near-sighted nature of dynamical corre-
lation also leads to the expectation that the number of
needed features based on valence virtual LMOs quickly
saturates with system size, which is confirmed in the re-
sults presented below.
The resulting features are invariant with respect to ro-
tation and translation of the system, invariant to rota-
tions among the occupied MOs that precede localization,
smooth with respect to molecular geometry, and unique
for each geometry – to the extent that the employed or-
bital localization method has these properties. In this
work, we employ the Intrinsic Bond Orbital method
which has been shown to yield unique LMOs which vary
smoothly with geometry.
23,24
By construction, the fea-
tures yield a model with sufficient flexibility to describe
dissociation into two closed-shell fragments. In the dis-
sociated limit, features corresponding to occupied pairs
with both
i
and
j
on one fragment contain no informa-
tion about the other fragment. For occupied pairs
i
and
j
that span fragments, by including dissociated fragments
in the training data, the ML model is trained to predict
that
ij
vanishes as features involving both
i
and
j
(e.g.
J
ij
) go to zero.
For each occupied LMO used to describe the diagonal
contributions to the correlation energy,
d
(
f
i
) in Eq. 2,
the total feature vector
f
i
is obtained by keeping only
the unique terms in
f
ii
.
III. CALCULATION DETAILS
All Hartree-Fock, second-order Møller-Plessett pertur-
bation theory (MP2),
17
and coupled-cluster with singles
and doubles (CCSD)
18
calculations are performed using
the
Molpro
2018.0 software package.
25
Unless otherwise
stated, calculations employ the cc-pVTZ basis set.
26
The
frozen-core approximation is employed for correlated cal-
culations.
Valence occupied and virtual LMOs are generated us-
ing the Intrinsic Bond Orbital method
23
with a local-
ization threshold of 10
−
12
; core orbitals are excluded
from localization. This method is detailed in Ref. 23
and summarized here. A set of Intrinsic Atomic Or-
bitals (IAOs) is formed by polarizing a minimal basis of
free-atom atomic orbitals to form a set of the same size
that can exactly represent the occupied MOs of a given
Slater determinant. The IAOs are then partitioned into
an occupied subset, whose span is the occupied MOs,
and a virtual subset, whose span defines the valence vir-
tual MOs. These two sets are localized using the Pipek-
Mezey criterion
27
to form the occupied and valence vir-
tual LMOs. The subset of valence virtual MOs are read-
ily localized.
20,28,29
For the selected features, Gaussian process regres-
sion (GPR)
30
of
d
and
o
in Eq. 2 is separately per-
formed with the
GPy
software package.
31
The Mat ́ern
5/2 kernel
30
is employed with white-noise regularization.
A single length scale is used for all features, resulting in a
total of three kernel hyperparameters. The scaled conju-
gate gradient method
32
is used to minimize the negative
log marginal likelihood objective with respect to the ker-
nel hyperparamters. Kernel ridge regression
33
was also
explored but was not found to lead to more accurate pre-
dictions than GPR.
In all cases, training and test geometries are gener-
ated from an
ab initio
molecular dynamics trajectory per-
formed with the
Q-Chem
5.0 software package,
34
using
3
the B3LYP
35–38
/6-31g*
39
level of theory and a Langevin
thermostat
40
at 350 K. Geometries are sampled from the
trajectories at 50 fs intervals. For each training geome-
try, data associated with all occupied orbitals is employed
for training, although results are unchanged if a consis-
tent number of orbital pairs is randomly selected from
training geometries.
To avoid overfitting, the total number of features
should be reduced prior to training. We prioritize fea-
tures based on the intuition that features involving two
occupied LMOs (e.g.
J
ij
) are more important than fea-
tures involving one occupied and one valence virtual
LMO (e.g.
J
ia
), which are in turn more important
than features involving two valence virtual LMOs (e.g.
J
aa
). This intuition largely agrees with feature Gini
importance rankings determined automatically via De-
cision Tree Regression (DTR),
41
while avoiding patholo-
gies found using naive application of the latter for some
cases. Such pathologies can arise from the fact that DTR
Gini importance ranks features by how well they lead to
separate clusters in feature space, with less regard for
variability within those clusters.
41
Optimal features for
ML in our application should describe variability both
within and between these clusters. This leads to prob-
lems for the DTR method in cases such as alkanes that
have only one type of occupied LMO (i.e., sigma bonds)
and thus yield no distinct clusters; in these cases, naive
application of DTR fails to select any features. Nonethe-
less, we acknowledge that more sophisticated automatic
feature selection methods are available and will be inves-
tigated in future work. For the purposes of this work, we
monitor potential overfitting using out-of-sample testing;
during training, we hold out a subset of the training set
and confirm that the errors from this subset are similar
to those from the training set. Employed features sets
used in this study are listed in Tab. I.
TABLE I. Employed feature sets, and the number of features
for the diagonal (#
f
i
) and off-diagonal (#
f
ij
) pairs.
Set
Description
#
f
i
#
f
ij
A
Features corresponding to the occupied-
occupied and occupied-virtual blocks of
F
,
J
, and
K
, including only the first four
localized valence virtual orbitals.
10
23
B
Feature Set A, with
F
aa
,
J
aa
, and
K
ab
also included in
f
i
.
13
23
C
f
i
= (
F
ii
,F
aa
,J
ii
,J
ia
,J
aa
,K
ia
)
6
7
f
ij
= (
F
ii
,F
ab
,J
ii
,J
ij
,J
jj
,K
ij
,K
ja
)
IV. RESULTS
A. Transferability among geometries
For the example of a single water molecule, we begin by
training the ML model on a subset of geometries to pre-
dict the correlation energy at other geometries. For both
the MP2 and CCSD levels of theory, the diagonal (
d
)
and off-diagonal (
o
) contributions to the correlation en-
ergy are separately trained using Feature Set A (Tab. I)
with 200 geometries, and the resulting ML predictions
for a superset of 1000 geometries are presented in Fig. 1.
Errors are summarized in terms of mean absolute error
(Mean Error), maximum absolute error (Max Error), and
Mean Error as a percentage of the mean total correla-
tion energy (Rel. Mean Error); energies are reported in
milliHartrees (mH) throughout the paper. The Pearson
correlation coefficient (
r
) is also reported as a measure
of correlation between the ML predictions and the true
values;
42
a value of
r
= 1 indicates perfect correlation,
r
= 0 indicates no correlation, and
r
=
−
1 indicates per-
fect anticorrelation. Note that a value of
r
= 1 does not
imply that the slope of the relationship is unity.
As illustrated for the diagonal contributions in Fig. 1a,
the individual contributions to the correlation energy ex-
hibit clusters associated with common physical origins
(i.e.,
σ
-bonding vs. lone-pair orbitals). For both the di-
agonal and off-diagonal contributions, the agreement be-
tween the ML prediction and the reference result is ex-
cellent, leading to predictions for the total correlation
energy that are well within chemical accuracy. For all
examples studied in this work, we find the quality of ML
predictions for MP2 and CCSD to be qualitatively simi-
lar (as in Fig. 1); MP2 results are thus presented in the
SI for the remainder.
Table II summarizes the corresponding results for other
small molecules, with
d
and
o
trained on a subset of
geometries and used to predict the CCSD correlation en-
ergy for other geometries. The molecules range in size
from H
2
to benzene. Feature Set A is used in all cases,
except for ethane, for which Feature Set B was needed to
achieve comparable accuracy. The number of geometries
included in the training set and testing superset are in-
dicated in the table. In general, the Mean Error for the
correlation energy is much less than 1 mH, and the Max
Error is also in the range of chemical accuracy. Note
that we are predicting the correlation energy for these
molecules with a Rel. Mean Error that is 0.1% or less for
all cases.
Table II also illustrates the sensitivity of the ML
predictions to changing the number of geometries in
the training set (for ethane, formic acid, and difluo-
romethane) or the employed basis set (for water). Al-
though the additional geometries for these cases lead to
better ML prediction accuracy, further improvement with
additional geometries eventually becomes limited by the
baseline self-training error of the employed GPR method.
The water results for basis sets ranging from double-zeta
to quintuple-zeta make clear that the ML prediction is
not sensitive to the employed basis set.