1806.00133.pdf

Transferability in Machine Learning for Electronic Structure

via the Molecular Orbital Basis

Matthew Welborn, Lixue Cheng, and Thomas F. Miller III

∗

Division of Chemistry and Chemical Engineering,

California Institute of Technology, Pasadena, CA 91125, USA

(Dated: July 30, 2018)

We present a machine learning (ML) method for predicting electronic structure correlation en-

ergies using Hartree-Fock input. The total correlation energy is expressed in terms of individual

and pair contributions from occupied molecular orbitals, and Gaussian process regression is used to

predict these contributions from a feature set that is based on molecular orbital properties, such as

Fock, Coulomb, and exchange matrix elements. With the aim of maximizing transferability across

chemical systems and compactness of the feature set, we avoid the usual specification of ML fea-

tures in terms of atom- or geometry-specific information, such atom/element-types, bond-types, or

local molecular structure. ML predictions of MP2 and CCSD energies are presented for a range of

systems, demonstrating that the method maintains accuracy while providing transferability both

within and across chemical families; this includes predictions for molecules with atom-types and el-

ements that are not included in the training set. The method holds promise both in its current form

and as a proof-of-principle for the use of ML in the design of generalized density-matrix functionals.

Keywords: machine learning, electron correlation, Gaussian processes, density-matrix functional theory

I. INTRODUCTION

Recent interest in the use of machine learning (ML) for

electronic structure has focused on models that are for-

mulated in terms of atom- and geometry-specific features,

such as atom-types and bonding connectivities. The ad-

vantage of this approach is that it can yield excellent

accuracy with computational cost that is comparable to

classical force fields.

1–16

However, a disadvantage of this

approach is that building a ML model to describe a di-

verse set of elements and chemistries requires training

with respect to a number of features that grows quickly

with the number of atom- and bond-types, and also re-

quires vast amounts of reference data for the selection and

training of those features; these issues have hindered the

degree of chemical transferability of existing ML mod-

els for electronic structure. For example, previous meth-

ods have not demonstrated predictions for molecules with

chemical elements that are not included in the training

data.

In this work, we focus on the more modest goal of us-

ing ML to describe the post-Hartree-Fock correlation en-

ergy. Assuming willingness to incur the cost of a Hartree-

Fock self-consistent field (SCF) calculation, we aim to

describe the correlation energy associated with pertur-

bation theory,

coupled-cluster theory,

or other post-

Hartree-Fock methods. Our approach focuses on training

not with respect to atom-based features, but instead us-

ing features based on the Hartree-Fock molecular orbitals

(MOs), which have no explicit dependence on the under-

lying atom-types and may thus be expected to provide

greater chemical transferability.

For a general post-Hartree-Fock electronic structure

method, the correlation energy may be expressed via Nes-

bet’s theorem as a sum over occupied MOs

occ

∑

(1)

Our strategy is to use ML to describe the diagonal and

off-diagonal contributions to this sum,

(

)

and

(

)

(2)

respectively, where

is a vector of features associated

with the

occupied MO, and

is a vector of features

associated with the

i,j

pair of occupied MOs. Employ-

ing this strategy in the representation of localized MOs

(LMOs), for which Eq. 1 also holds, leads to a ML model

that is compact with respect to the number of features

and that is both chemically accurate and encouragingly

transferable across chemical systems.

II. FEATURE DESIGN AND SELECTION

All ML features used in this study are elements of the

Fock matrix

, Coulomb matrix

, or exchange matrix

. With the aim of maximizing transferability of the fea-

tures, we represent the matrices in the LMO basis. Only

matrix elements associated with the subset of valence

occupied and virtual LMOs are included as ML features;

occupied core orbitals are excluded, as the post-Hartree-

Fock calculations employ the frozen core approximation,

and the valence virtual orbitals are defined by projection

onto a minimal basis (details in Sec. III).

For a given

i,j

pair of occupied LMOs, the total feature

vector

is comprised of feature vectors associated with

elements of the Fock, Coulomb, and exchange matrices,

(

)

(

)

(

)

(3)

arXiv:1806.00133v2 [physics.chem-ph] 26 Jul 2018

These composite vectors involve matrix elements from

the occupied-occupied, occupied-virtual, and virtual-

virtual blocks of the matrices, such that

(

)

(

)

(4)

(

)

(

)

(

)

(

)

where terms are sorted with respect to

and

such that

< F

. This sorting guarantees that

. The

vectors

, and

include matrix elements as-

sociated with localized valence virtual orbitals (indexed

a,b,c,...

) such that

= (

,...

)

(5)

= (

,...

)

and likewise for

and

. The localized valence virtual

orbitals associated with the matrix elements in

and

are selected and sorted on the basis of having the

largest off-diagonal Coulomb matrix elements, such that

> J

, etc.; likewise for

and

. Note that

the valence virtual LMO associated with

is the same

as for

, but it need not be the same as that associated

with

. Finally, the matrices

and

in Eq. 4

contain virtual-virtual matrix elements corresponding to

localized valence virtual orbitals that are selected and

sorted such that

> J

, etc.; only the upper

diagonal of these matrices comprise independent features

and are included. Because they appear in the cluster

amplitude equations of MP2 and CCSD, virtual-virtual

matrix elements are potentially informative features for

the prediction of pair correlation energies.

Appropriate sorting of the virtual LMOs was found to

be important for achieving transferability as the GP re-

gression is sensitive to permutations of elements within

the feature vector.

acts as a proxy for spatial

distance. As dynamical electron correlation is “near-

sighted,”

the spatially closest valence virtual LMOs are

also likely to be most important to the pair correlation

energy. In a large system, the number of included va-

lence virtual LMOs must be limited, and sorting ensures

that these most important elements are included in the

feature vector. Any distance-based cutoff procedure is

subject to discontinuities in the energy if valence vir-

tual LMOs move in or out of the cutoff region. As in

local correlation methods, sufficiently large cutoffs must

be chosen to ensure that the energy surface is acceptably

smooth.

The near-sighted nature of dynamical corre-

lation also leads to the expectation that the number of

needed features based on valence virtual LMOs quickly

saturates with system size, which is confirmed in the re-

sults presented below.

The resulting features are invariant with respect to ro-

tation and translation of the system, invariant to rota-

tions among the occupied MOs that precede localization,

smooth with respect to molecular geometry, and unique

for each geometry – to the extent that the employed or-

bital localization method has these properties. In this

work, we employ the Intrinsic Bond Orbital method

which has been shown to yield unique LMOs which vary

smoothly with geometry.

23,24

By construction, the fea-

tures yield a model with sufficient flexibility to describe

dissociation into two closed-shell fragments. In the dis-

sociated limit, features corresponding to occupied pairs

with both

and

on one fragment contain no informa-

tion about the other fragment. For occupied pairs

and

that span fragments, by including dissociated fragments

in the training data, the ML model is trained to predict

that

vanishes as features involving both

and

(e.g.

) go to zero.

For each occupied LMO used to describe the diagonal

contributions to the correlation energy,

(

) in Eq. 2,

the total feature vector

is obtained by keeping only

the unique terms in

III. CALCULATION DETAILS

All Hartree-Fock, second-order Møller-Plessett pertur-

bation theory (MP2),

and coupled-cluster with singles

and doubles (CCSD)

calculations are performed using

the

Molpro

2018.0 software package.

Unless otherwise

stated, calculations employ the cc-pVTZ basis set.

The

frozen-core approximation is employed for correlated cal-

culations.

Valence occupied and virtual LMOs are generated us-

ing the Intrinsic Bond Orbital method

with a local-

ization threshold of 10

−

; core orbitals are excluded

from localization. This method is detailed in Ref. 23

and summarized here. A set of Intrinsic Atomic Or-

bitals (IAOs) is formed by polarizing a minimal basis of

free-atom atomic orbitals to form a set of the same size

that can exactly represent the occupied MOs of a given

Slater determinant. The IAOs are then partitioned into

an occupied subset, whose span is the occupied MOs,

and a virtual subset, whose span defines the valence vir-

tual MOs. These two sets are localized using the Pipek-

Mezey criterion

to form the occupied and valence vir-

tual LMOs. The subset of valence virtual MOs are read-

ily localized.

20,28,29

For the selected features, Gaussian process regres-

sion (GPR)

and

in Eq. 2 is separately per-

formed with the

GPy

software package.

The Mat ́ern

5/2 kernel

is employed with white-noise regularization.

A single length scale is used for all features, resulting in a

total of three kernel hyperparameters. The scaled conju-

gate gradient method

is used to minimize the negative

log marginal likelihood objective with respect to the ker-

nel hyperparamters. Kernel ridge regression

was also

explored but was not found to lead to more accurate pre-

dictions than GPR.

In all cases, training and test geometries are gener-

ated from an

ab initio

molecular dynamics trajectory per-

formed with the

Q-Chem

5.0 software package,

using

the B3LYP

35–38

/6-31g*

level of theory and a Langevin

thermostat

at 350 K. Geometries are sampled from the

trajectories at 50 fs intervals. For each training geome-

try, data associated with all occupied orbitals is employed

for training, although results are unchanged if a consis-

tent number of orbital pairs is randomly selected from

training geometries.

To avoid overfitting, the total number of features

should be reduced prior to training. We prioritize fea-

tures based on the intuition that features involving two

occupied LMOs (e.g.

) are more important than fea-

tures involving one occupied and one valence virtual

LMO (e.g.

), which are in turn more important

than features involving two valence virtual LMOs (e.g.

). This intuition largely agrees with feature Gini

importance rankings determined automatically via De-

cision Tree Regression (DTR),

while avoiding patholo-

gies found using naive application of the latter for some

cases. Such pathologies can arise from the fact that DTR

Gini importance ranks features by how well they lead to

separate clusters in feature space, with less regard for

variability within those clusters.

Optimal features for

ML in our application should describe variability both

within and between these clusters. This leads to prob-

lems for the DTR method in cases such as alkanes that

have only one type of occupied LMO (i.e., sigma bonds)

and thus yield no distinct clusters; in these cases, naive

application of DTR fails to select any features. Nonethe-

less, we acknowledge that more sophisticated automatic

feature selection methods are available and will be inves-

tigated in future work. For the purposes of this work, we

monitor potential overfitting using out-of-sample testing;

during training, we hold out a subset of the training set

and confirm that the errors from this subset are similar

to those from the training set. Employed features sets

used in this study are listed in Tab. I.

TABLE I. Employed feature sets, and the number of features

for the diagonal (#

) and off-diagonal (#

) pairs.

Set

Description

Features corresponding to the occupied-

occupied and occupied-virtual blocks of

, and

, including only the first four

localized valence virtual orbitals.

Feature Set A, with

, and

also included in

= (

)

= (

)

IV. RESULTS

A. Transferability among geometries

For the example of a single water molecule, we begin by

training the ML model on a subset of geometries to pre-

dict the correlation energy at other geometries. For both

the MP2 and CCSD levels of theory, the diagonal (

)

and off-diagonal (

) contributions to the correlation en-

ergy are separately trained using Feature Set A (Tab. I)

with 200 geometries, and the resulting ML predictions

for a superset of 1000 geometries are presented in Fig. 1.

Errors are summarized in terms of mean absolute error

(Mean Error), maximum absolute error (Max Error), and

Mean Error as a percentage of the mean total correla-

tion energy (Rel. Mean Error); energies are reported in

milliHartrees (mH) throughout the paper. The Pearson

correlation coefficient (

) is also reported as a measure

of correlation between the ML predictions and the true

values;

a value of

= 1 indicates perfect correlation,

= 0 indicates no correlation, and

−

1 indicates per-

fect anticorrelation. Note that a value of

= 1 does not

imply that the slope of the relationship is unity.

As illustrated for the diagonal contributions in Fig. 1a,

the individual contributions to the correlation energy ex-

hibit clusters associated with common physical origins

(i.e.,

-bonding vs. lone-pair orbitals). For both the di-

agonal and off-diagonal contributions, the agreement be-

tween the ML prediction and the reference result is ex-

cellent, leading to predictions for the total correlation

energy that are well within chemical accuracy. For all

examples studied in this work, we find the quality of ML

predictions for MP2 and CCSD to be qualitatively simi-

lar (as in Fig. 1); MP2 results are thus presented in the

SI for the remainder.

Table II summarizes the corresponding results for other

small molecules, with

and

trained on a subset of

geometries and used to predict the CCSD correlation en-

ergy for other geometries. The molecules range in size

from H

to benzene. Feature Set A is used in all cases,

except for ethane, for which Feature Set B was needed to

achieve comparable accuracy. The number of geometries

included in the training set and testing superset are in-

dicated in the table. In general, the Mean Error for the

correlation energy is much less than 1 mH, and the Max

Error is also in the range of chemical accuracy. Note

that we are predicting the correlation energy for these

molecules with a Rel. Mean Error that is 0.1% or less for

all cases.

Table II also illustrates the sensitivity of the ML

predictions to changing the number of geometries in

the training set (for ethane, formic acid, and difluo-

romethane) or the employed basis set (for water). Al-

though the additional geometries for these cases lead to

better ML prediction accuracy, further improvement with

additional geometries eventually becomes limited by the

baseline self-training error of the employed GPR method.

The water results for basis sets ranging from double-zeta

to quintuple-zeta make clear that the ML prediction is

not sensitive to the employed basis set.