of 31
Regression-clustering for Improved Accuracy and
Training Cost with Molecular-Orbital-Based
Machine Learning
Lixue Cheng,
Nikola B. Kovachki,
Matthew Welborn,
and Thomas F. Miller
III
,
Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena,
CA 91125, USA
Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA
91125, USA
E-mail: tfm@caltech.edu
Abstract
Machine learning (ML) in the representation of molecular-orbital-based (MOB) features
has been shown to be an accurate and transferable approach to the prediction of post-Hartree-
Fock correlation energies. Previous applications of MOB-ML employed Gaussian Process
Regression (GPR), which provides good prediction accuracy with small training sets; how-
ever, the cost of GPR training scales cubically with the amount of data and becomes a com-
putational bottleneck for large training sets. In the current work, we address this problem
by introducing a clustering/regression/classification implementation of MOB-ML. In a first
step, regression clustering (RC) is used to partition the training data to best fit an ensemble
of linear regression (LR) models; in a second step, each cluster is regressed independently,
using either LR or GPR; and in a third step, a random forest classifier (RFC) is trained for
the prediction of cluster assignments based on MOB feature values. Upon inspection, RC is
1
arXiv:1909.02041v4 [physics.chem-ph] 23 Oct 2019
found to recapitulate chemically intuitive groupings of the frontier molecular orbitals, and the
combined RC/LR/RFC and RC/GPR/RFC implementations of MOB-ML are found to provide
good prediction accuracy with greatly reduced wall-clock training times. For a dataset of ther-
malized (350 K) geometries of 7211 organic molecules of up to seven heavy atoms (QM7b-T),
both RC/LR/RFC and RC/GPR/RFC reach chemical accuracy (1 kcal/mol prediction error)
with only 300 training molecules, while providing 35000-fold and 4500-fold reductions in the
wall-clock training time, respectively, compared to MOB-ML without clustering. The result-
ing models are also demonstrated to retain transferability for the prediction of large-molecule
energies with only small-molecule training data. Finally, it is shown that capping the number
of training datapoints per cluster leads to further improvements in prediction accuracy with
negligible increases in wall-clock training time.
1 Introduction
Machine-learning (ML) continues to emerge as a versatile strategy in the chemical sciences, with
applications to drug discovery,
1–5
materials design,
5–9
and reaction prediction.
5,10–14
An increas-
ing number of ML methods have focused on the prediction of molecular properties, including quan-
tum mechanical electronic energies,
15–31
densities,
26,32–36
and spectra.
37–41
Most of this work has
focused on ML in the representation of atom- or geometry-specific features, although more abstract
representations are gaining increased attention.
42–48
We recently introduced a rigorous factorization of the post-Hartree-Fock correlation energy
into contributions from pairs of occupied molecular orbitals and showed that these pair contri-
butions could be compactly represented in the space of molecular-orbital-based (MOB) features
to allow for straightforward ML regression.
47,48
This MOB-ML method was demonstrated to ac-
curately predict second-order Møller-Plessett perturbation theory (MP2)
49,50
and coupled cluster
with singles, doubles and perturbative triples (CCSD(T))
51,52
energies of different benchmark sys-
tems, including the QM7b-T and GDB-13-T datasets of thermalized drug-like organic molecules.
While providing good accuracy with a modest amount of training data, the accuracy of MOB-ML
2
in these initial studies was limited by the high computational cost (
O
(
N
3
)
) of applying Gaussian
Process Regression (GPR) to the full set of training data.
48
In this work, we combine MOB-ML with regression clustering (RC) to overcome this bottle-
neck in computational cost and accuracy. The training data are clustered via RC to discover locally
linear structures. By independently regressing these subsets of the data, we obtain MOB-ML mod-
els with greatly reduced training costs while preserving prediction accuracy and transferability.
2 Theory
2.1 Molecular-orbital based machine learning (MOB-ML)
The MOB-ML method is based on the observation that the correlation energy for any post-Hartree-
Fock wavefunction theory can be exactly decomposed as a sum over occupied molecular orbitals
(MOs) via Nesbet’s theorem,
53,54
E
c
=
occ
i j
ε
i j
,
(1)
where
E
c
is the correlation energy and
ε
i j
is the pair correlation energy corresponding to occupied
MOs
i
and
j
. The pair correlation energies can be expressed as a functional of the set of (occupied
and unoccupied) MOs, appropriately indexed by
i
and
j
, such that
ε
i j
=
ε
[
{
φ
p
}
i j
]
.
(2)
The functional
ε
maps the Hartree-Fock MOs to the pair correlation energy, regardless of the
molecular composition or geometry, such that it is a universal functional for all chemical systems.
To bypass the expensive post-Hartree-Fock evaluation procedure, MOB-ML approximates
ε
i j
by
machine learning two functionals,
ε
ML
d
and
ε
ML
o
, which correspond to diagonal and off-diagonal
terms of the sum in Eq. 1.
3
ε
i j
ε
ML
d
[
f
i
]
if
i
=
j
ε
ML
o
[
f
i j
]
if
i
6
=
j
(3)
The MOB-ML feature vectors
f
i
and
f
i j
are comprised of unique elements of the Fock, Coulomb
and exchange matrices between
φ
i
,
φ
j
, and the set of virtual orbitals. Without loss of generality,
we perform MOB-ML using localized MOs (LMOs) to improve transferability across chemical
systems.
47
Detailed descriptions of feature design are provided in our previous work,
47,48
and the
features employed here are unchanged from those detailed in Ref. 48.
2.2 Local linearity of MOB feature space
It has been previously emphasized that MOB-ML facilitates transferability across chemical sys-
tems, even allowing for predictions involving molecules with elements that do not appear in the
training set,
47
due to the fact that MOB features provide a compact and highly abstracted rep-
resentation of the electronic structure. However, it is worth additionally emphasizing that this
transferability benefits from the smooth variation and local linearity of the pair correlation ener-
gies as a function of MOB feature values associated with different molecular geometries and even
different molecules.
Figure 1 illustrates these latter properties for a
σ
-bonding orbital in a series of simple molecules.
On the y-axis, we plot the diagonal contribution to the correlation energy associated with this or-
bital (
ε
ii
), computed at the MP2/cc-pvTZ level of theory. On the x-axis, we plot the value of
a particular MOB feature, the Fock matrix element for the that localized orbital,
F
ii
. For each
molecule, a range of geometries is sampled from the Boltzmann distribution at 350 K, with each
plotted point corresponding to a different sampled geometry.
It is immediately clear from the figure that the pair correlation energy varies smoothly and lin-
early as a function of the MOB feature value. Moreover, the slope of the linear curve is remarkably
consistent across molecules. This illustration suggests that MOB features may lead to accurate re-
gression of correlation energies using simple machine learning models (even linear models), and it
4
also indicates the basis for the robust transferability of MOB-ML across diverse chemical systems,
including those with elements that do not appear in the training set.
Figure 1: The diagonal pair correlation energy (
ε
ii
) for a localized
σ
-bond in four different
molecules at thermally sampled geometries (at 350 K), computed at the MP2/cc-pvTZ level of the-
ory. The diagonal pair correlation energies for HF, NH
3
, and CH
4
are shifted vertically downward
relative to those of HF by 3.407, 6.289, and 7.772 kcal/mol for H
2
O, NH
3
and CH
4
. Illustrative
σ
-bond LMOs are shown for each molecule.
2.3 Regression clustering with a greedy algorithm
To take advantage of the local linearity of pair correlation energies as a function of MOB features,
we propose a strategy to discover optimally linear clusters using regression clustering (RC).
55
Consider the set of
M
datapoints
{
f
t
,
ε
t
} ⊂
R
d
×
R
, where
d
is the length of the MOB feature
vector and where each datapoint is indexed by
t
and corresponds to a MOB feature vector and the
associated reference value (i.e., label) for the pair correlation energy. To separate these datapoints
into locally linear clusters,
S
1
,...,
S
N
, we seek a solution to the optimization problem
min
S
1
,...,
S
N
N
k
=
1
t
S
k
|
A
(
S
k
)
·
f
t
+
b
(
S
k
)
ε
t
|
2
(4)
5
where
A
(
S
k
)
R
d
and
b
(
S
k
)
R
are obtained via ordinary least squares (OLS) solution,
f
T
t
1
1
.
.
.
.
.
.
f
T
t
|
S
k
|
1
A
(
S
k
)
b
(
S
k
)
=
ε
t
1
.
.
.
ε
t
|
S
k
|
.
(5)
Each resulting
S
k
is the set of indices
t
assigned to cluster
k
comprised of
|
S
k
|
datapoints. To
perform the optimization in Eq. 4, we employ a modified version of the greedy algorithm proposed
in Ref. 56 (Algorithm 1). In general, solutions to Eq. 4 may overlap, such that
S
k
S
l
6
=
/0 for
k
6
=
l
; however, the proposed algorithm enforces that clusters remain pairwise-disjoint.
Algorithm 1
Greedy algorithm for the solution of Eq. 4.
Input:
Initial clusters:
S
1
,...,
S
N
Output:
Data clusters
S
1
,...,
S
N
1:
for
k
1 to
N
do
2:
A
(
S
k
)
,
b
(
S
k
)
OLS solution of Eq. 5
3:
end for
4:
while
not converged
do
5:
for
k
1 to
N
do
6:
S
k
←{
t
∈{
1
,...,
M
}
: arg min
n
∈{
1
,...,
N
}
|
A
(
S
n
)
·
f
t
+
b
(
S
n
)
ε
t
|
2
=
k
}
7:
end for
8:
for
k
1 to
N
do
9:
A
(
S
k
)
,
b
(
S
k
)
OLS solution of Eq. 5
10:
end for
11:
end while
Algorithm 1 has a per-iteration runtime of
O
(
Md
2
)
, since we compute
N
OLS solutions each
with runtime
O
(
|
S
k
|
d
2
)
and since
N
k
=
1
|
S
k
|
=
M
. However, the algorithm can be trivially paral-
lelized to reach a runtime of
O
(
max
(
|
S
k
|
)
d
2
)
. A key operational step in this algorithm is line 6,
which can be explained in simple terms as follows: we assign each datapoint, indexed by
t
, to the
cluster to which it is closest, as measured by the squared linear regression distance metric,
|
D
n
,
t
|
2
=
|
A
(
S
n
)
·
f
t
+
b
(
S
n
)
ε
t
|
2
(6)
6
where
D
n
,
t
is the distance of this point to cluster
n
. In principle, a datapoint could be equidistant
to two or more different clusters by this metric; in such cases, we randomly assign the datapoint to
only one of those equidistant clusters to enforce the pairwise-disjointness of the resulting clusters.
Convergence of the greedy algorithm is measured by the decrease in the objective function of Eq. 4.
Figure 2 illustrates RC in a simple one-dimensional example for which unsupervised clustering
approaches will fail to reveal the underlying linear structure. To create two clusters of nearly linear
data that overlap in feature space, the interval of feature values on
[
0
,
1
]
is uniformly discretized,
such that
f
t
= (
t
1
)
/
(
M
1
)
for
t
=
1
,...,
M
. Then,
M
/
2 of the feature values are randomly
chosen without replacement for cluster
S
1
while the remainder are placed in
S
2
; the energy labels
associated with each feature value are then generated using
ε
t
=
f
t
+
ξ
t
,
1
,
t
S
1
and
ε
t
=
f
t
+
1
+
ξ
t
,
2
,
t
S
2
where
ξ
t
,
k
N
(
0
,
0
.
1
2
)
is an i.d.d. sequence. The resulting dataset is shown in Fig. 2a.
Application of the RC method to this example is illustrated in Figs. 2(b-d). The greedy algo-
rithm is initialized by randomly assigning each datapoint to either
S
1
or
S
2
(Fig. 2b). Then, with
only a small number of iterations (Figs. 2c and d), the algorithm converges to clusters that reflect
the underlying linear character. For comparison, Fig. 2e shows the clustering that is obtained upon
convergence of the standard K-means algorithm,
57
initialized with random cluster assignments.
Unlike RC, the K-means algorithm prioritizes the compactness of clusters, resulting in a final clus-
tering that is far less amenable to simple regression. While we recognize that the correct clustering
could potentially be obtained using K-means when the dimensions of
f
t
and
ε
t
are comparable, this
is not the case for MOB-ML applications since
f
t
is typically at least 10-dimensional and
ε
t
is a
scalar; the RC approach does not suffer from this issue. Finally, we have confirmed that initializa-
tion of RC from the clustering in Fig. 2e rapidly returns to the results in Fig. 2d, requiring only a
7
Figure 2: Comparison of clustering algorithms for (a) a dataset comprised of two cluster of nearly
linear data that overlap in feature space, using (b-d) RC and (e) standard K-means clustering. (b)
Random initialization of the clusters for the greedy algorithm, with datapoint color indicating clus-
ter assignment. (c) Cluster assignments after one iteration of the greedy algorithm. (d) Converged
cluster assignments after four iterations of the greedy algorithm. For panels (b-d), two linear re-
gression lines at each iteration are shown in black. (e) Converged cluster assignments obtained
using K-means clustering, which fails to reveal the underlying linear structure of the clusters.
8
couple of iterations of the greedy algorithm.
Figure 3: The MOB-ML clustering/regression/classification workflow. (a) Clustering of the train-
ing dataset of MOB-ML feature vectors and energy labels using RC to obtain optimized linear
clusters and to provide the cluster labels for the feature vectors. (b) Regression of each cluster
of training data (using LR or GPR), to obtain the ensemble of cluster-specific regression mod-
els. (c) Training a classifier (RFC) from the MOB-ML feature vectors and cluster labels for the
training data. (d) Evaluating the predicted MOB-ML pair correlation energy from a test feature
vector is performed by first classifying the feature vector into one of the clusters, then evaluating
the cluster-specific regression model. In each panel, blue boxes indicate input quantities, orange
boxes indicate training intermediates, and green boxes indicate the resulting labels, models and
pair correlation energy predictions.
3 Calculation Details
Results are presented for QM7b-T,
48
a thermalized version of the QM7b set
58
of 7211 molecules
with up to seven C, O, N, S, and Cl heavy atoms, as well as for GDB-13-T,
48
a thermalized version
of the GDB-13 set
59
of molecules with thirteen C, O, N, S, and Cl heavy atoms. The MOB-ML
features employed in the current study are identical to those previously provided.
48
Reference pair
correlation energies are computed using MP2
49
and using CCSD(T).
51,52
The MP2 reference data
were obtained with the cc-pVTZ basis set,
60
whereas the CCSD(T) data were obtained using the
9
cc-pVDZ basis set.
60
All employed training and test datasets are provided in Ref. 48.
3.1 Regression Clustering (RC)
RC is performed using the ordinary least square linear regression implementation in the S
CIKIT
-
LEARN
package.
61
Unless otherwise specified, we initialize the greedy algorithm from the results
of K-means clustering, also implemented in S
CIKIT
-
LEARN
; K-means initialization was found to
improve the subsequent training of the random forest classifier (RFC) in comparison to random
initialization. It is found that neither L1 nor L2 regularization had significant effect on the rate
of convergence of the greedy algorithm, so neither is employed in the results presented here. It
is found that a convergence threshold of 1
×
10
8
kcal
2
/mol
2
for the loss function of the greedy
algorithm (Eq. 4) leads to no degradation in the final MOB-ML regression accuracy (Fig. S2); this
value is employed throughout.
3.2 Regression
Two different regression models are employed in the current work. The first is ordinary least-
squares linear regression (LR), as implemented in S
CIKIT
-
LEARN
. The second is Gaussian Process
Regression, as implemented in the GP
Y
1.9.6 software package.
62
Regression is independently
performed for the training data associated with each cluster, yielding a local regression model
for each cluster. Also, as in our previous work,
47,48
regression is independently performed for the
diagonal and off-diagonal pair correlation energies (
ε
ML
d
and
ε
ML
o
) yielding independent regression
models for each (Eq. 3).
GPR is performed using a negative log marginal likelihood objective. As in our previous
work,
48
the Mat
́
ern 5/2 kernel is used for regression of the diagonal pair correlation energies and
the Mat
́
ern 3/2 kernel is used for the off-diagonal pair correlation energies; in both cases, white
noise regularization
63
is employed, and the GPR is initialized with unit lengthscale and variance.
10
3.3 Classification
An RFC is trained on MOB-ML features and cluster labels for a training set and then used to
predict the cluster assignment of test datapoints in MOB-ML feature space. We employ the RFC
implementation in S
CIKIT
-
LEARN
, using with 200 trees, the entropy split criteria,
64
and balanced
class weights.
64
Alternative classifiers were also tested in this work, including K-means, Linear
SVM,
65
and AdaBoost;
66
however, these schemes were generally found to yield less accurate
MOB-ML energy predictions than RFC.
For comparison, a “perfect” classifier is obtained by simply including the test data within
the RC training set. While useful for the analysis of prediction errors due to classification, this
scheme is not generally practical because it assumes prior knowledge of the reference energy la-
bels for the test molecules. Since the perfect classifier avoids mis-classification of the test data
by construction, it should be regarded as a best case scenario for the performance of the cluster-
ing/regression/classification approach.
3.4 The clustering/regression/classification workflow
Fig. 3 summarizes the combined work flow for training and evaluating a MOB-ML model with
clustering. The training involves three steps: First, the training dataset of MOB-ML feature vec-
tors and energy labels are assigned to clusters using the RC method (panel a). Second, for each
cluster of training data, the regression model (LR or GPR) is trained, to enable the prediction of
pair correlation energies from the MOB-ML vector. Third, a classifier is trained from the MOB-
ML feature vectors and cluster labels for the training data, to enable the prediction of the cluster
assignment from a MOB-ML feature vector.
The resulting MOB-ML model is specified in terms of the method of clustering (RC, for all
results presented here), the method of regression (either LR or GPR), and the method of classifi-
cation (either RFC or the perfect classifier). In referring to a given MOB-ML model, we employ a
notation that specifies these options (e.g., RC/LR/RFC or RC/GPR/perfect).
Evaluation of the trained MOB-ML model is explained in Fig. 3d. A given molecule is first
11
decomposed into a set of test feature vectors associated with the pairs of occupied MOs. The
classifier is then used to assign each feature vector to an associated cluster. The cluster-specific
regression model is then used to predict the pair correlation energy from the MOB feature vector.
And finally, the pair correlation energies are summed to yield the total correlation energy for the
molecule.
To improve the accuracy and reduce the uncertainty in the MOB-ML predictions, ensembles
of 10 independent models using the clustering/regression/classification workflow are trained, and
the predictive mean and the corresponding standard error of the mean (SEM) are computed by
averaging over the 10 models; a comparison between the learning curves
67
from a single run and
from averaging over the 10 independent models is included in Supporting Information Fig. S1.
As described here, the predicted correlation energies may exhibit discontinuities as a function of
nuclear position, due to changes in the assignment of feature vectors among the clusters; moving
forward, this may be avoided with the use of soft (or fuzzy) clustering algorithms.
68
4 Results
4.1 Clustering and classification in MOB feature space
We begin by showing that the situation explored in Fig. 2, in which locally linear clusters over-
lap, also arises in realistic chemical applications of MOB-ML. We consider the QM7b-T set of
drug-like molecules with thermalized geometries, using the diagonal pair correlation energies
ε
ML
d
computed at the MP2/cc-pVTZ level. Randomly selecting 1000 molecules for training, we perform
RC on the dataset comprised of these energy labels and feature vectors, using
N
=
20 optimized
clusters; the sensitivity of RC to the choice of
N
is examined later.
In many cases, the resulting clusters are well separated, such that the datapoints for one cluster
have small distances (as measured by the linear regression distance metric, Eq. 6) to the cluster
which it belongs to and large distances to all other clusters. However, the clusters can also overlap.
Fig. 4a illustrates this overlap for two particular clusters (labeled 1 and 2) obtained from the QM7b-
12
T diagonal-pair training data.
Each datapoint assigned to cluster 1 (blue) is plotted according to its distance to both cluster 1
and cluster 2; likewise for the datapoints in cluster 2 (red). The datapoints for which the distances
to both clusters approach zero correspond to regions of overlap between the clusters in the high-
dimensional space of MOB-ML features, akin to the case shown in Fig. 2.
Finally, in Fig. 4b, we illustrate the classification of the feature vectors into clusters. An RFC is
trained on the feature vectors and cluster labels for the diagonal pairs of 1000 QM7b-T molecules
in the training set, and the classifier is then used to predict the cluster assignment for the feature
vectors associated with the remaining diagonal pairs of 6211 molecules in QM7b-T. For clusters
1 and 2, we then analyze the accuracy of the RFC by plotting the linear regression distance for
each datapoint to the two clusters, as well as indicating the RFC classification of the feature vector.
Each red datapoint in Fig. 4b that lies above the diagonal line of reflection is mis-classified into
cluster 2, and similarly, each blue datapoint that lies below the line of reflection is mis-classified
into cluster 1. The figure illustrates that while RFC is not a perfect means of classification, it is at
least qualitatively correct. Later, in the results section, we will analyze the sources of MOB-ML
prediction errors due to mis-classification by comparing energy predictions obtained with perfect
classification versus RFC.
Figure 4: (a) Illustration of the overlap of clusters obtained via RC for the training set molecules
from QM7b-T. (b) Classification of the datapoints for the remaining test molecules from QM7b-T,
using RFC. Distances correspond to the linear regression metric defined in Eq. 6.
13
Figure 5: Analyzing the results of clustering/classification in terms of chemical intuition. Using a
a training set of 500 randomly selected molecules from QM7b-T, RC is performed for the diagonal
pair correlation energies,
ε
ML
d
, with a range of cluster numbers,
N
, and for each clustering, an RFC
is trained. Then, the trained classifier is applied to a set of test molecules (CH
4
, C
2
H
6
, C
2
H
4
, C
3
H
8
,
CH
3
CH
2
OH, CH
3
OCH
3
, CH
3
CH
2
CH
2
CH
3
, CH
3
CH(CH
3
)CH
3
, CH
3
CH
2
CH
2
CH
2
CH
2
CH
2
CH
3
,
(CH
3
)
3
CCH
2
OH, and CH
3
CH
2
CH
2
CH
2
CH
2
CH
2
OH) which have chemically intuitive LMO types,
as indicated in the legend. The LMOs are successfully resolved according to type by the classifier
as
N
increases. Empty boxes correspond to clusters into which none of the LMOs from the test set
is classified; these are expected since the training set is more diverse than the test set.
4.2 Chemically intuitive clusters
To address this, we employ a training set of 500 randomly selected molecules from QM7b-T, and
we perform regression clustering for the diagonal pair correlation energies
ε
ML
d
with a range of
total cluster numbers, up to
N
=
20. For each clustering, we then train an RFC. Finally, each
trained RFC is independently applied to a set of test molecules with easily characterized valence
molecular orbitals (listed in the caption of Fig. 5), to see how the feature vectors associated with
valence occupied LMOs are classified among the optimized clusters.
Figure 5 presents the results of this exercise, clearly indicating the agreement between chem-
ical intuition and the predictions of the RFC. As the number of clusters increases, the feature
vectors associated with different valence LMO types are resolved into different clusters; and with
a sufficiently large number of clusters (15 or 20), each cluster is dominated by a single type of
LMO while each LMO type is assigned to a small number of different clusters. The empty boxes
in Fig. 5 reflect that the training set contains a larger diversity of LMO types than the 11 test
14
molecules, which is expected. The observed consistency of the clustering/classification method
presented here with chemical intuition is of course promising for the accurate local regression of
pair correlation energies, which is the focus of the current work; however, the results of Fig. 5 also
suggest that the clustering/classification of chemical systems in MOB-ML feature space provides a
powerful and highly general way of mapping the structure of chemical space for other applications,
including explorative or active ML applications.
69
4.3 Sensitivity to the number of clusters
We now explore the sensitivity of the MOB-ML clustering/regression/classification implementa-
tion to the number of employed clusters. In particular, we investigate the mean absolute error
(MAE) of the MOB-ML predictions for the diagonal (
i
ε
ii
) and off-diagonal (
i
6
=
j
ε
i j
) contribu-
tions to the total correlation energy, as a function of the number of clusters,
N
, used in the RC.
The MOB-ML models employ linear regression and RFC classification (i.e., the RC/LR/RFC pro-
tocol); the training set is comprised of 1000 randomly chosen molecules from QM7b-T, and the
test set contains the remaining molecules in QM7b-T.
Figure 6 presents the result of this calibration study, plotting the prediction MAE as a func-
tion of the number of clusters. Not surprisingly, the prediction accuracy for both the diagonal and
off-diagonal contributions improves with
N
, although it eventually plateaus in both cases. For the
diagonal contributions, the accuracy improves most rapidly up to approximately 20 clusters, in
accord with the observations in Fig. 5; and for the off-diagonal contributions, a larger number of
clusters is useful for reducing the MAE error, which is sensible given the greater variety of feature
vectors that can be created from pairs of LMOs rather than only individual LMOs. Appealingly,
there does not seem to be a strong indication of MAE increases due to “over-clustering”. While
recognizing that the optimal number of clusters will, in general, depend somewhat on the appli-
cation and the regression method (i.e., LR versus GPR), the results in Fig. 6 nonetheless provide
useful guidance with regard to the appropriate values of
N
. Throughout the remainder of the study,
we employ a value of
N
=
20 for the MOB-ML prediction of diagonal contributions to the corre-
15
lation energy and a value of
N
=
70 for the off-diagonal contributions; however, we recognize that
these choices could be further optimized.
0.1
0.5
1
5
2
5
10
20
30
50
70
Prediction
MAE
(kcal/mol)
Number
of
clusters
Diagonal
Off-diagonal
0.1
0.5
1
5
2
5
10
20
30
50
70
Figure 6: Illustration of the sensitivity of MOB-ML predictions for the diagonal and off-diagonal
contributions to the correlation energy for the QM7b-T set of molecules, using a subset of 1000
molecules for training and the RC/LR/RFC protocol. The standard error of the mean (SEM) for
the predictions is smaller than the size of the plotted points.
4.4 Performance and training cost of MOB-ML with RC
We now investigate the effect of clustering on the accuracy and training costs of MOB-ML for
applications to sets of drug-like molecules. Figure 7a presents learning curves (on a linear-linear
scale) for various implementations of MOB-ML applied to MP2/cc-pVTZ correlation energies,
with the training and test sets corresponding to non-overlapping subsets of QM7b-T. In addition to
the new results obtained using RC, we include the MOB-ML results from our previous work (GPR
without clustering).
48
Figure 7a yields three clear observations. The first is that the use of RC with RFC (i.e.,
RC/GRP/RFC and RC/LR/RFC) leads to slightly less efficient learning curves than our previous
implementation without clustering, at least when efficiency is measured in terms of the number of
training molecules. Both the RC/GPR/RFC and RC/LR/RFC protocols require approximately 300
training molecules to reach the 1 kcal/mol per seven heavy atoms threshold for chemical accuracy
employed here, whereas MOB-ML without clustering requires approximately half as many train-
16