Improved accuracy and transferability of molecular-orbital-based machine learning: Organics, transition-metal complexes, non-covalent interactions, and transition states

Molecular-orbital-based machine learning (MOB-ML) provides a general framework for the prediction of accurate correlation energies at the cost of obtaining molecular orbitals. We demonstrate the importance of preserving physical constraints, including invariance conditions and size consistency, when generating the input for the machine learning model. Numerical improvements are demonstrated for different data sets covering total and relative energies for thermally accessible organic and transition-metal containing molecules, non-covalent interactions, and transition-state energies. MOB-ML requires training data from only 1% of the QM7b-T data set (i.e., only 70 organic molecules with seven and fewer heavy atoms) to predict the total energy of the remaining 99% of this data set with sub-kcal/mol accuracy. This MOB-ML model is significantly more accurate than other methods when transferred to a data set comprised of thirteen heavy atom molecules, exhibiting no loss of accuracy on a size intensive (i.e., per-electron) basis. It is shown that MOB-ML also works well for extrapolating to transition-state structures, predicting the barrier region for malonaldehyde intramolecular proton-transfer to within 0.35 kcal/mol when only trained on reactant/product-like structures. Finally, the use of the Gaussian process variance enables an active learning strategy for extending MOB-ML model to new regions of chemical space with minimal effort. We demonstrate this active learning strategy by extending a QM7b-T model to describe non-covalent interactions in the protein backbone-backbone interaction data set to an accuracy of 0.28 kcal/mol.


I. INTRODUCTION
The calculation of accurate potential energies of molecules and materials at affordable cost is at the heart of computational chemistry. While state-of-the-art ab initio electronic structure theories can yield highly accurate results, they are computationally too expensive for routine applications. Density functional theory (DFT) is computationally cheaper and has thus enjoyed widespread applicability. However, DFT is hindered by a lack of systematic improvability and from an uncertain quality for many applications.
In recent years, a variety of machine learning approaches has emerged which promise to mitigate the cost of highly accurate electronic structure methods while preserving accuracy. 1-30 While these machine learning methods share similar goals, they differ in the representation of the molecules and in the machine learning methodology itself. Here, we will focus on the molecular-orbital-based machine learning (MOB-ML) approach. 15,18,19 The defining feature of MOB-ML is its framing of learning highly accurate correlation energies as learning a sum of orbital pair correlation energies. These orbital pair correlation energies can be individually regressed with respect to a feature vector representing the interaction of the molecular orbital pairs. Without approximation, it can be shown that such pair correlation energies add up to the correct total correlation energy for single-reference wave function methods. Phrasing the learning problem in this a) Electronic mail: tfm@caltech.edu manner has the advantage that a given pair correlation energy, and, hence, a given feature vector, is independent of molecular size (after a certain size threshold has been reached) because of the inherent spatial locality of dynamic electron correlation. Consequently, operating in such an orbital pair interaction framework converts the general extrapolation task of training on small molecules and predicting on large molecule into an interpolation task of training on orbital pairs in a small molecule and predicting on the same orbital pairs in a large molecule.
In this work, we address challenges introduced by operating in a vectorized molecular orbital pair interaction framework (Section II). We show how changes to the feature design affect the performance and transferability of MOB-ML models within the same molecular family (Section IV A) and across molecular families (Sections IV B-IV C). We probe these effects on relative-and total-energy predictions for organic and transition-metal containing molecules, and we investigate the applicability of MOB-ML to transition-state structures and non-covalent interactions.

II. THEORY
MOB-ML predicts correlation energies based on information from the molecular orbitals. 15,18,19 The correlation energy E corr in the current study is defined as the difference between the true total electronic energy and the Hartree-Fock (HF) energy for a given basis set. Without approximation, the correlation energy is expressed as a sum over correlation energy contributions from pairs of occupied orbitals i and j, 31 (1) Electronic structure theories offer different ways of approximating these pair correlation energies. For example, the second-order Møller-Plesset perturbation theory (MP2) correlation energy is 32 where a, b denote virtual orbitals, F the Fock matrix in the molecular orbital basis, and ia|| jb the anti-symmetrized exchange integral. We denote a general repulsion integral over the spatial coordinates x 1 , x 2 of molecular orbitals p, q, m, n following the chemist's notation as The evaluation of correlation energies with post-HF methods like MP2 or coupled-cluster theory (including CCSD(T)) involves computations that exceed the cost of HF theory by or-ders of magnitude. By contrast, MOB-ML predicts the correlation energy at negligible cost by machine-learning the map where f i j denotes the feature vector into which information on the molecular orbitals is compiled. Following our previous work, 18 we define a canonical order of the orbitals i and j by rotating them into gerade and ungerade combinations (see Eq. (7) in Ref. 18), creating the rotated orbitalsĩ andj. The feature vector f i j assembles information on the molecular orbital interactions: (i) Orbital energies of the valence-occupied and valence-virtual orbitals F pp , (ii) mean-field interaction energy of valenceoccupied and valence-occupied orbitals and of valence-virtual and valence-virtual orbitals F pq , (iii) Coulomb interaction of valence-occupied and valence-occupied orbitals, of valenceoccupied and valence-virtual orbitals, and valence-virtual and valence-virtual orbitals [κ pp ] qq , and (iv) exchange interaction of valence-occupied and valence-occupied orbitals, of valence-occupied and valence-virtual orbitals, and valencevirtual and valence-virtual orbitals [κ pq ] pq . We note that all of these pieces of information enter either the MP2 or the MP3 correlation energy expressions, which helps to motivate their value within our machine learning framework. We remove repetitive information from the feature vector and separate the learning problem into the cases where (i) i = j where we employ the feature vector as defined in Eq. (5) Here, the index k denotes an occupied orbital other than i and j. For blocks in the feature vector that include more than one element, we specify a canonical order of the feature vector elements. In our previous work, 18 this order was given by the sum of the Euclidean distances between the centroids of orbitalĩ and p and between the centroids of orbitalj and p. In the current work, we introduce a different strategy to sort the feature vector elements (Section II A), we modify the protocol with which we obtain the feature vector elements associated withĩ,j (Section II B), and we revise our feature vector elements to ensure size consistency (Section II C). We provide a conceptual description of the changes to the feature set below and we give the full definition of the feature vector elements and the criteria according to which the feature elements are ordered in Tables S3-S6 in the Supporting information.
A. Defining importance of feature vector elements Careful ordering of the elements of the feature vector blocks in necessary in the current work because Gaussian process regression (GPR) is sensitive to permutation of the feature vector elements. Furthermore, the application of a Gaussian process requires that the feature vectors be of fixed length. 33 Given the near-sighted nature of dynamical electron correlation, it is expected that only a limited number of orbitalpair interactions are important to predict the pair correlation energy with MOB-ML. To construct the fixed-length feature vector, a cutoff criterion must be introduced. 15 For some feature vector elements, a robust definition of importance is straight-forward. The spatial distance between the orbital centroids i and a is, for example, a reliable proxy for the importance of the feature vector elements {[κ ii ] aa } of the feature vector f i . However, the definition of importance is less straightforward for feature vector elements that involve more than two indices. The most prominent example is the {[κ ab ] ab } feature vector block of f i j , which contains the exchange integrals between the valence-virtual orbitals a and b and which should be sorted with respect to the importance of these integrals for the prediction of the pair correlation energy ε i j . It is non-trivial to define a spatial metric which defines the importance of the feature vector elements {[κ ab ] ab } to predict the pair correlation energy ε i j ; instead, we employ the the MP3 approximation for the pair correlation energy, where t ab i j denotes the T-amplitude. Although we operate in a local molecular orbital basis, the canonical formulae are used to define the importance criterion; if we consider orbital localization as a perturbation (as in Kapuy-Møller-Plesset theory 34 ), the canonical expression is the leading order term. The term we seek to attach an importance to, {[κ ab ] ab }, appears in the first term of Eq. (7) and all integrals necessary to compute this term are readily available as (a combination of) other feature elements, i.e., we do not incur any additional significant computational cost to obtain the importance of the feature vector elements.
The way in which we determine the importance of the {[κ ab ] ab } elements here is an example of a more general strategy that we employ, in which the importance is assigned according to the lowest-order perturbation theory in which the features first appear in. Similar considerations have to be made for each feature vector block, all of which are specified in detail in Tables S3 and S4 in the Supporting Information.

B. Orbital-index permutation invariance
The Fock, Coulomb, and exchange matrix elements that comprise MOB features are naturally invariant to rotation and translation of the molecule. However, some care is needed to ensure that these invariances are not lost in the construction of symmetrized MOB features. In particular, rotating the valence-occupied orbitals into gerade and ungerade combinations leads to an orbital-index permutation variance for energetically degenerate orbitals i, j because the sign of the feature vector elements M˜j p , depends on the arbitrary assignment of the indices i and j. To rectify this issue, we include the absolute value of the generic feature vector element M in the feature vector instead of the signed value, where M˜j p may be F˜j p , [κ˜j˜j] pp , or [κ˜j p ]˜j p . The corresponding equation, is already orbital-index permutation invariant because we chose M pq (p = q) to be positive. 18

C. Size consistency
Size consistency is the formal property by which the energy of two isolated molecules equals the sum of their dimer upon infinite separation. 35,36 In the context of MOB-ML, satisfaction of this property requires that the contributions from the diagonal feature vectors are not affected by distant, noninteracting molecules and that for contributions from the off-diagonal feature vectors. To ensure that MOB-ML exhibits size-consistency without the need for explicit training on the dimeric species, the following modifications to the feature vectors are made. a. Diagonal feature vector. The feature vector as defined in Eq. (6) contains three blocks whose elements are independent of orbital i, The magnitude of these feature vector elements does not decay with an increasing distance between orbital i localized on molecule I and an orbital (for example, a) localized on molecule J. To address this issue, we multiply these feature vector elements by their estimated importance (see Section II A) so that they decay smoothly to zero. The other feature vector elements decay to zero when the involved orbitals are non-interacting albeit at different rates; we take the cube of feature vector elements of the type {[κ pp ] qq } to achieve a similar decay rate for all feature vector elements in the shortto medium-range which facilitates machine learning.
b. Off-diagonal feature vector. We modify the offdiagonal feature vector such that f i j = 0 for r i j = ∞ by first applying the newly introduced changes for f i also for f i j . Further action is needed for the off-diagonal case because many feature vector elements do not decay to zero when the distance between i and j is large due to rotation of the orbitals into a gerade and an ungerade combination, e.g., As a remedy, we apply a damping function of the form 1 1+ 1 6 (r i j /r 0 ) 6 to each feature vector element. The form of this damping function is inspired by the semi-classical limit of the MP2 expression as it is also used for semi-classical dispersion corrections. 37 The damping radius, r 0 , needs to be sufficiently large as to not interfere with machine learning at small r i j . If a damping radius close to zero would be chosen, all off-diagonal feature vectors would be zero which nullifies the information content; however, the damping radius r 0 also should not be too large as size-consistency has to be fully learned until the off-diagonal feature vector is fully damped to zero. Therefore, we employ a damping radius in the intermediate-distance regime and we empirically found r 0 = 5.0 Bohr to work well.
Lastly, we enforce that ε ML (0) = 0. The MOB features are engineered to respect this limit and would, for example, in a linear regression with a zero intercept trivially predict a zerovalued pair correlation energy without any additional training. However, the Gaussian process regression we apply in this work does not trivially yield a zero-valued pair correlation energy for a zero-valued feature vector. In the case that a training set does not include examples of zero-valued feature vectors, we need to include zero-valued feature vectors and zero-valued pair correlation energies in training to ensure that ε ML (0) = 0. For no model trained in the current study were more than 5% zero-valued feature vectors included.
The resulting MOB-ML model leads to size consistent energy predictions to the degree to which the underlying MO generation is. It is not required that the dimer is explicitly part of training the MOB-ML model to obtain this result. The detailed definition of each feature vector block is summarized in Tables S5 and S6. We apply the feature set defined in Tables S5 and S6 consistently in this work.

III. COMPUTATIONAL DETAILS
We present results for five different data sets: (i) a series of alkane molecules, (ii) the potential energy surface of the malonaldehyde molecule, (iii) a thermalized version of the QM7b and the GDB13 data set (i.e., QM7b-T and GDB13-T), 38 (iv) a set of backbone-backbone interactions (BBI), 39 and (v) a thermalized version of a subset of mononuclear, octahedral transition metal complexes put forward by Kulik and co-workers. 40 We refer to the Supporting Information Section II for a description how the structures were obtained or generated. All generated structures are available in Ref. 41.
The features for all structures were generated with the EN-TOS QCORE 42 package. The feature generation is based on a HF calculation applying a cc-pVTZ 43 basis for the elements H, C, N, O, S, and Cl. We apply a def2-TZVP basis set 44 for all transition metals. The HF calculations were accelerated with density fitting for which we applied the corresponding cc-pVTZ-JKFIT 45 and def2-TZVP-JKFIT 46 density fitting bases. Subsequently, we localized the valenceoccupied and the valence-virtual molecular orbitals with the Boys-Foster localization scheme 47,48 or with the intrinsic bond orbital (IBO) localization scheme. 49 We implemented a scheme to localize the valence-virtual orbitals with respect to the Boys-Foster function (for details on this implementation, see Section II in the Supplementary Information). We applied the Boys-Foster localization scheme for the data sets (i), (iii), (iv), and (v) for valence-occupied and valence-virtual molecular orbitals. IBO localization for valence-occupied and valence-virtual molecular orbitals led to better results for data set (ii).
The resulting orbitals are imported into the Molpro 2018.0 50,51 package via the matrop functionality to generate the non-canonical MP2 52 or CCSD(T) [53][54][55] pair correlation energies with the same orbitals we applied for the feature gen-eration. These calculations are accelerated with the resolution of the identity approximation. The frozen-core approximation is invoked for all correlated calculations.
We follow the machine learning protocol outlined in previous work 18 to train the MOB-ML models. In a first step, we perform MOB feature selection by evaluating the mean decrease of accuracy in a random forest regression in the SCIKIT-LEARN v0.22.0 package. 56 We then regress the diagonal and off-diagonal pair correlation energies separately with respect to the selected features in the GPY 1.9.6 software package. 57 We employ the Matérn 5/2 kernel with white noise regularization. 33 We minimize the negative log marginal likelihood objective with respect to the kernel hyperparameters with a scaled conjugate gradient scheme for 100 steps and then apply the BFGS algorithm until full convergence. As indicted in the results, both random-sampling and active-learning strategies were employed for the selection of molecules in the training data sets. In the active-learning strategy, we use a previously trained MOB-ML model to evaluate the Gaussian process variance for each molecule, and then include the points with the highest variance in the training data set, as outlined in Ref. 58. To estimate the Gaussian process variance for each molecule, it was assumed the variances per molecular orbital pair are mutually independent.

A. Transferability within a molecular family
We first examine the effect of the feature vector generation strategy on the transferability of MOB-ML models within a molecular family. To this end, we revisit our alkane data set 18 which contains 1000 ethane and 1000 propane geometries as well as 100 butane and 100 isobutane geometries. We perform the transferability test outlined in Ref. 18, i.e., training a MOB-ML model on correlation energies for 50 randomly chosen ethane geometries and 20 randomly chosen propane geometries to predict the correlation energies for the 100 butane and 100 isobutane geometries (see Figure 1). This transferability test was repeated with 10000 different training data sets (each consisting of data for 50 ethane molecules and 20 propane molecules) to assess the training set dependence of the MOB-ML models. As suggested in Ref. 25, we consider various performance metrics to assess the prediction accuracy of the MOB-ML models: (i) the mean error (ME, Eq. (S3)), (ii) the mean absolute error (MAE, Eq. (S4)), (iii) the maximum absolute error (MaxAE, Eq. (S5)), and (iv) the mean absolute relative error (MARE, Eq (S6)) which applies a global shift setting the mean error to zero. We report the minimum, peak, and maximum encountered MAREs in Table I alongside literature values obtained in our previous work, 18 by Dick et al., 27 and by Chen et al. 25 The MEs, MAEs, and MaxAEs are reported in Figure S1.
In general, MOB-ML as well as NeuralXC 27 and DeepHF 25 produce MAREs well below chemical accuracy for correla- tion energies of butane and isobutane when trained on correlation energies of ethane and propane. Updating the feature vector generation strategy for MOB-ML results in the best peak MAREs for butane as well as for isobutane which are 0.11 kcal/mol and 0.10 kcal/mol, respectively. As in our previous work, 18 we note that the total correlation energy predictions may be shifted with respect to the reference data so that the MEs for MOB-ML range from −0.92 to 2.70 kcal/mol for butane and from −0.18 to 1.02 kcal/mol for isobutane (see also Figure S1). This shift is strongly training-set dependent, which was also observed for results obtained with DeepHF. 25 The results highlight that this is an extrapolative transfer-ability test. A considerable advantage of applying GPR in practice is that each prediction is accompanied by a Gaussian process variance which, in this case, indicates that we are in an extrapolative regime (see Figure 1). Extrapolations might be associated with quality degradation which we see, most prominently, for the mean error in butane. By contrast, other machine learning approaches like neural networks are less clear in terms of whether the predictions are in an interpolative or extrapolative regime. 59 By including the butane molecule with the largest variance in the training set (which then consists of 50 ethane, 20 propane, and 1 butane geometries) we reduce the ME from 0.78 to 0.25, MAE from 0.78 to 0.26, MaxAE from 1.11 to 0.51, and the MARE from 0.11 to 0.09 kcal/mol for butane (see Figure S2). These results directly illustrate that MOB-ML can be systematically improved by including training data that is more similar to the test data; the improved confidence of the prediction is then also directly reflected in the associated Gaussian process variances. As a second example, we examine the transferability of a MOB-ML model trained within a basin of a potential energy surface to the transition-state region of the same potential energy surface. We chose malonaldehyde for this case study as it has also been explored in previous machine learning studies. 6 We train a MOB-ML model on 50 thermalized malonaldehyde structures which all have the property that d(O 1 -H) + d(O 2 -H) > 0.4 Å (where d denotes the distance between the two nuclei) which ensures that we are sampling from the basins. We then apply this trained model to predict the correlation energies for a potential energy surface mapping out the hydrogen transfer between the two oxygen atoms (see Figure 2). MOB-ML produces an accurate potential energy surface for the hydrogen transfer in malonaldehyde only from informa- The highest errors are encountered in the high-energy regime and this region is also associated with the highest Gaussian process variance indicating low confidence in the predictions (compare middle right and right panel of Figure 2). The Gaussian process variance reflects the range of structures the MOB-ML model has been trained in and highlights again that we did not include transition-state-like structures in the training.

B. Transferability across organic chemistry space
The Chemical Space Project 60 computationally enumerated all possible organic molecules up to a certain number of atoms, resulting in the GDB databases. 61 In this work, we examine thermalized subsets 18 of the GDB13 data set 61 to in-vestigate the transferability of MOB-ML models across organic chemistry space. The application of thermalized sets of molecules has the advantage that we can study the transferability of our models for chemical and conformational degrees of freedom at the same time. To test the transferability of MOB-ML across chemical space, we train our models on a thermalized set of seven and fewer heavy-atom molecules (also known as QM7b-T 18 ) and then we test the prediction accuracy on a QM7b-T test set and on a thermalized set of molecules with thirteen heavy atoms (GDB13-T; 18 see also Section V in the Supporting Information), as also outlined in our previous work. 18,19 We first investigate the effect of changing the feature vector generation protocol on the QM7b-T→QM7b-T prediction task (see Figure 3). In Ref. 18 180 structures is necessary to achieve a model with an MAE below 1 kcal/mol. The FHCL method yields an MAE below 1 kcal/mol when training on about 800 structures 26 and the DeepHF method already exhibits an MAE below 1 kcal/mol when training on their smallest chosen training set which consists of 300 structures (MAE=0.79 kcal/mol). 25 The refinements in the current work reduce the number of required training structures to reach chemical accuracy to about 100 structures when sampling randomly. This number is, however, strongly training set dependent. We can remove the trainingset dependence by switching to an active learning strategy where we can achieve an MAE below 1 kcal/mol reliably with about 70 structures. In general, the MAE obtained with the active learning strategy is comparable to the smallest MAEs obtained with random sampling strategies. This has the advantage that a small number of reference data can be generated in a targeted manner.
In general, our aim is to obtain a machine learning model which reliably predicts broad swathes of chemical space. For an ML model to be of practical use, it has to be able to describe out-of-set molecules of different sizes to a similar accuracy when accuracy is measured size-intensively. 36 We probe the ability of MOB-ML to describe out-of-set molecules with a different number of electron pairs by applying a model trained on correlation energies for QM7b-T molecules to predict correlation energies for GDB13-T. We collect the best results published for this transfer test in the literature in Figure 4. Our previous best single GPR model achieved an MAE of 2.27 kcal/mol when trained on 220 randomly chosen structures. 18 The modifications in the current work now yield a single GPR model which achieves an MAE of 1.47-1.62 kcal/mol for GDB13-T when trained on 220 randomly chosen QM7b-T structures. Strikingly, MOB-ML outperforms machine learning models trained on thousands of molecules like our RCR/GPR model and FHCL18. 26 The current MOB-ML results are of an accuracy that is similar to the best reported results from DeepHF (an MAE of 1.49 kcal/mol); 25 however, MOB-ML only needs to be trained on about 3% of the molecules in the QM7b data set while DeepHF is trained on 42% to obtain comparable results (MAE of 1.52 kcal/mol for 3000 training structures). The best reported result for DeepHF (MAE of 1.49 kcal/mol) was obtained by training on 97% of the molecules of the QM7b data set. We attribute the excellent transferability of MOB-ML to the fact that it focuses on the prediction of orbital-pair contributions, thereby reframing an extrapolation problem into an interpolation problem when training machine learning models on small molecules and testing them on large molecules. The pair correlation energies predicted for QM7b-T and for GDB13-T span a very similar range (0 to −20 kcal/mol), and Comparison of the prediction mean absolute errors of total correlation energies for GDB13-T molecules as a function of the number of QM7b-T molecules chosen for model training for different machine learning models: MOB-ML as outlined in this work with random sampling (green circles), MOB-ML with a single GPR 18 (orange circles), MOB-ML with RCR/GPR 19 (brown circles), DeepHF 25 (red squares), FHCL18 26 (purple squares). The green shaded area corresponds to the 90% confidence interval for the predictions obtained from 50 random samples of the training data. they are predicted with a similar Gaussian process variance (see Figure S5) which we would expect in an interpolation task. The final errors for GDB13-T are larger than for QM7b-T, because the total correlation energy is size-extensive; however, the size-intensive error per electron pair spans a comparable range for QM7b-T and for GDB13-T (see Figure S4). This presents a significant advantage of MOB-ML over machine learning models which rely on a whole-molecule representation and creates the opportunity to study molecules of a size that are beyond the reach of accurate correlated wave function methods.
Most studies in computational chemistry require accurate relative energies rather than accurate total energies. Therefore, we also assess the errors in the relative energies for the sets of conformers for each molecule in the QM7b-T and in the GDB13-T data sets obtained with MOB-ML with respect to the reference energies (see Figure 5). We emphasize that MOB-ML is not explicitly trained to predict conformer energies, and we include at most one conformer for each molecule in the training set. Nevertheless, MOB-ML produces on average chemically accurate relative conformer energies for QM7b-T when trained on correlation energies for only 30 randomly chosen molecules (or 0.4% of the molecules) in the QM7b set. We obtain chemically accurate relative energies for the GDB13-T data set when training on about 100 QM7b-T molecules. The prediction accuracy improves steadily when training on more QM7b-T molecules reaching a mean MAE of 0.43 kcal/mol for the relative energies of the rest of the QM7b-T set and of 0.77 kcal/mol for the GDB13-T set.
We now present the first reported test of MOB-ML for non-covalent interactions in large molecules. To this end, we FIG. 5. Prediction mean absolute errors for relative correlation energies as a function of the number of QM7b-T molecules chosen for model training for QM7b-T (blue circles) and for GDB13-T (orange crosses). The blue and orange shaded areas correspond to the 90% confidence interval for the predictions obtained from 50 random samples of the training data. The gray shaded area corresponds to the region where the error is smaller than chemical accuracy (1 kcal/mol). examine the backbone-backbone interaction (BBI) data set 39 which was designed to benchmark methods for the prediction of interaction energies encountered within protein fragments. Using the implementation of MOB-ML described here and using only 20 randomly selected QM7b-T molecules for training, the method achieves a mean absolute error of 0.98 kcal/mol for the BBI data set (see Figure 6). However, these predictions are uncertain as indicated by the large Gaussian process variances associated with these data points which strongly suggested that we are now, as expected, in an extrapolative regime. We further improve the predictive capability of MOB-ML by augmenting the MOB-ML model with data from the BBI set. Specifically, we can draw on an active learning strategy and consecutively include data points until all uncertainties are below 1 kcal/mol which in this case corresponds to only two data points. This reduces the MAE to 0.28 kcal/mol for the remaining 98 data points in the BBI set. Including more reference data points would further improve the performance for this specific data set. However, this is not the focus of this work. Instead, we simply emphasize that MOB-ML is a clearly extensible strategy to accurately predict energies for large molecules and non-covalent intermolecular interactions while providing a useful estimation of confidence.

C. Transition-metal complexes
We finally present the first application of MOB-ML to transition-metal complexes. To this end, we train a MOB-ML model on a thermalized subset of mononuclear, octahedral transition-metal complexes introduced by Kulik and coworkers 40 which we denote as TM-T. The chosen closed-shell transition-metal complexes feature different transition metals (Fe, Co, Ni) and ligands. The ligands span the spectrochemical series from weak-field (e.g., thiocyanate) over to strongfield (e.g., carbonyl) ligands. We see in Figure 7 that the learn-FIG. 6. Top panel: Errors in predictions were made with a MOB-ML model trained on 20 randomly selected QM7b-T molecules with FS 3 with respect to reference MP2/cc-pVTZ interaction energies for the BBI data set. Bottom panel: Errors in predictions were made with a MOB-ML model trained on 20 randomly selected QM7b-T molecules and augmented with the 2 BBI data points with the largest variance (orange circles) with respect to reference MP2/cc-pVTZ interaction energies. The bar attached to each prediction error indicates the associated Gaussian process variance. The gray shaded area corresponds to the region where the error is smaller than chemical accuracy (1 kcal/mol).
ing behaviour between TM-T and QM7b-T is similar when the error is measured per valence-occupied orbital. These results demonstrate that MOB-ML formalism can be straightforwardly applied outside of the organic chemistry universe without additional modifications. It is particularly notable that the learning efficiency for TM-T is comparable to that for QM7b-T, as seen in the relatively simple organic molecules in QM7b-T (Fig. 7). We note that whereas MP2 theory is not expected to be fully quantitative for transition metal complexes, 62,63 it provides a demonstration of the learning efficiency of MOB-ML for transition-metal complexes in the current example; and as previously demonstrated, MOB-ML learns other correlated wave function methods with similar efficiency. 15,18 V. CONCLUSIONS Molecular-orbital-based machine learning (MOB-ML) provides a general framework to learn correlation energies at the cost of molecular orbital generation. In this work, we demonstrate that preservation of physical symmetries and constraints leads to machine-learning methods with greater learning efficiency and transferability. Exploiting physical principles like size consistency and energy invariances not only leads to a conceptually more satisfying method, but it also leads to substantial improvements in prediction errors for different data sets covering total and relative energies for thermally accessible organic and transition-metal containing molecules, non-covalent interactions, and transition-state energies. With the modifications presented in the current work, MOB-ML is shown to be highly data efficient, which is important due to the high computational cost of generating reference correlation energies. Only 1% of the QM7b-T data set (containing organic molecules with seven and fewer heavy atoms) needs to be drawn on to train a MOB-ML model which produces on average chemically accurate total energies for the remaining 99% of the data set. Without ever being trained to predict relative energies, MOB-ML provides chemically accurate relative energies for QM7b-T when training on only 0.4% of the QM7b-T molecules. Furthermore, we have demonstrated that MOB-ML is not restricted to the organic chemistry space and that we are able to apply our framework out-of-the box to describe a diverse set transition-metal complexes when training on correlation energies for tens of molecules.
Beyond data efficiency, MOB-ML models are are shown to be very transferable across chemical space. We demonstrate this transferability by training a MOB-ML model on QM7b-T and predicting energies for a set of molecules with thirteen heavy atoms (GDB13-T). We obtain the best result for GDB13-T reported to date despite only training on 3% of QM7b-T. The successful transferability of MOB-ML is shown to result from its recasting of a typical extrapolation task (i.e., larger molecules) into an interpolation task (i.e., by predicting on the basis of size-intensive orbital-pair contributions). Even when MOB-ML enters an extrapolative regime as identified by a large Gaussian process variance, accurate results can be obtained; for example, we predict the transition-state energy for the proton transfer in malonaldehyde and interaction energies in the protein backbone-backbone interaction data set to chemical accuracy without training on transition-state-like data or non-covalent interactions, respectively. In this case, the uncertainty estimates also offer a clear avenue for active learning strategies which can further improve the model performance. Active learning offers an attractive way to reduce the number of expensive reference calculations further by picking the most informative molecules to be included in the training set. This provides a general recipe how to evolve a MOB-ML model to describe new regions of chemical space with minimal effort.
Future work will focus on the expansion of MOB-ML to cover more of chemical space. Specifically, particular areas of focus include open-shell systems and electronically excited states. Physical insight from exact conditions in electronic structure theory 64 will continue to guide the development of the method, with the aim of providing a machine-learning approach for energies and properties of arbitrary molecules with controlled error.

SUPPORTING INFORMATION
Details on feature generation for all data sets used in this work, definition of error metrics, expanded results for the alkane transferability test, expanded results for the transferability within the organic chemistry space. Features and labels for all data sets used in this work.

DATA AVAILABILITY STATEMENT
The data that supports the findings of this study are available within the article and its supplementary material. Additional data that support the findings of this study are openly available in Caltech Data Repository. 41