Analytical Gradients for Molecular-Orbital-Based Machine Learning

Molecular-orbital-based machine learning (MOB-ML) enables the prediction of accurate correlation energies at the cost of obtaining molecular orbitals. Here, we present the derivation, implementation, and numerical demonstration of MOB-ML analytical nuclear gradients which are formulated in a general Lagrangian framework to enforce orthogonality, localization, and Brillouin constraints on the molecular orbitals. The MOB-ML gradient framework is general with respect to the regression technique (e.g., Gaussian process regression or neural networks) and the MOB feature design. We show that MOB-ML gradients are highly accurate compared to other ML methods on the ISO17 data set while only being trained on energies for hundreds of molecules compared to energies and gradients for hundreds of thousands of molecules for the other ML methods. The MOB-ML gradients are also shown to yield accurate optimized structures, at a computational cost for the gradient evaluation that is comparable to Hartree-Fock theory or hybrid DFT.


I. INTRODUCTION
Analytical nuclear gradients are the foundation of the quantum chemical elucidation of complex reaction mechanisms via molecular dynamics simulations and minimum-energy and transition-state structure optimization. However, the routine calculation of ab initio energies and forces with accurate wave function methods is prohibited by their steep cost, e.g., coupled-cluster singles, doubles, and perturbative triples [CCSD(T)] scales as 1 N 7 and full configuration interaction scales as 2 N ! where N is a measure of system size. In recent years, machine learning has opened up a new way of mitigating the cost of quantum chemical calculations.  A particular approach that has proven to be highly data efficient and transferable across chemical space is molecular-orbital-based machine learning (MOB-ML) method. [27][28][29] MOB-ML relies on information of local molecular orbitals to predict the pair-wise sum of a post-Hartree-Fock correlation energy at drastically reduced cost. [27][28][29] The gradient theory for MOB-ML is comparable to that of non-canonical wave function-based correlation methods, due to factors that include orbital localization and the non-variational energy expression. There exists only a handful of local wave function-based correlation methods for which this effort has been performed. 30,31 In this work, we establish a general Lagrangian framework to obtain the analytical nuclear gradients of the MOB-ML energy. The framework enforces orthogonality, localization, and Brillouin constraints on the molecular orbitals a) tfm@caltech.edu (section II B). A noteworthy aspect of this framework is that it is agnostic to the training data used for the MOB-ML model, thereby yielding accurate gradient predictions for wave function theory methods for which analytical gradients have not yet been derived or implemented. Furthermore, the computational cost of evaluating the MOB-ML energy gradient is comparable to that of a Hartree-Fock (HF) gradient or a hybrid density functional theory (DFT) gradient, such that it is orders of magnitude faster than evaluating the gradients of ab initio wave function theories.
We numerically validate the MOB-ML gradient theory by comparison to energy finite differences in section IV. Furthermore, we show that using only training data based on energy calculations (not gradients), MOB-ML efficiently and accurately yields gradients for diverse sets of molecules (section IV). Comparison of MOB-ML to other ML methods on the example of the ISO17 data set highlights the data efficiency and high transferability of MOB-ML for gradient predictions. We show that MOB-ML optimized structures for molecules in the ISO17 set are systematically improved with respect to the reference HF method and we compare the performance to that of a standard DFT functional.

II. MOB-ML ANALYTICAL NUCLEAR GRADIENTS
A. MOB-ML Energy Theory MOB-ML relies on molecular orbital information from a HF calculation to predict a wave function correlation energy. The working equation for the MOB-ML energy is [27][28][29] where E HF is the HF energy, and E corr [f ] is the machine-learned correlation energy, The matrix of feature vectors, f , is divided into two sub-classes. The first sub-class is made up by the diagonal components of f , f i , which represent the valence-occupied orbital i. The second sub-class is made up by the off-diagonal components of f , f ij , which represent the interaction between the valenceoccupied orbitals i and j. Both diagonal and offdiagonal feature vectors are composed of elements from the HF Fock matrix in the MO basis, F, and the MO repulsion integrals, κ, where Here, (µν|κσ) are the four-center atomic orbital integrals with µ, ν, κ, and σ representing atomic orbital where q refers to nuclear coordinate. The calculation of the nuclear response of the HF MOs, ∂C/∂q, is avoided because the Lagrangian L is minimized with respect to all of its variational parameters which are the MO coefficients, C. The MOB-ML energy La-grangian is where x, z, z core , z val-occ , z val-vir , and λ are the Lagrange multipliers. We refer to the core MOs with column indices r, s, to the valence-occupied localized MOs (LMOs) with column indices i, j, k, l, to the valence-virtual LMOs with column indices a, b, and to the non-valence-virtual MOs with column indices w, x. The indices m, n, p, q are used to index generic molecular orbitals. The first term on the right hand side (RHS) of Eq. 5 is the MOB-ML energy described by Eq. 1. The second term on the RHS constrains the HF MOs, C, to be orthonormal, which is commonly referred to as the Pulay force. 33 The third term on the RHS is known as the Brillouin conditions, which account for the dependence of the correlation energy on the HF optimized molecular orbitals. The frozen-core conditions, F ri = 0, account for neglecting the correlation energy contributions from the core orbitals. The localization conditions, r ij = 0 and r ab = 0, account for how the valence-occupied and valence-virtual MOs are localized respectively. In this work, we employ Foster-Boys localization 34 and intrinsic bond orbitals (IBO) localization, 35 but it is straightforward to generalize to other localization methods. The valence virtual conditions, P wa = 0, reflect how the valence virtual MOs are obtained through a unitary transformation of the virtual MOs. This unitary transformation corresponds to the column space of a projection matrix formed by projecting the virtual MOs onto the IAOs. The complementary null space of this projection matrix corresponds to the non valence-virtual orbitals. This projection matrix is defined as where and where C vir is the virtual MO coefficient matrix. The matrix X IAO transforms between the original AO and IAO basis sets and is expanded in Appendix B. All together, this yields the following an-alytical nuclear gradient, where the superscript (q) denotes the explicit derivative of the quantity with respect to a nuclear coordinate. Eq. 8 is the general MOB-ML analytical nuclear gradient and we will now outline how to determine the Lagrange multipliers for our particular use case to arrive at a final working equation.

Minimizing with respect to MO coefficients
All of the Lagrange multipliers (x, z, z core , z val-occ , z val-vir and λ) are determined by minimizing the MOB-ML Lagrangian with respect to its variational parameters, which are the MO coefficients, C. Differentiating the Lagrangian with respect to these parameters yields where and Eqs. 13 and 14 are expanded in Appendices A and B, respectively, F is the HF Fock matrix, g includes all of the usual HF two-electron terms, R n is expanded in Appendix A, the condition q ∈ loc restricts the sum to valence-occupied and valence-virtual MOs, z = z + z † ,z core = z core + z core, † , andD F = D F + D F, † . The matrices D F , D J , and D K are calculated by where M pq refers to F pq , [κ pp ] qq and [κ pq ] pq , respectively. The matrix D R,n is The partial derivatives ∂ ii [fi] ∂fi and ∂ ij [fij ] ∂fij on the RHS of Eqns. 16 and 17 are the derivatives of the machine learning prediction with respect to the feature vectors.
We emphasize that any machine learning method (e.g. Gaussian process regression, regression clustering, neural net, etc.) can be readily used in this gradient framework without modification given ∂ ii [fi] ∂fi and ∂fij . Furthermore, we note that the following analytical nuclear gradient derivation generalizes to any type of feature-vector design and construction, so long as the feature-vector elements are obtained from F and κ. The partial derivatives ∂fi ∂Mpq , ∂fij ∂Mpq and ∂fij ∂R n pq are expanded in the supplementary material.
We now proceed to solve for each of the Lagrange multipliers. First, combining the stationary conditions described by Eq. 9 with the auxiliary conditions x = x † yields the linear Z-vector equations where P pq permutes the indices p and q, which is used to solve for z, z core , z val-occ , z val-vir and λ. The matrix x is then obtained as The Lagrange multipliers z val-occ are solved by considering the (valence-occupied)-(valence-occupied) part of Eq. 18, yielding Eq. 20 can be further simplified by showing that As a result, z val-occ is independent of all other Lagrange multipliers, which simplifies Eq. 20 to (22) where the 4-dimensional tensor B is expanded in Appendix A. The set of linear system of equations defined by Eq. 22 are the Z-vector coupled perturbed localization (Z-CPL) equations which are used to solve for z val-occ . Subsequently, Eq. 13 can be used to compute a z val-occ .
The Lagrange multipliers z core are solved by considering the core-(valence-occupied) part of Eq. 18, yielding which further simplifies to These are the Z-vector equations used to solve for z core . Subsequently, Eq. 12 can be used to calculate The Lagrange multipliers z val-vir are solved by considering the (valence-virtual)-(valence-virtual) part of Eq. 18, yielding which further simplifies to where the 4-dimensional tensor C is expanded in Appendix B. These are the Z-CPL equations which are used to solve for z val-vir . Subsequently, Eq. 14 can be used to compute a z val-vir . The Lagrange multipliers λ are solved by considering the (non valence-virtual)-(valence-virtual) part of Eq. 18, yielding which further simplifies to These are the Z-vector equations used to solve for λ. Subsequently, Eq. 15 can be used to compute D[λ]. Finally, the Lagrange multipliers z are solved by considering the virtual-occupied part of Eq. 18, yielding which further simplifies to Here, the MO indices a and i refer to the full virtual and occupied spaces, respectively. These are the Zvector coupled perturbed Hartree-Fock (Z-CPHF) equations. With the solutions to all Z-vector equations we can return to Eq. 19 to solve for x.

Incorporating molecular-orbital localization
To provide the working expression of Eq. 8 in terms of derivative AO integrals, we must specify the molecular-orbital localization method. For this derivation, we choose the Foster-Boys and IBO localization methods to localize the valence-occupied and valence-virtual orbitals, respectively, such that where h is the standard one-electron Hamiltonian, µ, ν, κ and σ label AO basis functions in the original basis, (µν|κσ) are the two-electron repulsion integrals, S 2 is the overlap matrix of the minimal AO basis (MINAO) used in the IBO procedure, and S 12 is the overlap matrix between the original AO and MINAO basis sets. The effective one-particle density d a is defined as where γ is the full system HF density. The effective two-particle density D is defined as where the effective one-particle density d b is defined as The matrices X 1 , X 2 , X 12 , and W n are defined as and where Eq. 35 is expanded in Appendix B and Eq. 36 is expanded in Appendix A.

III. COMPUTATIONAL DETAILS
In this work, we perform calculations on three different data sets: (i) the thermalized water data set published in Ref. 28, (ii) a thermalized set of organic molecules featuring up to seven heavy atoms (QM7b-T), 28 and (iii) the ISO17 data set of conformers taken from molecular dynamic (MD) trajectories for constitutional isomers with the chemical formula C 7 O 2 H 10 . 14 All MOB-ML energy and analytical gradient are implemented in and performed with entos qcore. 36 The DF-HF calculations for the QM7b-T set, 28 and the ISO17 set, 14 are performed with a cc-pVTZ 37 basis set and a cc-pVTZ-JKFIT density fitting basis. 38 The DF-HF calculations for the water calculations are performed with a aug-cc-pVTZ 39 and a aug-cc-pVTZ-JKFIT 38 basis set. We employ a molecular orbital convergence threshold of orbital grad threshold = 1 × 10 −8 a.u. In all MOB-ML calculations, the Foster-Boys 34 localization method is used to localize the valence-occupied MOs. The valence-virtual space is either localized with Foster-Boys localization (QM7b-T, ISO17) or the IBO localization method 35 (water). The diagonal and off-diagonal feature vectors are constructed following the procedure outlined in Ref. 32. For all Z-CPHF calculations a convergence threshold of 1 × 10 −8 a.u. is specified.
All WF calculations are performed in Molpro 40 with the frozen-core approximation, and with density fitting. All WF pair energy calculations employ the non-canonical MP2 [41][42][43][44] or non-canonical coupled-cluster singles, doubles, and perturbative triples [CCSD(T)] 1,45-50 correlation treatments with Table I. Mean absolute error (MAE) of the MOB-ML analytical nuclear gradient with respect to the MOB-ML numerical nuclear gradient for a non-equilibrium geometry of water. The numerical nuclear gradients in were obtained with a two-step central difference formula with a step size of 5×10 −4 bohr. The non-equilibrium geometry of water has bond lengths of 0.986Å and 0.958Å, and a bond angle of 94.5 • . All MOB-ML models are trained on data for 100 water geometries.
The MOB-ML models for water are trained on non-canonical CCSD(T)/aug-cc-pVTZ pair correlation energies. When constructing the feature vector all non-zero elements from the Fock and κ matrices are used. All linear regression (LR) models are trained using Scikit-Learn. 52 All Gaussian process regression (GPR) 53 models use the Matern 5/2 kernel 53,54 and are optimized using the scaled conjugate gradient option in GPy. 55 All regression clustering models are trained following the framework outlined in Ref. 29 using a GPR within each cluster.
The MOB-ML models for the QM7b-T data set, and the ISO17 data set are trained on noncanonical MP2/cc-pVTZ pair correlation energies. Feature selection is performed using random forest regression 56 with the mean decrease of accuracy criterion, which is sometimes referred to as permutation importance. 57 All GPR models use the Matern 5/2 kernel and are optimized using the scaled conjugate gradient option in GPy.

IV. RESULTS AND DISCUSSION
First, we compare the MOB-ML analytical gradient to the numerical gradient for an exemplary molecule to illustrate the correctness of our derivation and implementation in Table I. Table I shows that the mean absolute errors (MAE) of the analyti-cal MOB-ML gradients of a distorted water molecule with respect to the numerical ones are on the order of 10 −8 hartree/bohr for all MOB-ML models. A similar MAE is commonly found when comparing analytical and numerical gradients of pure electronic structure methods. 30,31,47,58 Additionally, Table I shows that the difference of the numerical and analytical gradient is largely independent of the regression technique (linear regression, Gaussian process regression, or a clustered Gaussian process regression) applied within the MOB-ML model. More generally, this illustrates (as also pointed out in Section II B 2) that any desired regression technique can be applied within MOB-ML without changes to the gradient framework provided that the regression prediction is differentiable with respect to the features.
As a second demonstration, we consider the thermally accessible potential energy surface of a single water molecule, following our previous work. 28 Fig. 1 shows the MAE for the energy predictions and for the associated analytical gradients we obtained with MOB-ML models trained on CCSD(T) energies performed on thermalized water geometries. As already highlighted in Ref. 29, the MAE for the energy prediction decreases steeply with the number of training geometries and we reach an MAE of 2 × 10 −4 kcal/mol when training on correlation energies of 100 training geometries. Additionally, we see that the MAE of the analytical MOB-ML gradients with respect to the analytical CCSD(T) gradients strictly decreases with an increasing amount of training data although the training data in this context are correlation energy labels and not gradients. The MAE of the MOB-ML analytical gradient is 9 × 10 −3 kcal/mol/Å when training on correlation energies for 100 water geometries. We can contextualize this result by considering that the threshold commonly used to determine if a structure optimization is converged is 0.36 kcal/mol/Å. The MAE for the gradient drops below this threshold when training on as few as three to nine water geometries. This demonstrates that MOB-ML is able to describe potential energy surfaces to a high accuracy and with a high data efficiency.
In Fig. 2, we show that this result generalizes to a diverse set of molecules. To this end, we first study the QM7b-T data set which is comprised of a thermalized set of 7211 organic molecules with 7 or fewer heavy atoms. 59   for the energy decreases steeply and we obtain an MAE of 1.0 kcal/mol when training on about 70 structures. The decrease in the MAE for the energy prediction is accompanied by a decrease in the MAE for the analytical MOB-ML gradient with respect to the analytical MP2 gradient. We reach a MAE of 2.08 kcal/mol/Å when training on 220 structures.
To compare MOB-ML for gradient predictions with other machine learning methods, we now also examine the ISO17 data set. 14 The ISO17 data set consists of conformers taken from MD trajectories for constitutional isomers with the chemical formula C 7 O 2 H 10 . Table II shows the performance of two MOB-ML models, one trained on 220 QM7b-T structures and one trained on 100 ISO17 structures, and summarizes the MAEs obtained with other ML models in the literature, i.e., SchNet, 14 FCHL, 21 PhysNet, 22 the shared-weight neural network (SWNN), 23 GM-sNN, 25 and GNNFF. 26 The MOB-ML models are the only ML models which are on average chemically accurate although the MOB-ML models were only trained on energies for 100 ISO17 molecules and 220 QM7b-T molecules, respectively. The fact that our model trained on a small set of the seven-heavy atom molecules which are smaller in size than ISO17 and which are chemically more diverse (QM7b-T additionally contains the elements N, S, Cl) showcases again how transferable and data efficient MOB-ML models are. The next best model in terms of the energy MAE is GM-sNN which was trained on energies and gradients for 400k ISO17 structures and achieves an MAE of 1.97 kcal/mol. The force MAE of the MOB-ML models (1.63 and 1.64 kcal/mol/Å, respectively) is comparable to that of GM-sNN (1.66 kcal/mol/Å) while employing only 0.025% of the training data. MOB-ML is significantly more accurate in the forces than other models trained on energies alone, i.e., SchNet which obtained an MAE of 5.71 kcal/mol/Å and SWNN which obtained an MAE of 6.61 kcal/mol/Å. The only model which is more accurate in terms of the force MAE is PhysNet which is trained on energies and forces for 400k ISO17 structures. PhysNet obtains a force MAE of 1.38 kcal/mol/A. Given the demonstrated learnability of forces, it is very likely that MOB-ML could be trained to be more accurate by including more training data. Furthermore, ana-lytical gradients have not been derived for all reference theories which considerably limits the scope of these machine learning methodologies. For example, the popular local coupled cluster methods 60-62 do not currently have derived analytical gradient theories. Table II. Comparison of the mean absolute error for the prediction of energies and atomic forces for the unknown test set of the ISO17 data set obtained with different ML methods. The different ML methods applied different training sizes and drew on different labels to train the models on. Energy and force errors are reported in kcal/mol and kcal/mol/Å, respectively.

Method
Training Despite comparing favorably to other ML methods, it remains to be shown if the MOB-ML gradients are sufficiently accurate for practical applications. Therefore, we now use the MOB-ML gradients to perform the common quantum-chemical task of optimizing molecular structures. We optimize the constitutional isomers in ISO17 with MP2 and with MOB-ML and compare the resulting structures via the root mean square deviation (RMSD) of the atoms positions in Figure 3.

V. CONCLUSIONS
In this work, we have presented the derivation and implementation of the formally complete MOB-ML analytical nuclear gradient theory within a general Lagrangian framework. We have validated our derivation and implementation by comparison of numerical and analytical gradients. The MOB-ML gradient framework can be applied in conjunction with any desired fitting technique (e.g., Gaussian process regression or neural networks) and any desired recipe for assembling the MOB-ML feature information. Furthermore, the framework for evaluating the gradient of a predicted high-accuracy wave function energy is independent of the wave function method MOB-ML was trained to predict. Hence, we can take the analytical gradient of a MOB-ML method trained to predict an arbitrary accurate wave func-tion theory. MOB-ML was previously shown to predict highaccuracy wave function energies at the cost of a molecular orbital evaluation. We now have shown that a MOB-ML model trained on correlation energies alone also yields highly accurate gradients for potential energy surfaces of a single molecule and for sets of diverse molecules. Specifically, we presented a MOB-ML model which obtains a force MAE of 1.64 kcal/mol/Å for the ISO17 set when only trained on reference energies for 100 molecules beating out the next best model only trained on energies in the literature, SchNet (5.71 kcal/mol/Å) which was trained on 400k molecules. 14 The transferability and data efficiency becomes even clearer when considering that we obtain an MAE of 1.63 kcal/mol/Å for the ISO17 set when training on 220 QM7b-T molecules which are smaller in size (seven versus nine heavy atoms) and which are more diverse in terms of chemical composition. The accuracy of a MOB-ML model trained on energies for 220 QM7b-T molecules for the forces is on par with some of the best ML models trained on energies and forces for hundreds of thousands of ISO17 molecules. Furthermore, we have demonstrated that a force MAE of this magnitude translates into structures which are very close to reference structures. Specifically, we obtain a mean RMSD of 0.01Å with respect to MP2 optimized structures for the ISO17 data set which is is three times smaller than for HF or B3LYP-D3 optimized structures. Natural objectives for future work include (i) the inclusion of gradients in the training process to boost the performance in the very low data regime; (ii) the extension to an open-shell framework; (iii) the adaptation of the Lagrangian framework to derive the analytical gradients of the MOB-ML energy with respect to quantities such as electric and magnetic fields.

SUPPLEMENTARY MATERIAL
The Supplementary Material contains the partial derivatives of the feature vector elements.

DATA AVAILABILITY STATEMENT
The data that supports the findings of this study are available within the article and its supplementary material. The data set used in Table I and Fig. 1 is available from Ref. 59. The data set used in Fig. 2 is available from Ref. 59. The data set used in Table II where n corresponds to the x, y, z-coordinates of the position operator. The matrices R n are defined as where |i) and |j) are valence-occupied MOs. The orbital derivative contributions from the Foster-Boys localization conditions shown in Eq. 13 are where B pqkl = n 2R n pk δ kq − 2R n pl δ lq R n kl + R n kk − R n ll R n pl δ kq + R n pk δ lq . (A4) Next, the dipole derivative contribution from the localization conditions from the first term on the RHS of Eq. 36 is For a full derivation of the orbital and dipole derivatives of Foster-Boys localization conditions please refer to Ref. 31.

Appendix B: IBO Localization
This appendix provides additional details for the terms IBO-related terms in Eqs. 7, 14, 26 and 35 of the main text. The localization conditions for IBO are 63 where A corresponds to an atom in the molecule. The matrices Q A are defined as where the summation over µ is restricted to basis functions at atom A. The matrix C IAO is the MO coefficient matrix represented in the intrinsic atomic orbital (IAO) basis which is defined as where S 1 is the overlap matrix in the original atomic orbital (AO) basis and C is the MO coefficient matrix in the original AO basis. The matrix that transforms from the AOs to the IAOs shown in Eqs. B3 and 7 is The matrix L is the subset of the MO coefficient matrix being localized. The matrixL is The orbital derivative contributions from the IBO localization conditions shown in Eq. 14 corresponds to Eq. 60 in Ref. 63 This appendix provides details on how the density fitting approximation can be used to approximate the four-center AO integral derivatives in Eq. 31. The AO integral derivatives are approximated by (µν|κσ) (q) ≈ (µν|κσ) where P and Q label density fitting basis functions, (µν|P ) (q) are three-center AO integrals, and J P Q are two-center AO integrals. The matrix c P is (D6)