Quantum harmonic free energies for biomolecules and nanomaterials

Obtaining the free energy of large molecules from quantum mechanical energy functions is a long-standing challenge. We describe a method that allows us to estimate, at the quantum mechanical level, the harmonic contributions to the thermodynamics of molecular systems of large size, with modest cost. Using this approach, we compute the vibrational thermodynamics of a series of diamond nanocrystals, and show that the error per atom decreases with system size in the limit of large systems. We further show that we can obtain the vibrational contributions to the binding free energies of prototypical protein–ligand complexes where exact computation is too expensive to be practical. Our work raises the possibility of routine quantum mechanical estimates of thermodynamic quantities in complex systems. A method to compute the quantum harmonic free energy contributions in large materials and biomolecular simulations at a reasonable cost is proposed, making quantum mechanical estimates of thermodynamic quantities possible for complex systems.

The contributions to the free energy from atomic motion are critically important to the thermodynamics and kinetics of biological, chemical, and materials systems. Changes in such contributions govern processes ranging from the affinity of drug binding to structural phase transitions in crystals. When the internal energy is computed at the quantum mechanical level, a harmonic approximation is often the only feasible option to describe atomic motion. However, for large systems such as nanostructures and biomolecules, computing free energy contributions is expensive even within the harmonic approximation. For a system of N atoms, the Hessian matrix which describes the vibrations requires O(3N ) gradient calculations, or O(3N ) times the cost of computing the internal energy. This is clearly prohibitive when N is large, making free-energy computation in large systems with quantum mechanical methods a major contemporary challenge [1].
There are many possible strategies to speed up harmonic vibrational analysis, including methods based on a partial Hessian [2,3], iterative diagonalization [4], and Hessian-free methods that use molecular dynamics to approximate the the harmonic problem [5,6]. Here we describe a different strategy where we estimate vibrational thermodynamic quantities directly without ever computing the full Hessian or taking advantage of any local structure.
The starting point is to express each harmonic thermodynamic quantity as a matrix function trace. Then, our technique contains three elements. First, we sample the matrix trace operation using random vectors and stochastic Lanczos quadrature [7]. Second, we compute the Hessian-vector product at the same cost as the gradient from the difference of gradients at displaced geometries, bypassing the Hessian construction entirely. Third, we ameliorate the stochastic error, especially for free energy differences, through a form of correlated sampling. Related stochastic methods have been used for anharmonic corrections to the harmonic free energy [8,9], as well as in stochastic electronic structure [10], but to our knowledge this is the first time these ideas have been brought to bear on the harmonic thermodynamic quantities themselves. As we demonstrate, this allows us to compute at the quantum mechanical level and with modest cost, vibrational free energy contributions for nanocrystals with more than 600 atoms, and free energy differences in protein-ligand complexes with more than 3000 atoms.

Theory
We first express the harmonic thermochemical quantities as traces of matrix functions. In particular, we are interested in the zero-point energy (ZPE), the thermal contribution to the enthalpy, and the entropy, Here β is the inverse temperature, {ω I } is the set of normal mode frequencies, and D is the mass-weighted Hessian matrix. We refer to ZPE as a non-thermal quantity as it has no temperature dependence. The above expressions have the form Trf (D). We now employ a stochastic estimator of the trace. The simplest version writes Trf (D) ≈ M n n l=1 v T l f (D)v l where v l are a set of n random vectors with zero mean and unit covariance, and M = 3N − 6 is the dimension of D. This direct stochastic evaluation requires a polynomial approximation of f (D), which is typically carried out using a Chebyshev expansion [11]. A closely related idea, which we use in this work, is stochastic Lanczos quadrature [7]. In this technique, the polynomial approximation is generated for the scalar v T l f (D)v l rather than globally for the function f (D). We have found the stochastic Lanczos method to be slightly superior to the Chebyshev polynomial approach for the quantities in this work. Within the polynomial expansion, the main operation is the matrix-vector product Dx, where x is in the stochastic Lanczos space. This can be computed from the difference of gradients, displaced by δx in mass-weighted coordinates, for small δ [12]. Thus no Hessian is needed at all in this approach (further details are provided in the Methods section). We note the sampling itself does not depend on f , and thus intermediate information, such as Lanczos quadrature weights and positions can be cached to estimate Trf (D) for any other f , and as such thermostatistical quantities at different temperatures can be computed without the need for repeating the sampling procedure.
Within the above scheme, there are two sources of error. The first is from the order of the Lanczos quadrature, m. This vanishes when m is greater than or equal to the matrix dimension. The second is the sampling error, which decreases like 1/ √ n for n random vectors. For a quadrature order of m and n samples, the cost of the method is equal to O(mn) gradient calculations. To reduce the statistical error, we use a form of correlated sampling. For the absolute thermodynamic quantities computed for the diamond nanocrystals we employ a high-level quantum mechanical approach as well as a cheaper low-level method (for example, a force-field, or semi-empirical quantum-mechanical approach) where the exact computation of the harmonic thermodynamic quantity X low is possible. Then, the free energy contribution for the high-level method is obtained as where ∆ is computed by applying the stochastic Lanczos quadrature to the difference of high-level and low-level methods. In the case of protein(P)-ligand(L) binding, we are interested in the difference between the holo (ligand-bound) state and the apo (ligand-free) state, i.e.
where X bind represents the ligand binding free energy, enthalpy, entropy, etc. In this case, we perform correlated sampling by using the same random vectors in the stochastic Lanczos treatment of the P+L, P, L systems (zeroing out elements for P and L respectively). No additional low-level method is involved in ligand-binding calculations.

Results
As a first application, we take diamond nanocrystals ( Figure 1a) as prototypical nanomaterials, and compute the free energies as a function of size. We employ the highand low-level correlated sampling approach described above, with Kohn-Sham density functional theory (DFT) with the PBE functional [13] as the high-level method and the semi-empirical extended tight-binding (xTB) method [14] as the low-level method. For the smallest system (C 54 H 54 ), we can compute the Hessian explicitly to provide an exact reference. Figure 1b shows ∆ quantities for the zero-point energy, the thermal enthalpy, and the entropy respectively, as a function of quadrature order m. The error bar indicates the statistical error for 50 samples (see SI for details of error analysis). A stochastic quadrature level of m = 8 does not provide sufficient accuracy, so we choose m = 16 for further calculations. Additional calculations on three transition-metal complexes using m = 16 are presented in the SI, although the choice of m in general is system and accuracy specific. Figure 1c shows the value and stochastic error per atom (estimated as one standard error) for the ZPE, thermal enthalpy, and entropy respectively. We note the error decreases with the size of the system, faster than the decay of the quantities themselves, which is evidence of "self-averaging" due to the large system size. A more detailed discussion of the "self-averaging" behavior is provided in the SI. Thus if one is interested in peratom quantities, as is often the case for thermodynamics, for example to locate phase transitions, our stochastic approach becomes increasingly more efficient in a large system. The difference between our largest simulation and the extrapolated thermodynamic limits for the per-atom ZPE, enthalpy, and entropy is only 0.2 kcal/mol, 0.004 kcal/mol, and 0.006 kcal/mol respectively; statistical errors with 50 samples are about 0.001 kcal/mol or less. In fact, in the largest diamond system, with a single sample, one can estimate the per-atom quantities with a statistical error of less than 0.01 kcal/mol (this error was estimated from 50 samples; see SI for details) at a 120-fold speedup relative to the exact Hessian calculation. If one is instead interested in the absolute values, the method can still be cheaper than the full computation of the Hessian. Figure 1d shows the stochastic error for the largest carbon system (C 432 H 216 ) as a function of computational cost. At less than 20% of the exact calculation's computational effort, the error estimate for ZPE is well within 1 kcal/mol, red corresponding to 0.03% relative error with respect to the total ZPE or 7.7% relative error with respect to the ∆ZPE between DFT and xTB. Less precise estimates can be obtained even more cheaply; a 20 times speedup is possible if 2 kcal/mol of error in the absolute quantities is tolerable.
Vibrational contributions to the free energy are also central to the study of large molecule and biomolecular interactions. For protein-ligand interactions in particular, where the aggregate binding is often only 5-15 kcal/mol, the vibrational contribution to binding free energies can be significant. Although the harmonic approximation is not necessarily a faithful approximation in these systems, the harmonic contributions nonetheless provide a useful first estimate of the thermal and entropic contributions [15,16]. The task is challenging for the stochastic Lanczos approach as binding is the difference of large absolute quantities, requiring tight convergence of the statistical error. To reduce statistical error, we use the fixed random vector correlated per atom (kcal/mol) sampling approach described above. We do not include explicit water molecules in the simulation: in principle, these could be included at additional cost, or the desolvation contribution to the free energy can be separately estimated by standard continuum methods [15]. We first study a system which is just small enough that exact results at the xTB level can be obtained at a high computational cost: a cutout (∼ 1600 atoms) of the human tankyrase 2 (TNKS2) protein with a bound ligand shown in Figure 2a (see Methods for more information). In Figure 2b we show the thermal contributions to the binding enthalpy, entropy and free energy for the stochastic Lanczos quadrature orders m = 8, 16, 32 using xTB. We present a detailed check of the Lanczos convergence for a larger set of m values, and across a set of different systems, in the SI (see

Figures 3S and 4S
). Following these convergence checks, we choose m = 16 for further calculations, where the error due to the Lanczos order is estimated to be less than 1 kcal/mol. Table 1 summarizes the data using up to 100 samples for all rovibrational free energy contributions (the rotational contribution is obtained following Ref. [17]). From comparison to the exact results, the total thermal contribution ∆G vib − ∆ZPE can be estimated with a statistical error of less than 1 kcal/mol with a cost of roughly 10% of the exact Hessian calculation. The non-thermal contribution ∆ZPE (not plotted) has larger statistical error, but can still be estimated to better than 2 kcal/mol with roughly 200 samples, or 67% cost of the exact Hessian calculation. We next evaluate the thermal quantities at the Kohn-Sham DFT level using the PBE functional for the TNKS2 complex using up to 35 samples, as summarized in Table 1. Interestingly, we find the thermal vibrational contributions at the DFT level to be quite similar to those from xTB. A single sample using our DFT implementation takes roughly two days on 1 node (32 CPU cores), compared to 600 days on 1 node for the exact Hessian calculation.
Finally, we apply our approach to HIV protease bound to a small molecule (JE-2147)[19] (Figure 3a). This system contains over 3000 atoms including hydrogens. The number of gradient calculations required to compute the full Hessian in this case (∼ 10,000) is so large that the exact computation is expensive even at the level of semi-empirical quantum mechanics, thus we do not compute exact data here. In Table 1 Table 1 Contributions to the binding free energy at 298.15 K for the TNKS2 and JE-2147-HIV protease system. The rotational free energies ∆G rot are computed at the optimized structures, the thermal enthalpy, entropy and free energy are computed from stochastic sampling. The error estimate for TNKS using xTB is one standard error from 100 random samples (corresponding to 33% of exact cost), 35 samples (corresponding to 11% of exact cost) for TNKS using DFT, and 50 random samples (corresponding to 10% of exact cost) for HIV (xTB). The experimental binding affinity is estimated as either k B T ln IC 50 or k B T ln K i , with IC 50 = 9.2 nM for TNKS2 from Ref.
[20] and K i = 41 pM for HIV protease from Ref. [19]. ∆G tot bind includes the non-thermal contribution from the ZPE and is computed as described in the Methods section. All values are given in kcal/mol. energy are both similar, small, and of opposite sign, meaning that the total thermal free energy contribution is almost zero. The small size of the thermo-statistical harmonic contributions may be due to the known rigidity of the HIV protease binding pocket, which means that many of the normal modes in the free protein and protein-ligand complex may be very similar. Nonetheless, estimating ∆G vib −ZPE to an accuracy of 1 kcal/mol is clearly feasible within our scheme at roughly 10% of the estimated cost of the exact calculation (Figure 3b). Table 2 Summary of computational cost to achieve an accuracy of 1 kcal/mol compared to an exact Hessian calculation. The number of samples n required to achieve the desired accuracy is estimated with 50 samples for the diamond systems (n = 50), n = 100 for TNKS2 (xTB), n = 35 for TNKS (DFT), and n = 50 for HIV (xTB). See SI for details about error analysis.

Discussion
We have presented results which demonstrate the feasibility of computing harmonic contributions to the free energy at the quantum mechanical level for systems of more than a thousand atoms. The cost is greatly reduced from that needed to compute the Hessian of the system. This is particularly true when one is interested in intensive (or "per-atom") quantities, where self-averaging behavior shows that in large systems, we may estimate the quantities at a cost comparable to that of a few energy evaluations. This holds promise in evaluating thermodynamic transitions in materials involving large unit cells, for example, those associated with alloys and disorder. In the case of free energy differences, the correlated sampling technique employed here makes the evaluation of even small thermal free energy differences, as found in protein-ligand complexes, feasible at the level of 1-2 kcal/mol. The estimated speedups for all systems considered, in order to reach a given accuracy, are summarized in Table 2.
An additional advantage of the current approach is that the cost may be continually tuned. This is relevant to new applications, for example in the computational screening for therapeutics [1,21], where less precise estimates are an acceptable tradeoff for speed. Also, as we see from Table 1, while the computed binding affinity using the harmonic approximation is not always highly accurate in biomolecular systems, the increased facility to obtain harmonic estimates further raises the possibility for new approaches to compute anharmonic contributions to free energies with a variety of techniques, such as the minimum-mining technique [22], which samples multiple minima in an anharmonic potential and combines harmonic contributions from each of them. While applications which require intensive quantities benefit most from the stochastic approach, converging absolute thermal quantities to sufficiently high precision may require improved statistical estimators [23]. In addition, while we have estimated harmonic thermal contributions using quantum mechanical energy functions, the same algorithm accelerates harmonic free energy computation using any energy function, including classical force-fields, and can be combined with other cost reduction techniques, such as partial Hessians.
In summary, the technique presented here suggests that the estimation of harmonic free energy effects at the quantum mechanical level for systems with hundreds or even more than a thousand of atoms need not be considered a future challenge [1], but one which can be begin to be addressed today.

Stochastic Lanczos quadrature
The stochastic Lanczos method is a numerical method which has been employed in different contexts (see e.g. Ref. [24] for an early application in quantum manybody systems). We follow the general mathematical formulation in Ref. [7]. The Lanczos iterations were performed starting from a vector randomly selected from a Rademacher distribution. The Lanczos iterations require the action of the massweighted Hessian matrix on this random vector. We compute this matrix-vector product from finite difference gradient calculations: Here, D is the mass-weighted Hessian matrix, g is the gradient, M is the diagonal matrix of masses. The displacement is given by where the factor of M −1/2 accounts for the mass-weighting. The value of δ is chosen based on the norm of the random vector so that the average displacement per atom is 0.0012Å. The number of Lanczos iterations m is a parameter of the method; m should be increased until convergence is reached. In chemical systems, the maximum eigenvalue of the Hessian does not scale with the system size (it is the maximum vibrational frequency, for example a C-H stretch), while the minimum eigenvalue is bounded from below by 0. For many functions of the Hessian, this means that the maximum and minimum eigenvalues also do not scale with system size. Under this assumption, we expect a fixed m to yield a constant relative error in the trace of the function, and furthermore, due to exponential convergence in m for well-behaved functions of the Hessian, m needs to only increase logarithmically with system size for constant absolute error. A numerical study of convergence with m is presented in the SI.
Additionally, we also implemented and tested a Chebyshev fitting method as an alternative to the stochastic Lanczos quadrature. We found that often a higher order Chebyshev fit was required making it a slightly more expensive alternative to Lanczos quadrature.

Calculations on diamond nanocrystals
Diamond nanocrystals were constructed by creating supercells of the bulk diamond unit cell and then capping with hydrogens. The resulting structure was optimized using the PBE functional[13] and the def2-SV(P) basis set [25]. All DFT calculations were performed with the ORCA program [26,27]. The structure was also optimized using the second generation extended tight-binding (GFN2-xTB) method [28] as implemented in the Semiempirical Extended Tight-Binding (xTB) program package [29].

Calculations on protein-ligand systems
All calculations on protein-ligand systems used the second generation extended tightbinding (GFN2-xTB) method [28] as implemented in the Semiempirical Extended Tight-Binding (xTB) program package [29]. Generalized Born, solvent-accesible area (GBSA) solvation was used to mimic an aqueous environment for all calculations. For the TNKS2 system, we additionally performed the calculations using density functional theory. We used the PBE functional with the GTH-DZV basis and GTH pseudo-potential [30] for PBE. The system was placed in a 45.55Å × 41.78Å × 37.27Å periodic box, allowing a 5-6Å vacuum around the atoms. A 0.15 Hartree level shift was applied to the virtual orbitals to help the SCF convergence. The Gaussian and Plane Waves method [31,32] was employed and the plane wave cutoff was 200 Hartree.
The truncated TNKS2 protein was constructed from the the ligands/proteinstructure obtained from Ref. [33]. The entire protein was minimized using Amber-Tools, using the Generalized Born implicit, igb=5), the Amber 14 force field [34], and the general AMBER force field (GAFF) [35] for ligands, assigned using Antechamber from AmberTools [36]. Following minimization, truncation and capping of the terminals were carried out using PyMol [37]. Truncation was performed to remove all protein atoms beyond 3-4Å around the ligand. Truncated ends were capped using ACE/NME terminal patches. The ligand bound to the protein is one of the many inhibitors identified in Ref. [38] whose structure is available in the Protein Data Bank [39](PDB: JKN).
The structure of the JE-2147-HIV protease complex was obtained from PDB 1KZK. Hydrogens were added using UCSF Chimera [40] and the structure was optimized first using the GFN-FF force field as implemented in the xTB program package [29] and finally with the GFN2-xTB method [28] ultimately used for harmonic vibrational analysis.
The total binding free energy is estimated as ∆G rot + ∆H vib − T ∆S vib + ∆E + ∆G solv , where ∆E is the single point energy difference and ∆G solv is estimated via GBSA within xTB.
• Availability of data and materials: Data is available from the authors upon reasonable request. • Authors' contributions: AFW and GKC conceived the project. AFW, CL carried out the work. All authors contributed to the writing of the paper.
density-functional theory. Physical Review Letters 111(10), 1-5 (2013  where m p is the p-th central moment of the samples, such that 1 n (h 4 − n−3 n−1 h 2,2 ) forms an unbiased estimator for Var(S 2 n ), and the error of error is computed from its square root.
We next consider error estimation fors n from n samples where n > n. A special case is to estimate the error of a single-sample estimators 1 = s l from n samples. Such an error is just the variance of the underlying distribution of s l , i.e., E[(s l − E[s l ]) 2 ], and it is straightforward to show that n S 2 n forms an unbiased estimator. Similarly, one can show that n S 2 n /n is an unbiased estimator from n samples for Var(s n ).

Performance on transition-metal complexes
Transition metal complexes provide an example of systems where empirical force fields are often inadequate and quantum mechanical energy functions are especially valuable. We considered three transition-metal complexes, namely a Grubbs catalyst, a cobalamin model, and a heme model, as challenging cases. The initial structure of the Grubbs catalyst was taken from Ref. [1], optimized with both xTB and B3LYP [2] functional theory. The 6-311G basis set [3,4] was used for main group elements while the LanL2DZ basis [5] was used for the transition metal Ru. The stochastic sampling was performed using the same level of theory. The cobalamin structure was taken from Ref. [6], and BP86[7, 8]/6-31G(d) [9][10][11] was used for geometry optimization and sampling. The heme structure was taken from Ref. [12], and we used the B3LYP/6-311G(d) [3] level of theory. The DFT methods were chosen to be similar to the ones employed in the references from which the initial structures were obtained. In all the calculations, a Lanczos order of 16 was used, which we see to be sufficient to reproduce the exact results to within the small statistical error bars.

Self-averaging in diamond nanocrystals
In order to confirm that the observed small statistical error in large diamond crystals is the result of "self-averaging", instead of due to under-sampling of the larger coordinate space, we show in Supplementary Table 1 the exact absolute free energy of the two diamond systems, computed from the exact xTB Hessian and from our method using 50 samples (numbers in kcal/mol). We see that the exact values all lie within one standard error of the sampled values, validating the error bars at both system sizes, despite the fixed number of samples as a function of system size. Moving from C 54 H 54 to C 128 H 96 , the system size more than doubles, but the statistical error increases by much less (and in fact stays nearly constant), which is the self-averaging effect referred to in the text.

Convergence with fitting order
Here, we check the convergence with respect to the fitting order m. To obtain a basic understanding, we first examine the accuracy of Chebyshev polynomial fitting for the thermodynamic quantities, which can be assessed without using any stochastic sampling. We do so by examining the accuracy of the Chebyshev expansion over a range of frequencies (see Supplementary Figure 2). The Chebyshev error for a specific system can be estimated by averaging this fitting error over the frequency domain, weighted by the density of the system's vibrational states. The lowest vibrational frequency is 0 cm −1 , while the highest frequency of a chemical system is usually the bond vibration between a hydrogen and a heavy atom (typically < 4000 cm −1 ). Thus the only system dependence comes from the density of states. This is in contrast to the case of Chebyshev fitting in electronic structure, where the maximum frequency usually grows with system size. We see in both plots that the maximum pointwise deviation at m = 16 is less than 1 kcal/mol over the range plotted (although we note that the entropy diverges at 0 cm −1 where the density of states also vanishes). The above convergence check cannot be directly carried out for stochastic Lanczos quadrature due to the need to specify some initial stochastic vector. We have therefore checked the convergence of the Lanczos method as a function of m for several systems by brute-force stochastic sampling, as shown in Supplementary Figures 3 and 4.
A Lanczos order of m = 16 is generally seen to be sufficient to obtain a systematic error of 1 kcal/mol or less.