A Caltech Library Service

Regression-clustering for Improved Accuracy and Training Cost with Molecular-Orbital-Based Machine Learning

Cheng, Lixue and Kovachki, Nikola B. and Welborn, Matthew and Miller, Thomas F., III (2019) Regression-clustering for Improved Accuracy and Training Cost with Molecular-Orbital-Based Machine Learning. Journal of Chemical Theory and Computation, 15 (12). pp. 6668-6677. ISSN 1549-9618. doi:10.1021/acs.jctc.9b00884.

[img] PDF - Accepted Version
See Usage Policy.

[img] PDF - Submitted Version
See Usage Policy.


Use this Persistent URL to link to this item:


Machine learning (ML) in the representation of molecular-orbital-based (MOB) features has been shown to be an accurate and transferable approach to the prediction of post-Hartree-Fock correlation energies. Previous applications of MOB-ML employed Gaussian Process Regression (GPR), which provides good prediction accuracy with small training sets; however, the cost of GPR training scales cubically with the amount of data and becomes a computational bottleneck for large training sets. In the current work, we address this problem by introducing a clustering/regression/classification implementation of MOB-ML. In a first step, regression clustering (RC) is used to partition the training data to best fit an ensemble of linear regression (LR) models; in a second step, each cluster is regressed independently, using either LR or GPR; and in a third step, a random forest classifier (RFC) is trained for the prediction of cluster assignments based on MOB feature values. Upon inspection, RC is found to recapitulate chemically intuitive groupings of the frontier molecular orbitals, and the combined RC/LR/RFC and RC/GPR/RFC implementations of MOB-ML are found to provide good prediction accuracy with greatly reduced wall-clock training times. For a dataset of thermalized (350 K) geometries of 7211 organic molecules of up to seven heavy atoms (QM7b-T), both RC/LR/RFC and RC/GPR/RFC reach chemical accuracy (1 kcal/mol prediction error) with only 300 training molecules, while providing 35000-fold and 4500-fold reductions in the wall-clock training time, respectively, compared to MOB-ML without clustering. The resulting models are also demonstrated to retain transferability for the prediction of large-molecule energies with only small-molecule training data. Finally, it is shown that capping the number of training datapoints per cluster leads to further improvements in prediction accuracy with negligible increases in wall-clock training time.

Item Type:Article
Related URLs:
URLURL TypeDescription Paper
Cheng, Lixue0000-0002-7329-0585
Kovachki, Nikola B.0000-0002-3650-2972
Welborn, Matthew0000-0001-8659-6535
Miller, Thomas F., III0000-0002-1882-5380
Additional Information:© 2019 American Chemical Society. Received: September 4, 2019; Published: October 22, 2019. This work emerged from a CMS 273 class project at Caltech that also involved Dmitry Burov, Jialin Song, Ying Shi Teh, and Dr. Tamara Husch, as well as Professors Kaushik Bhattacharya and Richard Murray; we thank these individuals for their ideas and contributions. This work is supported by the US Air Force Office of Scientific Research (AFOSR) grant FA9550-17-1-0102. M.W. acknowledges a postdoctoral fellowship from the Resnick Sustainability Institute. N.B.K. is supported, in part, by the US National Science Foundation (NSF) grant DMS 1818977, the US Office of Naval Research (ONR) grant N00014-17-1-2079, and the US Army Research Office (ARO) grant W911NF-12-2-0022. Computational resources were provided by the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the DOE Office of Science under contract DE-AC02-05CH11231. The authors declare no competing financial interest.
Group:Resnick Sustainability Institute
Funding AgencyGrant Number
Air Force Office of Scientific Research (AFOSR)FA9550-17-1-0102
Resnick Sustainability InstituteUNSPECIFIED
Office of Naval Research (ONR)N00014-17-1-2079
Army Research LaboratoryW911NF-12-2-0022
Department of Energy (DOE)DE-AC02-05CH11231
Issue or Number:12
Record Number:CaltechAUTHORS:20191023-150600399
Persistent URL:
Official Citation:Regression Clustering for Improved Accuracy and Training Costs with Molecular-Orbital-Based Machine Learning. Lixue Cheng, Nikola B. Kovachki, Matthew Welborn, and Thomas F. Miller, III. Journal of Chemical Theory and Computation 2019 15 (12), 6668-6677. DOI: 10.1021/acs.jctc.9b00884
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:99417
Deposited By: Tony Diaz
Deposited On:23 Oct 2019 23:19
Last Modified:16 Nov 2021 17:46

Repository Staff Only: item control page