ESAMP: event-sourced architecture for materials provenance management and application to accelerated materials discovery
Abstract
While the vision of accelerating materials discovery using data driven methods is well-founded, practical realization has been throttled due to challenges in data generation, ingestion, and materials state-aware machine learning. High-throughput experiments and automated computational workflows are addressing the challenge of data generation, and capitalizing on these emerging data resources requires ingestion of data into an architecture that captures the complex provenance of experiments and simulations. In this manuscript, we describe an event-sourced architecture for materials provenance (ESAMP) that encodes the sequence and interrelationships among events occurring in a simulation or experiment. We use this architecture to ingest a large and varied dataset (MEAD) that contains raw data and metadata from millions of materials synthesis and characterization experiments performed using various modalities such as serial, parallel, multi-modal experimentation. Our data architecture tracks the evolution of a material's state, enabling a demonstration of how state-equivalency rules can be used to generate datasets that significantly enhance data-driven materials discovery. Specifically, using state-equivalency rules and parameters associated with state-changing processes in addition to the typically used composition data, we demonstrated marked reduction of uncertainty in prediction of overpotential for oxygen evolution reaction (OER) catalysts. Finally, we discuss the importance of ESAMP architecture in enabling several aspects of accelerated materials discovery such as dynamic workflow design, generation of knowledge graphs, and efficient integration of simulation and experiment.
Copyright and License
This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.
Acknowledgement
The development and implementation of the architecture were supported by the Toyota Research Institute through the Accelerated Materials Design and Discovery program. Generation of all experimental data was supported by the Joint Center for Artificial Photosynthesis, a US Department of Energy (DOE) Energy Innovation Hub, supported through the Office of Science of the DOE under Award Number DE-SC0004993. The development of the catalyst discovery use case was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, under Award DESC0020383. The authors thank Dr Edwin Soedarmadji for stewardship of MEAD and all members of the JCAP High Throughput Experimentation group for the generation of the data. The authors thank Daniel Schweigert for providing insights into standard database management practices. The authors thank Thomas E. Morell for facilitating implementation of DOI-based linkages between MPS and CaltechDATA.
Data Availability
The entire MEAD data stored in ESAMP provenance is available in a PostgreSQL database. This format requires three steps to make use of: download the compressed SQL database dump file (.tar.gz format) from https://data.caltech.edu/records/hjfx4-a8r81; install PostgreSQL by following the instructions here; extract the .tar.gz file, which will yield a .sql file; follow the PostgreSQL documentation to create a new database from the .sql file. This will create a local copy of the database that we present in this work. The data can be browsed using the DBeaver user. Our docker container scripts to setup the database are provided here: https://github.com/modelyst/mps-docker. Database generation code: the database discussed in this manuscript was generated using the custom built DBgen tool: https://github.com/modelyst/dbgen/. Code to generate Fig. 5: all the scripts used to generate this figure are available at https://github.com/TRI-AMDD/ESAMP-usecase. The notebook 'query_and_modeling.ipynb' was used to generate the results and visualizations. The associated database queries are made available in eche_forms_query.sql and eche_pets_query.sql. In addition helper scripts such as myquaternaryulitity.py, myternaryutility.py, quaternary_faces_shells.py are provided to aid in visualization.
Conflict of Interest
Modelyst LLC implements custom data management systems in a professional context.
Files
Name | Size | Download all |
---|---|---|
md5:0722916a2b2c641a07e0c6fb60a17136
|
180.0 kB | Preview Download |
md5:1c36de029c2ce30f774b70a74ceb93ac
|
516.9 kB | Preview Download |
md5:1a652b62a83baf54a9824d28539d7253
|
283.5 kB | Preview Download |
md5:1acf11fd89dd411e57422d25c352d869
|
1.3 MB | Preview Download |
md5:549dad80284811fc6c347d2a5af22406
|
321.0 kB | Preview Download |
Additional details
- Toyota Research Institute
- Accelerated Materials Design and Discovery Program
- United States Department of Energy
- DE-SC0004993
- United States Department of Energy
- DE-SC0020383
- Caltech groups
- Liquid Sunlight Alliance, JCAP