of 10
ESAMP: Event-Sourced Architecture for Materials
Provenance management and application to accelerated
materials discovery
Michael J. Statt,
a
,
Brian A. Rohr,
a
Kris Brown,
a
Dan Guevarra,
c
Jens Hummelshoej,
b
Linda
Hung,
b
Abraham Anapolsky,
b
John M. Gregoire,
c
,
and Santosh K. Suram
b
,
While the vision of accelerating materials discovery using data driven methods is well-founded, prac-
tical realization has been throttled due to challenges in data generation, ingestion, and materials
state-aware machine learning. High-throughput experiments and automated computational work-
flows are addressing the challenge of data generation, and capitalizing on these emerging data
resources requires ingestion of data into an architecture that captures the complex provenance of
experiments and simulations. In this manuscript, we describe an event-sourced architecture for mate-
rials provenance (ESAMP) that encodes the sequence and interrelationships among events occurring
in a simulation or experiment. We use this architecture to ingest a large and varied dataset (MEAD)
that contains raw data and metadata from millions of materials synthesis and characterization exper-
iments performed using various modalities such as serial, parallel, multi-modal experimentation. Our
data architecture tracks the evolution of a material’s state, enabling a demonstration of how state-
equivalency rules can be used to generate datasets that significantly enhance data-driven materials
discovery. Specifically, using state-equivalency rules and parameters associated with state-changing
processes in addition to the typically used composition data, we demonstrated marked reduction of
uncertainty in prediction of overpotential for oxygen evolution reaction (OER) catalysts. Finally, we
discuss the importance of ESAMP architecture in enabling several aspects of accelerated materials
discovery such as dynamic workflow design, generation of knowledge graphs, and efficient integration
of theory and experiment.
Introduction
Accelerating materials discovery is critical for a sustainable fu-
ture and the practical realization of emergent technologies. Data-
driven methods are anticipated to play an increasingly significant
role in enabling this desired acceleration, which would be greatly
facilitated by the establishment and community adoption of data
structures and databases that capture data from the broad range
of materials experiments. In computational materials science, au-
tomated workflows have been established to produce large and
diverse materials datasets. While these workflows and associated
data management tools can be improved to facilitate capturing
of a materials’ state and enable easy capture of re-configurable
analysis methods, their current implementations have facilitated
a host of materials discoveries,
1–4
emphasizing the importance of
continued development of materials data architectures. In case of
experimental materials science, the majority of the data remains
in human readable format and is not ingested into a database. In
cases where the databases exist, they are either large with limited
scope (ICSD, ICDD, which contains hundreds of thousands of x-
a
Modelyst LLC, Palo Alto, CA, 94303, United States
b
Accelerated Materials Design and Discovery, Toyota Research Institute, Los Altos, CA,
94040, United States
c
Division of Engineering and Applied Science, California Institute of Technology,
Pasadena, CA 91125, United States
Corresponding authors: Michael J. Statt <michael.statt@modelyst.io>, John M.
Gregoire <gregoire@caltech.edu>, Santosh K. Suram <santosh.suram@tri.global>
† Electronic Supplementary Information (ESI) available: [detailed schema discus-
sion for relational database implementation of ESAMP is included here].
ray diffraction patterns) or are diverse but have limited data.
5–7
.
This has limited application of machine learning for acceleration
of experimental materials discovery to specific datasets such as
microstructure data, x-ray diffraction spectra, x-ray absorption
spectra, or Raman spectra.
8–11
Recent application of high-throughput experimental techniques
have resulted in two large, diverse experimental datasets: a) High
Throughput Experimental Materials (HTEM) dataset, which con-
tains synthesis conditions, chemical composition, crystal struc-
ture, and optoelectronic property measurements (> 150,000
entries), and b) Materials Experiment and Analysis Database
(MEAD) that contains raw data and metadata from millions of
materials synthesis and characterization experiments, as well as
the corresponding property and performance metrics.
12,13
These
datasets contain thousands to millions of data entries for a given
type of experimental process, but the experimental conditions or
prior processing of the materials leading up to the process of in-
terest can vary substantially. The multitude of process parameters
and provenances results in datasets whose richness could be fully
realized and utilized if the context and provenance of each exper-
iment were appropriately modeled. In contrast to computational
data, where the programmatic workflows facilitate provenance
tracking, experimental workflows generally experience more vari-
ability from many on-the-fly decisions as well as environmental
factors and evolution of the instrumentation. Sensitivity to his-
torical measurements is generally higher in experiments since
any measurement could conceivably alter the material, making
any materials experiment a type of "processing." Factors rang-
1–10 | 1
ing from instrument contamination to drifting detector calibra-
tion also may play a role. Therefore, a piece of experimental data
must be considered in the context of the parameters used for its
generation and the entire experimental provenance.
The importance of sample and process history of the experi-
mental data makes it challenging to identify which measurement
data can be aggregated to enable data-driven discovery. The stan-
dard practice for generating a shareable dataset is to choose data
that match a set of process and provenance parameters and con-
sider most or all other parameters to be inconsequential. This
method is highly subjective to the individual researcher. For both
human and machine users of the resulting dataset, the ground
truth of the sample-process provenance is partially or fully miss-
ing. In addition, an injection of assumptions prior to ingestion
into a database creates datasets that do not adhere to the Find-
ability, Accessibility, Interoperability, and Reusability (FAIR) guid-
ing principles
14
resulting in lack of interoperability, creation of
data silos that cannot be analyzed efficiently to generate new in-
sights and accelerate materials discovery. As a result, the data’s
value is never fully realized, motivating the development of data
management practices that closely link data ingestion to data ac-
quisition.
Given the complexity and variability in materials experimen-
tation, several tailored approaches such as ARES, AIR-Chem,
and Chem-OS have been developed to enable integration be-
tween data ingestion and acquisition for specific types of exper-
iments.
15–17
Recently, a more generalizable solution for facili-
tating experiment specification, capture, and automation called
ESCALATE was developed.
18
Such approaches aim to streamline
and minimize information loss that occurs in an experimental
laboratory. We focus on modeling the complete ground truth of
materials provenances that could operate on structured data re-
sulting either from a specialized in-house data management soft-
ware or a more general framework such as ESCALATE. We use an
event-sourced architecture for materials provenances (ESAMP)
to capture the ground truth of materials experimentation. This
architecture is inspired by event-sourced architectures used in
software design wherein the whole application state is stored as
a sequence of events. This architecture maintains relationships
among experimental processes, their metadata, and their result-
ing primary data to strive for comprehensive representation of
the experiments. We believe that these attributes make ESAMP
broadly applicable for materials experiments and beyond. We
discuss database architecture decisions that enable deployment
for a range of experiment throughput and automation levels. We
also discuss the applicability of ESAMP to primary data acquisi-
tion modes such as serial, parallel, and multimodal experimenta-
tion. Finally, we present a specific instantiation of ESAMP for one
of the largest experimental materials databases (MEAD) consist-
ing of more than 6 million measurements on 1.5 million samples.
We demonstrate facile information retrieval, analysis, and knowl-
edge generation from this database. The primary use case de-
scribed herein involves training machine learning models for cat-
alyst discovery, where different definitions of provenance equiva-
lence yield different datasets for model training that profoundly
impact the ability to predict catalytic activity in new compositions
spaces. We also discuss the universality of our approach for ma-
terials data management and its opportunities for the adoption of
machine learning in many different aspects of materials research.
ESAMP Description
Overview
ESAMP is a database architecture designed to store experimental
materials science data. It aims to capture all three of the types
of aforementioned data: 1) information about the samples in the
database including storing provenance regarding how they were
created and what processes they have undergone, 2) raw data
from processes run on the samples, and 3) information derived
from analyses of these raw data.
Altogether, this architecture enables users to use simple SQL
queries to answer questions like:
• What is the complete history of a given sample and any other
samples used to create this one?
• How many samples have had XRD run on them both before
and after an electrochemistry experiment?
• What is the figure of merit resulting from a given set of raw
data analyzed using different methods?
Identification of data to evaluate any scientific question re-
quires consideration of the context of the data, motivating our
design of the ESAMP structure to intuitively specify contextual re-
quirements of the data. For example, if a researcher wishes to be-
gin a machine learning project, creating a custom dataset for their
project can be done by querying data in the ESAMP architecture.
For example, training data for machine learning prediction of the
overpotential in chronopotentiometry (CP) experiments from cat-
alyst composition can be obtained via a query to answer questions
such as
• Which samples have undergone XPS then CP?
• How diverse are the sample compositions in a dataset?
The researcher may further restrict the results to create a bal-
anced dataset or a dataset with specified heterogeneity with re-
spect to provenance and experiment parameters. The query pro-
vides transparent self-documentation of the origins of such a
dataset; any other researcher wondering how the dataset was cre-
ated can look at the WHERE clause in the SQL query to see what
data was included and excluded.
To enable these benefits, we must first track the state of samples
and instruments involved in a laboratory to capture the ground
truth completely. In this article, we focus mainly on the state of
samples and note that the architecture could capture the state of
instruments or other research entities. A sample provenance can
be tracked by considering three key entities: Sample, Process,
and Process_data, which are designed to provide intuitive inges-
tion of data from both traditional manual experiments and their
automated or robotic analogues.
Sample: A sample is a label that specifies a physically-
identifiable representation of an entity that can undergo many
processes (e.g. the liquid in that vial or the thin film on that
2 |
1–10
substrate). Samples can be combined or split to form complex
lineages, such as an anode and a cathode being joined in a bat-
tery or a vial of precursor used in multiple catalyst preparations.
The only fundamental assumption placed on a sample is that it
has a unique identifier so that its lineage and process history can
be tracked.
Process: A process is an event that occurs to one or more sam-
ples. It is associated with an experiment in a laboratory, such as
annealing in a sample furnace or performing spectroscopic char-
acterization. Processes have input parameters and are identified
by the machine (or human) that performed them at a specific
time.
Process_data: Process data is data generated by a process that
applies to one or more samples that underwent that process.
Since the process but not the specific is central to sample prove-
nance, management of can occur in a connected but distinct part
of the framework. As many raw outputs from scientific processes
are difficult to interpret without many additional steps of analy-
sis, is connected to a section of the framework devoted to itera-
tive steps of analysis where is transformed and combined to form
higher-level figures of merit (FOM).
These three entities connected via a sample_process table form
the framework’s central structure. Figure 1 shows these entities
and their relationships. The three shaded boxes indicate the sec-
ondary tables that support the central tables by storing process
details, sample details, and analyses. Each region is expanded
upon below.
Fig. 1
An overview of the framework showing the central location of
the sample_process entity and its relationship to the three major areas
of the framework.
Samples, Collections, and Lineage
The trinity of sample, process, and process-data enable us to have
a generalized framework that captures the ground truth associ-
ated with any given sample in an experimental dataset. How-
ever, interpretation of experimental data requires us to capture
the provenance of a sample completely. That is, throughout the
sample’s lifetime, it is important to track three key things:
Fig. 2
An overview of the three major areas of the framework as shown in
Figure 1. Each region is centered on one of the three entities connected
to the central entity: (a) (b) (c) .
• How was the sample created?
• What processes occurred to the sample?
• If the sample no longer exists, how was it consumed?
The middle question is directly answered by the sequence of en-
tries in the sample_process table wherein each record in sam-
ple_process specifies the time that a sample underwent a pro-
cess. This concept is complicated by processes that merge, split,
or otherwise alter physical identification of samples. Such pro-
cesses are often responsible for the creation and consumption
of samples, for example the deposition of a catalyst onto an
electrode or the use of the same precursor in many different
molecule formulations. In these cases, the process history of the
“parent” catalyst or precursor is an inherent part of the prove-
nance of the “child” catalyst electrode or molecular material.
These potentially-complex lineages are tracked through the sam-
ple_ancestor and sample_parent entities as shown in Figure 2(a).
Both the
and
entities are defined by their connec-
tion to two sample entities, indicating a parent/ancestor and
child/descendant relationship, respectively. The entity indicates
that the child sample was created from the parent sample and
should inherit its process history lineage. Each can be decorated
with additional attributes to indicate its role in the parent-child
relationship, such as labeling the anode and cathode when creat-
ing a battery. The entity is nearly identical to with an additional
attribute called “rank” that indicates the number of generations
between the ancestor and the descendant. A rank of 0 indicates
a parent-child relationship, while a rank of 2 indicates a great-
grandparent type relationship. These two entities can capture the
complex lineages produced by experimental workflows.
The final entity connected to a sample is the collection. It is
common for researchers to group samples. For example, in high
throughput experiments many samples may exist on the same
chip or plate, or researchers may include in a collection all sam-
ples synthesized for a single project. In these cases, researchers
1–10 | 3
need to be able to keep track of and make queries based on that
information. It is clear from the previously-mentioned example
that many samples can (and almost always do) belong to at least
one collection. It is also important that we allow for the same
sample to exist in many collections. For example, a researcher
may want to group samples by which plate or wafer they are on,
which high-level project they are a part of, and which account
they should be billed to all at the same time. The corresponding
many-to-many relationships are supported by ESAMP .
Processes & Process Details
A process represents one experimental procedure (e.g. a synthesis
or characterization) that is applied to a sample. The only require-
ment imposed on a process is that it must be possible to sort them
chronologically. Chronological sorting is essential for accurately
representing a sample’s process history. Therefore, each process is
uniquely associated with a timestamp and machine/user. There
is an underlying assumption that for a single process time and
a given machine/user, only 1 process is occurring, although that
process may involve multiple samples.
While single-step experiments on machine-based workflows
can easily provide a precise timestamp for each process, it is cum-
bersome and error-prone for researchers to provide these at the
timescale of seconds or even hours. Additionally, some multi-step
processes may reuse the initial timestamp throughout each step,
associating an initiation timestamp with a closely-coupled series
of experiments whose ordering is known but whose individual
timestamps are not tracked. It is important to add a simple order-
ing parameter to represent the chronology when the timestamp
alone is insufficient. For tracking manual experiments, this order-
ing parameter allows researchers to record the date and a counter
for the number of experiments they have completed that day. In
multi-step processes, each step can be associated with an index to
record the order of steps.
Processes indicate that an experimental event has occurred to
one or more samples. However, it is important to track informa-
tion describing the type of process that occurred and the process
parameters used, or generally any information that would be re-
quired to reproduce the experiment. A given research workflow
may comprise many different types of experiments, such as elec-
trochemical, XPS, or deposition processes. Each of these types of
processes will also be associated with a set of input parameters.
The entity and its associated process-specific tables are used to
track this important metadata for each process. A more com-
prehensive discussion on the representation of process details for
various relational database management system (RDMS) imple-
mentations is provided in the SI.
Process Data & Analysis
While tracks inputs to a , tracks the output of a . For repro-
ducibility, transparency, and ability to continue experiments with-
out reliance on an active database connection, it is prudent to
store process outputs as raw files independent from the data man-
agement framework. Therefore, while may include relevant data
parsed from the raw files, it should also always include a raw file
path. Additionally, attributes can be added to specify the location
to search for the file, such as an Amazon S3 bucket or local stor-
age drive. A single file may also contain multiple pieces of data
that each refers to different samples. This complexity motivates
the inclusion of the start and end line numbers for a file identi-
fying information for . If an entire file should be consumed as a
single piece of process data, null values can be provided for those
attributes. As a significant amount of scientific data is stored as
comma-separated values (CSV) files, it can also be beneficial to
parse these files directly into values in the database utilizing flexi-
ble column data types, such as JavaScript Object Notation (JSON)
that is supported by modern RDMS’s. For large datasets, stor-
ing data using efficient binary serializations such as Messagepack
could be beneficial.
19
The relationship between process outputs and their associated
processes and samples can be complex. The most straightforward
relationship is one piece of process data is generated for a sin-
gle sample, which is typically the case for serial experimentation
and traditional experimentation performed without automation.
In parallel experimentation, a single process involves many sam-
ples, and if the resulting data is relevant to all samples, has a
many-to-one relationship to . In multi-modal experiments, mul-
tiple detectors can generate multiple pieces of data for a single
sample in a single process, where has a one-to-many relationship
to . Parallel, multi-model experimentation can result in many-
to-many relationships. To model these different types of experi-
mentation in a uniform manner, ESAMP manages many-to-many
relationships between and .
The raw output of scientific processes may require several iter-
ative analytical steps before the desired results can be obtained.
As the core tenet of this framework design is tracking the full
provenance of scientific data, analytical steps must have their lin-
eage tracked similarly to that of samples and processes. This is
achieved by the analysis, analysis_details, and analysis_parent ta-
bles. The analysis table represents a single analytical step and,
similar to , is identified by inputs, outputs, and associated pa-
rameters. Just as has a many-to-many relationship with sample,
analysis has a many-to-many relationship with process_data; a
piece of process data can be used as an input to multiple analyses
and a single analysis can have multiple pieces of process data as
inputs. The type of analysis and its input parameters are stored in
the analysis_detail entity. The analysis type should define the an-
alytical transformation function applied to the inputs, while the
parameters are fed into the function alongside the data inputs.
An important difference between analysis and is that an analy-
sis can use the output of multiple and analysis entities as inputs.
This is analogous to the parent-child relationship as that mod-
eled by . The introduction of analysis_parent table allows for this
complex lineage to be modeled. This allows for even the most
complex analytical outputs to be traced back to the raw entities
and the intermediate analyses on which they are based.
State
During experiments a sample may be intentionally or uninten-
tionally altered. For example, a researcher could measure the
4 |
1–10
composition of a sample, perform an electrochemical process
that unknowingly changes the composition, and finally perform
a spectroscopic characterization. Even though the sample label is
preserved throughout these three processes, directly associating
the composition measurement with the spectroscopic measure-
ment can lead to incorrect analysis because the intervening pro-
cess altered the link between the two. This example motivates
the need for the final entity in the framework,
state
. The ESAMP
model for
state
assumes that every process irreversibly changes
the sample. A
state
is defined by two sample_process entities that
share the same sample and have no sample_process chronologi-
cally between them. By managing
state
under the most conserva-
tive assumption that every process alters the sample’s state, any
state equivalency rules (SERs), i.e. wheter a certain type of pro-
cess alters the state or not, can be applied in a transparent man-
ner. A new
state
table can be constructed from these SERs, which
may be easily modified either by a human or a machine.
Fig. 3
An example of a sample state graph. Sample 1 is shown under-
going five processes with types P1, P2, or P3. A state is defined between
every process. The right boxes show how different sets of rules governing
whether a process is state-changing or not can change the equivalency
between the states. Without any rules, all processes are assumed to be
state-changing, and no states are equivalent. This constraint can be fully
relaxed to make all states equivalent. It can also partially relaxed based
on process type or process details, such as
γ
, as shown in the lower two
rule sets.
As state essentially provides a link between the input and out-
put of a process, it is best visualized as a graph. Figure 3shows
an example state graph. Sample 1 undergoes a series of five pro-
cesses that involve three distinct types of processes. A new state
is created after each process. If no relaxation assumptions are ap-
plied, all processes are assumed to be state-changing, and since
all states are non-equivalent, it might be invalid to share process
data or derived analysis amongst them. Under the most relaxed
constraint, no processes are state-changing. However, the utility
of
state
is the ability to apply domain and use-specific rules to
model SERs. For example, consider process 3 (
P
3
) to be a de-
structive electrochemical experiment that changes the sample’s
composition, while the other processes are innocuous character-
ization experiments. By designating only
P
3
as state-changing,
the sample can be considered to have only 2 unique states. SERs
can be further parameterized by utilizing the ’s of the process to
determine state-changing behavior. For example, if
P
2
is an an-
neal step, we might only consider it state-changing if the tempera-
ture rises above a certain level. By defining simple rules, merging
equivalent states yields simpler state graphs that serve as the ba-
sis for dataset curation. This powerful concept of
state
is enabled
by the core framework’s ability to track the process provenance of
samples throughout their lifetime.
Database Implementation
The framework so far has been defined using standard entity rela-
tionship language. It is important to note that this framework can
be instantiated in most or all RDMS’s and is not tied to a specific
implementation. However, the specific implementation details of
the framework may change slightly depending on the RDMS used.
These changes are vital in deciding the RDMS system that is ap-
propriate for a particular use case.
Figure S1 shows the framework in its entirety. All double-sided
arrows indicate a many-to-many relationship. The implementa-
tion of many-to-many relationships differs between SQL, NoSQL,
and graph databases. In a SQL RDMS such as PostgreSQL, the
standard practice uses a "mapping" table where a row is defined
simply by its relationship to the two tables with the many-to-
many relationship. In graph databases, many-to-many relation-
ships can be represented simply as an edge between two nodes.
Additionally, entities that track lineages, such as , state, and anal-
ysis_parent, can also be represented simply as edges between two
nodes of the same type. The cost of this simplicity is the reduced
constraints on column datatypes as well as a less standardized
and, in some cases, less powerful query functionality.
Fig. 4
A full graphical representation of the framework described in
Figures 1 and 2. Single headed arrows indicate a many-to-one relationship
in the direction of the arrow. Double-headed arrows indicate a many-to-
many relationship.
If complicated process provenance and lineages are expected
to exist along with a need to query those lineages, then a graph
database may be the right choice. However, if simpler lineages
with large amounts of well-structured data are used, a standard
SQL RDMS would be more advantageous. Data can even be mi-
1–10 | 5
grated quite easily between implementations of this framework
in two RDMS’s if the slight differences noted above are carefully
considered.
Results
Implementation of the ESAMP framework is demonstrated via in-
gestion and modeling of MEAD, the database resulting from high
throughput experimental investigation of solar fuels materials in
the Joint Center for Artificial Photosynthesis (JCAP).
20
MEAD
contains a breadth and depth of experiments that make it repre-
sentative of a broad range of materials experiments. For example,
the 51 types of processes include serial, parallel, and multi-modal
experiments.
Using the most conservative rule that every process is state-
changing, the database contains approximately 17 million mate-
rial states. This dataset contains many compositions in high-order
composition spaces, particularly metal oxides with three or more
cation elements. For electrocatalysis of the oxygen evolution reac-
tion (OER), the high throughput experiments underlying MEAD
have led to the discovery of catalysts with nanostructured mix-
tures of metal oxides in such high-order composition spaces.
21–23
Given the vast number of unique compositions in these high-
dimensional search spaces, a critical capability for accelerating
catalyst discovery is the generation of machine learning models
that can predict composition-activity trends in high-order compo-
sition spaces, motivating illustration of ESAMP for this use case.
Catalyst discovery use case
To demonstrate the importance and utility of the management
of process provenance and parameters, we consider a use case
where data is curated to train a machine learning model and pre-
dict the catalytic activity of new catalyst compositions. We com-
mence by considering all MEAD measurements of metal oxides
synthesized by inkjet printing and evaluated as OER electrocat-
alysts, particularly the OER overpotential for an anodic electro-
chemical current density of 3 mA cm
2
. This overpotential is the
electrochemical potential above 1.23 V vs RHE required to obtain
the current density, so smaller values correspond to higher, desir-
able catalytic activity. Measurement of this overpotential can be
made by cyclic voltammogram (CV) or chronopotentiometry (CP)
measurements.
Querying MEAD for all measurements of this overpotential and
identifying the synthesis composition for each sample produces
a dataset of composition and activity regardless of each sample’s
history prior to the CP experiment and the electrochemical con-
ditions of the measurement. This dataset is referred to as dataset
A in Figure 5a and contains 262,087 measurements of overpo-
tential. Considering a provenance to be the ordered set of pro-
cess types that occurred up to the overpotential measurement,
this dataset contains 460 unique provenances. To increase the
homogeneity in provenance and materials processing, the SERs
can require that the catalyst samples have been annealed at 400
C. Additionally, to generate a single activity metric for each sam-
ple, the SERs can also require only the most recent or “latest”
measurement of activity, which results in a dataset B contain-
ing 43,860 measurements, corresponding to 113 unique prove-
nances. To further increase the homogeneity, the SERs can also
require the electrolyte pH to be within 0.5 of pH 13 and require
those catalysts to have been operated for at least 100 minutes be-
fore catalyst activity measurement, resulting in dataset C contain-
ing 10,885 measurements. This dataset contains only 26 unique
provenances that differ in their sequence of electrochemical ex-
periments that preceded the overpotential measurement.
Dataset C contains 54 unique 4-cation composition spaces. To
demonstrate machine learning prediction of catalyst activity in
new composition spaces, each of these 54 combinations of 4-
cation elements is treated as an independent data instance in
which the test set is taken to be all catalyst measurements from
dataset C where the catalyst composition contains three or all four
of the respective 4-cation elements. Keeping the test set consis-
tent, three independent eXtreme Gradient Boosting (XGB) ran-
dom forest regression models, one for each of the three datasets,
were trained to predict over-potential from composition, wherein
each case the composition spaces that comprise the test are held
out from training. Repeating these exercises for all 54 data in-
stances enables calculation of the aggregate mean absolute er-
ror (MAE) for predicting catalyst activity, as shown in Figure 5a
for the three different datasets. The MAE improves consider-
ably when increasing the homogeneity of provenance and exper-
imental parameters from dataset A to B and from dataset B to
C, demonstrating the value of using appropriate SERs to curate
materials databases with specific provenance and property con-
ditions to generate suitable training data for a specific prediction
task.
The parameters used for creating the SERs can also be con-
sidered as properties of the catalyst measurements, enabling the
training of machine learning models that not only use composi-
tion as input but also additional parameters, in the present case
the maximum annealing temperature, the number of previous
measurements of the catalyst activity, the electrolyte pH, the du-
ration of prior catalyst stability measurements, and whether the
measurement occurred by CV or CP. Figure 5a shows the corre-
sponding results for the same exercise described above wherein
the aggregate MAE is calculated for each dataset A, B. and C. This
more expressive input space enables a substantial decrease in the
MAE when using the less homogeneous datasets, A and B.
For the Ce-Fe-Mn-Ni data instance, Figure 5b shows the pre-
diction using dataset B and only composition as model input,
resulting in an MAE of 216 mV. Using the same dataset but ex-
panding the model input to include the experiment and cata-
lyst parameters lowers the MAE to 19 mV, which is the approx-
imate measurement uncertainty (Figure 5c). Comparison to the
ground truth values in Figure 5d reveals that the prediction in Fig-
ure 5c captures the broad range in activity and the composition-
activity trends in each of the four 3-cation and 4-cation compo-
sition spaces. Overall, these results demonstrate that curation of
data to accelerate materials discovery via machine learning re-
quires management of experiment provenance and parameters.
6 |
1–10
Fig. 5
Machine learning for catalyst discovery use case: prediction of OER overpotential for 3 mA cm
2
in 3-cation and 4-cation composition spaces.
Datasets for model training are
A
: all measurements of this performance metric,
B
: only the most recent measurement of activity for catalysts annealed
at 400
C, and
C
: the measurements from B made in pH 13 electrolyte and succeeding at least 100 minutes of catalyst operation. a) The dataset size
in terms of the number of overpotential measurements and the number of unique provenances (right axes) and MAE (left axis) for the three datasets,
where the MAE is aggregated over 54 data instances of machine learning prediction both from prediction using only composition and from prediction
using composition and experiment parameters. b) The overpotential predicted from the composition for the Ce-Fe-Mn-Ni data instance using dataset
B, resulting in MAE of 216 mV. c) The analogous result using composition and experiment parameters, which lowers the MAE to 19 mV. d) The
ground truth data, where the element labels for the composition graph as well as the overpotential color scale apply to b) and c) as well.
Discussion
Automated ML pipelines
In the catalyst discovery use case described above, we identi-
fied that the choice of state-changing processes had a signifi-
cant effect in predicting OER overpotential. To avoid making
such decisions
a priori
, which is often not possible in experimen-
tal research, all distinguishable processes should be reflected in
the data management. For instance, a sample storage event is
typically assumed to be non state-changing, which may not be
the case. The simplest example is air-sensitive materials whose
sample handling between experiments should be documented as
"sample handling” processes. The ESAMP framework allows for
every event in the laboratory to be defined as a process. However,
in practice, capturing every event in a laboratory is infeasible in a
typical laboratory setting. There may always exist “hidden” pro-
cesses that altered a material’s state but were not tracked, which
compounds the issues discussed above with human-made deci-
sions about what processes are state-changing and whether that
designation varies with either sample or process parameters. By
liberally defining what constitutes a process and aggregating data
from many experimental workflows, ESAMP will ultimately en-
able machine learning to identify hidden processes and which
tracked processes are indeed state-changing.
Recently, several research works have focused on developing
closed-loop methods to identify optimal materials and processing
conditions for several applications such as carbon nanotube syn-
thesis
24
, halide perovskite synthesis
25
, and organic thin film syn-
thesis
26
. The workflows of these experiments are typically static.
Similarly, several high-throughput experimental systems deploy
static workflows or utilize simple if-then logic to choose amongst
a set of pre-defined workflows. Machine learning on data defined
using ESAMP that contain various process provenances along
with definitions of state-changing processes will enable dynamic
identification of workflows that maximize knowledge extraction.
Generality for modeling other experimental workflows
While the breadth of process provenances and the dynamic
range of depth within each type of provenance makes the MEAD
database an excellent demonstrator of ESAMP, the provenance
management and the database schema are intended to be general
to all experimental and computational workflows. A given type
of experiment may be considered equivalent when performed in
2 different labs, although differences in the process parameters
and data management have created hurdles to universal materi-
als data management. Such differences may require lab-specific
ingestion scripts and
tables, but custom development of these
components of ESAMP comprise a low-overhead expansion of
the database to accept data from new labs as well as new types of
processes. One of the most widely used experimental inorganic
crystal structural and diffraction databases (ICDD) was generated
by manual curation and aggregation over several decades of x-ray
diffraction data generated in many laboratories. We anticipate
that ESAMP’s universal data management will result in a more
facile generation of several large experimental datasets with full
provenance that enables data-driven accelerated materials discov-
eries.
In addition to the generation of new insights from provenance
management and acceleration of research via more effective in-
corporation of machine learning, we envision materials prove-
nance management to profoundly impact the integrity of experi-
mental science. In the physical sciences, the complexity of mod-
1–10 | 7
ern experimentation contributes to issues with reproducing pub-
lished results.
27
However, the complexity itself is not the issue,
but rather the inability of the Methods sections in journal articles
to adequately describe the materials provenance, for example, via
exclusion of parameters or processing steps that were assumed to
be unimportant, which is exacerbated by complex, many-process
workflows. Provided an architecture for provenance manage-
ment such as ESAMP, data can ultimately determine what pa-
rameters and processes are essential for reproducible materials
experiments.
Generation of knowledge graphs and data networks
As discussed above, we anticipate ESAMP to provide the frame-
work that enables the curation of large and diverse datasets with
full provenance. Such large datasets are a great starting point for
machine learning applications. However, ESAMP is quite gen-
eral, and adapting a more specific data framework to one’s use
case can make knowledge extraction easier. These frameworks
may extract subsets of the data stored in the main framework and
apply simplifying assumptions that apply to the specific use case.
However, as long as a link exists between the higher-level frame-
work and ESAMP, then the complete provenance information will
still be preserved and queryable. Machine learning datasets, such
as those described in datasets A, B, and C in the above use case,
are examples of a practical higher-level extraction.
One example of a higher-level framework enabled by ESAMP is
that of knowledge graphs. Knowledge graphs are a powerful
abstraction for storing, accessing, and interpreting data about
entities interlinked by an ontology of classes and relations
28
.
This allows for formal reasoning, with reasoning engines de-
signed for queries like “Return all triples
(
x
1
,
x
2
,
x
3
)
where
φ
(
x
1
,
x
2
)
and
φ
(
x
1
,
x
3
)
and (
ψ
(
x
2
,
x
3
)
if and only if
θ
(
x
3
)
)". Beyond di-
rect queries which produce tabular results suited for traditional
machine learning applications, machine learning models can
be applied directly to relational databases
29,30
and knowledge
graphs
31
. Applications involving knowledge graphs and ontolo-
gies have been explored in the space of chemistry and materials
science research
32,33
.
The population of knowledge graphs is mainly facilitated by
ESAMP in two ways. Firstly, data within a relational database
structure is straightforwardly mappable into the data structure
of knowledge graph triples
34
. Secondly, a solid grasp of how
to resolve distinct entities can be achieved through ESAMP be-
fore populating the nodes of the knowledge graph. Alternative
approaches of merging all samples with the same label or con-
sidering every possibly-distinct sample to be a unique material
are too coarse- and fine-grained, respectively. Beyond knowledge
graphs, other high-level frameworks specialize in the migration
and merging of data between research groups that structure their
experiments and analyses differently
35
, and these demand struc-
tured data such as ESAMP data as their initial input.
Better theory-experiment integration
While the initial focus on experimental materials data is moti-
vated by the historical lack of concerted effort to manage and
track provenance compared to computational materials science,
we also envision that ESAMP has the inherent flexibility and
expressiveness to model computational materials provenance as
well and assist in the holy grail of combining theory, experiment,
and machine learning systematically.
In the computational context, simulation details are recorded
from automated workflows, input and output files, and code doc-
umentation, so the provenance and parameters involved in com-
putations are simpler to track and ingest than experiments. To
ingest computational data into ESAMP, we would consider the
physically relevant aspects of a simulation, such as atomic po-
sitions, composition, microstructure, or device components and
dimensions, to comprise a sample. The simulation itself would be
the process, with numerical parameters, theoretical approxima-
tions, and the compute hardware and software, potentially being
relevant process details. Output files and logs would be the pro-
cess data. Just as samples in experiments can undergo multiple
processes, a simulation “sample" can start in a specific configura-
tion, undergo optimization during a “process," and the new con-
figuration, still associated with the same “sample," can be passed
on for further processing. Computational samples could be com-
bined – results of a simulation are mixed in a new simulation – or
partitioned into new samples. The ESAMP framework allowing
analyses to be built on multiple process data components also is
relevant when post-processing simulation data.
Integrating theoretical and experimental workflows has long
been pursued in materials research. If a computational simula-
tion indicates a material has desirable properties, it is advanta-
geous to directly query all of the experimental data associated
with that material to validate the prediciton. Similarly, connect-
ing a physical material to its computational counterpart can pro-
vide key insight into the fundamental source of its properties.
ESAMP provides a framework for storing both computational and
experimental workflows using the same relationships and entities,
which creates a common language for querying the complex in-
formation associated with computational and experimental sam-
ples needed to complete this mapping. Not only will mapping
experiments to theory be made easier through ESAMP, but more
valid comparisons will also be enabled. Computational models
are often benchmarked against experimentally obtained values.
However, this mapping relies upon the composition and structure
information from a sample being valid for the state associated
with the value. If an intervening process changed the material’s
composition or structure, the mapping between the theoretical
and experimental dataset would be incorrect. Therefore, it is ad-
vantageous to use ESAMP to define state equivalency rules simi-
lar to those described above, to ensure that the state equivalency
between synthesis processes and those generating properties of
interest.
Conclusions
In this work, we present a database architecture, called ESAMP,
designed for storing materials science research data. We demon-
strate that the database architecture captures each material sam-
ple’s provenance, the data derived from experiments run using
each sample, and high-level results derived from the raw data.
8 |
1–10