of 11
ESAMP: event-sourced architecture for materials
provenance management and application to
accelerated materials discovery
Michael J. Statt,
*
a
Brian A. Rohr,
a
Kris Brown,
a
Dan Guevarra,
c
Jens Hummelshøj,
b
Linda Hung,
b
Abraham Anapolsky,
b
John M. Gregoire
*
c
and Santosh K. Suram
*
b
While the vision of accelerating materials discovery using data driven methods is well-founded, practical
realization has been throttled due to challenges in data generation, ingestion, and materials state-aware
machine learning. High-throughput experiments and automated computational work
fl
ows are
addressing the challenge of data generation, and capitalizing on these emerging data resources requires
ingestion of data into an architecture that captures the complex provenance of experiments and
simulations. In this manuscript, we describe an event-sourced architecture for materials provenance
(ESAMP) that encodes the sequence and interrelationships among events occurring in a simulation or
experiment. We use this architecture to ingest a large and varied dataset (MEAD) that contains raw data
and metadata from millions of materials synthesis and characterization experiments performed using
various modalities such as serial, parallel, multi-modal experimentation. Our data architecture tracks the
evolution of a material's state, enabling a demonstration of how state-equivalency rules can be used to
generate datasets that signi
fi
cantly enhance data-driven materials discovery. Speci
fi
cally, using state-
equivalency rules and parameters associated with state-changing processes in addition to the typically
used composition data, we demonstrated marked reduction of uncertainty in prediction of overpotential
for oxygen evolution reaction (OER) catalysts. Finally, we discuss the importance of ESAMP architecture
in enabling several aspects of accelerated materials discovery such as dynamic work
fl
ow design,
generation of knowledge graphs, and e
ffi
cient integration of simulation and experiment.
Introduction
Accelerating materials discovery is critical for a sustainable
future and the practical realization of emergent technologies.
Data-driven methods are anticipated to play an increasingly
signi
cant role in enabling this desired acceleration, which
would be greatly facilitated by the establishment and commu-
nity adoption of data structures and databases that capture data
from the broad range of materials experiments. In computa-
tional materials science, automated work
ows have been
established to produce large and diverse materials datasets.
While these work
ows and associated data management tools
can be improved to facilitate capturing of a materials' state and
enable easy capture of re-con
gurable analysis methods, their
current implementations have facilitated a host of materials
discoveries,
1
4
emphasizing the importance of continued
development of materials data architectures. In case of experi-
mental materials science, the majority of the data remains in
human readable format and is not ingested into a database. In
cases where the databases exist, they are either large with
limited scope (ICSD, ICDD, which contains hundreds of thou-
sands of X-ray di
ff
raction patterns) or are diverse but have
limited data.
5
7
This has limited application of machine
learning for acceleration of experimental materials discovery to
speci
c datasets such as microstructure data, X-ray di
ff
raction
spectra, X-ray absorption spectra, or Raman spectra.
8
11
Recent application of high-throughput experimental tech-
niques has resulted in two large, diverse experimental datasets:
(a) High Throughput Experimental Materials (HTEM) dataset,
which contains synthesis conditions, chemical composition,
crystal structure, and optoelectronic property measurements
(>150 000 entries), and (b) Materials Experiment and Analysis
Database (MEAD) that contains raw data and metadata from
millions of materials synthesis and characterization experi-
ments, as well as the corresponding property and performance
a
Modelyst LLC, Palo Alto, CA, 94303, USA. E-mail: michael.statt@modelyst.io
b
Accelerated Materials Design and Discovery, Toyota Research Institute, Los Altos, CA,
94040, USA. E-mail: santosh.suram@tri.global
c
Division of Engineering and Applied Science, California Institute of Technology,
Pasadena, CA 91125, USA. E-mail: gregoire@caltech.edu
Electronic supplementary information (ESI) available: Detailed schema
discussion for relational database implementation of ESAMP. See DOI:
https://doi.org/10.1039/d3dd00054k
Cite this:
Digital Discovery
,2023,
2
,
1078
Received 30th March 2023
Accepted 14th June 2023
DOI: 10.1039/d3dd00054k
rsc.li/digitaldiscovery
1078
|
Digital Discovery
,2023,
2
,1078
1088
© 2023 The Author(s). Published by the Royal Society of Chemistry
Digital
Discovery
PAPER
Open Access Article. Published on 21 June 2023. Downloaded on 12/8/2023 10:37:27 PM.
This article is licensed under a
Creative Commons Attribution 3.0 Unported Licence.
View Article Online
View Journal
| View Issue
metrics.
12,13
These datasets contain thousands to millions of
data entries for a given type of experimental process, but the
experimental conditions or prior processing of the materials
leading up to the process of interest can vary substantially. The
multitude of process parameters and provenances results in
datasets whose richness could be fully realized and utilized if
the context and provenance of each experiment were appro-
priately modeled. In contrast to computational data, where the
programmatic work
ows facilitate provenance tracking, exper-
imental work
ows generally experience more variability from
many on-the-
y decisions as well as environmental factors and
evolution of the instrumentation. Sensitivity to historical
measurements is generally higher in experiments since any
measurement could conceivably alter the material, making any
materials experiment a type of
processing.
Factors ranging
from instrument contamination to dri
ing detector calibration
also may play a role. Therefore, a piece of experimental data
must be considered in the context of the parameters used for its
generation and the entire experimental provenance.
The importance of sample and process history of the exper-
imental data makes it challenging to identify which measure-
ment data can be aggregated to enable data-driven discovery.
The standard practice for generating a shareable dataset is to
choose data that match a set of process and provenance
parameters and consider most or all other parameters to be
inconsequential. This method is highly subjective to the indi-
vidual researcher. For both human and machine users of the
resulting dataset, the ground truth of the sample-process
provenance is partially or fully missing. In addition, an injec-
tion of assumptions prior to ingestion into a database creates
datasets that do not adhere to the Findability, Accessibility,
Interoperability, and Reusability (FAIR) guiding principles
14
resulting in lack of interoperability, creation of data silos that
cannot be analyzed e
ffi
ciently to generate new insights and
accelerate materials discovery. As a result, the data's value is
never fully realized, motivating the development of data
management practices that closely link data ingestion to data
acquisition.
Given the complexity and variability in materials experi-
mentation, several tailored approaches such as ARES, AIR-
Chem, and Chem-OS have been developed to enable integra-
tion between data ingestion and acquisition for speci
c types of
experiments.
15
17
Recently, a more generalizable solution for
facilitating experiment speci
cation, capture, and automation
called ESCALATE was developed.
18
Such approaches aim to
streamline and minimize information loss that occurs in an
experimental laboratory. We focus on modeling the complete
ground truth of materials provenances that could operate on
structured data resulting either from a specialized in-house
data management so
ware or a more general framework such
as ESCALATE.
Prior e
ff
orts such as The Materials Commons,
19
GEMD,
20
and
PolyDAT
21
have also focused on modeling materials prove-
nances. GEMD uses a construction based on specs and runs for
materials, ingredients, processes, and measurements. However,
there isn't an explicit distinction between measurements and
processes. Especially, in case of in-operando or
in situ
experiments, a single experiment corresponds to both a process
and also a measurement. PolyDAT focuses on capturing trans-
formations and characterizations of polymer species. Materials
Commons focuses on creation of samples, data
les, and
measurements by processes. We acknowledge the e
ff
orts of
these earlier works, here we aim to further simplify the data
architecture such that it is easily generalizable for various data
sources. We also simplify various terminologies such as mate-
rials, ingredients, processes, measurements, characterizations,
transformations into three main entities
sample, process, and
process data. We also introduce a concept called
state
that
enables dynamic sample
/
process data mapping and
demonstrate its value for machine learning.
We use an event-sourced architecture for materials prove-
nances (ESAMP) to capture the ground truth of materials
experimentation. This architecture is inspired by event-sourced
architectures used in so
ware design wherein the whole
application state is stored as a sequence of events. This archi-
tecture maintains relationships among experimental processes,
their metadata, and their resulting primary data to strive for
comprehensive representation of the experiments. We believe
that these attributes make ESAMP broadly applicable for
materials experiments and beyond. We discuss database
architecture decisions that enable deployment for a range of
experiment throughput and automation levels. We also discuss
the applicability of ESAMP to primary data acquisition modes
such as serial, parallel, and multimodal experimentation.
Finally, we present a speci
c instantiation of ESAMP for one of
the largest experimental materials databases (MEAD) named
Materials Provenance Store
22
(MPS) consisting of more than 6
million measurements on 1.5 million samples. We demonstrate
facile information retrieval, analysis, and knowledge generation
from this database. The primary use case described herein
involves training machine learning models for catalyst
discovery, where di
ff
erent de
nitions of provenance equiva-
lence yield di
ff
erent datasets for model training that profoundly
impact the ability to predict catalytic activity in new composi-
tions spaces. We also discuss the universality of our approach
for materials data management and its opportunities for the
adoption of machine learning in many di
ff
erent aspects of
materials research.
ESAMP description
Overview
ESAMP is a database architecture designed to store experi-
mental materials science data. It aims to capture all three of the
types of aforementioned data: (1) information about the
samples in the database including storing provenance
regarding how they were created and what processes they have
undergone, (2) raw data from processes run on the samples, and
(3) information derived from analyses of these raw data.
Altogether, this architecture enables users to use simple SQL
queries to answer questions like:

What is the complete history of a given sample and any
other samples used to create this one?
© 2023 The Author(s). Published by the Royal Society of Chemistry
Digital Discovery
,2023,
2
,1078
1088 |
1079
Paper
Digital Discovery
Open Access Article. Published on 21 June 2023. Downloaded on 12/8/2023 10:37:27 PM.
This article is licensed under a
Creative Commons Attribution 3.0 Unported Licence.
View Article Online

How many samples have had XRD run on them both before
and a
er an electrochemistry experiment?

What is the
gure of merit resulting from a given set of raw
data analyzed using di
ff
erent methods?
Identi
cation of data to evaluate any scienti
c question
requires consideration of the context of the data, motivating our
design of the ESAMP structure to intuitively specify contextual
requirements of the data. For example, if a researcher wishes to
begin a machine learning project, creating a custom dataset for
their project can be done by querying data in the ESAMP
architecture. For example, training data for machine learning
prediction of the overpotential in chronopotentiometry (CP)
experiments from catalyst composition can be obtained
via
a query to answer questions such as

Which samples have undergone XPS then CP?

How diverse are the sample compositions in a dataset?
The researcher may further restrict the results to create
a balanced dataset or a dataset with speci
ed heterogeneity with
respect to provenance and experiment parameters. The query
provides transparent self-documentation of the origins of such
a dataset; any other researcher wondering how the dataset was
created can look at the WHERE clause in the SQL query to see
what data was included and excluded.
To enable these bene
ts, we must
rst track the state of
samples and instruments involved in a laboratory to capture the
ground truth completely. In this article, we focus mainly on the
state of samples and note that the architecture could capture
the state of instruments or other research entities. A sample
provenance can be tracked by considering three key entities:
sample, process, and process_data, which are designed to
provide intuitive ingestion of data from both traditional manual
experiments and their automated or robotic analogues.
Sample.
A sample is a label that speci
es a physically-
identi
able representation of an entity that can undergo many
processes (
e.g.
the liquid in that vial or the thin
lm on that
substrate). Samples can be combined or split to form complex
lineages, such as an anode and a cathode being joined in
a battery or a vial of precursor used in multiple catalyst prepa-
rations. The only fundamental assumption placed on a sample
is that it has a unique identi
er so that its lineage and process
history can be tracked.
Process.
A process is an event that occurs to one or more
samples. It is associated with an experiment in a laboratory,
such as annealing in a sample furnace or performing spectro-
scopic characterization. Processes have input parameters and
are identi
ed by the machine (or human) that performed them
at a speci
c time.
Process_data.
Process data is data generated by a process
that applies to one or more samples that underwent that
process. Since the process but not the speci
c ProcData is
central to sample provenance, management of ProcData can
occur in a connected but distinct part of the framework. As
many raw outputs from scienti
c processes are di
ffi
cult to
interpret without many additional steps of analysis, ProcData is
connected to a section of the framework devoted to iterative
steps of analysis where ProcData is transformed and combined
to form higher-level
gures of merit (FOM).
These three entities connected
via
a sample_process table
form the framework's central structure. Fig. 1 shows these
entities and their relationships. The three shaded boxes indi-
cate the secondary tables that support the central tables by
storing process details, sample details, and analyses. Each
region is expanded upon below.
Samples, collections, and lineage
The trinity of sample, process, and process-data enable us to
have a generalized framework that captures the ground truth
associated with any given sample in an experimental dataset.
However, interpretation of experimental data requires us to
capture the provenance of a sample completely. That is,
throughout the sample's lifetime, it is important to track three
key things:

How was the sample created?

What processes occurred to the sample?

If the sample no longer exists, how was it consumed?
The middle question is directly answered by the sequence of
entries in the sample_process table wherein each record in
sample_process speci
es the time that a sample underwent
a process. This concept is complicated by processes that merge,
split, or otherwise alter physical identi
cation of samples. Such
processes are o
en responsible for the creation and consump-
tion of samples, for example the deposition of a catalyst onto an
electrode or the use of the same precursor in many di
ff
erent
molecule formulations. In these cases, the process history of the
parent
catalyst or precursor is an inherent part of the prove-
nance of the
child
catalyst electrode or molecular material.
These potentially-complex lineages are tracked through the
sample_ancestor and sample_parent entities as shown in
Fig. 2a.
Both the SampParent and SampAnc entities are de
ned by
their connection to two sample entities, indicating a parent/
ancestor and child/descendant relationship, respectively. The
SampParent entity indicates that the child sample was created
Fig. 1
An overview of the framework showing the central location of
the sample_process entity and its relationship to the three major areas
of the framework.
1080
|
Digital Discovery
,2023,
2
,1078
1088
© 2023 The Author(s). Published by the Royal Society of Chemistry
Digital Discovery
Paper
Open Access Article. Published on 21 June 2023. Downloaded on 12/8/2023 10:37:27 PM.
This article is licensed under a
Creative Commons Attribution 3.0 Unported Licence.
View Article Online
from the parent sample and should inherit its process history
lineage. Each SampParent can be decorated with additional
attributes to indicate its role in the parent
child relationship,
such as labeling the anode and cathode when creating a battery.
The SampAnc entity is nearly identical to SampParent with an
additional attribute called
rank
that indicates the number of
generations between the ancestor and the descendant. A rank of
0 indicates a parent
child relationship, while a rank of 2 indi-
cates a great-grandparent type relationship. The parent and the
ancestor tables are not essential to the database and are tables
that can be derived from the materials provenance. However,
these derived tables are extremely valuable for simplifying
complex queries dependent on sample lineages.
The
nal entity connected to a sample is the collection. It is
common for researchers to group samples. For example, in high
throughput experiments many samples may exist on the same
chip or plate, or researchers may include in a collection all
samples synthesized for a single project. In these cases,
researchers need to be able to keep track of and make queries
based on that information. It is clear from the previously-
mentioned example that many samples can (and almost
always do) belong to at least one collection. It is also important
that we allow for the same sample to exist in many collections.
For example, a researcher may want to group samples by which
plate or wafer they are on, which high-level project they are
a part of, and which account they should be billed to all at the
same time. The corresponding many-to-many relationships are
supported by ESAMP.
Processes & process details
A process represents one experimental procedure (
e.g.
a synthesis or characterization) that is applied to a sample. The
only requirement imposed on a process is that it must be
possible to sort them chronologically. Chronological sorting is
essential for accurately representing a sample's process history.
Therefore, each process is uniquely associated with a time-
stamp and machine/user. There is an underlying assumption
that for a given timestamp and a given machine/user, only 1
process is occurring, although that process may involve
multiple samples.
While single-step experiments on machine-based work
ows
can easily provide a precise timestamp for each process, it is
cumbersome and error-prone for researchers to provide these at
the timescale of seconds or even hours. Additionally, some multi-
step processes may reuse the initial timestamp throughout each
step, associating an initiation timestamp with a closely-coupled
series of experiments whose ordering is known but whose indi-
vidual timestamps are not tracked. It is important to add a simple
ordering parameter to represent the chronology when the time-
stamp alone is insu
ffi
cient. For tracking manual experiments,
this ordering parameter allows researchers to record the date and
a counter for the number of experiments they have completed
that day. In multi-step processes, each step can be associated with
an index to record the order of steps.
Processes indicate that an experimental event has occurred
to one or more samples. However, it is important to track
information describing the type of process that occurred and
the process parameters used, or generally any information that
would be required to reproduce the experiment. A given
research work
ow may comprise many di
ff
erent types of
experiments, such as electrochemical, XPS, or deposition
processes. Each of these types of processes will also be associ-
ated with a set of input parameters. The ProcDet entity and its
associated process-speci
c tables are used to track this impor-
tant metadata for each process. A more comprehensive
discussion on the representation of process details for various
relational database management system (RDMS) implementa-
tions is provided in the ESI.
Process data & analysis
While ProcDet tracks inputs to a Proc, ProcDet tracks the output
of a Proc. For reproducibility, transparency, and ability to
continue experiments without reliance on an active database
connection, it is prudent to store process outputs as raw
les
independent from the data management framework. Therefore,
while ProcData may include relevant data parsed from the raw
les, it should also always include a raw
le path. Additionally,
attributes can be added to specify the location to search for the
le, such as an Amazon S3 bucket or local storage drive. A single
le may also contain multiple pieces of data that each refers to
di
ff
erent samples. This complexity motivates the inclusion of
the start and end line numbers for a
le identifying information
for ProcData. If an entire
le should be consumed as a single
piece of process data, null values can be provided for those
attributes. As a signi
cant amount of scienti
c data is stored as
comma-separated values (CSV)
les, it can also be bene
cial to
parse these
les directly into values in the database utilizing
exible column data types, such as JavaScript Object Notation
(JSON) that is supported by modern RDMS's. For large datasets,
Fig. 2
An overview of the three major areas of the framework as
shown in Fig. 1. Each region is centered on one of the three entities
connected to the central SampProc entity: (a) Samp (b) ProcData (c)
Proc.
© 2023 The Author(s). Published by the Royal Society of Chemistry
Digital Discovery
,2023,
2
,1078
1088 |
1081
Paper
Digital Discovery
Open Access Article. Published on 21 June 2023. Downloaded on 12/8/2023 10:37:27 PM.
This article is licensed under a
Creative Commons Attribution 3.0 Unported Licence.
View Article Online
storing data using e
ffi
cient binary serializations such as Mes-
sagepack could be bene
cial.
23
The relationship between process outputs and their associ-
ated processes and samples can be complex. The most
straightforward relationship is one piece of process data is
generated for a single sample, which is typically the case for
serial experimentation and traditional experimentation per-
formed without automation. In parallel experimentation,
a single process involves many samples, and if the resulting
data is relevant to all samples, SampProc has a many-to-one
relationship to ProcData. In multi-modal experiments,
multiple detectors can generate multiple pieces of data for
a single sample in a single process, where SampProc has a one-
to-many relationship to ProcData. Parallel, multi-model exper-
imentation can result in many-to-many relationships. To model
these di
ff
erent types of experimentation in a uniform manner,
ESAMP manages many-to-many relationships between Samp-
Proc and ProcData.
The raw output of scienti
c processes may require several
iterative analytical steps before the desired results can be ob-
tained. As the core tenet of this framework design is tracking
the full provenance of scienti
c data, analytical steps must have
their lineage tracked similarly to that of samples and processes.
This is achieved by the analysis, analysis_details, and ana-
lysis_parent tables. The analysis table represents a single
analytical step and, similar to Proc, is identi
ed by inputs,
outputs, and associated parameters. Just as Proc has a many-to-
many relationship with sample, analysis has a many-to-many
relationship with process_data; a piece of process data can be
used as an input to multiple analyses and a single analysis can
have multiple pieces of process data as inputs. The type of
analysis and its input parameters are stored in the ana-
lysis_detail entity. The analysis type should de
ne the analytical
transformation function applied to the inputs, while the
parameters are fed into the function alongside the data inputs.
An important di
ff
erence between analysis and Proc is that an
analysis can use the output of multiple ProcData and analysis
entities as inputs. This is analogous to the parent
child rela-
tionship as that modeled by SampParent. The introduction of
analysis_parent table allows for this complex lineage to be
modeled. This allows for even the most complex analytical
outputs to be traced back to the raw ProcData entities and the
intermediate analyses on which they are based.
State
During experiments a sample may be intentionally or unin-
tentionally altered. For example, a researcher could measure the
composition of a sample, perform an electrochemical process
that unknowingly changes the composition, and
nally perform
a spectroscopic characterization. Even though the sample label
is preserved throughout these three processes, directly associ-
ating the composition measurement with the spectroscopic
measurement can lead to incorrect analysis because the inter-
vening process altered the link between the two. This example
motivates the need for the
nal entity in the framework, state.
The ESAMP model for state assumes that every process
irreversibly changes the sample. A state is de
ned by two sam-
ple_process entities that share the same sample and have no
sample_process chronologically between them. By managing
state under the most conservative assumption that every
process alters the sample's state, any state equivalency rules
(SERs),
i.e.
whether a certain type of process alters the state or
not, can be applied in a transparent manner. A new state table
can be constructed from these SERs, which may be easily
modi
ed either by a human or a machine.
As state essentially provides a link between the input and
output of a process, it is best visualized as a graph. Fig. 3 shows
an example state graph. Sample 1 undergoes a series of
ve
processes that involve three distinct types of processes. A new
state is created a
er each process. If no relaxation assumptions
are applied, all processes are assumed to be state-changing, and
since all states are non-equivalent, it might be invalid to share
process data or derived analysis amongst them. Under the most
relaxed constraint, no processes are state-changing. However,
the utility of state is the ability to apply domain and use-speci
c
rules to model SERs. For example, consider process 3 (P
3
)tobe
a destructive electrochemical experiment that changes the
sample's composition, while the other processes are innocuous
characterization experiments. By designating only P
3
as state-
changing, the sample can be considered to have only 2
unique states. SERs can be further parameterized by utilizing
the ProcDet's of the process to determine state-changing
behavior. For example, if P
2
is an anneal step, we might only
Fig. 3
An example of a sample state graph. Sample 1 is shown
undergoing
fi
ve processes with types P1, P2, or P3. A state is de
fi
ned
between every process. The right boxes show how di
ff
erent sets of
rules governing whether a process is state-changing or not can
change the equivalency between the states. Without any rules, all
processes are assumed to be state-changing, and no states are
equivalent. This constraint can be fully relaxed to make all states
equivalent. It can also partially relaxed based on process type or
process details, such as
g
, as shown in the lower two rule sets.
1082
|
Digital Discovery
,2023,
2
,1078
1088
© 2023 The Author(s). Published by the Royal Society of Chemistry
Digital Discovery
Paper
Open Access Article. Published on 21 June 2023. Downloaded on 12/8/2023 10:37:27 PM.
This article is licensed under a
Creative Commons Attribution 3.0 Unported Licence.
View Article Online
consider it state-changing if the temperature rises above
a certain level. By de
ning simple rules, merging equivalent
states yields simpler state graphs that serve as the basis for
dataset curation. This powerful concept of state is enabled by
the core framework's ability to track the process provenance of
samples throughout their lifetime.
Database implementation
The framework (Fig. 4) so far has been de
ned using standard
entity relationship language. It is important to note that this
framework can be instantiated in most or all RDMS's and is not
tied to a speci
cimplementation.However,thespeci
cimple-
mentation details of the framework may change slightly
depending on the RDMS used. These changes are vital in deciding
the RDMS system that is appropriate for a particular use case.
Fig. S1
shows the framework in its entirety. All double-sided
arrows indicate a many-to-many relationship. The implementa-
tion of many-to-many relationships di
ff
ers between SQL, NoSQL,
and graph databases. In a SQL RDMS such as PostgreSQL, the
standard practice uses a
mapping
table where a row is de
ned
simply by its relationship to the two tables with the many-to-
many relationship. In graph databases, many-to-many relation-
ships can be represented simply as an edge between two nodes.
Additionally, entities that track lineages, such as SampParent,
state, and analysis_parent, can also be represented simply as
edges between two nodes of the same type. The cost of this
simplicity is the reduced constraints on column datatypes as well
as a less standardized query functionality.
If complicated process provenance and lineages are expected
to exist along with a need to query those lineages, then a graph
database may be the right choice. However, if simpler lineages
with large amounts of well-structured data are used, a standard
SQL RDMS would be more advantageous. Data can even be
migrated quite easily between implementations of this frame-
work in two RDMS's if the slight di
ff
erences noted above are
carefully considered. In this implementation we used a post-
greSQL database due to the presence of a large amount of
reasonably well-structured data. In addition, the postgreSQL
database allows us to build a graph database on top of it, which
can be used for complex provenance queries.
Results
Implementation of the ESAMP framework is demonstrated
via
ingestion and modeling of MEAD, the database resulting from
high throughput experimental investigation of solar fuels
materials in the Joint Center for Arti
cial Photosynthesis
(JCAP).
24
MEAD contains a breadth and depth of experiments
that make it representative of a broad range of materials
experiments. For example, the 51 types of processes include
serial, parallel, and multi-modal experiments.
Using the most conservative rule that every process is state-
changing, the database contains approximately 17 million
material states. This dataset contains many compositions in
high-order composition spaces, particularly metal oxides with
three or more cation elements. For electrocatalysis of the oxygen
evolution reaction (OER), the high throughput experiments
underlying MEAD have led to the discovery of catalysts with
nanostructured mixtures of metal oxides in such high-order
composition spaces.
25
27
Given the vast number of unique
compositions in these high-dimensional search spaces, a crit-
ical capability for accelerating catalyst discovery is the genera-
tion of machine learning models that can predict composition-
activity trends in high-order composition spaces, motivating
illustration of ESAMP for this use case.
Catalyst discovery use case
To demonstrate the importance and utility of the management
of process provenance and parameters, we consider a use case
where data is curated to train a machine learning model and
predict the catalytic activity of new catalyst compositions. We
commence by considering all MEAD measurements of metal
oxides synthesized by inkjet printing and evaluated as OER
electrocatalysts, particularly the OER overpotential for an
anodic electrochemical current density of 3 mA cm
2
. This
overpotential is the electrochemical potential above 1.23 V
vs.
RHE required to obtain the current density, so smaller values
correspond to higher, desirable catalytic activity. Measurement
of this overpotential can be made by cyclic voltammogram (CV)
or chronopotentiometry (CP) measurements.
Querying MEAD for all measurements of this overpotential
and identifying the synthesis composition for each sample
produces a dataset of composition and activity regardless of
each sample's history prior to the CP experiment and the elec-
trochemical conditions of the measurement. This dataset is
referred to as dataset A in Fig. 5a and contains 660 260
measurements of overpotential. Considering a provenance to be
the ordered set of process types that occurred up to the over-
potential measurement, this dataset contains 19 129 unique
provenances. To increase the homogeneity in provenance and
materials processing, the SERs can require that the catalyst
samples have been annealed at 400 °C. Additionally, to generate
a single activity metric for each sample, the SERs can also
require only the most recent or
latest
measurement of activity,
which results in a dataset B containing 66 653 measurements,
corresponding to 304 unique provenances. To further increase
the homogeneity, the SERs can also require the electrolyte pH to
Fig. 4
A full graphical representation of the framework described in
Fig. 1 and 2. Single headed arrows indicate a many-to-one relationship
in the direction of the arrow. Double-headed arrows indicate a many-
to-many relationship.
© 2023 The Author(s). Published by the Royal Society of Chemistry
Digital Discovery
,2023,
2
,1078
1088 |
1083
Paper
Digital Discovery
Open Access Article. Published on 21 June 2023. Downloaded on 12/8/2023 10:37:27 PM.
This article is licensed under a
Creative Commons Attribution 3.0 Unported Licence.
View Article Online
be within 0.5 of pH 13 and require those catalysts to have been
operated for at least 100 minutes before catalyst activity
measurement, resulting in dataset C containing 20 012
measurements. This dataset contains only 29 unique prove-
nances that di
ff
er in their sequence of electrochemical experi-
ments that preceded the overpotential measurement.
Dataset C contains 63 unique 4-cation composition spaces.
To demonstrate machine learning prediction of catalyst activity
in new composition spaces, each of these 63 combinations of 4-
cation elements is treated as an independent data instance in
which the test set is taken to be all catalyst measurements from
dataset C where the catalyst composition contains three or all
four of the respective 4-cation elements. Keeping the test set
consistent, three independent eXtreme Gradient Boosting
(XGB) random forest regression models, one for each of the
three datasets, were trained to predict over-potential from
composition, wherein each case the composition spaces that
comprise the test are held out from training. Repeating these
exercises for all 63 data instances enables calculation of the
aggregate mean absolute error (MAE) for predicting catalyst
activity, as shown in Fig. 5a for the three di
ff
erent datasets. The
MAE improves considerably when increasing the homogeneity
of provenance and experimental parameters from dataset A to B
and from dataset B to C, demonstrating the value of using
appropriate SERs to curate materials databases with speci
c
provenance and property conditions to generate suitable
training data for a speci
c prediction task.
The parameters used for creating the SERs can also be
considered as properties of the c
atalyst measurements, enabling
the training of machine learning models that not only use
composition as input but also additional parameters, in the
present case the maximum annealing temperature, the number of
previous measurements of the catalyst activity, the electrolyte pH,
the duration of prior catalyst stability measurements, and whether
the measurement occurred by CV or CP. Fig. 5a shows the cor-
responding results for the same exercise described above wherein
the aggregate MAE is calculated for each dataset A, B, and C. This
more expressive input space enables a substantial decrease in the
MAE when using the dataset B. Whereas, for dataset A this
expressive input space marginally increased the MAE, high-
lighting the importance of combining SER based data classi
ca-
tion with regression using richer expressions of the input space.
For the Ce
Fe
Mn
Ni data instance, Fig. 5b shows the
prediction using dataset B and only composition as model input,
resultinginanMAEof143mV.Usingthesamedatasetbut
expanding the model input to include the experiment and catalyst
parameters lowers the MAE to 25 mV, which is the approximate
measurement uncertainty (Fig. 5c). Comparison to the ground
truth values in Fig. 5d reveals that the prediction in Fig. 5c
captures the broad range in activity and the composition-activity
trends in each of the four 3-cation and 4-cation composition
spaces. Overall, these results demonstrate that curation of data to
accelerate materials discovery
via
machine learning requires
management of experiment provenance and parameters.
Fig. 5
Machine learning for catalyst discovery use case: prediction of OER overpotential for 3 mA cm
2
in 3-cation and 4-cation composition
spaces. Datasets for model training are A: all measurements of this performance metric, B: only the most recent measurement of activity for
catalysts annealed at 400 °C, and C: the measurements from B made in pH 13 electrolyte and succeeding at least 100 minutes of catalyst
operation. (a) The dataset size in terms of the number of overpotential measurements and the number of unique provenances (right axes) and
MAE (left axis) for the three datasets, where the MAE is aggregated over 63 data instances of machine learning prediction both from prediction
using only composition and from prediction using composition and experiment parameters. (b) The overpotential predicted from the
composition for the Ce
Fe
Mn
Ni data instance using dataset B, resulting in MAE of 143 mV. (c) The analogous result using composition and
experiment parameters, which lowers the MAE to 25 mV. (d) The ground truth data, where the element labels for the composition graph as well as
the overpotential color scale apply to (b) and (c) as well.
1084
|
Digital Discovery
,2023,
2
,1078
1088
© 2023 The Author(s). Published by the Royal Society of Chemistry
Digital Discovery
Paper
Open Access Article. Published on 21 June 2023. Downloaded on 12/8/2023 10:37:27 PM.
This article is licensed under a
Creative Commons Attribution 3.0 Unported Licence.
View Article Online
Discussion
Automated ML pipelines
In the catalyst discovery use case described above, we identi
ed
that the choice of state-changing processes had a signi
cant e
ff
ect
in predicting OER overpotential. To avoid making such decisions
apriori
,whichiso
en not possible in experimental research, all
distinguishable processes should be re
ected in the data
management. For instance, a sample storage event is typically
assumed to be non state-changing, which may not be the case.
The simplest example is air-sensitive materials whose sample
handling between experiments should be documented as
sample
handling
processes. The ESAMP framework allows for every event
in the laboratory to be de
ned as a process. However, in practice,
capturing every event in a laboratory is infeasible in a typical
laboratory setting. There may always exist
hidden
processes that
altered a material's state but were not tracked, which compounds
the issues discussed above with human-made decisions about
what processes are state-changing and whether that designation
varies with either sample or process parameters. By liberally
de
ning what constitutes a process and aggregating data from
many experimental work
ows, ESAMP will ultimately enable
machine learning to identify hidden processes and which tracked
processes are indeed state-changing.
Recently, several research works have focused on developing
closed-loop methods to identify optimal materials and pro-
cessing conditions for several applications such as carbon
nanotube synthesis,
28
halide perovskite synthesis,
29
and organic
thin
lm synthesis.
30
The work
ows of these experiments are
typically static. Similarly, several high-throughput experimental
systems deploy static work
ows or utilize simple if-then logic to
choose amongst a set of pre-de
ned work
ows. Machine
learning on data de
ned using ESAMP that contain various
process provenances along with de
nitions of state-changing
processes will enable dynamic identi
cation of work
ows that
maximize knowledge extraction.
Generality for modeling other experimental work
ows
While the breadth of process provenances and the dynamic range
ofdepthwithineachtypeofprovenancemakestheMEADdata-
base an excellent demonstrator of ESAMP, the provenance
management and the database schema are intended to be general
toallexperimentalandcomputationalwork
ows. A given type of
experiment may be considered equivalent when performed in 2
di
ff
erent labs, although di
ff
erences in the process parameters and
data management have created hurdles to universal materials
data management. Such di
ff
erences may require lab-speci
c
ingestion scripts and tables, but custom development of these
components of ESAMP comprise a low-overhead expansion of the
database to accept data from new labs as well as new types of
processes. One of the most widely used experimental inorganic
crystal structural and di
ff
raction databases (ICDD) was generated
by manual curation and aggregation over several decades of X-ray
di
ff
raction data generated in many laboratories. We anticipate
that ESAMP's universal data management will result in a more
facile generation of several large experimental datasets with full
provenance that enables data-driven accelerated materials
discoveries.
In addition to the generation of new insights from prove-
nance management and acceleration of research
via
more
e
ff
ective incorporation of machine learning, we envision
materials provenance managem
ent to profoundly impact the
integrity of experimental scien
ce. In the physical sciences, the
complexity of modern experimentation contributes to issues
with reproducing p
ublished results.
31
However, the
complexity itself is not the issue, but rather the inability of the
Methods sections in journal articles to adequately describe
the materials provenance, for example,
via
exclusion of
parameters or processing st
eps that were assumed to be
unimportant, which is exacerbated by complex, many-process
work
ows. Provided an architecture for provenance
management such as ESAMP, data can ultimately determine
what parameters and processes are essential for reproducible
materials experiments.
Generation of knowledge graphs and data networks
As discussed above, we anticipate ESAMP to provide the
framework that enables the curation of large and diverse data-
sets with full provenance. Such large datasets are a great start-
ing point for machine learning applications. However, ESAMP
is quite general, and adapting a more speci
c data framework to
one's use case can make knowledge extraction easier. These
frameworks may extract subsets of the data stored in the main
framework and apply simplifying assumptions that apply to the
speci
c use case. However, as long as a link exists between the
higher-level framework and ESAMP, then the complete prove-
nance information will still be preserved and queryable.
Machine learning datasets, such as those described in datasets
A, B, and C in the above use case, are examples of a practical
higher-level extraction. See (Fig. 6) for extraction of datasets
based on process provenance constraints.
One example of a higher-level framework enabled by ESAMP
is that of knowledge graphs. Knowledge graphs are a powerful
abstraction for storing, accessing, and interpreting data about
entities interlinked by an ontology of classes and relations.
32
This allows for formal reasoning, with reasoning engines
designed for queries like
Return all triples (
x
1
,
x
2
,
x
3
) where
f
(
x
1
,
x
2
) and
f
(
x
1
,
x
3
) and (
j
(
x
2
,
x
3
) if and only if
q
(
x
3
))
. Beyond
direct queries which produce tabular results suited for tradi-
tional machine learning applications, machine learning models
can be applied directly to relational databases
33,34
and knowl-
edge graphs.
35
Applications involving knowledge graphs and
ontologies have been explored in the space of chemistry and
materials science research.
36,37
The population of knowledge graphs is mainly facilitated by
ESAMP in two ways. Firstly, data within a relational database
structure is straightforwardly mappable into the data structure of
knowledge graph triples.
38
Secondly, a solid grasp of how to
resolve distinct entities can be achieved through ESAMP before
populating the nodes of the knowledge graph. Alternative
approaches of merging all samples with the same label or
considering every possibly-disti
nctsampletobeauniquematerial
© 2023 The Author(s). Published by the Royal Society of Chemistry
Digital Discovery
,2023,
2
,1078
1088 |
1085
Paper
Digital Discovery
Open Access Article. Published on 21 June 2023. Downloaded on 12/8/2023 10:37:27 PM.
This article is licensed under a
Creative Commons Attribution 3.0 Unported Licence.
View Article Online