Dataset
Design
for Building
Models
of Chemical
Reactivity
Priyanka
Raghavan,
Brittany
C. Haas,
⊥
Madeline
E. Ruos,
⊥
Jules Schleinitz,
⊥
Abigail
G. Doyle,
Sarah E. Reisman,
Matthew
S. Sigman,
and Connor
W. Coley
*
Cite This:
ACS Cent.
Sci.
2023,
9, 2196−2204
Read Online
ACCESS
Metrics
& More
Article
Recommendations
ABSTRACT:
Models
can codify our understanding
of chemical
reactivity
and serve a useful purpose
in the development
of new
synthetic
processes
via, for example,
evaluating
hypothetical
reaction
conditions
or in silico substrate
tolerance.
Perhaps the most determining
factor is the composition
of the training data and whether it is sufficient
to train a model that can make accurate predictions
over the full domain
of interest. Here, we discuss the design of reaction datasets in ways that
areconducive
todata-driven
modeling,
emphasizing
theideathattraining
setdiversityand modelgeneralizability
relyonthechoiceofmolecular
or
reaction
representation.
We additionally
discuss the experimental
constraints
associated
with generating
common
types of chemistry
datasets and how these considerations
should influence
dataset design
and model building.
■
INTRODUCTION
Data-driven
modeling
inorganicchemistry
datesbackalmosta
century.
1
Since then, researchers
have explored
various
approaches
to correlate
molecular
properties
with reaction
performance
by using a broad range of techniques
from linear
free energy relationships
(LFERs)
to multivariate
linear
regression
to deep learning.
Besides the type of model itself,
approaches
have varied with respect to their application
domain,
diversity
of inputs, and performance
measure
or
prediction
target.Here,wefocusonmodelsthataretrainedon
experimental
data to anticipate
quantitative
performance
metrics, such as reaction yields, selectivities,
or even rates.
The major themes and trends in building
such structure
−
property
relationships
2,3
and the broader
landscape
of
predictive
chemistry
4
have been the subject of recent reviews.
However,
in addition
to the many publicized
success stories
usingmodelstopredicttheperformance
ofchemical
reactions,
we have witnessed
many cases where modeling
has been less
successful.
Our ability to train models that support chemistry
objectives
is dependent
on data in ways that may be
underappreciated
and underreported.
In this Outlook,
we discuss the concept of dataset design
(Figure 1)
�
the
construction
of experimental
datasets with
modeling
applications
in mind
�
and
some of the pitfalls that
we have encountered
when learning from datasets that have
not been intentionally
designed
for machine learning.
We have
organized
our discussion
around the primary considerations
whentheaimismodelbuildinganddescribeateachstagehow
those model considerations
should directly influence
dataset
design.
■
DEFINING
THE DESIRED
DOMAIN
OF
APPLICABILITY
A primary
consideration
of model building
is the desired
domain of applicability:
the range of inputs over which we
would like a model to make accurate predictions.
Do we want
to be able to query the model with any set of reactants,
conditions,
andproducts
andhaveitestimatetheyield?Or,are
there specific combinations
of known substrates
that we want
to study? Is it acceptable
to assume a constant,
unvarying
temperature
and reaction
time, or do we also want to
understand
how those factors influence
the reaction perform-
ance? Here, we can draw a distinction
between
“global” and
“local” models. The former might involve using a corpus of
literature
data (for example,
the Chemical
Abstracts
Service
(CAS) Content
Collection
or the Pistachio,
USPTO,
or
Reaxys datasets)
containing
millions of examples
and spanning
thousands
of reaction types. The latter might involve focusing
on a single reaction type and a well-defined
set of substrates
and reaction conditions;
in most substrate
scope studies, the
reaction
conditions
are not varied. While a globally useful
model is appealing
in its scope, it is generally
advantageous
to
have a sufficiently
narrow domain of applicability
to minimize
Received:
September
20, 2023
Revised:
November
6, 2023
Accepted:
November
15, 2023
Published:
December
8,
2023
Outlook
http://pubs.acs.org/journal/acscii
© 2023
The Authors.
Published
by
American
Chemical
Society
2196
https://doi.org/10.1021/acscentsci.3c01163
ACS Cent.
Sci.
2023,
9, 2196
−
2204
This article is licensed under CC-BY 4.0
underlying
mechanism
changes,
reactivity
cliffs, or interaction
effects in the dataset. These are factors that not only increase
modeling
difficulty
butalso areseldomaccounted
forin model
inputs. This perhaps explains why predicting
selectivity
has
seen more consistent
success than predicting
yield, as is
discussed
later. Furthermore,
some literature-derived
datasets
arealgorithmically
extracted
fromtextandhavenotundergone
extensive
manual curation
or validation,
so certain fields may
be omitted or incorrect.
The datasets we can use for model training exhibit diversity
along different
axes (Figure 2A). Data derived from the
published
literature
span a wide range of substrates
and
reaction types, but each reactant
−
product
combination
might
be reported
only once or twice. In contrast,
public datasets
from high-throughput
experimentation
(HTE) exist only for a
few reaction types so far (Buchwald
−
Hartwig
amination
7
and
Suzuki coupling
8
being the most popular datasets),
although
more varied datasets,
both in terms of reaction
types and
design workflow,
are emerging.
9,10
Most HTE datasets
are
generated
through parallel plate-based
chemistry
in 24-, 96-,
384-, or even higher density
well formats.
In these
experimental
campaigns,
some reaction
variables
are easy to
vary via automated
liquid handling
capabilities
(e.g., the
diversity of concentrations
and the combinations
of additives),
while other aspects (e.g., heterogeneous
reactants
and the
diversity
of solvents)
are harder to vary given the practical
challenges
of stock solution preparation.
Acquiring
andscreening
alargenumberofdiversesubstrates
is the most salient challenge
that tends to limit the number of
distinct components
used in HTE campaigns,
which often
leverage thecombinatorial
natureof discretevariableselection.
For example,
the C
−
N coupling
dataset from Ahneman
et al.
7
covers 4140 reactions
defined by the combination
of 15
choices for the aryl halide, 23 additives,
4 Pd catalysts,
3 bases,
1 amine, and 1 solvent,
at fixed time, temperature,
and
concentrations.
Similarly,
thedatasetfromPereraetal.of5760
Suzuki reactions
8
was defined by combinations
of 5 electro-
philes, 6nucleophiles,
11ligands,7bases,and 4solvents.
Even
afewchoicesforeachcomponent
canquicklyrepresent
alarge
experimental
space, for which there tends to be a higher cost
associated
with the HTE campaigns
and, particularly
with
significant
numbers
of distinct products,
a higher analytical
burden.
The variation
of individual
components
or aspects of
reaction conditions
is directly tied to the applicability
domain,
as a model should not be expected
to generalize
to a new
Figure
1.
Recommended
conceptual
workflow
for dataset design. From top to bottom, (1) task definition
with respect to the modeling
space,
setting, and target; (2) experimental
constraints,
including
the number of reactions
and throughput
of the analytical
method;
and (3) intentional
dataset design, emphasizing
feature-based
reaction component
selection.
5,6
These steps culminate
in (4) data acquisition
and modeling,
with an
optional active learning loop for iterative dataset expansion.
ACS Central
Science
http://pubs.acs.org/journal/acscii
Outlook
https://doi.org/10.1021/acscentsci.3c01163
ACS Cent.
Sci.
2023,
9, 2196
−
2204
2197
molecule
or input that is too dissimilar
from what it has been
trained on. As an extreme example,
a model that has only seen
reactions
performed
at room temperature
cannot understand
the influence
of temperature
on the reaction outcome.
In the
Ahneman
et al. study,
7
the component
with the greatest
variation
in the dataset was the additive species with 23 total
choices, which justifies the evaluation
of model generalization
tounseenadditives
intheoriginalpaper.Withonlythreebases
explored,
it is unrealistic
to expect the model to anticipate
the
performance
of a fourth unseen base. At the same time, a
model trained on a narrow subset of reaction space cannot, in
general, be expected
to generalize
well to other areas of that
space, making
it vital to select an appropriate
set of
representative
examples.
Mathematically,
extrapolation
is generally
thought of as data
that falls outside of the convex hull of training data; however,
high-dimensional
datasets almost always represent
extrapola-
tive tasks by this definition.
12
The notion of similarity
between
training and testing points and what constitutes
extrapolation
in a chemical
sense has no strict definition,
but distance
in
chemical
feature space (e.g., using descriptors
or molecular/
reaction
fingerprints)
is a natural approach.
Structural
similarity
hasbeenusedtoestimatethedomainofapplicability
and uncertainty
of predictive
models.
13,14
■
SELECTING
A REACTION
PERFORMANCE
METRIC
AS AN OUTPUT
VARIABLE
There are many commonly
reported
reaction
performance
metrics that can be used as the prediction
target (output
variable)
in data-driven
models. The two most common
are
yield, bounded
between
0 and 100, and selectivity
(e.g., the
enantiomeric
ratio,regioselectivity,
etc.),whichisacontinuous
scalar metric. Other metrics such as the reaction rate or rate
constants
are less common
15,16
but are of high interest to
process chemists
in particular.
Rate is a time- and resource-
intensive
measurement
to collect, requiring
yields/conversions
at many time points. However,
rate can be reliably assayed
across orders of magnitude
and provides
insight for practical
experimental
considerations,
such as the reaction
concen-
tration, temperature,
and time. Enantioselectivity,
as reflected
by
ΔΔ
G
‡
,isacompelling
choiceforanoutputvariableandhas
been used in a significant
number
of successful
work-
flows:
3,17,18
it is a scalar metric that is centered
at 0 when
unselective
and, due to the relative precision
of measuring
the
enantiomeric
ratio (e.r.), does not tend to have a long-tailed
distribution.
Furthermore,
the e.r. most often corresponds
to
the difference
between
enantio-determining
transition
states
withthegeneralreactionmechanism
otherwise
beingthesame,
allowingonetoneglectfactorsthatconfound
modeling
yieldas
an output, such as side reactions
or differences
in turnover
rates of a catalyst. Likewise,
regioselectivity
is an internally
consistent
metric that relies only on direct comparisons
between
candidate
atom sites.
19
−
22
While selectivity
is a useful metric for a subset of reactions,
the more universal
and widely reported
metric in synthetic
organic chemistry
is yield. Generally,
yield prediction
has only
been successful
within large, high-throughput
datasets
in
single/narrow
reaction
classes. Similar attempts
to model
diverse literature
or “real-world”
electronic
laboratory
note-
book (ELN) data produce poorer results given the abundance
of confounding
variables
(e.g., concentrations,
time, scale,
experimental
hardware,
the experimentalist)
that may be
unaccounted
for in the reaction
description.
23,24
Different
A model
trained
on a narrow
subset
of reaction
space
cannot,
in general,
be expected
to gen-
eralize
well
to other
areas
of that
space,
making
it vital
to select
an
appropriate
set of representative
examples.
Figure
2.
Common
typesofreactiondatasetsandtheirattributes:
HTEdatasets,literature
databases,
andsubstrate
scopestudies.(A)Eachdataset
type qualitatively
placed within axes of size, substrate
diversity,
unique conditions
per substrate,
and reaction type diversity.
(B) Yield distribution
histograms
forasampledatasetofeachtype:SuzukiHTEdatafromPfizer,
8
asubsetoftheCASContentCollection
coveringpublished
single-step
reactions
from 2010 to 2015, and a reported
reaction scope for the preparation
of benzamides.
11
ACS Central
Science
http://pubs.acs.org/journal/acscii
Outlook
https://doi.org/10.1021/acscentsci.3c01163
ACS Cent.
Sci.
2023,
9, 2196
−
2204
2198
data sources tend to exhibit different
distributions
of reported
reaction yields (Figure 2B).
Yield is a particularly
challenging
target to predict.
It
quantifies
the efficiency
of several successive
microscopic
steps
and is implicitly
affected by changes in the reaction conditions
that may prompt different
mechanistic
pathways.
It is an
inherently
noisier value that may include issues related to
isolation
of the product,
which challenges
modeling
efforts, as
thereported
yieldincorporates
bothreactivity
andpurification.
Importantly,
this isalso a time-dependent
process, wherein the
relative yields across conditions
are sensitive
to the choice of
when the reaction
is assayed.
For example,
the ability to
distinguish
theefficacyoftwocatalysts(onefastandoneslow)
can be lost if the reactions
are performed
on long time scales.
Most datasets are acquired
using a single time point without
regard for the rate dependence
of yield; furthermore,
researchers
may intentionally
choose longer reaction times to
achieve
higher yields, not realizing
that this might be
obfuscating
differences
in the reaction rate.
Reported
yields can also be of several types, such as isolated
yields, assay yields, or even LCAPs (liquid chromatography
area percents),
further increasing
modeling
complexity
(Figure
3). Selectivity
can suffer from the same ambiguity,
but it is
more consistently
assayed
without
isolation.
LCAP is a
common
output from HTE campaigns,
where it is unrealistic
to calibrate
yields using product standards
for every example.
The well-defined
range of yield values (0
−
100%)
additionally
presents a modeling
complication,
as many architectures
from
linear regression
to neural networks
are able to make
predictions
outside of this physical
range. Compressing
or
truncating
predictions
using techniques
such as logistic
regression
or sigmoid activation
functions
does not tend to
improve modeling
in our experience.
Simplifying
the task to a
binary (0% versus >0% yield) or categorical
(binned
yield
intervals)
classification
rather than a regression
lowers the
analytical
burden for data acquisition
and mitigates
the impact
of noise, but it still does not guarantee
the ability to train a
useful model.
The range of reaction
output values represented
in a
particular
datasetwillinfluence
therangeofoutputvaluesinits
predictions.
This is a consideration
that is similar to the
domain of applicability,
where it is necessary
to see sufficient
diversity
during training
if one expects it when making
predictions.
If, for example,
the training set has its outputs
within a narrow interval (e.g., yields within 70
−
95%),
it is
unlikely
that the model will be able to make accurate
predictions
outside of that interval. Common
types of models
such as random forests (RFs) and Gaussian
processes
(GPs)
are fundamentally
incapable
of doing so. Multivariate
linear
models, neural networks,
and others can in principle,
but their
extrapolations
will have a higher degree of uncertainty
than
their interpolations.
Nevertheless,
studies
have shown
successful
(at times, retrospective)
extrapolation
during e.r.
prediction
toselectcatalyststhatachieveselectivity
betterthan
anything
observed
during training.
27
−
29
To simplify matters,
models that are meant to guide experimental
design (e.g.,
optimize
reaction
conditions
30
) need not make accurate
predictions
extrapolating
to output values beyond the training
setinordertobeuseful,asevidenced
bythesuccessofGPsfor
Bayesian
optimization
in chemistry
31
and beyond.
■
IDENTIFYING
A MOLECULAR/REACTION
REPRESENTATION
TO HELP DEFINE
“DIVERSITY”
Supervised
learning of complex
input/output
relationships
is
the basis of most modeling
for chemical
reactivity;
thus, this
goal should guide dataset design. A model’s
ability to
generalize
depends
heavily on the representation
we use; for
example,
a categorical
(one-hot)
encoding
of bases does not
allow a model to predict the performance
of unseen bases, but
arepresentation
basedonthep
K
a
valuesoftheconjugate
acids
potentially
could. If we intend to train a model to understand
the impact of base strength,
we might plan to run experiments
using diverse bases, wherein diversity
is defined in terms of
base strength,
as reflected
by the p
K
a
of the conjugate
acid.
Our ability to design a dataset that leads to a useful,
generalizable
model relies on our definition
of molecular
diversity,
whether
that be based on descriptors,
functional
group fingerprints,
or more general notions
of chemical
structure.
Whenever
wearedesigning
adatasetforthepurpose
of model training, we should be intentional
about aligning the
goals of generalization
with the diversity
of data points.
If we hypothesize
that there are certain molecular
features
relevant for modeling,
those features should form the basis for
defining a diverse set of experiments.
This may include using
density functional
theory (DFT)-based
descriptors,
which
directly capture the electronic
and structural
properties
of
molecules
that often greatly influence
reactivity,
or simple
physicochemical
features
such as Mordred
descriptors.
32,33
While the latter type of descriptor
is readily calculable
with
cheminformatics
packages
in milliseconds,
the computational
cost of deriving
descriptors
from DFT calculations
can be
significant
and render these workflows
inaccessible
or
Figure
3.
Isolated versus analytical
yields for (A) literature-extracted
Ni-catalyzed
C
−
O couplings
25
and (B) a reported
photocatalytic
C
−
Hactivation
substrate
scope.
26
Common
reasonsforthediscrepancies
between
the yields are given.
ACS Central
Science
http://pubs.acs.org/journal/acscii
Outlook
https://doi.org/10.1021/acscentsci.3c01163
ACS Cent.
Sci.
2023,
9, 2196
−
2204
2199