The Materials Experiment Knowledge Graph:
1
Supporting Information
2
Michael J. Statt
1,*
, Brian A. Rohr
1,*
, Dan Guevarra
2,3
, Ja’Nya Breeden
3
, Santosh K. Suram
4
,
3
and John M. Gregoire
2,3,*
4
1
Modelyst LLC, Palo Alto, CA 94306
5
2
Division of Engineering and Applied Science, California Institute of Technology, Pasadena, CA 91125
6
3
Liquid Sunlight Alliance, California Institute of Technology, Pasadena, CA 91125
7
4
Toyota Research Institute, Los Altos, CA 94022
8
*
E-mail: brian.rohr@modelyst.io, michael.statt@modelyst.io, gregoire@caltech.edu
9
Node and edge types
10
Entity Type
Count
Analysis
3,980,832
AnalysisDetails
7
Collection
3,612
Element
46
Process
23,591
ProcessData
5,425,961
ProcessDetail
13,434
Sample
12,159,999
SampleProcess
30,656,471
pH
15
Table S1.
Summary of entity count in the Neo4j graph database. Each row represents the count of a unique entity type.
Relationship Type
Count
ANALYSIS_DETAILS
3,980,832
CONTAINS
3,527,845
CollectionSample
12,159,990
NEXT
18,590,083
PH
6,267
PROCESS
30,656,471
PROCESS_DETAIL
23,488
ProcessDataAnalysis
4,050,703
SAMPLE
3,065,6471
SampleProcessProcessData
7,777,908
Table S2.
Summary of relationship count in the Neo4j graph database. Each row represents the count of a unique relationship
type.
Computational Methods
11
The computational resources used to execute the use cases are as follows:
12
The extract, transform, load (ETL) process was carried out using a python library called DBgen (https://github.com/modelyst/dbgen),
13
which was specifically designed to instantiate complicated, scientific data pipelines. PostgreSQL (https://www.postgresql.org/)
14
was used to create the SQL database, and the Neo4j community edition (https://neo4j.com/) was used to create the graph
15
database. The process of migrating the data from the SQL database to the graph database was done using a python library
16
Electronic
Supplementary
Material
(ESI)
for
Digital
Discovery.
This
journal
is
©
The
Royal
Society
of
Chemistry
2023
Table S3.
Resources used for each task
Task
Resource Used
Extracting, transforming, and loading (ETL) the
filesystem data into the SQL database
DBgen
Hosting the SQL database
PostgreSQL, AWS EC2, Docker
Migrating the SQL database to the graph database
PG4J
Hosting the graph database
Neo4j, AWS EC2, Docker
Querying graph database and processing data for de-
sign of experiments use cases
Python, Jupyter notebook server (local)
called PG4J (https://github.com/modelyst/pg4j), which is capable of migrating any PostgreSQL database to Neo4j. For query
17
timing, the SQL database and the graph database were run in docker containers on AWS EC2. Specifically, the EC2 instance
18
was a t2.xlarge, and the docker images were postgres:14 and neo4j:5.5 for the SQL and graph databases, respectively. Cypher
19
queries and data processing for the design of experiments use cases were executed in Jupyter notebooks running on a local
20
JupyterHub server (Intel i9-11900K, 64 GB RAM). The computational methods are summarized in table S3.
21
2/8
Design of experiments use case
22
pH 14
pH 7
Figure S1.
A summary of OER activity with 4982 measurements in pH 14 electrolyte (left) and 6524 measurements in pH 7
electrolyte (right). The catalysts with at least 70% concentration of the given element (Ce, Co, Fe, or Ni) are then grouped by 5
bins of total duration of electrochemical operation prior to the activity measurement. The catalyst overpotential at 3 mA/cm
2
is
shown as an inverted vertical axis so that higher activity is shown as higher position in the figure. The consistent upward trend
with increasing duration demonstrates a universal condition in pH 13 electrolyte, which is not observed in pH 7 electrolyte.
3/8
Table S4.
Candidate composition spaces proposed by automated design of experiment for measurement in pH 3 electrolyte,
sorted by 5th percentile overpotential at 10 mA/cm
2
from pH 7 measurement.
Composition space
Eta (V) 3 mA/cm
2
Eta (V) 10 mA/cm
2
plate ID
sample count
Co-Fe-La-Ni
0.118
0.142
1740
1365.0
Co-Fe-La-Ni
0.118
0.142
1723
1365.0
Ce-Co-Fe-Ni
0.191
0.175
1749
30.0
Ce-Co-Fe-Ni
0.191
0.175
1751
30.0
Ce-Co-Fe-Ni
0.191
0.175
1754
30.0
Ce-Co-Fe-Ni
0.191
0.175
1755
30.0
Ce-Co-Fe-Ni
0.191
0.175
1756
30.0
Ce-Co-Fe-Ni
0.191
0.175
1762
388.0
Ce-Co-Fe-Ni
0.191
0.175
1763
388.0
Ce-Co-Fe-Ni
0.191
0.175
1774
30.0
Ce-Co-Fe-Ni
0.191
0.175
2486
388.0
Ce-Co-Fe-Ni
0.191
0.175
2487
388.0
Ce-Co-Fe-Ni
0.191
0.175
2488
388.0
Ce-Co-Fe-La-Ni
0.163
0.184
1757
1683.0
Ce-Co-La-Ni
0.435
0.483
1721
1398.0
Ce-Co-La-Ni
0.435
0.483
1720
1398.0
Ce-Co-La-Ni
0.435
0.483
2369
30.0
Ce-Co-La-Ni
0.435
0.483
2367
30.0
Ce-Co-La-Ni
0.435
0.483
1722
1398.0
Ce-Co-La-Ni
0.435
0.483
1750
30.0
Ce-Co-La-Ni
0.435
0.483
1719
1398.0
Ce-Co-La-Ni
0.435
0.483
1829
1398.0
Co-Cu-Fe-Mn-Sn-Ta
0.556
0.631
3673
3.0
Co-Cu-Fe-Mn-Sn-Ta
0.556
0.631
3859
63.0
Co-Cu-Fe-Mn-Sn-Ta
0.556
0.631
3865
63.0
Co-Cu-Fe-Mn-Sn-Ta
0.556
0.631
3866
63.0
Co-Cu-Fe-Mn-Sn-Ta
0.556
0.631
3867
63.0
Co-Cu-Fe-Mn-Sn-Ta
0.556
0.631
3870
63.0
Ce-Co-Ni-Zn
0.521
0.634
2532
1597.0
4/8
Creation of Database Subsets
23
To investigate the scaling of query times in both the SQL and graph databases, we created three smaller versions of the original
24
database. The first database fragment was created by removing the last half of the rows in the sample-process table, ordered by
25
their process timestamps. We then deleted all rows in other tables that were no longer linked to a sample-process. This process
26
was repeated two more times to create two additional database fragments, with 3/4 and 7/8 of the rows in the sample-process
27
table deleted. Each fragment was migrated to Neo4j using the tools described above, resulting in a series of MPS-style and
28
MEKG-style databases that share the same information and contain 1/8, 1/4, and 1/2 of the number of Sample-Processes in the
29
full MPS and MEKG databases.
30
5/8
Query 4 in Cypher and SQL
31
Below is the code for Query 4 in Cypher:
32
MATCH
33
path=(sp1:SampleProcess)-[:NEXT]->(sp2)-[:NEXT]->(sp3)-[:NEXT]->(sp4)-[:NEXT]->(sp5),
34
(a1:Analysis)<--(pda1:ProcessData)<--(sp1)-->(p1:Process)-->(pd1:ProcessDetail),
35
(a2:Analysis)<--(pda2:ProcessData)<--(sp2)-->(p2:Process)-->(pd2:ProcessDetail),
36
(a3:Analysis)<--(pda3:ProcessData)<--(sp3)-->(p3:Process)-->(pd3:ProcessDetail),
37
(a4:Analysis)<--(pda4:ProcessData)<--(sp4)-->(p4:Process)-->(pd4:ProcessDetail),
38
(a5:Analysis)<--(pda5:ProcessData)<--(sp5)-->(p5:Process)-->(pd5:ProcessDetail)
39
WHERE
40
pd1.technique STARTS WITH ’CA’
41
AND pd2.technique STARTS WITH ’CA’
42
AND pd3.technique STARTS WITH ’CA’
43
AND pd4.technique STARTS WITH ’CA’
44
AND pd5.technique STARTS WITH ’CV’
45
AND a5.name = ’CV_FOMS_standard’
46
AND apoc.convert.fromJsonMap(a1.output)[’I.A_ave’] > 1e-7
47
AND apoc.convert.fromJsonMap(a2.output)[’I.A_ave’] > 1e-8
48
AND apoc.convert.fromJsonMap(a3.output)[’I.A_ave’] > 1e-9
49
AND apoc.convert.fromJsonMap(a4.output)[’I.A_ave’] > 1e-10
50
AND apoc.convert.fromJsonMap(a5.output)[’I.A_max’] > 1e-6
51
AND apoc.convert.fromJsonMap(pd1.parameters)[’electrolyte’] CONTAINS ’NaOH’
52
AND apoc.convert.fromJsonMap(pd2.parameters)[’electrolyte’] CONTAINS ’NaOH’
53
AND apoc.convert.fromJsonMap(pd3.parameters)[’electrolyte’] CONTAINS ’NaOH’
54
AND apoc.convert.fromJsonMap(pd4.parameters)[’electrolyte’] CONTAINS ’NaOH’
55
AND apoc.convert.fromJsonMap(pd5.parameters)[’electrolyte’] CONTAINS ’NaOH’
56
RETURN count(path)
57
Below is the code for Query 4 in SQL:
58
with your_table as (
59
select
60
sp.sample_id,
61
p1."timestamp",
62
p1."ordering",
63
pd1.technique,
64
pd1.parameters,
65
a."name",
66
a."output"
67
from
68
sample_process sp
69
join process p1 on
70
sp.process_id = p1.id
71
left join process_detail pd1 on
72
p1.process_detail_id = pd1.id
73
left join sample_process_process_data sppd on
74
sppd.sample_process_id = sp.id
75
left join process_data pd on
76
sppd.process_data_id = pd.id
77
left join process_data_analysis pda on
78
pda.process_data_id = pd.id
79
left join analysis a on
80
pda.analysis_id = a.id
81
where
82
sp.sample_id in (
83
6/8
select
84
sp3.sample_id
85
from
86
sample_process sp3
87
join process p3 on
88
sp3.process_id = p3.id
89
join process_detail pd3 on
90
p3.process_detail_id = pd3.id
91
where
92
pd3.technique like ’CV%’
93
)
94
and sp.sample_id in (
95
select
96
sp2.sample_id
97
from
98
sample_process sp2
99
join process p2 on
100
sp2.process_id = p2.id
101
join process_detail pd2 on
102
p2.process_detail_id = pd2.id
103
where
104
pd2.technique like ’CA%’
105
group by
106
sp2.sample_id
107
having
108
count(
*
) >= 4)
109
),
110
filtered_labels as (
111
select
112
sample_id
113
from
114
your_table
115
group by
116
sample_id
117
having
118
COUNT(
*
) >= 5
119
),
120
sequenced_data as (
121
select
122
t1.sample_id,
123
t1.Timestamp,
124
t1.ordering,
125
t1.technique,
126
t1.parameters,
127
t1."name",
128
t1."output",
129
row_number() over (partition by t1.sample_id
130
order by
131
t1.Timestamp,
132
t1.ordering) as RowNum
133
from
134
your_table t1
135
inner join
136
filtered_labels fl on
137
t1.sample_id = fl.sample_id
138
7/8
),
139
json_agg_data as (
140
select
141
sd1.sample_id,
142
json_agg(sd1.technique) over (partition by sd1.sample_id
143
order by
144
sd1.RowNum rows between current row and 4 following) as technique_seq,
145
json_agg(sd1.parameters) over (partition by sd1.sample_id
146
order by
147
sd1.RowNum rows between current row and 4 following) as parameters_seq,
148
json_agg(sd1.name) over (partition by sd1.sample_id
149
order by
150
sd1.RowNum rows between current row and 4 following) as name_seq,
151
json_agg(sd1.output) over (partition by sd1.sample_id
152
order by
153
sd1.RowNum rows between current row and 4 following) as output_seq,
154
COUNT(
*
) over (partition by sd1.sample_id
155
order by
156
sd1.RowNum rows between current row and 4 following) as technique_seq_count
157
from
158
sequenced_data sd1
159
)
160
select
161
count(
*
)
162
-- sample_id
163
-- technique_seq,
164
-- parameters_seq,
165
-- name_seq,
166
-- output_seq
167
from
168
json_agg_data
169
where
170
technique_seq_count = 5
171
and technique_seq->>0 like ’CA%’
172
and technique_seq->>1 like ’CA%’
173
and technique_seq->>2 like ’CA%’
174
and technique_seq->>3 like ’CA%’
175
and technique_seq->>4 like ’CV%’
176
and name_seq->>4 = ’CV_FOMS_standard’
177
and (output_seq->0->>’I.A_ave’)::float > 1e-7
178
and (output_seq->1->>’I.A_ave’)::float > 1e-8
179
and (output_seq->2->>’I.A_ave’)::float > 1e-9
180
and (output_seq->3->>’I.A_ave’)::float > 1e-10
181
and (output_seq->4->>’I.A_max’)::float > 1e-6
182
and parameters_seq->0->>’electrolyte’ like ’%NaOH%’
183
and parameters_seq->1->>’electrolyte’ like ’%NaOH%’
184
and parameters_seq->2->>’electrolyte’ like ’%NaOH%’
185
and parameters_seq->3->>’electrolyte’ like ’%NaOH%’
186
and parameters_seq->4->>’electrolyte’ like ’%NaOH%’
187
188
8/8