WWW.NATURE.COM/NATURE | 1
SUPPLEMENTARY INFORMATION
doi:10.1038/nature24047
Supplementary Information for “Solving a Higgs
1
optimization problem with quantum annealing for
2
machine learning”
3
AlexMott
1
◦
,JoshuaJob
2
,Jean-RochVlimant
1
,DanielLidar
3
,andMariaSpiropulu
1
∗
4
1
DepartmentofPhysics,CaliforniaInstituteofTechnology,Pasadena,91125,USA
5
2
DepartmentofPhysics,andCenterforQuantumInformationScience&Technology,UniversityofSouthern
6
California,LosAngeles,California90089,USA
7
3
DepartmentsofElectricalEngineering,ChemistryandPhysics,andCenterforQuantumInformationScience&
8
Technology,UniversityofSouthernCalifornia,LosAngeles,California90089,USA
9
◦
NowatDeepMind
10
*
smaria@caltech.edu
11
ABSTRACT
12
We provide additional details in support of “On a Higgs optimization problem with quantum annealing”
13
1 The quantum annealer approach to the Higgs optimization problem
14
Our problem, toward which we apply quantum annealing for machine learning (QAML), is that of constructing a
15
binary classifier that can detect the “signal” of the decay of a Higgs boson into two photons in a “background” of noise
16
from other Standard Model processes. The classifiers are trained on a set of simulated collision events (synthetic
17
data sets) where the signal sample contains events with a Higgs boson and the background sample contains a
18
cocktail of background physics processes that mimic Higgs events. The classification is achieved by exploiting deep
19
correlations in various physical properties of the signal and background events. Classifiers such as boosted decision
20
trees [e.g., XGBoost (XGB)], or deep neural networks (DNN) have seen great success in many other contexts, from
21
speech and image recognition,
1
to marketing, finance and manufacturing.
2
In the high energy physics context there
22
are challenges and limitations of these techniques often related to the level of agreement between the synthetic and
23
observed data. Supervised learning requires an accurately labeled training data set, and the simulation procedure
24
requires calculations of the matrix elements of the physics processes of interest,
3
modeling of the hadronization of
25
colored particles
?
and simulation of the interaction of the final state particles with the detector.
4
The complexity
26
of the end-to-end simulation operation is encapsulated in the uncertainties associated with the level of agreement
27
with the observations.
28
The binary classifier proposed and studied in this work, is trained with a “quantum annealing for machine
29
learning” (QAML) algorithm
5, 6
and takes the form of a linear neural network (LNN) that relies on explicitly
30
linearized correlations. This reduces sensitivity to errors in the model of the detector, and due to the binary
31
weights it guards against overtraining. In this model it is simple to control and correct the correlations between
32
the kinematical observables in the Monte Carlo simulations. Additionally the model provides a straightforward
33
interpretation of the criteria used to classify the events. This comes at the price of a provably NP-hard training
34
problem, with a training time that grows exponentially with the number of variables. This is a price sometimes
35
worth paying in return for robustness in the presence of label noise, a fact that has become increasingly recognized
36
in the machine learning community.
7–9
37
Heuristic optimization techniques such as classical simulated annealing (SA)
10, 11
and quantum annealing (QA)
12, 13
38
may reduce the training time sufficiently to solve problems of practical interest with this linear model. QA and
39
the closely related quantum adiabatic algorithm,
14
hold the potential for significant improvements in performance
40
over classical techniques, though the delineation of the improvement boundary remains an active area of research.
15
41
Here we use both QA and SA to train a classifier and examine its performance compared to traditional methods.
42
To implement QA we use a programmable quantum annealer
16
built by D-Wave Systems Inc.,
17, 18
the D-Wave
43
Two X (DW) model housed at the University of Southern California’s Information Sciences Institute, comprising
44
1098
superconducting flux qubits. Such QA devices have been employed to study, e.g., graph isomorphism,
19
45
1
Bayesian network structure,
20
operational planning,
21
DNNs,
22
quantum Boltzmann machines,
23–25
and tree cover
46
detection in aerial imagery.
26
Both the quantumness
27–31
and speedup
32–35
in these devices are intensely scrutinized
47
topics of ongoing research.
48
2 DNN and XGB optimization procedure
49
We benchmark the performance of QAML against DNN and XGB.
50
We train a DNN using Keras
36
with the Theano backend,
37
a standard tool in deep learning and increasingly
51
popular in high energy physics. Our network has two fully connected hidden layers with
1000
nodes each. The
52
model is optimized using the Adam algorithm
38
with a learning rate of
0
.
001
and a mini-batch size of
10
. We
53
find that network performance is not affected by small changes in the number of nodes or the initial guesses for
54
the weights. The model hyperparameters, regularization terms, and optimization parameters for our deep neural
55
net are selected using the
Spearmint
Bayesian optimization software.
39, 40
Early stopping is used (with patience
56
parameter
10
) to avoid overtraining and have sufficient generalization.
57
We also train an ensemble of boosted decision trees using XGB
41
with a maximum depth of
10
, a learning rate
58
of
0
.
3
, and L2-regularization parameter
λ
=
2000
.
59
To train and optimize XGB, we use
100
rounds of training and start with the default choices for the various
60
parameters. We evaluate values of the learning rate
61
η
∈{
0
.
001
,
0
.
002
,
0
.
003
,
0
.
005
,
0
.
008
,
0
.
01
,
0
.
02
,
0
.
03
,
0
.
05
,
0
.
08
,
0
.
1
,
0
.
2
,
0
.
3
,
0
.
5
,
0
.
8
}
at tree depths of
5
,
8
,
10
,
12
,
15
,
62
and
20
. Some of these parameters give small improvements in AUC over the defaults at value of the L2-regularization
63
parameter
λ
=
1
. Far larger improvements are found when
λ
is increased. Hence we hold the other parameters
64
fixed and evaluate
λ
∈{
5
,
10
,
20
,
50
,
100
,
200
,
500
,
1000
,
1500
,
1800
,
2000
,
2200
,
2500
}
, finding the approximate opti-
65
mum AUC on the test set at
2000
. Testing again, the tree depth and
η
are found to have minimal effect on the
66
AUC (significantly smaller than the error), and
η
=
0
.
3
and tree depth
10
are chosen as the approximate optimum.
67
We note that the DNN and XGB settings are selected so as to prevent overtraining.
68
3 Mapping weak classifier selection to the Ising problem
69
In this section we closely follow Ref. [
6], with slight changes of notation. Let
V
be the event space, consisting of
70
vectors
{
⃗
x
}
that are either signal or background. We define a weak classifier
c
i
(
⃗
x
)
:
V
�→
R
,
i
=
1
,...
N
, as classifying
71
event
⃗
x
as signal (background) if
c
i
(
⃗
x
)
>
0
(
c
i
(
⃗
x
)
<
0
). We normalize each weak classifier so that
|
c
i
|≤
1
/
N
. We
72
introduce a binary weights vector
⃗
w
∈{
0
,
1
}
N
and construct a strong classifier
R
⃗
w
(
⃗
x
)=
∑
i
w
i
c
i
(
⃗
x
)
∈
[
−
∥
⃗
w
∥
/
N
,
∥
⃗
w
∥
/
N
]
.
73
The event
⃗
x
is correspondingly classified as signal (background) if
R
⃗
w
(
⃗
x
)
>
0
(
R
⃗
w
(
⃗
x
)
<
0
). The weights
⃗
w
are to be
74
determined; they are the target of the solution of the Ising problem.
75
Let
T
=
{
⃗
x
τ
,
y
τ
}
denote a given set of training events, where
⃗
x
τ
is an event vector collecting the values of each
of the variables we use, and
y
τ
=
±
1
is a binary label for whether
⃗
x
τ
is signal (
+
1
) or background (
−
1
). Let
Q
⃗
w
(
⃗
x
)=
sign
[
R
⃗
w
(
⃗
x
)]
, so that
Q
⃗
w
(
⃗
x
)=+
1
(
−
1
) denotes signal (background) event classification. Thus
y
τ
Q
⃗
w
(
⃗
x
τ
)=+
1
if
⃗
x
τ
is correctly classified as signal or background (
y
τ
and
Q
⃗
w
(
⃗
x
τ
)
agree), and
y
τ
Q
⃗
w
(
⃗
x
τ
)=
−
1
if
⃗
x
τ
is incorrectly
classified (
y
τ
and
Q
⃗
w
(
⃗
x
τ
)
disagree). The cost function
L
(
⃗
w
)=
∑
τ
[
1
−
y
τ
Q
⃗
w
(
⃗
x
τ
)]
/
2
thus counts the number of incorrectly
classified training events, and minimizing it over all possible weight vectors returns the optimal set of weights, and
hence the optimal strong classifier given the training set
T
. To avoid overtraining and economize on the number
of weak classifiers used, we can introduce a penalty term proportional to the number of weights, i.e.,
λ
∥
⃗
w
∥
, where
λ
>
0
is the penalty strength. Thus the optimal set of weights for given
λ
is
⃗
w
opt
=
argmin
{
⃗
w
}
[
L
(
⃗
w
)+
λ
∥
⃗
w
∥
]
.
(1)
This optimization problem cannot be directly mapped onto a quantum annealer, due to the appearance of the sign
function. Instead we next introduce a relaxation to a quadratic form that is implementable on the current generation
of D-Wave devices. Namely, using the training set we form the vector of strong classifier results
⃗
R
⃗
w
=
{
R
⃗
w
(
⃗
x
τ
)
}
|
T
|
τ
=
1
,
the Euclidean distance measure
δ
(
⃗
w
)=
∥
⃗
y
−
⃗
R
⃗
w
∥
2
between the strong classifier and the set of training labels, and
replace Eq. (
1) by
⃗
w
min
=
argmin
{
⃗
w
}
δ
(
⃗
w
)
.
(2)
Finding
⃗
w
opt
in this way is equivalent to solving a quadratic unconstrained binary optimization (QUBO) problem:
⃗
w
min
=
argmin
{
⃗
w
}
[
∑
τ
R
⃗
w
(
⃗
x
τ
)
2
−
2
y
τ
R
⃗
w
(
⃗
x
τ
)+
y
2
τ
]
=
argmin
{
⃗
w
}
[
∑
τ
(
∑
i
,
j
w
i
w
j
c
i
(
⃗
x
τ
)
c
j
(
⃗
x
τ
)
−
2
y
τ
∑
i
w
i
c
i
(
⃗
x
τ
)
)
+
|
T
|
]
.
(3)
2/25
SUPPLEMENTARY INFORMATION
2
| WWW.NATURE.COM/NATURE
RESEARCH
Bayesian network structure,
20
operational planning,
21
DNNs,
22
quantum Boltzmann machines,
23–25
and tree cover
46
detection in aerial imagery.
26
Both the quantumness
27–31
and speedup
32–35
in these devices are intensely scrutinized
47
topics of ongoing research.
48
2 DNN and XGB optimization procedure
49
We benchmark the performance of QAML against DNN and XGB.
50
We train a DNN using Keras
36
with the Theano backend,
37
a standard tool in deep learning and increasingly
51
popular in high energy physics. Our network has two fully connected hidden layers with
1000
nodes each. The
52
model is optimized using the Adam algorithm
38
with a learning rate of
0
.
001
and a mini-batch size of
10
. We
53
find that network performance is not affected by small changes in the number of nodes or the initial guesses for
54
the weights. The model hyperparameters, regularization terms, and optimization parameters for our deep neural
55
net are selected using the
Spearmint
Bayesian optimization software.
39, 40
Early stopping is used (with patience
56
parameter
10
) to avoid overtraining and have sufficient generalization.
57
We also train an ensemble of boosted decision trees using XGB
41
with a maximum depth of
10
, a learning rate
58
of
0
.
3
, and L2-regularization parameter
λ
=
2000
.
59
To train and optimize XGB, we use
100
rounds of training and start with the default choices for the various
60
parameters. We evaluate values of the learning rate
61
η
∈{
0
.
001
,
0
.
002
,
0
.
003
,
0
.
005
,
0
.
008
,
0
.
01
,
0
.
02
,
0
.
03
,
0
.
05
,
0
.
08
,
0
.
1
,
0
.
2
,
0
.
3
,
0
.
5
,
0
.
8
}
at tree depths of
5
,
8
,
10
,
12
,
15
,
62
and
20
. Some of these parameters give small improvements in AUC over the defaults at value of the L2-regularization
63
parameter
λ
=
1
. Far larger improvements are found when
λ
is increased. Hence we hold the other parameters
64
fixed and evaluate
λ
∈{
5
,
10
,
20
,
50
,
100
,
200
,
500
,
1000
,
1500
,
1800
,
2000
,
2200
,
2500
}
, finding the approximate opti-
65
mum AUC on the test set at
2000
. Testing again, the tree depth and
η
are found to have minimal effect on the
66
AUC (significantly smaller than the error), and
η
=
0
.
3
and tree depth
10
are chosen as the approximate optimum.
67
We note that the DNN and XGB settings are selected so as to prevent overtraining.
68
3 Mapping weak classifier selection to the Ising problem
69
In this section we closely follow Ref. [
6], with slight changes of notation. Let
V
be the event space, consisting of
70
vectors
{
⃗
x
}
that are either signal or background. We define a weak classifier
c
i
(
⃗
x
)
:
V
�→
R
,
i
=
1
,...
N
, as classifying
71
event
⃗
x
as signal (background) if
c
i
(
⃗
x
)
>
0
(
c
i
(
⃗
x
)
<
0
). We normalize each weak classifier so that
|
c
i
|≤
1
/
N
. We
72
introduce a binary weights vector
⃗
w
∈{
0
,
1
}
N
and construct a strong classifier
R
⃗
w
(
⃗
x
)=
∑
i
w
i
c
i
(
⃗
x
)
∈
[
−
∥
⃗
w
∥
/
N
,
∥
⃗
w
∥
/
N
]
.
73
The event
⃗
x
is correspondingly classified as signal (background) if
R
⃗
w
(
⃗
x
)
>
0
(
R
⃗
w
(
⃗
x
)
<
0
). The weights
⃗
w
are to be
74
determined; they are the target of the solution of the Ising problem.
75
Let
T
=
{
⃗
x
τ
,
y
τ
}
denote a given set of training events, where
⃗
x
τ
is an event vector collecting the values of each
of the variables we use, and
y
τ
=
±
1
is a binary label for whether
⃗
x
τ
is signal (
+
1
) or background (
−
1
). Let
Q
⃗
w
(
⃗
x
)=
sign
[
R
⃗
w
(
⃗
x
)]
, so that
Q
⃗
w
(
⃗
x
)=+
1
(
−
1
) denotes signal (background) event classification. Thus
y
τ
Q
⃗
w
(
⃗
x
τ
)=+
1
if
⃗
x
τ
is correctly classified as signal or background (
y
τ
and
Q
⃗
w
(
⃗
x
τ
)
agree), and
y
τ
Q
⃗
w
(
⃗
x
τ
)=
−
1
if
⃗
x
τ
is incorrectly
classified (
y
τ
and
Q
⃗
w
(
⃗
x
τ
)
disagree). The cost function
L
(
⃗
w
)=
∑
τ
[
1
−
y
τ
Q
⃗
w
(
⃗
x
τ
)]
/
2
thus counts the number of incorrectly
classified training events, and minimizing it over all possible weight vectors returns the optimal set of weights, and
hence the optimal strong classifier given the training set
T
. To avoid overtraining and economize on the number
of weak classifiers used, we can introduce a penalty term proportional to the number of weights, i.e.,
λ
∥
⃗
w
∥
, where
λ
>
0
is the penalty strength. Thus the optimal set of weights for given
λ
is
⃗
w
opt
=
argmin
{
⃗
w
}
[
L
(
⃗
w
)+
λ
∥
⃗
w
∥
]
.
(1)
This optimization problem cannot be directly mapped onto a quantum annealer, due to the appearance of the sign
function. Instead we next introduce a relaxation to a quadratic form that is implementable on the current generation
of D-Wave devices. Namely, using the training set we form the vector of strong classifier results
⃗
R
⃗
w
=
{
R
⃗
w
(
⃗
x
τ
)
}
|
T
|
τ
=
1
,
the Euclidean distance measure
δ
(
⃗
w
)=
∥
⃗
y
−
⃗
R
⃗
w
∥
2
between the strong classifier and the set of training labels, and
replace Eq. (
1) by
⃗
w
min
=
argmin
{
⃗
w
}
δ
(
⃗
w
)
.
(2)
Finding
⃗
w
opt
in this way is equivalent to solving a quadratic unconstrained binary optimization (QUBO) problem:
⃗
w
min
=
argmin
{
⃗
w
}
[
∑
τ
R
⃗
w
(
⃗
x
τ
)
2
−
2
y
τ
R
⃗
w
(
⃗
x
τ
)+
y
2
τ
]
=
argmin
{
⃗
w
}
[
∑
τ
(
∑
i
,
j
w
i
w
j
c
i
(
⃗
x
τ
)
c
j
(
⃗
x
τ
)
−
2
y
τ
∑
i
w
i
c
i
(
⃗
x
τ
)
)
+
|
T
|
]
.
(3)
2/25
WWW.NATURE.COM/NATURE | 3
SUPPLEMENTARY INFORMATION
RESEARCH
Regrouping the terms in the sum and dropping the constant we find:
⃗
w
min
=
argmin
{
⃗
w
}
[
∑
i
,
j
w
i
w
j
(
∑
τ
c
i
(
⃗
x
τ
)
c
j
(
⃗
x
τ
)
)
−
2
∑
i
w
i
(
∑
τ
c
i
(
⃗
x
τ
)
y
τ
)
]
=
argmin
{
⃗
w
}
[
∑
i
,
j
C
ij
w
i
w
j
−
2
∑
i
C
i
w
i
]
,
(4)
where
C
ij
=
∑
τ
c
i
(
⃗
x
τ
)
c
j
(
⃗
x
τ
)=
C
ji
and
C
i
=
∑
τ
c
i
(
⃗
x
τ
)
y
τ
.
76
This has a tendency to overtrain. The reason is that
|
R
⃗
w
(
⃗
x
τ
)
|≤
∥
⃗
w
∥
/
N
, so that
|
y
τ
−
R
⃗
w
(
⃗
x
τ
)
|
2
≥
(
1
−
∥
⃗
w
∥
/
N
)
2
, and
hence
δ
(
⃗
w
)=
∑
τ
|
y
τ
−
R
⃗
w
(
⃗
x
τ
)
|
2
≥|
T
|
(
1
−
∥
⃗
w
∥
/
N
)
2
. To minimize
δ
(
⃗
w
)
the solution will be biased toward making
∥
⃗
w
∥
as large as possible, i.e., to include as many weak classifiers as possible. To counteract this overtraining tendency
we add a penalty term that makes the distance larger in proportion to
∥
⃗
w
∥
, i.e.,
λ
∥
⃗
w
∥
with
λ
>
0
, just as in Eq. (
1).
Thus we replace Eq. (4) by
⃗
w
min
=
argmin
{
⃗
w
}
[
∑
i
,
j
C
ij
w
i
w
j
+
∑
i
(
λ
−
2
C
i
)
w
i
]
,
(5)
The last step is to convert this QUBO into an Ising problem by changing the binary
w
i
into spin variables
s
i
=
±
1
,
i.e.,
w
i
=(
s
i
+
1
)
/
2
, resulting in:
⃗
s
min
=
argmin
{
⃗
s
}
[
1
4
∑
i
,
j
C
ij
s
i
s
j
+
1
2
∑
i
,
j
C
ij
s
i
+
1
2
∑
i
(
λ
−
2
C
i
)
s
i
]
,
(6)
where we use the symmetry of
C
ij
to write the middle term in the second line, and we drop the constant terms
1
4
∑
i
,
j
C
ij
and
1
2
∑
i
(
λ
−
2
C
i
)
. We now define the couplings
J
ij
=
1
4
C
ij
and the local fields
h
i
=
1
2
(
λ
−
2
C
i
+
∑
j
C
ij
)
. The
optimization problem is then equivalent to finding the ground state
⃗
s
min
=
argmin
{
⃗
s
}
H
of the Ising Hamiltonian
H
Ising
=
N
∑
i
<
j
J
ij
s
i
s
j
+
N
∑
i
=
1
h
i
s
i
.
(7)
In the main text and hereafter, when we refer to
λ
it is measured in units of
max
i
(
C
i
)
(e.g.,
λ
=
0
.
05
is shorthand
77
for
λ
=
0
.
05max
i
(
C
i
)
).
78
4 Robustness of QAML to MCMC mismodelling
79
Two essential steps are involved in the construction of the weak classifiers in our approach. First, we remove
80
information about the tails of the distributions of each variable and use the corresponding truncated single-variable
81
distributions to construct weak classifiers. Second, since the single-variable classifiers do not include any correlations
82
between variables, we include additional weak classifiers built from the products/ratios of the variables, where after
83
taking the products/ratios we again apply the same truncation and remove tails. That is, our weak classifiers
84
account only for one and two-point correlations and ignore all higher order correlations in the kinematic variable
85
distribution. The particular truncation choice to define the weak classifiers as a piecewise linear function defined
86
only by a central percentile (30th or 70th, chosen during construction) and two percentiles in the tails (10th and
87
90th) means that the MC simulations only have to approximately estimate those four percentiles of the marginals
88
and the correlations between the variables. Any MC simulation which is unable to approximate the 10th, 30th,
89
70th, and 90th percentiles of the marginal distribution for each dimension of the dataset and the products between
90
them would surely not be considered acceptably similar to the target distribution for use in HEP data analyses, as
91
it is effectively guaranteed to be wrong in the higher order correlations and thus in its approximation of the true
92
distribution. Meanwhile, typical machine learning approaches for this problem use arbitrary relationships across
93
the entire training dataset, including the tails and high-order correlations, and so are likely to be more sensitive to
94
any mismodelling.
95
5 Receiver operating characteristic (ROC)
96
Any classifier may be characterized by two numbers: the true positive and true negative rates, in our case corre-
97
sponding to the fraction of events successfully classified as signal or background, respectively. Since our classifiers
98
all return floating point values in
[
−
1
,
1
]
, to construct a binary classifier we introduce a cut in this range, above and
99
below which we classify as signal and background, respectively. Since this cut is a free parameter, we vary it across
100
3/25
SUPPLEMENTARY INFORMATION
4
| WWW.NATURE.COM/NATURE
RESEARCH
Figure 1.
The ROC curves for the annealer-trained networks (DW and SA) at
f
=
0
.
05
, DNN, and XGB. Error
bars are defined by the variation over the training sets and statistical error. Both panels show all four ROC
curves. Panel (a) [(b)] includes
1
σ
error bars only for DW and DNN [SA and XGB], in light blue and pale yellow,
respectively. Results shown are for the
36
variable networks at
λ
=
0
.
05
trained on
100
events. The annealer
trained networks have a larger area under the ROC curve
the entire range and plot the resulting parametric curve of signal acceptance (true positive,
ε
S
) and background
101
rejection (true negative,
r
B
), producing a receiver operating characteristic (ROC) curve.
42
102
More explicitly, consider a labeled set of validation events,
V
=
{
⃗
x
v
,
y
v
}
, with
y
v
=
0
or
1
if
⃗
x
v
is background or
103
signal, respectively, and a strong classifier
R
⃗
w
(
⃗
x
)
. The latter is constructed from a given set of weak classifiers and
104
a vector of weights
⃗
w
previously obtained from training over a training set
T
. The strong classifier outputs a real
105
number
R
⃗
w
(
⃗
x
v
)=
∑
i
w
i
c
i
(
⃗
x
v
)
. To complete the classifier, one introduces a cut,
O
c
, such that we classify event
⃗
x
v
as
106
signal if
R
⃗
w
(
⃗
x
v
)
>
O
c
and background if
R
⃗
w
(
⃗
x
v
)
<
O
c
. If we evaluate the strong classifier on each of the events in
107
our validation set
V
, we obtain a binary vector of classifications,
⃗
C
=
{
C
v
}
, with entries
0
denoting classification as
108
background and
1
denoting classification as signal. By comparing
C
v
to
y
v
for all
v
we can then evaluate the fraction
109
of the events which are correctly classified as background, called the true negative rate or “background rejection”
110
r
B
(equal to the number of times
C
v
=
y
v
=
0
, divided by the total number of actual background events ), and the
111
fraction of events correctly classified as signal or “signal efficiency”
ε
S
(equal to the number of times
C
v
=
y
v
=
1
,
112
divided by the total number of actual signal events). For a given strong classifier, these values will be a function of
113
the cutoff
O
c
. Plotting
r
B
(
O
c
)
against
ε
S
(
O
c
)
yields a parametric curve, dubbed the “receiver operator characteristic”
114
(ROC) curve, as shown in Fig.
1. Note that the cutoffs are trivial to adjust while all the computational effort goes
115
into forming the networks, so one can vary
O
c
essentially for free to tune the performance of the network to suit
116
one’s purposes.
117
In other words, for a given strong classifier, i.e., solution/state, we can evaluate its output as a floating point
118
number on each of the values in our data set, and for any value of a cut on
[
−
1
,
1
]
this results in a single classification
119
of the test data,
⃗
C
. One can then evaluate the true positive and true negative rates by computing
⃗
C
v
·
⃗
y
k
where
120
k
∈
[
S
,
B
]
(signal, background),
y
i
k
=
1
if datum
i
is in ensemble
k
and is
0
otherwise.
121
When we take
f
>
0
and accept excited states with energy
E
<
(
1
−
f
)
E
GS
as “successes”, we have a set of
122
networks (labeled by
f
) for each training set. We simply take the supremum over the
f
-labeled set of values of
r
B
123
at each value of
ε
S
, to form the ROC curve for the classifier formed by pasting together different classifiers over
124
various ranges of
ε
S
.
125
To estimate the error due to limited test sample statistics, we reweight each element of the test set with weights
126
⃗
w
drawn from a Poisson distribution with mean
1
, effectively computing
∑
i
w
i
p
i
y
i
k
. The weights on the elements of
127
4/25
WWW.NATURE.COM/NATURE | 5
SUPPLEMENTARY INFORMATION
RESEARCH
the test set are determined for all elements at once and we evaluate all strong classifiers using the same weights.
128
For a single weight vector, we evaluate many values of the cut, and use linear interpolation to evaluate it in steps
129
of 0.01 in the region
[
0
,
1
]
. This gives us the true negative rate as a function of the true positive rate for a single
130
weighting corresponding to a single estimated ROC curve. When constructing a composite classifier from multiple
131
states, we are identifying regions of signal efficiency in which one should use one of the states rather than the others,
132
namely, we take the maximum background rejection rate over the states for each value of signal efficiency.
133
Repeating for many reweightings we get many ROC curves all of which are consistent with our data, and thus
134
the standard deviation across weights on a single training set at each value of signal efficiency serves as an estimate
135
of the statistical uncertainty in our ROC curves.
136
To estimate variation due to the choice of the training set towards reproductions of the procedure and results, for
137
a given training size we generate multiple disjoint training sets and use the standard deviation in mean performance
138
across training sets as our estimate of the error on the model resulting from the particular choice of training set.
139
When we compute the difference between two ROCs or AUROCs, we hold the training set and weight vector fixed,
140
take the difference, and then perform statistics over the weights and training sets in the same manner as above.
141
Errors in the AUROC were estimated similarly, taking the AUROC for each Poisson weight vector and training set
142
(fold) instead of
r
B
(
ε
S
)
. This is the procedure leading to Fig. 3 in the main text. An example of the ROC curves is
143
given in Fig.
1. At the scale of that plot, it is virtually impossible to tell the detailed differences between SA and
144
DW or the various values of
f
, so we use plots of differences of AUCs to extract more detailed information about
145
the ROC curves. This leads to Fig. 4 in the main text. Additional difference plots are given in Sec.
10 below.
146
6 Quantum annealing and D-Wave
147
In QA, we interpret the Ising spins
s
i
in the Ising Hamiltonian (7) as Pauli operators
σ
z
i
on the
i
th
qubit in a
148
system of
N
qubits. QA is inspired by the adiabatic theorem,
43
namely if the Hamiltonian is interpolated from
149
an initial Hamiltonian
H
(
t
=
0
)
to a final Hamiltonian
H
(
t
=
t
a
)
sufficiently slowly compared to the minimum
150
ground-to-first-excited state gap of
H
(
t
)
, the system will be in the ground state of
H
(
t
=
t
a
)
with high probability,
151
provided it was initialized in the ground state of
H
(
t
=
0
)
. Thus, one can evolve from a simple, easy to initialize
152
Hamiltonian at
t
=
0
to a complicated Hamiltonian with an unknown ground state at
t
=
t
a
, where
t
a
is known as
153
the annealing time. In QA, the initial Hamiltonian is a transverse field
H
X
=
∑
i
σ
x
i
, and the final Hamiltonian is
154
the Ising Hamiltonian (7), with the time-dependent Hamiltonian taking the form
H
(
t
)=
A
(
t
)
H
X
+
B
(
t
)
H
Ising
, where
155
A
(
t
)
is monotonically decreasing to
0
and
B
(
t
)
is monotonically increasing from
0
; these functions are known as the
156
annealing schedule. QA can be seen as both a generalization and a restriction of adiabatic quantum computation
44
157
(for a review see
15
): as a restriction, QA typically requires the initial Hamiltonian to be a sum of
σ
x
s and the
158
final Hamiltonian be diagonal in the computational basis (i.e., a sum of
σ
z
terms), while, as a generalization, it
159
undergoes open-system dynamics and need not remain in the ground state for the entire computation.
160
Current and near-generation quantum annealers are naturally run in a batch mode in which one draws many
161
samples from a single Hamiltonian. Repeated draws for QA are fast. The DW averages approximately
5000
samples
162
per second under optimal conditions. We take advantage of this by keeping all the trial strong classifiers returned
163
and not restricting to the one with minimum energy.
1
The DW has
1098
superconducting Josephson junction flux
164
qubits arranged into a grid, with couplers between the qubits in the form shown in Fig.
3, known as the Chimera
165
graph. The annealing schedule used in the DW processor is given in Fig.
4. The Chimera graph is not fully
166
connected, a recognized limitation as the Ising Hamiltonian (7) is fully connected, in general. To address this,
167
we perform a
minor embedding operation
.
45, 46
Minor embedding is the process whereby we map a single logical
168
qubit in
H
Ising
into a physical ferromagnetic (
J
ij
=
−
1
) chain of qubits on DW. For each instance we use a heuristic
169
embedding found via the D-Wave API, that is as regular and space-efficient as possible for our problem sizes.
170
Given a minor embedding map of logical qubits into a chain of physical qubits, we divide the local fields
h
i
171
equally among all the qubits making up the chain for logical qubit
i
, and divide
J
ij
equally among all the physical
172
couplings between the chains making up logical qubits
i
and
j
. After this procedure, there remains a final degree of
173
freedom: the chain strength
J
F
. If the strength of the couplers in the ferromagnetic chains making up logical qubits
174
is defined to be
1
, then the maximum magnitude of any other coupler is
max
(
max
i
(
{
|
h
i
|
}
)
,
max
i
,
j
(
{
�
�
J
ij
�
�
}
)
)
=
1
J
F
.
175
There is an optimal value of
J
F
, generally. This is due to a competition between the chain needing to behave as
176
a single large qubit and the problem Hamiltonian needing to drive the dynamics.
47
If
J
F
is very large, the chains
177
will “freeze out” long before the logical problem, i.e., the chains will be far stronger than the problem early on, and
178
1
The energy is effectively a function of error on the training set of the weak classifiers, hence is distinct from the measures used to
directly judge classifier performance, such as the area under the ROC curve.
5/25
SUPPLEMENTARY INFORMATION
6
| WWW.NATURE.COM/NATURE
RESEARCH
the transverse field terms will be unable to induce the large, multi-qubit flipping events necessary to explore the
179
logical problem space. Similarly, if
J
F
is very weak, the chains will be broken (i.e., develop a kink or domain wall)
180
by tension induced by the problem, or by thermal excitations, and so the system will generally not find very good
181
solutions. Ideally, one wants the chains and the logical problem to freeze at the same time, so that at the critical
182
moment in the evolution both constraints act simultaneously to determine the dynamics. For the results shown
183
here, we used
J
F
=
6
with an annealing time
t
a
=
5
μ
s. To deal with broken chains, we use majority vote on the
184
chain with a coin-toss tie-breaker for even length chains. Detailed analysis of the performance of this strategy in
185
the context of error correction can be found in the literature.
48, 49
186
Figure
5 shows the average minimum energy returned by the DW, rescaled by the training size (to remove a
187
linear scaling), as a function of the chain strength and training size. We see that the smallest training size (
N
=
100
)
188
has a smaller average minimum energy than the rest of the training sizes, and that there is only a very slight
189
downward tendency as the chain strength
J
F
increases for the larger training sizes.
190
Figure
6 plots the fractional deviation of the minimum energy returned by the DW relative to the true ground
191
state energy, averaged over the training sets. While the DW’s minimum energy returned approaches the true ground
192
state, it seems to converge to
≈
5%
(i.e.
f
≈
0
.
05
) above the ground state as we increase the chain strength, for
193
all training sizes
≥
1000
. In this case, we were not able to find the optimal chain strength in a reasonable range of
194
chain strengths, and instead simply took the best we found,
J
F
=
6
. As discussed in Sec.
8, the DW processor suffers
195
from noise sources on the couplers and thermal fluctuations, and it seems that this poses significant challenges for
196
the performance of the quantum annealer. It is possible that even larger chain strengths may resolve the issue, but
197
given the convergence visible in Fig.
6, it seems likely that
J
F
=
6
is already near the optimum.
198
7 Simulated annealing
199
In simulated annealing,
10
we initialize the vector
⃗
s
in a random state. At each time step, we create a trial vector
200
⃗
s
′
by flipping one of the spins in
⃗
s
, selected at random. We accept the trial vector using the Metropolis update
201
rule:
50
the new state is accepted with probability
1
if
H
(
⃗
s
′
)
<
H
(
⃗
s
)
(i.e., lower energy states are accepted deter-
202
ministically), whereas if
H
(
⃗
s
′
)
>
H
(
⃗
s
)
, we accept the trial vector with probability
exp
{
−
β
(
H
(
⃗
s
′
)
−
H
(
⃗
s
))
}
for some
203
inverse temperature
β
. After we have attempted
N
spin flips, which amounts to one sweep, we then increase the
204
inverse temperature
β
according to some schedule. At first,
β
≪
1
and the system quickly drifts through the space
205
of possible states. As
β
grows the system settles into lower lying valleys in the energy landscape, and ultimately
206
ceases to evolve entirely in the limit of infinite
β
(zero temperature). Our simulations used
β
init
=
0
.
1
and
β
final
=
5
,
207
and used a linear annealing schedule (i.e., if we perform
S
sweeps, we increase
β
after each sweep by
β
final
−
β
init
S
). These
208
parameters have generally performed well in other studies.
33
All SA data in the main text and presented here is at
209
1000 sweeps, however we also tested SA at 100 sweeps, and found a negligible difference in overall performance, as
210
seen in Fig.
7, where the integrated difference of the ROC curves is found to be statistically indistinguishable from
211
0
.
212
8 Effect of noise on processor
213
Internal control error (ICE) on the current generation of D-Wave processors is effectively modeled as a Gaussian
214
centered on the problem-specified value of each coupler and local field, with standard deviation
0
.
025
, i.e., a coupler
215
J
ij
is realized as a value drawn from the distribution
N
(
J
ij
,
0
.
025
)
when one programs a Hamiltonian. Figure
8
216
contains a histogram of the ideal values of the embedded couplers corresponding to connections between logical
217
qubits across all
20
problem instances of
36
variables at 20000 training events. One can see that the ideal distribution
218
has some structure, with two peaks. However, if one resamples values from the Gaussian distribution induced by
219
ICE, one finds that many of the features are washed out completely. This suggests that the explanation for the
220
flattening out of the performance of QA as a function of training size (recall Fig. 3 in the main text) is due to this
221
noise issue. Thus we investigate this next.
222
Figures
9-11 tell the story of the scaling of the couplers with training size. Figure
9 shows linear scaling
223
of the maximum Hamiltonian coefficient with training size. We observe wider variation at the smallest training
224
sizes, but overall the precision scales linearly with training size. This is confirmed in Figure
10, which shows the
225
maximum coefficient normalized by the training size. Since this value is constant for sufficiently large training
226
sizes, the maximum value scales linearly with size. At first glance, this indeed suggests an explanation for why the
227
performance of QA using the DW levels off as a function of training size: the coupling values pass
20
(half the scale
228
of the errors which is
≈
1
/
0
.
025
=
40
). However, absolute numbers are not necessarily informative, and Fig.
11
229
dispels this explanation.
230
6/25
WWW.NATURE.COM/NATURE | 7
SUPPLEMENTARY INFORMATION
RESEARCH
Figure
11 shows the ratio of the median coefficient to the maximum coefficient, thereby showing the scale of
231
typical
Hamiltonian coefficients on the DW prior to rescaling for chain strength (which for the most of the data
232
here would reduce the magnitude by a further factor of
6
).
233
Since all the different types of coefficient ratios are constant with system size, we have effectively no scaling with
234
training size of the precision of the couplers. This means that the scaling of precision with training size cannot
235
explain the saturation of performance with increasing training size.
236
However, the magnitudes here are quite small, and so once one accounts for rescaling the energies, typical
237
couplers are expected to be subject to a significant amount of noise, even causing them to change sign. This effect
238
likely explains, at least in part, the difficulties the DW has in finding the true ground state, as discussed above and
239
seen in Fig.
6, where even at the largest chain strength we still find that the DW’s typical minimum energy is
∼
5%
240
above the ground state energy.
241
9 Sensitivity to variation of the parameters of weak classifier construction
242
When constructing the weak classifiers, we choose to define
v
cut
as the
70
th percentile of the signal distribution.
243
This choice is arbitrary. To test the effect of this value on the classifier performance we use identical training sets
244
and values of both
60%
and
80%
and compare them to our primary estimate of
70%
. The results for both the
245
minimum energy returned (
f
=
0
) and
f
=
0
.
05
for each are shown in Fig.
12.
246
Note that every training set has the same ground state configuration at
70%
and
80%
. The ROCs and AUROC
247
are then invariant across a wide range of
v
cut
values.
248
Figure
2 reproduces Fig. 3 from the main text, but also shows the AUROC for SA’s optimal classifier (by energy)
249
for various values of the regularization parameter
λ
. We find no significant variation, with the major features of
250
SA being stable, namely the advantage at small training size and the saturation at around an AUROC of
≈
0
.
64
.
251
10 Difference between ROC curves plots
252
We show differences between ROC curves for various algorithms in Figs.
13-18. These form the basis for Fig. 4 in
253
the main text, which gives the integral of the difference over signal efficiency. Figures
13 and 14 show the difference
254
in background rejection
r
DW
B
−
r
SA
B
as a function of the signal efficiency for
f
=
0
and
f
=
0
.
05
, respectively. For
255
f
=
0
DW and SA are indistinguishable to within experimental error. For
f
=
0
.
05
SA slightly outperforms DW in
256
the range of low signal efficiencies for training sizes
≥
5000
. The primary conclusion to draw from these plots is
257
that SA differs from DW by roughly one standard deviation or less across the whole range, even though DW for
258
training sizes larger than
100
struggles to find states within less than
5%
of the ground state energy. This suggests
259
a robustness of QAML, which (if it generalizes to other problems) significantly improves the potential to exploit
260
physical quantum annealers to solve machine learning problems and achieve close-to-optimal classifier performance,
261
even in the presence of significant processor noise.
262
Figures
15 and 16 show the ROC difference between DW and DNN and DW and XGB at
f
=
0
, respectively.
263
The two cases have broadly similar shapes. One clearly sees that QAML on DW outperforms DNN and XGB at
264
the smallest training size in a statistically significant manner, but that the trend reverses for sizes
≥
5000
. Note,
265
that at the scale of these diagrams, the gap between
f
=
0
and
f
=
0
.
05
is negligible.
266
Figures
17 (SA) and
18 (DW) show the difference between
f
=
0
and
f
=
0
.
05
. SA and DW exhibit broadly
267
similar behavior, with an improvement with excited states of
≈
0
.
4%
in background rejection for SA and of
≈
0
.
2%
268
for DW. The improvement increases with training size and is slightly larger for SA than DW (though this difference
269
is likely simply noise, as it is less than half the standard deviation of each distribution). It should be noted that
270
since QAML’s comparative advantage against other techniques appears to be in the realm of small training sizes.
271
However, this is the same range where including excited states has no benefit.
272
7/25
SUPPLEMENTARY INFORMATION
8
| WWW.NATURE.COM/NATURE
RESEARCH
Figure 2.
A reproduction of Fig. 3 from the main text, now including the optimal strong classifier found by SA at
f
=
0
for various values of the regularization parameter
λ
=
0
.,
0
.
1
,
0
.
2
. We find that this parameter has negligible
impact on the shape of the AUROC curve, and that performance for SA always saturates at
≈
0
.
64
, with an
advantage for QAML (DW) and SA over XGB and DNNs for small training sizes.
8/25