of 42
SciPost Phys. 18, 040 (2025)
Learning the simplicity of scattering amplitudes
Clifford Cheung
1
, Aurélien Dersy
2,3
and Matthew D. Schwartz
2,3
1
Walter Burke Institute for Theoretical Physics,
California Institute of Technology, 91125 Pasadena, CA, USA
2
Department of Physics, Harvard University, 02138 Cambridge, MA, USA
3
NSF Institute for Artificial Intelligence and Fundamental Interactions
clifford.cheung@caltech.edu
, †
adersy@g.harvard.edu
, ‡
schwartz@g.harvard.edu
Abstract
The simplification and reorganization of complex expressions lies at the core of scientific
progress, particularly in theoretical high-energy physics. This work explores the appli-
cation of machine learning to a particular facet of this challenge: the task of simplifying
scattering amplitudes expressed in terms of spinor-helicity variables. We demonstrate
that an encoder-decoder transformer architecture achieves impressive simplification ca-
pabilities for expressions composed of handfuls of terms. Lengthier expressions are
implemented in an additional embedding network, trained using contrastive learning,
which isolates subexpressions that are more likely to simplify. The resulting framework
is capable of reducing expressions with hundreds of terms—a regular occurrence in
quantum field theory calculations—to vastly simpler equivalent expressions. Starting
from lengthy input expressions, our networks can generate the Parke-Taylor formula
for five-point gluon scattering, as well as new compact expressions for five-point am-
plitudes involving scalars and gravitons. An interactive demonstration can be found at
https:
//
spinorhelicity.streamlit.app.
Copyright C. Cheung
et al
.
This work is licensed under the Creative Commons
Attribution 4.0 International License.
Published by the SciPost Foundation.
Received
Accepted
Published
2024-09-06
2024-12-19
2025-02-03
Check for
updates
doi:10.21468
/
SciPostPhys.18.2.040
Contents
1 Introduction
2
2 Notation and training data
4
2.1 Spinor-helicity formalism
5
2.2 Target data set
6
2.3 Input data set
7
2.4 Analytic simplification
9
3 One-shot learning
9
3.1 Network architecture
10
3.2 Results
11
3.3 Embedding analysis
14
1
SciPost Phys. 18, 040 (2025)
4 Sequential simplification
15
4.1 Contrastive learning
16
4.2 Grouping terms
17
4.3 Simplifying long expressions
19
4.4 Physical amplitudes
20
5 Conclusion
22
A Parsing amplitudes
23
B Training data composition
24
C Nucleus sampling calibration
25
D Training on intricate amplitudes
26
E Integer embeddings
27
F Cosine similarity for dissimilar terms
28
G Physical amplitudes
29
References
38
1 Introduction
The modern scattering amplitude program involves both the computation of amplitudes as well
as the study of their physical properties. Are there better, more efficient, or more transparent
ways to compute these objects? The dual efforts to devise powerful techniques for practical
calculation and to then use those results to glean new theoretical structures have led to sus-
tained progress over the last few decades. An archetype of this approach appears in the context
of QCD, whose Feynman diagrams yield famously cumbersome and lengthy expressions. For
example, even for the relatively simple process of tree-level, five-point gluon scattering, Feyn-
man diagrams produce hundreds of terms. However, in the much-celebrated work of Parke and
Taylor
[
1
]
, it was realized that this apparent complexity is illusory. These hundreds of terms at
five-point—and more generally, for any maximally helicity violating configuration—simplify
to a shockingly compact monomial formula,
A
(
1
+
2
+
3
+
···
i
···
j
···
n
+
)=
i j
4
12
〉〈
23
〉···〈
n
1
,
(1)
shown here in its color-ordered form. The simplicity of the Parke-Taylor formula strongly
suggests an alternative theoretical framework that directly generates expressions like Eq. (1)
without the unnecessarily complicated intermediate steps of Feynman diagrams.
This essential fact—that on-shell scattering amplitudes are simple and can illuminate hid-
den structures in theories—has led to new physical insights. Indeed, shortly after
[
1
]
it was
realized that Eq. (1) also describes the correlators of a two-dimensional conformal field the-
ory
[
2
]
, which is a pillar of the modern-day celestial holography program
[
3
]
. Much later,
Witten deduced from Eq. (1) that Yang-Mills theory is equivalent to a certain topological string
2
SciPost Phys. 18, 040 (2025)
theory in twistor space
[
4
]
, laying the groundwork for a vigorous research program that even-
tually led to the twistor Grassmanian formulation
[
5,6
]
and amplituhedron
[
7
]
. Examples like
this abound in the amplitudes program—structures like double copy
[
8, 9
]
and the scattering
equations
[
10–12
]
were all derived from staring directly at amplitudes, rather than from the
top-down principles of quantum field theory.
Progress here has hinged on the existence of
simple
expressions for on-shell scattering
amplitudes. We are thus motivated to ask whether there is a more systematic way to recast
a given expression from its raw form into its most compact representation. For example, a
complicated spinor-helicity expression can often be simplified through repeated application of
Schouten identities
|
1
〉〈
23
+
|
2
〉〈
31
+
|
3
〉〈
12
=
0 ,
(2)
together with total momentum conservation of
n
-point scattering
|
1
[
1
|
+
|
2
[
2
|
+
···
+
|
n
[
n
|
=
0 .
(3)
However, the search space for these operations is expansive and difficult to navigate even
with the help of existing computer packages
[
13, 14
]
, and, to our knowledge, there exists
no canonical algorithmic way to inform which operations simplify complicated expressions
analytically. This is where recent advances in machine learning (ML) offer a natural advantage.
The role of ML in high-energy physics has grown dramatically in recent years
[
15
]
. In the
field of scattering amplitudes, much of the work to date has focused on reproducing the nu-
merical output of these amplitudes using neural networks
[
16–19
]
. However, recent advances
in ML have led to the development of powerful architectures, capable of handling increasingly
complex datasets, including those that are purely symbolic. In particular, the transformer
architecture
[
20
]
has allowed for practical applications across a wide range of topics, includ-
ing jet tagging
[
21
]
, density estimation for simulation
[
22, 23
]
, and anomaly detection
[
24
]
.
The appeal of transformers comes from their ability to create embeddings for long sequences
which take into account all of the objects composing that sequence. In natural language pro-
cessing, where transformers first originated, this approach encodes a sentence by mixing the
embeddings of all of the words in the sentence. These powerful representations have been a
key driver for progress in automatic summarization, translation tasks, and natural language
generation
[
25–27
]
. Since mathematical expressions can also be understood as a form of
language, the transformer architecture has been successfully repurposed to solve certain in-
teresting mathematical problems. For those problems, the validity of a model’s output can
often be confirmed through explicit numerical evaluation of the symbolic result, allowing one
to easily discard any model hallucinations. From symbolic regression
[
28
]
to function integra-
tion
[
29
]
, theorem proving
[
30
]
, and the amplitudes bootstrap
[
31
]
, transformers have proven
to be effective in answering questions that are intrinsically analytical rather than numerical. In
particular, transformers have been adapted to simplify short polylogarithmic expressions
[
32
]
and it is natural to expect that the same methodology can be extended to our present task,
which is the simplification of spinor-helicity expressions.
A common bottleneck for transformer-based approaches is the length of the mathemati-
cal expression that can be fed through the network. Typical amplitude expressions can easily
have thousands of distinct terms and processing the whole expression at once quickly be-
comes intractable. The self-attention operation in a transformer scales quadratically in time
and memory with the sequence length and it is therefore most efficiently applied to shorter ex-
pressions. For instance, the Longformer and BigBird architectures
[
33,34
]
implement reduced
self-attention patterns, using a sliding window view on the input sequence and resorting to
global attention only for a few select tokens. In the context of simplifying mathematical ex-
pressions, it is quite clear that humans proceed similarly: we start by identifying a handful
of terms that are likely to combine and then we attempt simplification on this subset. In this
3
SciPost Phys. 18, 040 (2025)
Simplify
Transformer
12
〉〈
34
[13][25]
Simplify
Transformer
14
〉〈
34
[13][45]
+
Complicated Amplitude
13
〉〈
24
[13][25]
− 〈
14
〉〈
23
[13][25]
+
14
〉〈
34
[25][34] +
14
〉〈
34
[23][45]
+
14
〉〈
24
[12][45] +
· · ·
Projection
Transformer
Iteration Loop
Figure 1: Spinor-helicity expressions are simplified in several steps. To start, indi-
vidual terms are projected into an embedding space (grey sphere). Using contrastive
learning, we train a “projection” transformer encoder to learn a mapping that groups
similar terms close to one another in the embedding space. After identifying similar
terms we use a “simplify” transformer encoder-decoder to predict the corresponding
simple form. After simplifying all distinct groups, this procedure is repeated with the
resulting expression, iterating until no further simplification is possible.
paper, we mimic this procedure by leveraging contrastive learning
[
35–39
]
. As illustrated in
Fig. 1 we train a network to learn a representation for spinor-helicity expressions in which
terms that are likely to simplify are close together in the learned embedding space. Grouping
nearby terms, we then form a subset of the original expression which is input into yet another
transformer network trained to simplify more moderately-sized expressions. By repeating the
steps of grouping and simplification we are then able to reduce spinor-helicity expressions with
enormous numbers of distinct terms.
Our paper is organized as follows. We begin in Section 2 with a brief review of the spinor-
helicity formalism and its role in scattering amplitude calculations. We describe the physical
constraints that amplitudes must satisfy, as well as the various mathematical identities that
can relate equivalent expressions. In Section 3 we introduce a transformer encoder-decoder
architecture adapted to the simplification of moderately-sized spinor-helicity expressions. We
describe our procedure for generating training data and discuss the performance of our net-
works. Afterwards, in Section 4 we present the concept of contrastive learning and describe
how it arrives at a representative embedding space. We present an algorithm for grouping
subsets of terms that are likely to simplify in lengthier amplitude expressions. We then show-
case the performance of our full simplification pipeline on actual physical amplitudes, in many
cases composed of hundreds of terms.
1
Finally, we conclude with a brief perspective on the
prospects for ML in this area.
2 Notation and training data
In this section, we review the mechanics of the spinor-helicity formalism and then describe the
generation of training data for our models. Our notation follows
[
40
]
, though a more detailed
exposition can also be found in
[
41–43
]
and references within.
1
Our implementation, datasets and trained models are available at https:
//
github.com
/
aureliendersy
/
spinorhelicity. This repository also contains a faster local download of our online interactive demonstration, hosted
at https:
//
spinorhelicity.streamlit.app. This application reduces amplitudes following the procedure described in
Fig. (1) and has the ability to simplify the amplitude expressions quoted in this paper.
4
SciPost Phys. 18, 040 (2025)
2.1 Spinor-helicity formalism
The basic building blocks of spinor-helicity expressions are
helicity spinors
, which are two-
component objects whose elements are complex numbers. Left-handed spinors transform in
the
(
1
2
, 0
)
representation of the Lorentz group and are written as
λ
α
. Right-handed spinors
transform in the
(
0,
1
2
)
representation of the Lorentz group and are written as
̃
λ
̇
α
. A general
four-momentum transforms in the
1
2
,
1
2

representation of the Lorentz group and is written
as the two-by-two matrices
p
α
̇
α
or
p
̇
αα
. When the four-momentum corresponds to a massless
particle, it satisfies the on-shell condition,
p
·
p
=
det
(
p
α
̇
α
)=
0, and can be written as the outer
product of helicity spinors, so
p
α
̇
α
=
λ
α
̃
λ
̇
α
. As usual in the study of scattering amplitudes, we
generalize to complex four-momenta, so
λ
α
and
̃
λ
̇
α
are independent objects.
Helicity spinors of the same chirality can be dotted into each other to form the Lorentz
invariant, antisymmetric products,
Angle brackets :
λχ
=
−〈
χλ
=
λ
α
χ
β
ε
αβ
,
(4)
Square brackets :
[
λχ
]=
[
χλ
]=
̃
λ
̇
α
̃
χ
̇
β
ε
̇
α
̇
β
.
Here all indices are raised and lowered with the antisymmetric two-index tensors,
ε
αβ
or
ε
̇
α
̇
β
.
The Lorentz invariant product of a pair of four-momenta is
p
·
q
=
1
2
λχ
[
χλ
]
,
(5)
where we have also defined
q
α
̇
α
=
χ
α
̃
χ
̇
α
.
For physical processes, we are typically interested in the
n
-point amplitude, which describes
a scattering process involving
n
external particles, here taken to be all incoming for conve-
nience. This object depends on
n
external massless momenta, which we write as
p
α
̇
α
i
=
λ
α
i
̃
λ
̇
α
i
.
We use the standard shorthand in which angle and square brackets are labelled by their cor-
responding external states, so
p
i
·
p
j
=
1
2
i j
[
ji
]
.
(6)
Note that the antisymmetry of the angle and square bracket imply that
ii
=[
ii
]=
0.
The
n
-point scattering amplitude is strongly constrained by the little group, which by def-
inition acts trivially on four-momenta but nontrivially on helicity spinors
λ
α
i
z
i
λ
α
i
,
and
̃
λ
̇
α
i
z
1
i
̃
λ
̇
α
i
,
(7)
where
z
i
is an arbitrary complex number. The little group defines the spin representation of
each external state. Consequently, the
n
-point scattering amplitude transforms under the little
group as
M
(
1
h
1
2
h
2
···
n
h
n
)
‚
Y
i
z
2
h
i
i
Œ
M
(
1
h
1
2
h
2
···
n
h
n
)
,
(8)
where
h
i
is the helicity of leg
i
. Hence, the little group strongly constrains the number of
powers of each helicity spinor that can appear in every term in the amplitude. Note also that
the mass dimension of each helicity spinor is one-half, so each angle or square bracket is mass
dimension one.
A general
n
-point amplitude is highly constrained by the little group and dimensional anal-
ysis. As we have seen, Eq. (8) restricts the allowed powers of left- and right-handed spinors.
Assuming that there is a single coupling constant of fixed mass dimension, then the mass di-
mension of each term in the amplitude must also be the same. These constraints, together
with information from singular kinematic limits, can sometimes be exploited to extract ana-
lytic results from numerical calculations
[
44–47
]
. Within our ML framework, we will assume
that all amplitudes have fixed mass dimension and little group scaling.
5
SciPost Phys. 18, 040 (2025)
2.2 Target data set
The goal of this work is to train a computer program that takes as input a complicated spinor-
helicity expression
M
and then outputs a more minimal form
M
. The ML approach to this
problem requires a training set composed of multiple instances of such pairs,
{
M
,
M
}
. To
build such a set, we randomly generate a simple target expression
M
and then scramble
it using various spinor-helicity identities to obtain a more complicated but mathematically
equivalent form
M
. We then iterate this procedure many times to generate a list of many
such pairs
{
M
,
M
}
. Following the terminology of
[
29
]
, this data generation procedure is
called
backward generation
, where one starts by generating the target
M
, rather than the
input
M
. In the alternative approach, the
forward generation
, one would instead generate
M
and simplify it with an external software to generate the target
M
. Since we lack a clear
algorithmic way to maximally simplify an amplitude, we cannot use this approach, and our
datasets will be constructed only from backward generation.
As noted earlier, for
M
to describe a physical scattering amplitude, its terms must all
exhibit the same little group scaling and mass dimension. However, to craft a general sim-
plification algorithm, our network will need to be able to simplify subexpressions whose little
group scaling and mass dimension differ from the final target. For this reason, the various
pairs
{
M
,
M
}
in the same training set will in general exhibit different mass dimensions or
external helicity choices.
An efficient mathematical representation of spinor-helicity expressions should be free of
notational degeneracies. To eliminate the intrinsic redundancy of antisymmetry in the angle
and square brackets, we choose a convention in which all brackets are written with their first
entry smaller than the second. Concretely, we send
ji
〉 → −〈
i j
and
[
ji
]
→ −
[
i j
]
for
i
<
j
whenever possible. Furthermore, we rationalize all of our amplitudes and write the numerator
in a fully expanded form, yielding
M
=
1
D
N
terms
X
=
1
N
,
(9)
where each
N
is a monomial product of angle and square brackets and
D
is a common de-
nominator. Since the target amplitudes should be compact, the number of distinct terms
N
terms
should not be too large. For concreteness, we restrict 0
N
terms
3, with
M
=
0 an allowed
possibility, so this is our operational definition of “simple”.
The precise algorithm for creating a target amplitude
M
for the training set is as follows.
To begin, we randomly generate its first term, i.e.,
N
1
/
D
corresponding to
=
1, in several
steps:
1. We fix the number
n
of external momenta.
2. We fix the number of numerator
n
N
and denominator terms
n
D
, which are chosen in
the ranges
n
N
[
0, 2
n
]
and
n
D
[
1, 2
n
]
.
3. For each numerator or denominator term
r
we
(a) Randomly choose
[
i j
]
or
i j
, where
i
<
j
and
i
,
j
[
1,
n
]
.
(b) Raise this bracket to the power
p
=
max
(
1,
̃
p
)
, where
̃
p
is drawn from the Poisson
distribution Poisson
(
λ
)
, where
λ
=
0.75 in our analysis. The resulting expression
for the term is then
r
=
i j
p
or
r
=[
i j
]
p
.
4. We combine the brackets and randomize the overall sign to obtain the first term,
N
1
D
=
±
n
N
Y
a
=
1
n
D
Y
b
=
1
r
a
r
b
.
(10)
6
SciPost Phys. 18, 040 (2025)
Next, we add additional terms to the target amplitude that have the same little group scaling
and mass dimension as the first term. These additional terms are of the form
N
ℓ>
1
=
n
Y
j
=
2
j
1
Y
i
=
1
i j
p
i j
[
i j
]
̃
p
i j
,
(11)
where
p
i j
and
̃
p
i j
are the solutions to the system of
n
+
1 equations,
m
(
N
1
)=
n
X
j
=
2
j
1
X
i
=
1
p
i j
+
̃
p
i j

,
(12)
h
k
(
N
1
)=
n
X
j
=
k
+
1
p
k j
̃
p
k j

+
k
1
X
i
=
1
(
p
ik
̃
p
ik
)
,
(13)
where we have defined
N
1
to have mass dimension
m
(
N
1
)
and little group scalings
h
k
(
N
1
)
for each external momentum 1
k
n
. Here a solution is deemed acceptable only if the
coefficients
p
i j
,
̃
p
i j
are non-negative, so the common denominator is unchanged. We then
repeat this procedure until we have generated all terms
N
in Eq. (9), thus yielding our final
form for
M
.
Note that when adding numerator terms we do not multiply them by random rational num-
bers. Rather, we instead consider expressions where each numerator term has
±
1 as a relative
coefficient. This will be mostly sufficient for the physical amplitudes under consideration.
2.3 Input data set
With a set of target amplitudes
M
in hand, we can now scramble them into more complicated
input amplitudes
M
so that the inverse map can be learned by the network. This reshuf-
fling is achieved using various mathematical identities that relate equivalent spinor-helicity
expressions.
The first mathematical identity that we will use for scrambling is the Schouten identity,
which is a consequence of the two-dimensional nature of spinors:
Schouten identity :
̈
i j
〉〈
kl
=
il
〉〈
k j
+
ik
〉〈
jl
,
[
i j
][
kl
]=[
il
][
k j
]+[
ik
][
jl
]
.
(14)
These relations obviously generate more terms from fewer terms, and they are independent
of the number of external legs
n
.
The second identity that we use arises as a consequence of the total momentum conser-
vation:
P
n
i
=
1
p
α
̇
α
i
=
P
n
i
=
1
λ
α
i
̃
λ
̇
α
i
=
0. Sandwiching this identity between any of the
n
helicity
spinors yields the
n
2
equations
momentum conservation :
n
X
j
=
1
i j
[
jk
]=
0 ,
i
,
k
.
(15)
When
i
̸
=
k
, this is a linear relation on
n
2 non-vanishing terms, whereas for
i
=
k
it constrains
n
1 non-vanishing terms. Here we can also take the square of total momentum conservation,
(
P
n
i
=
1
p
α
̇
α
i
)
2
=
0, to obtain
momentum squared :
X
i
<
j
(
i
,
j
)
S
n
1
i j
[
ji
]=
X
k
<
l
(
k
,
l
)
S
n
2
kl
[
l k
]
,
(16)
7
SciPost Phys. 18, 040 (2025)
where
S
n
1
and
S
n
2
are two disjoint subsets forming a partition of the momenta set
{
p
1
,
···
,
p
n
}
.
Of course, total momentum conservation and its square are not independent identities. In
fact, one can often simplify amplitudes in different ways using different identities. For
instance, to simplify the expression
14
[
14
]
−〈
23
[
23
]
in four-point scattering, one can
use the squared version of momentum conservation which reads
(
p
1
+
p
4
)
2
=(
p
2
+
p
3
)
2
and implies
14
[
14
]=
23
[
23
]
. Alternatively, one can use momentum conservation,
12
[
23
]=
−〈
14
[
43
]
, multiply both sides by
[
14
]
, and then apply another momentum con-
servation identity,
21
[
14
]=
−〈
23
[
34
]
. So while the various identities are not independent,
having some redundancy in operations can often expedite multiple intermediate simplification
steps.
To proceed, we allow for the scrambling identities of Eqs. (15-16) to be applied in two
slightly different ways. The first method involves selecting a random bracket in the numerator
of a spinor-helicity expression and then choosing whether to apply momentum conservation,
its squared counterpart, or the Schouten identity. There is no technical reason that requires us
to only scramble terms in the numerator, but we do so for the sake of simplicity. Note that the
amplitudes we consider will have denominators that are simple products of square and angle
brackets. Once a numerator bracket has been selected, we then randomly pick an identity
and craft the appropriate replacement rule. For instance, if
12
is selected in the five-point
amplitude, we can generate a substitution following the Schouten identity as
12
〉→
13
〉〈
25
〉−〈
15
〉〈
23
35
.
(17)
To apply Eq. (17) we must randomly choose two additional external momenta, as required by
the form of Eq. (14). Similarly, when applying momentum conservation or its squared cousin,
one must randomly select reference helicity spinors, as in Eq. (15), or the subsets
S
n
1
and
S
n
2
,
as in Eq. (16). After applying the substitution in Eq. (17) to all relevant bracket terms in the
numerator, we then say that the amplitude is one identity away from its simple form, i.e., it
has been
scrambled once
.
To increase the diversity of the generated expressions we implement a second method for
applying the scrambling identities. Rather than substituting an existing bracket, we instead
allow for multiplication by unity or addition of zero. To multiply by unity we write a trivial
fraction using a randomly chosen bracket and scramble its corresponding numerator following
the aforementioned procedure. For instance, if we are using the Schouten identity in this
scrambling step we send
multiplication by unity :
M
M
12
12
M
13
〉〈
25
〉−〈
15
〉〈
23
12
〉〈
35
.
(18)
For the addition of zero we proceed similarly: randomly select a bracket, write 0
=
i j
〉−〈
i j
,
and then scramble one of the two terms. This step is necessary when scrambling target ampli-
tudes that vanish, so
M
=
0. Alternatively, we can also insert this identity into the numerator
of a spinor-helicity expression, where we need to multiply it by an appropriate bracket expres-
sion so that the little group and mass dimension scalings are preserved. For instance, we can
apply the replacement
addition of zero :
M
+
[
34
]
[
34
]
D
F
M
+
[
14
][
35
]
[
13
][
45
]
[
15
][
34
]
D
[
15
]
F
,
(19)
where we have scrambled
[
34
]
with the Schouten identity. Here
D
is the denominator of
the original amplitude and
F
is a factor required to ensure that the scaling behaviour of
M
is respected. This factor is sampled from the solution set obtained by solving the system
2
of
2
We allow for negative power coefficients in our solution set, so generically
F
is not a simple monomial of
square and angle brackets. Instead, it can also have brackets raised to a negative power.
8
SciPost Phys. 18, 040 (2025)
Eqs. (12-13) using
[
48
]
. In our analysis, multiplication by unity and addition of zero will count
as a single scrambling step since a single identity is sufficient to undo them.
One could be concerned that the backward generation procedure introduces some bias
in the training samples. While the target
M
are chosen to match with simple amplitudes
expressions (or at least part of them), it is not immediately clear whether all possible
M
can be reached from this generation process. It will thus be important to test our models on
amplitude expressions that one encounters in practical settings, as we do in Sec 4.4, to ensure
that we have adequate generalization beyond the training set.
2.4 Analytic simplification
Before moving on, it is worthwhile to comment briefly on various analytic approaches to am-
plitudes simplification that have been developed over the years. Since the ambiguities in rep-
resenting a given spinor-helicity expression stem from the Schouten identity and momentum
conservation, it is natural to try to devise kinematic variables which trivialize these identities.
For example, in projective coordinates, we have that
λ
α
i
=(
1,
z
i
)
and
i j
=
z
i
z
j
, so the
Schouten identity is algebraically satisfied. However, momentum conservation is not mani-
fest, so in these variables, a given spinor helicity expression can still be expressed in many
distinct ways.
Alternatively, one can trivially manifest momentum conservation using momentum twistor
variables
[
49, 50
]
. Here one defines twistors residing in dual spacetime coordinates for which
x
α
̇
α
i
x
α
̇
α
i
+
1
=
λ
α
i
̄
λ
̇
α
i
. These variables are natural for planar amplitudes exhibiting dual con-
formal invariance, as in maximally supersymmetric Yang-Mills theory. However, such sim-
plifications are certainly not generic. More importantly, since momentum twistors are four-
component objects, five or more of them are linearly dependent due to the higher-dimensional
analogue of the Schouten identity. Indeed, any finite-dimensional representation of the kine-
matics will exhibit this ambiguity. Thus, irrespective of the analytic approach taken, there is
no general way to trivialize the simplification task.
3 One-shot learning
Armed with a training set of spinor-helicity expressions
M
and their simplified target counter-
parts
M
, we can now apply ML. Concretely, we are interested in reducing complicated input
expressions like
M
=
−〈
12
2
[
12
][
15
]
−〈
13
〉〈
24
[
13
][
45
]+
13
〉〈
24
[
14
][
35
]
−〈
13
〉〈
24
[
15
][
34
]

12
15
〉〈
23
〉〈
34
〉〈
45
[
12
][
15
]
, (20)
down to simplified target expressions like
M
=
12
3
15
〉〈
23
〉〈
34
〉〈
45
.
(21)
By hand, the simplification of spinor-helicity expressions proceeds by successive applications
of well-chosen identities. For instance, the jump from Eq. (20) to Eq. (21) would be achieved
using a single Schouten identity,
[
13
][
45
]=[
15
][
43
]+[
14
][
35
]
. We would instead like to
create a ML algorithm that performs this simplification automatically.
The task of simplifying expressions is similar to theorem proving, where a program learns
to apply a set of axioms or tactics to reach a desired goal. In this context, reinforcement learn-
ing
[
51, 52
]
and Monte Carlo tree search
[
30
]
have already successfully reconstructed lengthy
proofs. However, these approaches become increasingly difficult to implement for larger ex-
pressions and more mathematical identities. In this paper, rather than trying to train models
9
SciPost Phys. 18, 040 (2025)
Table 1: Transformer architecture and training hyperparameters used for the one-
shot simplification of spinor-helicity amplitudes.
Hyperparameter Type
Parameter Description
Value
Network architecture
Encoder layers
3
Decoder layers
3
Attention heads
8
Embedding dimension
512
Maximum input length
2560
Training parameters
Batch size
16
Epoch size
50000
Epoch number
1500
Learning Rate
10
4
to learn a sequence of simplification steps, we instead focus on models that can generate a list
of guesses for what the simple form can be without explicitly listing the intermediate steps in
between. In this way, going from
M
M
can be viewed as a one-shot translation task for
which transformer networks have demonstrated excellent performance on analogous tasks.
From their conception, transformer models have excelled at translation tasks
[
20
]
. More
recently, these architectures have been deployed to integrate functions
[
29
]
and simplify poly-
logarithmic expressions
[
32
]
, which are mathematical problems that share common features
with language translation. In particular, one exploits the fact that any mathematical expres-
sion possesses a tree-like structure. Using prefix notation, where operators precede operands,
this tree-like structure is represented as an ordered set of tokens, akin to a regular sentence,
yielding an input that can be passed through a transformer. See Appendix A for a detailed
description of this structure in a concrete example.
As an initial experiment, we restrict our training data to expressions composed of at most
1k tokens, guaranteeing a reasonable memory requirement and training time. The structure
of the training data is discussed in detail in Appendix B. Since scrambling identities will swiftly
increase the size of an expression, we initially restrict to
M
that are related to
M
by at most
three scrambling steps. With more scrambling steps, the typical size of an amplitude can
exceed thousands of tokens and one-shot simplification is no longer suitable. We discuss the
simplification of more complicated expressions such as this in the subsequent section.
3.1 Network architecture
In this paper we closely follow the implementation of
[
29
]
and employ an encoder-decoder
transformer architecture
3
defined with the hyperparameters detailed in Tab. 1. As detailed
in Appendix A, an input expression like
M
=
12
[
34
]
is first converted to a prefix notation
P
(
M
)=
[
’mul’, ’ab12’, ’sb34’
]
where the tokens are either binary operators, integers, or angle
and square brackets. This set of ordered tokens is fed through an embedding layer with posi-
tional encoding before passing through a set of self-attention layers. These layers ensure that
the encoding of each token is conditioned on the embedding of all of the other tokens making
up the input sentence. The resulting embedded sentence is then passed to the decoder, which
is composed of self-attention layers and a final projection layer that together with a softmax
function is responsible for assigning a probability distribution over the set of allowed tokens.
3
Our implementation and analysis use the
PyTorch
[
53
]
,
Sympy
[
54
]
,
Scikit-learn
[
55
]
,
Numpy
[
56
]
and
Mat-
plotlib
[
57
]
libraries.
10
SciPost Phys. 18, 040 (2025)
Following the procedure outlined in Sec 2.2, we generate training data for spinor-helicity
amplitudes with four, five, or six external momenta. In each case, we train a separate trans-
former network on a single A100 GPU using the parameters summarized in Tab 1 and the
Adam
optimizer
[
58
]
. To reach 1500 epochs we have training times of 45h, 65h and 80h for
four, five and six-point amplitudes respectively. Each transformer is honed on training data
for which three or fewer scrambling steps have been applied to produce the input amplitude,
corresponding to a total of 10M unique amplitude pairs. We additionally retain 10k examples
for testing our trained models, where the associated input amplitudes have not been encoun-
tered previously during training. To characterize the performance of our networks, we do not
rely on the naive in-training measure of accuracy. Indeed, since our models are trained using
a cross-entropy loss on the predicted tokens, the measure of accuracy that one has access to
during training is based on whether the model can exactly reproduce the ordered set of tokens
that corresponds to the target amplitude. However, in some cases, the same target amplitude
can be written in different equivalent ways. For example, this can easily occur in four-point
amplitudes, where the target expression defined in Eq. (9) can often be further simplified or
expressed more compactly.
4
With the help of the tools developed in
[
13
]
, we instead verify
whether the numerical evaluation of the input amplitude matches that of the predicted output.
We ask for a numerical equivalence at 9 digits of precision using two independent sets of phase
space points, which proves to be sufficient for the amplitudes considered during training.
At inference time we implement a
beam search
[
60
]
that automatically generates a mul-
tiplicity of distinct predictions for the simple form of the original spinor-helicity expression.
Rather than restricting ourselves to
greedy decoding
, which simply outputs the tokens with the
highest probability, this accommodates candidate amplitudes with lower probability tokens.
For a beam search of size
N
, we retain the candidate amplitudes with the
N
lowest scores,
where the latter are calculated by summing the log-likelihood probability of each token and
normalizing by the sequence length. Since the numerical evaluation of the candidate ampli-
tudes provides us with an unambiguous criterion for identifying valid solutions, we can use
large beam sizes at inference time in the hope that at least one candidate proves to be correct.
To boost the performance at inference time we also consider an alternative to beam search
known as
nucleus sampling
[
61
]
. Nucleus sampling is a stochastic decoding technique whereby
the model output is constructed by sampling subsequent tokens according to their probability
distributions, whereas, in contrast, beam search selects subsequent tokens based on their high-
est probabilities. To avoid sampling over irrelevant tokens, nucleus sampling only considers
the most promising tokens by selecting the minimal set whose combined probability distribu-
tion exceeds a threshold
p
n
. This guarantees that only promising expressions are sampled and
that tokens with low scores are generically ignored. Our implementation is further detailed in
Appendix C.
3.2 Results
To characterize the performance of our trained models we compare evaluation results for a
beam search of size 1, 5, 10, 20, and 50 in Fig. 2. In each case, we compare the complexity of
the target amplitudes
M
to the complexity of the best hypothesis in the beam. As described in
Appendix B, we define the complexity to be the number of distinct square and angle brackets
that compose the amplitude, which is a proxy for the compactness of the resulting expression.
For beams of size greater than five, our models perform well, recovering a valid simplified form
4
When manipulating four-point amplitudes, the momentum conservation identities typically only involve two
distinct groups of terms. This implies that factorization is not unique and that the least common denominator
in a four-point amplitude is not uniquely fixed
[
59
]
. Therefore, one generically has many ways of rewriting the
same amplitude without increasing its complexity. For five- and six-point amplitudes this is much less likely, as
factorization is conjectured to be unique.
11