Hyper-optimized approximate contraction of tensor networks
with arbitrary geometry: supplementary information
Johnnie Gray
1
Garnet Kin-Lic Chan
1
1
Division of Chemistry and Chemical Engineering, California Institute of Technology,
Pasadena, USA 91125
The following material contains detailed information about the implementation and application of the
hyper-optimized approximate contraction method. Sec. A details the generation of spanning trees. Sec. B
derives the explicit form of the ‘compression’ projectors. In Sec. C we give details and pseudo-code for each
of the tree building algorithms. Sec. D defines the estimated cost and size metrics and compares them. In
Sec. E we define the various hand-coded approximate contraction schemes we compare to. Sec. F contains
details about each model we apply approximate contraction to. In Sec. G we provide various details and
extra results regarding the comparison to
CATN
[
1
]. Finally in Sec. H we perform a brief study of the corner
double line model.
A Tree spans and gauging
In this section we detail the simple method of generating a (possibly
r
-local) spanning tree, for use with the
tree gauge, and also in the
Span
contraction tree building algorithm (note the spanning tree is a different
object to the contraction tree). A pseudocode outline is given in Algorithm. 1. We first take an arbitrary
initial connected subgraph,
S
0
of the graph,
G
, for example the two tensors sharing bonds that are about
to be compressed. We then greedily select a pair of nodes, one inside and one outside this region, which
expands the spanning tree,
τ
and region
S
, until all nodes within graph distance
r
of
S
0
are in
S
. Since
the nodes can share multiple edges, the spanning tree
τ
is best described using an ordered set of node pairs
rather than edges. If the graph
G
has cycles, then nodes outside
S
may have multiple connections to it, this
degeneracy is broken by choosing a scoring function.
1
Algorithm 1
r
-local spanning tree
Input:
graph
G
, initial region
S
0
, max distance
r
τ
←{}
.
ordered set of pairs forming spanning tree
S
←
S
0
.
set of nodes spanned by the tree
c
←{}
.
candidates to add to tree
for
u
∈
S
do
.
each node in original region
for
v
∈
NEIGHBORS
(
G,u
)
\
S
do
.
connected nodes not in region
r
uv
←
1
.
distance to original region
c
←
c
∪{
(
u,v,r
uv
)
}
end for
end for
while
|
c
|
>
0
do
u,v,r
uv
←
BEST
(
c
)
.
pop the best candidate edge
c
←
c
\{
(
u,v,r
uv
)
}
if
(
v /
∈
S
)
∧
(
r
uv
≤
r
)
then
.
node is new and close enough
S
←
S
∪{
v
}
.
add
v
to region
τ
←
τ
∪{
(
u,v
)
}
.
add edge to tree
for
w
∈
NEIGHBORS
(
G,v
)
\
S
do
.
add new neighboring candidates
r
vw
←
r
uv
+ 1
c
←
c
∪{
(
v,w,r
vw
)
}
end for
end if
end while
Return:
τ
,
S
For the tree gauge, we choose the scoring function such that the closest node with the highest connectivity
(product of sizes of connecting edge dimensions) is preferred. The gauging proceeds by taking the pairs in
τ
in reverse order, gauging bonds from the outer to the inner tensor (see main text Fig.3E). If the graph
G
is a
tree and we take
r
=
∞
, this corresponds exactly to canonicalization of the region
S
0
. In order to perform a
compression of a bond in the tree gauge, we just need to perform QR decompositions inwards on a ‘virtual
copy’ of the tree (as shown in the main text Fig. 4B), until we have the central ’reduced factors’
R
A
and
R
B
.
Performing a truncated SVD on the contraction of these two to yield
R
A
R
B
=
R
AB
≈
U,σ,V
†
, allows us
to compute the locally optimal projectors to insert on the bond as
P
L
=
R
B
V σ
−
1
/
2
and
P
R
=
σ
−
1
/
2
U
†
R
A
such that
AB
≈
AP
L
P
R
B
. The form of these projectors, which is the same as CTMRG and HOTRG but
including information up to distance
r
away, is explicitly derived in Sec. B. One further restriction we place
is to exclude any tensors from the span that are input rather than intermediate tensors.
One obvious alternative possibility to the tree-gauge is to introduce an initial ‘Simple Update’ style
gauge [
2
] on each of the bonds and update these after compressing a bond, including in the vicinity of the
adjacent tensors. A similar scheme was employed for 3D contractions in [
3
]. In our experience this performs
similarly to the tree-gauge (and indeed the underlying operations are very similar) but is more susceptible to
numerical issues due to the direct inversion of potentially small singular values.
B Explicit projector form
Performing a bond compression such as in the main text Fig. 3D can be equated to the insertion of two
approximate projectors that truncate the target bond to size
χ
. The projector form allows us to perform the
tree gauge compression ’virtually’ - i.e. without having to modify tensors anywhere else in the original tensor
network. We begin by considering the product
AB
, where
A
and
B
might represent collections of tensors
2
Figure 1: An example of transforming a tensor network,
T
, into an exactly contractable tensor network,
T
P
,
using the explicit projector form of a approximate contraction tree. Here we take a
6
×
6
×
6
with
D
= 2
cube and use an optimized contraction tree from the
Span
generator, for
χ
= 8
. Each node in the original
tensor network is colored uniquely. The grey square nodes in the right hand side diagram represent the
inserted projectors, with thicker edges the compressed bonds of size
χ
. Arrows indicate the orientation of
the projectors (i.e. the order of the compressions).
such as a local tree. Assuming we can decompose each into a orthogonal and ‘reduced’ factor we write:
AB
= (
Q
A
R
A
)(
R
B
Q
B
)
.
If we resolve the identity on either side, we can form the product
R
A
R
B
in the middle and perform a
truncated SVD on this combined reduced factor yielding
UσV
†
.
AB
=
Q
A
R
A
(
R
−
1
A
R
A
)(
R
B
R
−
1
B
)
R
B
Q
B
=
Q
A
R
A
R
−
1
A
(
UσV
†
)
R
−
1
B
R
B
Q
B
=
Q
A
R
A
(
R
−
1
A
U
√
σ
)(
√
σV
†
R
−
1
B
)
R
B
Q
B
from which we can read off the projectors that we need to insert into the original tensor network in order to
realize the optimal truncation as:
P
L
=
R
−
1
A
U
√
σ ,
P
R
=
√
σV
†
R
−
1
B
.
Finally, in order to avoid performing the inversion of the reduced factors, we can simplify:
P
L
=
R
−
1
A
U
√
σ
=
R
−
1
A
(
U
√
σ
√
σV
†
)
V σ
−
1
/
2
=
R
−
1
A
(
R
A
R
B
)
V σ
−
1
/
2
=
R
B
V σ
−
1
/
2
3
and likewise:
P
R
=
√
σV
†
R
−
1
B
=
σ
−
1
/
2
U
†
(
U
√
σ
√
σV
†
)
R
−
1
B
=
σ
−
1
/
2
U
†
(
R
A
R
B
)
R
−
1
B
=
σ
−
1
/
2
U
†
R
A
.
This form of the projectors makes explicit the equivalence to CTMRG and HOTRG [
4
,
5
,
6
], for which
R
A
and
R
B
contain only information about the local plaquette. Note in general that we just need to know
R
A
and
R
B
(not
Q
A
or
Q
B
) to compute
P
L
and
P
R
, but we can include in these the effects of the distance-
r
tree gauge in order to perform the truncation locally without modifying any tensors but
A
and
B
.
Rather than dynamically performing the approximate contraction algorithm using the ordered contraction
tree, one can also use it to statically map the original tensor network,
T
, to another tensor network,
T
P
,
which has the sequence of projectors lazily inserted into it (i.e. each
AP
L
P
R
B
is left uncontracted).
Exact
contraction of
T
P
then gives the approximate contracted value of
T
. Such a mapping may be useful for
relating the approximate contraction to other tensor network forms [
7
], or for performing some operations
such as optimization [8]. Here we describe the process.
To understand where the projectors should be inserted we just need to consider the sub-graphs that
the intermediate tensors correspond to. At the beginning of the contraction, each node corresponds to a
sub-graph of size 1, containing only itself. We can define the sub-graph map
S
(
i
) =
{
i
}
for
i
= 1
...N
.
When we contract two nodes
i,j
to form a new node
k
, the new sub-graph is simply
S
(
k
) =
S
(
i
)
∪
S
(
j
)
.
When we compress between two intermediate tensors
i
and
j
, we find all bonds connecting
S
(
i
)
to
S
(
j
)
, and
insert the projectors
P
L
and
P
R
, effectively replacing the identity linking the two regions with the rank-
χ
operator
P
L
P
R
. Finally we add the tensor
P
L
to the sub-graph
S
(
i
)
and
P
R
to the sub-graph
S
(
j
)
. This can
be visualized like so.
Grouping all the neighboring tensors on one side of the bonds as an effective matrix
A
and those on the other
side as
B
(note that these might generally include projectors from previous steps), the form of
P
L
and
P
R
can be computed as above.
An example of the overall geometry change of performing this explicit projection transformation for the
full set of compressions on a cubic tensor network approximate contraction is shown in Fig. 1. Note that the
dynamic nature of the projectors, which depend on both the input tensors and the contraction tree, is what
differentiates a tensor network which you contract using approximate contraction, and for instance directly
using a tree- or fractal-like ansatz such as
T
P
.
C Tree builder details
In this section we provide extended details of each of the heuristic ordered contraction tree generators. First
we outline the hyper optimization approach. Each tree builder
B
takes as input the graph
G
with edges
weighted according to the tensor network bond sizes, as well as a set of heuristic hyper-parameters,
̄
θ
, that
4
control how it generates an ordered contraction tree
Υ
. The builder is run inside a hyper-optimization
loop that uses a generic optimizer,
O
, to sample and tune the parameters. We use the
nevergrad
[
9
]
optimizer for this purpose. A scoring function computes some metric
y
for each tree (see Sec. D for possible
functions), which is used to train the optimizer and track the best score and tree sampled so far,
y
best
and
Υ
best
respectively. The result, outlined in Algorithm 2, is an anytime algorithm (i.e. can be terminated at any
point) that samples trees from a space that progressively improves. Note that while the optimization targets a
specific
χ
, the tree produced exists separately from
χ
and can be used for a range of values of
χ
(in which
case one would likely optimize for the maximum value).
Algorithm 2
Hyper optimization loop
Input:
graph
G
, max bond
χ
, builder
B
, optimizer
O
y
best
←∞
while
optimizing
do
̄
θ
←
SAMPLE
PARAMETERS
(
O
)
.
Get new hyper parameters
Υ
←
GENERATE
TREE
(
B,G,
̄
θ
)
.
Build tree with new parameters
y
←
SCORE
TREE
(Υ
,χ
)
.
Score the tree
if
y < y
best
then
y
best
←
y
Υ
best
←
Υ
end if
REPORT
PARAMETERS
(
O,
̄
θ,y
)
.
Update optimizer with score
end while
Return:
Υ
best
In the following subsections we outline the specific hyper parameter choices,
̄
θ
, for each tree builder.
However one useful recurring quantity is a measure of
centrality
, similar to the harmonic closeness[
10
,
11
],
that assigns to each node a value according to how central it is in the network. This can be computed very
efficiently as
c
[
v
]
=
1
Z
∑
u
6
=
v
1
√
d
(
u,v
)+1
, where
d
(
u,v
)
is the shortest distance between nodes
u
and
v
. The
normalization constant
Z
is chosen such that
c
[
v
]
∈
[0
,
1]
∀
v
.
C.1
Greedy
The
Greedy
algorithm builds an ordered contraction tree by taking the graph at step
α
of the contraction,
G
α
,
and greedily selecting a pair of tensors to contract
(
i,j
)
→
k
, simulating the contraction and compression of
those tensors, and then repeating the process with the newly updated graph,
G
α
+1
, until only a single tensor
remains. The pair of tensors chosen at each step are those that minimize a local scoring function, and it is
the parameters within this that are hyper-optimized. The local score is a sum of the following components:
•
log
2
size of new tensor
after
compression with weight
θ
new
size
.
•
log
2
size of new tensor
before
compression with weight
θ
old
size
.
• The minimum, maximum, sum, mean or difference (the choice of which is a hyper parameter) of the
two input tensor sizes
log
2
, with weight
θ
inputs
.
• The minimum, maximum, sum, mean or difference (the choice of which is a hyper parameter) of the
sub-graph sizes of each input (when viewed as sub-trees) with weight
θ
subgraph
.
•
The minimum, maximum, mean or difference (the choice of which is a hyper parameter) of the
centralities of each input tensor with weight
θ
centrality
. Centrality is propagated to newly contracted
nodes as the minimum, maximum or average of inputs (the choice of which is a hyper-parameter).
5
•
a random variable sampled from the Gumbel distribution multiplied by a temperature (which is a
hyper-parameter).
The final hyper-parameter is a value of
χ
greedy
to simulate the contraction with, which can thus deviate
from the real value of
χ
used to finally score the tree. The overall space defined is 11-dimensional, which
is small enough to be tuned by, for example, Bayesian optimization. In our experience it is not crucial to
understand how each hyper-parameter affects the tree generated, other than that they are each chosen to
carry some meaningful information from which the optimizer can conjure a local contraction strategy; the
approach is more in the spirit of high-dimensional learning rather than a physics-inspired optimization.
C.2
Span
The
Span
algorithm builds an ordered contraction tree using a modified, tunable version of the spanning
tree generator in Algorithm 1 with
r
=
∞
. The basic idea is to interpret the ordered sequence of node pairs
in the spanning tree,
τ
, as the reversed series of contractions to perform. The initial region
S
0
is taken as
one of the nodes with the highest or lowest centrality (the choice being a hyper-parameter). The remaining
hyper-parameters are used to tune the local scoring function (
BEST
(
c
)
in Algorithm. 1), that decides which
pair of nodes should be added to the tree at each step. These are:
• The connectivity of the candidate node to the current region, with weight
θ
connectivity
.
• The dimensionality of the candidate tensor, with weight
θ
ndim
.
• The distance of the candidate node from the initial region, with weight
θ
distance
• The centrality of the candidate node, with weight
θ
centrality
•
a random variable sampled from the Gumbel distribution multiplied by a temperature (which is a
hyper-parameter).
The final hyper-parameter is a permutation controlling which of these scores to prioritize over others.
C.3
Agglom
The
Agglom
algorithm builds the contraction tree by repeated graph partitioning using the library
KaHyPar
[
12
,
13
]. We first partition the graph,
G
into
∼ |
V
|
/K
parts, with the target subgraph size
K
being a tunable
hyper-parameter. Another hyper-parameter is the imbalance,
θ
imbalance
, which controls how much the
sub-graph sizes are allowed to deviate from
K
. Other hyper-parameters at this stage pertain to
KaHyPar
:
•
θ
mode
either ‘direct’ or ‘recursive’,
•
θ
objective
either ‘cut’ or ‘km1’,
•
θ
weight
whether to weight the edges constantly or logarithmically according to bond size.
Once a partition has been formed, the graph is transformed by simulating contracting all of the tensors in
each group, and then compressing between the new intermediates to create a new graph with
∼|
V
|
/K
nodes
and bonds of size no more than
χ
agglom
(itself a hyper-parameter which can deviate from the real
χ
used
to score the tree). The contractions within each partition are chosen according to the
Greedy
algorithm.
Finally, the tree generated in this way is not ordered. To fix an ordering the contractions are sorted by
sub-graph size and average centrality.
6
Algorithm 3
Branch and bound tree search
Input:
graph
G
, Maximum bond dimension
χ
y
best
←∞
c =
{}
.
candidate contractions
for
i,j
∈
EDGES
(
G
)
do
.
populate with every pair of tensors
y
←
0
.
initial score
p
←
[ ]
.
the contraction ‘path’
c
←
c
∪{
(
i,j,G,y,p
)
}
end for
while
|
c
|
>
0
do
(
i,j,G,y,p
)
←
REMOVE
BEST
(
c
)
if
INVALID
(
i,j,G
)
or
y
≥
y
best
then
.
no need to explore further
continue
end if
if
|
G
|
= 1
and
y < y
best
then
.
finished contraction with best score
y
best
←
y
p
best
←
p
continue
end if
p
←
APPEND
(
p,
(
i,j
))
.
continue exploring
(
k,G,y
)
←
SIMULATE
CONTRACTION
(
i,j,G,χ
)
. k
is the new node
for
l
∈
NEIGHBORS
(
G,k
)
do
.
add new possible contractions
c
←
c
∪{
(
k,l,G,y,p
)
}
end for
end while
Υ
best
←
BUILD
TREE
FROM
PATH
(
G,p
best
)
Return:
Υ
best
C.4 Branch & bound approximate contraction tree
The hyper-optimized approach produces heavily optimized trees but with no guarantee that they are an
optimal solution. For small graphs a depth first branch and bound approach can be used to find an optimal tree
exhaustively, or to refine an existing tree if terminated early. The general idea is to run the greedy algorithm
whilst tracking a score, but keep and explore every candidate contraction at each step (a ‘branch’) in order to
‘rewind’ and improve it. The depth first aspect refers to prioritizing exploring branches to completion so as
to establish an upper bound on the score. The upper bound can then be improved and used to terminate bad
branches early.
D Tree cost functions
There are various cost functions one can assign to an approximate contraction tree to then optimize against.
Broadly these correspond to either space (memory) or time (FLOPs) estimates. Three cost functions that we
have considered that only depend on the tree and
χ
(but not gauging scheme for example) are the estimated
peak memory,
M
, the largest intermediate tensor,
W
, and the number of FLOPs involved in the contractions
only. Specifically, given the set of tensors,
{
v
α
}
, present at stage
α
of the contraction, the peak memory is
7
Figure 2: Relationship between various tree cost functions for randomly sampled approximate contraction
trees for two geometries: 2D square of size
16
×
16
and 3D cube of size
6
×
6
×
6
(pictured in insets).
D
,
χ
,
the algorithm and its hyper-parameters are all uniformly sampled.
given by:
M
= max
α
∑
v
∈{
v
α
}
size(
T
[
v
]
)
.
(1)
Given a compression and gauging scheme, one can also trace through the full computation, yielding a
more accurate peak memory usage,
̃
M
, as well an estimate of the FLOPs associated with all QR and SVD
decompositions too – we call this the full computational ‘cost’,
C
. Included in this we consider only the
dominant contributions:
• contraction of two tensors with effective dimensions
(
m,n
)
and
(
n,k
)
:
mnk
• QR of tensor with effective dimensions
(
m,n
)
with
m
≥
n
:
2
mn
2
−
2
3
n
3
• SVD of tensor with effective dimensions
(
m,n
)
with
m
≥
n
:
4
mn
2
−
4
3
n
3
.
Of these the first two dominate since the SVD is only ever performed on the reduced bond matrix. Note the
actual FLOPs will be a constant factor higher depending on the data type of the tensors.
8
Figure 3: Overview of a single step of the the manual 2D boundary contraction method that uses an MPS to
sweep across the square.
In Fig. 2 we plot the relationship between the various metrics mentioned above for several thousand
randomly sampled contraction trees on both a square and cubic geometry for varying
D
,
χ
and algorithm.
We note that
M
,
̃
M
and
W
are all tightly correlated. The full cost
C
is slightly less correlated with these and
only slightly more so with the ‘contractions only‘ cost. Importantly however, the best contractions largely
appear to simultaneously minimize all the metrics.
E Hand-coded Contraction Schemes
In Figs. 3-7 we illustrate the various hand-coded contraction schemes used as comparisons in the text: 2D
boundary contraction, 2D corner transfer matrix RG [
14
], 2D higher-order TRG [
15
], 3D PEPS boundary
contraction, and 3D higher-order TRG [
15
]. Note that in the case of CTMRG and HOTRG, the algorithms
are usually iterated to treat infinite, translationally invariant lattices, but here we simply apply a finite number
of CTMRG or HOTRG steps and also generate the projectors locally to handle in-homogeneous tensor
networks. For both CTMRG and HTORG we use the cheaper, ‘lazy’ method [
6
] of computing the reduced
factors
R
A
and
R
B
which avoids needing to form and compute a QR on each pair of tensors on either side
of a plaquette. We then use the projector form as given in Sec. B to compress the plaquette. The 3D PEPS
boundary contraction algorithm has not previously been implemented to our knowledge, but is formulated in
a way analogous to 2D boundary contraction. Notably, if any dimension is of size 1 it reduces to exactly 2D
boundary contraction including canonicalization. For further details, we refer to the lecture notes [
7
] and the
original references.
F Models
F.1 Ising Model
We consider computing the free energy per spin of a system of
N
classical spins at inverse temperature
β
,
f
=
−
log
Z
Nβ
(2)
9