A randomized quantum algorithm for statistical phase estimation

Phase estimation is a quantum algorithm for measuring the eigenvalues of a Hamiltonian. We propose and rigorously analyse a randomized phase estimation algorithm with two distinctive features. First, our algorithm has complexity independent of the number of terms L in the Hamiltonian. Second, unlike previous L-independent approaches, such as those based on qDRIFT, all sources of error in our algorithm can be suppressed by collecting more data samples, without increasing the circuit depth.


I. INTRODUCTION
Quantum computers can be used to simulate dynamics and learn the spectra of quantum systems, such as interacting particles comprising complex molecules or materials, described by some Hamiltonian H. Phase estimation [1] on the unitary U = e iHt efficiently solves the common spectral problem of computing ground state energies, whenever we can efficiently prepare a trial state with non-trivial (not exponentially small) overlap η with the ground state [2] (see also [3]). Each run of standard phase estimation returns a single eigenvalue, with precision and success probability dependent on the number of times U is used.
Recently, statistical approaches to phase estimation have been proposed [4][5][6], where each run uses only a few ancillae and shorter circuits than standard phase estimation. As such, statistical phase estimation may be better suited to early fault-tolerant quantum computers that are qubit-and depth-limited. However, in these approaches, a single run gives a sample of an estimator for U j for some runtime j, which alone is not enough to infer spectral properties. Multiple runs with different values of j are needed, and statistical analysis gives spectral information with a confidence that increases with the amount of data obtained. These runs could be massively parallelized across multiple quantum computers. Interestingly, the approach of Lin & Tong [6] is not only statistical in its analysis, but also generates the runtimes j, and therefore the circuits, from a random ensemble.
The cost of phase estimation-statistical or standardtypically depends on the Hamiltonian sparsity L, the number of terms in the Hamiltonian when decomposed in a suitable basis, such as the Pauli basis. Simple schemes based on implementing U using Trotter formulae have O(L) gate complexity [7][8][9][10][11]. This can be prohibitive for the electronic structure problem in chemistry and materials science, where typically L = O(N 4 ) for an N -orbital problem [12]. This increases to L = O(N 6 ) when using transcorrelated orbitals [13,14] to better resolve electron-electron interactions. Interestingly, sub-linear non-Clifford complexity O( √ L + N ) is possible by employing an efficient data-lookup oracle [15,16] in qubitization-based implementations of phase estimation [17][18][19][20]. However, these approaches require O( √ L) ancillae, which increases the qubit cost from O(N ) to O(N 2 ), or even O(N 3 ) in the transcorrelated setting.
Heuristic truncation and low-rank factorisations have been proposed to decrease the sparsity L [18][19][20] of the electronic structure Hamiltonian. As an alternative approach, randomized compilation [21][22][23] has been rigorously shown to enable phase estimation with gate complexity that is independent of L for any Hamiltonian. A weakness of these randomized algorithms is a systematic error in energy estimates that can only be suppressed by increasing gate complexity, leading to high gate counts per run (cf. [20,Appendix D]).
Here, we overcome this difficulty by combining the statistical approach of Lin & Tong [6] with a novel random compilation of each U j instance, that has parallels tobut is distinct from-both the qDRIFT random compiler [21] and the linear combinations of unitaries (LCU) method [24,25]. Our algorithm for phase estimation is doubly randomized in that we randomly sample j, then approximate U j using a random gate sequence. Unlike in any previous approach, all approximation and compilation errors can be expressed in terms of statistical noise that is suppressed by collecting more data samples. This allows for a trade-off between the gate complexity per sample and the number of samples required. We explore this trade-off and show how to efficiently find the algorithmic parameters that minimise the total complexity. In contrast, qDRIFT approximates U up to some systematic error (measured by the diamond norm) that cannot be mitigated by increasing the number of samples.
Applied to ground state energy estimation, we can tune the gate vs. sample trade-off to yield the following complexities. Given a Hamiltonian as a linear combination of Pauli operators with total weight λ, and an ansatz state with overlap at least η with the ground space, we can choose to sample from O(η −2 ) randomly compiled quantum circuits, where O(·) hides polylogarithmic fac-tors. Each circuit uses one ancilla and at most O(λ 2 ∆ −2 ) single-qubit Pauli rotations to estimate the ground state energy to within additive error ∆.
In Section II, we start by constructing a subroutine that we refer to as eigenvalue thresholding, which we then apply to ground state energy estimation in Sec. III. We discuss examples from quantum chemistry in Sec. IV.

II. EIGENVALUE THRESHOLDING
Problem setting. We assume that the Hamiltonian H is specified as a linear combination of n-qubit Pauli operators P : This form can always be achieved, and is particularly natural for many physical systems of interest, such as fermionic Hamiltonians [26][27][28][29][30]. Note that the spectral norm H obeys the generally loose bound H ≤ λ. We consider the following problem of coarsely determining whether an ansatz state ρ has overlap with eigenstates of H with eigenvalues below some threshold: Given a threshold X, precision ∆ > 0, and overlap parameter η > 0, we seek to decide if (A) tr[ρΠ ≤X−∆ ] < η or (B) tr[ρΠ ≤X+∆ ] > 0, where Π ≤x denotes the projector onto the eigenspaces of H with eigenvalues at most x. Both of these statements can simultaneously be true, in which case it suffices to output either A or B. We refer to this problem as eigenvalue thresholding, and its solution will later allow us to estimate the ground state energy, given a suitable ansatz ρ. Cumulative distribution function. Similarly to [6], we define the cumulative distribution function (CDF) associated with the Hamiltonian H and ansatz state ρ as where τ := π 2λ+∆ is a normalisation factor. The jump discontinuities in C(x) occur at eigenvalues of τ H, so appropriately characterising the CDF would enable us to estimate the spectrum of the Hamiltonian. We can write C(x) as the convolution (Θ * p)(x) of the Heaviside function Θ(·) and the probability density function p(·) corresponding to τ H and ρ: noting that p(x) is supported within x ∈ (− π 2 , π 2 ) since τ H ≤ τ λ < π 2 . 1 Eigenvalue thresholding then reduces to the following problem regarding the CDF.
Problem 1: For given x ∈ [−τ λ, τ λ] and δ > 0, determine whether outputting either statement if both are true. In particular, solving Problem 1 for x = τ X and δ = τ ∆ solves eigenvalue thresholding. 2 Algorithm overview. To solve Problem 1, we will construct an approximation C(·) to the CDF C(·) satisfying for relevant values of x, δ, and ε. Observe that for ε ∈ (0, η/2), C(x) < η − ε would imply the first case of Eq. (4), while C(x) > ε would imply the second case. Hence, it suffices to estimate C(x). Our algorithm is based on expressing C(x) in terms of a linear combination of computationally simple unitaries, obtained via a two-step construction. First, we develop an improved Fourier series approximation to the Heaviside function (Lemma 1). Second, we combine this with a novel decomposition of the time evolution operators (Lemma 2) in the relevant Fourier series. Randomly sampling unitaries from our decomposition and estimating their expectation values using Hadamard tests ( Fig. 1(a)) will give estimates for C(x), allowing us to solve Problem 1 with high probability.
As such, Lemma 1 also improves the asymptotic complexity of their phase estimation algorithm. In Appendix A, we prove a stronger version of Lemma 1 with explicit constants, by converting suitable Chebyshev approximations to the error function into Fourier series.
LCU decomposition of time evolution operators. Instead of directly implementing the time evolution operators e iĤtj from Eq. (6) in Hadamard tests, as considered by [6], we further decompose each of these terms into a specific linear combination of unitaries.
Lemma 2. LetĤ = p P be a Hermitian operator that is specified as a convex combination of Pauli operators. For any t ∈ R and r ∈ N := {1, 2, . . . }, there exists a linear decomposition and for all k ∈ S 2 , the non-Clifford cost of controlled-U k is that of r controlled single-qubit Pauli rotations.
This decomposition is conceptually different from previous LCU methods, cf. [25] and references therein. The purpose of Lemma 2 is to allow for a trade-off between the sample complexity and gate complexity of our algorithm. Specifically, as shown later, the sample complexity depends on the total weight k∈S2 b k of the coefficients in our decomposition. Since this is bounded by exp(t 2 /r), we can reduce the sample complexity by increasing r, at the cost of increasing the gate complexity per sample, and vice versa.
To prove Lemma 2, we write e iĤt = (e iĤt/r ) r and Taylor-expand each e iĤt/r = 1 + iĤt/r + O((t/r) 2 ). We then pair up consecutive terms in this expansion, which differ in phase by i. SinceĤ is a convex combination of Pauli operators, this gives rise to convex combinations of multi-qubit Pauli rotations, e.g., the leading term is with θ := arccos 1 + (t/r) 2 . The higher-order terms contain additional Pauli operators, as illustrated in Fig. 1(b). The controlled version of each Pauli rotation can be implemented using a controlled single-qubit rotation, along with Clifford gates. Hence, each controlled-U k requires r controlled single-qubit rotations in total. Explicit forms for the higher-order terms and proof details are given in Appendix C, where we also show, via Algorithm 2, that one can efficiently sample U k according to the distribution given by {b k } k . Our algorithm for Problem 1. Putting together the above results, we apply Lemma 2 to decompose each e iĤtj in Eq. (6) as e iĤtj = k∈S2 b k . We choose a positive integer r j for each j ∈ S 1 , and define the corresponding "runtime vector" r = (r j ) j ∈ N S1 . This leads to the final decomposition with total weight As a simple example, Recall that we can solve Problem 1 by determining if C(x) < η − ε or C(x) > ε. To estimate C(x), we sample (j, k) from S 1 × S 2 with probability proportional to |a jk |, and perform a Hadamard test on ρ and U (j) k , obtaining an estimate m jk for tr[ρU (j) k ]. Then, z jk := A( r)e i arg(a jk ) m jk is an unbiased estimate of C(x). Letting Z denote the random variable obtained by taking the average of C sample such estimates, it follows from Hoeffding's inequality that guessing C(x) < η − ε if Re[Z] < η/2, and C(x) > ε otherwise, gives a correct answer with probability at least 1 − ϑ provided that C sample ≥ 4A( r) 2 (η/2 − ε) −2 ln(ϑ −1 ) (cf. Appendix E). Thus, we arrive at Algorithm 1, our algorithm for solving Problem 1, and hence eigenvalue thresholding.

5:
Sample a unitary U a.
b. Sample a unitary using Algorithm 2 with inputs H/λ, tj, rj .

6:
Perform a Hadamard test with inputs ρ and U  zi ← A( r)e i(arg(F j )+jx) mi 8: z ← i zi/C sample ( r). If Re(z) < η/2, return 0. Else, return 1 Complexity. The Hadamard test in Step 6 is the only quantum step and involves two circuits on n + 1 qubits, for an n-qubit Hamiltonian H. The expected number of controlled Pauli rotations per circuit is Step 6 is repeated C sample ( r) times, so the expected total non-Clifford complexity is 2C sample ( r) · C gate ( r). It remains to specify how to choose the runtime vector r ∈ N S1 . For example, we could aim to minimise the total complexity arg min r C sample ( r) · C gate ( r).
Prima facie this is a high-dimensional optimisation problem, as |S 1 | = O(δ −1 log( −1 )) from Lemma 1. However, differentiating with respect to r, one sees that the argmin is effectively described by a single free parameter. Therefore, optimising r is reducible to an efficiently solvable one-dimensional problem, and this further holds when minimising C sample subject to constraints on C gate ; see Appendix D for details. Moreover, if one is exclusively interested in asymptotic complexities, the simple choice for r in Eq. (10) already gives and since A( r) ≤ √ eF and C gate ( r) ≤ max j∈S1 r j = 2[(2d + 1)τ λ] 2 for this choice, with F and d given by Lemma 1 and picking ε = const. × η in Algorithm 1. Note that the worst-case gate complexity thus has the same scaling as that in Eq. (15) for the expected gate complexity C gate ( r). Hence, we arrive at a total complexity O(δ −2 η −2 ). For eigenvalue thresholding, one would choose δ = τ ∆, in which case δ −1 = O(λ/∆).

III. GROUND STATE ENERGY ESTIMATION
Under appropriate assumptions on the Hamiltonian H and ansatz state ρ, our method for estimating the CDF can be adapted to perform phase estimation. Specifically, eigenvalues of H coincide with the locations of jump discontinuities in C(x), and we can estimate these locations given sufficient knowledge about the overlap of ρ with relevant eigenspaces. For simplicity, we restrict ourselves to the problem of estimating the ground state energy [H] min , which only requires the standard promise that tr[ρΠ min ] ≥ η for some η > 0, where Π min denotes the projector onto the ground space of H.
The analysis in [6,Section 5] shows that by solving Problem 1 for s = O(log(δ −1 )) different values of x determined in a fashion similar to binary search, one can find an x * such that C(x * − δ) < η and C(x * + δ) > 0, which implies that |x * − τ [H] min | ≤ δ. Hence, if we take δ = τ ∆, then x * /τ would give an estimate of the ground state energy to within additive error ∆. We use Algorithm 1 to solve Problem 1, noting that we can reuse the samples collected in Step 6 for all of the different x values, with only a small overhead in the sample complexity. Namely, since Algorithm 1 errs with probability at most ϑ for any x, choosing ϑ = ξ/s would ensure, by the union bound, that the ground state is successfully estimated with probability at least 1 − ξ.
Theorem 1. For any n-qubit Hamiltonian H of the form in Eq. (1), let ρ be a state that has overlap tr[ρΠ min ] ≥ η with the ground space of H. Then, the ground state energy of H can be estimated to within additive error ∆ with probability at least 1 − ξ using O 1 η 2 log 2 λ ∆ log 1 η log 1 ξ log λ ∆ quantum circuits on n + 1 qubits. Each circuit uses one copy of ρ and at most O λ 2 ∆ 2 log 2 1 η single-qubit Pauli rotations.
Thus, our quantum complexities are independent of the Hamiltonian sparsity L, at the price of the quadratic dependence O(λ 2 ∆ −2 η −2 ) for the total gate count. This is in contrast to standard results on phase estimation (see e.g., [20, Table I]). Additionally, note that Theorem 1 is derived using the specific choice of runtime vector in Eq. (10). By tuning r (using for instance the optimisation procedures in Appendix D), we can reduce the gate complexity per circuit by running more circuits, for a given set of problem parameters.

IV. EXAMPLES IN QUANTUM CHEMISTRY
Comparisons. Conventional phase estimation algorithms depend on the sparsity L, which is especially prohibitive for chemistry Hamiltonians. Several algorithms [18][19][20] have used heuristic truncation policies to justify eliminating certain terms from the Hamiltonians, thereby reducing L. While supporting numerics were presented, these truncations are not rigorous. Moreover, it was also assumed that only a single run of the algorithm suffices. In practice, a single sample might return an incorrect result due to imperfect overlap with the ground state (η < 1), inherent failure probabilities of phase estimation, or quantum error correction failure events. In contrast, our algorithm is rigorously analysed; we use no Hamiltonian truncation, and upper-bound the number of samples needed in terms of η and the target success probability. 3 Hydrogen chains. As a benchmark system for assessing the scaling of quantum algorithms applied to quantum chemistry, we discuss hydrogen chains [20,36]. Using the best value λ ∼ O(N 1.34 ) given in [36], our algorithm scales as O(N 2.68 /∆ 2 ). For comparison, the scaling of qubitization is O(N 3.34 /∆) without truncation, and with heuristic truncations, O(N 2.3 /∆) for the sparse method of [18] and O(N 2.1 /∆) for the tensor hypercontraction approach of [20]. Hence, for constant ∆, qubitization gives a better scaling than our algorithm if the proposed truncation schemes are accurate. However, we emphasize that our rigorous analysis does not make use of heuristic strategies for truncating Hamiltonian terms [18][19][20] and that qubitization uses considerably more logical ancillae. Finally, if we are interested in extensive properties, where ∆ ∝ N , then our approach scales as O(N 0.68 ), outperforming all qubitization algorithms.  To calculate the number of samples needed to guarantee an overall failure probability of ≤ ξ for ground state energy estimation (Theorem 1), one would multiply the y-axis by ln(ϑ −1 ) = O log(ξ −1 ) + log log(δ −1 ) (see [6] for the explicit constants). As an example, ξ = 0.1 would give a multiplier of approximately 6.
FeMoco. We estimate the costs of our algorithm applied to the Li et al. FeMoco Hamiltonian [37], another popular benchmark for which there have been several state-of-the-art resource studies [18][19][20]. We consider chemical accuracy ∆ = 0.0016 Hartree, and use λ = 1511 Hartree, obtained using the bounds in [36]. We present our results in Fig. 2, illustrating the trade-off between the the expected number of gates per circuit and the number of samples required. Since the Hamiltonian from [37] has 152 spin orbitals, each circuit uses 153 qubits.
We have presented our gate counts as C gate controlled Pauli rotations, but asymptotically our circuits can typically be realised using ∼ 2C gate Toffoli gates. For modest system sizes and a modest number of logical ancilla (∼ 40), the Toffoli count is ∼ 6C gate (see Appendix F). The FeMoco resource estimate for the qDRIFT random compiler combined with phase estimation in [20, Appendix D] arrived at 10 16 Toffoli gates per sample, which is ∼ 10 4 times larger than 2C gate from the results in Fig. 2. Moreover, our rigorous analysis will likely be loose and overestimate resources; for instance, more aggressive-though heuristic-Hamiltonian rescaling is justifiable and can further reduce costs (see Appendix G).
[35] J. van Apeldoorn, A. Gilyén, S. Gribling, and R. In this appendix, we work toward a non-asymptotic version of Lemma 1. Along the way, we provide various related approximation theory results with explicit constants. The main result will be Theorem 3, which shows that the Fourier series can be made an arbitrarily good approximation to the Heaviside function Θ(x) on x ∈ [−π, π] by choosing appropriate values for the parameters β ∈ R >0 and d ∈ N. Here and throughout, I n (·) denotes the n th modified Bessel function of the first kind.

Chebyshev approximation
First, we construct an approximation P β,d (·) to the Heaviside function in terms of Chebyshev polynomials; the properties of P β,d (·) are characterised in Theorem 2 below. The following development is a strengthening of the results in [38, Appendix A], featuring more direct proofs as well as tighter constants. Note that the methods and bounds from [38, Appendix A] were subsequently employed in e.g., [32] and other works on quantum algorithms.
We start with the following Chebyshev approximation to the scaled error function erf( which in turn approximates the sign function for large β (cf. Lemma 10). For β ∈ R >0 and d ∈ N, we define where T n denotes the n th Chebyshev polynomial of the first kind.
Proposition 3. For any β ∈ R >0 and d ∈ N, we have Proof. From the Chebyshev expansion of the error function (Proposition 7), we see that gives an exact expression for erf( √ 2βx). Hence, using max x∈[−1,1] |T n (x)| = 1 and the fact that Proposition 3 shows that the error in using Q β,d (·) to approximate the scaled error function depends on the infinite sum ∞ j=d+1 I j (β) of modified Bessel functions. In the next proposition, we bound this sum directly in order to obtain tighter results than those given by [38, Appendix A], which used loose bounds from the survey [39].
Proposition 4. For any β ∈ R >0 , d ∈ N, and integer t ≥ β, we have Proof. Starting from the expression for ∞ j=d+1 I j (β) given by Proposition 8, we have where in the second line, we bound the first term using a Chernoff bound and the second term using the fact that the inner sum goes over fewer than half of the binomial coefficients. Hence, we find The second sum on the right-hand side is an upper tail of the Poisson distribution with mean β, provided that t + 1 > β. In particular, it follows from [40, Corollary 6] that for t ≥ β, and the claim follows.
We now define the function for β ∈ R ≥0 and ε ∈ (0, 1), where W (·) denotes the principal branch of the Lambert-W function. As shown in Proposition 9, f (β, ) is the solution t to the equation (eβ/t) t e −β = ε under the constraint t > β.
Lemma 5 (Chebyshev approximation to the error function). For any β, ε 1 , ε 2 ∈ R >0 , we have and t is any integer such that with the function f (·, ·) defined as in Eq. (A5).

Proof. Combining Propositions 3 and 4, we have
for any integer t ≥ β. If we require that t ≥ β, then to bound the first term on the right-hand side of Eq. (A8) by ε 1 , it suffices for which holds for d ≥ tW (8/(πε 2 1 )) = √ tw ε1 . This in turn implies that √ β/d ≤ √ t/d ≤ 1/ √ w ε1 , so the second term on the right-hand side of Eq. (A8) is at most The constraints on t in Eq. (A7) then follow from Proposition 9, together with the fact that Eq. (A8) holds only for t ≥ β.
For β ∈ R >0 and d ∈ N, we define where P β,d (x) is a linear combination of Chebyshev polynomials, and approximates the Heaviside function Θ(x) with approximation error determined by β and d on a domain determined by β, as captured by the following theorem.

Fourier approximation
Next, we transform our polynomial approximation P β,d (·) to the Heaviside function Θ(·) into a Fourier series F β,d (·), using the observation that Θ(x) = Θ(sin x) for all x ∈ [−π, π]. For β ∈ R >0 and d ∈ N, we define The following proposition allows us to extract the Fourier coefficients of F β,d (·).
Proof. Using the trigonometric identity sin x = cos(π/2 − x) in conjunction with T k (cos θ) = cos(kθ), we have By reorganising the sum in Eq. (A3), we have It then follows from Lemma 6 that F β,d (x) := P β,d (sin x) has the form given in Eq. (A1), and we arrive at the following theorem.
Proof. Eq. (A12) follows from the Eq. (A10) of Theorem 2. Specifically, since Θ(x) = Θ(sin x) for x ∈ [−π, π] and | sin x| ≥ sin δ for by Lemma 1 from the main text is a special case of Theorem 3, obtained by making the simple choice ε 1 = ε 2 = ε 3 = 2ε/3. For completeness, we provide the proof below. Note that in practice, one can numerically minimise d over the choice of ε 1 , ε 2 , ε 3 , as we do in our numerical estimates in Sec. IV of the main text.

Technical lemmas
The following results are used in the approximation theory proofs above.
Proposition 7 (Chebyshev expansion of the error function). For any β ∈ R >0 , we have Proof. By definition, changing variables in the second equality and substituting T 2 (z) = 2z 2 − 1 to obtain the third equality. Next, we use the fact that for any x ∈ [−1, 1], the Jacobi-Anger identity gives where we used the composition identity T m (T n (z)) = T mn (z) in the second line and dx T n = 1 2 Tn+1 n+1 − Tn−1 n−1 in the third line, noting that T n (0) = 0 for odd n.
Proposition 8. For any β ∈ R and d ∈ Z ≥0 , we have where the first line follows from the definition of I (·), the second from making the change of variables j = 2m + , the third from exchanging the summations, and the fourth from re-indexing the inner sum with k = (j − )/2. Proposition 9. For any β, ε ∈ R >0 , we have for all where f (·, ·) is defined in Eq. (A5). Moreover, f (β, ε) > β for all ε < 1, and Proof. For any β, the function g β (t) := (eβ/t) t e −β reaches its global maximum of 1 at t = β. Therefore, if ε ≥ 1, Eq. (A15) holds for all t. The function g β decreases monotonically past t = β, limiting to 0 as t → ∞. Hence, if ε < 1, we look for the t > β such that g β (t) = ε: Note that since ε < 1, the right-hand side of the last expression is always > −1/e. The solution such that t > β is then given by the principal branch of the Lambert-W function: which rearranges to give t = f (β, ε). Thus, f (β, ε) > β. Eq. (A17) follows from noting that since β > 0, Eq. (A15) holds for any t such that (eβ/t) t ≤ ε, and then applying a loosened version of [32, Lemma 59].
When we defined the CDF C(·) in Eq. (2), we chose the normalisation factor τ = π 2λ+∆ , which implies that the corresponding probability density function p(·) is supported within the interval [−τ λ, τ λ]. Hence, we can apply Proposition 12 for any δ such that This shows that the approximate CDF C(x) = (p * F )(x) from Eq. (6) satisfies the guarantees in Eq. (5) for all ε ∈ R >0 and δ ∈ (0, λτ ], provided that we use the appropriate Fourier series F (·) from Lemma 1 (with β and d in Eq. (A1) chosen appropriately in terms of δ and ε), as claimed in the main text. In this appendix, we prove Lemma 2 by constructing a particular decomposition of the time evolution operator e iĤt into a linear combination of unitaries (LCU). We then provide an algorithm, Algorithm 2, for efficiently sampling a unitary from this decomposition with probability proportional to its coefficient.
Proof of Lemma 2. By assumption, we haveĤ = p P , where p > 0 for all , p = 1, and each P is a Pauli operator. We write e iĤt = (e iĤt/r ) r , and observe that if each e iĤt/r has an LCU decomposition is an LCU decomposition for e iĤt , with total weight Furthermore, note that we can sample a unitary U k with probability proportional to |b k | by independently sampling r unitaries W m1 , . . . , W mr according to the distribution given by {|c m |} m , and implementing their product W m1 . . . W mr . We construct the following decomposition for e iĤt/r . Letting x := t/r, we have
The non-Clifford gate complexity r can in fact be chosen to be arbitrarily small, at the cost of an exponentially large sample complexity. Specifically, note that in the proof, m |c m | can also be bounded as m |c m | ≤ exp(x) = exp(t/r), which gives k |b k | ≤ exp(t). An exponentially large sample complexity in the limit of zero non-Clifford gate complexity is consistent with the fact that Clifford circuits can be efficiently simulated classically [43,44].
The proof immediately gives the following algorithm for sampling from the decomposition from Lemma 2.
Algorithm 2 Efficiently sample from the LCU decomposition of e iĤt from Lemma 2 Input: A HamiltonianĤ = L =1 p P specified as a convex combination of Pauli operators, a real number t, a positive integer r. Output: Description of a random unitary U , such that E[U ] ∝ e iĤt and U can be implemented using r controlled single-qubit Pauli rotations.
As presented, Step 3 of Algorithm 2 samples from an infinite distribution, over all even positive integers. However, the probability of sampling an integer n decreases super-exponentially with n, so a very precise approximation can be made by truncating to only the first few terms in the distribution, as analysed in Appendix E 2.
Proof. For |x| ≤ 1, the sum of the first three terms in the series on the LHS can be bounded as x 2n (2n)! 1 + 1 2 x 2n + 1 2 = 1 + x 2 + 5 72 x 4 + 1 1200 x 6 n! x 2n , using √ 1 + a 2 ≤ 1 + 1 2 a 2 , x 6 ≤ x 4 , and 5/72 + 1/1200 < 1/2, while for all n ≥ 3, Hence, we find for all |x| ≤ 1. For |x| > 1, we have n! |x| n = e |x| < e x 2 , using √ 1 + a 2 ≤ 1 + |a| to obtain the first inequality. In this appendix, we show how to optimise the complexity of our algorithm with respect to the runtime vector r. Recall that in Algorithm 1, we use the LCU decomposition constructed in the proof of Lemma 2 as our decomposition for each e iHtj , leading to Eq. (8). Importantly, for each j ∈ S 1 , we are free to choose any positive integer r j when applying Lemma 2. Then, the total weight of the coefficients of the decomposition is and the complexity of the controlled version of each unitary in the decomposition of e iHtj is that of r j controlled single-qubit Pauli rotations. Specifically, from Eqs. (9), (11), and (12), we have the expressions Note that for each j, the value of µ j implicitly depends on r j . Also, observe that we can in fact exclude all indices j such that t j = 0 from the sums in Eqs. (D1) and Eq. (D3). This is due to the fact that tr[ρe 0 ] = 1 does not have to be estimated, so we do not have to sample these indices when implementing Algorithm 1. Therefore, throughout this section, all sums will implicitly be over S 1 \ {j : t j = 0}. When finding the optimal runtime vector r = (r j ) j∈S1 ∈ N S1 , which determines the sample and gate complexities, we consider two different goals: 1. finding r that minimises the expected total gate complexity C total ( r) = 2C sample ( r) · C gate ( r) (Appendix D 1), 2. minimising the sample complexity given an upper bound on the expected gate complexity per sample, i.e., given a number g ∈ R >0 , finding r that minimises C sample ( r) subject to the constraint C gate ( r) ≤ g (Appendix D 2).
Our Algorithm 1 uses the specific Fourier series of Lemma 1, whose coefficients are specified in Appendix A, but we note that the results presented in this section will apply to any arbitrary set {F j } j∈S1 of Fourier coefficients. We solve both problems approximately, where the sources of approximation are as follows.
• We use approximate expressions for the complexities, replacing µ j with its analytic upper bound in the formulae for C sample ( r), C gate ( r), and C total ( r).
• We ignore the fact that C sample ( r) must be an integer, i.e., we remove the ceiling in Eq. (D2).
• The r j 's that minimise the approximation expressions are real numbers in general. On the other hand, in the context of our algorithm, the r j 's are required to be integers. We simply round our results for the approximate r j 's to the nearest integer.
After determining the r that optimise these approximate expressions, we then round each entry and substitute the resulting vector into the exact expression, Eqs. (D2) and (D3), in our numerical calculations. This gives complexities that are valid and exact, but only near-optimal in general. However, the upper bound u j for µ j becomes tight for small t j /r j , and the r j 's are typically large enough that the rounding error is negligible, so we expect these complexities to be close to the optimal solutions. Note that since C gate ( r) does not include the cost of preparing the ansatz state ρ, in this section we are considering only the case where the state preparation complexity is small relative to the cost of implementing the random unitaries from Lemma 2. However, the same techniques can be straightforwardly adapted to incorporate state preparation costs into the optimisation.

Minimising the total complexity
Making the approximations discussed above, we have where we have used that µ j ≤ u j for all j (cf. Eq. (D4)).
The following proposition reduces the minimisation of c( r), and hence C total ( r), to a simple one-dimensional optimisation problem.
2. Minimising the sample complexity given constraints on gate complexity Above, we considered minimising C total ( r) = C sample ( r) · C gate ( r), with no constraints on C sample or C gate . In some scenarios, it may be useful to consider the following more constrained problem: Impose an upper bound on C gate ( r), the expected gate complexity per Hadamard test. What choice of runtime vector r gives the minimum sample complexity C sample ( r)? This problem may be well-motivated in e.g., the context of "early fault-tolerance," where it might be advantageous to run a larger number of shorter circuits, even if this increases the total complexity [6].
We approximate where u j is defined in Eq. (D5) and S(·) in Eq. (D7). It is clear that allowing for larger C gate decreases the minimum C sample required, so we replace our upper limit on C gate (an inequality constraint) by an equality constraint.
Proposition 15. For any F j ∈ C \ {0} and t j ∈ R \ {0} for j ∈ S 1 \ {0}, let u j and S(·) be defined as in Eqs. (D5) and (D7). For any g > 0, define the function R (g) by Then, the minimum of f ( r) := j |F j |u j subject to S( r) = g and r > 0 is attained by where λ * is the solution to Proof. The constraint S( r) = g is equivalent to j |F j |u j r j = g j |F j |u j , so using the method of Lagrange multipliers, we solve ∇L( r (g) * , λ * ) = 0 for r (g) * and λ * , where This gives for all j (under the requirement that r (g) * > 0) and S( r (g) * ) = g, which when rewritten in terms of the function R (g) gives the result.
Thus, we can find all of the r (g) * j 's by solving a one-dimensional problem-namely, by solving Eq. (D10) for λ * -then simply evaluating R (g) j (λ * ) for each j. An analogous strategy can be used to find the r that minimises the expected gate complexity given a fixed upper bound on the sample complexity.
In this section, we fill in the details for the proof that Algorithm 1 outputs an incorrect answer with probability at most ϑ, and analyse the effect of truncating the infinite distribution in Algorithm 2 to a small number of terms.

Failure probability of Algorithm 1
Let all quantities be defined as in Algorithm 1. Recall that by Eq. (5), C(x) < η − ε would imply that C(x − δ) < η, while C(x) > ε would imply that C(x + δ) > 0, so we can solve Problem 1 by deciding between C(x) < η − ε and C(x) > ε, outputting either if both are true. To estimate C(x), Steps 5-7 of Algorithm 1 sample from the decomposition a jk tr[ρU Step 8 of Algorithm 1 computes the average Z of C sample ( r) independent samples of Z, and compares its real part to η/2, guessing that C( (Here, we write e.g., Z < a as shorthand for Re(Z) < a.) Thus, the probability of error is bounded as Since E[Z] = C(x), we see that conditioned on C(x) = η − ε, Hence, using the fact that |Z| = √ 2A( r), so Re(Z) is contained in the interval [− √ 2A( r), √ 2A( r)], Hoeffding's inequality gives and the right-hand side is at most ϑ when C sample ( r) is chosen as in Eq. (11) of Algorithm 1. Likewise, as claimed.

Truncating the infinite distribution
The linear decomposition of e iĤt into unitaries constructed in the proof of Lemma 2 in Appendix C contains an infinitely many unitaries, due to the fact that the index n in the sum in Eq. (C4) ranges over all even, non-negative integers. In practice, instead of sampling from a distribution over infinitely many integers in Step 3 of Algorithm 2, one could truncate the sum in Eq. (C4) at some order M , and modify Step 3 accordingly. Since the coefficients in Eq. (C4) decay very rapidly with n, the effect of this truncation is negligible for modest values of M .
To make this rigorous, we consider the random variable Z that results from sampling from the truncated distribution, and see how this changes the Hoeffding's inequality analysis in the previous subsection. The total weight of the truncated LCU will be less than A( r), so Re(Z ) is bounded in the interval [− √ 2A( r), √ 2A( r)]. Unlike Z, however, Z is not in general an unbiased estimator for C(x); we will have Consequently, defining Z to be the average of C sample ( r) independent samples of Z , the analogue of Eq. (E1) would read (using Z as shorthand for Re(Z ) where it is clear from context), leading to by Hoeffding's inequality, and similarly for P Z ≥ η/2 C(x) = ε ≤ ε. Therefore, it suffices to replace the (η/2−ε) −2 factor in the definition of C sample ( r) in Eq. (11) Hence, it remains to bound the bias B. We show in Theorem 4 below that |B| decreases superexponentially with the truncation order M . For this, we introduce some extra notation for convenience. From the proof of Lemma 2, we have that for each j ∈ S 1 , SinceĤ = p P is assumed to be a convex combination of Pauli operators, each A (j) n is a convex combination of unitaries. Note also that in this notation, the total weight µ j of the coefficients in the LCU decomposition for e iĤtj = (e iĤtj ) rj is given by for each j. Then, we have Theorem 4. For any F j ∈ C, t j ∈ R, and r ∈ N S1 such that r j ≥ |t j | for all j, 5 let Z and Z be defined as above, and let A( r) and C gate ( r) be defined as in Eqs. (D1) and (D3). Then, for any γ > 0, where W (·) denotes the principal branch of the Lambert-W function.
where Z 0 is a Pauli Z acting on the control qubit. Since {1 ⊗ P, −Z 0 ⊗ P } is a commuting and independent set of Pauli operators, there exists a Clifford C such that under conjugation by C, we have {1 ⊗ P, −Z 0 ⊗ P } → {Z 1 , Z 2 }. Thus, we find leading to an extra factor of ×2 in non-Clifford complexity. Second, each single-qubit Z rotation of the form exp(i(θ/2)Z j )) can be compiled into the Clifford+T gate set [45][46][47][48][49], though this typically increases gate counts by a factor O(log(1/ )) to achieve synthesis precision . For instance, using the Ross-Selinger synthesis algorithm leads to an synthesis overhead of 3 log 2 (1/ ) + O(log(log(1/ ))). For = 10 −10 this gives a ∼ 100× overhead, which can be reduced to ∼ 50× using random compilation of the Ross-Selinger algorithm [50]. Therefore, the expected T -count per sample is upper bounded by ∼ 100C gate ( r).
However, this large 100× constant factor can be reduced by using smarter compilation strategies. In our algorithm, the vast majority of gates are sampled from the leading order terms in Eq. (7), which all have the same rotation angle. Let us assume that in our algorithm we have a subset of controlled Pauli rotations {P 1 , P 2 , P 3 , . . . , P w } by the same angle. Furthermore, for chemistry problems these will typically be independent-there are no combinations that multiply to form the identity-and we assume they are all commuting within this subset (we address validity of this assumption later). Then, the set of w controlled Pauli rotations becomes a sequence of Pauli rotations with respect to the set which is a set of 2w commuting and independent Pauli operators. For such a set of operators, there will exist a Clifford rotation C that maps this set to {Z 1 , Z 2 , . . . , Z 2w }. Therefore, the w controlled-Pauli rotations can be realised by C 2w j=1 exp(i(θ/2)Z j )C † . The special structure of 2w j=1 exp(i(θ/2)Z j ) allows the use of Hamming weight phasing [10,51,52]. Notice that where |x| is the Hamming weight of bit-string x over the indices 1 to 2w. Therefore, the key idea of Hamming weight phasing is that we produce the phase exp(iθ(w − |x|)) by first mapping |x → |x ||x| where ||x| is a register of size log 2 (2w) qubits storing a binary representation of the integer |x|. We now need only log 2 (2w) Pauli Z rotations acting on the ||x| register. The cost of calculating the Hamming weight on 2w bit is upper bounded by 2w Toffoli gates (and ancilla qubits). For example, costing each Pauli rotation at 50 T gates, equivalently 25 Toffoli gates using catalysis, the total Toffoli cost is then C w−Rot = w (2w + 25 log 2 (2w)) .
We plot the Toffoli cost per gate C w−Rot /w in Fig. 3. For large w, the Toffoli cost per gate will approach 2. We see that for finite w, at w = 100 we need ∼ 4 Toffoli per gate and at w = 40 we need ∼ 6 Toffoli per gate. Assuming the randomly sampled unitaries are dominated by long commuting sequences of the form {P 1 , P 2 , . . . , P w } with large w (e.g. w 100) and identical rotation angles, this supports the claim in the main text that, roughly, the Toffoli cost scales as 2C gate . But using a more modest number of ancilla (e.g. w ∼ 40), the Toffoli cost scales as 6C gate .
Of course, there is a small but nonzero probability that we sample higher order terms (n > 0) with rotation angles θ n = θ 0 (cf. Eq. (C5)). we can perform these controlled rotations using standard-though expensive-circuit synthesis instead of Hamming weight phasing. However, the frequency of these rotations is far fewer than one in every 100 gates, so this extra expense is relatively negligible. There is also a finite probability that a Pauli rotation exp(iθP j ), is followed by a sample P k that does not commute with P j . However, there are O(N 4 ) Pauli operators in the Hamiltonian, and for any given P j there are only O(N 3 ) non-commuting Pauli terms. With each Pauli equally weighted, the probability of selecting an anti-commuting operator is O(1/N ). Therefore, for large enough N we expect to encounter many long sequences of commuting rotations that enable the use of Hamming weight phasing. For these reasons, we stress that the 2C gate − 6C gate Toffoli count claims are a rough, asymptotic estimate. A more detailed analysis of the finite-N statistics and pre-asymptotics is beyond the scope of this work. rigorous manner. However, reduced resource overheads can be obtained via heuristic modifications of the value used for τ . The CDF is inferred from expectations of Tr[ρe ijHτ ], and so to avoid ambiguity due to periodicity of this exponential Proposition 12 required that τ was chosen small enough that p(x) is supported within the interval x ∈ [−(π−δ)/2, (π− δ)/2]. Recall that if the state ρ is supported on eigenstates with eigenvalues in the range [−E, +E] for some E, then p(x) is supported on x ∈ [−τ E, τ E]. It follows that Proposition 12 can be employed whenever τ E ≤ (π − δ)/2. (G1) Using δ = τ ∆ and simplifying, this equates to τ ≤ π 2E + ∆ (G2) In Appendix B and throughout the main text, we used that E ≤ H ≤ λ. However, this analysis is overly pessimistic and we could often set τ = π 2λ/b + ∆ .
for some b > 1 without any significant problems as we explain below. First, H ≤ λ is typically very loose for frustrated systems. If we know H , then we can determine a new range of safe values for τ and therefore b. However, calculating H is computationally hard and so typically its value is unknown; hence, the use of b = 1 for our rigorous theorem statements.
Second, the assumptions of Proposition 12 can be relaxed with a similar proof going through. That is, let us assume that p(x) is mostly supported on the interval x ∈ [−(π − δ)/2, (π − δ)/2], so that the support outside this interval has total weight no more than . Then one could derive a similar result to Proposition 12 at the price of an extra to the additive error bounds on the CDF. Provided is small compared to η, this extra error could be accommodated by a slight tuning of the algorithm parameters. When will this assumption on p(x) hold? The initial state ρ is typically taken to be an approximation of the ground state, so it will have very low energy Tr[ρH], close to the ground state energy. Indeed, the ground state energy and also Tr[ρH] could be several orders of magnitude smaller than H . When Tr [ρH] H , ρ cannot have large overlap with high-energy eigenstates. Thus, in practice, it will often be safe to set b such that b > λ/ H (but not very much larger) since then any support of p(x) outside [−(π − δ)/2, (π − δ)/2] will be relatively small.