Information-theoretic bounds on quantum advantage in machine learning

We study the performance of classical and quantum machine learning (ML) models in predicting outcomes of physical experiments. The experiments depend on an input parameter $x$ and involve execution of a (possibly unknown) quantum process $\mathcal{E}$. Our figure of merit is the number of runs of $\mathcal{E}$ required to achieve a desired prediction performance. We consider classical ML models that perform a measurement and record the classical outcome after each run of $\mathcal{E}$, and quantum ML models that can access $\mathcal{E}$ coherently to acquire quantum data; the classical or quantum data is then used to predict outcomes of future experiments. We prove that for any input distribution $\mathcal{D}(x)$, a classical ML model can provide accurate predictions on average by accessing $\mathcal{E}$ a number of times comparable to the optimal quantum ML model. In contrast, for achieving accurate prediction on all inputs, we prove that exponential quantum advantage is possible. For example, to predict expectations of all Pauli observables in an $n$-qubit system $\rho$, classical ML models require $2^{\Omega(n)}$ copies of $\rho$, but we present a quantum ML model using only $\mathcal{O}(n)$ copies. Our results clarify where quantum advantage is possible and highlight the potential for classical ML models to address challenging quantum problems in physics and chemistry.


I. INTRODUCTION
The widespread applications of machine learning (ML) to problems of practical interest have fueled interest in machine learning using quantum platforms [20,47,79]. Though many potential applications of quantum ML have been proposed, so far the prospect for quantum advantage in solving purely classical problems remains unclear [10, 41,85,86]. On the other hand, it seems plausible that quantum ML can be fruitfully applied to problems faced by quantum scientists, such as characterizing the properties of quantum systems and predicting the outcomes of quantum experiments [6, 28,30,40,67,81,88].
Here we focus on an important class of learning problems motivated by quantum mechanics. Namely, we are interested in predicting functions of the form where x is a classical input, E is an arbitrary (possibly unknown) completely positive and trace preserving (CPTP) map, and O is a known observable. Equation (1) encompasses any physical process that takes a classical input and produces a real number as output.
The goal is to construct a function h(x) that accurately approximates f (x) after accessing the physical process E as few times as possible.
A particularly important special case of setup (1) is training an ML model to predict what would happen in physical experiments [67]. Such experiments might explore, for instance, the outcome of a reaction in quantum chemistry [96], ground state properties of a novel molecule or material [17,27,40,59,73,74,92], or the behavior of neutral atoms in an analog quantum simulator [19,25,63]. In these cases, the input x subsumes parameters that characterize the process, e.g., chemicals involved in the reaction, a description of the molecule, or the intensity of lasers that control the neutral atoms. The map E characterizes a quantum evolution happening in the lab. Depending on the parameter x, it produces the quantum state E(|x x|). Finally, the experimentalist measures a certain observable O at the end of the experiment. The goal is to predict the measurement outcome for new physical experiments, with new values of x that have not been encountered during the training process.
Motivated by these concrete applications, we want to understand the power of classical and quantum ML models in predicting functions of the form given in Equation (1). On the one hand, we consider classical ML models that can gather classical mea- where o i is the outcome when we perform a POVM measurement on the state E(|x i x i |). We denote by N C the number of such experiments performed during training in the classical ML setting. On the other hand, we consider quantum ML models in which multiple runs of the CPTP map E can be composed coherently to collect quantum data, and predictions are produced by a quantum computer with access to the quantum data. We denote by N Q the number of times E is used during training in the quantum setting. The classical and quantum ML settings are illustrated in Figure 1.
We focus on the question of whether quantum ML can have a large advantage over classical ML: to achieve a small prediction error, can the optimal N Q in the quantum ML setting be much less than the optimal N C in the classical ML setting? For the purpose of this comparison, we disregard the runtime of the classical or quantum ML models that generate the predictions; we are only interested in how many times the process E must run during the learning phase in the quantum and classical settings.
Our first main result addresses small average prediction error, i.e. the prediction error |h(x) − f (x)| 2 averaged over some specified input distribution D(x). We rigorously show that, for any E, O, and D, and for any quantum ML model, one can always design a classical ML model achieving a similar average prediction error such that N C is larger than N Q by at worst a small polynomial factor. Hence, there is no In the learning phase of the classical ML setting, a measurement is performed after each query to E; the classical measurement outcomes collected during the learning phase are consulted during the prediction phase. (Right) In the learning phase of the quantum ML setting, multiple queries to E may be included in a single coherent quantum circuit, yielding an output state stored in a quantum memory; this stored quantum state is consulted during the prediction phase.
exponential advantage of quantum ML over classical ML if the goal is to achieve a small average prediction error, and if the efficiency is quantified by the number of times E is used in the learning process. This statement holds for existing quantum ML models running on near-term devices [47,53,79] and future quantum ML models yet to be conceived. We note, though, that while there is no large advantage in query complexity, a substantial quantum advantage in computational complexity is possible [80]. However, the situation changes if the goal is to achieve a small worst-case prediction error rather than a small average prediction error -an exponential separation between N C and N Q becomes possible if we insist on predicting f (x) = tr(O E(|x x|)) accurately for every input x. We illustrate this point with an example: accurately predicting expectation values of Pauli observables in an unknown n-qubit quantum state ρ. This is a crucial subroutine in many quantum computing applications; see e.g. [34, 52, 54, 56-58, 61, 74]. We present a quantum ML model that uses N Q = O(n) copies of ρ to predict expectation values of all n-qubit Pauli observables. In contrast we prove that any classical ML model requires N C = 2 Ω(n) copies of ρ to achieve the same task even if the ML model can perform arbitrary adaptive singlecopy POVM measurements.

II. MACHINE LEARNING SETTINGS
We assume that the observable O (with O ≤ 1) is known and the physical experiment E is an unknown CPTP map that belongs to a set of CPTP maps F. Apart from E ∈ F, the process can be arbitrary -a common assumption in statistical learning theory [12,15,21,87,89]. For the sake of concreteness, we assume that E is a CPTP map from a Hilbert space of n qubits to a Hilbert space of m qubits. Regarding inputs, we consider bit-strings of size n: x ∈ {0, 1} n . This is not a severe restriction, since floating-point representations of continuous parameters can always be truncated to a finite number of digits. We now give precise definitions for classical and quantum ML settings; see Fig. 1 for an illustration.
a. Classical (C) ML: The ML model consists of two phases: learning and prediction. During the learning phase, a randomized algorithm selects classical inputs x i and we perform a (quantum) experiment that results in an outcome o i from performing a POVM measurement on E(|x i x i |). A total of N C experiments give rise to the classical training data {(x i , o i )} NC i=1 . After obtaining this training data, the ML model executes a randomized algorithm A to learn a prediction model where s C is stored in the classical memory. In the prediction phase, a sequence of new inputsx 1 ,x 2 , . . . ∈ {0, 1} n is provided. The ML model will use s C to evaluate predictions h C (x 1 ), h C (x 2 ), . . . that approximate f (x 1 ), f (x 2 ), . . . up to small errors. b. Restricted classical ML: We will also consider a restricted version of the classical setting. Rather than performing arbitrary POVM measurements, we restrict the ML model to measure the target observable O on the output state E |x i x i | to obtain the measurement outcome o i . In this case, we always have o i ∈ R and E[o i ] = tr(O E(|x i x i |)).
c. Quantum (Q) ML: During the learning phase, the model starts with an initial state ρ 0 in a Hilbert space of arbitrarily high dimension. Subsequently, the quantum ML model accesses the unknown CPTP map E a total of N Q times. These queries are interleaved with quantum data processing steps: where each C i is an arbitrary but known CPTP map, and we write E ⊗I to emphasize that E acts on an n-qubit subsystem of a larger quantum system. The final state ρ E , encoding the prediction model learned from the queries to the unknown CPTP map E, is stored in a quantum memory. In the prediction phase, a sequence of new inputsx 1 ,x 2 , . . . ∈ {0, 1} n is provided. A quantum computer with access to the stored quantum state ρ E executes a computation to produce prediction values h Q (x 1 ), h Q (x 2 ), . . . that approximate f (x 1 ), f (x 2 ), . . . up to small errors 1 . The quantum ML setting is strictly more powerful than the classical ML setting. During the prediction phase, classical ML models are restricted to processing classical data, albeit data obtained by measuring a quantum system during the learning phase. In contrast, quantum ML models can work directly with the quantum data and perform quantum data processing. A quantum ML model can have an exponential advantage relative to classical ML models for some tasks, as we demonstrate in Sec. IV.

III. AVERAGE-CASE PREDICTION ERROR
For a prediction model h(x), we consider the average-case prediction error x∈{0,1} n D(x)|h(x) − tr(O E(|x x|))| 2 , with respect to a fixed distribution D over inputs. This could, for instance, be the uniform distribution. Although learning from quantum data is strictly more powerful than learning from classical data, there are fundamental limitations. The following rigorous statement limits the potential for quantum advantage.
Theorem 1. Fix an n-bit probability distribution D, an m-qubit observable O ( O ≤ 1) and a set F of CPTP maps with n input qubits and m output qubits. Suppose there is a quantum ML model which accesses the map E ∈ F N Q times, producing with high probability a function h Q (x) that achieves 1 Due to non-commutativity of quantum measurements, the ordering of new inputs matters. For instance, the two lists x 1 ,x 2 andx 2 ,x 1 can lead to different outcome predictions h Q (x i ). Our main results do not depend on this subtleteythey are valid, irrespective of prediction input ordering.
Then there is an ML model in the restricted classical setting which accesses E N C = O(mN Q / ) times and produces with high probability a function h C that achieves Proof sketch. The proof consists of two parts. First, we cover the entire set of CPTP maps F with a maximal packing net, i.e. the largest subset In the second part, we explicitly construct an ML model in the restricted classical setting that achieves a small average-case prediction error using a modest number of experiments. In this ML model, an input x i is selected by sampling from the probability distribution D, and an experiment is performed in which the observable O is measured in the output quantum state E(|x i x i |), obtaining measurement outcome o i which has expectation value tr(O E(|x i x i |)). A total of N C such experiments are conducted. Then, the ML model minimizes the least-squares error to find the best fit within the aforementioned maximal packing net S: Because the measurement outcome o i fluctuates about the expectation value of O, it may be impossi-ble to achieve zero training error. Yet it is still possible for h C to achieve a small average-case prediction error, potentially even smaller than the training error. We use properties of maximal packing nets and of quantum fluctuations of measurement outcomes to perform a tight statistical analysis of the averagecase prediction error, finding that with high probabil- provided that N C is of order log(|S|)/ . Finally, we combine the two parts to conclude N C = O(mN Q / ). The full proof is in Appendix C.
Theorem 1 shows that all problems that are approximately learnable by a quantum ML model are also approximately learnable by some restricted classical ML model which executes the quantum process E a comparable number of times. This applies in particular, to predicting outputs of quantummechanical processes. The relation N C = O(mN Q / ) is tight. We give an example in Appendix D with For the task of learning classical Boolean circuits, fundamental limits on quantum advantage have been established in previous work [11-13, 31, 80, 94]. Theorem 1 generalizes these existing results to the task of learning outcomes of quantum processes.

IV. WORST-CASE PREDICTION ERROR
Rather than achieving a small average prediction error, one may be interested in obtaining a prediction model that is accurate for all inputs x ∈ {0, 1} n . For a prediction model h(x), we consider the worst-case prediction error to be Under such a stricter performance requirement, exponential quantum advantage becomes possible. We highlight this potential by means of an illustrative and practically relevant example: predicting expectation values of Pauli operators in an unknown nqubit quantum state ρ. This is a central task for many quantum computing applications [34, 52, 54, 56-58, 61, 74]. To formulate this problem in our framework, suppose the 2n-bit input x specifies one of the 4 n n-qubit Pauli operators P x ∈ {I, X, Y, Z} ⊗n , and suppose that E ρ (|x x|) prepares the unknown state ρ and maps P x to the fixed observable O, which is then measured; hence In this setting, according to Theorem 1, there is no large quantum advantage if our goal is to estimate the Pauli operator expectation values with a small average prediction error. However, an exponential quantum advantage is possible if we insist on accurately predicting every one of the 4 n Pauli observables. First we show there is an efficient quantum ML model that achieves a small prediction error. Details are in Appendix E 2; here we just sketch the main ideas. The procedure for predicting tr(P x ρ) has two stages. The goal of the first stage is to predict the absolute value | tr(P x ρ)| for each x, and the goal of the second stage is to determine the sign of tr(P x ρ). The key idea used in the first stage is that, although two different Pauli operators P x and P y may either commute or anticommute, the tensor products P x ⊗ P x and P y ⊗ P y are mutually commuting for all x and y. Therefore, although it is not possible to measure anticommuting Pauli operators simultaneously using a single copy of the state ρ, it is possible to measure P x ⊗ P x simultaneously for all x using two copies of ρ. Indeed, all 4 n expectation values tr((P x ⊗ P x )(ρ ⊗ ρ)) = tr(P x ρ) 2 can be determined by measuring pairs of qubits in the Bell basis, which is highly efficient. This completes the first stage.
If | tr(P x ρ)| is found to be small in the first stage, we may predict h(x) = 0 and be assured that the prediction error is small. Therefore, in the second stage, we need only determine the sign if | tr(P x ρ)| was found to be reasonably large in the first stage. In that case we can perform a coherent measurement across several copies of ρ which performs a majority vote and yields the correct value of the sign with high success probability. Because the measurement is strongly biased in favor of one of the two possible outcomes, it introduces only a very "gentle" disturbance of the premeasurement state. Therefore, by performing many such measurements in succession on the same quantum memory register, we can determine the sign of tr(P x ρ) for many different values of x. The second stage can also be more amenable to near-term implementation using a heuristic that groups commuting observables [56,57]; see Appendix E 2 d for further discussion. Each of the two stages requires only a small number of copies of ρ; a careful analysis yields the following theorem.
Theorem 2. The quantum ML model only needs N Q = O(log(M/δ)/ 4 ) copies of ρ to predict expectation values of any M Pauli observables to error with probability at least 1 − δ.
More details regarding the quantum ML model, as well as a rigorous proof, are provided in Appendix E 2. The sample complexity stated in Theorem 2 improves upon previously known shadow tomography protocols [4,5,26,54] for the special case of predicting Pauli observables; see Appendix A. Because each access to E ρ allows us to obtain one copy of ρ, we only need N Q = O(n) to predict expectation values of all 4 n Pauli observables up to a constant error.
For classical ML models, we prove the following fundamental lower bound; see Appendix E 4.
Theorem 3. Any classical ML must use N C ≥ 2 Ω(n) copies of ρ to predict expectation values of all Pauli observables up to a small error with a constant success probability.
This theorem holds even when the POVM measurements performed by the classical ML model could depend on the previous POVM measurement outcomes adaptively. When combined with Theorem 2, Theorem 3 establishes an exponential gap separating classical ML models from fully quantum ML models. Table 2 provides a summary of the upper and lower

Model
Upp. bd. Low. bd. Table 2: Sample complexity for predicting expectations of all 4 n Pauli observables (worst-case prediction error) in an n-qubit quantum state. Upp. bd. is the achievable sample complexity of a specific algorithm. Low. bd. is the lower bound for any algorithm. The classical ML upper bound can be achieved using classical shadows based on random Clifford measurements [54]. The rest of the bounds are obtained in Appendix E. bounds on the sample complexity for predicting expectation values of Pauli observables.

V. NUMERICAL EXPERIMENTS
We support our theoretical findings with numerical experiments, focusing on the task of predicting the expectation values of all 4 n Pauli observables in an unknown n-qubit quantum state ρ, with a small worst-case prediction error. In this case, the function is f (x) = tr(O E ρ (|x x|)) = tr(P x ρ), where x ∈ {I, X, Y, Z} n indexes the Pauli observables, and E ρ prepares the unknown state ρ then maps P x to the fixed observable O. This is the task we considered in Section IV. Note that average-case prediction of Pauli observables is a much easier task, because most of the 4 n expectation values are exponentially small in n.
We consider two classes of underlying states ρ: (i) Mixed states: ρ = (I + P )/2 n , where P is a tensor product of n Pauli operators. States in this class have rank 2 n−1 . (ii) Product states: ρ = n i=1 |s i s i |, where each |s i is one the six possible single-qubit stabilizer states. We consider stabilizer states to ensure that classical simulation of the quantum ML model is tractable for reasonably large system size.
The numerical experiment in Figure 3 implements the best-known ML procedures. We can clearly see that there is an exponential separation between the number of copies of the state ρ required for classical and quantum ML to predict expectation values when ρ is in the class of mixed states. However, for the class of product states, the separation is much less pronounced. Restricted classical ML can only obtain outcomes o i ∈ {±1} with E[o i ] = tr(P xi ρ). Hence each copy of ρ provides at most one bit of information, and therefore O(n) copies are needed to predict expectation values of all 4 n Pauli observables. In contrast, standard classical ML can perform arbitrary POVM measurements on the state ρ, so each copy can provide up to n bits of information. The separation between classical ML and quantum ML is marginal for product states.

VI. CONCLUSION AND OUTLOOK
We have studied the task of learning functions of the form Equation (1), using as a figure of merit the number of runs of E. Our main result Theorem 1 shows that, when the objective is achieving a specified average prediction error, a classical ML model can perform as well as a quantum ML model, using a comparable number of runs of E. This result establishes a fundamental limit on quantum advantage in machine learning that holds for any quantum ML model [47,53,79].
From a different perspective, Theorem 1 means that the classical ML setting, in which a measurement is performed after each query to E, can be surprisingly effective. The quantum ML setting, in which multiple queries to E can be included in a single coherent quantum circuit, is far more challenging and may be infeasible until far in the future. Therefore finding that classical and quantum ML have comparable power (for average-case prediction) boosts our hopes that the combination of classical ML and near-term quantum algorithms [47,52,53,76,79] may fruitfully address challenging quantum problems in physics, chemistry, and materials science.
On the other hand, Theorem 2 and 3 rigorously establish that quantum ML can have an exponential advantage over classical ML for certain problems where the objective is achieving a specified worst-case prediction error. This exponential advantage of quantum ML over classical ML may be viewed as an exponential separation between coherent measurements (in which a measurement apparatus interacts coherently multiple times with a measured system, storing quantum data which is then processed by a quantum computer) and incoherent measurements (in which a POVM measurement is performed and the outcome recorded after each interaction between system and apparatus, and the classical measurement outcomes are then processed by a classical computer). Such a separation has been challenging to establish because incoherent measurements are difficult to analyze in the adaptive setting, where each measurement performed may depend on the outcomes of all previous measurements. Our proof technique overcomes this challenge, enabling us to identify tasks which allow substantial quantum advantage. An important future direction will be identifying further learning problems which allow substantial quantum advantage, pointing toward potential practical applications of quantum technology.
inspiring discussions. We would also like to thank anonymous reviewers for in-depth comments and suggestions. HH is supported by the J.

Appendix
Roadmap: Appendix A provides additional context and discusses relevant existing work. Details regarding the numerical experiments can be found in Section B. The remaining portions are devoted to theory and mathematical proofs. Appendix C provides a thorough treatment of average prediction errors, including bounds on query complexity (the number times the quantum process E is accessed) culminating in a proof of Theorem 1. Appendix D provides a stylized example demonstrating that the bound is tight. Finally, Appendix E provides sample complexity upper and lower bounds for predicting many Pauli expectation values with small worst-case error (leading to exponential separation in query complexity).
Appendix A: Related works a. Quantum PAC learning: It is instructive to relate our main result on achieving average-case prediction error (Theorem 1) to quantum probably approximately correct (PAC) learning [11-13, 29, 31, 80, 94]. The latter rigorously established the absence of information-theoretic quantum advantage for learning classical Boolean functions h : {0, 1} n → {0, 1}. This is a special case of predicting functions of the form f (x) = tr (O E(|x x|)). To see this, we reversibly encode every n-bit Boolean function h in a (unitary) CPTP map that acts on (n + 1) qubits: where ⊕ is addition in Z 2 . This is the quantum oracle for implementing the Boolean function h.
which is the quantum sample considered in quantum PAC learning. For related works on PAC learning a distribution rather than a function, see [71,84]. While existing results [11-13, 29, 31, 80, 94] have shown that no large quantum advantage in sample complexity is possible, we can still have substantial quantum speedups in computational complexity. In Ref. [80], for instance, a contrived learning problem is constructed based on factoring. This work showcases the possibility of a quantum advantage in computational complexity, even if no quantum advantage in sample complexity is possible. On the other hand, exponential separation in sample complexity is possible if one considers worst-case approximation errors.
b. Quantum machine learning: Quantum computers have the potential to improve existing machine learning models based on classical computers. A series of works require that an exponential amount of data is stored in a quantum random access memory [20,35,46,52,64,65,77,93]. Due to the quantum data structure, one can obtain exponential speed-up for certain machine learning tasks. However, the act of storing the exponential amount of data will take exponential time. Furthermore, if we assume a similar data structure for classical machines, then the exponential advantage may vanish altogether [3, 10, 41,85,86]. Another recent line of works focuses on using quantum computers to represent a set of classically intractable functions [32,38,47,79]. However, computational power provided by data in a machine learning task can also make a classical ML model stronger [53] and may challenge some of these proposals in practice. When we consider learning an arbitrary unitary, due to the inherent complexity of the unitary group, the sample complexity is going to scale with the Hilbert space dimension (exponential in the system size) [75,82].
c. Classical learning theory: For a general introduction to classical learning theory, see the comprehensive book on the foundations of machine learning [68]. The classical learning strategy is the same as learning probabilistic real-valued functions. Existing works on learning real-valued functions [7, 14], or probabilistic Boolean functions [60], mainly focus on the worst-case input distribution that maximizes sample complexity. Geometric quantities, such as the fat-shattering dimension [7], then provide a characterization of the sample complexity [7]. For example, [14] proves that a (worst-case) sample complexity upper bound is given bỹ O(fat( /5)/ 2 ), whereÕ(·) suppresses logarithmic factors. In contrast, our proof of Theorem 1 produces a sample complexity is also different (it measures the cardinality of a maximal packing net that depends on the input distribution).
d. Shadow tomography: Shadow tomography is the task of simultaneously estimating the outcome probabilities associated with M 2-outcome measurements up to accuracy : p i (ρ) = tr(E i ρ), where each E i is a positive semi-definite matrix with operator norm at most one [4, 5, 23,26]. The best existing result is given by [26]. They showed that N = O log(M ) 2 log(d)/ 4 copies of the unknown state suffice to achieve this task. Their protocol is based on an improved quantum threshold search: finding an observable E i with the expectation value tr(E i ρ) exceeding a certain threshold. Then, they combine this with online learning of quantum states following Aaronson's original protocol. [4]. A more experimentally friendly shadow tomography protocol has been proposed in [54]. It only yields competitive scaling complexities for a restricted set of observables, but can be implemented on state-of-the-art quantum platforms [36,83].
e. PAC learning quantum systems: A precursor to shadow tomography is PAC learning of quantum states [2,78], where the goal is to accurately predict expectation values of different observables in an unknown quantum system ρ up to a small average error. This fits nicely into the scope of Theorem 1: The optimal fully quantum ML model that can perform quantum data analysis on many copies stored in the quantum memory will not yield a large advantage in sample complexity over ML models that make predictions based solely on classical measurement data from randomized measurements on single copy of ρ. This separation between learning from classical measurement data and learning from the quantum states coherently has not been discussed in [2]. The result [2] could be seen as establishing an upper bound on the maximal packing net for the set of CPTP maps F. This then translates into a sample complexity upper bound needed to achieve good prediction performance.
f. Measuring expectation values of Pauli observables: Due to the importance of measuring Pauli observables in near-term applications of quantum computers, a series of methods [22,34,44,[56][57][58]91] have been proposed to reduce the number of measurements needed to estimate Pauli expectation values. All of them are based on one basic, yet powerful, observation: Commuting observables can be measured simultaneously. For example, quantum chemistry applications are often contingent on measuring O(n 4 ) Pauli observables [74]. The aforementioned Pauli estimation protocols group these observables into O(n 3 ) or even only O(n) commuting groups. In turn, O(n 3 ) or O(n) copies of the underlying state suffice to obtain expectation values for all O(n 4 ) relevant Pauli observables by exploiting the ability to simultaneously measure commuting observables. On the other hand, the novel technique proposed in Appendix E gets by with even fewer state preparations. A total of O(log(n 4 )) = O(log(n)) copies suffice. Restriction to few-body Pauli observables can yield additional improvements. Several protocols are known for this special case, see e.g. [22,33,37,54,58,72].
g. Incoherent versus coherent measurements: The exponential advantage of quantum ML over classical ML established by our work may be viewed as an exponential separation between coherent measurements (in which a measurement apparatus interacts coherently multiple times with a measured system, storing quantum data which is then processed by a quantum computer) and incoherent measurements (in which a POVM measurement is performed and the outcome recorded after each interaction between system and apparatus, and the classical measurement outcomes are then processed by a classical computer). Existing work has shown an advantage of coherent measurements over independent incoherent measurements, a special case in which the POVM measurements do not depend on the results of previous measurements; see e.g., [43] on quantum state tomography and [54] on shadow tomography. However, few prior results limit the power of incoherent measurements in the adaptive setting, in which each measurement performed may depend on the outcomes of all the previous measurements.
A prior result we were aware of before obtaining our result was [24], showing a mild polynomial advantage of coherent over incoherent measurements for the task of distinguishing whether an unknown quantum state is close to the completely mixed state or not. Our work established an exponential separation between incoherent and coherent measurements in sample complexity (the number of copies of a quantum state needed to perform the task) for shadow tomography. After our work was complete, [6] also presented a detailed analysis of the power of coherent and incoherent measurements, finding an exponential separation between incoherent and coherent measurements for the task of distinguishing between different types of quantum channels. For the case of mixed states given by ρ = (I + P x * )/2 n , we have f (x) = tr(P x ρ) = δ x,x * for all x ∈ {I, X, Y, Z} n . That is, f (x) is a point function, i.e. f (x * ) = 1 for exactly one x * , all other Pauli strings evaluate to zero.
a. Restricted classical ML: We consider a restricted classical ML model that implements an exhaustive search over all Pauli observables P x with x ∈ {I, X, Y, Z} n . For each x, we repeatedly measure the observable P x and check whether the outcome −1 ever occurs. If this is the case, we know f (x) = 0 with certainty (recall, that f (x) ∈ {0, 1} is a point function). After looping through all observables, we will be left with a single input x * that obeys f (x * ) = 1. There are a total of 4 n inputs and whenever x = x * , the expected number of measurements required to obtain the outcome −1 is 1/(Pr x [−1]) = 2. In contrast, for x = x * , outcome −1 can never occur. This results in a sample complexity of O(4 n ) -a scaling that is confirmed by our numerical experiments and matches the lower bound Ω(4 n ).
b. Classical ML: For classical ML, we implement property prediction with classical shadows based on random Clifford measurements [54]. The sample complexity to predict M Pauli observables up to small constant accuracy is known to be O(max x tr(P 2 x ) log(M )). Using tr(P 2 x ) = tr(I ⊗n ) = 2 n and M = 4 n , this upper bound simplifies to O(n2 n ). The numerical experiments confirm this theoretical prediction.
c. Quantum ML: For quantum ML, we use the procedure introduced in Appendix E 2. We first perform several repetitions of two-copy Bell basis measurements to estimate absolute values | tr(P x ρ)| 2 for all 4 n possible inputs x. This allows us to immediately identify x * . One could solve for x * by performing Gaussian elimination in GF (2). A similar strategy has also been used for learning quantum channels [45]. The required sample complexity is O(log(4 n )) = O(n) and the numerical experiments confirm this linear scaling.

Product states
We consider the unknown n-qubit state to be a tensor product of n single-qubit stabilizer states (there are only six choices |0 , |1 , |+ , |− , |y, + , |y, − ). We only need to obtain measurement outcomes for single-qubit Pauli observables to completely determine such product states.
a. Restricted classical ML: Recall that the goal of this task is to predict f (x) = tr(P x ρ) for an unknown quantum state ρ. The restricted classical ML can collect measurement data of the form is the measurement outcome when we measure P x on ρ. We simply collect measurement outcomes for the 3n single-qubit Pauli observables by choosing the appropriate x i . For each qubit, we sample ±1-outcomes in the X-, Y -and Z-basis. Once we find that two of the three Pauli observables result in both the outcome ±1, we can determine the single-qubit state for the particular qubit. For two of the three bases, the underlying distribution is uniform, while deterministic outcomes are produced in the third basis. In turn, we expect to require 6n Pauli measurements to unambiguously characterize the underlying stabilizer state. This scaling is confirmed by the numerical experiments.
b. Classical ML: We associate classical ML models with reconstruction procedures that can perform arbitrary POVM measurements on individual copies of the unknown state ρ. Here, we consider a sequence of POVMs where we measure first in the all-X basis, second in the all-Y basis, and then in the all-Z basis (and repeat). Similar to the restricted classical ML, we also check if two of the three possible single-qubit observables X, Y, Z have resulted in both outcomes ±1. Once two of the three single-qubit observables have produce both outcomes ±1, we can perfectly identify the corresponding single-qubit stabilizer state. But, to complete the assignment, we need to be able to do so for all n qubits. This incurs an additional logarithmic factor in the expected number of measurements. We expect to require O(log(n)) measurement repetitions to unambiguously determine the underlying product state. Numerical experiments confirm this scaling behavior.
c. Quantum ML: Quantum ML models can perform quantum data processing on multiple copies of quantum states. For product states, we only perform the following procedure. For each repetition, we perform two-copy Bell basis measurements on ρ ⊗ ρ to simultaneously measure X ⊗ X, Y ⊗ Y, Z ⊗ Z on each of the n qubits in ρ. We can determine the eigenbasis (X-basis, Y-basis, or Z-basis) of each qubit when we found that two of the observables X ⊗ X, Y ⊗ Y, Z ⊗ Z result in both ±1 outcomes. For two of the three observables, the underlying distribution is uniform in ±1, while deterministic outcomes are produced in the third observable. Since we know the stabilizer bases for each qubit after a few two-copy measurements, we simply measure the n qubits in the corresponding stabilizer basis on the final copy to recover the full state. We refer to [42,69,70,95] for results on efficient procedures for testing and learning stabilizer states in general.

Appendix C: Proof of Theorem 1
This section contains a thorough treatment of average prediction errors. We consider related setups for the classical and quantum learning settings.
The learning problem is defined by a set of CPTP maps F, an input distribution D, and an observable O with O ≤ 1. Each CPTP map E ∈ F maps a n-qubit quantum state to m-qubit state. This collection defines a function The goal is to learn a function f : {0, 1} n → R such that with high probability A bit of additional context is appropriate here: we are studying the existence of learning algorithms under a fixed learning problem defined by input distribution D, observable O, and set of CPTP maps F. In turn, the actual learning algorithms may and, in general, will depend on these mathematical objects.
One of our main technical contributions -Theorem 1 -highlights that a substantial quantum advantage is impossible for this setting (small average-case prediction error). This is in stark contrast to the setting of achieving small worst-case prediction error. The proof consists of two parts. Section C 1 establishes a lower bound for the query complexity of any quantum ML model. Subsequently, Section C 2 provides an upper bound for the query complexity achieved by certain classical ML models. Finally, a combination of these two results establishes Theorem 1, see Section C 3.

Information-theoretic lower bound for quantum machine learning models
The quantum machine learning model consists of learning phase and prediction phase. In the learning phase, the quantum ML model accesses the quantum experiment characterized by the CPTP map E for N Q times to learn a model. We consider the quantum ML model to be a mixed state quantum computation algorithm (a generalization of unitary quantum computation). Starting point is an initial state ρ 0 on any number of qubits. Subsequently, arbitrary quantum operations C t (CPTP maps) are interleaved with in total N Q invocations E ⊗I of the unknown (black box) CPTP map and produce a final state In this model, we can assume without loss that E always acts on the first n qubits, because the quantum operations C t are unrestricted. In particular, they could contain certain SWAP operations that permute the qubits around. The final state ρ E is the quantum memory that stores the prediction model learned from the CPTP map E using the quantum ML algorithm. Obtaining ρ E concludes the quantum learning phase.
In the prediction phase, we assume that new inputs are provided as part of a sequence For each sequence member x i , the quantum ML model accesses the input x i , as well as the current quantum memory. It produces an outcome by performing a POVM measurement on the quantum memory ρ E . We emphasize that this can, and in general will, affect the quantum memory nontrivially. The quantum ML outputs h Q (x i ) depend on the entire sequence x 1 , . . . , x i . And different sequence orderings will produce different predictions. For example, when n = 2, the following ordering may result in the prediction but a different ordering may result in a slightly different prediction, such as Also, note that h Q (x i ) can be randomized because a quantum measurement is performed to produce the prediction outcome. The ordering does not affect the theorem we want to prove. In the following, we will fix the input ordering to be an arbitrary ordering. For example, we can use the input ordering such that the quantum ML model has the smallest prediction error. After fixing an input ordering, we can treat the entire prediction phase (taking a sequence of inputs x 1 , x 2 , . . . and producing h Q (x 1 ), h Q (x 2 ), . . .) as an enormous POVM measurement on the output state ρ E obtained from the learning phase. Each outcome a from the enormous POVM measurement on the output state ρ E corresponds to a function h Q,a (x) : {0, 1} n → R. Using Naimark's dilation theorem, every POVM measurement is a projective measurements on a larger Hilbert space. Since the quantum memory that the quantum ML model can operate on contains an arbitrary amount of qubits, we can use Naimark's dilation theorem to restrict the enormous POVM measurement to a projective measurement {P a } a . Hence, for any CPTP map E ∈ F, when we asks the quantum ML model to produce the prediction for an ordering of inputs x 1 , x 2 , . . ., the output values for a projective measurement {P a } a with a P a = I. Finally, we will assume that the produced function h Q (x) achieves small prediction error for any CPTP map E ∈ F. This assumption asserts that

a. Maximal packing net
We emphasize that Rel. (C9) must be valid for any E ∈ F. Because we only need to output a function h Q (x) that approximates f (x) = tr(O E(|x x|)) on average, the task will not be hard when there are only a few qualitatively different CPTP maps in F. However, the problem could become harder when F contains a large amount of very different CPTP maps. The task is now to transform this requirement into a stringent lower bound on N Q -the number of black-box uses of the unknown CPTP map E ⊗I within the quantum computation (C3). As a starting point, we equip the set of target functions F f = {f E (x) = tr(OE(|x x|))| E ∈ F } with a packing net. Packing nets are discrete subsets whose elements are guaranteed to have a certain minimal pairwise distance (think of spheres that must not overlap with each other). We choose points (functions) f Ei ∈ F f and demand We denote the resulting packing net of F f by M p 4 (F f ) and note that every such set has finitely many elements (F f is a compact set). We also assume that M p 4 (F f ) is maximal in the sense that no other 4 -packing net can contain more points (functions).
It is possible to utilize packing nets to derive a query complexity lower bound for the quantum machine learning model. In fact, we will present two different proof strategies. The first proof is inspired by [39,43,54] and analyzes a communication protocol. The second proof uses a proof technique that depends on an analysis of polynomials similar to [16]. While it is somewhat weaker than the information-theoretic bound in the first proof, we include the derivation for completeness as we believe that it may be insightful for the interested reader.

b. Proof strategy I: mutual information analysis
Let us define a communication protocol between two parties, say Alice and Bob. They use the packing net M p 4 (F f ) as a dictionary to communicate randomly selected classical messages. More precisely, Alice samples an integer X uniformly at random from 1, 2, . . . , |M p 4 (F f )| and chooses the corresponding CPTP map E X ∈ M p 4 (F f ). When Bob wants to access the unknown CPTP map E X , he will ask Alice to apply the CPTP map E X . Bob will then execute the quantum machine learning model (C3) to obtain a prediction model h Q,a (x), where a parameterizes the prediction model. Subsequently, Bob solves the following optimization problem to obtain an integerX. This decoding procedure seems adequate, provided that the prediction model h Q approximately reproduces the true underlying function. More precisely, assumption (C8) asserts E x∼D |h Q,a (x) − tr(OE X (|x x|))| 2 ≤ with probability at least 2/3.
Here is where the choice of dictionary matters: M p 4 (F f ) is a packing net, see Equation (C10). For X = X this necessarily implies This allows us to conclude that Bob's decoding strategy (C11) succeeds perfectly if In turn, Assumption C8 ensuresX = X (perfect decoding) with probability at least 2/3. Now, we use the fact that Alice samples her message X uniformly at random from a total of |M p 4 (F f )| integers. BecauseX = X (perfect decoding) with probability at least 2/3, Fano's inequality implies that where is the binary entropy. This gives a lower bound on the mutual information between sent and decoded message, namely Next, note thatX is obtained by classically processing a measurement outcome a of the quantum state ρ E X The data processing inequality and Holevo's theorem [9, 18, 49, 51] then imply The Holevo χ quantity between the classical random variable X and the quantum state ρ E X is where S(ρ) = tr(−ρ log ρ) is the von Neumann entropy. Throughout this work, we refer to log with base e.
Recall that Bob produces ρ E X by utilizing a total of N Q channel copies obtained from Alice. We can use the specific layout (C3) of Bob's quantum computation to produce an upper bound on the Holevo-χ: This bound follows from induction over a sample-resolved variant of Bob's quantum computation. For t = 0, 1, . . . , N Q , we will show that Bound (C20) then follows from recognizing that setting t = N Q reproduces Bob's complete computation, see Equation (C3). The base case (t = 0) is simple, because ρ 0 E X = C 0 (ρ 0 ) does not depend on X at all. This ensures Now, let us move to the induction step (t > 0). The induction hypothesis provides us with and we must relate χ(X : ρ t E X ) to χ(X : ρ t−1 E X ). To achieve this goal, we use the fact that the Holevo-χ is closely related to the quantum relative entropy D(ρ||σ) = tr (ρ(log ρ − log σ)) [9, 18, 51]. Indeed, and monotonicity of the quantum relative entropy asserts This effectively allows us to ignore the t-th quantum operation C t and instead exposes the t-th invocation of E ⊗ I. We analyze the two remaining terms separately. Let use define the notation tr ≤m as the partial trace over the first m qubits, and tr >m as the partial trace over the rest of the qubits. Subadditivity of the von Neumann entropy S(ρ) [9, 18, 51] implies The second inequality uses the fact that the maximum entropy for an m-qubit system is at most m log 2. The last equality is due to the following technical observation (the action of a CPTP map can be traced out).
Lemma 1. Fix a CPTP map E from n qubits to m qubits and let I denote the identity map on n ≥ 0 qubits. Then, tr ≤m [(E ⊗I)ρ] = tr ≤n [ρ] for any (n + n )-qubit state ρ.
Proof. Let E(ρ) = i K i ρK † i be a Kraus representation of the CP map E. TP moreover implies i K † i K i = I. For any input state ρ, Linearity and (partial) cyclicity of the partial trace then ensure This concludes the proof of the lemma.
Similarly, the second term can be lower bounded by We can combine these two bounds with the monotonicity of the quantum relative entropy to obtain Plug in the induction hypothesis (C23) to complete the argument: Rearranging this display yields a lower bound on the minimal query complexity in terms of packing net size: c. Proof strategy II: polynomial method The second proof uses a proof technique that depends on analysis of polynomials [16]. It leads to somewhat weaker results that only apply if m ≤ n. We include this derivation for completeness as we believe that it may be insightful for the interested reader.
Let us start by recalling that we may embed a 4 -packing net M p 4 (F f ) within the set of target functions F f . Geometrically, this means that each E ∈ M p 4 (F f ) describes the center of a 2 -ball (this radius is defined with respect to average prediction error squared). And, according to the defining property Equation (C10), these balls do not overlap. We can use these disjoint balls to cluster different quantum machine learning solutions. Define where A is a placeholder for all possible answers the quantum machine learning model can provide. See the definition given in Equation (C7). The packing net condition (C10) ensures that different clusters are completely disjoint. For distinct E 1 , E 2 ∈ F and a 1 ∈ F Q E1 , a 2 ∈ F Q E2 , two triangle inequalities and Equation (C10) yield We will use this insight to reason about an auxiliar matrix P of size |M p 4 (F f )| × |M p 4 (F f )|. We label rows and columns by packing net elements f Ei with i = 1 (rows) or i = 2 (columns). For each pair f E1 , f E2 , let P E1,E2 denote the probability of a mix-up between E 1 and E 2 . Such mix-ups occur if the underlying CPTP map is E 2 , but the quantum ML model outputs an answer a ∈ F Q E1 that belongs to the cluster associated with E 1 : Here, ρ 2 is the outcome state of the quantum ML model (trained on CPTP map E 2 ) and P a is the POVM element associated with predicting a 1 . Recall that the main assumption on the quantum ML model is that it predicts accurately with probability at least 2/3. This implies while each row sum over off-diagonal matrix elements is strictly smaller.
The first equality uses the definition of matrix P . The third equality follows from the observation that distinct clusters are also disjoint (F Q E1 ∩ F Q E2 = ∅). We conclude that the |M p 4 (F f )| × |M p 4 (F f )|-matrix P is diagonally dominant. Such matrices are guaranteed to be non-singular, i.e. they have full rank. This is a suitable starting point for analyzing the probability tr(P a ρ E ) via a polynomial method [16]. Let with K E i ∈ C 2 m ×2 n be the Kraus representation of a fixed CPTP map E ∈ F. This representation is parametrized by (at most) 2 n 2 m × 2 m 2 n = 2 2(n+m) complex parameters: On a high level, we parametrize inputs to the quantum ML model by vectors. After training, the probability of obtaining answer a ∈ A corresponds to a homogeneous polynomial of degree N Q in z E and of degree N Q inz E : where w † a is a dual tensor product vector with compatible dimension N = 2 2(n+m)2NQ = 2 4(n+m)NQ . Every matrix element P E1,E2 of P defined in Equation (C43) can be expressed as a sum of homogeneous polynoials in (z E2 ⊗z E2 ) ⊗ · · · ⊗ (z E2 ⊗z E2 ). Collecting all M = |M p 4 (F f )| possible tensor products as rows of the matrix allows us to present the multilinear characterization of all entries of P in a single display: Above, we have shown that the M × M -matrix P must have full column rank. This is only possible if |M p 4 (F f )| = M ≤ N = 2 2(n+m)2NQ = 2 4(n+m)NQ .
Rearranging these terms and assuming n ≤ m implies the following lower bound on quantum query complexity N Q : 2. Information-theoretic upper bound for restricted classical machine learning models We will focus on restricted classical ML models that can select inputs x i ∈ {0, 1} n and obtain the corresponding outcome o i ∈ R. This outcome is obtained by performing a single-shot measurement (the projective measurement given by the eigenbasis of O) of observable O on the output quantum state E(|x i x i |). This ensures Here, M p 4 (F f ) is the maximal packing net defined in Section C 1 a. The packing net is a subset of the set F f that contains functions that are sufficiently different from one another. By "empirical training error" we mean the deviation of the function f (x i ) from the actual measurement outcome o i , averaged over N data points: In the later discussion, we will also refer to the ideal training error, meaning the average deviation of the function f (x i ) from the expectation value f E ( This distinction is important; the ideal training error could be close to zero as long as the maximal packing net M p 4 (F f ) is closely packed, but because of the statistical fluctuation in the quantum measurements, the outcomes {o i } can deviate substantially from the expectation value f E (x i ). Therefore we might not be able to achieve small empirical training error even if f = f E .
In the following, we will provide a tight statistical analysis for bounding the prediction error The statistical analysis relies crucially on the distance measure used to define the packing net M p 4 (F f ). Recall that this is the average squared distance over the input distribution D, and the statistical fluctuation in performing quantum measurements to obtain o i , see Equation (C10). In particular, we will show that a data size of N = Θ(log(|M p 4 (F f )|)/ ) suffices to achieve prediction errors of order O( ) only. We find it worthwhile to point out that this scaling is better than one might expect. Standard results in statistical learning theory [15, 68] usually yield a data size of order log(|M p 4 (F f )|)/ 2 , which is worse than our result by an additional 1/ factor.

b. Concentration results I: Ideal training error
We begin by considering the concentration of the ideal training error for an arbitrary function f from the maximal packing net M p 4 (F f ): which only depends on the inputs x 1 , . . . , x N and is independent of the observable measurement outcome o i . We use the quantifier ideal because we compare directly with the expectation value f E (x i ) rather than the measurement outcome o i . As a first step, view |f ( n as a random variable and check that it is bounded: n . This implies the following bound on the variance: We see that the ideal training error as a sum of independent random variables with bounded variance. Bernstein's inequality implies for t > 0 These two tail bounds (that cover different regimes) and a union bound then imply Intuitively, this can be understood as follows. The functions f close to f E will be distorted by at most , while the functions f that are further away from f E will be distorted by a value proportional to the distance (Throughout this paper, log has base e unless otherwise indicated.) Then, with probability at least 1 − δ, we have We note that the training data size N in Equation (C66) scales as 1/ rather than 1/ 2 , an improvement over the standard scaling typically encountered in statistical learning theory [15,68]. The 1/ 2 comes naturally when we sample over the different inputs x i and apply a concentration inequality on the ideal training error 1 The main reason for the improved scaling is that any function f with a small prediction error E x∼D |f (x) − f E (x)| 2 ≤ 4 also has a small variance (e.g., a highly biased coin that almost always come out heads has a variance close to zero, which is much smaller than for an unbiased coin), so we need only N = O(1/ ) to achieve O( ) statistical fluctuations. Furthermore, by examining Equation (C67), we see that if function f has a large prediction error then the statistical fluctuations in the training data may also be large. This is a price we pay to avoid a training set with size N scaling as 1/ 2 . The increased statistical fluctuations for a function f with a large prediction error are not problematic, because the statistical fluctuation are still smaller than the prediction error in that case. We find that functions with small prediction error have small ideal training error, while functions with large prediction error have large ideal training error, which is adequate for our purposes.
We condition on the event that the display Equation (C67) holds true and proceed to the second step. In the second step, we will condition on a set of inputs x 1 , . . . , x N and study the concentration of statistical fluctuations in the observable measurement outcome o i . Let us define a new quantity, which we call the shifted empirical training error: where f can be any function in the packing net M p 4 (F f ). The expectation value of the shifted empirical training error can be computed by means of direct expansion. Use and for fixed input x i , we can also bound the variance: The last inequality is contingent on O ≤ 1 which implies o ∈ [−1, 1] with probability one. Now, we apply Bernstein's inequality again. The o i 's are independent, bounded random variables with small variance. So, we obtain These results are conditioned on x 1 , . . . , x N already being sampled (the result of first step) where the event given in Equation (C67) holds.

d. Prediction error for functions in the maximal packing net
Before bounding the prediction error of f obtained by the restricted classical ML model, we need to show that the following events happen simultaneously with high probability.
• Event 1: There exists a functionf ∈ M p 4 (F f ) with small prediction error that results in an empirical training error that is upper bounded by a certain threshold. In particular, we will set out to show that • Event 2: All functions f ∈ M p 4 (F f ) that have a large prediction error will result in an empirical training error lower bounded by a certain threshold. In particular, we define the event to be: for all models Then, we can combine these statements to obtain a bound on the prediction error of f . Let us first relate the packing net to another useful concept.
The proof is standard, see e.g. [90], and based on contradicting the assumption that M p 4 (F f ) is a maximal packing net. Since it is short and insightful, we include the full proof for completeness.
Proof of Lemma 2. If there exists f E ∈ F, such that for allf ∈ M p 4 (F f ), we have then we can add f E into the packing net M p 4 (F f ). Hence M p 4 (F f ) is not the maximal packing net.
For Event 1 in Equation (C78), we want to show the existence of a functionf that has a small prediction error as well as an empirical training error upper bounded by a threshold. Because M p 4 (F f ) is a maximal packing net, using Lemma 2, there exists a functionf such that the prediction error We now condition on Equation (C67) being true, which happens with probability at least 1 − δ. Therefore, we have the following bound on the ideal training error We can now use this insight to control the shifted empirical training error. Use a combination of Eqs. (C76) and (C77) to conclude The first inequality comes from separately analyzing the two cases: 1 , then take the looser statement. The second inequality arises from inserting the (lower bound) on training data size N from Equation (C66). Therefore, if the display from Equation (C67) is true (which happens with probability at least 1 − δ), then The second inequality is contingent on using Equation (C83). This is Event 1 that we have set out to establish. And it is guaranteed to happen with high probability. We now move on to Event 2 given in Equation (C79). For any f ∈ M p 4 (F f ) with a large prediction error we want to show that the training error 1 will also be large. We again condition on the event displayed by Equation (C67) (which happens with probability at least 1 − δ). This relation implies the following bound on the ideal training error Using the concentration result from Equation (C76), we have The last inequality uses the training data size bound from Equation (C66). This ensures We can combine these insights by applying union bound to obtain that all the desired events given in Equation (C78) and (C79) happen simultaneously with probability at least 1 − δ if we condition on Equation (C67) to be true. Furthermore, because the event in display (C67) happens with probability 1 − δ, we can guarantee that the the desired events given in Equation (C78)  To conclude, we have shown that using a training data of size N ≥ 38 log(2 |M p 4 (F f )| /δ)/ guarantees that the relations given in Equation (C78) and (C79) happen with probability at least 1 − 2δ.

e. Prediction error for functions produced by restricted classical ML
Let us choose a data of size N ≥ 38 log(4 |M p 4 (F f )| /δ)/ such that the two relations in Equation (C78) and (C79) are both true with probability 1 − δ. We now combine these with two other concepts from the previous subsections. Let f E (x) be the actual target function and recall that (at least) one packing net elementf ∈ M p 4 (F f ) is guaranteed to be close (E x∼D |f (x) − f E (x)| 2 ≤ 4 according to Lemma 2). The restricted classical ML model tries to identify such a packing net element by minimizing the empirical training error: f * = arg min f ∈M p The first relation in Equation (C78) allows us to take it from there. Indeed, where the strict inequality is completely trivial. Apply the second relation in Equation (C79) to complete the chain of arguments: In words, this display implies that the empirical training error achieved by f * -the output of the restricted classical ML model -is strictly smaller than any empirical training error that could be achieved by any packing net function that has a comparatively large prediction error (at least 12 ). Therefore if f * has a prediction error of at least 12 , then this leads to a contradiction that 1 By contradiction, this claim implies that the prediction error achieved by f * cannot be too bad.

Combining the upper and lower bound
If a quantum ML model produces a prediction h Q achieving average prediction error with probability at least 2/3 for any CPTP map E ∈ F, then, as proven in Equation (C37), the quantum ML must access the map E at least N Q times, where On the other hand, from Proposition 1, we know there is a restricted classical ML model producing prediction h C achieving average prediction error with high probability for any CPTP map E ∈ F, such that the restricted classical ML accesses the map N C times, where This concludes the proof of Theorem 1. This statement follows from constructing a stylized learning problem that admits the largest possible separation (albeit only a small polynomial factor). We first introduce the problem and discuss quantum and classical strategies (and their limitations) afterwards. We will focus on restricted classical ML models, because Theorem 1 also considers restricted classical ML models. We leave open the question of whether the separation between unrestricted classical ML and quantum ML is tight or not.
a. Learning problem formulation Fix ∈ (0, 1/3), let m be the integer in the statement of Proposition 2 and set n = m−1. We consider a set of CPTP maps F = {E a : a ∈ {0, 1} n } containing 2 n elements, where each map in the set takes an n-qubit input to an (n+1)-qubit output. The map E a , labeled by bit string a ∈ {0, 1} n , is comprised of 2 × 2 n Kraus operators: Here, X is a single-qubit bit flip and a z ∈ {0, 1} denotes the inner product of bit-strings in Z 2 . We also choose the (n+1)-qubit observable O = Z ⊗ I ⊗n , i.e. we measure the first qubit in the Z-basis and trace out the rest of the system. By construction, the resulting function admits a closed-form expression: We consider D to be the uniform distribution over the n-bit inputs.
b. Upper bound on the quantum query complexity The above learning problem is easy to solve in the quantum realm. Since the set of CPTP maps F = {E a : a ∈ {0, 1} n } is known, it suffices to extract the label a ∈ {0, 1} n of the underlying CPTP map. Once a is known, the closed-form expression (D3) allows to predict future function values f a (x) efficiently with perfect accuracy -regardless of the input x ∈ {0, 1} n . Quantum computers are well equipped to extract the label a. In fact, a single query of the unknown CPTP map E a suffices to extract the label by executing the following simple procedure: 1. prepare the all-zero state on n qubits: ρ 0 = |0, . . . , 0 0, . . . , 0|; 2. query E and apply it to ρ 0 : ρ 1 = 1 2 (I + √ 3 Z) ⊗ |a a|, according to Equation (D2); 3. throw away (trace out) the first qubit to obtain the n remaining ones: ρ 2 = |a a|; 4. perform a computational basis measurement to extact a ∈ {0, 1} n with probability one.
We see that a single quantum query (N Q = 1) suffices to extract the label a with certainty. Subsequently, we can make efficient and perfect predictions via the closed-form expression (D3): In words, N Q = 1 allows for training a quantum ML model h Q (x) that achieves zero prediction error for all input distributions (perfect prediction). This concrete ML model is also optimal, because N Q = 1 is the smallest number of queries conceivable (N Q = 0 would not reveal any information about the underlying CPTP map). c. Lower bound on the classical query complexity Let us now turn to potential classical strategies for solving the above learning problem. In contrast to the previous paragraph, we will not construct an explicit strategy. Instead, we will use ideas similar to Appendix C 1 b to establish a fundamental lower bound.
Recall that the input distribution D is taken to be the uniform distribution. Also, for each E a ∈ F, the underlying function f a (x) = tr (OE a (|x x|)) admits a closed form expression, see Equation because 2 −n x∈{0,1} n |c x| 2 = 1/2 for all n-bit strings c = (0, . . . , 0). Now, suppose that a restricted classical ML model can utilize training data 2 ≤ with high probability for any label a ∈ {0, 1} n . Then, this model would also allow us to identify the underlying label.
By checking E x∼D |h C (x) − f b (x)| 2 ≤ for every possible value b ∈ {0, 1} n , the restricted classical ML model allows us to recover the underlying bit-string label a ∈ {0, 1} n with high probability. For this part of the argument, what's essential is that the right-hand-side of the inequality Equation (D6) is greater than . If we replace 3 in Equation (D2) by α , where α is a constant, we require √ 2α − 1 > 1, or α > 2. We chose α = 3 merely for convenience.
For any random hidden bitstring a ∈ {0, 1} n , we can use the restricted classical ML to obtain the training data {(x i , o i )} NC i=1 and determine the underlying bitstring a. We assume that the restricted classical ML first query x 1 obtains o 1 , then query x 2 obtains o 2 , and so on. We also have which is a single-shot outcome for measuring the observable O on the state E a (|x i x i |) in the eigenbasis of O = Z ⊗ I ⊗n . Because we can use the training data to determine a with high probability (by the assumption of the restricted classical ML model), Fano's inequality and the data processing inequality then imply a bound on the mutual information between the training data and the CPTP map label a: Next, using chain rule of mutual information based on conditional mutual information, we have The second equality follows from the fact that x i is chosen by the restricted classical ML using only the information of {(x j , o j )} i−1 j=1 , hence the input x i does not provide any additional information about a, i.e., I(a : x i |{(x j , o j )} i−1 j=1 ) = 0. We now upper bound each term: Because o i is a two-outcome random variable, A closer inspection of Equation (D7) reveals that the probability of one outcome is p = 1 2 1 + √ 3 and the other is 1 − p (the value a x i ∈ {0, 1} only ever permutes the outcome sign). This ensures and we conclude Finally, we combine Eqs. (D8), (D9) and (D12) to conclude Therefore, recalling that the output size of our set of maps is m = n+1, we have for a restricted classical ML model with small average prediction error: Proposition 2 follows from combining this assertion with the fact that the underlying learning problem does admit a perfect quantum solution with N Q = 1, see Equation (D4).

Appendix E: Exponential separation for predicting expectation values of Pauli operators
In this section, we consider an example to demonstrate the existence of exponential information-theoretic quantum advantage when we want to achieve small worst-case prediction error.

Task description
The learning task is to train an ML model that allows accurate prediction of where ρ is an unknown n-qubit state, X, Y, Z are the single-qubit Pauli operators, and P x is the tensor product of Pauli operators given by x. This task does fit into the framework described in the main text. Every unknown quantum state ρ defines an unknown quantum channel E ρ . The unknown quantum channel E ρ takes a classical input x ∈ {I, X, Y, Z} n , prepares the unknown quantum state ρ, and rotates the quantum state ρ according to the input x such that a Pauli-Z measurement on the first qubit is equivalent to measuring P x on ρ. More precisely, we define where |x is an encoding of the classical input x (e.g., as a 2n-qubit computational basis state), C x is a Clifford unitary that satisfies We can extend this definition linearly to all of quantum state space. The goal of the machine learning model is to produce f (x) such that We consider restricted classical ML models that can only obtain {(x i , o i )}, where x i are inputs denoting the Pauli operators, and o i are measurement outcome when measuring E ρ (|x i x i |) with Z 1 , which is equivalent to the measurement outcome of measuring ρ with P xi . Classical ML models can obtain {(x i , o i )}, where o i is now the measurement outcome of an arbitrary POVM on the output state E ρ (|x i x i |). Note that for restricted classical ML, o i ∈ R, but for classical ML, o i is in the set indexing the POVM elements Quantum ML models can access the unknown quantum channel E ρ at will. Recall that, according to Theorem 1, for achieving a small average prediction error according to any input distribution, no large quantum advantage in sample complexity can be found. In the following, we will show that to achieve a small worst-case prediction error, an exponential quantum advantage in sample complexity is possible. In Section E 2, we give a simple quantum ML algorithm that can accurately predict the expectation values of all 4 n Pauli observables using only O(n) samples. In Section E 3 and E 4, we will show that any classical ML algorithm requires at least an exponential number of samples to accurately predict the expectation values of all 4 n Pauli observables. In Section E 5, we will give a matching lower bound Ω(n) for quantum ML.

Sample complexity of a quantum ML algorithm
The quantum ML algorithm accesses the quantum channel E ρ multiple times to obtain multiple copies of the underlying quantum state ρ. Each access to E ρ allows us to obtain one copy of ρ. Then, the quantum ML algorithm performs a sequence of measurements on the copies of ρ to accurately predict tr(P x ρ) for all x ∈ {I, X, Y, Z} n . To this end, we will give a detailed proof for Theorem 4 given below. From Theorem 4, we only need O(log(100 × 4 n )/ 4 ) = O(n/ 4 ) copies to accurately predict all 4 n Pauli operators with probability at least 0.99. Hence we only need to access the quantum channel E ρ for O(n/ 4 ) times.
Theorem 4. For any M Pauli operators P 1 , . . . , P M , there is a procedure that producesp 1 , . . . ,p M with under probability at least 1 − δ by performing POVM measurements on copies of the unknown quantum state ρ.
The procedure takes in the Pauli operators one by one and produces the estimatep i sequentially. Throughout the prediction process, the procedure maintains two blocks of memory: 1. Classical memory: We perform N 1 repetitions of Bell measurements on two copies of the quantum state ρ ⊗ ρ. For repetition t, we go through every qubit, and measure the k-th qubit from the first and second copies in the Bell basis to obtain: where the Bell basis encompasses four maximally entangled 2-qubit states. Set |Ω = 1 √ 2 (|00 + |11 ) (Bell state) and define We then efficiently store the measurement data S (t) k , ∀k = 1, . . . , n, ∀t = 1, . . . , N 1 in a classical memory with 2nN 1 bits. We use this block of memory to estimate | tr(P ρ)| 2 for any Pauli operator P .
2. Quantum memory: We store N 2 copies of the unknown quantum state ρ. We use this block of memory to estimate sign(tr(P ρ)) for any Pauli operator P .
Consider any tensor product of Pauli operators P = σ 1 ⊗ . . . ⊗ σ n , where σ k ∈ {I, X, Y, Z}, ∀k = 1, . . . , n. The classical memory is used to predict the absolute value of tr(P ρ), while the quantum memory is used to predict the sign of tr(P ρ). This allows us to obtain an accurate estimate for tr(P ρ). The following remark is central to the procedure.
Remark 1. If we find that the absolute value of tr(P ρ) is close to zero, then we need not use the quantum memory to predict the sign of tr(P ρ). By measuring the sign only for the Pauli observables such that the absolute value of the expectation value is appreciable, we can find the sign for may Pauli observables without badly disturbing the copies of ρ stored in the quantum memory.
We proceed to give a detailed procedure for estimating the absolute value and the sign of tr(P i ρ).

a. Measuring absolute values
To understand how the absolute value of the expectation of a Pauli operator is estimated, first consider the case where ρ is the density operator of a single qubit, and suppose that ρ ⊗ ρ is measured in the Bell basis. The outcome S is the projector onto one of the four Bell states, as in Equation (E7). If σ ∈ {I, X, Y, Z} is any Pauli matrix, then each Bell state is an eigenstate of σ ⊗ σ with eigenvalue ±1. In the state S, the +1 eigenvalue of σ ⊗ σ occurs with probability Prob(+) = 1 2 tr ((I ⊗ I + σ ⊗ σ)(ρ ⊗ ρ)), and the −1 eigenvalue occurs with probability Prob(−) = 1 2 tr ((I ⊗ I − σ ⊗ σ)(ρ ⊗ ρ)). Therefore, This observation can be generalized to the case where ρ is an n-qubit state, and each pair of qubits in ρ ⊗ ρ is measured in the Bell basis, yielding the outcomes {S k , k = 1, 2, . . . , n}. If P = σ 1 ⊗ . . . ⊗ σ n is a Pauli observable, then S k is an eigenstate of σ k ⊗ σ k with eigenvalue ±1 for each k, just as in the n = 1 case; in particular, This product is +1 when ⊗ n k=1 S k is an eigenstate of P ⊗ P with eigenvalue +1, and it is −1 when ⊗ n k=1 S k is an eigenstate of P ⊗ P with eigenvalue −1. Therefore, Because Equation (E10) relates the distribution of Bell measurement outcomes to | tr(P ρ)|, we can estimate | tr(P ρ)| accurately by repeating Bell measurement on ρ⊗ρ sufficiently many times. Suppose we have altogether 2N 1 copies of ρ, and perform the Bell measurement on N 1 pairs of copies. We collect the measurement data {S (t) k } in the classical memory, where k = 1, 2, . . . n labels the qubit pairs, and t = 1, 2, . . . N 1 labels the repeated measurements.
For each Pauli observable P = σ 1 ⊗ . . . ⊗ σ n , we define the corresponding expression which can be computed efficiently in time O(nN 1 ).
Using the statistical property given in Equation (E10), we can apply Hoeffding's inequality to show that, with high probability, the estimateâ(P ) is close to the expectation value | tr(P ρ)| 2 : Lemma 3. Given N 1 = Θ(log(1/δ)/ 2 ). For any Pauli operator P , we have with probability at least 1 − δ.
To obtain an estimate for the absolute value | tr(P ρ)|, we consider the following estimatê b = max(0,â).
It is not hard to show the following implication using Therefore, Lemma 3 gives the following corollary. With high probability, we can estimate the absolute value of tr(P ρ) for any Pauli operator P accurately.

b. Measuring signs
Given a Pauli operator P , we check if | tr(P ρ)| is large enough with the previous procedure using Bell measurement and a the classical memory. If | tr(P ρ)| is large enough, we proceed to measure the sign of tr(P ρ) as well, using a quantum memory. Suppose that N 2 copies of the state ρ are stored in the quantum memory. The sign is determined by measuring the two outcome observable where Π (z k ) projects the kth copy onto the eigenspace of the Pauli operator P with eigenvalue z k ∈ {+1, −1} and MAJ(z) is the majority vote. In effect, the observable E measures P on each of the N 2 copies of ρ, obtaining either +1 or −1 each time, and takes a majority vote on these outcomes, yielding +1 if more than half of the outcomes are +1, and yielding -1 if more than half of the outcomes are -1. Intuitively, if we are guaranteed that | tr(P ρ)| >˜ and N 2 is sufficiently large, then the majority vote will concentrate on the correct answer. Hence, the quantum state ρ ⊗N2 is approximately contained in one of the two eigenspaces of the observable E. As a result, after measuring E, the quantum state ρ ⊗N2 would remain approximately the same. This allows us to keep measuring the sign of tr(P ρ) for many different Pauli operators. This strategy is a key element in the original protocol for shadow tomography [4]. The rigorous guarantee is given by Lemma 5, which relies on the quantum union bound [1].
Lemma 4 (Quantum union bound [1]). Given any quantum state ρ. Consider a sequence of M two-outcome where K i has eigenvalue 0 or 1. Assume tr(K i ρ) ≥ 1 − . When we measure K 1 , . . . , K M sequentially on ρ, the probability that all of them yield the outcome 1 is at least 1 − M √ .
Proof. Let us first consider the probability that E i outputs the sign of tr(P i ρ) when measuring on ρ ⊗N2 . Using Hoeffding's inequality, the probability is lower bounded by if we take N 2 = 4 log(2M/δ)/ 2 . Let us define a related observable K i which has outcome 1 if the outcome of E i is equal to the sign of tr(P i ρ) and 0 otherwise. We have the following bound on the expectation value: Note that measuring K i is the same as measuring E i . The only difference is in the eigenvalue associated with the outcome (E i has outcome ±1, while K i has outcome 0, 1). Using the quantum union bound in Lemma 4, we will obtain outcome 1 when we measure K i for all i = 1, . . . , M with probability at least 1 − M δ 2 /M 2 = 1 − δ.
Because measuring outcome 1 when we measure K i is equivalent to obtaining the correct sign when we measure E i , this concludes the proof.

c. Sample complexity analysis
We can combine the previous results to obtain the sample complexity N 1 , N 2 to guarantee accurate prediction of tr(P i ρ), ∀i = 1, . . . , M . Here by "sample complexity" we mean the number of copies of ρ consumed by the protocol. First, following Corollary 1 and the union bound, we choose such that with probability at least 1 − (δ/2), we have This allows accurate prediction for the absolute values. In the next step, we only obtain the sign of tr(P i ρ) if b i > (2/3) . Let R denote the number of Pauli operators that achieveb i > (2/3) . Conditioning on Event (E22), for all P i withb i > (2/3) , we have tr(P i ρ) > (1/3) . Let us denote the measured signs to beŝ i , ∀i = 1, . . . , M . If we do not measure the sign for P i , then we setŝ i = 0. Using Lemma 5, we can choose a number of samples to guarantee that, with probability at least 1−(δ/2), the measured signs are all correct for all P i withb i > (2/3) : Together, with probability at least (1−(δ/2)) 2 ≥ 1−δ, Events (E22) and (E24) both holds. Finally, we produce the following estimate for tr(P i ρ):p For Pauli operator P i withb i > 2 3 , we have For Pauli operator P i withb i ≤ 2 3 , we have The first inequality uses triangle inequality. The second inequality uses the assumption thatb i ≤ 2 3 and the fact thatb i ≥ 0. Hence, we successfully predict the expectation value of tr(P i ρ), ∀i = 1, . . . , M . The number of copies we used is This concludes the proof of Theorem 4.

d. Heuristics for near-term implementations
The procedure for measuring the absolute value of tr(P ρ) requires only two-copy Bell basis measurements, which can be performed quite easily on current quantum devices [33,55]. On the other hand, the procedure for measuring signs of tr(P ρ) can be rather difficult to implement on a near-term quantum device. However, for measuring signs, we only need to investigate the Pauli operator expectation values found to be relatively large in absolute value. For any P such that | tr(P ρ)| is small, rather than determine the sign we simply set our predicted expectation value of P to zero.
Therefore we focus on the Pauli observables whose expectation values are (comparatively) large in absolute value. When two Pauli observables P 1 and P 2 anti-commute, the expectation values of P 1 and P 2 cannot both be large in absolute value. Hence, it is often the case that these remaining observables can be sorted into a collection of just a few sets, where operators in each set are mutually commuting. Various sorting strategies are known [22,34,44,[56][57][58]91]. Once such a commuting set is identified, all the operators in the set can be simultaneously measured using just a single copy of ρ, As a simple illustrative example, suppose that the underlying state is a stabilizer state. Even if we consider all 4 n Pauli observables, when we filter out all Pauli observables with tr(P ρ) = 0, the rest of the Pauli observables will form a single commuting set. Hence, we only need to measure in one appropriately chosen basis to estimate all non-zero Pauli observables simultaneously.

Sample complexity lower bound for restricted classical ML algorithms
Recall that the restricted classical ML algorithm can only choose a certain input x and obtain measurement outcome o when we measure a fixed observable O = Z 1 on E ρ (|x x|). The sample complexity lower bound can be proved by reducing the problem to a well known classical problem: learning point functions. This section establishes a sample complexity lower bound for this basic problem.
Proof. Suppose a is selected uniformly at random. By assumption, the classical randomized algorithm will be able to outputf that obeys Equation (E30). Because the worst-case prediction error is smaller than 1/2, the classical randomized algorithm will be able to correctly identify a with probability at least 2/3. We now use a simple calculation to obtain a lower bound to the probability for any classical randomized algorithm to identify a. Because f a (x) = 0 for all but one inputs, a uniformly random input x 1 will result in f a (x 1 ) = 0 with probability (2 n − 1)/(2 n ). Conditioned on f a (x 1 ) = 0, the second query x 2 will result in f a (x 2 ) = 0 with probability (2 n − 2)/(2 n − 1). By induction, the probability that the first k queries all result in function value equal to 0 is Hence, with k = (1/4)2 n , the probability that the first k queries are all zero is (3/4). In such an event, the classical algorithm will have no information that helps it to distinguish the remaining (3/4)2 n point functions.
Because the conditional probability of the distribution over a under such an event is uniform across the (3/4)2 n point functions, no matter what the classical algorithm chooses, the probability of correctly identifying a is equal to (4/3)2 −n . Thus the probability of correctly identifying a uniformly random label a ∈ {0, 1} n with k uniformly random queries f a (x 1 ), . . . , f a (x k ) is equal to under the extra assumption n ≥ 2 (2 n ≥ 4). Combining this fundamental lower bound with the constructive argument above reveals N ≥ (1/4)2 n = Ω(2 n ). The number of inputs N must be greater than k = (2 n )/4 to achieve a success probability of at least 2/3.
It is not hard to consider a subset of all quantum states that can be mapped to the basic problem of learning point functions. We can equate x ∈ {I, X, Y, Z} n (input) and a ∈ {I, X, Y, Z} n with bit strings of size 2n (two bits unambiguously characterize all 4 single-qubit Paulis). We recall the definition that P x ∈ {I, X, Y, Z} ⊗n is the tensor product of Pauli basis based on x ∈ {I, X, Y, Z} n . For each a, we define a mixed state This is now exactly the same as the problem for learning point function given in Lemma 6 with input size 2n.

Sample complexity lower bound for any classical ML algorithm
While restricted classical ML models can only obtain measurement outcome o for a fixed observable O, one may be curious what the sample complexity lower bound would be for a standard classical ML model that can perform an arbitrary POVM measurement (which is equivalent to performing quantum computation with ancilla qubits follow by a computational basis measurement) on the output quantum state E ρ (|x x|). While this is much more powerful than restricted classical ML models, we will show that the sample complexity is still exponential. A quantum ML model that can process the quantum data in an entangled fashion has an exponential advantage over classical ML model that can only process each quantum data separately.
We will first focus on non-adaptive measurements, where the POVM measurement for each copy is fixed and do not change throughout the training process, and show that any procedure needs to measure Ω(n2 n ) copies of the state ρ. A matching upper bound can be obtained using classical shadows with random Clifford measurements [54]. Classical shadows with random Clifford measurements can predict M observables O 1 , . . . , O M using only N C = O(max i tr(O 2 i ) log(M )) copies. For the set of all n-qubit Pauli observables, we have tr(P 2 i ) = 2 n and M = 4 n , so N C = O(n2 n ). This matches with the lower bound for non-adaptive measurements.
We also give a lower bound of Ω(2 n/3 ) for adaptive measurements, where each POVM measurement can depend on the outcomes of previous POVM measurements. This sample complexity lower bound could be further improved using a more sophisticated analysis, which we leave for future work.
a. Non-adaptive measurements When one could perform arbitrary POVM measurement on E ρ (|x x|), the input x is no longer useful since x only rotates the quantum state ρ, which can be absorbed into the POVM measurement. Let us denote the POVM for the i-th copy of ρ to be F i . We can reduce the classical ML models with non-adaptive measurements to the following setup: where o i is the POVM measurement outcome (single shot). Hence, it is a random variable that depends on ρ and F i . This setup is known as single-copy independent measurements in the quantum state tomography literature [43,54]. Without loss, we can further restrict our attention to POVM measurements comprised of rank-one projectors. Such measurements always reveal more information and we write F i = {w ioi 2 n |ψ ioi ψ ioi |} j . The classical ML model then uses the classical measurement outcomes o 1 , . . . , o N to learn a function f (x) such that the following prediction error bound holds with high probability: We will show that this necessarily requires N ≥ Ω(n2 n ). Recall that the sample complexity for achieving a constant worst-case prediction error using a quantum ML model is N = O(n).
The proof uses a mutual information analysis similar to the sample complexity lower bound for quantum ML models given in Section C 1 b. We consider a communication protocol between Alice and Bob. First, we define a codebook that Alice will use to encode classical information in quantum states: where P a runs through all Pauli matrices {I, X, Y, Z} ⊗n \ {I ⊗n } that are not the global identity and s is an additional sign. There are 2(4 n −1) combinations in total and Alice will sample one of them uniformly at random. Then, Alice she prepares N copies of ρ (a,s) and sends them to Bob. Bob will perform the POVM measurements F 1 , . . . , F N on the N copies of ρ (a,s) he receives to obtain classical measurement outcomes o 1 , . . . , o N . He will use them to train the classical ML model to produce a functionf (x) that is guaranteed to obey with high probability. Because tr(P x ρ (a,s) ) is either +1, 0, −1, it is not hard to see that Bob can usef (x) to determine Alice's original message a -as long as Equation (E38) holds. Using Fano's inequality and data processing inequality, the mutual information between a and the measurement outcomes o 1 , . . . , o N must be lower bounded by because the w ioi 's are expansion coefficients of a POVM ( oi w ioi = 1). This is the advertised result.

b. Adaptive measurements
In the last section, we have derived a sample complexity lower bound for independent single-copy quantum measurements. Several results in the literature address this restricted setting, see e.g. [43,54]. In stark contrast, very little is known about the more realistic setting of adaptive single-copy measurements; see [6,24] for lower bounds on quantum mixedness testing and the task of distinguishing between physical experiments. Here, we give an elementary proof that provides an exponential sample complexity lower bound for predicting Pauli expectation values based on single-copy adaptive measurements. Such an extension to adaptive measurement strategies is nontrivial -very few results are known for this setting. However, the actual result is not (yet) tight. We believe that further improvements are possible using more sophisticated analysis and we leave this as a future work. When adaptive measurements are allowed, each POVM measurement can depend on previous POVM measurement outcomes. And the entire training process can be written as  Using the conditions j w o<i ij 2 n |ψ o<i ij ψ o<i ij | = I and ψ o<i ij |ψ o<i ij = 1, we have j w o<i ij = 1. Suppose Alice randomly chooses one of 4 n possible n-qubit states based on a non-uniform probability distribution: ρ = ρ mm = I 2 n , with probability 1 2 , ρ a = I+Pa 2 n , with probability 1 2(4 n −1) , where P a is chosen among all 4 n − 1 nontrivial Pauli operators {I, X, Y, Z} ⊗n \ {I ⊗n }. (In contrast to the previous subsection, we only include positive signs, that is ρ a = ρ (a,+1) ). Alice then sends ρ ⊗N to Bob. Hence, Bob will receive ρ ⊗N mm , with probability 1 2 , ρ ⊗N a , with probability 1 2(4 n −1) .

(E57)
Bob uses the classical ML model with adaptive measurement outcomes o 1 , . . . , o N to infer all Pauli expectation values of tr(P b ρ) with a small error (the assumption of the classical ML model). Because tr(P b ρ a ) = δ ab , but tr(P b ρ mm ) = 0 for all b, Bob can successfully distinguish the quantum state chosen by Alice once he knows the expectation values of all Pauli operators. This implies that the probability distribution for Bob's measurement outcomes o 1 , . . . , o N must be able to distinguish the two events: 1. ρ = ρ mm , which happens with probability 1/2.
We have thus reduced a multiple-hypothesis testing problem -distinguishing among 4 n possible states -to a two-hypothesis testing problem -distinguishing between the completely mixed state and a randomly chosen ρ a . We will use this observation to derive an information-theoretic lower bound on N . Importantly, the two hypothesises give rise to different joint probability distributions of all N outcomes. Under the first hypothesis (ρ = ρ mm ), while the second hypothesis (ρ = ρ a and a is itself uniformly random) would imply When Bob performs his measurement strategy, he exactly obtains one sample from such a joint probability distribution. And based on this sample, he must distinguish between the two hypotheses. A well-known fact from statistics states that the optimal decision strategy is the maximum likelihood rule (pick the joint probability distribution that is most likely, given the observed event). This strategy succeeds with a probability that is determined by the total variational (TV)distance: Pr [successful discrimination] = 1 2 + 1 2 TV(p 1 , p 2 ) with (E60) We note in passing that this classical observation is actually the starting point for the celebrated Holevo-Helstrom theorem [48,50]. We refer to [8,66] and also [62, Lecture 1] for a modern discussion from a quantum information perspective. Importantly, the TV distance between p 1 (o 1 , . . . , o N ) and p 2 (o 1 , . . . , o N ) remains tiny until N becomes exponentially large: TV (p 1 (o 1 , . . . , o N ), p 2 (o 1 , . . . , o N )) ≤ 2N/(2 n + 1) 1/3 . This is the content of Lemma 9 below. This TV upper bounds Bob's bias for successful discrimination of the two possibilities. Because Bob can successfully discriminate between the two hypotheses, we have TV(p 1 , p 2 ) = Ω(1), which gives the desired result N = Ω(2 n/3 ).
We emphasize that an exponential lower bound can be proven using the same method whenever we want to predict a class of observables {O 1 , . . . , O M } such that 1 M M i=1 ψ| O i |ψ 2 is an exponentially small number for all pure states |ψ . Pauli observables are one example of such a class of observables.

Proof.
The key insight is that -regardless of the actual choice of measurements -single-copy, rank-one POVMs are ill equipped to distinguish the maximally mixed state ρ mm = I/2 n from ρ a = (I + P a )/2 n , where a is a uniformly random index. To see this, let us first re-use Lemma 7 to obtain a ψ o<i ioi | P a |ψ o<i ioi 2 = (4 n − 1) E a ψ o<i ioi | P a |ψ o<i ioi 2 = 4 n − 1 2 n + 1 = 2 n − 1 for any |ψ o<i ioi .