Expressivity of Quantum Neural Networks

In this work, we address the question whether a sufficiently deep quantum neural network can approximate a target function as accurate as possible. We start with simple but typical physical situations that the target functions are physical observables, and then we extend our discussion to situations that the learning targets are not directly physical observables, but can be expressed as physical observables in an enlarged Hilbert space with multiple replicas, such as the Loshimidt echo and the Renyi entropy. The main finding is that an accurate approximation is possible only when the input wave functions in the dataset do not exhaust the entire Hilbert space that the quantum circuit acts on, and more precisely, the Hilbert space dimension of the former has to be less than half of the Hilbert space dimension of the latter. In some cases, this requirement can be satisfied automatically because of the intrinsic properties of the dataset, for instance, when the input wave function has to be symmetric between different replicas. And if this requirement cannot be satisfied by the dataset, we show that the expressivity capabilities can be restored by adding one ancillary qubit where the wave function is always fixed at input. Our studies point toward establishing a quantum neural network analogy of the universal approximation theorem that lays the foundation for expressivity of classical neural networks.


I. INTRODUCTION
Neural networks lie at the center of the recent third wave of artificial intelligence. The universal approximation theorem plays an essential role in the development of neural networks, which states that sufficiently wide or sufficiently deep neural networks can approximate a well-behaved function on d-dimensional Euclidean space R d with arbitrary accuracy [1]. This theorem lays the foundation of the expressive capability of neural networks and serves as bases for the successes of neural network applications. Quantum neural networks (QNNs) are quantum generalizations of classical feedforward neural networks on future quantum computers [2][3][4][5], which lie at the center of the recent development of the quantum machine learning, including quantum unsupervised learning [6][7][8][9][10][11], quantum generalization of neural networks [12][13][14][15][16][17][18][19][20][21], quantum circuit structures developed by classical neural networks [22][23][24][25][26][27][28][29], and information theory in quantum neural networks [30,31]. However, expressivity of QNNs has not been fully explored and there are only few works in this direction [13,32].
Here we consider the quantum generalizations of fully connected neural networks, which contain quantum wave functions of n-qubit states as inputs, parameterized quantum circuits made of local quantum gates, and measurements on readout qubits leading to labels. The parameters in the quantum circuit will be optimized during training that yields the best approximation of the learning target. Since the Hilbert space dimension of an n-qubit state is 2 n , the wave function can encode information of d = 2 n complex numbers, up to a normalization condition and a global phase. In the following, concerning the effects of the depth and the width on the expressive capability of a QNN, we address the question whether a sufficiently deep QNN and a sufficiently wide QNN can express any well-behaved function in the C d space.
To be concrete, here we consider a number of typical learning tasks in the quantum physics problem, where the learning targets include (i) physical observables; (ii) the Loschmidt echo, and (iii) the Rényi entropy. We point out that, in contrast to the universal approximation theorem for classical neural networks, a QNN cannot express a general well-behaved function with arbitrary accuracy even though the QNN is made sufficiently deep. However, we show in this work that, by enlarging the Hilbert space dimension of the input state, this problem can be solved, and the expressivity can be significantly improved or can even be made as accurate as possible. Enlarging the Hilbert space dimension effectively increases the width of the QNN. This can be achieved either by adding an ancillary qubit in the input and (or) by duplicating replicas of the input wave functions. These results point toward an analogy of the universal approximation theorem for QNN.

II. RESULTS
We consider a dataset denoted as {(|ψ l , y l )}, {l = 1, . . . , N D }, where l labels data and N D is the total number of data in the dataset. Each input quantum state |ψ l can be written as |ψ l = m c l m |m , where {|m } is a complete set of 2 n bases of the n-qubit Hilbert space, and {c l m } are 2 n normalized complex numbers with a fixed total phase. Usually the information in the label is much more condensed than the information of the entire input, therefore, here we consider that the label is simply a number y l ∈ [−1, 1]. To motivate this result, let us first start with a simpler situation that the label is a physical observable y l = ψ l |Ô|ψ l , whereÔ is a hermitian operator on n-qubit quantum state [33]. This is equivalent to say that y l is a quadratic function of these complex numbers as The QNN we considered is shown in Fig. 1(a).
A unitaryÛ is made of a number of two-qubit gates. We useû i j to denote a two-qubit gate acting on qubit-i andj. Eachû i j is parameterized aŝ whereĝ k are SU(4) generators and α k i j are parameters. In a QNN, these parameters need to be determined by training. We use the brick wall architecture to arrange theseû i j to formÛ . To make sure that the QNN can realize any kinds of unitary transformations, we set the circuit depth sufficiently large. The entire quantum circuitÛ acts on the input wave function |ψ l and then we perform a measurement, say σ x , on the readout qubit-r.
The measurement operator is therefore denoted bŷ where the superscription i = 1, . . . , N denotes the qubits, and σ i 0 denotes the identity matrix. The measurement of the Here |ψ are input wave function in the dataset,Û denotes the unitary rotation by quantum circuit, and the detector denotes the readout qubit. The red qubit with label a denotes the ancillary qubit at which the wave function is always fixed. quantum circuits leads tõ The loss function is taken as L = 1 N D l |ỹ l − y l | 2 , which enforcesỹ l to be y l for all |ψ l . And we use the Adams method to optimize the coefficient of the generators of each two-qubit gate during training [31].
Here one thing that should be noticed is whether the input wave functions span the entire Hilbert space, which requires N D > 2 n . Here N D denotes the total number of data in the training set and 2 n is the dimension of the Hilbert space containing n qubits. If N D < 2 n , all the input wave functions in the dataset only occupy a subset of the entire Hilbert space. In some cases, even when N D > 2 n , if the wave functions have certain structures, for instance, if the wave functions are taken as ground states of certain Hamiltonians [30,31], they also do not span the entire Hilbert space. However, if N D > 2 n and the input wave functions are general enough, they span the entire Hilbert space. In this case, in order for allỹ l to faithfully represent y l , one requiresÔ =Û †MÛ . However, this is not possible for a general operatorÔ. BecauseM is a direct product of theσ x operator on the read-out qubit-r and the n − 1 identity operators on the rest qubits, the eigenvalues ofM consist 2 n−1 number of −1 and equal number of +1, and any unitary transformation keeps these eigenvalues invariant. That is to say, even though one can make the QNN deep enough to present a generic unitaryÛ in the SU(2 n ) group, it always cannot satisfyÔ =Û †MÛ . This argument can be easily generalized to situations that measurements are performed in more than one readout qubit.

A. Ancillary qubit
Now we show this problem can be solved by adding one ancillary qubit. Instead of |ψ l , we now add one ancillary qubit and the input wave function is set as |α ⊗ |ψ l , where the input state at the ancillary qubit is always fixed as |α . The unitaryÛ now acts on the entire 2 n+1 -dimensional Hilbert space, and the measurement is still performed in the readout qubit and noŵ where the superscript a denotes the ancillary qubit. The structure is shown in Fig. 1(b). Now we will show that for any given |α , we can always construct an operatorÔ, acting on the 2 n+1 Hilbert space, which satisfies the following two requirements. The first is that the operatorÔ can generate the observables as α| ⊗ ψ l |Ô|α ⊗ |ψ l = ψ l |Ô|ψ l = y l , and the second is that the eigenvalues of the operatorÔ consist of 2 n number of +1 and equal number of −1 and are consistent to that of the measurement operatorM. Without loss of generality, we choose |α = |↑ , it can be shown thatÔ chosen aŝ σ a z ⊗Ô +σ a x ⊗ I −Ô 2 satisfies these two conditions. First, In the legends, S, D and T, respectively, denote single, double, and triple replicas as input for quantum circuit, and +A means that one ancillary qubit is added. Here we use the brick wall architecture made of two-qubit gate for quantum circuit.
Secondly, suppose {|m } is a set of eigenbases in the 2 ndimensional Hilbert space (without the ancillary qubit) that O|m = O m |m , and under these bases,Ô can be written as Therefore, its eigenvalue consists of 2 n number of +1 and equal number of −1, which equal the eigenvalues of operatorM. When such an operator O is found, it is possible to find a unitaryÛ in the 2 n+1dimensional space, such thatÛ †MÛ =Ô, and then, This shows, with the help of one ancillary qubit, the QNN can accurately express the functional mapping y l = ψ l |Ô|ψ l for all generic quantum states |ψ l .
In Fig. 2(a), we show that the loss function for learning total magnetization of generic wave functions in a three-qubit quantum state, withÔ chosen as i σ i z /n. The red lines and the purple lines are results for QNN with structures shown in Figs. 1(a) and 1(b), respectively. The structure shown in Fig. 1(b) has one more ancillary qubit compared with the structure shown in Fig. 1(a). One can see that, if without the ancillary qubit, the loss clearly saturates to a finite value even for sufficient large training epochs. By adding the ancillary qubit, the loss is significantly reduced and approaches to zero.

B. General rule
The lesson from above example is that the learning accuracy can be significantly improved by enlarging the Hilbert space dimension of the input for the quantum circuit. Here we generalize this lesson to a general statement. Suppose H is the total Hilbert space of input for quantum circuit, Dim(H) denotes its Hilbert space dimension. Let us consider H = H 0 H 1 , and suppose all input wave functions in the dataset only reside in H 0 . The statement is that, when Dim(H 1 ) Dim(H 0 ), we can always find an operatorÔ acting on entire Hilbert space H, such that, (i) for any wave function |ψ in H 0 , ψ H |Ô|ψ H = ψ|Ô|ψ , (ii) the eigenvalues ofÔ consists of an equal number of +1 and −1, which are the same as the eigenvalue of the measurement operator. Then, it is possible to find a properÛ such thatÔ =Û †MÛ .
It is easy to see thatÔ constructed as Eq. (7) satisfies the above two requirements. This can be extended to situations Dim(H 1 ) > Dim(H 0 ). In this case, H 1 is larger than the space spanned by {|m }, and we chooseÔ to be diagonal with equal number of +1 and −1 eigenvalues in the residual Hilbert space. The ancillary qubit is a specific example of this general statement, where H 0 consists states |↑ ⊗ |ψ and H 1 consists states |↓ ⊗ |ψ , where |ψ denotes the input wave functions in the dataset. We have proved that if the condition Dim(H)/2 Dim(H 0 ) is fulfilled, then a learnable observable can be constructed. However, we shall also note that this condition is sufficient but not necessary. For certain special cases, if a specific target observableÔ happens to share the same set of eigenvalues as the measurement operatorM, the measurement operator can be rotated to the target observable, such thatÛ †MÛ =Ô even without requiring Dim(H)/2 Dim(H 0 ) condition.

C. Replica
Now we move to consider the learning tasks such as the Loschmidt echo and the Rényi entropy. The Loschmidt echo is an interference between two wave functions, starting from the same input wave function |ψ l and evolved by two different HamiltoniansĤ a andĤ b for time duration t, that is, y l = | ψ l |e iĤ a t e −iĤ b t |ψ l | 2 . Here we denoteŴ = e iĤ a t e −iĤ b t , and for most Hamiltonians,Ŵ is a sufficiently chaotic operator for long enough t [34]. In Fig. 2(b), adopting the QNN in Figs. 1(a) and 1(b) as before, we show the loss function for learning the Loschmidt echo. We can see that even with an ancillary qubit, the loss still saturates to a finite nonzero value even with sufficient long training epochs. The reason is also quite obvious. It is because for the Loschmidt echo, the label y is a quartic function of {c m }, whileỹ given the QNN through Eq. (3) is only a quadratic function of {c m }. Thus, to accurately capture the learning target such as the Loschmidt echo, nonlinearity is necessary.
There are also various discussions on adding nonlinearity in QNN. Here we show that duplicating replica of the input states is another way to incorporate the nonlinearity. In fact, it is a quite efficient way in this case, which can be easily seen from Suppose the input wave function is a n-qubit state, and when we double the input to a 2n-qubit state, the Loschmidt echo returns to a quadratic function in the enlarged Hilbert space. In the doubled space, the Loschmidt echo becomes a physical observable withÔ = (Ŵ † ⊗Ŵ +Ŵ ⊗Ŵ † )/2 being a Hermitian operator. The fact that a mth-degree polynomial function of the elements of the density matrix ρ can be written as the expectation value of an observable on m copies of ρ has also been discussed in Ref. [36].
We can also consider another example of the Rényi entropy. For an input wave function |ψ l , by partially tracing out a subsystem B, the reduced density matrix for remaining subsystem A is given by ρ A = Tr B |ψ l ψ l |. The output of the QNN is in the range [−1, 1], while the mth order Rényi entropy is defined as S (m) A = (log Trρ m A )/(1 − m). Consequently, we choose the label y l as y l = exp((1 − m)S (m) A ) = Trρ m A . We show in Fig. 2(c) the loss function for learning the second Rényi entropy. Here we consider the two-qubit system, one qubit is taken as A and the other qubits are taken as B. Similar as the Loschmidt echo case, without replica the loss still saturates to a finite nonzero value at sufficient long training epochs even with an ancillary qubit. Similarly, for considering the second Rényi entropy, let us first study whereX A is the swap operator acting on subsystem A between two replicas, andÎ B is the identity operator acting on subsystem B between two replicas. Then, the second Rényi entropy, defined as − log y l , can be related to physical observable in the doubled Hilbert space. This can also be generalized to the higher order Rényi entropy, and in general, for considering the mth order Rényi entropy, we study for which we need m replicas, and the mth order Rényi entropy is defined as (log y l )/(1 − m). Hence, we double the size of the input for the quantum circuit and duplicate two replicas of the input wave functions as the input. The unitaryÛ then acts on the total Hilbert space with Dim(H) = 2 2n , with the same measurement on the readout qubit as discussed above. The structure is shown in Fig. 1(c). Now the question is that, with the enlarged Hilbert space, whether we still need an ancillary qubit, as shown in Fig. 1(d). Note that the input wave functions are all subjected to a constraint that they have to be symmetric between two replicas, therefore, these wave functions do not span all 2 2n dimensional Hilbert space. Let us denote such symmetric Hilbert space as H 0 , and with the general statement we discussed above, it is important to analyze whether Dim(H 0 ) is larger than the half of Dim(H). It can be shown that where the first term in the middle counts the dimension of the sub-Hilbert space spanned by |m ⊗ |m , and the second term counts the dimension of the sub-Hilbert space spanned by (|m ⊗ |m + |m ⊗ |m )/ √ 2. It is obvious that Dim(H 0 ) is larger than Dim(H)/2. In other word, because Dim(H 1 ) < Dim(H 0 ), one still needs the ancillary qubit in order to construct a properÔ. This can be seen from Figs. 2(b) and 2(c) for the cases of learning the Loschmidt echo and the second Rényi entropy, respectively. Especially, for learning the second Rényi entropy, one can see that the loss can still be reduced by adding an ancillary qubit even with doubled input.
The situation becomes different when one considers the tripled Hilbert space, for instance, when considering the third Rényi entropy. For the tripled Hilbert space, if we still require the wave functions to be symmetric between three replicas, similar as the analyzation that leads to Eq. (11), we obtain that the Hilbert space dimension Dim(H 0 ) is given by Dim(H 0 ) = 2 n + 2 n (2 n − 1) + 2 n (2 n − 1)(2 n − 2) 6 (12) Therefore, in this case, the requirement for finding a proper O can be satisfied without adding an ancila qubit. This can be seen in Fig. 2(d). One can see with two replicas, the loss cannot be reduced to a sufficient small value, for both cases without and with the ancila qubit. However, when there are three replicas, even without an ancila qubit, the loss can already drop to be sufficiently close to zero. And for a sufficiently long training epoch, the losses for QNN with or without ancila qubit approach the same value. This shows that the ancila qubit is not necessary in this case when there are three replicas. And the same conclusion can be generalized to situations with more than three replicas.

III. DISCUSSION
In this work, we consider the expressivity of QNN for learning targets that are observables (i.e., expectations of a hermitian operator) of input wave functions. These also include the situations that the learning targets are not observable of input wave functions, but can be expressed as observables in the enlarged Hilbert space with multiple replicas of input wave functions, such as the Loschmidt echo and the Rényi entropy. The main finding of this work is that such target can be expressed accurately only when the input wave functions in all dataset only occupy a subset H 0 of the entire Hilbert space H that the quantum circuit acts on, especially, we require the condition Dim(H 0 ) < Dim(H)/2. An accurate approximation of the learning target is possible for a sufficiently deep QNN either when this condition is satisfied naturally by the dataset, or when the condition is enforced by artificially adding an ancillary qubit.
Our discussions also provide a general recipe for improving the learning accuracy, provided that no prior knowledge of such learning task is known. First, one can first try to add an ancillary quibit. If not satisfactory, then one can duplicate two replicas with an ancillary qubit. Finally, if still not satisfied, one can add more replicas, and when the number of replica equals or is greater than three, the ancillary qubit is no longer needed. However, we shall also note that increasing the number of replicas cause serious computational resources and this method cannot be extended to a large number of replicase. This limits this way of adding nonlinearity and for situations that the nonlinearity is too strong, this method should be combined with other ways of adding nonlinearity.
In the future, we can consider a number of generalizations of such studies. First, here we focus on learning targets that are observables or generalized observables, and we can consider more sophisticated learning targets. Secondly, here we focus on regression tasks, and we can consider classification tasks. Thirdly, here we focus on the fully connected architectures, and we can consider other architectures of QNN, such as the convolutional QNN [25][26][27] and the recurrent QNN [28,29]. We hope such studies can lead to analogy of the universal approximation theorem for QNN and lay the foundation of the expressive power of QNN.