Using a single circuit to compute the gradients with respect to all parameters of a quantum neural network

Guang Ping He

doi:10.15302/frontphys.2026.083201

Front. Phys. ›› 2026, Vol. 21 ›› Issue (8) :083201 DOI: 10.15302/frontphys.2026.083201

RESEARCH ARTICLE

Using a single circuit to compute the gradients with respect to all parameters of a quantum neural network

Guang Ping He

Author information +

History +

PDF (2058KB)

Abstract

Finding gradients is a crucial step in training machine learning models. For quantum neural networks, computing gradients using the parameter-shift rule requires calculating the cost function twice for each adjustable parameter in the network. When the total number of parameters is large, the quantum circuit must be repeatedly adjusted and executed, leading to significant computational overhead. Here, we propose an approach to compute all gradients using only a single circuit, significantly reducing both the circuit depth and the number of classical registers required. Although the theoretical complexity of gradient calculation and the total number of measurements are not reduced by this approach, a considerable reduction in the number of unique circuit compilations and job submissions is achieved. We experimentally validate our approach on both quantum simulators and IBM’s real quantum hardware, demonstrating that our method significantly reduces circuit compilation time compared to the conventional approach, resulting in a substantial speedup in total runtime.

Graphical abstract

Keywords

quantum machine learning / variational quantum circuit / parameter-shift rule / gradient descent / quantum algorithm

Cite this article

Download citation ▾

Guang Ping He. Using a single circuit to compute the gradients with respect to all parameters of a quantum neural network. Front. Phys., 2026, 21(8): 083201 DOI:10.15302/frontphys.2026.083201

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

Artificial intelligence technology is making incredible progress nowadays. One of the main reasons is that classical artificial neural networks are made very feasible with the advance of computer science in recent years. On the other hand, though it is widely believed that quantum neural networks may breathe new life into the researches [1], they are still not as practical as their classical counterparts due to the scale and performance of currently available quantum computers. Especially, highly effective methods have been developed for computing the gradients of the cost function with respect to the adjustable parameters of classical neural networks, e.g., the backpropagation algorithm [2, 3]. A distinct feature is that the cost function needs to be calculated only once, and all the gradients will be readily deduced, so that classical neural networks can be trained fast and efficiently. But for quantum neural networks, computing the gradients is considerably less convenient [4]. Effective algorithms were found for certain network architectures only [5]. For other general architectures, the parameter-shift rule [6, 7] seems to be the best algorithm found for this task so far. But it needs to calculate the cost function twice for computing the gradient with respect to a single parameter of the network (which will be further elaborated below). The values of some parameters for each calculation need adjustment too. Consequently, the quantum circuit for calculating the cost function has to be modified many times in order to obtain the gradients with respect to all parameters in each single round of training of the quantum network.

In this paper, we will propose an approach in which the gradients with respect to all parameters of a quantum neural network can be computed simultaneously using a single circuit only. The main purpose is to break the limit on the scale of computation on real quantum hardware. As it is well-known, the number of adjustable parameters of a powerful neural network is generally very high. If we have to run two quantum circuits to compute the gradient with respect to a single parameter, then the total number of unique executed circuits could easily exceed the capacity of any real quantum computer in existence. Of course, the execution of these circuits can be broken down into many “jobs”. But it is obviously inconvenient when we are using quantum cloud computation platforms. Especially, if the jobs are submitted separately, then there could be a long waiting time between the execution of the jobs. For example, for free users on the IBM Quantum Experience online platform, this waiting time could vary from 10 minutes to 4 hours. For paying users, as also pointed out in Ref. [7], “quantum hardware devices are typically billed and queued per unique circuit”. According to Amazon Braket, the quantum computation platforms Rigetti, IonQ, OQC, and Quera all charge

$

0.3 USD per unique circuit in addition to other fees [8]. A conventional method for reducing the number of unique circuits is to “stack” many circuits into a single job. In the Qiskit software development kit for running IBM quantum computers and simulators with Python programs, this can be done using the “append” function [9]. But there is also a restriction on its usage on real quantum computers. For example, on the “ibmq_quito” backend, the number of circuits allowed to be appended is limited to 100. To break this limit, the Qiskit’s “compose” function can serve as an alternative. According to our experience, it runs much slower than the “append” method does. On platforms that charge by the runtime, slower programs mean higher cost. As of now, the rate in IBM Cloud’s Pay-As-You-Go Plan is

$

96 USD per minute [10]. Nevertheless, it can indeed stack more circuits into one single job so that the gradients with respect to more parameters can be computed together.

In the following, we will use the direct stacking method based on the “compose” function (referred to as the “conventional approach”) as a baseline to compare with the performance of our proposed method that uses a single circuit only (referred to as the “improved approach”). In the next section, we will briefly introduce the typical structure of quantum neural networks. The parameter-shift rule will be reviewed in Section 3. Subsequently, our improved approach will be presented in Section 4, with its theoretical advantages — namely, reduced circuit depth and lower classical memory requirements — elaborated in Section 5. To demonstrate these advantages, we conduct experiments and the results are reported in Section 6. These results show substantial savings in the number of unique circuit compilations and job submissions while achieving a significant speedup in total runtime, particularly as the number of input data points and the scale of the quantum circuit increase, despite that the theoretical complexity of gradient calculation and the total number of measurements required are not reduced. However, this approach also carries some theoretical disadvantages: the circuit needs two additional qubits, and requires more shots to run compared to the conventional approach, as explained in Section 7. Finally, in Section 8, we discuss the generalization of our approach, and suggest some improvements on real quantum hardware which can make our approach even faster.

2 The variational quantum circuit

There are various types of quantum neural networks. Among them, variational quantum circuits (VQCs) [11-21] have the advantage that it can be implemented on current or near-term noisy intermediate-scale quantum computers, as demonstrated experimentally [22-25]. Applications within this framework also include variational quantum simulators [26, 27], quantum approximate optimization algorithm [28-30], etc. Research domains extend across high-energy physics [31, 32], cybersecurity [33, 34], and finance [35, 36]. In this paper, we will use VQCs as an example to illustrate how our improved approach can be applied to compute the gradients with respect to their parameters.

As shown in Fig. 1(a), the general structure of a VQC can be divided into three blocks: the feature map, the ansatz, and the measurement of the observable, which gives the value of the cost function. Note that the goal of this paper is to provide a method for computing the gradients, instead of finding a VQC for a specific application. To this end, the details of the feature map and the measurement are less relevant, and our approach does not need to modify them. Thus, we will leave their description in Section 6. The ansatz is the most important block, because it contains all the adjustable parameters with respect to whom the gradients need to be computed. Our example uses the RealAmplitudes ansatz [37], which is the default ansatz in Qiskit’s VQC implementation. As pointed out in Ref. [17], this ansatz has also been used in a VQC with proven advantages over traditional feedforward neural networks in terms of both capacity and trainability [16]. The typical structure of the RealAmplitudes ansatz is illustrated in Fig. 1(b).

3 The parameter-shift rule

Finding the gradients is a must for optimizing neural networks for certain applications. In our example, the original parameter-shift rule proposed in Ref. [6] will be used, where the exact gradient of the cost function

f = f (θ)

with respect to the parameters

θ = (θ 1, . . ., θ n)

of the quantum gates is obtained as

(1)

∂ f ∂ θ i = r (f (θ + s e i) − f (θ − s e i))

for

i = 1, . . ., n

, where

r = 1 / 2

and

s = π / 2

for all the rotation and phase gates available in the Qiskit library.

This equation shows that computing the gradient with respect to a single parameter

θ i

needs to calculate the cost function

f

twice, each time with a different set of parameters, i.e.,

θ + s e i = (θ 1, . . ., θ i + s, . . ., θ n)

and

θ − s e i = (θ 1, . . ., θ i − s, . . ., θ n)

. Therefore, when the total number of parameters of a VQC is

n

, computing all the gradients will have to run

2 n

circuits with different values of the parameters. To stack these circuits into a single job, the aforementioned conventional approach is simply to connect them one after another using Qiskit’s “compose” function, as illustrated in Fig. 2. Each circuit is independently executed and measured before the next one is re-initialized and parameterized with new values, and the measurement results are stored in different classical registers.

4 Our improved approach

Now it will be shown how to compute all the

2 n

shifted cost functions as well as the original unshifted cost function

f (θ)

using a single circuit. The main idea is to introduce additional gates for realizing the shifts to the existing parameterized gates of the ansatz. Each of these additional gates is activated with a certain probability, so that each shift stands a chance to take effect. When none of these gates activate, the circuit acts exactly like the original ansatz so that the unshifted cost function will be computed. The key part is to ensure that these gates will be activated only one at a time at the most. That is, in each run (i.e., “shot”) of the circuit, there should not be two (or more) additional gates activated simultaneously. Otherwise, the circuit will output the cost function with the form like

f (θ + s e i + s e j)

(

i ≠ j

) which could be useful for calculating higher-order derivatives but is useless for computing the gradient via Eq. (1), while significantly lowers the efficiency of the circuit.

For this purpose, we add two additional control qubits

q a

and

q b

to the circuit, regardless the number of qubits and parameters in the original ansatz. Then for each existing single-parameter rotation gate

R Y (θ i)

(

i = 1, . . ., n

), we added two blocks acting on the qubit it rotates and the above two additional control qubits, to turn the parameter from

θ i

θ i + s

and

θ i − s

, respectively. Each block contains 2 controlled-RY gates, 1 CX (two-qubit controlled-NOT) gate, 1 measurement and 1 reset operation. Figure 3 showcases an example of the resultant circuit. For simplicity, only three RY gates in the RealAmplitudes ansatz of the original VQC [corresponding to the three RY gates before any of the red dashed boxes in Fig. 1(b)] are studied, and shown with the two additional blocks added to each of them. We also uploaded the diagram of the complete quantum circuit for a VQC with 10 qubits (which will be used in Section 6 for the classification task of the MNIST dataset) to Github (available at: github.com/gphehub/grad2210/blob/main/Fig.8_Complete circuit for MNIST classification.emf).

Let us track the circuit in Fig. 3 from left to right to see how it works. The role of the first additional control qubit

q a

is to record whether any of the additional blocks has been activated. It can also be considered as a switch. If it is in the state

| ψ ⟩ q a = | 1 ⟩ q a

, then it means that no block has ever been activated yet. Or if it is flipped to the state

| ψ ⟩ q a = | 0 ⟩ q a

, then it implies that one of the blocks was already activated. The second additional control qubit

q b

acts like a quantum dice, which controls the probability for each block to be activated. At the very beginning of the whole VQC,

q a

and

q b

are initialized in the states

| 1 ⟩ q a

and

| 0 ⟩ q b

, respectively.

Next, in the first green block right behind the first RY gate

R Y (θ 1)

, the first gate

C R Y (γ 0)

is a controlled-RY gate with

q a

(

q b

) serving as the control (target) qubit. It introduces a rotation angle

(2)

γ 0 = 2 arcsin ⁡ 1 N − 0

about the

y

-axis to

q b

only when

| ψ ⟩ q a = | 1 ⟩ q a

. Here

N = 2 n + 1

. Then the state of

q b

becomes

(3)

| ψ ⟩ q b = cos ⁡ γ 0 2 | 0 ⟩ q b + sin ⁡ γ 0 2 | 1 ⟩ q b = 1 − 1 N − 0 | 0 ⟩ q b + 1 N − 0 | 1 ⟩ q b .

A measurement in the computational basis

{| 0 ⟩, | 1 ⟩}

is then made on

q b

, and the result is sent to a classical register. With probability

1 / N

the measurement result will be

| 1 ⟩ q b

. In this case, the controlled-RY gate

C R Y (s)

next to the measurement will be activated, introducing a rotation angle

s = π / 2

q 1

(i.e., the qubit on which the gate

R Y (θ 1)

of the original ansatz is applied). As proven in Appendix A, the RY gates has the property

(4)

R Y (s) R Y (θ 1) = R Y (θ 1 + s),

we can see that a shift

s

is successfully introduced to the parameter

θ 1

. Meanwhile, with the next CX gate where

q b

(

q a

) serves as the control (target) qubit, the state of

q a

is turned from

| 1 ⟩ q a

into

| 0 ⟩ q a

, indicating that the current block was activated. Also, at the end of this block, the state of

q b

is reset to

| 0 ⟩ q b

regardless the measurement result. Consequently, all the rest blocks that follows will be bypassed. After executing the entire circuit, the final result will give the cost function

f (θ + s e 1)

On the other hand, with probability

1 − 1 N

the measurement result on

q b

in this block will be

| 0 ⟩ q b

, as indicated by Eq. (3). Then the two aforementioned controlled gates in the same block following this measurement will not take effect at all, so that the states of

q a

and

q 1

remain unchanged, allowing the second green block to be activated.

Similarly, in the second green block, since the state of

q b

was reset to

| 0 ⟩ q b

, the first gate

C R Y (γ 1)

introduces a rotation angle

(5)

γ 1 = 2 arcsin ⁡ 1 N − 1

q b

when

| ψ ⟩ q a = | 1 ⟩ q a

, turning the state of

q b

into

(6)

| ψ ⟩ q b = cos ⁡ γ 1 2 | 0 ⟩ q b + sin ⁡ γ 1 2 | 1 ⟩ q b = 1 − 1 N − 1 | 0 ⟩ q b + 1 N − 1 | 1 ⟩ q b .

Then

q b

is also measured in the computational basis, and the result is stored in another classical register. With probability

1 / (N − 1)

the result will be

| 1 ⟩ q b

. Note that this case occurs only when the first block was not activated, i.e., the measurement result of

q b

in the first block was

| 0 ⟩ q b

, which occurred with probability

1 − 1 N

. Thus, the total probability for finding

q b

| 1 ⟩ q b

at this stage (i.e., the second block is activated) is

(7)

p r o b 2 = (1 − 1 N) 1 N − 1 = 1 N,

which equals to the probability for activating the first block. When this result indeed occurs, the next gate

C R Y (− s)

will introduce the rotation angle

− s = − π / 2

q 1

, and the CX gate next to it will turn the state of

q a

into

| 0 ⟩ q a

, making the rest blocks bypassed. Then the final result of the entire circuit will be the cost function

f (θ − s e 1)

For the same reasons, by setting the rotation angle of the first controlled-RY gate

C R Y (γ j)

in the

(j + 1)

th additional block as

(8)

γ j = 2 arcsin ⁡ 1 N − j

(

j = 0, . . ., N − 2

), the probability that the

(j + 1)

th additional block is activated while all the

j

additional blocks before it are not activated is

(9)

p r o b j + 1 = [∏ i = 0 j − 1 (1 − 1 N − i)] 1 N − j = [N − 1 N − 0 N − 2 N − 1 N − 3 N − 2 . . . N − j N − (j − 1)] 1 N − j = 1 N .

That is, every block stands the same probability

1 / N

to be activated. The CX gates on

q b

and

q a

also ensure that only one block at the most will be activated. As a result, executing the whole circuit for one shot will give the output corresponding to the shifted cost function

f (θ + s e j)

f (θ − s e j)

for a single

j

only. When none of the block was activated, which also occurs with probability

(10)

p r o b u n s h i f t e d = ∏ i = 0 N − 2 (1 − 1 N − i) = N − 1 N − 0 N − 2 N − 1 N − 3 N − 2 . . . N − (N − 1) N − (N − 2) = 1 N,

the circuit outputs the unshifted cost function

f (θ)

. By reading the classical registers

c

that store the measurement results of

q b

in every block to see which block was activated, we can learn which cost function was calculated. Running the circuit for many shots will then provide enough shots for each of these cost functions, so that all gradients in Eq. (1) can be obtained with the desired precision.

After finding the gradients, the standard training routine can be applied for optimizing the corresponding VQC. That is, the value of each adjustable parameter

θ i

can be updated as

(11)

θ i → θ i ′ = θ i − η ∂ f ∂ θ i,

where

η

is the learning rate chosen by the user. Then the VQC should be reconstructed by using

θ ′ = (θ 1 ′, . . ., θ n ′)

as the new adjustable parameters of the ansatz. This completes one epoch of training. Computing the gradients using the new VQC and repeating this updating procedure for many epochs will eventually minimize the unshifted cost function, which means that the VQC is optimized for the given task.

5 Theoretical advantages

5.1 Circuit depth

Comparing with the conventional approach, a significant advantage of our improved approach is that the depth of the entire circuit is reduced, as estimated below.

For each single input data, let

Λ

denote the depth of the circuit for calculating the cost function once using the conventional approach, and

n

denote the number of the adjustable parameters. Define

(12)

λ = Λ n .

In many ansatz,

λ

remains non-trivial even for high

n

(which will be proven in Appendix B by using the RealAmplitudes ansatz as an example). Recall that computing the gradient with respect to a single parameter using the parameter-shift rule needs to calculate the cost function twice. Thus, the total depth of the stacked circuit for computing all the

n

gradients and the unshifted cost function using the conventional approach is

(13)

D c o n v = Λ (2 n + 1) = λ n (2 n + 1) ∝ O (n 2) .

On the other hand, when using our improved approach, from Fig. 3 we can see that a total of 10 operations were added to each existing parameterized RY gate of the original ansatz. Therefore, to compute all the

n

gradients and the unshifted cost function, the total depth of our circuit is simply

(14)

D i m p r = Λ + 10 n = λ n + 10 n ∝ O (n) .

This result shows that our improved approach has the advantage that it takes a lower circuit depth when the number of parameters

n

is high, and the improvement will be more significant with the increase of

n

5.2 Number of classical bits in the stacked circuits

In the conventional approach, for each single input data, running the circuit for calculating the cost function once (either shifted or unshifted) takes

Q

classical bits to store the final measurement result of the

Q

qubits. When all the

2 n + 1

circuits for computing all the

n

gradients and the unshifted cost function are stacked together so that it can be submitted as a single job on the real hardware, the total number of required classical bits is

(15)

N c o n v = Q (2 n + 1) .

In our improved approach, the

Q

qubits of the original ansatz are measured only once at the very end of the circuit. The additional qubit

q b

is measured twice in the two additional blocks added to each of the

n

parameterized gates of the original ansatz, which takes

2 n

classical bits to store the results. At the end of the circuit, both the additional qubits

q a

and

q b

do not need to be measured. But for simplicity, we added these two measurements in our code too (not shown in Fig. 3), which may also serve as an additional monitoring of the running of the circuit. Consequently, the total number of classical bits required in our approach is

(16)

N i m p r = Q + 2 n + 2,

which is smaller than that of the conventional approach since generally

Q ≥ 2

. This is another advantage of our approach.

6 Experiments

To compare the performance of our improved approach and the conventional one, we tested them experimentally on both real quantum hardware and simulator.

6.1 The input datasets

Three classical input datasets were used in our experiments, as summarized in Table 1. The 8-dimensional dataset is generated by us, where each data point

x

contains 8 random numbers

{x 1, . . ., x 8}

uniformly distributed over the interval

[0, 1)

. The intention of using such a random dataset is to ensure that our experimental results could stand less chances to be biased by the structure of the dataset.

The 784-dimensional Modified National Institute of Standards and Technology (MNIST) dataset [38] is a widely-used resource for machine learning research [39]. It contains

70 000

greyscale

28 × 28 = 784

pixel images of handwritten digits

0 − 9

The Iris dataset [40] (available as a built-in dataset in the Scikit-learn Python library) consists of 150 samples of Iris flowers in 3 categories. Each sample contains

4

feature values.

6.2 The feature map

Our feature map uses the amplitude encoding method [14], where each

d

-dimensional classical input data point

x = {x 1, . . ., x d}

is encoded as the amplitudes of a

Q

-qubit quantum state

(17)

| Φ x ⟩ = C n o r m ∑ i = 1 d x i | i ⟩ .

Here

d = 2 Q

| i ⟩

is the

i

th computational basis state, and

C n o r m

is the normalization constant.

6.3 The cost function

For the MNIST and 8-dimensional datasets, the cost function of our networks (either using the improved approach or the conventional one) is taken as

(18)

C o s t = 1 m ∑ x | a (x) − y (x) |,

where

m

is the number of input data,

a (x)

is the vector of outputs from the network when

x

is input, and

y (x)

is the desired output, i.e., the label of the input

x

(as contained in the MNIST dataset) in the form of one-hot encoding.

Like the MNIST dataset, we also give each data in our 8-dimensional input dataset a label, so that our program can serve as a classifier if needed. But since our current data are generated randomly, we simply label all data as “2” without actual meaning.

For the Iris dataset, we choose the cost function as

(19)

C o s t = − 1 m Q ∑ x log ⁡ a y (x),

where

Q

is the number of qubits, and

a y (x)

is the

y

th component of the vector

a (x)

, with

y

being the label of the input

x

in its original decimal form.

6.4 Results

6.4.1 Gradient

To test the validity of our improved approach, we used quantum simulator to run a VQC with the 8-dimensional classical input data, which takes

Q = 3

qubits to encode when using the amplitude encoding method. The RealAmplitudes ansatz in this VQC contains 1 repetition with full entanglement. Thus, it has

n = 6

adjustable parameters. We first calculated the exact amplitude distribution of the output states numerically, and used the result of the gradients as reference. Then we ran the conventional approach for

s = 500

shots. For comparison, our improved approach for the same VQC was run for

s ′ = 500 × (2 n + 1) = 6500

shots, so that each cost function could be calculated for about

s = 500

shots in average (for simplicity, in the following when we say that the average number of shots for our improved approach is

s

, we mean that the actual number of shots is

s ′ = s (2 n + 1)

). The numbers of input data were all taken to be

m = 20

in these experiments. The results of the gradients are shown in Table 2.

From the results we can see that when the (average) number of shots is

s = 500

, both approaches resulted in similar precision when comparing with the exact values. [The precision for the gradient

g r a d (θ 2)

with respect to the second adjustable parameter

θ 2

was poor for both approaches, probably because the point happened to meet the barren plateau [41-44].] Thus, it is proven that the modified quantum circuit (i.e., Fig. 3) in our improved approach works correctly as desired.

6.4.2 Individual shots

We further study the individual number of shots for each single cost function that was actually calculated in the above experiment using our improved approach. The result is shown in Fig. 4. It is found that the fluctuation of the individual number of shots felt within the range

[445, 587]

, with a standard deviation

22.6

, which is only

4.52 %

of the average value

s = 500

, indicating that all the cost functions were calculated to almost the same precision.

6.4.3 Runtimes

We compare the runtimes of both approaches using 3 experiments:

Exp. 1: The VQC used in Table 2 and Fig. 4, run on quantum simulator.

Exp. 2: The same VQC run on real quantum hardware (the “ibmq_quito” backend).

Exp. 3: A VQC with 784-dimensional classical input data, with the RealAmplitudes ansatz containing

r = 2

repetitions and full entanglement, run on quantum simulator. Note that this VQC takes

Q = 10

qubits to encode each input data with the amplitude encoding method, so that the total number of adjustable parameters is

n = 30

as given by Eq. (B.1).

Let

t c

(

t r

) denote the time spent on compiling (running) the circuit. The sum

t s = t c + t r

can serve as a good measure of the performance of the approaches, as the other parts of the computer programs (e.g., reading the input data and initial parameters, calculating the cost functions and gradients from the raw counting result of quantum simulator or real hardware, and exporting the results to the output files) run very fast and take almost the same amount of time for both approaches. The results are shown in Fig. 5. The number of shots for each data point was fixed as

s = 500

(i.e., the actual number of shots was

s ′ = 500 (2 n + 1)

for our improved approach).

The following observations can be found in all these experiments.

1) The values of

t r

show that our improved approach always takes more time to run than the conventional approach does, either on simulator or real hardware. The difference varies from

3 − 4

times (in Exp. 1 and Exp. 2) to 24 times (in Exp. 3). This is not surprising, because the improved one needs to be run for more shots to reach the same precision.

2) On the contrary, the values of

t c

show that our improved approach saves a tremendous amount of time for compiling the circuit, which is only about

4 %

(in Exp. 1 and Exp. 2) to

0.1 %

(in Exp. 3) of that of the conventional approach. This is owed to the significantly reduced circuit depth, as shown in Eq. (14).

3) As the result of the competition between

t c

and

t r

, the total runtime

t s

of the conventional approach is shorter when

m

(the number of input data) is low. But as

m

increases, our improved approach will eventually become faster. In Exp. 1, the turning point occurred at around

m = 15

where both approaches took about 8 seconds to complete. After that,

t s

of our approach increased approximately in a linear way, while

t s

of the conventional one rose dramatically, mostly due to the long compiling time

t c

. In both Exp. 2 and Exp. 3, the same behavior can be observed. The turning points occurred relatively later though. This is because the real quantum hardware used in Exp. 2 runs about

10 − 30

times slower than the simulator used in Exp. 1, and Exp. 3 has a more complicated circuit than that of Exp. 1 though it is also run on simulator. Consequently, both Exp. 2 and Exp. 3 have a longer

t r

than that of Exp. 1. Meanwhile, all the three experiments have similar

t c

values because the compiling of the circuits are always done on classical computers. Therefore, the slower the quantum hardware or simulator is, the later the turning point of the total runtime

t s

occurs.

In the age of big data, for real applications of neural networks, the number of input data is generally much higher than what was used here. As a result, we can see that our improved approach has the advantage that it can save the total runtime. Currently, this advantage is most pronounced on quantum platforms where compilation overhead or job queuing time dominates the total runtime and cost. Its benefit might be reduced or even negative when the number of data is low, or in scenarios where pure quantum execution time is the sole bottleneck or cost metric. Nevertheless, we can expect the advantage to be even more obvious with the advance of real quantum computers, because they will surely run faster in the future than they do today, resulting in a shorter

t r

so that the turning point of

t s

could occur even earlier.

It is also worth noting that if Qiskit’s “append” function were used for stacking the circuits, the “ibmq_quito” real quantum backend only allows

100

circuits to be appended in a single job. Since the VQC in Exp. 2 has 6 adjustable parameters, there are totally

13

shifted and unshifted cost functions to be calculated for each single input data. It means that using the “append” function can handle only

m = [100 / 13] = 7

input data at the most. But Exp. 2 [i.e., Fig. 5(b)] shows that both the conventional approach (using the “compose” function) and our improve approach can accomplish the computation for at least

m = 140

input data in a single job with no problem. Thus, they managed to break the limit on real quantum hardware.

6.4.4 Noise

The impact of noise in the circuits was studied and shown in Fig. 6, where the value of the average deviation

Δ ¯

was calculated as follows. The exact values

f i e x a

(

i = 1, . . ., 2 n + 1

) of all the shifted and unshifted cost functions were calculated numerically beforehand. Then the values

f i s i m

of these cost functions were computed using the noise model in the simulator, where the error rate

ϵ

of all quantum gates were taken to be equal for simplicity, and ranged from 0.001 to 0.01. The number of shots for each cost function was

s = 1000

(i.e.,

s ′ = 1000 (2 n + 1)

for our improved approach). Define

(20)

Δ i = | f i s i m − f i e x a f i e x a |

as the absolute value of the relative deviation between the results of the simulator and the exact values, and the average deviation is obtained as

(21)

Δ ¯ = 1 2 n + 1 ∑ i = 1 2 n + 1 Δ i .

We first studied the same VQC used in Exp. 1 of Fig. 5(a), which takes 8-dimensional classical input data, with the RealAmplitudes ansatz containing

r = 1

repetition and full entanglement. The result is shown in Fig. 6(a), where the number of input data was

m = 20

. We can see that the effect of noise in our improved approach is about

2 − 3

times more significant than that in the conventional approach. Theoretically, this is not unexpected, because it takes 3 qubits to encode 8-dimensional classical input data in the conventional approach, while our improved approach adds 2 more qubits to control the shifts to the parameterized gates. The same error rate surely brings more errors to 5 qubits than it does to 3 qubits. But we can expect that the difference should become less significant with higher-dimensional classical input data, which means an increase on the number of total qubits in the VQC. This is because in our improved approach, the number of additional qubits is always 2, regardless the number of qubits in the original VQC. Indeed, a VQC with 784-dimensional classical input data was studied in Fig. 6(b). It is similar to the VQC in Exp. 3 of Fig. 5(c), but since the noise model in the simulator runs much slower than the noiseless one, here we only took

r = 1

repetition in the RealAmplitudes ansatz, so that the total number of adjustable parameters dropped to

n = 20

. The number of input data was taken as

m = 1

. For 784-dimensional input, our improved approach takes

12

qubits in total, while the conventional approach takes 10 qubits. The result in Fig. 6(b) verified the conjecture that the noise should display less difference between the two approaches when the number of qubits becomes close. In fact, when

ϵ = 0.001

, 0.003 and 0.005, the average deviation of our improved approach is even slightly lower than that of the conventional approach (which is believed to be the consequence of the randomness in quantum simulation). The link to the source data of all figures is provided in the Data and Code Availability section. More details of the computer programs and the environment is given in Appendix C.

6.4.5 The final model

Last but not the least, we studied the performance of our approach using a complete training process for optimizing a classifier on the Iris dataset. For this classification task, we used a VQC with

Q = 3

qubits and the RealAmplitudes ansatz contains

r = 4

repetitions, which has

n = 15

adjustable parameters. We trained the VQC on quantum simulator for 50 epochs with the learning rate fixed at 2. In each epoch, 40 samples were randomly chosen from the Iris dataset and stored in a data file. Then the computer programs of the conventional approach and our improved approach used the same file as the input data, so that the training will not be biased by the randomized process. The cost function in each epoch was computed on these 40 samples, and the classification accuracy was computed using all the 150 samples as the verification data. To avoid the runtime of the training being messed up with that of the computation of the accuracy, we computed the accuracy only when the whole training is completed, by using the values of the adjustable parameters of the VQC generated and recorded in each epoch. Both approaches also use the same set of initial values of the adjustable parameters randomly chosen at epoch 0.

The comparison between the runtimes of the two approaches as a function of the training epochs is shown in Fig. 7(a). It is not surprising to see that the runtimes grow linearly with the number of epochs. And we found once again that our improved approach ran much faster than the conventional approach did. The total runtimes

t s

for 50 epochs are 2819 seconds (our improved approach) vs. 21 672 seconds (the conventional approach).

Figures 7(b) and (c) show the cost function and classification accuracy of the VQC trained by the two approaches as a function of the training epochs. We can see that, although the fluctuation of the individual number of shots implies that our improved approach cannot calculate all the cost functions (and therefore all the gradients) to exactly the same precision, the training process is less affected. Especially, the accuracy of the VQC reached its maximum

98 %

at epoch 46 in both approaches. We also listed the trained values of the adjustable parameters of the VQC at epoch 46 in Table 3, from which we can conclude that the final machine learning model trained using our improved approach is very close to that of the conventional approach. It is reasonable to expect that they can be even closer if we run both approaches with more shots.

7 Theoretical disadvantages

Our improved approach requires two more qubits than the conventional approach does. It is considerable on free quantum computation platforms, which generally offers a total of

5 − 6

qubits only. But on near-term quantum devices that could really play a role in practical applications, the number of qubits has to be much higher, so that taking two more qubits should not make much difference.

What really matters is that our approach takes more shots than the conventional approach does in order to compute the gradients up to approximately the same precision. This is because in the conventional approach, when the stacked circuit is executed for

s

shots, each of the shifted and unshifted cost functions is calculated for exactly

s

shots too. But in our approach, when executing the circuit once, only one of the cost functions is calculated, depending on which of the additional blocks was activated by chance. To ensure that each cost function will be calculated for approximately

s

shots, our circuit needs to be executed for about

s ′ = s (2 n + 1)

shots in total. This surely increases the runtime of the program. But recall that our circuit has a lower depth, so that it takes less time to compile. Therefore, whether our approach can result in a speedup on the total runtime depends on the competition between these two factors, as manifested in Figs. 5 and 7.

Also, in our approach whether an additional block will be activated is controlled by the measurement result of the qubit

q b

, where quantum uncertainty takes effect. Consequently, executing the circuit for

s ′ = s (2 n + 1)

shots does not mean that each cost function will be calculated for

s

shots exactly. Some statistical fluctuations are inevitable. Nevertheless, the relationship between the number of shots and the precision of the results is also statistical and subjected to quantum uncertainty. A fluctuation of the precision cannot be avoided even if all individual shot numbers are strictly aligned to the same value. Thus, we do not see the need to take extra efforts trying to level off the individual shot numbers of all the cost functions so that they equal exactly to each other. (In fact, this can simply be accomplished by discarding the data of some of the shots. But like we said, it seems unnecessary so that our experiments do not include this treatment.)

8 Discussion

In summary, we demonstrated experimentally that while our improved approach does not reduce the theoretical complexity of gradient calculation or the total number of measurements required, it manages to increase the number of gradients that can be computed in each single job, breaking the limit that can be reached using Qiskit’s “append” function. More importantly, it has a smaller circuit depth and requires less classical registers for storing the measurement results when comparing with the conventional approach using Qiskit’s “compose” function. This reduces the time spent for compiling the quantum circuit significantly. Therefore, the total runtime of the program can also be saved, especially when the number of input data is high and the structure of the quantum circuit is complicated.

Though we only demonstrated our improved approach for VQCs where each parameterized quantum gate

U (θ)

satisfies

(22)

U (θ + s) = U (θ) U (s)

(i.e., the gates covered by the original parameter-shift rule [6]), it can also be modified to compute the gradients for any other quantum gate that requires the general parameter-shift rules [7]. For example, suppose that the first parameterized gate

R Y (θ 1)

on the qubit

q 1

in Fig. 3 is replaced by a more general gate

U ′ (θ 1)

that does not satisfy Eq. (22). To compute

U ′ (θ + s)

, all we need is to replace the controlled-RY gate

C R Y (s)

(controlled by

q b

and acting on

q 1

) in the first block right behind

R Y (θ 1)

in Fig. 3 by two controlled-gates

(U ′ (θ 1)) − 1

and

U ′ (θ 1 + s)

in serial, both of which are also controlled by

q b

and acting on

q 1

. Here

(U ′ (θ 1)) − 1

is the reverse operation of

U ′ (θ 1)

such that

(U ′ (θ 1)) − 1 U ′ (θ 1) = I

. We can see that once the measurement on

q b

before them (i.e., the measurement operator in the first block of Fig. 3) results in

| 1 ⟩ q b

, then these two gates will be activated. Combining with the original

U ′ (θ 1)

gate, the complete operation on the qubit

q 1

will be

U ′ (θ 1 + s) (U ′ (θ 1)) − 1 U ′ (θ 1) = U ′ (θ 1 + s)

. Thus, the shift of such a general parameterized quantum gate can also be computed with our approach.

Our approach also suggests the following improvement on real quantum computers. From Fig. 3 we can see that the purpose of the first two operations in each block (the controlled-RY gate

C R Y (γ j)

and the measurement) is to turn the state of

q b

into

| 1 ⟩ q b

with a certain probability. If there is a classical random number generator (even if it generates pseudorandom numbers only), it can accomplish the same task while further reducing the circuit depth. Also, the next two quantum controlled gates make use of the states

| 0 ⟩

and

| 1 ⟩

q a

and

q b

only, without needing their quantum superpositions. They can be replaced by classical IF gates (i.e., controlled gates that take classical control bits instead of control qubits, while the target registers can be either quantum or classical) and both

q a

and

q b

can be classical registers too, so that less quantum resource is required. Unfortunately, classical random number generator and the IF gate are currently unavailable on real quantum hardware (the latter is available on the simulator though). Our approach shows that there is the need for adding these operations in the future, so that the performance of real quantum computers for certain tasks can be further improved.

9 Appendix A: Additional proof on the correctness of our approach

As illustrated in Fig. 3, each of the additional block added in the quantum circuit of our improved approach starts with a controlled-RY gate which activates only if the qubit

q a

is in the state

| 1 ⟩ q a

, and ends with a CX gate before the reset operation, which flips the state of the qubit

q a

| 0 ⟩ q a

. Consequently, only one of the additional blocks can be activated in each shot at the most. Therefore, to prove that our circuit is equivalent to the conventional one for computing the shifted cost function with regard to a specific adjustable parameter

θ i

, the key component is to prove that the

C R Y (± s)

gate indeed introduces the correct shift to

θ i

, i.e., the validity of Eq. (4), which is proven below.

When using the original RealAmplitudes ansatz, the operations on any qubit are either CX gates or RY gates. As a result, the amplitudes of the state of any qubit initialized as

| 0 ⟩

will always remain real. That is, the general form of a qubit

q i

at any stage of the ansatz can be written as

(A.1)

| ψ ⟩ = cos ⁡ α | 0 ⟩ + sin ⁡ α | 1 ⟩ = (cos ⁡ α sin ⁡ α) .

With this notation, the matrix form of the

R Y (θ i)

gate is

(A.2)

R Y (θ i) = (cos ⁡ θ i 2 − sin ⁡ θ i 2 sin ⁡ θ i 2 cos ⁡ θ i 2) .

This is because

(A.3)

R Y (θ i) | ψ ⟩ = (cos ⁡ θ i 2 − sin ⁡ θ i 2 sin ⁡ θ i 2 cos ⁡ θ i 2) (cos ⁡ α sin ⁡ α) = (cos ⁡ θ i 2 cos ⁡ α − sin ⁡ θ i 2 sin ⁡ α sin ⁡ θ i 2 cos ⁡ α + cos ⁡ θ i 2 sin ⁡ α) = (cos ⁡ (α + θ i 2) sin ⁡ (α + θ i 2)) = cos ⁡ (α + θ i 2) | 0 ⟩ + sin ⁡ (α + θ i 2) | 1 ⟩,

i.e., it rotates the original angle

α

of the qubit by

θ i / 2

In our improved circuit, once the block corresponding to the parameter

θ i

is activated, the

C R Y (± s)

gate serves as an

R Y (± s)

gate applied on the qubit

q i

with the matrix form

(A.4)

R Y (± s) = (cos ⁡ ± s 2 − sin ⁡ ± s 2 sin ⁡ ± s 2 cos ⁡ ± s 2) .

Combining with the existing

R Y (θ i)

gate in the original circuit, the total effect of the operations on the qubit

q i

(A.5)

R Y (± s) R Y (θ i) = (cos ⁡ ± s 2 − sin ⁡ ± s 2 sin ⁡ ± s 2 cos ⁡ ± s 2) ⋅ (cos ⁡ θ i 2 − sin ⁡ θ i 2 sin ⁡ θ i 2 cos ⁡ θ i 2) = (cos ⁡ θ i ± s 2 − sin ⁡ θ i ± s 2 sin ⁡ θ i ± s 2 cos ⁡ θ i ± s 2) = R Y (θ i ± s) .

Thus, Eq. (4) is proven. Applying them on the state

| ψ ⟩

in Eq. (A.1) yields

(A.6)

R Y (± s) R Y (θ i) | ψ ⟩ = (cos ⁡ θ i ± s 2 − sin ⁡ θ i ± s 2 sin ⁡ θ i ± s 2 cos ⁡ θ i ± s 2) (cos ⁡ α sin ⁡ α) = cos ⁡ (α + θ i ± s 2) | 0 ⟩ + sin ⁡ (α + θ i ± s 2) | 1 ⟩ .

That is, the one and only activated block in our circuit in effect turns the existing

R Y (θ i)

gate in the original circuit into an

R Y (θ i ± s)

gate, so that it is equivalent to the conventional circuit for computing the shifted cost function where the adjustable parameter

θ i

is shifted to

θ i ± s

10 Appendix B: Proving that $λ$ in Eq. (12) is non-trivial

In the RealAmplitudes ansatz, let

Q

denote the total number of the qubits. When

Q

is fixed, increasing

n

means increasing the number of repetition

r

in this ansatz. This is because the ansatz starts with a parameterized RY gate for each of the

Q

qubits, then each repetition adds another parameterized RY gate to each qubit. Thus, the total number of the parameters satisfies

(B.1)

n = Q (r + 1) .

Suppose that the circuit depth of each additional repetition (including the RY gates and the CX gates in the entanglement section) is

Δ

. Then

(B.2)

Λ = Λ 0 + r Δ = Λ 0 + (n Q − 1) Δ = (Λ 0 − Δ) + n Δ Q,

where

Λ 0

denotes the depth of the other parts of the circuit which is not included in

Δ

, e.g., the feature map, the observable, the final measurement, and the very first RY gate on each qubit within the ansatz. In this case, we yield

(B.3)

λ = (Λ 0 − Δ) + n Δ Q n = Λ 0 − Δ n + Δ Q .

A neural network that is useful in practice generally has a large number of parameters. When

n → ∞

we have

(B.4)

lim n → ∞ λ = Δ / Q .

Thus it is proven that

λ

stays non-trivial.

11 Appendix C: Program details

The programs for the 8-dimensional input data and the Iris dataset using quantum simulator were run on a personal laptop with a

2.8

GHz quad-core Intel i7-3840QM processor and

16

1600

MHz DDR3 memory.

The programs for the 784-dimensional input data using quantum simulator were run on a Tianhe-2 supercomputer node with a

2.2

GHz Intel Xeon E5-2692 v2 processor and

64

GB memory.

In these simulations, Qiskit’s “qasm_simulator” backend was used as the quantum simulator.

The programs using real hardware were run on the “ibmq_quito” 5-qubit backend of IBM Quantum Experience online platform.

The operating environment for all programs is Python 3.8.10 with Qiskit 0.21.2.

When running on quantum simulator, both

t c

(the time spent on compiling the quantum circuit) and

t r

(the time spent on running the circuit) are recorded automatically by the programs. But when running on real quantum hardware,

t r

is obtained from the job output files downloaded from IBM Quantum Experience platform instead. This is because the value of

t r

recorded by the programs in this case also includes the time spent on waiting in the queue for the job to be run, which is not the correct measure for the performance of the approaches being used.

In both approaches, the circuits for different input data are stacked together using Qiskit’s “compose” function.

The initial values of all adjustable parameters of the VQCs were generated randomly (uniformly distributed over the interval

[0, π)

) and stored beforehand. Then we always loaded these same values into every computer programs in every run.

More details can be found in the file “file_description.pdf” accompanies with the source codes (see Data and Code Availability section).

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	M. Cerezo , G. Verdon , H. Y. Huang , L. Cincio , and P. J. Coles , Challenges and opportunities in quantum machine learning, Nat. Comput. Sci. 2(9), 567 (2022)

[2]	D. E. Rumelhart , G. E. Hinton , and R. J. Williams , Learning representations by back-propagating errors, Nature 323(6088), 533 (1986)

[3]	M. A. Nielson, Neural Networks and Deep Learning, Determination Press, 2015

[4]	A. Abbas,R. King,H. Y. Huang,W. J. Huggins,R. Movassagh,D. Gilboa,J. R. McClean, On quantum backpropagation, information reuse, and cheating measurement collapse, arXiv: 2023)

[5]	J. Bowles,D. Wierichs,C. Y. Park, Backpropagation scaling in parameterised quantum circuits, arXiv: 2023)

[6]	M. Schuld , V. Bergholm , C. Gogolin , J. Izaac , and N. Killoran , Evaluating analytic gradients on quantum hardware, Phys. Rev. A 99(3), 032331 (2019)

[7]	D. Wierichs , J. Izaac , C. Wang , and C. Y. Y. Lin , General parameter-shift rules for quantum gradients, Quantum 6, 677 (2022)

[8]

[9]

[10]

[11]	A. Peruzzo , J. McClean , P. Shadbolt , M. H. Yung , X. Q. Zhou , P. J. Love , A. Aspuru-Guzik , and J. L. O’Brien , A variational eigenvalue solver on a photonic quantum processor, Nat. Commun. 5(1), 4213 (2014)

[12]	E. Grant , M. Benedetti , S. Cao , A. Hallam , J. Lockhart , V. Stojevic , A. G. Green , and S. Severini , Hierarchical quantum classifiers, npj Quantum Inf. 4, 65 (2018)

[13]	M. Benedetti , E. Lloyd , S. Sack , and M. Fiorentini , Parameterized quantum circuits as machine learning models, Quantum Sci. Technol. 4(4), 043001 (2019)

[14]	R. LaRose and B. Coyle , Robust data encodings for quantum classifiers, Phys. Rev. A 102(3), 032420 (2020)

[15]	M. Cerezo , A. Arrasmith , R. Babbush , S. C. Benjamin , S. Endo , K. Fujii , J. R. McClean , K. Mitarai , X. Yuan , L. Cincio , and P. J. Coles , Variational quantum algorithms, Nat. Rev. Phys. 3(9), 625 (2021)

[16]	A. Abbas , D. Sutter , C. Zoufal , A. Lucchi , A. Figalli , and S. Woerner , The power of quantum neural networks, Nat. Comput. Sci. 1(6), 403 (2021)

[17]	D. Arthur,P. Date, A hybrid quantum-classical neural network architecture for binary classification, arXiv: 2022)

[18]	Y. Du , T. Huang , S. You , M. H. Hsieh , and D. Tao , Quantum circuit architecture search for variational quantum algorithms, npj Quantum Inf. 8, 62 (2022)

[19]	R. Chen , Z. Guang , C. Guo , G. Feng , and S. Y. Hou , Pure quantum gradient descent algorithm and full quantum variational eigensolver, Front. Phys. (Beijing) 19(2), 21202 (2024)

[20]	X. D. Xie , Z. Y. Xue , and D. B. Zhang , Variational quantum algorithms for scanning the complex spectrum of non-Hermitian systems, Front. Phys. (Beijing) 19(4), 41202 (2024)

[21]	X. Li and Z. Q. Yin , Improve variational quantum eigensolver by many-body localization, Front. Phys. (Beijing) 20(2), 023202 (2025)

[22]	P. J. J. O’Malley , R. Babbush , I. D. Kivlichan , J. Romero , J. R. McClean , et al. Scalable quantum simulation of molecular energies, Phys. Rev. X 6(3), 031007 (2016)

[23]	A. Kandala , A. Mezzacapo , K. Temme , M. Takita , M. Brink , J. M. Chow , and J. M. Gambetta , Hardware-eﬃcient variational quantum eigensolver for small molecules and quantum magnets, Nature 549(7671), 242 (2017)

[24]	C. Cîrstoiu , Z. Holmes , J. Iosue , L. Cincio , P. J. Coles , and A. Sornborger , Variational fast forwarding for quantum simulation beyond the coherence time, npj Quantum Inf. 6, 82 (2020)

[25]	Y. Kim , A. Eddins , S. Anand , K. X. Wei , E. van den Berg , S. Rosenblatt , H. Nayfeh , Y. Wu , M. Zaletel , K. Temme , and A. Kandala , Evidence for the utility of quantum computing before fault tolerance, Nature 618(7965), 500 (2023)

[26]	Y. Li and S. C. Benjamin , Eﬃcient variational quantum simulator incorporating active error minimization, Phys. Rev. X 2, 021050 (2017)

[27]	S. Endo , J. Sun , Y. Li , S. C. Benjamin , and X. Yuan , Variational quantum simulation of general processes, Phys. Rev. Lett. 125(1), 010501 (2020)

[28]	E. Farhi,J. Goldstone,S. Gutmann, A quantum approximate optimization algorithm, arXiv: 2014)

[29]	L. Zhou , S. T. Wang , S. Choi , H. Pichler , and M. D. Lukin , Quantum approximate optimization algorithm: Performance, mechanism, and implementation on near-term devices, Phys. Rev. X 10(2), 021067 (2020)

[30]	E. Farhi , J. Goldstone , S. Gutmann , and L. Zhou , The quantum approximate optimization algorithm and the Sherrington−Kirkpatrick model at infinite size, Quantum 6, 759 (2022)

[31]	K. Terashi , M. Kaneda , T. Kishimoto , M. Saito , R. Sawada , and J. Tanaka , Event classification with quantum machine learning in high-energy physics, Comput. Softw. Big Sci. 5(1), 2 (2021)

[32]	A. Gianelle , P. Koppenburg , D. Lucchesi , D. Nicotra , E. Rodrigues , L. Sestini , J. de Vries , and D. Zuliani , Quantum machine learning for b-jet charge identification, J. High Energy Phys. 2022(8), 14 (2022)

[33]	M. Islam , M. Chowdhury , Z. Khan , and S. M. Khan , Hybrid quantum-classical neural network for cloud-supported in-vehicle cyberattack detection, IEEE Sens. Lett. 6(4), 6001204 (2022)

[34]	M. Y. Küçükkara , F. Atban , and C. Bayilmiş , Quantum-neural network model for platform independent DDoS attack classification in cyber security, Adv. Quantum Technol. 7(10), 2400084 (2024)

[35]	D. Emmanoulopoulos,S. Dimoska, Quantum machine learning in finance: Time series forecasting, arXiv: 2022)

[36]	E. A. Cherrat , S. Raj , I. Kerenidis , A. Shekhar , B. Wood , J. Dee , S. Chakrabarti , R. Chen , D. Herman , S. Hu , P. Minssen , R. Shaydulin , Y. Sun , R. Yalovetzky , and M. Pistoia , Quantum deep hedging, Quantum 7, 1191 (2023)

[37]

[38]

1998)

[39]	L. Deng , The MNIST database of handwritten digit images for machine learning research, IEEE Signal Process. Mag. 29(6), 141 (2012)

[40]	R. A. Fisher , The use of multiple measurements in taxonomic problems, Ann. Hum. Genet. 7, 179 (1936)

[41]	J. R. McClean , S. Boixo , V. N. Smelyanskiy , R. Babbush , and H. Neven , Barren plateaus in quantum neural network training landscapes, Nat. Commun. 9(1), 4812 (2018)

[42]	E. Grant , L. Wossnig , M. Ostaszewski , and M. Benedetti , An initialization strategy for addressing barren plateaus in parametrized quantum circuits, Quantum 3, 214 (2019)

[43]	S. Wang , E. Fontana , M. Cerezo , K. Sharma , A. Sone , L. Cincio , and P. J. Coles , Noise-induced barren plateaus in variational quantum algorithms, Nat. Commun. 12(1), 6961 (2021)

[44]	S. H. Sack , R. A. Medina , A. A. Michailidis , R. Kueng , and M. Serbyn , Avoiding barren plateaus using classical shadows, PRX Quantum 3(2), 020365 (2022)