To work on trying to find novel quantum algorithms is to take a long lonely trip where progress is measured in failures. Whenever one of the brave souls (or group of souls) embarks on such a journey, and comes back with treasure that looks like a new quantum algorithm, we should all do the quantum algorithm dance and celebrate the returning heroes. It looks to me like we have recently had just such an event with the appearance of the preprint “Optimization by Decoded Interferometry” by Stephen P. Jordan, Noah Shutty, Mary Wootters, Adam Zalcman, Alexander Schmidhuber, Robbie King, Sergei V. Isakov, and Ryan Babbush [arXiv:2408.08292]. They call this algorithm Decoded Quantum Interferometry, or DQI (not to be confused with the name of the quantum computing division of the American Physical Society, but yes it will be confused). I was lucky enough to sit in the same office as Stephen, so I got to watch as he and the team made the discovery of the algorithm. The optimizer has blogged about the results of this algorithm, but I’m not a theoretical computer scientist, so what I’m most interested in is “how does it work”. In other words, what is the actual quantum algorithm, what does it actually do?
The general problem space that the DQI works on is optimization. In optimization problems one wants to maximize some function over a combinatorialy large domain. That is for a function $f$ one wants to find an $x$ such that $f(x) \geq f(y)$ for all $y \neq x$. Generally finding the maximum is very hard. If I give you $f$ as a black box, i.e. I don’t tell you how $f$ was generated, it’s very clear that one might have to search in the worst case over the entire space of inputs to the function. More practically, however, we are often given some succinct description of $f$, for example it might be a mathematical function like a polynomial , and we can ask how hard it is to find the maximum given this description. It turns out though that even then the problem is hard, this leads us to complexity classes like NP and friends. Even more practically, though, one could also loosen the requirement a bit, what if one wants to find a value $x$ such that $f(x)$ is as close the maximum as possible. It turns out that when one does this, there are certain regimes where wouldn’t be crazy for quantum algorithms to outperform classical algorithms. This is the target of the DQI algorithm.
Let’s “derive” DQI.
If you have a function $f$ you want to optimize, one thing you might think to do on a quantum computer is to prepare the state $$|f\rangle := \frac{1}{\sqrt{\mathcal F}} \sum_x f(x) |x\rangle$$The sum is over some large space of inputs to the function, for example we could imagine that the sum is over bitstrings of length $n$ (so the sum is over $2^n$ terms). For simplicity we assume the values $f$ produces are real numbers. Here ${\mathcal F}$ is a normalization factor. If one could produce this state, then the probability of observing $x$ if measuring this state is $$prob(x) = \frac{ |f(x)|^2}{\mathcal F}$$This doesn’t seem too impressive, but lets stick with it. In particular, we do notice that the probability of producing $x$ is higher where $|f(x)|$ is higher, so producing such a state gives us the ability to bias towards the values that maximize or minimize $f$.
We want to produce a state like $|f\rangle$ and we are using a quantum computer, so somehow we want to arrange it so that the quantum computer directs positive interference like effects to the places where $f$ is higher and negative interference like effect s to the places where $f$ is lower. We don’t have a lot of interfering type matrices in quantum (or rather all generic unitaries are like this, but structured unitaries that we can get a handle on are rarer), so lets just assume that the unitary which produces this is the $n$ qubit Hamadard transform. That is its own inverse, so we can say that if we start in the state $$|\hat{f}\rangle := \frac{1}{\sqrt{\mathcal F} 2^{n/2}} \sum_{x,y \in {\mathbb Z}_2^n} (-1)^{x \cdot y} f(x) |y\rangle$$where $x \cdot y = x_1 y_1 + \dots + x_n y_n ~{\rm mod}~2$, then applying $H^{\otimes n}$ to this state will produce the desired $|f\rangle$. We’ve changed the problem of preparing $|f\rangle$ to preparing another state, $|\hat{f}\rangle$.
What to do next? First stare at this state. What could the $f(x)$ possibly be that would make $|\hat{f}\rangle$ easy to prepare? Well one obvious thing would be if we had $$f(x) = (-1)^{b \cdot x}$$where $b \in {\mathbb Z}_2^n$. In that case, $|\hat{f}\rangle = |b\rangle$. In other words, if we can prepare $|b\rangle$ (which is trivial), then we can create $|f\rangle$ with $f(x) = (-1)^{b \cdot x}$. Now this isn’t to exciting of a function to maximize. And even worse when we prepare $|f\rangle$ since all the values of $f$ are $\pm 1$, and measure it we get all bit strings with equal probability. Boo.
But like all things in algorithm research, lets keep chugging along. Since $|b\rangle$ gave us an $f$ we could analyze, a natural thing to think about is what if you could produce a superpostion of inputs. If we start with $\frac{1}{\sqrt{2}}(|b_1\rangle + |b_2\rangle)$, where $b_i \in {\mathbb Z}_2^n$, then we see that we can produce $|f\rangle$ with $$f(x) = (-1)^{b_1 \cdot x} + (-1)^{b_2 \cdot x}$$Now this state is a bit more interesting, in that it doesn’t have an equal probability when we measure in the computational basis. Sometimes $b_1 \cdot x = b_2 \cdot x$ in which case the amplitudes add up, and other times $b_1 \cdot x \neq b_2 \cdot x$ in which case the amplitudes cancel.
Another observation is that we could change the values of which amplitudes cancel and which add, by instead of preparing $\frac{1}{\sqrt{2}}(|b_1\rangle + |b_2\rangle)$ instead preparing $\frac{1}{\sqrt{2}}((-1)^{v_1}|b_1\rangle + (-1)^{v_2}|b_2\rangle)$ where $v_i \in {\mathbb Z}_2$. Now the function is $$f(x) = (-1)^{v_1 + b_1 \cdot x} + (-1)^{v_2+b_2 \cdot x}$$.
Generaliizing a bit more, if we can prepare the state $$\frac{1}{\sqrt{m}} \sum_{j=1}^{m} (-1)^{v_j} |b_j\rangle$$ then we produce a $|f\rangle$ state with $$f(x) = \sum_{j=1}^m (-1)^{b_j \cdot x + v_j}.$$ Note that the way we have written things we can essentially ignore normalzation factors, because the final normalization factor ${\mathcal F}$ in our definition takes care of things. For instance we could have written $f(x) = \alpha \sum_{j=1}^m (-1)^{b_j \cdot x + v_j}$ for some non-zero $\alpha$ and this would also correspond to the same state, ${\mathcal F}$ takes care of the proper overall normalization.
Great! But where have we gotten ourselves. Well one thing we can note is that the $f$ we have produced has a nice interpretation. Consider the set of $m$ linear equations mod-$2$ $$\begin{align}b_1 \cdot x = v_1 \\ b_2 \cdot x = v_2 \\ \dots \quad \\ b_m \cdot x = v_m \end{align}$$Then $f(x) = \sum_{j=1}^m (-1)^{b_j \cdot x + v_j}$ is directly proportional to a count of how many of these simultaneous mod-$2$ linear equations we can satisfy. This should feel to you more like a traditional optimization problem. In particular it might remind you of $3$SAT. In $3$SAT one is given a set of $m$ clauses, each of clause being a boolean expression which is made up of an OR of 3 boolean variables or their negations. The goal of the $3$SAT problem is to find if there is a satisfying assignment where each clause evaluates to true. The optimization version of this problem is to find the a value of the boolean variables that maximizes the number of clauses that evaluate to TRUE. This is called max-$3$SAT The problem we have reduced to is like this, except the clause are now mod-$2$ equations. For this reason this problem is called max-XORSAT.
But ok, now what have we really shown? There are objections. The first objection is that we haven’t talked about the complexity of preparing the state $\frac{1}{\sqrt{m}} \sum_{j=1}^{m} (-1)^{v_j} |b_j\rangle$. The second objection is that we only see the maximum $x$ with probability $|f(x)|^2$, is this enough to figure out $x$ that maximizes $f(x)$ or that gives an $f(x)$ that is a good approximation to the maximal value?
Let’s talk about the second problem first. What if, instead of preparing $|f\rangle$ we instead try to prepare $$|f^2\rangle := \frac{1}{\sqrt{{\mathcal F}_2}} \sum_x f(x)^2 |x\rangle.$$ Here ${\mathcal F}_2$ is normalization factor to make the state normalized. This then will result in $x$ with probability proportional to $|f(x)|^4$, i.e. we will have even greater probability of seeing the maximal value.
Going with this idea, lets check what happens when we apply $H^{\otimes n}$ to $|f^2\rangle$ just as we did for $|f\rangle$ (so we can figure out what to prepare to be able to produce $|f^2\rangle$ by $n$-qubit Hadmard-ing). Let’s use the $f(x)$ we described above for max-XORSAT. Then $$f(x)^2 = \left ( \sum_{j=1}^m (-1)^{v_j +\sum_{i=1}^n b_j \cdot x} \right)^2= \sum_{j_1,j_2=1}^m (-1)^{v_{j_1} +v_{j_2} + \sum_{i=1}^n (b_{j_1} + b_{j_2}) \cdot x}.$$ We can split this sum into two terms, and using the fact that for $z\in {\mathbb Z}_2$ $z^2=0~{\rm mod}~2$ we can express this as $$f(x)^2= m + \sum_{j_1 \neq j_2 = 1}^m (-1)^{v_{j_1} +v_{j_2} + \sum_{i=1}^n (b_{j_1} + b_{j_2}) \cdot x}$$
From this we can now calculate what the state is that, if we Hadamard it, we end up with $|f^2\rangle$. This is $$|\hat{f^2}\rangle=\frac{1}{ 2^{n/2} \sqrt{ {\mathcal F}_2}}\sum_{y \in {\mathbb Z}_2^n} f(x)^2 (-1)^{x \cdot y} | y\rangle.$$ We can individually calculate the Fourier transform of the two terms in our expression for $f(x)^2$ above, but we do need to be careful to note the relative normalization between these terms. $$\alpha \left(m|0\rangle + \sum_{j_1 \neq j_2=1}^{m} (-1)^{v_{j_1+j_2}} |b_{j_1}+b_{j_2}\rangle \right)$$where the addition in the ket is done bitwise modulo $2$.
Great, so we’ve….gotten ourselves a new set of states we want to be able to produces. We can be a great lyric writer and make a list $$\begin{align} &|0\rangle \\ \sum_{j=1}^{m} &(-1)^{v_i} |b_i\rangle \\ \sum_{j_1,j_2=1}^{m} & (-1)^{v_{j_1}+v_{j_2}}|b_{j_1}+b_{j_2}\rangle \end{align}$$ Our list is of non-normalized states. How might we prepare such states (and superpositions of such states)? To do this we recall the inverse trick. Suppose you you can compute the function $g(x)$ from $k$ bits to $l$ bits and also the inverse $g^{-1}(x)$ from $l$ bits to $k$ bits. We further assume that we are working over a domain where we the inverse exist. In other words suppose you have quantum circuits that act as $U_g|x\rangle |y\rangle = |x\rangle |y\oplus g(x)\rangle$ and $U_{g^{-1}} |x\rangle|y\rangle = |x\rangle |y \oplus g^{-1}(x)\rangle$ where $\oplus$ is bitwise addition and the first register has $k$ bits and the second $l$ bits. Then if one starts with a superposition over $|x\rangle$ then one can convert it to a superposition over $g$ evaluated at the points in the superposition. i.e. $$U_{g^{-1}}^\updownarrow U_g \left( \frac{1}{\sqrt{N}} \sum_x |x\rangle \right) |0\rangle = |0\rangle \left( \frac{1}{\sqrt{N}} \sum_x |f(x)\rangle \right)$$ Here $\updownarrow$ means that we apply the $U$ with the registers flipped. This is the inverse trick: if we can efficiently convert the function and its inverse, then we can easily go from a superposition over inputs to a superposition over the function applied to these inputs.
Let’s now return to our list (good thing we made a list). The first state $|0\rangle$ is pretty easy to make. Lets look at the second and third states without the phase term $\sum_{j=1}^m |b_i\rangle$ and $\sum_{j_1,j_2=1}^m |b_{j_1}+b_{j_2}\rangle$. Now comes the trick that gives us the title. Take the $b_i$s and put them each as rows into a matrix. This matrix has $m$ rows and $n$ columns. We unimaginatively call this $m$ by $n$ matrix $B$. We can think about this as the parity check matrix for a linear error correcting code. In particular we can think about this as a code where we encode into $m$ bits. We haven’t said anything about the rate of this code, i.e. how many bits we are attempting to code into or the distance of the code. But given $m$ bits we can apply $B$ to these bits and we will produce a syndrome. The syndrome are the bits we can read out which, for a good code, will allow us to determine exactly what error occurred to information we encoded into the code.
Under this microscope, that $B$ is a syndrome matrix for a linear error correcting code, what are the states in our list. First of all $|0\rangle$ is always a codeword in a linear code. But now look at $\sum_{j=1}^m |b_i\rangle$. Each of these is the syndrome for code if there was exactly one bit flip error. And if we look at $\sum_{j_1,j_2=1}^m |b_{j_1}+b_{j_2}\rangle$, then this is the syndrome for the code if there was exactly two bit flip errors. Return now to the inverse trick. Suppose that we prepare all single qubit errors, $$\frac{1}{\sqrt{m}} (|100\dots00\rangle + |010\dots00\rangle+ \cdots + |000\dots01\rangle$$or written another way $\sum_{x \in {\mathbb Z}_2^m, |x|=1} |x\rangle$ where $|x|$ here is the hamming weight (count of the number of $1$s in $x$). And further suppose that $B$ is a code which can correctly and efficiently correct single bit errors. Then we can use the inverse trick to turn this state in to $\sum_{j=1}^m |b_i\rangle$. Similarly if $B$ is a code that can correct two bit flips errors, then we can take a superpostion over all binary strings with two $1$s and using the inverse trick convert that into $\sum_{j_1,j_2=1}^m |b_{j_1}+b_{j_2}\rangle$ which are the syndromes for all two bit errors on the code.
OK so what do we have. Suppose the $B$ is the syndrome check matrix for an $m$ bit code and it has $n$ bit syndromes. Further suppose that this code has a distance $d$ so that it can correct $t=\lfloor (d-1)/2 \rfloor$ bit flips and that we can, in polynomial time classically decode $t$ errors. Then we can take superpositions of bit strings with $t$ $1$s and $m-t$ $0$s and using the inverse trick, we can produce all superpositions over error syndromes corresponding to $t$ errors. Ignoring normalization this is $$ \sum_{x \in {\mathbb Z}_2^m, |x|=t}|x\rangle \rightarrow \sum_{j_1 \neq j_2 \neq \cdots \neq j_t} |b_{j_1}+b_{j_2}+\dots+b_{j_t}\rangle.$$
OK so now we need to prepare the states $\sum_{x \in {\mathbb Z}_2^m, |x|=t}|x\rangle$. These are so called Dicke states, which have been studied first in quantum optics, and then in a variety of other settings in quantum computing. It turns out that they are easy to prepare quantum mechanically, for a recent paper on knowledge about how to do this see “Short-Depth Circuits for Dicke State Preparation” by Andreas Bärtschi and Stephan Eidenbenz [arXiv:2207.09998] (a not so great way to do this is to use the inverse Schur transform, do we get to finally say that the Schur transform is useful for quantum algorithms now? If yes, I think John Preskill owes me a beer?)
So we’ve seen that, if we interpret $B$ as a syndrome matrix, then we can take Dicke states and cover them to superpositions over code words with a given number of errors. It’s pretty easy to also see that we can apply the phases, corresponding to the $(-1)^{v_j}$ terms for the states we want to prepare as well, these will just be $Z$ gates applied to the Dicke states where we only apply $Z$ where $v_j=1$. Finally we notice that the state we needed to prepare for $|\hat{f}^2\rangle$ was a superposition of two different numbers of errors, $0$ errors or $2$ errors. To handle this we note that when decoding the syndrome, we can also calculate the number of errors. So if we started with $ \alpha |0\rangle + \beta|2\rangle$, and then conditionally create the appropriate Dicke states in separate register,$$\alpha|0\rangle |D_0\rangle + \beta |2\rangle |D_2\rangle$$ where $|D_k\rangle$ is the appropriate Dicke state, then when we perform the uncompute step we can also uncompte that first register, and end up in the appropriate superposition of syndomes with the correct normalization (the $\alpha$ and $\beta$ we want).
Pretty cool! While we’ve just worked out the simple case of $|f^2\rangle$, we can see how this generalizes. In particular for a given $B$ it will have a distance $d$ and can correct $t=\lfloor (d-1)/2 \rfloor$ bit flips. Then we can produce the state $|f^l\rangle$ for $l \leq t$ by generalizing the above steps and assuming that we can efficiently decode up to $l$ bit errors in the code. The way we do this is to first analytically express $f^l$ as a sum over terms which are sums over sums over syndromes with $0, 1, \dots, l$ errors. This in general will give us some $w_k$ weights that we explicitly know. We then start by preparing $\sum_{k=0}^l w_k |k\rangle$. We then append an $m$ bit register and conditionally create the Dicke states with the first register number of $1$s, $$\sum_{k=0}^{l} w_k |k\rangle |D_k\rangle.$$ Then we apply the appropriate $Z$s to the bits where $v_i=1$. Finally using the decoder which take the syndrome and creates the error as well as its weight we can use the inverse trick to clear the first two registers.
The above is a general description of what the DQI algorithm does, i.e. the quantum part. But how well does it work and does it solve max-XORSAT more efficiently than the best classical algorithms? At this point I should tell you that I’ve lied. There are a couple of ways I have lied. The first is that in general instead of preparing a state with $f^l$ the DQI paper describes making an arbitrary degree $l$ polynomial. This is useful because it allows one to identify an “optimal” polynomial. The second lie is a bit bigger, in that there is another problem that is central in the DQI paper and the is the max-LINSAT problem. While max-XORSAT has clauses that are made up of mod-$2$ equation, the max-LINSAT problem generalizes this. Instead of using ${\mathbb Z}_2$ this problem uses ${\mathbb Z}_p$ where $p$ is a prime (not exponentially big in problem size, so small). One can then look at codes over ${\mathbb Z}_p$. If we let $f_j$ denote function from ${\mathbb Z}_p$ to $\pm 1$, then the function you want to maximize in the max-LINSAT problem is $$f(x)=\sum_{j=1}^m f_i( b_j \cdot x)$$ where $b_j \in {\mathbb Z}_p^n$ and the term inside the parenthesis is done with mod-$p$ arithmetic. The paper describes how to produce the $|f^l\rangle$ (and similarly for arbitrary polynomial) state. Again the main trick is to consider the decoding problem for the code with syndrome matrix made up of the $b_j$s, but now over ${\mathbb Z}_p$ instead of ${\mathbb Z}_2$.
Given these two lies have been corrected, what can we say about how well this algorithm works? If in the max-LINSAT problem one has that $r$ of the $p$ possible values for each $f_i$ map to $+1$ (and the other $p-r$ map to $-1$), then the authors are able to show a very nice asymptotic formula. Recall the $m$ is the number of clauses, $p$ is the base fields, $r$ is the just defined amount of bias in the $f_i$ towards $+1$, and $l$ is the weight to which you can efficiently decode the code for. Then the expected number of clauses satisfied when $m \rightarrow \infty$ follows a beautiful formula, which the authors call the semicircle law $$ \frac{\langle s \rangle}{m} = \left( \sqrt{\frac{l}{m} \left( 1 – \frac{r}{p} \right)} + \sqrt{\frac{r}{p} \left( 1 – \frac{l}{m} \right) } \right)^2$$
Great, so there is a good understanding of the performance of the algorithm, how does it compare to the classical world. Here one needs to define the parameters that one is looking at, and how the problems are generated. Here there are two arenas that the authors look at. The first is for the max-XORSAT case. In that case, the authors are able to show that the quantum algorithm provides a higher number of satisfied clauses than simulated annealing for random sparse instances of the problem. That’s great, simulated annealing is a top performing optimization strategy and even getting in the same ballpark for it is very hard. But in this regime it turns out that the authors were also too clever and found a classical heuristic that can outperform DQI for the same regime. Note that this isn’t the end of the story, it may be that there are regimes where the DQI does outperform all classical polynomial time algorithms. And in all fairness, the DQI result is a provable number of clauses, whereas simulated annealing is a heuristic, the authors have tried to set a high bar!
Instead of looking at random instances, one could also look at more structured cases. Here the authors show that if one uses the Reed-Solomon code then the max-LINSAT problem becomes equivalent to another problem, the Optimal Polynomial Intersection problem, and in this case DQI beats all known classical algorithms for this problem. The Optimal Polynomial Intersection problem is as follows. Suppose you are given $p-1$ subsets of ${\mathbb Z}_p$, $F_1, F_2, \dots, F_{p-1}$. Then your goal is to find a $n-1$ degree polynomial over ${\mathbb Z}_p$ that maximizes the number of intersections with these subsets, i.e. if $Q$ is the polynomial maximize the number of times $Q(1) \in F_1, Q(2) \in F_2, \dots$. This problem has been studied before in cryptographic literature and if one converts DQI for Reed-Solomon to this problem it seems that the quantum algorithm can produce more intersecting regions than the best known classical algorithm. The margin is actually quite high, for instance if the ration of $m$ over $n$ is $10$ the quantum can produce $0.7179\dots$ expected intersections, whereas the best classical produced $0.55$ expected intersections. Exciting!
Hey mom, thanks for reading this far! Everyone stand up and do the quantum algorithms dance!
But more seriously, definitely check out the paper. There are nice connections between this work and prior work, two that stand out are works over lattices from Regev, Aharonov, and Ta-Shma and then more recent work by Yamakawa, Zhandy, Tillich, and Chailloux. What I like most about the work is that it feels very different than the more traditional places where we have seen “exponential” speedups (those quotes are because we do not know how to prove much real separations, quantum or classically, for all we known factoring could be in classical polynomial time. The quotes don’t detract from what is shown here, which is the gold standard of what people these days call “provable” exponential advantage. I personally find that nomenclature off because it puts the word “provable” too close to “exponential”. YMMV if you aren’t a grumpy old dude like me). Another part I like is that it shows the power of creating a superposition over structured objects. We know, for instance, that being able to prepare the superposition over all of the permutations of a graph would lead to a polynomial time quantum algorithm for graph isomorphism. We don’t know how to do that preparation, whereas here we do know how to produce produce the superposition.
And finally I like this problem because it has a nice sort of structure for thinking of new algorithms. One can first look at other codes, and there is the nice semicircle law to guide you. Another thing one can do is think about how this generalizes to other unitaries beyond the Fourier transform, what are these optimizing for? And finally one can also try the general method of taking an objective function and performing work on this function (squaring, convolving, etc) and think if maybe that can yield useful new quantum algorithms. Lot’s to work with and I’m excited to see where this all goes (and if the classical world can catch up to the quantum for the Optimal Polynomial Intersection problem. Go quantum!)
p.s. Stephen Jordan is talking at the excellent Simons Institute Quantum Colloquium tomorrow about DQI. If you aren’t following these colloquiums, you should! They are all most excellent and they include a panel afterwards with experts who then talk about the talk. This captures the feeling of a real conference, where a lot of the best part is the follow up conversation in the hallways.
Great post on a very impressive work!!