A Quantum distance-based classifier (part 1)

Robert Wezeman, TNO

Table of Contents

$$ \newcommand{\ket}[1]{\left|{#1}\right\rangle} $$

Introduction

Consider the following scatter plot of the first two flowers in the famous Iris flower data set

Notice that just two features, the sepal width and the sepal length, divide the two different Iris species into different regions in the plot. This gives rise to the question: given only the sepal length and sepal width of a flower can we classify the flower by their correct species? This type of problem, also known as statistical classification, is a common problem in machine learning. In general, a classifier is constructed by letting it learn a function which gives the desired output based on a sufficient amount of data. This is called supervised learning, as the desired output (the labels of the data points) are known. After learning, the classifier can classify an unlabeled data point based on the learned function. The quality of a classifier improves if it has a larger training dataset it can learn on. The true power of this quantum classifier becomes clear when using extremely large data sets. In this notebook we will describe how to build a distance-based classifier on the Quantum Inspire using amplitude encoding. It turns out that, once the system is initialized in the desired state, regardless of the size of training data, the actual algorithm consists of only 3 actions, one Hadamard gate and two measurements. This has huge implications for the scalability of this problem for large data sets. Using only 4 qubits we show how to encode two data points, both of a different class, to predict the label for a third data point. In this notebook we will demonstrate how to use the Quantum Inspire SDK using QASM-code, we will also provide the code to obtain the same results for the ProjectQ framework.

Back to Table of Contents

Problem

We define the following binary classification problem: Given the data set $$\mathcal{D} = \Big\{ ({\bf x}_1, y_1), \ldots ({\bf x}_M , y_M) \Big\},$$

consisting of $M$ data points $x_i\in\mathbb{R}^n$ and corresponding labels $y_i\in \{-1, 1\}$, give a prediction for the label $\tilde{y}$ corresponding to an unlabeled data point $\bf\tilde{x}$. The classifier we shall implement with our quantum circuit is a distance-based classifier and is given by \begin{equation}\newcommand{\sgn}{{\rm sgn}}\newcommand{\abs}[1]{\left\lvert#1\right\rvert}\label{eq:classifier} \tilde{y} = \sgn\left(\sum_{m=0}^{M-1} y_m \left[1-\frac{1}{4M}\abs{{\bf\tilde{x}}-{\bf x}_m}^2\right]\right). \hspace{3cm} (1)\end{equation}

This is a typical $M$-nearest-neighbor model, where each data point is given a weight related to the distance measure. To implement this classifier on a quantum computer, we need a way to encode the information of the training data set in a quantum state. We do this by first encoding the training data in the amplitudes of a quantum system, and then manipulate the amplitudes of then the amplitudes will be manipulated by quantum gates such that we obtain a result representing the above classifier. Encoding input features in the amplitude of a quantum system is known as amplitude encoding.

Back to Contents

Amplitude encoding

Suppose we want to encode a classical vector $\bf{x}\in\mathbb{R}^N$ by some amplitudes of a quantum system. We assume $N=2^n$ and that $\bf{x}$ is normalised to unit length, meaning ${\bf{x}^T{x}}=1$. We can encode $\bf{x}$ in the amplitudes of a $n$-qubit system in the following way \begin{equation} {\bf x} = \begin{pmatrix}x^1 \\ \vdots \\ x^N\end{pmatrix} \Longleftrightarrow{} \ket{\psi_{{\bf x}}} = \sum_{i=0}^{N-1}x^i\ket{i}, \end{equation} where $\ket{i}$ is the $i^{th}$ entry of the computational basis $\left\{\ket{0\ldots0},\ldots,\ket{1\ldots1}\right\}$. By applying an efficient quantum algorithm (resources growing polynomially in the number of qubits $n$), one can manipulate the $2^n$ amplitudes super efficiently, that is $\mathcal{O}\left(\log N\right)$. This follows as manipulating all amplitudes requires an operation on each of the $n = \mathcal{O}\left(\log N\right)$ qubits. For algorithms to be truly super-efficient, the phase where the data is encoded must also be at most polynomial in the number of qubits. The idea of quantum memory, sometimes referred as quantum RAM (QRAM), is a particular interesting one. Suppose we first run some quantum algorithm, for example in quantum chemistry, with as output some resulting quantum states. If these states could be fed into a quantum classifier, the encoding phase is not needed anymore. Finding efficient data encoding systems is still a topic of active research. We will restrict ourselves here to the implementation of the algorithm, more details can be found in the references.

The algorithm requires the $n$-qubit quantum system to be in the following state \begin{equation}\label{eq:prepstate} \ket{\mathcal{D}} = \frac{1}{\sqrt{2M}} \sum_{m=0}^{M-1} \ket{m}\Big(\ket{0}\ket{\psi_{\bf\tilde{{x}}}} + \ket{1}\ket{\psi_{\bf{x}_m}}\Big)\ket{y_m}.\hspace{3cm} (2) \end{equation} Here $\ket{m}$ is the $m^{th}$ state of the computational basis used to keep track of the $m^{th}$ training input. The second register is a single ancillary qubit entangled with the third register. The excited state of the ancillary qubit is entangled with the $m^{th}$ training state $\ket{\psi_{{x}_m}}$, while the ground state is entangled with the new input state $\ket{\psi_{\tilde{x}}}$. The last register encodes the label of the $m^{th}$ training data point by \begin{equation} \begin{split} y_m = -1 \Longleftrightarrow& \ket{y_m} = \ket{0},\\ y_m = 1 \Longleftrightarrow& \ket{y_m} = \ket{1}. \end{split} \end{equation} Once in this state the algorithm only consists of the following three operations:

  1. Apply a Hadamard gate on the second register to obtain

    $$\frac{1}{2\sqrt{M}} \sum_{m=0}^{M-1} \ket{m}\Big(\ket{0}\ket{\psi_{\bf\tilde{x}+x_m}} + \ket{1}\ket{\psi_{\bf\tilde{x}-x_m}}\Big)\ket{y_m},$$

    where $\ket{\psi_{\bf\tilde{{x}}\pm{x}_m}} = \ket{\psi_{\tilde{\bf{x}}}}\pm \ket{\psi_{\bf{x}_m}}$.

  2. Measure the second qubit. We restart the algorithm if we measure a $\ket{1}$ and only continue if we are in the $\ket{0}$ branch. We continue the algorithm with a probability $p_{acc} = \frac{1}{4M}\sum_M\abs{{\bf\tilde{x}}+{\bf x}_m}^2$, for standardised random data this is usually around $0.5$. The resulting state is given by

\begin{equation} \frac{1}{2\sqrt{Mp_{acc}}}\sum_{m=0}^{M-1}\sum_{i=0}^{N-1} \ket{m}\ket{0}\left({\tilde{x}}^i + x_m^i\right)\ket{i}\ket{y_m}. \end{equation}
  1. Measure the last qubit $\ket{y_m}$. The probability that we measure outcome zero is given by \begin{equation} p(q_4=0) = \frac{1}{4Mp_{acc}}\sum_{m|y_m=0}\abs{\bf{\tilde{{x}}+{x}_m}}^2. \end{equation}

In the special case where the amount of training data for both labels is equal, this last measurement relates to the classifier as described in previous section by \begin{equation} \tilde{y} = \left\{ \begin{array}{lr} -1 & : p(q_4 = 0 ) > p(q_4 = 1)\\ +1 & : p(q_4 = 0 ) < p(q_4 = 1) \end{array} \right. \end{equation} By setting $\tilde{y}$ to be the most likely outcome of many measurement shots, we obtain the desired distance-based classifier.

Back to Table of Contents

Data preprocessing

In the previous section we saw that for amplitude encoding we need a data set which is normalised. Luckily, it is always possible to bring data to this desired form with some data transformations. Firstly, we standardise the data to have zero mean and unit variance, then we normalise the data to have unit length. Both these steps are common methods in machine learning. Effectively, we only have to consider the angle between different data features.

To illustrate this procedure we apply it to the first two features of the famous Iris data set:

Table of Contents

Quantum algorithm

Now we can start with our quantum algorithm on the Quantum Inspire. We describe how to build the algorithm for the simplest case with only two data points, each with two features, that is $M=N=2$. For this algorithm we need 4 qubits:

From the data set described in previous section we pick the following data set $\mathcal{D} = \big\{({\bf x}_1,y_1), ({\bf x}_2, y_2) \big\}$ where:

We are interested in the label $\tilde{y}$ for the data point ${\bf \tilde{x}} = (0.8670, 0.4984)$. The amplitude encoding of these data points look like \begin{equation} \begin{split} \ket{\psi_{\bf\tilde{x}}} & = 0.8670 \ket{0} + 0.4984\ket{1}, \\ \ket{\psi_{\bf x_1}} & = 0.9193 \ket{0} + 0.3937\ket{1},\\ \ket{\psi_{\bf x_2}} & = 0.1411 \ket{0} + 0.9899\ket{1}. \end{split} \end{equation}

Before we can run the actual algorithm we need to bring the system in the desired initial state (equation 2) which can be obtain by applying the following combination of gates starting on $\ket{0000}$.

After this step the system is in the state $$\ket{\mathcal{D}_A} = \frac{1}{2}\Big(\ket{0}+\ket{1}\Big)\Big(\ket{0}+\ket{1}\Big)\ket{0}\ket{0} $$

$$ R_y(\theta)\ket{0} = \cos\left(\frac{\theta}{2}\right)\ket{0} + \sin\left(\frac{\theta}{2}\right)\ket{1}.$$

Therefore, the angle needed to rotate to the state $\psi=a\ket{0} + b\ket{1}$ is given by $\theta = 2\cos^{-1}(a)\cdot sign(b)$. Quantum Inspire does not directly support controlled-$R_y$ gates, however we can construct it from other gates as shown in the figure below. In these pictures $k$ stand for the angle used in the $R_y$ rotation.

After this step the system is in the state $$\ket{\mathcal{D}_B} = \frac{1}{2} \Big(\ket{0}+\ket{1}\Big)\Big(\ket{0}\ket{\tilde{{x}}}+\ket{1}\ket{0}\Big)\ket{0}$$

After this step the system is in the state $$\ket{\mathcal{D}_C} = \frac{1}{2}\Bigg(\ket{0}\Big(\ket{0}\ket{\tilde{{x}}} + \ket{1}\ket{{x_1}}\Big) + \ket{1}\Big(\ket{0}\ket{\tilde{{x}}} + \ket{1}\ket{0}\Big)\Bigg) \ket{0}$$

After this step the system is in the state $$\ket{\mathcal{D}_D} = \frac{1}{2}\Bigg(\ket{0}\Big(\ket{0}\ket{\tilde{{x}}} + \ket{1}\ket{{x_1}}\Big) + \ket{1}\Big(\ket{0}\ket{\tilde{{x}}} + \ket{1}\ket{{x}_2}\Big)\Bigg) \ket{0}$$

The actual algorithm

Once the system is in this initial state, the algorithm itself only consists of one Hadamard gate and two measurements. If the first measurement gives the result $\ket{1}$, we have to abort the algorithm and start over again. However, these results can also easily be filtered out in a post-proecessing step.

The circuit for the whole algorithm now looks like:

We can send our QASM code to the Quantum Inspire with the following data points

\begin{equation} \begin{split} \ket{\psi_{\tilde{x}}} & = 0.8670 \ket{0} + 0.4984\ket{1},\\ \ket{\psi_{x_1}} & = 0.9193 \ket{0} + 0.3937\ket{1},\\ \ket{\psi_{x_2}} & = 0.1411 \ket{0} + 0.9899\ket{1}. \end{split} \end{equation}

We only consider the events where the second qubit equals 0, that is, we only consider the events in the set $$\{0000, 0001, 0100, 0101, 1000, 1001, 1100, 1101\}$$

The label $\tilde{y}$ is now given by

\begin{equation} \tilde{y} = \left\{ \begin{array}{lr} -1 & : \#\{0000, 0001, 0100, 0101\} > \#\{1000, 1001, 1100, 1101\}\\ +1 & : \#\{1000, 1001, 1100, 1101\} > \#\{0000, 0001, 0100, 0101\} \end{array} \right. \end{equation}

The following code will randomly pick two training data points and a random test point for the algorithm. We can compare the prediction for the label by the Quantum Inspire with the true label.

To get a better idea how well this quantum classifier works we can compare the predicted label to the true label of the test datapoint. Errors in the prediction can have two causes. The quantum classifier does not give the right classifier prediction or the quantum classifier gives the right classifier prediction which for the selected data gives the wrong label. in general, the first type of errors can be reduced by increasing the number of times we run the algorithm. In our case, as we work with the simulator and our gates are deterministic (no conditional gates), we do not have to deal with this first error if we use the true probability distribution. This can be done by using only a single shot without measurements.

Conclusion and further work

How well the quantum classifier performs, hugely depends on the chosen data points. In case the test data point is significantly closer to one of the two training data points the classifier will result in a one-sided prediction. The other case, where the test data point has a similar distance to both training points, the classifier struggles to give an one-sided prediction. Repeating the algorithm on the same data points, might sometimes give different measurement outcomes. This type of error can be improved by running the algorithm using more shots. In the examples above we only used the true probability distribution (as if we had used an infinite number of shots). By running the algorithm instead with 512 or 1024 shots this erroneous behavior can be observed. In case of an infinite number of shots, we see that the quantum classifier gives the same prediction as classically expected.

The results of this toy example already shows the potential of a quantum computer in machine learning. Because the actual algorithm consists of only three operations, independent of the size of the data set, it can become extremely useful for tasks such as pattern recognition on large data sets. The next step is to extend this toy model to contain more data features and a larger training data set to improve the prediction. As not all data sets are best classified by a distance-based classifier, implementations of other types of classifiers might also be interesting. For more information on this particular classifier see the reference ref.

Back to Table of Contents

References

The same algorithm for the projectQ framework