I’ve been reading about adversarial attacks and wanted to implement something very simple.
*The simplest adversarial attack: attacking a linear model* by Pierre Ablin helped me do just that and this blog post adapts a lot of material from there—check it out!

## Setup

Suppose we have a binary classifier that has been trained on input-output pairs $(\boldsymbol{x}_1, \boldsymbol{y}_1), \ldots, (\boldsymbol{x}_n, \boldsymbol{y}_n)$, with $\boldsymbol{x}_i \in \mathbb{R}^p$ and $\boldsymbol{y}_i \in \lbrace -1, 1 \rbrace$.
Our goal is to take *base* input $\boldsymbol{x}_\mathrm{b}$ and transform it to *poison* input $\boldsymbol{x}_\mathrm{p}$ such that

- its class would be guessed incorrectly by the classifier
- $\boldsymbol{x}_\mathrm{p}$ would remain similar to $\boldsymbol{x}_\mathrm{b}$

If $\boldsymbol{x}$’s represent images, then “similar” could mean that any introduced changes are visually imperceptible. For example, if we are dealing with two classes of digits, say 3’s and 7’s, from the MNIST database ($p = 28 \times 28 = 784$), then our goal would be to take an image of a ‘7’ and make it look like a ‘3’ to the classifier but keep it looking like a ‘7’ to the human, and vice versa.

We will assume a classifier that makes predictions based on the sign of a linear regression model: $$ y = \text{sign}(\boldsymbol{w}^\top \boldsymbol{x} + \boldsymbol{b}) $$ with $\boldsymbol{w} \in \mathbb{R}^p$ and $\boldsymbol{b} \in \mathbb{R}$.

Thus, the decision boundary of the classifier is a hyperplane $\Pi$ with equation $$ \boldsymbol{w}^\top \boldsymbol{x} + \boldsymbol{b} = 0 $$ i.e. any point $\boldsymbol{x} \in \Pi$ must satisfy the above equation.

## Strategy

To have an input $\boldsymbol{x}_\mathrm{p}$ classified differently than $\boldsymbol{x}_\mathrm{b}$, it must lie on the opposite side of the hyperplane^{1}.
To ensure visually imperceptible changes, a sensible strategy would be to move from $\boldsymbol{x}_\mathrm{b}$ as little as possible, i.e. perpendicularly towards the plane.

It might be useful to think of going from $\boldsymbol{x}_\mathrm{b}$ to $\boldsymbol{x}_\mathrm{p}$ in terms of the shortest distance from the hyperplane $\Pi$ to $\boldsymbol{x}_\mathrm{p}$. We can use vector $\boldsymbol{d}$ to denote that:

We can then express $\boldsymbol{x}_\mathrm{p}$ in the following way: $$ \boldsymbol{x}_\mathrm{p} = \boldsymbol{x}_\mathrm{b} - \alpha \boldsymbol{d} $$ with $\alpha \in \mathbb{R}$.

This is useful because

- $\alpha < 1$ means $\boldsymbol{x}_\mathrm{p}$ is on the same side as $\boldsymbol{x}_\mathrm{b}$, and the attack is unsuccessful
- $\alpha = 1$ means $\boldsymbol{x}_\mathrm{p}$ is on hyperplane $\Pi$
- $\alpha > 1$ means $\boldsymbol{x}_\mathrm{p}$ is on the opposite side to $\boldsymbol{x}_\mathrm{b}$, and the attack is successful

However, we should express $\boldsymbol{d}$ in terms of quantities we already know.

Notice that $\boldsymbol{w}$ is perpendicular to the hyperplane and so is in the same (or opposite) direction as $\boldsymbol{d}$. We can thus say that $\boldsymbol{d} = c \boldsymbol{w}$ (with $c \in \mathbb{R}$) and so $$ \boldsymbol{x}_\mathrm{p} = \boldsymbol{x}_\mathrm{b} - \alpha c \boldsymbol{w} $$

Further, we can find $c$ by considering the case of $\alpha = 1$, when $\boldsymbol{x}_\mathrm{p}$ is on the hyperplane $\Pi$ and thus must satisfy its equation: $$ \boldsymbol{w}^\top (\boldsymbol{x}_\mathrm{b} - 1 c \boldsymbol{w}) + \boldsymbol{b} = 0 $$

Solving for $c$ gives $$ c = \frac{\boldsymbol{w}^\top \boldsymbol{x}_\mathrm{b} + \boldsymbol{b}}{\boldsymbol{w}^\top \boldsymbol{w} } $$

and so $$ \boldsymbol{x}_\mathrm{p} = \boldsymbol{x}_\mathrm{b} - \alpha \frac{\boldsymbol{w}^\top \boldsymbol{x}_\mathrm{b} + \boldsymbol{b}}{\boldsymbol{w}^\top \boldsymbol{w} } \boldsymbol{w} $$

## Results

After achieving 95% test accuracy with the linear classifier, I attempted to evaluate the attack’s effectiveness. Such a simple strategy for manipulating the inputs yielded great results—I was definitely surprised. Here is how an image of a digit ‘7’ changes as we increase $\alpha$ (remember, $\alpha > 1$ is already enough for misclassification, though the higher the $\alpha$, the more confident the classifier will be of its incorrect decision):

Personally, I didn’t notice any changes until $\alpha = 5$ (more than enough for misclassification!). But if I hadn’t observed the original image, I wouldn’t even know that something is wrong. Only at around $\alpha = 50$ do we see some funny stuff (mostly in the background pixels), where one may suspect that the images are being manipulated.

The strategy of moving perpendicularly towards the decision boundary made sense, but it wasn’t at all obvious to me that the resulting poisoned image wouldn’t look drastically different.
Of course, this whole “attack” assumed a lot of knowledge and power to change things on the attacker’s side, but it is still interesting why something like this could potentially work so well.
It could be that my genuine surprise comes from looking at the 2D representations of the hyperplane in the diagrams above and interpreting movement in any direction as incredibly consequential.
Indeed, some existing works speculate that the effectiveness of adversarial attacks is partly due to high dimensionality of the models (even a linear classifier for MNIST has $785$ parameters), but, so far, I haven’t found *truly* satisfactory explanations.
I’ll have to keep looking, but please let me know if you are aware of any relevant literature!

## Bonus chicanery

Setting $\alpha$ to negative values means we are moving *away* from the hyperplane.
That makes a ‘7’ look even more like a ‘7’ to the classifier.
Does it to *you*?

I will consider only the base inputs that were classified correctly in the first place. ↩︎