Introduction

Regular ML: build a model based on sample data $x$ in order to make prediction
Attack: add invisible noise $\eta$ to clean sample data and fool the model(make the wrong prediction)
Adversarial Machine Learning: attempts to fool models by supplying deceptive input
The malfunction of a ML model may cause damage in many aspects.

Why ML models can be fooled?
The precision $\epsilon$ of an individual input feature is limited so that adding noise beyond that limitation($||\eta||<\epsilon$) is both invisible to human and cannot be distinguished by hardware.
During the training period, the original data is processed by activation where slight perturbation may be magnified and thus leads to malfunction of the model.
the adversarial perturbation causes the activation to grow by $w^T\eta$: $w^T\tilde{x}=w^Tx-w^T\eta$ We want to get the max $w^T\eta$ to fool the model.
And we need to get the min $\eta$ so that the slight change is hard to be detected.
so we assign: $\eta = \epsilon sign(w)$ The activation will grow by $\epsilon mn$ where $m$ is the average magnitude of each element of n-dimensional vector $w$.
A simple linear model can have adversarial examples if its input has sufficient dimensionality.

How to attack a ML model

White-box Attack

An attacker has complete access to the ML model.
The main idea od white box attack is to generate adversarial examples.

Fast Gradient Sign Method(FGSM)

$\textit{J}(w,x,y^{true})$: loss function
SGD: update weights to decrease the loss:

$w = w - \nabla \textit{J}_w(w,x,y)$

Attack: update samples to increase the loss:

$x = x + \nabla \textit{J}_x(w,x,y)$

Perturbation:

$\eta = \epsilon sign(\nabla \textit{J}_x(w,x,y))\quad x = x + \eta$

THis algorithm is cheap because we don’t update weights or parameters $w$.

One-step target class method

In FGSM:

$\tilde{x} = x + \epsilon sign(\nabla \textit{J}_x(w,x,y^{true}))$

Using $y^{target}$ which are unlikely to be the correct class of $x$:

$\tilde{x} = x - \epsilon sign(\nabla \textit{J}_x(w,x,y^{target}))$

Black-box Attack

Attacker has no access to parameters but only has some idea of the domain of interests (eg. image classification). The number of queries to the system is limited.

How to defend?

How to effectively defend?