So far we've been using gradient descent to modify model parameters to increase accuracy. But we can also use gradient descent to modify model inputs—for example, to answer questions like "what would I have to change about this input to obtain a different classifier output?" Here's an extremely simple example. Suppose we take two inputs from section 1 and assign them labels:
xA = np.array([[1, 0, 1, 0, 0, 0]]).T
yA = 0

xB = np.array([[0, 1, 0, 1, 0, 0]]).T
yB = 1

We train a logistic regression model on this dataset, and obtain the following weight vector:
theta = np.array([[-1, 1, -1, 1, 0, 0]]).T

Remember that the output of a logistic regression classifier has the form $f(x) = \sigma(\theta^\top x)$. 2.A) What is $\nabla_\theta f(x)$? (Remember that $\frac{\partial}{\partial z} \sigma(z) = \sigma(z) (1 - \sigma(z))$. Don't forget the chain rule!) In which direction does the gradient point? 2.B) What is $\nabla_x f(x)$? In which direction does the gradient point? 2.C) The code below implements gradient descent on $x_A$ or $x_B$ for this problem. In other words, we start by providing $x_A$ or $x_B$ as input to the model, then change the input in order to increase the probability of a different target label. What do the modified inputs look like? If you provide a large enough value to iters to change the classification decision, do inputs look more like examples from the input class or the target class? 2.D) Next, we'll apply the same technique to a digit classification problem. Instead of a logistic regression model, using a two-layer neural network with a softmax output to perform 10-way classification (digits range from 0 to 9). This model should get about 95% accuracy on the test set. Just like before, we'll start with an input from one class, and modify the image to increase the probability that our model labels it as some other class. See what happens for different values (between 0 and 9) of input_class and target_class. (i) Can you find examples where the classifier output changes? Do the modified images look more like examples from the original class or the target class? (ii) Is this classifier reliable? How do you reconcile these results with the reported test accuracy? 2.E) Modified inputs designed to trick machine learning models, like the ones you produced in the previous section, are known as adversarial examples. They exist for text, images, and even real-world objects: (Ian Goodfellow et al., 2014) (Robin Jia et al., 2017) (Shang-Tse Chen et al., 2018) Making machine learning models robust to these kinds of attacks is a major open area of research. What are some real-world consequences of the existence of adversarial examples like the ones above?

### Discussion Guide

• 2.A) sigma(theta.dot(x)) * (1 - sigma(theta.dot(x))) * x. Points in the direction of x.
• 2.B) sigma(theta.dot(x)) * (1 - sigma(theta.dot(x))) * theta. Points in the direction of theta.
• 2.C) If you increase iters enough to change the classification decision, inputs look more like the target class than the input class.
• 2.D) (i) Classifier output usually changes with very high confidence, but the images look almost indistinguishable from the original versions. (ii) The modified images aren't from the training distribution, so the 95% number can still be true. But it should be extremely surprising that images so similar to training images can confuse the classifier so much.
• 2.E) Some examples: stickers on stop signs that cause cars to crash; undetectable spam emails and comments, violent / adult images bypassing automatic content filtres.