Deeply Confused

by Guy Jacobson and Matt Gruskin

Answer: MICROWAVE
Problem: Arbor Day Town/Bloomsday Town

This puzzle is presented as a sequence of 23 square photographs, each with a resolution of 299 × 299 pixels. The solver can visually identify the contents of each image, the first letters of which form an acrostic:

IMPALA
NAIL
CASTLE
ENVELOPE
PEACOCK
TOASTER
IPOD
ORANGE
NECKLACE
VOLCANO
TROMBONE
HIPPOPOTAMUS
REFRIGERATOR
EEL
EGGNOG
IRON
MAGPIE
ABACUS
GONDOLA
ECHIDNA
NIPPLE
ESPRESSO
TRICYCLE

INCEPTION V THREE IMAGENET

The acrostic is a clue for the Inception v3 deep convolutional neural network, trained for the ImageNet Large Scale Visual Recognition Challenge. The input of this network is a 299 × 299 pixel image, and the output is a classification of the image into a set of 1000 classes.

The images presented in the puzzle can all be classified into one of the 1000 ImageNet classes. (This fact may help the solver to narrow down the possible identifications for each image, if they’re able to recognize that the identifications they’ve found so far are all present in the list of ImageNet classifications, before solving the complete acrostic.)

With a list of 299 × 299 images, and a clue suggesting the use of Inception v3, the next logical step is to see how Inception v3 classifies each image. There are several pre-trained copies of Inception v3 available online, and several frameworks with which they can be used:

https://transcranial.github.io/keras-js/#/inception-v3 is a website where the user can run Inception v3 classification on an image URL they supply.
https://www.tensorflow.org/tutorials/images/image_recognition is a TensorFlow tutorial for running Inception v3 classification locally, using either Python or C++.
https://keras.io/applications/ contains examples for running image classification locally using the Keras framework.
and many more . . .

When the solver tries to classify the images, they will find that Inception v3 believes the most likely class for each image is completely different than the image’s visual appearance! Classification will detect the following classes for the images:

SOMBRERO
THIMBLE
AXOLOTL
CHURCH
KIMONO
UMBRELLA
PELICAN
TEAPOT
HOURGLASS
IBEX
RIFLE
TOUCAN
EFT
EARTHSTAR
NEWFOUNDLAND
RADIATOR
OCARINA
WOMBAT
BINOCULARS
ARTICHOKE
NEMATODE
DRAGONFLY
SPATULA

Which spells the acrostic: STACK UP THIRTEEN ROW BANDS

A 13-pixel-tall band can be extracted from each of the 23 299-pixel-tall images using the nth band from the nth image, and these bands can be stacked up in order, to form a new 299 × 299 pixel image. This stacked-up image can be assembled either programmatically or manually using image editing tools.

Running Inception v3 classification on the stacked-up image will yield the classification MICROWAVE, which is the answer to this puzzle.

Here’s a sample Python 3 program (using the Keras framework) for running Inception v3 classification over the given images and the final stacked image:

import numpy as np
from keras.preprocessing import image
from keras.applications import inception_v3

imagecount = 23
imagedimension = 299

# Number of rows in a band (Surprise! It's 13)
nrows = imagedimension // imagecount

# Directory with the image files
imagedir = 'images'

# Load the Inception V3 model
model = inception_v3.InceptionV3()


# Load an image from file imfile and scale into a numpy array from -1 to +1
def loadimage(imfile):
   img = image.load_img(imfile, target_size=(imagedimension, imagedimension))
   imgarray = image.img_to_array(img) / (255.0 / 2.0)
   imgarray -= 1.0
   return np.expand_dims(imgarray, axis=0)   # Keras needs an extra dimension for batch


# Return top prediction per Inception v3 for image as array
def predict(imgarray):
   pred = model.predict(imgarray)
   classes = inception_v3.decode_predictions(pred, top=1)
   return classes[0][0]


images = []

# Predict class for each image
for n in range(imagecount):
   im = loadimage('{}/{:02d}.png'.format(imagedir, n + 1))
   images.append(im)
   classid, name, confidence = predict(im)
   print('{}: {:.4} confidence'.format(name, confidence))

# Make stacked image out of thirteen-row bands from images using nth band from nth image
stacked = np.empty(shape=images[0].shape)
for n in range(imagecount):
   stacked[:, nrows * n:nrows * (n + 1)] = images[n][:, nrows * n:nrows * (n + 1)]

# Predict class for stacked image
classid, name, confidence = predict(stacked)
print('Stacked image is {}: {:.4} confidence'.format(name, confidence))

Construction Notes

This puzzle was constructed using a technique for producing images that “fool” a neural network into producing a specific classification. This technique works by iteratively modifying an input image using gradients obtained from back-propagation through the neural network that bring the final prediction closer to the desired prediction. Refer to the article Machine Learning is Fun Part 8: How to Intentionally Trick Neural Networks for more details about this technique.

The resulting “hacked” image is very sensitive to the exact network weights used to produce it—different independently-trained copies of Inception v3 will not be fooled by it. Luckily, there aren’t actually that many independently-trained copies of Inception v3 available on the Internet—most copies of the weights that are easily available and usable are derived from a couple of different snapshots supplied by TensorFlow.

To produce the images for this puzzle, we set up a program that applied the image modification technique using three different copies of Inception v3 simultaneously:

TensorFlow checkpoint from 2016-08-28 (http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz)
TensorFlow checkpoint from 2015-12-05 (http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz)
CNTK model (https://www.cntk.ai/Models/CNTK_Pretrained/InceptionV3_ImageNet_CNTK.model)

The other copies of Inception v3 that we were able to locate and get running were all derived from one of these three copies, and therefore had the same weights and were also fooled by our images.

The images were produced using GPU-accelerated TensorFlow with an Nvidia GeForce GTX 750 Ti graphics card. Each image took anywhere from a few minutes to a few hours to converge on our desired classifications with a high confidence for all three networks.