Home

Awesome

cnn-vis

Inspired by Google's recent Inceptionism blog post, cnn-vis is an open-source tool that lets you use convolutional neural networks to generate images. Here's an example:

<img src="http://cs.stanford.edu/people/jcjohns/cnn-vis-examples/example12.png" width=800px>

You can find many more examples, along with scripts used to generate them, in the example gallery.

Convolutional neural networks (CNNs) have become very popular in recent years for many tasks in computer vision, but most especially for image classification. A CNN takes an image (in the form of a pixel grid) as input, and transforms the image through several layers of nonlinear functions. In a classification setup, the final layer encodes the contents of the image in the form of a probability distribution over a set of classes. The lower layers tend to capture low-level image features such as oriented edges or corners, while the higher layers are thought to encode more semantically meaningful features such as object parts.

In order to use a CNN for a classification task, it needs to be trained. We initialize the weights of the network randomly, then show it many examples of images whose labels are known. Based on the errors that the network makes in classifying these known images, we gradually adjust the weights of the network so that it correctly classifies these images. Two popular datasets for training CNNs are ImageNet [4] and MIT Places [10]. ImageNet contains 1000 categories of objects, such as dogs, birds, and other animals, while MIT Places contains 205 types of scenes such as bedrooms, kitchens, and forests.

Although CNNs perform well on a variety of tasks, it can be difficult to understand exactly what types of image features a CNN is using to work its magic. One trick for demystifying a CNN is to choose a neuron in a trained CNN, and attempt to generate an image that causes the neuron to activate strongly. We initialize the image with random noise, propagate the image forward through the network to compute the activation of the target neuron, then propagate the activation of the neuron backward through the network to compute an update direction for the image. We use this information to update the image, and repeat the process until convergence. This general strategy has been used to visualize the activations of individual neurons [8, 9], to generate images of particular object classes [5], to invert CNN features [1, 2], and to generate images to fool CNNs [3, 6].

Inceptionism builds on this line of work, adding three unique twists:

Setup

Caffe

cnn-vis is built on top of Caffe, an excellent open-source CNN implementation from Berkeley. You'll need to do the following:

./scripts/download_model_binary.py models/bvlc_googlenet/

cnn-vis

Clone the repo, create a virtual environment, install requirements, and add the Caffe Python library to the virtualenv:

git clone https://github.com/jcjohnson/cnn-vis.git
cd cnn-vis
virtualenv .env
source .env/bin/activate
pip install -r requirements.txt
echo $CAFFE_ROOT/python > .env/lib/python2.7/site-packages/caffe.pth

Usage

cnn-vis is a standalone Python script; you can control its behavior by passing various command-line arguments.

There are quite a few knobs that can be tweaked that affect the final generated image. To help get you started, we've provided scripts that use cnn-vis to generate a bunch of example images in the example gallery. For completeness we also document all options here.

CNN options

These options control the CNN that will be used to generate images.

Image options

These options define the objective that will be optimized to generate an image

Initialization options

Options for setting the initial image. You can either seed the initial image from an existing image, or use random noise. In the case of random noise, we generate Gaussian white noise, then smooth it using Gaussian blur to prevent TV regularization from dominating the first few steps of optimization.

Resize options

Options for configuring multiscale zooming used to generate high-resolution images. To generate nice images, we want to start with a small initial size that is ideally not much bigger than the base resolution of the CNN, then gradually grow to larger images.

Sizes may be specified as multiples of a base size; for noise initializations the base size is the input size of the CNN, and for image initializations the base size is the original size of the initial image.

Optimization options

We optimize using gradient descent, and use RMSProp to compute per-parameter adaptive learning rates.

Layer amplification objective options

These options allow you to configure the objective that is used for layer amplification. During backpropagation, we set the gradient of the target layer to -l1_weight * abs(a) - l2_weight * clip(a, -g, g), where a are the activations of the target layer. This corresponds to maximizing the (weighted) sum of the absolute values and thresholded squares of the activations at the target layer. The generated image tends not to be very sensitive to the values of these parameters, so the defaults should work fine.

P-norm regularization options

P-norm regularization prevents individual pixels from getting too large. For noise initializations, p-norm regularization pulls each pixel toward zero (corresponding to the mean ImageNet color) and for image initializations, p-norm regularization will pull each pixel toward the value of that pixel in the initial image. For noise initializations, relatively weak p-norm regularization tends to work well; for image initializations, p-norm regularization is the only term enforcing visual consistency with the initial image, so p-norm regularization should be stronger.

Auxiliary p-norm regularization options

Parameters for a second p-norm regularizer; however the second p-norm regularizer always pulls towards zero, while the first p-norm regularizer pulls toward the initial image if it is given. If the initial image contains very saturated regions (either very white or very black) then even small deviations around the initial value can result in pixel values outside the [0, 255] range. A trick for getting around this problem is adding a second p-norm regularizer with a high exponent (maybe 11) and very low regularization constant (maybe 1e-11). This regularizer will have little effect on pixels near the center of the [0, 255] range, but will push pixels outside this range back toward zero.

Total Variation regularization options

Total Variation (TV) regularization encourages neighboring pixels to have similar values. For noise initializations this regularizer is critical; without it the generated image will exhibit large amounts of high-frequency noise. For image initializations it is less critical; strong p-regularization will keep the pixels close to the initial image, and this will be sufficient to prevent high-frequency noise.

As defined in [2], we compute the TV-norm of an image by approximating the magnitude of the image gradient using neighboring pixels, raising the image gradient to the power of beta, and summing over the image.

[2] suggests that starting with a low TV-norm regularization strength and increasing it over time gives good results. In cnn-vis we implement this idea by increasing the TV-norm regularization strength by a constant amount after a fixed number of iterations.

Output options

Options for controlling the output. --output_file: Filename where the final image will be saved. Default is out.png. --rescale_image: If this flag is given, then the image colors are rescaled to [0, 255] linearly; the minimum value of the image will be mapped to 0, and the maximum image value will map to 255. If this flag is not given, the image wil be clipped to the range [0, 255] for output. Rescaling the image values can reveal detail in highly saturated or desaturated image regions, but can lead to color distortion. --output_iter: After every output_iter steps of optimization, some outputs will be produced. Exactly what is produced is controlled by iter_behavior. --iter_behavior: What should happen every output_iter steps. The allowed options are shown below. Options can be combined with + to have multiple types of behavior; for example show+print+save will do all three every output_iter steps.

References

[1] A. Dosovitskiy and T. Brox. "Inverting Convolutional Networks with Convolutional Networks", arXiv preprint arXiv:1506.02753 (2015).

[2] A. Mahendran and A. Vedaldi, "Understanding Deep Image Representations by Inverting Them", CVPR 2015

[3] A. Nguyen, J. Yosinski, J. Clune. "Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images", CVPR 2015

[4] O. Russakovsky, et al. "Imagenet large scale visual recognition challenge", IJCV 2014.

[5] K. Simonyan and A. Vedaldi and A. Zisserman, "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps", ICLR 2014

[6] C. Szegedy, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).

[7] C. Szegedy, et al. "Going Deeper with Convolutions", CVPR 2015.

[8] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, H. Lipson H (2015) "Understanding Neural Networks Through Deep Visualization", ICML 2015 Deep Learning workshop.

[9] M. D. Zeiler and R. Fergus. "Visualizing and understanding convolutional networks", ECCV 2014.

[10] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database", NIPS 2014.