# Deep Learning & Neural Networks
## Project 3 - Handwriting Recognition with ConvNets

Essentially https://www.tensorflow.org/versions/master/tutorials/mnist/pros/index.html but with inline comments
### Setup

As usual start by loading all the libraries. These will be the same as the last exercise.

In [None]:
# Load TensorFlow
import tensorflow as tf
# Load numpy - adds MATLAB/Julia-style math to Python
import numpy as np
# Load matplotlib for plotting
%matplotlib inline
import matplotlib.pyplot as plt

Load MNIST data set. It will download the files into the working directory if they're not already there.

The files consist of sets of images and labels for
  - training
  - validation
  - testing

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

### Helper functions

Before we build the graph we will write helper functions.

Below we will define a function for creating tensor variables. It takes a shape (list of dimensions) as an argument and intializes the tensor to random truncated Gaussian variables with standard deviation of $0.1$

In [None]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

Do the same but for the bias term:

In [None]:
def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

The below function specifies the kind of convolution operation we'll be using throughout. It's one where the stride is 1 pixel to the right and 1 pixel down.

In general, 'strides' represents the number of pixels you move right and down at a time (the two middle values in ``[1, 1, 1, 1]``) during the convolution - the first and fourth values are usually just set to 1. See [this link](stackoverflow.com/questions/34619177/what-does-tf-nn-conv2d-do-in-tensorflow) for more details.

In [None]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

Similarly for our pooling, we'll define a helper function. We'll use a simple 2x2 max pooling with no overlaps - so the output will be half the height and width of the input. Also the stride has to be ``[1,2,2,1]`` (why?)

In [None]:
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                            strides=[1, 2, 2, 1], padding='SAME')

### Creating the computation graph

Let's get the easy part out of the way, we create a session just like in any other TensorFlow-based code

In [None]:
sess = tf.Session()

Now we will immediately define the inputs to our graph. These will be the data ``x`` and labels ``y_``. Notice how the data is the image flattened (since this is the format the data is in to begin with). Remember that the first dimension of each tensor being ``None`` signifies that the number of rows can be anything at ``session`` runtime.

In [None]:
x  = tf.placeholder("float", shape=[None, 784], name="x")
y_ = tf.placeholder("float", shape=[None, 10], name="y_")

As we noted, the images are already flattened, which is normally good! However, convolutions actually work on the matrix. Therefore, let's use the ``reshape`` operator to mangle them into matrices.

In [None]:
x_image = tf.reshape(x, [-1,28,28,1])

That's an unknown number of 28x28 images with 1 channel. Let's now define the weight and bias Tensor variables. The bias variable is 32-dimensional because it is simply an offset for each of the 32 filters (activation maps).

In [None]:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

Now we can define the first convolution layer. This is a convolution using 32 5x5 filters.

In [None]:
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

**Question** What is the dimension of ``h_conv1``?

Next we will add a max-pool layer to our network,

In [None]:
h_pool1 = max_pool_2x2(h_conv1)

We do pretty much the same thing (convolution + maxpool), adding another 2 layers, but with 64 5x5 filters for the convolution piece:

In [None]:
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

Now we'll implement 2 fully connected layers with a softmax layer after the second and *dropout* between the two.

We first need to mangle the $64$ $7 \times 7$ images into one $7^2 \times 64$ dimensional vector. Then we add a fully connected layer with ReLU to the result:

In [None]:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

The next thing we'll do is add a layer that randomly drop certain units from ``h_fc1`` during training. See the slides for details about the dropout technique.

Note that we leave the dropout probability as an input to be set at runtime.

In [None]:
keep_prob = tf.placeholder(tf.float32, name="wtf")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

We're almost done! Just need to add a fully connected layer from ``h_fc1_drop``, that is the remaining neurons after dropout is done. The output of this last layer is vector of length 10. We "softmax" this vector at the end to get a probability distribution over the 10 digits.

In [None]:
W_fc2  = weight_variable([1024, 10])
b_fc2  = bias_variable([10])
y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

This wasn't so scary apart from maybe the convolution parts!

### Training the net

Like in Project 1A we use cross-entropy as the loss. The only other difference from previous setups is that we will use the [Adam optimizer](https://arxiv.org/abs/1412.6980) (fancy!) instead of usual Gradient Descent.

In [None]:
# Time for the training!
# We'll use a "cross entropy" loss function instead
# of square loss
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
# We'll use ADAM instead of SGD (fancy!)
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
# We can use TF to track the accuracy
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
# Initialize weights
sess.run(tf.initialize_all_variables())
# Time to train this thing
# Warning, may melt laptop
ces, accs = [], []
for i in range(1000):
    # Use the helper functions to get a batch of
    # 50 digits
    batch = mnist.train.next_batch(50)
    # Every 100 steps
    if i%100 == 0:
        # Accuracy is measured with dropout off
        train_accuracy = sess.run(accuracy, 
                                  feed_dict={x:batch[0],
                                             y_: batch[1],
                                             keep_prob: 1.0})
        print("step %d, training accuracy %g"%(i, train_accuracy))
    # Train it
    ce, acc, _ = sess.run((cross_entropy,accuracy,train_step),
             feed_dict={x: batch[0],
                        y_: batch[1],
                        keep_prob: 0.5})
    ces.append(ce)
    accs.append(acc)

We plot the training error and cross-entropy over the training that just happened

In [None]:
f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(range(1000), ces)
axarr[1].plot(range(1000), accs)
print("test accuracy %g"%sess.run(accuracy, feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

### Examining the convolutional network

Lets have a look at one shall we?

In [None]:
idx = 10
plt.matshow(np.reshape(mnist.test.images[idx], (28,28)))
net_opinion = sess.run(y_conv, feed_dict={
    x: mnist.test.images[idx:idx+1], keep_prob: 1.0})
print np.round(net_opinion)
print np.argmax(net_opinion)

Looking for mistakes in the network is simple. We can see why the mistaken image was tricky to classify for the computer

In [None]:
for idx in range(300, 1000):
    net_opinion = sess.run(y_conv, feed_dict={
        x: mnist.test.images[idx:idx+1], keep_prob: 1.0})
    net_digit = np.argmax(net_opinion)
    true_digit = np.argmax(mnist.test.labels[idx])
    if net_digit != true_digit:
        plt.matshow(np.reshape(mnist.test.images[idx], (28,28)))
        print idx
        print net_opinion
        print net_digit
        print true_digit
        break

Lets look at the activations in the first convolutional layer

In [None]:
idx = 10
plt.matshow(np.reshape(mnist.test.images[idx], (28,28)))
value_h_conv1 = sess.run(h_conv1, feed_dict={
    x: mnist.test.images[idx:idx+1], keep_prob: 1.0})
plt.matshow(value_h_conv1[0,:,:,0])
plt.matshow(value_h_conv1[0,:,:,1])