# Image Classification - Softmax Regression

## Introduction

In this exercise you will learn how to classify images of handwritten digits. For classification you will implement the logistic regression, or better said, as we have more than two classes, softmax regression. Further you will learn about stochastic gradient descent (opposed to gradient descent) and for evaluation of your model, the accuracy and f1-score.

## Requirements

TODO

### Python Modules

With the deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.

# All necessary imports at the beginning
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import os
import shutil
import gzip
import urllib.request
import pandas as pd

## Load, Explore and Prepare Dataset

The MNIST dataset is a classic Machine Learning dataset you can get it and more information about it from the website of Yann Lecun. MNIST contains handwrittin digits and is split into a trainings set of 60000 examples and a test set of 10000 examples. You can use the module sklearn to load the MNIST dataset in a convenient way. easy load, mldata.org, orginal mnist, mnist link and description

Note:

If the cells below throws an error, the problem might be a broken download link. In that case, download the dataset from another soruce, e.g. from https://www.kaggle.com/avnishnish/mnist-original, and unzip it and place it under BASE_DATA_DIR.

BASE_DATA_DIR = os.path.expanduser('~/deep.TEACHING/data')

class Mnist:
"""Downloads, loads into numpy array and reshapes the common machine learning dataset: MNIST

The MNIST dataset contains handwritten digits, available from http://yann.lecun.com/exdb/mnist/,
has a training set of 60,000 examples, and a test set of 10,000 examples. With the class you can
download the dataset and prepare them for usage, e.g., flatten the images or hot-encode the labels.
"""

"""Downloads and moves MNIST dataset in a given folder.

MNIST will be downloaded from http://yann.lecun.com/exdb/mnist/ and moved
into the folder 'data_dir', if a path is given by the user, else files
will be moved into a folder specified by the deep-teaching-commons
config file.

Args:
data_dir: A string representing a path where you want to store MNIST.
auto_download: A boolean, if True and the given 'data_dir' does not exists, MNIST will be download.
verbose: A boolean indicating more user feedback or not.
"""
self.data_dir = data_dir
self.verbose = verbose

if self.data_dir is None:
self.data_dir = os.path.join(BASE_DATA_DIR, 'MNIST')

self.data_url = 'http://yann.lecun.com/exdb/mnist/'
self.files = ['train-images-idx3-ubyte.gz','train-labels-idx1-ubyte.gz','t10k-images-idx3-ubyte.gz', 't10k-labels-idx1-ubyte.gz']

if self.verbose:

Creates a directory, downloads MNIST and moves the data into the
directory. MNIST source and target directory are defined by
class initialization (__init__).

TODO:
Maybe redesign so it can be used as standalone method.
"""
if os.path.exists(self.data_dir):
if self.verbose:
else:
if self.verbose:
print('data directory does not exist, starting download...')
# Create directories
os.makedirs(self.data_dir)
# Download each file and move it to given self.data_dir
for file in self.files:
urllib.request.urlretrieve(self.data_url + file, file)
shutil.move(file, os.path.join(self.data_dir, file))
if self.verbose:
if self.verbose:

def get_all_data(self, one_hot_enc=None, flatten=True, normalized=None):
"""Loads MNIST dataset into four numpy arrays.

Default setup will return training and test images in a flat resprensentaion,
meaning each image is row of 28*28 (784) pixel values. Labels are encoded as
digit between 0 and 9. You can change both representation using the arguments,
e.g., to preserve the image dimensions.

Args:
one_hot_enc (boolean): Indicates if labels returned in standard (0-9) or one-hot-encoded form
flatten (boolean): Images will be returned as vector (flatten) or as matrix
normalized (boolean): Indicates if pixels (0-253) will be normalized

Returns:
train_data (ndarray): A matrix containing training images
train_labels (ndarray): A vector containing training labels
test_data (ndarray): A matrix containing test images
test_labels (ndarray): A vector containing test labels
"""
train_images = self.get_images(os.path.join(self.data_dir,self.files[0]))
train_labels = self.get_labels(os.path.join(self.data_dir,self.files[1]))
test_images = self.get_images(os.path.join(self.data_dir,self.files[2]))
test_labels = self.get_labels(os.path.join(self.data_dir,self.files[3]))

if one_hot_enc:
train_labels, test_labels = [self.to_one_hot_enc(labels) for labels in (train_labels, test_labels)]

if flatten is False:
train_images, test_images = [images.reshape(-1,28,28) for images in (train_images, test_images)]

if normalized:
train_images, test_images = [images/np.float32(256) for images in (train_images, test_images)]

return train_images, train_labels, test_images, test_labels

def get_images(self, file_path):
"""Unzips, reads and reshapes image files.

Args:
file_path (string): mnist image data file

Returns:
ndarray: A matrix containing flatten images
"""
with gzip.open(file_path, 'rb') as file:
images = np.frombuffer(file.read(), np.uint8, offset=16)
return images.reshape(-1, 28 * 28)

def get_labels(self, file_path):
"""Unzips and read label file.

Args:
file_path (string): mnist label data file

Returns:
ndarray: A vector containing labels
"""
with gzip.open(file_path, 'rb') as file:
labels = np.frombuffer(file.read(), np.uint8, offset=8)
return labels

def to_one_hot_enc(self, labels):
"""Converts standard MNIST label representation into an one-hot-encoding.

Converts labels into a one-hot-encoding representation. It is done by
manipulating a one diagonal matrix with fancy indexing.

Args:
labels (ndarray): Array of mnist labels

Returns:
ndarray: A matrix containing a one-hot-encoded label each row

Example:
[2,9,0] --> [[0,0,1,0,0,0,0,0,0,0]
[0,0,0,0,0,0,0,0,0,1]
[1,0,0,0,0,0,0,0,0,0]]
"""
return np.eye(10)[labels]

mnist = Mnist()
X_train, y_train, X_test, y_test = mnist.get_all_data(one_hot_enc=True, flatten=True)

To get a visualization of MNIST we will plot a digit. Each line represents an image in flatten form (all pixel in a row). We have change the shape from a vector back to a matrix of the original shape to plot the image. In the case of MNIST this means a conversion of 784 pixel into 28x28 pixel. In addition we will check the label of that digit to verify it correspond to the image.

def plot_mnist_digit(digit):
image = digit.reshape(28, 28)
plt.imshow(image, cmap='binary', interpolation='bicubic')

#choose a random number, plot it and check label
random_number = np.random.randint(1,len(X_train)-1)
print('label:',y_train[random_number])
plot_mnist_digit(X_train[random_number])

## Exercises

### Plot Digits

After a glimpse into MNIST let us explore it a bit further.

Write a function plot_mnist_digits(data, examples_each_row) that plots configurable number of examples for each class, like:

def plot_mnist_digits(data, examples_each_row):
############################################
#TODO: Write a function that plots as many #
#      examples of each class as defined   #
#      by 'examples_each_row'              #
############################################

raise NotImplementedError()
############################################
#             END OF YOUR CODE             #
############################################

plot_mnist_digits(X, examples_each_row=11)

### Preparation of Train- and Test-Split

After exploring MNIST let us prepare the date for our linear classifier. First we will shuffle the training data to get a random distribution.

# shuffle training data
shuffle_index = np.random.permutation(len(X_train))
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

### Define a Linear Classifier Using Softmax

We will train a model to classify the MNIST dataset with the following equation:

$L = \frac{1}{M} \sum_{i=1}^{M} -log\; \left ( \frac{e^{h(x_j,\Theta)}}{\sum_{k=1}^{K}e^{h(x_k,\Theta)}} \right)_i + \frac{\lambda}{2} \sum_{}^{} \Theta^2, \: with \;\; h(X,\Theta) = X * \Theta$

Using the universal equation for a loss function we can see the separate parts of that hugh equation.

$L = \frac{1}{N} \sum_i L_i(h(x_i,\Theta),y_i) + \lambda R(\Theta)$

We will implement each part on its own and put them together. That way it is much easier to understand whats going on.

Let us start with the score function or hypothesis:

$h(X,\Theta) = X * \Theta$

It is possible to calculate all score values with one matrix multiplication (dot product) so we can use the whole training data$X$ instead of one digit$x_i$. This is much faster than using loops.

def class_scores(X,theta):
############################################
#TODO: Implement the hypothesis and return #
#      the score values for each class of  #
#      every digit.                        #
############################################
raise NotImplementedError()
############################################
#             END OF YOUR CODE             #
############################################


After we can define the data loss funtion$L_i$. We assume the score values are unnormalized log probabilities and we use the softmax function to calculate probabilities. $P(Y=j\mid X=x_i) = \frac{e^{s_j}}{\sum_{k=1}^{K}e^{s_k}}$ $L_i = -log\;P(Y=j\mid X=x_i)$

Implement the functions softmax and data_loss

Hint:

The correct classes (labels) are in a one hot encoding shape, so you can use a matrix multiplication to extract the correct class.

# Calculate class probability distribution for each digit from given class scores
def softmax(class_scores):
############################################
#TODO: Use the softmax function to compute #
#      class probabilties                  #
############################################
raise NotImplementedError()
############################################
#             END OF YOUR CODE             #
############################################

# Compute data_loss L_i for the correct class
def data_loss(class_probabilities, onehot_encode_label):
############################################
#TODO: With hot encoded labels and class   #
#      probabilties calculate data loss    #
#      L_i                                 #
############################################
raise NotImplementedError()
############################################
#             END OF YOUR CODE             #
############################################


Now we will calculate loss$L$ using the defined functions.

$L = \frac{1}{M} \sum_i L_i(h(x_i,\Theta),y_i) + \lambda R(\Theta)$

Besides the loss L we will have to calculate the gradient for our loss function$L$. To minimize our loss we will need the gradient. For more information about the gradient you can use additional sources, like that good blog post.

${\displaystyle \operatorname {grad} (L)={\frac {\partial L}{\partial \theta_{1}}}{\hat {e}}_{1}+\cdots +{\frac {\partial L}{\partial \theta_{n}}}{\hat {e}}_{n}={\begin{pmatrix}{\frac {\partial L}{\partial \theta_{1}}}\\\vdots \\{\frac {\partial L}{\partial \theta_{n}}}\end{pmatrix}}.}$

Implement the rest of the loss function.

def loss(X, y, theta, lam):

encoded_labels = y                                # also needed for the gradient, therefore separated calculated
probabilities = softmax(class_scores(X,theta))    # also needed for the gradient, therefore separated calculated
loss_Li = data_loss(probabilities,encoded_labels)

m = X.shape[0]                                    # number of training data for normalization
l2_regularization = (lam/2)*np.sum(theta*theta)   # regularization loss

############################################
#TODO: Put everthing together and calculate #
#      loss L and gradient dL with given   #
#      variables.                          #
############################################
raise NotImplementedError()

############################################
#             END OF YOUR CODE             #
############################################

return loss,gradient

### Reduce the Cost Using Gradient Descent

Gradient descent is a way to minimize our Loss functions. It iteratively moves toward a set of parameter values that minimize our Loss function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the gradient (which is a vector that shows us the highest rise).

${\displaystyle \mathbf {\theta} _{new}=\mathbf {\theta} _{old}-\gamma \times grad(L)} \space , \space \gamma = learning \space rate$

$\Leftrightarrow {\displaystyle \mathbf {\theta} _{new}=\mathbf {\theta} _{old}-\gamma \space {\begin{pmatrix}{\frac {\partial L}{\partial \theta_{1}}}\\\vdots \\{\frac {\partial L}{\partial \theta_{n}}}\end{pmatrix}}.}$

def gradient_descent(training_data, training_label, theta, lam=0.5, iterations=100, learning_rate=1e-5):
losses = []
############################################
#TODO: Optimize loss with gradient descent #
#      update rule. Return a final model   #
#      and a history of loss values.       #
############################################
for i in range(0,iterations):
losses.append(loss_L)
theta -= (learning_rate * gradient)
print('epoch ', i, ' - cost ', loss_L)
print('final loss:',loss_L)
############################################
#             END OF YOUR CODE             #
############################################
return theta, losses

# Initialize learnable parameters theta
theta = np.zeros([X_train.shape[1],len(y_train[0])])
# Start optimization with training data, theta and optional hyperparameters
opt_model, loss_history = gradient_descent(X_train,y_train,theta,iterations=250)

### Stochastical Gradient Descent

In stoachstical gradient descent the gradient is computed with one or a few training examples (also called minibatch) as opposed to the whole data set (gradient descent). When the data-set is very large, SGD converges much faster, as more updates on the wheights (thetas) are done.

A typical minibatch size is 256, although the optimal size of the minibatch can vary for different applications and architectures.

def sgd(training_data, training_label, theta, lam=0.5, iterations=100, learning_rate=1e-5, batch_size=256):
losses = []
for i in range(iterations):
shuffle_index = np.random.permutation(training_data.shape[0])
data, label = training_data[shuffle_index], training_label[shuffle_index]
data, label = data[:batch_size], label[:batch_size]

l, grad = loss(data, label, theta, lam)
losses.append(l)
return theta, losses

# Initialize learnable parameters theta
theta = np.zeros([X_train.shape[1],len(y_train[0])])
# Start optimization with training data, theta and optional hyperparameters
opt_model_sgd, loss_history_sgd = sgd(X_train,y_train,theta,iterations=250)

### Evaluate the Vanilla Gradient Descent Model

Let us look at the optimization results. Final loss tells us how far we could reduce costs during training process. Further we can use the first loss value as a sanity check and validate our implementation of the loss function works as intended. Recall loss value after first iteration should be$log\:c$ with$c$ being number of classes. To visulize the whole trainings process we can plot losss values from each iteration as a loss curve.

# check loss after last iteration
print('last iteration loss:',loss_history[-1])
# Sanity check: first loss should be ln(10)
print('first iteration loss:',loss_history[0])
print('Is the first loss equal to ln(10)?', np.log(10) - loss_history[0] < 0.000001)
# Plot a loss curve
plt.plot(loss_history)
plt.ylabel('loss')
plt.xlabel('iterations')

Evaluation above gave us some inside about the optimization process but did not quantified our final model. One possibility is to calculate model accuracy.

def modelAccuracy(X,y,theta):
print(X.shape)
print(y.shape)
print(theta.shape)
# calculate probabilities for each digit
probabilities = softmax(np.dot(X,theta))
# class with highest probability will be predicted
prediction = np.argmax(probabilities,axis=1)
# Sum all correct predictions and divied by number of data
accuracy = (sum(prediction == np.argmax(y, axis=1)))/X.shape[0]
return accuracy

print('Training accuracy: ', modelAccuracy(X_train,y_train,opt_model))
print('Test accuracy: ', modelAccuracy(X_test,y_test,opt_model))

### Evaluate the Stochastical Gradient Descent Model

# Plot a loss curve
plt.plot(loss_history_sgd)
plt.ylabel('loss')
plt.xlabel('iterations')
print('Training accuracy: ', modelAccuracy(X_train,y_train,opt_model_sgd))
print('Test accuracy: ', modelAccuracy(X_test,y_test,opt_model_sgd))

But that quantification is limited. A more gerenell approach is to calculate a confusion matrix and get different model measurements from it. A good overview for model measurements is provided by the wikipedia article of precision and recall. We implement a confusion matrix for our model and calculate a F1 score and print() it.

### Confusion Matrix and F1 score

The confusion Matrix$m$ should look like ths

$\quad\quad (actual \space classes)\\ m = \begin{bmatrix} \space & 1 & 2 & \cdot \cdot \cdot & \cdot \cdot \cdot & 8 & 9 \\ 1 & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ 2 & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot &\cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot& \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ 8 & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\9 & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \end{bmatrix} \space (predicted \space classes) \space$

with all correct predictions are located in the diagonal of the table, so it is easy to visually inspect the table for prediction errors, as they will be represented by values outside the diagonal.

To understand the confunsion matrix more lets look at$m_{example}$.$m_{example}$ predicted 100 times 1, 200 times 2 and 300 times 3 while the actuall class of the image was 1. It also predicted 200 times 1, 500 times 2 and 400 times 3 with 2 beeing the actuall class

$\quad\quad (actual \space classes)\\ m_{example} = \begin{bmatrix} \space & 1 & 2 & \cdot \cdot \cdot \\ 1 & 100 & 200 & \cdot \cdot \cdot \\ 2 & 200 & 500 & \cdot \cdot \cdot \\ 3 &300 & 400 & \cdot \cdot \cdot \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot& \cdot \cdot \cdot \\ \end{bmatrix} \space (predicted \space classes) \space$

After calculating the confusion matrix we can calculate the$F1$ score. Note that the$F1$ score is defined as

$F1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}= \frac{2}{\frac{precision\space+\space recall}{precision \space \times \space recall}} = \frac{2\times precision \times recall}{precision + recall}$

$\space recall = \frac{d}{c} , precision = \frac{d}{r}$

With$d$ beeing all elements on the diagonal of the confusion Matrix,$r$ sum of every row of the confusion Matrix and$c$ sum of every collm of the confusion Matrix

Note that the score will be calculated every time for each class which will result in an vector$scores$ $scores = \left(\begin{array}{c}F1_{1}\\ \cdot \\ \cdot \\ \cdot \\ \cdot \\F1_{9} \end{array}\right)$

Implement the functions confusionMatrix and f1Score.

Hint:

• When y is not one-hot encode use np.eye(10)[y] to transform into one-hot encoding
• When y is one-hot encoded, use np.argmax(y, axis=1) to transform into label vector [0,3,4,...]
def confusionMatrix(X,y,theta):
############################################
#TODO: Calculate the model predictions for #
#      given X and theta.                  #
#      Then compare predictions with y and #
#      build the a confusion matrix and    #
#      return it.                          #
############################################
raise NotImplementedError()
############################################
#             END OF YOUR CODE             #
############################################

matrix = confusionMatrix(X_test,y_test,opt_model)
plt.figure(figsize=(10, 10))
plt.imshow(matrix)
plt.colorbar()
plt.show()

confusionMatrix_pandas(X_train,y_train,opt_model)

def f1Score(confMatrix):
############################################
#TODO: Calculate a F1 score from a given   #
#      confusion matrix.                   #
############################################
raise NotImplementedError()
############################################
#             END OF YOUR CODE             #
############################################
score = f1Score(matrix)
for i,j in enumerate(score):
print("f1 score of", i, "is", j)

### Visualize theta

Interesting is to plot a part of$theta$, because you can visualize the learned templates for each class.

One of the benefits of a simple model like softmax is that we can visualize the weights$theta$ for each of the classes, and see what it prefers. Here we look at one random weight of each class.

plt.figure(figsize=(20, 20))
num_classes = 10

for c in range(num_classes):
f = plt.subplot(10, num_classes, 1 * num_classes + c + 1)
f.axis('off')
plt.imshow(np.reshape(opt_model[:,c],[28,28]))
plt.show()

## Appendix

The numerical range of the floating-point numbers used by Numpy is limited. For$float64$, the maximal representable number is on the order of$10^{308}$. Exponentiation in the softmax function makes it possible to easily overshoot this number, even for fairly modest-sized inputs.

A nice way to avoid this problem is by normalizing the inputs to be not too large or too small, by observing that we can use an arbitrary constant C by multiplying the fraction with.

$\frac{e^{ log(c)}}{e^{log(c)}}$

So you get following function:

$P(Y=j\mid X=x_i) = \frac{e^{s_j + log(c)}}{\sum_{k=1}^{K}e^{s_k + log(c)}}$

Where you can choose the actual value of$log(c)$ freely because let$log(c)=x$ then$x \in \mathbb{R}$ for all$c \in \mathbb{R}^+$ and$c \neq 0$ .

### Preparation

Before we start we need to import the pictures into our notebook. For that purpose we can use the imread() function from matplotlib.image

file = 'pics/own_test_images/0.png'
plt.imshow(temp, cmap='binary', interpolation='bicubic')

### Creating an Own Batch

Since our function ìmread() works perfectly fine, we can create a whole matrix with all of our example.

images = []
for i in range(10):
file = 'pics/own_test_images/' + str(i)+'.png'
img = img[:,:,0] #removing 3rd dimension
img = img.reshape(1,-1)

if len(images) == 0:
images = img
else:
images = np.append(images,img, axis=0)

print(images.shape)
print(images.dtype)

Lets see whether we were successful by displaying all elements of our matrix

plt.figure(figsize=(20, 20))
num_classes = 10

for c in range(num_classes):
f = plt.subplot(10, num_classes, 1 * num_classes + c + 1)
f.axis('off')
plt.imshow(np.reshape(images.T[:,c],[28,28]), cmap='binary', interpolation='bicubic')
plt.show()

### Evaluation

probabilities = softmax(np.dot(images,opt_model_sgd))
prediction = np.argmax(probabilities,axis=1)

for i,j in enumerate(prediction):
print('Our model predicted:',i, 'as',j)

accuracy = (sum(prediction == np.array(range(10))))/10 * 100
print('Our accuracy was: ', accuracy,'%')

### Normalize our Images

We can see that our matrix$images$ has values inbetween 1 and 0 which could be a reason our model sees every picture as 5. So we are going to change that by normalizing every pixel of our image so that every value is inbetween 0 and 255

for i in range(images.shape[0]):
for j in range(images[i].shape[0]):
if images[i][j] <= 0.5:
images[i][j] *= 255
else:
images[i][j] = 0

images = np.array(images, dtype=np.uint8)

### Evaluation

probabilities = softmax(np.dot(images,opt_model_sgd))
prediction = np.argmax(probabilities,axis=1)

for i,j in enumerate(prediction):
print('Our model predicted:',i, 'as',j)

accuracy = (sum(prediction == np.array(range(10))))/10 * 100
print('Our accuracy was: ', accuracy,'%')

### Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Image Classification - Softmax Regression
by Benjamin Voigt, Klaus Strohmenger
Based on a work at https://gitlab.com/deep.TEACHING.

### Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Benjamin Voigt, Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.