# HTW Berlin - Angewandte Informatik - Advanced Topics - Exercise - Multiclass Logistic Regression (Softmax) with TensorFlow

## Introduction

In this exercise notebook you will implement a multiclass logistic regression model using TensorFlow. To do so, one would normally use TensorFlow's predefined functions for the softmax prediction, the cross-entropy costs and an optimizer based on the gradient descent update algorithm.

Here you will not use any of them, but implement them yourself only using basic TensorFlow functions like tf.matmul, tf.transpose, etc. An exception is the tf.gradients function, which returns the gradient of a function with respect to a variable / list of variables. This gradient can then be used to define the update algorithm.

Besides consolidating your theoretical knowledge about gradient descent, knowing how to use the TensorFlow's autograd feature can be very useful when you want to do anything which can be calculated with a gradient but is not covered with the standard built-ins, e.g. define your own cost and update function.

In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal. In this notebook, however, these cells will only detect a small portion of possible errors, e.g. your implemented function returning a wrong shape.

## Requirements

### Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

• Logistic regression
• Softmax function
• Cross-entropy
• Basic TensorFlow dataflow (see below)

### Python Modules

# External Modules
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from numpy.testing import assert_almost_equal

%matplotlib inline

tf.reset_default_graph()
sess = tf.InteractiveSession()

## Exercise - Multiclass Logistic Regression (Softmax) with TensorFlow

### Training Data

Given $m$ examples in our training data $\mathcal D = \{(\vec x^{(1)}, y^{(1)}),(\vec x^{(2)},y^{(2)}), \dots (\vec x^{(m)},y^{(m)})\}$, with $\vec x^{(1)}$ denoting the first feature vector and $y^{(1)}$ the corresponding class.

We will create our own training data by drawing samples from different gaussian distributions, which our model should be capable of generalizing. To make things concrete we will be using:

• two features $\vec x = (x_1, x_2)^T$
• three classes: $y \in \{ 0, 1, 2\}$
• 100 examples for each class
# class 0:
# covariance matrix and mean
cov0 = np.array([[5,-4],[-4,4]])
mean0 = np.array([2.,3])
# number of data points
m0 = 100

# class 1
# covariance matrix and mean
cov1 = np.array([[5,-3],[-3,3]])
mean1 = np.array([0.5,0.5])
m1 = 100

# class 2
# covariance matrix mean
cov2 = np.array([[2,0],[0,2]])
mean2 = np.array([8.,-5])
m2 = 100

# generate m0 gaussian distributed data points with
# mean0 and cov0.
r0 = np.random.multivariate_normal(mean0, cov0, m0)
r1 = np.random.multivariate_normal(mean1, cov1, m1)
r2 = np.random.multivariate_normal(mean2, cov2, m2)

def plot_data(r0, r1, r2):
plt.figure(figsize=(7.,7.))
plt.scatter(r0[...,0], r0[...,1], c='r', marker='o', label="Klasse 0")
plt.scatter(r1[...,0], r1[...,1], c='y', marker='o', label="Klasse 1")
plt.scatter(r2[...,0], r2[...,1], c='b', marker='o', label="Klasse 2")
plt.xlabel("$x_0$")
plt.ylabel("$x_1$")
# Let's visualize our training data

plot_data(r0, r1, r2)
X = np.concatenate((r0, r1, r2), axis=0)
X.shape
y = np.concatenate((np.zeros(m0), np.ones(m1), 2 * np.ones(m2)))
y.shape
# shuffle the data
assert X.shape[0] == y.shape[0]
perm = np.random.permutation(np.arange(X.shape[0]))
#print(perm)
X = X[perm]
y = y[perm]

### Implement the Model

Since we have concrete classes and not contiunous values, we have to implement logistic regression (opposed to linear regression). logistic regression implies the use of the logistic function. But as the number of classes exceeds two, we have to use the generalized form, the softmax function.

Implement softmax regression. This can be split into three subtasks: 1. Implement the softmax function for prediction. 2. Implement the computation of the cross-entropy loss. 3. Implement vanilla gradient descent.

#### Softmax

Implement the softmax prediction $h_i$, defined for each class $i$ as:

$h_i = \frac{\exp(z_i)}{\sum_{k=1}^c\exp (z_k)}$

with $c$ denoting the class label and the net output $z_i$ for that class, where the whole vector $\vec z$ is defined as:

$\vec z = W \vec{x} + \vec b$

Hint:

Remember that your functions should be able to handle multiple or even all $\vec x$s.

Evaluating softmax should look like:

    in> h.eval(feed_dict={x: X})

out> array([[1.62411915e-08, 1.70372473e-03, 9.98296261e-01],
[3.72431863e-08, 3.27572320e-03, 9.96724248e-01],
[9.83378708e-01, 1.66097078e-02, 1.15793373e-05],
.....

### First we define Variables for the weigths W and bias b.
### From Docstring:
### "A variable maintains state in the graph across calls to run() ...
### ... constructor requires an initial value ..."
NUM_LABELS = 3
NUM_FEATURES = 2
D_TYPE = tf.float32
I_TYPE = tf.int32
W = tf.Variable(tf.random_uniform([NUM_FEATURES, NUM_LABELS], dtype=D_TYPE))
b = tf.Variable(tf.zeros([NUM_LABELS], dtype=D_TYPE))
### And placeholders for the training data.
### From Docstring:
### "This tensor will produce an error if evaluated. Its value must
### be fed using the feed_dict optional argument to Session.run()

### Using None in the first dimension allows to feed a variable number
x = tf.placeholder(shape=[None, NUM_FEATURES], dtype=D_TYPE, name="features")
t = tf.placeholder(shape=[None], dtype=I_TYPE, name="targets")
### Variables must be initialized by running an init Op after having
### launched the graph.  We first have to add the init Op to the graph.

init_op= tf.global_variables_initializer()
sess.run(init_op)
### Implement this function

def net_output(x, W, b):
"""
Calculates the net output z = W * x + b.

:x: Predicitons.
:x type: 2D-Tensor of type float32 with
shape (n_examples, n_features).
:W: Weight matrix.
:W type: 2D-Tensor of type float32 with
shape (n_features, n_classes).
:b: Weight matrix.
:b type: D-Tensor of type float32 with
shape (n_classes).

:returns: The net output
:r type: 2D-Tensor of type float32
with shape (n_examples, n_classes).
"""
raise NotImplementedError()
### Implement this function

def softmax(z):
"""
Returns the normalized predictions z.

:z: Predicitons.
:z type: 2D-Tensor of type float32 with
shape (n_examples, n_classes).

:returns: softmax prediction.
:r type: Tensor with same type and shape as z.
"""
raise NotImplementedError()
z = net_output(x, W, b)
h = softmax(z)

some_predictions = h.eval(feed_dict={x: X[0:2]})
print(some_predictions)

assert_almost_equal(some_predictions[0].sum(), 1.0)
assert_almost_equal(some_predictions[1].sum(), 1.0)

#### Cross-Entropy

Implement the computation of the cross-entropy loss. Don't use any build-in function of TensorFlow for the cross-entropy.

Reminder:

\begin{split} H(p, q) & = \sum_{i=0}^c p_i(x) \cdot \log \frac{1}{q_i(x)} \\ & = -\sum_{i=0}^c p_i(x) \cdot \log q_i(x) \\ \end{split}

with

• the number of classes c
• the correct class distribution $p(x)$
• and the predictions of our net $q(x)$ (softmax output)

Hint:

Return the cross-entropy average: $J(W,b) = \frac{1}{m} \sum_{j=1}^m H\left(p(\vec x^{(j)}),q(\vec x^{(j)})\right)$

### Implement this function

def cross_entropy(targets, predictions):
"""
Computes the cross-entropy average.

:targets: True classes as scalars.
:targets type: tf.Tensor with the shape (n_classes).
:predictions: predictions as softmax output
:predicitons type: tf.Tensor with shape (n_examples, n_classes).

:returns: cross-entropy average.
:r type: Tensor of type float32
"""
raise NotImplementedError()
# t is the tensorflow placeholder for the targets (class labels)
cost = cross_entropy(t, h)

some_cost = cost.eval(feed_dict={x: X, t: y})
print(some_cost)

assert some_cost.dtype == np.float32

Implement gradient descent and train the model:

• Implement the gradient descent update rule. Don't use any TensorFlow build-in optimizer!

• Use tf.gradients for computing the gradient.
• tf.assign for updating.
• Iteratively apply the update rule to minimize the loss.
• Train for 100 epochs
• Use minibatches with size 50
• Keep track of the costs after each epoch
• Decide about an appropriate learning rate

Reminder:

Equation for the update rule:

\begin{aligned} W' & = W - \alpha \cdot \frac{\partial}{\partial W} J(W, b)\\\\ b' & = b - \alpha \cdot \frac{\partial}{\partial b} J(W, b) \end{aligned}
### Complete this cell

nb_epochs = 100
minibatch_size = 50
learning_rate = 1337 ### Decide about an appropriate learning rate

cost_per_epoch = []



### Plot

#### Cost (Loss) over Iterations

Plot of the cost progress vs. iterations.

plt.plot(range(len(cost_per_epoch)), cost_per_epoch)
plt.xlabel('# of iterations')
plt.ylabel('cost')
plt.title('Learning Progress')

#### Decision Boundary After Training

The following function plots the data with the decision boundaries after the training. The model should be trained well enough to seperate most (roughly ~95%) of the data correctly. Use the following code for plotting.

def plot_decision_boundary(iteration=None, x_min=-10, x_max=14, y_min=-10, y_max=10):
fig = plt.figure(figsize=(8,8))

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

delta = 0.1
a = np.arange(x_min, x_max+delta, delta)
b = np.arange(y_min, y_max+delta, delta)
A, B = np.meshgrid(a, b)

x_ = np.dstack((A, B)).reshape(-1, 2)

out = h.eval(feed_dict={x: x_})

ns = list()
ns.append(3)
ns.extend(A.shape)
out = out.T.reshape(ns)

plt.pcolor(A, B, out[0], cmap="Blues", alpha=0.2)
plt.pcolor(A, B, out[1], cmap=('Oranges'), alpha=0.2)
plt.pcolor(A, B, out[2], cmap=('Greens'), alpha=0.2)
# lets visualize the data:
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)

plt.title("Decision boundaries in data space.")
plot_decision_boundary()

#### Decision Boundary Before Training

Now we reinitialize our model's variables to visualize how the decision boundaries might have been before the training. Since we initilize our weights with tf.random_uniform this will look different for every execution.

sess.run(init_op)
plot_decision_boundary()

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

HTW Berlin - Angewandte Informatik - Advanced Topics - Exercise - Multiclass Logistic Regression (Softmax) with tensorflow
by Christian Herta, Klaus Strohmenger