# Activation Functions (Differentiable Programming)

## Introduction

Ever heared about the vanishing gradient problem? This notebook gives a brief introduction into activation functions and how they are connected with it. During the exercises you will train a small, yet deep network and visualize the gradients during training.

## Requirements

### Knowledge

• Backpropagation

### Prerequisites

To solve this notebook you should either:

• be familiar with the neural net framework you've been building in the 'Differentiable Programming' course so far. If dp.py is located in the same folder as this notebook, you can access it as a module with import dp
• or be familiar with any other deep learning framework (e.g. PyTorch, which is also introduced in this course) where you can manually access the gradients

### Python Modules

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.decomposition import PCA

import dp
from dp import Model,Node

## Theory

### Recap Backpropagation

Learning of (deep) neural networks relies on the backpropagation algorithm.

Imagine a network with three hidden layers, each consisting of only one neuron:

If the net output of our our last activation function$a_3$ is not equal to our desired output, given by our training data, then we adjust the weights$w_{1:3}$ and the bias$b_{1:3}$ with the following rules:

• First we calculate the error between our output and the true label: *$error = cost(a_3, y_{true})$
• Second, we calculate the partial derivatives for the weights$\frac{\partial error}{\partial w_j}$ and the bias$\frac{\partial error}{\partial b_j}$ to find out in which direction we have to adjust them in order to lower the costs $\alpha$ the learning rate), e.g.: *$w_1 \leftarrow w_1 - \alpha \frac{\partial error}{\partial w_1}$

To calculate$\frac{\partial error}{\partial w_1}$ we use the cain rule:

$\frac{\partial error}{\partial w_1} = \frac{\partial error}{\partial a_3} \cdot \frac{\partial a_3}{\partial a_2} \cdot \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial w_1}$

We see, that$\frac{\partial error}{\partial w_1}$ heavily depends on the derivatives of the activation functions.

### Activation functions

Non-linear functions are one of the core functionalities of neural networks. Only by applying a non-linear function (e.g.$sigmoid$,$tanh$,$relu$) onto a linear combination of the features, we receive another (and hopefully linear seperable) representation of the features.

#### Sigmoid

For a long time, the sigmoid was used, not only in the last layer for classification, but also in the hidden layers as activation function. Let us have a look at it and its derivative:

$sigmoid(z) = \frac{1}{1 + exp(-z)}$

and

$\frac{\partial sigmoid(z)}{\partial z} = sigmoid(z) \cdot (1- sigmoid(z))$

def sigmoid(z):
return 1 / (1 + np.exp(-z))

z = np.linspace(-10,+10,100)
plt.plot(z, sigmoid(z), linewidth=3, label='sigmoid(z)')
plt.plot(z, sigmoid(z)*(1-sigmoid(z)), linestyle='--', linewidth=2, label='derivative sigmoid(z)')
plt.legend()

As we can see$\frac{\partial sigmoid(z)}{\partial z} \in ]0, \frac{1}{4}]$. The maximum it can be is$\frac{1}{4}$. So for our example network the highest value we can get is:

$\frac{\partial error}{\partial w_1} = \frac{\partial error}{\partial a_3} \cdot \frac{1}{4} \cdot \frac{1}{4} \cdot \frac{1}{4}$

In short: the deeper the network, the more the error vanishes as we propagate it back to the first layers.

#### Tangens Hyperbolicus

1998 the$tanh(z)$ was proposed to be supirior to the$sigmoid(z)$ [LEC98]. Let us have a look at it and its derivative:

$\frac{\partial tanh(z)}{\partial z} = 1 - tanh(z)^2$

z = np.linspace(-10,+10,100)
plt.plot(z, np.tanh(z), linewidth=3, label='tanh')
plt.plot(z, 1-np.tanh(z)**2, linestyle='--', linewidth=2, label='derivative tanh')
plt.legend()

Here our derivative is in the range of$]0, 1[$. So if we are lucky, our error does not vanish when we propagate it back to the first layers.

#### Rectified Linear Units (ReLU)

Despite beeing superior to th$sigmoid$ function, the$tanh$ still shares another problem with it: If we are not very close to$0$, the function saturates and the derivatives become almost$0$. To counter this problem, the$ReLU$ function can be used:

$ReLU(z) = max(0,z)$

and

$\frac{\partial ReLU(z)}{\partial z} = 0 \text{ if } z < 0; 1 \text{ if } z > 0$

def relu(z):
return np.maximum(z, np.zeros_like(z))

z = np.linspace(-2,+2,100)
plt.plot(z, relu(z), linewidth=3, label='relu(z)')
plt.plot(z, np.sign(relu(z)), linestyle='--', linewidth=2, label='derivative relu(z)')
plt.legend()

The range where$\frac{\partial ReLU(z)}{\partial z} = 1$ (infinitely points) is a lot bigger than for$\frac{\partial tanh(z)}{\partial z}$ (only at one point).

#### Problems to Note when using ReLU

ReLU can result in dead neurons when the partial derivative is$0$. this can especially be a problem when using only few neurons in a layer. On the other side, one can argue, that these dead neurons enforce sparsity of the network and lead to less overfitting, as long as they are not dead for every training example.

Another problem is that you can run into overflow fast when computing the forward pass. This is not a problem for$tanh$ and$sigmoid$ as their output is always between$]-1, +1[$, resp.$]0, +1[$.

To tackle overflow there exist several techniques like restricting the output of ReLU to a fixed value, e.g.$2$. This is also called clipping. It is also a good idea to normalize the inputs to have mean$\mu = 0$ and standard deviation$\sigma=1$. The latter is also a necessary step to avoid saturation when using$tanh$ or$sigmoid$, so it is not even an essential step exclusive to$ReLU$.

## Exercises

In order to show the effect of vanishing gradient, we will build a deep, yet simple neural network.

### Data

We use the iris dataset and from it only the first 100 examples, which contain only classes 0 and 1 so we have a binary classification problem for simplicity.

### Load iris dataset

### 0-100 only contains class 0 and 1 for simplicity
X = iris.data[:100]
y = iris.target[:100]

### Feature scaling
X = (X - X.mean(axis=0)) / X.std(axis=0)

### The Model Class

Below you can see a starting point for an implementation for a network class with the following limitations:

• only 1 hidden layer
• hidden layer has fixed size of 10 neurons
• fixed activation function$tanh$ for the hidden layer

For now, leave this part as it is and complete the rest of the notebook first, so you can check if you successfully solved the other exercises.

After everything runs fine with this static network, come back here to enhance the model class.

class Net(Model):
def __init__(self, n_features, n_hidden_neurons, n_layers, act_func):
super(Net, self).__init__()
self.hidden = self.Linear_Layer(n_features, 10, "h0")
self.out = self.Linear_Layer(10, 1, "h1")

def loss(self, x, y):
if not type(y) == Node:
y = Node(y)
out = self.forward(x)
loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
return loss.sum()

def forward(self, x):
if not type(x) == Node:
x = Node(x)
x = self.hidden(x).tanh()
x = self.out(x).sigmoid()
return x

### The Optimizer Class

Below you can see the Optimizer and SGD class from dp.py. We copied it here because we are going to modify it later to save the gradients so we can access them after the training.

class Optimizer(object):

def __init__(self, model, x_train=None, y_train=None, hyperparam=dict(), batch_size=128):
self.model = model
self.x_train = x_train
self.y_train = y_train
self.batch_size=batch_size
self.hyperparam = hyperparam
self._set_param()
self.grad_stores = [] # list of dicts for momentum, etc.

def random_batch(self):
n = self.x_train.shape[0]
indices = np.random.randint(0, n, size=self.batch_size)
return Node(self.x_train[indices]), Node(self.y_train[indices])

def train(self, steps=1000, print_each=100):
raise NotImplementedError()

model = self.model
x, y = self.random_batch()

param = model.get_param()
print("iteration\tloss")
for i in range(1, steps+1):
x, y = self.random_batch()
if i%print_each==0 or i==1:
print(i, "\t",loss.value[0,0])

model.set_param(param)

return loss.value

class SGD(Optimizer):

def __init__(self, model, x_train=None, y_train=None, hyperparam=dict(), batch_size=128):
super(SGD, self).__init__(model, x_train, y_train, hyperparam, batch_size)

def _set_param(self):
self.alpha = self.hyperparam.get("alpha", 0.001)

def _update(self, param, grad, g, i):

def train(self, steps=1000, print_each=100):
return self._train(steps, num_grad_stores=1, print_each=print_each)

### Training

Now start the training. Note that we already pass the arguments n_hidden_layers, n_hidden_neurons and act_func to the Net class when initializing our network, altohugh, they have no function yet as the number of layers, neurons and types of activation functions inside Net are statically coded for now.

The training should succeed and yield 100 % accuracy after 100 epochs.

n_epochs = 100
n_hidden_layers = 1 # number of hidden layers
n_hidden_neurons = 1 # number of neurons per hidden layer
n_features = X.shape[1]
act_func = 'sigmoid'

grad_w = np.zeros(shape=(n_hidden_layers+1, n_epochs)) ### n_hidden_layers + 1 (for output)
net = Net(n_features, n_hidden_neurons, n_hidden_layers, act_func)

optimiser = SGD(
net,
x_train=X,
y_train=y
)
optimiser.train(steps=n_epochs,print_each=n_epochs//10);

y_pred = net.forward(X).value

acc = (len(y) - np.sum(np.abs(y - y_pred.round().flatten()))) / len(y)
print('accuracy on the training data after training: ', acc)

### Exercise - Modify the Optimizer Class

Modify the Optimizer and/or SGD class, so the mean of the absolute values of the partial derivatives of all wheights in one layer are saved in the variable grad_w (defined in the cell above):

• The first dimension of grad_w corresponds to the layer-number
• The second dimension correpsonds to the epoch

Hint:

If a quick and dirty solution is ok for you, directly access grad_w as a global variable inside Optimizer._train.

With the new implementation in place, execute the training cell again to populate the log of gradients grad_w.

If everything is correct, executing the cell below should plot 2 graphs like the following picture:

n_epochs = 100
n_hidden_layers = 1 # number of hidden layers
n_hidden_neurons = 1 # number of neurons per hidden layer
n_features = X.shape[1]
act_func = 'sigmoid'

grad_w = np.zeros(shape=(n_hidden_layers+1, n_epochs)) ### n_hidden_layers + 1 (for output)
net = Net(n_features, n_hidden_neurons, n_hidden_layers, act_func)

optimiser = SGD(
net,
x_train=X,
y_train=y
)
optimiser.train(steps=n_epochs,print_each=n_epochs//10);

y_pred = net.forward(X).value

acc = (len(y) - np.sum(np.abs(y - y_pred.round().flatten()))) / len(y)
print('accuracy on the training data after training: ', acc)
fig, axs = plt.subplots(nrows=n_hidden_layers+1, ncols=1, figsize=(16,16))
for i in range(n_hidden_layers):
axs[i].set_title('absolute mean gradients of hidden layer number [{}]'.format(i))
axs[-1].set_title('absolute mean gradients of the output layer')
axs[-1].plot(np.linspace(0,n_epochs-0, n_epochs-0), grad_w[i,0:])

### Exercise - Modify the Net Class

Modify the Net class:

• Your class should be able to produce nets with different numbers of hidden layers according to the parameter n_layers.
• The number of neurons in each layer should depend on the parameter n_hidden_neurons.
• Depending on the string-parameter act_func, your network should either use tanh, sigmoid or relu.

• Remember that the activation function of the very last layer should still always be sigmoid

When you are finished with the changes, try networks with more layers. For example, you can expect a network with 10 layers each using the sigmoid activation function to not learn at all. Plotting the gradients should show vanishing gradients towards the first layers like in the following plot:

Sample Plot:

Plot showing the absolute gradients mean of each layer. Here with sigmoid as the activation function and 10 layers. Note the y-axis scale:

### Freestyle Exercise

Experiment with number of layers and activation functions and see if you can get results as proof for the statements made in the theory chapter of this notebook.

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Activation Functions (Differentiable Programming)
by Klaus Strohmenger