# Activation Functions (Differentiable Programming)

## Table of Contents

## Introduction

Ever heared about the *vanishing gradient* problem? This notebook gives a brief introduction into activation functions and how they are connected with it. During the exercises you will train a small, yet deep network and visualize the gradients during training.

## Requirements

### Knowledge

- Gradient Descent
- Backpropagation

### Prerequisites

To solve this notebook you should either:

- be familiar with the neural net framework you've been building in the 'Differentiable Programming' course so far. If dp.py is located in the same folder as this notebook, you can access it as a module with
`import dp`

- or be familiar with any other deep learning framework (e.g. PyTorch, which is also introduced in this course) where you can manually access the gradients

### Python Modules

```
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import dp
from dp import Model,Node
```

## Theory

### Recap Backpropagation

Learning of (deep) neural networks relies on the backpropagation algorithm.

Imagine a network with three hidden layers, each consisting of only one neuron:

If the net output of our our last activation function$ a_3 $ is not equal to our desired output, given by our training data, then we adjust the weights$ w_{1:3} $ and the bias$ b_{1:3} $ with the following rules:

- First we calculate the error between our output and the true label: *$ error = cost(a_3, y_{true}) $
- Second, we calculate the partial derivatives for the weights$ \frac{\partial error}{\partial w_j} $ and the bias$ \frac{\partial error}{\partial b_j} $ to find out in which direction we have to adjust them in order to lower the costs $ \alpha $ the learning rate), e.g.: *$ w_1 \leftarrow w_1 - \alpha \frac{\partial error}{\partial w_1} $

To calculate$ \frac{\partial error}{\partial w_1} $ we use the cain rule:

$ \frac{\partial error}{\partial w_1} = \frac{\partial error}{\partial a_3} \cdot \frac{\partial a_3}{\partial a_2} \cdot \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial w_1} $

We see, that$ \frac{\partial error}{\partial w_1} $ heavily depends on the derivatives of the activation functions.

### Activation functions

Non-linear functions are one of the core functionalities of neural networks. Only by applying a non-linear function (e.g.$ sigmoid $,$ tanh $,$ relu $) onto a linear combination of the features, we receive another (and hopefully linear seperable) representation of the features.

#### Sigmoid

For a long time, the sigmoid was used, not only in the last layer for classification, but also in the hidden layers as activation function. Let us have a look at it and its derivative:

$ sigmoid(z) = \frac{1}{1 + exp(-z)} $

and

$ \frac{\partial sigmoid(z)}{\partial z} = sigmoid(z) \cdot (1- sigmoid(z)) $

```
def sigmoid(z):
return 1 / (1 + np.exp(-z))
z = np.linspace(-10,+10,100)
plt.plot(z, sigmoid(z), linewidth=3, label='sigmoid(z)')
plt.plot(z, sigmoid(z)*(1-sigmoid(z)), linestyle='--', linewidth=2, label='derivative sigmoid(z)')
plt.legend()
```

As we can see$ \frac{\partial sigmoid(z)}{\partial z} \in ]0, \frac{1}{4}] $. The maximum it can be is$ \frac{1}{4} $. So for our example network the highest value we can get is:

$ \frac{\partial error}{\partial w_1} = \frac{\partial error}{\partial a_3} \cdot \frac{1}{4} \cdot \frac{1}{4} \cdot \frac{1}{4} $

In short: the deeper the network, the more the error vanishes as we propagate it back to the first layers.

#### Tangens Hyperbolicus

1998 the$ tanh(z) $ was proposed to be supirior to the$ sigmoid(z) $ [LEC98]. Let us have a look at it and its derivative:

$ \frac{\partial tanh(z)}{\partial z} = 1 - tanh(z)^2 $

```
z = np.linspace(-10,+10,100)
plt.plot(z, np.tanh(z), linewidth=3, label='tanh')
plt.plot(z, 1-np.tanh(z)**2, linestyle='--', linewidth=2, label='derivative tanh')
plt.legend()
```

Here our derivative is in the range of$ ]0, 1[ $. So if we are lucky, our error does not vanish when we propagate it back to the first layers.

#### Rectified Linear Units (ReLU)

Despite beeing superior to th$ sigmoid $ function, the$ tanh $ still shares another problem with it: If we are not very close to$ 0 $, the function saturates and the derivatives become almost$ 0 $. To counter this problem, the$ ReLU $ function can be used:

$ ReLU(z) = max(0,z) $

and

$ \frac{\partial ReLU(z)}{\partial z} = 0 \text{ if } z < 0; 1 \text{ if } z > 0 $

```
def relu(z):
return np.maximum(z, np.zeros_like(z))
z = np.linspace(-2,+2,100)
plt.plot(z, relu(z), linewidth=3, label='relu(z)')
plt.plot(z, np.sign(relu(z)), linestyle='--', linewidth=2, label='derivative relu(z)')
plt.legend()
```

The range where$ \frac{\partial ReLU(z)}{\partial z} = 1 $ (infinitely points) is a lot bigger than for$ \frac{\partial tanh(z)}{\partial z} $ (only at one point).

#### Problems to Note when using ReLU

ReLU can result in dead neurons when the partial derivative is$ 0 $. this can especially be a problem when using only few neurons in a layer. On the other side, one can argue, that these dead neurons enforce sparsity of the network and lead to less overfitting, as long as they are not dead for every training example.

Another problem is that you can run into overflow fast when computing the forward pass. This is not a problem for$ tanh $ and$ sigmoid $ as their output is always between$ ]-1, +1[ $, resp.$ ]0, +1[ $.

To tackle overflow there exist several techniques like restricting the output of ReLU to a fixed value, e.g.$ 2 $. This is also called *clipping*. It is also a good idea to normalize the inputs to have mean$ \mu = 0 $ and standard deviation$ \sigma=1 $. The latter is also a necessary step to avoid saturation when using$ tanh $ or$ sigmoid $, so it is not even an essential step exclusive to$ ReLU $.

## Exercises

In order to show the effect of vanishing gradient, we will build a deep, yet simple neural network.

### Data

We use the iris dataset and from it only the first 100 examples, which contain only classes 0 and 1 so we have a binary classification problem for simplicity.

```
### Load iris dataset
iris = datasets.load_iris()
### 0-100 only contains class 0 and 1 for simplicity
X = iris.data[:100]
y = iris.target[:100]
### Feature scaling
X = (X - X.mean(axis=0)) / X.std(axis=0)
```

### The Model Class

Below you can see a starting point for an implementation for a network class with the following limitations:

- only 1 hidden layer
- hidden layer has fixed size of 10 neurons
- fixed activation function$ tanh $ for the hidden layer

For now, leave this part as it is and complete the rest of the notebook first, so you can check if you successfully solved the other exercises.

After everything runs fine with this static network, come back here to enhance the model class.

```
class Net(Model):
def __init__(self, n_features, n_hidden_neurons, n_layers, act_func):
super(Net, self).__init__()
self.hidden = self.Linear_Layer(n_features, 10, "h0")
self.out = self.Linear_Layer(10, 1, "h1")
def loss(self, x, y):
if not type(y) == Node:
y = Node(y)
out = self.forward(x)
loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
return loss.sum()
def forward(self, x):
if not type(x) == Node:
x = Node(x)
x = self.hidden(x).tanh()
x = self.out(x).sigmoid()
return x
```

### The Optimizer Class

Below you can see the `Optimizer`

and `SGD`

class from `dp.py`

. We copied it here because we are going to modify it later to save the gradients so we can access them after the training.

```
class Optimizer(object):
def __init__(self, model, x_train=None, y_train=None, hyperparam=dict(), batch_size=128):
self.model = model
self.x_train = x_train
self.y_train = y_train
self.batch_size=batch_size
self.hyperparam = hyperparam
self._set_param()
self.grad_stores = [] # list of dicts for momentum, etc.
def random_batch(self):
n = self.x_train.shape[0]
indices = np.random.randint(0, n, size=self.batch_size)
return Node(self.x_train[indices]), Node(self.y_train[indices])
def train(self, steps=1000, print_each=100):
raise NotImplementedError()
def _train(self, steps=1000, num_grad_stores=0, print_each=100):
assert num_grad_stores in (0,1,2)
model = self.model
if num_grad_stores>0:
x, y = self.random_batch()
grad, loss = model.get_grad(x, y)
self.grad_stores = [dict() for _ in range(num_grad_stores)]
for grad_store in self.grad_stores:
for g in grad:
grad_store[g] = np.zeros_like(grad[g])
param = model.get_param()
print("iteration\tloss")
for i in range(1, steps+1):
x, y = self.random_batch()
grad, loss = model.get_grad(x, y)
if i%print_each==0 or i==1:
print(i, "\t",loss.value[0,0])
for g in grad:
self._update(param, grad, g, i)
model.set_param(param)
return loss.value
class SGD(Optimizer):
def __init__(self, model, x_train=None, y_train=None, hyperparam=dict(), batch_size=128):
super(SGD, self).__init__(model, x_train, y_train, hyperparam, batch_size)
def _set_param(self):
self.alpha = self.hyperparam.get("alpha", 0.001)
def _update(self, param, grad, g, i):
param[g] -= self.alpha * grad[g]
def train(self, steps=1000, print_each=100):
return self._train(steps, num_grad_stores=1, print_each=print_each)
```

### Training

Now start the training. Note that we already pass the arguments `n_hidden_layers`

, `n_hidden_neurons`

and `act_func`

to the `Net`

class when initializing our network, altohugh, they have no function yet as the number of layers, neurons and types of activation functions inside `Net`

are statically coded for now.

The training should succeed and yield 100 % accuracy after 100 epochs.

```
n_epochs = 100
n_hidden_layers = 1 # number of hidden layers
n_hidden_neurons = 1 # number of neurons per hidden layer
n_features = X.shape[1]
act_func = 'sigmoid'
grad_w = np.zeros(shape=(n_hidden_layers+1, n_epochs)) ### n_hidden_layers + 1 (for output)
net = Net(n_features, n_hidden_neurons, n_hidden_layers, act_func)
optimiser = SGD(
net,
x_train=X,
y_train=y
)
optimiser.train(steps=n_epochs,print_each=n_epochs//10);
y_pred = net.forward(X).value
acc = (len(y) - np.sum(np.abs(y - y_pred.round().flatten()))) / len(y)
print('accuracy on the training data after training: ', acc)
```

### Exercise - Modify the Optimizer Class

**Task:**

Modify the `Optimizer`

and/or `SGD`

class, so the **mean** of the **absolute** values of the partial derivatives of all wheights in one layer are saved in the variable `grad_w`

(defined in the cell above):

- The first dimension of
`grad_w`

corresponds to the layer-number - The second dimension correpsonds to the epoch

**Hint:**

If a quick and dirty solution is ok for you, directly access `grad_w`

as a global variable inside `Optimizer._train`

.

### Plot Gradients

With the new implementation in place, execute the training cell again to populate the log of gradients `grad_w`

.

If everything is correct, executing the cell below should plot 2 graphs like the following picture:

```
n_epochs = 100
n_hidden_layers = 1 # number of hidden layers
n_hidden_neurons = 1 # number of neurons per hidden layer
n_features = X.shape[1]
act_func = 'sigmoid'
grad_w = np.zeros(shape=(n_hidden_layers+1, n_epochs)) ### n_hidden_layers + 1 (for output)
net = Net(n_features, n_hidden_neurons, n_hidden_layers, act_func)
optimiser = SGD(
net,
x_train=X,
y_train=y
)
optimiser.train(steps=n_epochs,print_each=n_epochs//10);
y_pred = net.forward(X).value
acc = (len(y) - np.sum(np.abs(y - y_pred.round().flatten()))) / len(y)
print('accuracy on the training data after training: ', acc)
```

```
fig, axs = plt.subplots(nrows=n_hidden_layers+1, ncols=1, figsize=(16,16))
for i in range(n_hidden_layers):
axs[i].set_title('absolute mean gradients of hidden layer number [{}]'.format(i))
axs[i].plot(np.linspace(0,n_epochs-0, n_epochs-0), grad_w[i,0:])
axs[-1].set_title('absolute mean gradients of the output layer')
axs[-1].plot(np.linspace(0,n_epochs-0, n_epochs-0), grad_w[i,0:])
```

### Exercise - Modify the Net Class

**Task:**

Modify the `Net`

class:

- Your class should be able to produce nets with different numbers of hidden layers according to the parameter
`n_layers`

. - The number of neurons in each layer should depend on the parameter
`n_hidden_neurons`

. -
Depending on the string-parameter

`act_func`

, your network should either use`tanh`

,`sigmoid`

or`relu`

.- Remember that the activation function of the very last layer should still always be
`sigmoid`

- Remember that the activation function of the very last layer should still always be

When you are finished with the changes, try networks with more layers.
For example, you can expect a network with 10 layers each using the `sigmoid`

activation function to not learn at all. Plotting the gradients should show vanishing gradients towards the first layers like in the following plot:

**Sample Plot:**

Plot showing the absolute gradients mean of each layer. Here with sigmoid as the activation function and 10 layers. Note the y-axis scale:

### Freestyle Exercise

Experiment with number of layers and activation functions and see if you can get results as proof for the statements made in the theory chapter of this notebook.

## Literature

## Licenses

### Notebook License (CC-BY-SA 4.0)

*The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).*

*Activation Functions (Differentiable Programming)*

by Klaus Strohmenger

is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Based on a work at https://gitlab.com/deep.TEACHING.

### Code License (MIT)

*The following license only applies to code cells of the notebook.*

Copyright 2019 Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.