# Dropout (Differentiable Programming)

## Table of Contents

## Introduction

This notebook walks you through an implementation of a regularization technique called dropout. The idea is that in each forward pass during training, we randomly select units to 'drop out' from the network, i.e. remove them from the network. This forces the surviving units to learn without depending too heavily on the cooperation of other units and produce better results individually.

## Requirements

### Knowledge

These are useful resources on the topic, though it's not required to read them entirely before tackling this notebook.

- The original Dropout paper by Srivastava et al.[SRI14]
- This blog post on dropout by Agustinus Kristiadi[KRI16]

### Python Modules

```
import numpy as np
from dp import NeuralNode,Node
from sklearn import datasets,preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
```

### Data

This cell downloads the breast cancer dataset provided by sklearn.

```
x,y = datasets.load_breast_cancer(return_X_y=True)
x = preprocessing.scale(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
```

## Dropout

Figure 1 from 'Dropout: A Simple Way to Prevent Neural Networks from Overfitting' #SRI14

The authors of the Dropout paper propose that a good way to reduce overfitting is to average out the predictions of many separately trained networks - but this is too computationally expensive to do in practice.

Introduce dropout: On the left, you see a network with all its units and their connections. On the right, the crossed out units have been dropped from the network along with all their connections. So it creates a new, 'thinned' version of the neural net.

When we send train samples through the network in the forward pass, we randomly sample units to drop from the network. So for each sample we train a 'thinned' version of the net. This approximates training and averaging many different neural nets with shared parameters.

The paper presents the following motivation (#SRI14 p. 1932/p. 4 in the PDF)

"Similarly, each hidden unit in a neural network trained with dropout must learn to work with a randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes. "

The following exercises walk you through an implementation of a dropout layer for a neural net.

## Exercises

### Mask operator

**Task:**

Implement a `mask`

operator for the Node autodiff class.

**Note:** Remember to implement the partial derivative of the mask operator since it's crucial for backprop. If a unit is killed through dropout, it doesn't contribute anything to the network. So the gradient that flows back into it should be 0.

```
def mask(self, mask : np.ndarray):
raise NotImplementedError()
return
Node.mask = mask
```

Verify your solution.

```
# mask numbers 1..10 and square
a = Node(np.arange(1,11)[None,:], 'A')
mask = np.array([0, 1] * 5)
b = a.mask(mask).square()
# check gradients
grads = b.grad(np.ones(b.shape))['A']
assert grads[0,2] == 0
assert grads[0,3] == 8
assert grads[0,4] == 0
```

We'll again use the autodiff class to create a model for the breast cancer dataset. The method `linear_layer`

adds a linear layer to the network, your task will be to implement a `dropout`

layer.

```
class Model():
# define layers of the model
def __init__(self):
self.params = dict()
self.fc1 = self.linear(30,20,'fc1')
self.do1 = self.dropout(keep_prob=0.5)
self.fc2 = self.linear(20,10,'fc2')
self.do2 = self.dropout(keep_prob=0.5)
self.fc3 = self.linear(10,1,'fc3')
# define forward pass
def forward(self,x,train=True):
if not type(x) == Node:
x = Node(x)
# TODO: implement forward pass
raise NotImplementedError()
return out
# define loss function
def loss(self,x,y,train=True):
out = self.forward(x,train)
if not type(y) == Node:
y = Node(y)
loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
return loss.sum()
# add a linear layer to the model
def linear(self, fan_in,fan_out,name):
W_name, W_value = f'weight_{name}', np.random.randn(fan_in,fan_out)
b_name, b_value = f'bias_{name}', np.random.randn(1,fan_out)
self.params[W_name] = W_value
self.params[b_name] = b_value
def forward(x):
return x.dot(Node(self.params[W_name], W_name)) + Node(self.params[b_name], b_name)
return forward
# TODO: add dropout method
```

### Forward pass

**Task:**

Implement the forward pass, e.g.

```
x -> linear -> tanh -> dropout
-> linear -> tanh -> dropout
-> linear -> sigmoid
```

**Note:** Mind the `train`

parameter which indicates whether we're forwarding train or test data. On test data, you do not apply dropout.

### Dropout layer

To apply dropout, we multiply the activations of a layer with a boolean/binary matrix of `0`

s and `1`

s (masking).

The hyperparameter$ p $ controls the percentage of units to keep. To make things more explicit, this parameter is also called `keep_prob`

.

Each dropout layer can have a different setting for the `keep_prob`

parameter$ \in $ [0..1]

**Task:**

Implement the `dropout`

layer.

```
def dropout(self,keep_prob=0.5):
raise NotImplementedError()
Model.dropout = dropout
```

Verify your implementation.

The first dropout layer has a `keep_prob`

of 1.0, so all activations should survive.

The second dropout layer has a `keep_prob`

of 0.5, so approximately half of them should be dead.

```
data = Node(np.random.randint(1,10,size=(10,10)))
out0 = dropout(None,keep_prob=1.0)(data)
out1 = dropout(None,keep_prob=0.8)(data)
# all units should survive
assert np.all(out0.value == data.value)
# roughly 80% of units should survive
np.testing.assert_almost_equal(np.count_nonzero(out1.value)/out1.value.size, 0.8, decimal=1)
```

### Expected value

Say we have a dropout layer with a `keep_prob`

of 0.8, so only about 80% of the inputs survive. The expected value of the output is about 80% of that of the input.

At test time however, we don't apply dropout - So there's a scaling problem. The units receive test data which have a greater expected value than the train data they learned on.

To remedy this, the dropout layer applies the dropout mask, then multiplies the values by$ \frac{1}{keep\_prob} $ to correct the expected value. Or equivalently, multiply the mask itself by$ \frac{1}{keep\_prob} $.

**Task:**

Update your implementation to fix the expected value. Verify your implementation below.

```
net = Model()
data = Node(np.random.randint(1,100,size=(100,100)))
out = net.dropout(keep_prob=0.8)(data)
# mean of input and output should be similar
np.testing.assert_almost_equal(data.value.mean(), out.value.mean(), decimal=0)
```

The cell below executes a training loop which you can use to verify if your model learns appropriately.

```
# training
net = Model()
lrate = 0.002
batch_size = 75
test_losses = []
steps=100
for i in range(steps):
minis = np.random.choice(np.arange(len(x_train)),size=batch_size, replace=False)
x_mini = x_train[minis,:]
y_mini = y_train[minis]
loss = net.loss(x_mini,y_mini,train=True)
grads = loss.grad(1)
new_params = { k : net.params[k] - lrate * grads[k]
for k in grads.keys() }
net.params.update(new_params)
test_losses.append(net.loss(x_test,y_test,train=False).value.item())
# testing
pred = np.round(net.forward(x_test,train=False).value.squeeze())
np.mean(pred == y_test)
plt.plot(test_losses)
plt.ylabel('loss on test set')
plt.xlabel('iterations');
```

## Literature

## Licenses

### Notebook License (CC-BY-SA 4.0)

*The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).*

*Dropout*

by *Diyar Oktay*

is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Based on a work at https://gitlab.com/deep.TEACHING.

### Code License (MIT)

*The following license only applies to code cells of the notebook.*

Copyright 2018 *Diyar Oktay*

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.