# Dropout (Differentiable Programming)

## Introduction

This notebook walks you through an implementation of a regularization technique called dropout. The idea is that in each forward pass during training, we randomly select units to 'drop out' from the network, i.e. remove them from the network. This forces the surviving units to learn without depending too heavily on the cooperation of other units and produce better results individually.

## Requirements

### Knowledge

These are useful resources on the topic, though it's not required to read them entirely before tackling this notebook.

### Python Modules

import numpy as np

from dp import NeuralNode,Node

from sklearn import datasets,preprocessing
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

### Data

x,y = datasets.load_breast_cancer(return_X_y=True)
x = preprocessing.scale(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

## Dropout

Figure 1 from 'Dropout: A Simple Way to Prevent Neural Networks from Overfitting' #SRI14

The authors of the Dropout paper propose that a good way to reduce overfitting is to average out the predictions of many separately trained networks - but this is too computationally expensive to do in practice.

Introduce dropout: On the left, you see a network with all its units and their connections. On the right, the crossed out units have been dropped from the network along with all their connections. So it creates a new, 'thinned' version of the neural net.

When we send train samples through the network in the forward pass, we randomly sample units to drop from the network. So for each sample we train a 'thinned' version of the net. This approximates training and averaging many different neural nets with shared parameters.

The paper presents the following motivation (#SRI14 p. 1932/p. 4 in the PDF)

"Similarly, each hidden unit in a neural network trained with dropout must learn to work with a randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes. "

The following exercises walk you through an implementation of a dropout layer for a neural net.

## Exercises

Implement a mask operator for the Node autodiff class.

Note: Remember to implement the partial derivative of the mask operator since it's crucial for backprop. If a unit is killed through dropout, it doesn't contribute anything to the network. So the gradient that flows back into it should be 0.

def mask(self, mask : np.ndarray):
raise NotImplementedError()
return

Node.mask = mask

# mask numbers 1..10 and square
a = Node(np.arange(1,11)[None,:], 'A')
mask = np.array([0, 1] * 5)

assert grads[0,4] == 0

We'll again use the autodiff class to create a model for the breast cancer dataset. The method linear_layer adds a linear layer to the network, your task will be to implement a dropout layer.

class Model():
# define layers of the model
def __init__(self):
self.params = dict()
self.fc1 = self.linear(30,20,'fc1')
self.do1 = self.dropout(keep_prob=0.5)
self.fc2 = self.linear(20,10,'fc2')
self.do2 = self.dropout(keep_prob=0.5)
self.fc3 = self.linear(10,1,'fc3')

# define forward pass
def forward(self,x,train=True):
if not type(x) == Node:
x = Node(x)
# TODO: implement forward pass
raise NotImplementedError()
return out

# define loss function
def loss(self,x,y,train=True):
out = self.forward(x,train)
if not type(y) == Node:
y = Node(y)
loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
return loss.sum()

# add a linear layer to the model
def linear(self, fan_in,fan_out,name):
W_name, W_value = f'weight_{name}', np.random.randn(fan_in,fan_out)
b_name, b_value = f'bias_{name}', np.random.randn(1,fan_out)
self.params[W_name] = W_value
self.params[b_name] = b_value

def forward(x):
return x.dot(Node(self.params[W_name], W_name)) + Node(self.params[b_name], b_name)

return forward

# TODO: add dropout method

### Forward pass

Implement the forward pass, e.g.

x -> linear -> tanh -> dropout
-> linear -> tanh -> dropout
-> linear -> sigmoid


Note: Mind the train parameter which indicates whether we're forwarding train or test data. On test data, you do not apply dropout.

### Dropout layer

To apply dropout, we multiply the activations of a layer with a boolean/binary matrix of 0s and 1s (masking).

The hyperparameter$p$ controls the percentage of units to keep. To make things more explicit, this parameter is also called keep_prob.

Each dropout layer can have a different setting for the keep_prob parameter$\in$ [0..1]

Implement the dropout layer.

def dropout(self,keep_prob=0.5):
raise NotImplementedError()

Model.dropout = dropout

The first dropout layer has a keep_prob of 1.0, so all activations should survive.

The second dropout layer has a keep_prob of 0.5, so approximately half of them should be dead.

data = Node(np.random.randint(1,10,size=(10,10)))
out0 = dropout(None,keep_prob=1.0)(data)
out1 = dropout(None,keep_prob=0.8)(data)

# all units should survive
assert np.all(out0.value == data.value)

# roughly 80% of units should survive
np.testing.assert_almost_equal(np.count_nonzero(out1.value)/out1.value.size, 0.8, decimal=1)

### Expected value

Say we have a dropout layer with a keep_prob of 0.8, so only about 80% of the inputs survive. The expected value of the output is about 80% of that of the input.

At test time however, we don't apply dropout - So there's a scaling problem. The units receive test data which have a greater expected value than the train data they learned on.

To remedy this, the dropout layer applies the dropout mask, then multiplies the values by$\frac{1}{keep\_prob}$ to correct the expected value. Or equivalently, multiply the mask itself by$\frac{1}{keep\_prob}$.

Update your implementation to fix the expected value. Verify your implementation below.

net = Model()
data = Node(np.random.randint(1,100,size=(100,100)))
out = net.dropout(keep_prob=0.8)(data)

# mean of input and output should be similar
np.testing.assert_almost_equal(data.value.mean(), out.value.mean(), decimal=0)

The cell below executes a training loop which you can use to verify if your model learns appropriately.

# training
net = Model()
lrate = 0.002
batch_size = 75
test_losses = []
steps=100

for i in range(steps):
minis = np.random.choice(np.arange(len(x_train)),size=batch_size, replace=False)
x_mini = x_train[minis,:]
y_mini = y_train[minis]
loss = net.loss(x_mini,y_mini,train=True)
new_params = { k : net.params[k] - lrate * grads[k]
net.params.update(new_params)
test_losses.append(net.loss(x_test,y_test,train=False).value.item())

# testing
pred = np.round(net.forward(x_test,train=False).value.squeeze())
np.mean(pred == y_test)
plt.plot(test_losses)
plt.ylabel('loss on test set')
plt.xlabel('iterations');

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Dropout
by Diyar Oktay