# Batch Norm (Differentiable Programming)

## Table of Contents

## Introduction

This notebook deals with 'Batch Normalization'. You're likely familiar with feature scaling - transform all features of the input data to roughly the same range before feeding it into the network.

**Batch normalization** takes this a step further and performs normalization on the activations at each layer.

## Requirements

### Knowledge

It's not required to study these resources before tackling this notebook but they provide an excellent coverage of the topic.

- The original Batch Norm paper by S. Ioffe/C. Szegedy [IOF15]
- The write-up Batch Norm layer by Leonardo Araujo dos Santos [ARA18]

### Python Modules

```
import dp
from dp import NeuralNode,Node
import numpy as np
from sklearn import datasets,preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
```

### Data

This cell downloads the breast cancer dataset provided by sklearn.

```
x,y = datasets.load_breast_cancer(return_X_y=True)
x = preprocessing.scale(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
```

### Batch normalization

Imagine a deep neural network attempting to learn. Before we feed our data into the network, we normalize each dimension to have a mean of 0 and a standard deviation of 1. To do so we subtract the expected value (mean)$ E $ and divide by the sqrt of the variance$ Var $ of each dimension. $ x^{norm} = \frac{x - E[x]}{\sqrt{Var(x)}} $ It's common to add a small number$ \epsilon $ to the variance just to prevent taking the square root of zero. In python code:

```
# data
foo = np.random.randn(1000,20)
# mean and variance per feature
mean = foo.mean(axis=0,keepdims=True)
var = foo.var(axis=0,keepdims=True)
epsilon = 1e-8
# normalization
foo_norm = (foo - mean)/np.sqrt(var + epsilon)
foo_norm.mean(), foo_norm.std()
```

So the first layer is happy and content, its input is always normalized. But what about the deeper layers? After the data has gone through multiple matrix multiplications in the network its mean and standard deviation have likely shifted.

On top of that, in each iteration of learning, the parameters of the first layer change, so the succeeding layers try to learn on data with constantly shifting mean and variance.

To make things easier for the deeper layers, batch normalization is applied. Each layer calculates the mean and variance of each feature over the mini-batch of samples `x`

. Each sample is then normalized as

```
x_norm = (x - batch_mean)/batch_variance
```

Then, we essentially provide a customizable standard deviation through the **hyperparameters**$ \gamma $ (gamma) and$ \beta $ (beta)

```
out = gamma * x_norm + beta
```

These parameters$ \gamma $ and$ \beta $ are learnable in the training process. You update them as you would any other parameter such as the weight of a linear layer, e.g. with SGD, Momentum or Adam.

## Exercises

### Bias

**Task:**

A linear layer generally performs the forward pass$ x \cdot W + b $. Show that if we apply batch norm after a linear layer, we can omit the bias term$ b $.

### Implement batch norm

We'll use the `Node`

class in dp.py for automatic differentiation and create a model for the breast cancer dataset. The architecture should look as follows

```
data(30 features) -> linear(30,20) -> batch norm -> tanh -> linear(20,10) -> batch norm -> tanh -> linear(10,1) -> sigmoid
```

**Task:**

Implement the method `batch_norm`

to add a batch norm layer to the network.

```
class Model():
# define layers of the model
def __init__(self):
self.params = dict()
self.fc1 = self.linear(30,20,'fc1')
self.bn1 = self.batch_norm(20,'bn1')
self.fc2 = self.linear(20,10,'fc2')
self.bn2 = self.batch_norm(10,'bn2')
self.fc3 = self.linear(10,1,'fc3')
# define forward pass
def forward(self,x,train=True):
if not type(x) == Node:
x = Node(x)
x = self.fc1(x)
x = self.bn1(x).tanh()
x = self.fc2(x)
x = self.bn2(x).tanh()
x = self.fc3(x)
out = x.sigmoid()
return out
# define loss function
def loss(self,x,y):
out = self.forward(x,train=True)
if not type(y) == Node:
y = Node(y)
loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
return loss.sum()
# add a linear layer to the model
def linear(self, fan_in,fan_out,name):
W_name, W_value = f'weight_{name}', np.random.randn(fan_in,fan_out)
self.params[W_name] = W_value
def forward(x):
return x.dot(Node(self.params[W_name], W_name))
return forward
```

```
def batch_norm(self, fan_in, name):
# TODO: add gamma and beta of this layer to self.params
# TODO: define and return the forward pass, i.e. a function that
# applies batch norm to x
raise NotImplementedError()
Model.batch_norm = batch_norm
```

Verify if your implementation works properly. The output of the batch norm layer should have a mean of **beta** and a standard deviation of **gamma**.

```
net = Model()
assert 'gamma_bn1','beta_bn1' in net.params
assert 'gamma_bn2','beta_bn2' in net.params
data = np.random.randn(100,10)
out = net.bn2(Node(data)).value
print(out.mean(axis=0))
print(out.std(axis=0))
assert np.allclose(out.mean(axis=0), net.params['beta_bn2'], atol=1e-5)
assert np.allclose(out.std(axis=0), np.abs(net.params['gamma_bn2']), atol=1e-5)
```

### New data

Recall: During the training process, you calculate the mean and variance of each feature **over the mini batch** of samples, then normalize each sample as

```
x_norm = (x - batch_mean)/batch_variance.
```

After training is completed, you may want to classify a single sample or a whole dataset, so there are no mini-batches.

To account for this, during the learning process you keep track of the **moving average** of the batch mean and batch variance. This moving average is then applied to normalize non-train data. If moving averages are new to you you may want to check out this Notebook on optimizers.

**Task:**

Change your implementation of the `batch_layer`

method to add `avg_mean_{layer_name}`

and `avg_variance_{layer_name}`

to the parameters of the model. For each mini-batch that the network sees during training, update the parameters as the moving average of the batch mean and batch variance, respectively.

**Note:** The forward function returned by the `batch_norm`

method needs a parameter such as `train`

to distinguish between train batches and test samples.

```
def batch_norm(self, fan_in,name):
raise NotImplementedError()
def forward(x,train=True):
raise NotImplementedError()
return forward
Model.batch_norm = batch_norm
```

Verify your implementation: This tests feeds data with a mean of 42 and a standard deviation of 10 through the batch norm layer many times and checks the moving averages the layer has learned.

```
net = Model()
assert np.all(net.params['avg_mean_bn1'] == 0)
assert np.all(net.params['avg_variance_bn2'] == 0)
for i in range(100):
data = Node(np.random.normal(loc=42,scale=10,size=((1000,20))))
net.bn1(data)
np.testing.assert_allclose(42, net.params['avg_mean_bn1'], atol=1)
np.testing.assert_allclose(10**2, net.params['avg_variance_bn1'], atol=5)
```

### Gradient descent

This cell creates a simple training loop to train and then test the model.

```
net = Model()
lrate = 0.01
batch_size = 50
steps = 100
# training
for i in range(steps):
minis = np.random.choice(np.arange(len(x_train)),size=batch_size, replace=False)
x_mini = x_train[minis,:]
y_mini = y_train[minis]
loss = net.loss(x_mini,y_mini)
grads = loss.grad(1)
new_params = { k : net.params[k] - lrate * grads[k]
for k in grads.keys() }
net.params.update(new_params)
# testing
pred = np.round(net.forward(x_test).value).squeeze()
np.mean(pred == y_test)
```

## Literature

## Licenses

### Notebook License (CC-BY-SA 4.0)

*The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).*

*Batch Normalization*

by *Diyar Oktay*

is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Based on a work at https://gitlab.com/deep.TEACHING.

### Code License (MIT)

*The following license only applies to code cells of the notebook.*

Copyright 2018 *Diyar Oktay*

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.