# Batch Norm (Differentiable Programming)

## Introduction

This notebook deals with 'Batch Normalization'. You're likely familiar with feature scaling - transform all features of the input data to roughly the same range before feeding it into the network.

Batch normalization takes this a step further and performs normalization on the activations at each layer.

## Requirements

### Knowledge

It's not required to study these resources before tackling this notebook but they provide an excellent coverage of the topic.

### Python Modules

import dp
from dp import NeuralNode,Node
import numpy as np
from sklearn import datasets,preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

### Data

x,y = datasets.load_breast_cancer(return_X_y=True)
x = preprocessing.scale(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

### Batch normalization

Imagine a deep neural network attempting to learn. Before we feed our data into the network, we normalize each dimension to have a mean of 0 and a standard deviation of 1. To do so we subtract the expected value (mean)$E$ and divide by the sqrt of the variance$Var$ of each dimension. $x^{norm} = \frac{x - E[x]}{\sqrt{Var(x)}}$ It's common to add a small number$\epsilon$ to the variance just to prevent taking the square root of zero. In python code:

# data
foo = np.random.randn(1000,20)

# mean and variance per feature
mean = foo.mean(axis=0,keepdims=True)
var = foo.var(axis=0,keepdims=True)
epsilon = 1e-8

# normalization
foo_norm = (foo - mean)/np.sqrt(var + epsilon)
foo_norm.mean(), foo_norm.std()

So the first layer is happy and content, its input is always normalized. But what about the deeper layers? After the data has gone through multiple matrix multiplications in the network its mean and standard deviation have likely shifted.

On top of that, in each iteration of learning, the parameters of the first layer change, so the succeeding layers try to learn on data with constantly shifting mean and variance.

To make things easier for the deeper layers, batch normalization is applied. Each layer calculates the mean and variance of each feature over the mini-batch of samples x. Each sample is then normalized as

x_norm = (x - batch_mean)/batch_variance


Then, we essentially provide a customizable standard deviation through the hyperparameters$\gamma$ (gamma) and$\beta$ (beta)

out = gamma * x_norm + beta


These parameters$\gamma$ and$\beta$ are learnable in the training process. You update them as you would any other parameter such as the weight of a linear layer, e.g. with SGD, Momentum or Adam.

## Exercises

### Bias

A linear layer generally performs the forward pass$x \cdot W + b$. Show that if we apply batch norm after a linear layer, we can omit the bias term$b$.

### Implement batch norm

We'll use the Node class in dp.py for automatic differentiation and create a model for the breast cancer dataset. The architecture should look as follows

data(30 features) -> linear(30,20) -> batch norm -> tanh -> linear(20,10) -> batch norm -> tanh -> linear(10,1) -> sigmoid


Implement the method batch_norm to add a batch norm layer to the network.

class Model():
# define layers of the model
def __init__(self):
self.params = dict()
self.fc1 = self.linear(30,20,'fc1')
self.bn1 = self.batch_norm(20,'bn1')
self.fc2 = self.linear(20,10,'fc2')
self.bn2 = self.batch_norm(10,'bn2')
self.fc3 = self.linear(10,1,'fc3')

# define forward pass
def forward(self,x,train=True):
if not type(x) == Node:
x = Node(x)
x = self.fc1(x)
x = self.bn1(x).tanh()
x = self.fc2(x)
x = self.bn2(x).tanh()
x = self.fc3(x)
out = x.sigmoid()
return out

# define loss function
def loss(self,x,y):
out = self.forward(x,train=True)
if not type(y) == Node:
y = Node(y)
loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
return loss.sum()

# add a linear layer to the model
def linear(self, fan_in,fan_out,name):
W_name, W_value = f'weight_{name}', np.random.randn(fan_in,fan_out)
self.params[W_name] = W_value

def forward(x):
return x.dot(Node(self.params[W_name], W_name))

return forward
def batch_norm(self, fan_in, name):
# TODO: add gamma and beta of this layer to self.params
# TODO: define and return the forward pass, i.e. a function that
#       applies batch norm to x
raise NotImplementedError()

Model.batch_norm = batch_norm

Verify if your implementation works properly. The output of the batch norm layer should have a mean of beta and a standard deviation of gamma.

net = Model()
assert 'gamma_bn1','beta_bn1' in net.params
assert 'gamma_bn2','beta_bn2' in net.params

data = np.random.randn(100,10)
out = net.bn2(Node(data)).value
print(out.mean(axis=0))
print(out.std(axis=0))
assert np.allclose(out.mean(axis=0), net.params['beta_bn2'], atol=1e-5)
assert np.allclose(out.std(axis=0), np.abs(net.params['gamma_bn2']), atol=1e-5)

### New data

Recall: During the training process, you calculate the mean and variance of each feature over the mini batch of samples, then normalize each sample as

x_norm = (x - batch_mean)/batch_variance.


After training is completed, you may want to classify a single sample or a whole dataset, so there are no mini-batches.

To account for this, during the learning process you keep track of the moving average of the batch mean and batch variance. This moving average is then applied to normalize non-train data. If moving averages are new to you you may want to check out this Notebook on optimizers.

Change your implementation of the batch_layer method to add avg_mean_{layer_name} and avg_variance_{layer_name} to the parameters of the model. For each mini-batch that the network sees during training, update the parameters as the moving average of the batch mean and batch variance, respectively.

Note: The forward function returned by the batch_norm method needs a parameter such as train to distinguish between train batches and test samples.

def batch_norm(self, fan_in,name):
raise NotImplementedError()

def forward(x,train=True):
raise NotImplementedError()

return forward

Model.batch_norm = batch_norm

Verify your implementation: This tests feeds data with a mean of 42 and a standard deviation of 10 through the batch norm layer many times and checks the moving averages the layer has learned.

net = Model()
assert np.all(net.params['avg_mean_bn1'] == 0)
assert np.all(net.params['avg_variance_bn2'] == 0)
for i in range(100):
data = Node(np.random.normal(loc=42,scale=10,size=((1000,20))))
net.bn1(data)
np.testing.assert_allclose(42, net.params['avg_mean_bn1'], atol=1)
np.testing.assert_allclose(10**2, net.params['avg_variance_bn1'], atol=5)

This cell creates a simple training loop to train and then test the model.

net = Model()

lrate = 0.01
batch_size = 50
steps = 100

# training
for i in range(steps):
minis = np.random.choice(np.arange(len(x_train)),size=batch_size, replace=False)
x_mini = x_train[minis,:]
y_mini = y_train[minis]
loss = net.loss(x_mini,y_mini)
new_params = { k : net.params[k] - lrate * grads[k]
net.params.update(new_params)

# testing
pred = np.round(net.forward(x_test).value).squeeze()
np.mean(pred == y_test)

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Batch Normalization
by Diyar Oktay