# Weight Initialization (Differentiable Programming)

## Introduction

This notebook deals with parameter initialization in neural nets. Weights that start off too small or too large can cause gradients to vanish or explode, which is detrimental to the learning process. Xavier initialization aims to keep activations and gradients flowing in the forward and backward pass.

In this notebook, you'll compare different initialization techniques and study their effect on the network. Finally you'll implement a mechanism for custom weight initialization for the neural net library you've been building in this course.

## Requirements

### Knowledge

A recommended read on network initialization is the blog post Initialization of deep networks by Gustav Larsson (#LAR15)

### Prerequisites

This notebook uses the neural net framework you've been building in the 'Differentiable Programming' course - but you can use the implementation in dp.py. If dp.py is located in the same folder as this notebook, you can access it as a module with import dp

### Python Modules

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets,preprocessing
from dp import Model,Node,SGD, Adam

### Variance

The variance of the product between two independent variables is:

${\rm Var}(XY) = E(X^2Y^2) − (E(XY))^2={\rm Var}(X){\rm Var}(Y)+{\rm Var}(X)(E(Y))^2+{\rm Var}(Y)(E(X))^2$

Goodman, Leo A., "On the exact variance of products," Journal of the American Statistical Association, December 1960, 708–713.

with zero-mean variables: $E(X) = E(Y) = 0$ this is

${\rm Var}(XY) = {\rm Var}(X){\rm Var}(Y)$

### Weight initialization by considering only the forward pass

• weight matrix$W$ consists of$m$

• column vectors$\vec w_i$ (neuron weights for a hidden neuron$i$,$m$ hiddens for the layer in total)
• each element was drawn from an IID Gaussian with variance$var(W)$.
• input vector of one example (also hidden vector)$\vec x^T$ with expected variance$var (X)$
• for random initialization there is no correlation between the input and the weights
• both should be approximatly zero-mean (through initialization resp. data preprocessing for$\vec x$)

So we use -$n$ is also called the "fan out" of a layer -$m$ is the "fan in" of a layer

Now we want that the variance remains constant, i.e. same variance for input and output in the linear regime. So the following expression should be 1:

$\frac{\text{var}(\vec x^T \cdot \vec w_i)}{\text{var}(X)} = \frac{\text{var} (\sum_{j=1}^n x_j w_{ji})}{\text{var}(X)}= \frac{n {\ }\text{var}(X) \text{var}(W)}{\text{var}(X)} = n {\ }\text{var}(W) = 1$

i.e.:

• va$(W) = 1/n$
resp.
• st$(W) = 1/\sqrt n$

With ReLu-Units only half of the units are in the acitive regime. So the variance of$W$ must be twice to yield the same effect, i.e.:

• va$(W) = 2/n$
resp.
• st$(W) = \sqrt{2/n}$

For training we do a forward pass and a backward pass. In the backward pass the error signal is "linearly" backpropagated.

Glorot et al. suggest taking the average between forward and backward pass for initialization, i.e.:

• va$(W) = 2/(n + m)$
resp.
• st$(W) = \sqrt{2/(n+m)}$

For ReLU's:

• va$(W) = 4/(n + m)$
resp.
• st$(W) = \sqrt{4/(n+m)}$

## Exercises

### Forward pass

In this exercise we have some contrived train_data (1000 samples with 500 features). Implement the forward pass.

• In each layer, the number of input and output features should remain the same.
• The weights in each layer are drawn from the uniform distribution$[-1 .. 1]$
• Each layer uses the tanh activation function.
• Return the activations across all layers

(We'll focus on the parameters xavier and gain in the next step)

train_data = np.random.randn(1000,500)
def feed_forward(x,num_layers,xavier=False,gain=np.sqrt(2)):
raise NotImplementedError()

### Visualise the activations

Feed your train_data through the forward pass with 10 layers. Then plot the distribution of the activation values of each layer in a histogram. What does this tell you about the saturation of the network?

# Some sample code that plots multiple histograms

plt.figure(figsize=(40,20))

def plot(activations):
plt.figure(figsize=(40,20))
for i in range(10):
plt.subplot(3,4,i+1)
plt.hist(np.geomspace(0.1,2))

plot(feed_forward(train_data,num_layers=10))

### Xavier Initialization

Update your implementation of feed_forward. If the parameter xavier is set to True, initialize all weights with Xavier initialization. Glorot et al. suggests the following normalized initialization

$W \sim U \left[ - \frac{\sqrt6}{\sqrt{(fan\_in+fan\_out)}} , \frac{\sqrt6}{\sqrt{(fan\_in+fan\_out)}} \right]$

To put it into words: fan_in and fan_out are the number of input features and output features. Weights are drawn from the uniform distribution -sqrt(6/(fan_in+fan_out)) to sqrt(6/(fan_in+fan_out)), multiplied by a constant gain.

Repeat the forward pass and plot the activations using Xavier initialization - Does this solve the problem of saturation?

plot(feed_forward(train_data,10,xavier=True))

### Using the neural net framework

Now we turn towards implementing custom initializers in the neural net framework.

First, create a model for a classification problem. We'll use the breast cancer dataset. Define the following architecture:

• First layer: Linear, 30 input features, 20 output features, tanh activation
• Second layer: Linear, 20 input features, 10 output features, tanh activation
• Third layer: Linear, 10 input features, 1 output feature, sigmoid activation
• For the loss function, use cross-entropy.

Note: Your neural net implementation may not have a Tanh_Layer function to return a layer that performs a matrix multiplication followed by a tanh function. But equivalently, you can use a linear layer and apply the activation function in the forward pass, for example:

def __init__():
self.hidden0 = self.Linear_Layer(...)

def forward(self,x):
return self.hidden0(x).tanh()

x_train,y_train = datasets.load_breast_cancer(return_X_y=True)
x_train = preprocessing.scale(x_train)
#print(x_train)
#print(y_train)
print(x_train.shape)
class Net(Model):
def __init__(self):
super(Net,self).__init__()
# create layers
raise NotImplementedError()

def loss(self,x,y):
if not type(y) == Node:
y = Node(y)
# compute and return cross entropy loss, accumulated over all samples
raise NotImplementedError()

def forward(self, x):
if not type(x) == Node:
x = Node(x)
# implement the forward pass
# hidden_0 -> tanh -> hidden_1 -> tanh -> hidden_2 -> sigmoid
raise NotImplementedError()

### Implement Initializer

Initializer is an abstract class. Its method initialize iterates over all weights and biases in the network and sets their values.

Any subclass represents a specific initialization method, e.g. Xavier. A subclass implements the methods initial_weights(self, fan_in, fan_out) and initial_bias(self, fan_in, fan_out). The arguments fan_in and fan_out are the number of input and output features of the layer. The functions return initialized weights and bias suited for the layer, respectively.

class Initializer():
def __init__(self):
pass

def initialize(self,net):
for k,v in net.get_param().items():
fan_in,fan_out = v.shape
if 'weight' in k:
W = self.initial_weights(fan_in,fan_out)
np.copyto(v, W)
elif 'bias' in k:
b = self.initial_bias(fan_in, fan_out)
np.copyto(v, b)

def initial_weights(self, fan_in, fan_out):
raise NotImplementedError('Must be implemented by subclass')

def initial_bias(self, fan_in, fan_out):
raise NotImplementedError('Must be implemented by subclass')

Task: Implement a few different initializers.

• LowInitializer: initializes all parameters close to 0
• LargeInitializer: initializes parameters at a large value e.g. random numbers drawn from the uniform distribution [-100..100]
• NormalInitializer: initializes parameters with values drawn from a normal distribution (as opposed to a uniform distribution)
• XavierInitializer: initializes parameters using Xavier initialization.
class LowInitializer(Initializer):
def initial_weights(self, fan_in, fan_out):
pass
def initial_bias(self, fan_in, fan_out):
pass

class LargeInitializer(Initializer):
def initial_weights(self, fan_in, fan_out):
pass
def initial_bias(self, fan_in, fan_out):
pass

class NormalInitializer(Initializer):
def initial_weights(self, fan_in, fan_out):
pass
def initial_bias(self, fan_in, fan_out):
pass

class XavierInitializer(Initializer):
def initial_weights(self, fan_in, fan_out):
pass
def initial_bias(self, fan_in, fan_out):
pass

Repeat the training process with different initializers applied to the network. Compare how well the network learns.

net = Net()
#LowInitializer().initialize(net)
LargeInitializer().initialize(net)
#XavierInitializer().initialize(net)
#NormalInitializer().initialize(net)
net,
x_train=x_train,
y_train=y_train,
hyperparam = {"alpha": 0.01}
)
optimiser.train(steps=100,print_each=10);`

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Notebook title
by Benjamin Voigt, Diyar Oktay