Weight Initialization (Differentiable Programming)
Table of Contents
Introduction
This notebook deals with parameter initialization in neural nets. Weights that start off too small or too large can cause gradients to vanish or explode, which is detrimental to the learning process. Xavier initialization aims to keep activations and gradients flowing in the forward and backward pass.
In this notebook, you'll compare different initialization techniques and study their effect on the network. Finally you'll implement a mechanism for custom weight initialization for the neural net library you've been building in this course.
Requirements
Knowledge
A recommended read on network initialization is the blog post Initialization of deep networks by Gustav Larsson (#LAR15)
Prerequisites
This notebook uses the neural net framework you've been building in the 'Differentiable Programming' course  but you can use the implementation in dp.py
. If dp.py is located in the same folder as this notebook, you can access it as a module with import dp
Python Modules
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets,preprocessing
from dp import Model,Node,SGD, Adam
Variance
The variance of the product between two independent variables is:
$ {\rm Var}(XY) = E(X^2Y^2) − (E(XY))^2={\rm Var}(X){\rm Var}(Y)+{\rm Var}(X)(E(Y))^2+{\rm Var}(Y)(E(X))^2 $
Goodman, Leo A., "On the exact variance of products," Journal of the American Statistical Association, December 1960, 708–713.
with zeromean variables: $ E(X) = E(Y) = 0 $ this is
$ {\rm Var}(XY) = {\rm Var}(X){\rm Var}(Y) $
Weight initialization by considering only the forward pass

weight matrix$ W $ consists of$ m $
 column vectors$ \vec w_i $ (neuron weights for a hidden neuron$ i $,$ m $ hiddens for the layer in total)
 each element was drawn from an IID Gaussian with variance$ var(W) $.
 input vector of one example (also hidden vector)$ \vec x^T $ with expected variance$ var (X) $
 for random initialization there is no correlation between the input and the weights
 both should be approximatly zeromean (through initialization resp. data preprocessing for$ \vec x $)
So we use $ n $ is also called the "fan out" of a layer $ m $ is the "fan in" of a layer
Now we want that the variance remains constant, i.e. same variance for input and output in the linear regime. So the following expression should be 1:
$ \frac{\text{var}(\vec x^T \cdot \vec w_i)}{\text{var}(X)} = \frac{\text{var} (\sum_{j=1}^n x_j w_{ji})}{\text{var}(X)}= \frac{n {\ }\text{var}(X) \text{var}(W)}{\text{var}(X)} = n {\ }\text{var}(W) = 1 $
i.e.:
 va$ (W) = 1/n $
resp.  st$ (W) = 1/\sqrt n $
With ReLuUnits only half of the units are in the acitive regime. So the variance of$ W $ must be twice to yield the same effect, i.e.:
 va$ (W) = 2/n $
resp.  st$ (W) = \sqrt{2/n} $
For training we do a forward pass and a backward pass. In the backward pass the error signal is "linearly" backpropagated.
Glorot et al. suggest taking the average between forward and backward pass for initialization, i.e.:
 va$ (W) = 2/(n + m) $
resp.  st$ (W) = \sqrt{2/(n+m)} $
For ReLU's:
 va$ (W) = 4/(n + m) $
resp.  st$ (W) = \sqrt{4/(n+m)} $
Exercises
Forward pass
In this exercise we have some contrived train_data
(1000 samples with 500 features). Implement the forward pass.
 In each layer, the number of input and output features should remain the same.
 The weights in each layer are drawn from the uniform distribution$ [1 .. 1] $
 Each layer uses the
tanh
activation function.  Return the activations across all layers
(We'll focus on the parameters xavier
and gain
in the next step)
train_data = np.random.randn(1000,500)
def feed_forward(x,num_layers,xavier=False,gain=np.sqrt(2)):
raise NotImplementedError()
Visualise the activations
Feed your train_data
through the forward pass with 10 layers. Then plot the distribution of the activation values of each layer in a histogram. What does this tell you about the saturation of the network?
# Some sample code that plots multiple histograms
plt.figure(figsize=(40,20))
def plot(activations):
plt.figure(figsize=(40,20))
for i in range(10):
plt.subplot(3,4,i+1)
plt.hist(np.geomspace(0.1,2))
plot(feed_forward(train_data,num_layers=10))
Xavier Initialization
Update your implementation of feed_forward
. If the parameter xavier
is set to True
, initialize all weights with Xavier initialization. Glorot et al. suggests the following normalized initialization
$ W \sim U \left[  \frac{\sqrt6}{\sqrt{(fan\_in+fan\_out)}} , \frac{\sqrt6}{\sqrt{(fan\_in+fan\_out)}} \right] $
To put it into words: fan_in and fan_out are the number of input features and output features. Weights are drawn from the uniform distribution
sqrt(6/(fan_in+fan_out))
to sqrt(6/(fan_in+fan_out))
, multiplied by a constant gain
.
Repeat the forward pass and plot the activations using Xavier initialization  Does this solve the problem of saturation?
plot(feed_forward(train_data,10,xavier=True))
Using the neural net framework
Now we turn towards implementing custom initializers in the neural net framework.
First, create a model for a classification problem. We'll use the breast cancer dataset. Define the following architecture:
 First layer: Linear, 30 input features, 20 output features, tanh activation
 Second layer: Linear, 20 input features, 10 output features, tanh activation
 Third layer: Linear, 10 input features, 1 output feature, sigmoid activation
 For the loss function, use crossentropy.
Note: Your neural net implementation may not have a Tanh_Layer
function to return a layer that performs a matrix multiplication followed by a tanh function. But equivalently, you can use a linear layer and apply the activation function in the forward pass, for example:
def __init__():
self.hidden0 = self.Linear_Layer(...)
def forward(self,x):
return self.hidden0(x).tanh()```
x_train,y_train = datasets.load_breast_cancer(return_X_y=True)
x_train = preprocessing.scale(x_train)
#print(x_train)
#print(y_train)
print(x_train.shape)
class Net(Model):
def __init__(self):
super(Net,self).__init__()
# create layers
raise NotImplementedError()
def loss(self,x,y):
if not type(y) == Node:
y = Node(y)
# compute and return cross entropy loss, accumulated over all samples
raise NotImplementedError()
def forward(self, x):
if not type(x) == Node:
x = Node(x)
# implement the forward pass
# hidden_0 > tanh > hidden_1 > tanh > hidden_2 > sigmoid
raise NotImplementedError()
Implement Initializer
Initializer
is an abstract class. Its method initialize
iterates over all weights and biases in the network and sets their values.
Any subclass represents a specific initialization method, e.g. Xavier. A subclass implements the methods initial_weights(self, fan_in, fan_out)
and initial_bias(self, fan_in, fan_out)
. The arguments fan_in
and fan_out
are the number of input and output features of the layer. The functions return initialized weights and bias suited for the layer, respectively.
class Initializer():
def __init__(self):
pass
def initialize(self,net):
for k,v in net.get_param().items():
fan_in,fan_out = v.shape
if 'weight' in k:
W = self.initial_weights(fan_in,fan_out)
np.copyto(v, W)
elif 'bias' in k:
b = self.initial_bias(fan_in, fan_out)
np.copyto(v, b)
def initial_weights(self, fan_in, fan_out):
raise NotImplementedError('Must be implemented by subclass')
def initial_bias(self, fan_in, fan_out):
raise NotImplementedError('Must be implemented by subclass')
Task: Implement a few different initializers.
 LowInitializer: initializes all parameters close to 0
 LargeInitializer: initializes parameters at a large value e.g. random numbers drawn from the uniform distribution [100..100]
 NormalInitializer: initializes parameters with values drawn from a normal distribution (as opposed to a uniform distribution)
 XavierInitializer: initializes parameters using Xavier initialization.
class LowInitializer(Initializer):
def initial_weights(self, fan_in, fan_out):
pass
def initial_bias(self, fan_in, fan_out):
pass
class LargeInitializer(Initializer):
def initial_weights(self, fan_in, fan_out):
pass
def initial_bias(self, fan_in, fan_out):
pass
class NormalInitializer(Initializer):
def initial_weights(self, fan_in, fan_out):
pass
def initial_bias(self, fan_in, fan_out):
pass
class XavierInitializer(Initializer):
def initial_weights(self, fan_in, fan_out):
pass
def initial_bias(self, fan_in, fan_out):
pass
Repeat the training process with different initializers applied to the network. Compare how well the network learns.
net = Net()
#LowInitializer().initialize(net)
LargeInitializer().initialize(net)
#XavierInitializer().initialize(net)
#NormalInitializer().initialize(net)
optimiser = Adam(
net,
x_train=x_train,
y_train=y_train,
hyperparam = {"alpha": 0.01}
)
optimiser.train(steps=100,print_each=10);
Literature
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
Notebook title
by Benjamin Voigt, Diyar Oktay
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2019 Benjamin Voigt, Diyar Oktay
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.