# Optimizers (Differentiable Programming)

## Introduction

At this point, you are probably very familiar with vanilla gradient descent optimization and its update rule

weights = weights - alpha * gradients.


In this Notebook, you will implement three types of optmization algorithms that aim to improve upon classic gradient descent: SGD + Momentum, RMSProp and the Adam optimizers.

## Requirements

### Prerequisites

This notebook works with the neural net framework you've been building in the 'Differentiable Programming' course.

### Knowledege

The algorithms you'll implement in this notebooks are described in the following videos by deeplearning.ai/Andrew NG:

### Python Modules

import numpy as np
import matplotlib.pyplot as plt
from dp import Optimizer,Model,Node,SGD
from sklearn import datasets,preprocessing

### Data

This cell downloads the breast cancer dataset provided by sklearn and defines a model for the classification task.

x_train,y_train = datasets.load_breast_cancer(return_X_y=True)
x_train = preprocessing.scale(x_train)

class Net(Model):
def __init__(self):
super(Net, self).__init__()
self.h = self.Linear_Layer(30, 20, "h0")
self.h2 = self.Linear_Layer(20, 10, "h1")
self.h3 = self.Linear_Layer(10, 1, "h2")

def loss(self, x, y):
if not type(y) == Node:
y = Node(y)
out = self.forward(x)
loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
return loss.sum()

def forward(self, x):
if not type(x) == Node:
x = Node(x)
x = self.h(x).tanh()
x = self.h2(x).leaky_relu()
out = self.h3(x).sigmoid()

return out

def train_breast_cancer(optimizer):
net = Net()
optim = optimizer(net,x_train,y_train)
loss,loss_hist,para_hist = optim.train(steps=1000,print_each=100)
plt.plot(np.arange(len(loss_hist)),loss_hist, label=optimizer.__name__)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.legend()
plt.show()

## Exercises

### Motivation: Exponential moving average

The moving average is a way to smooth out a noisy signal by averaging over the history of values. The further a data point lies in the past, the lower its influence on the current moving average.

The example below shows a noisy signal. The x values are the interval$[-\pi \dots \pi)$. The y-values are$cos(x)$ plus random noise.

x = np.linspace(-np.pi,np.pi,200)
y = np.cos(x)
target = y + np.random.uniform(low=-2,high=2,size=x.shape) * 0.2
plt.plot(x,target,'rx',label='true noisy data')
plt.legend()
plt.show()

If you were to draw a line through each point precisely, you would create a line with many oscillations in the vertical direction.

plt.plot(x,target,'r')
plt.show()

The series of exponential moving averages can be computed recursively:

• original signal:$\theta_{1:T} = [ \theta_1, \theta_2 \dots \theta_T ]$

• exponential moving average:

$s_0$ = 0

$s_t = \beta * s_{t-1} + (1 - \beta) * \theta_t$

Beta$\beta$ is a hyperparameter in the interval [0..1]. Note that the initial value$s_0$ is not part of the generated series, it just provides an initial value to plug into the calculation for$s_1$.

The aim of the moving average is to smoothen out the original noisy signal. You can also say the moving average dampens the oscillations in the horizontal direction.

Implement a function to return the smoothed series of moving averages for a signal. (You can ignore the parameter bias_correction for now) Try different values for beta, e.g. 0.1, 0.9 and 0.999 and note the changes in the curve.

def gen_smooth(signal,beta,bias_correction=False):
raise NotImplementedError()

Plot the original noisy series and the series of exponentially moving averages.

beta = 0.9
smooth_target = list(gen_smooth(target,beta))
plt.plot(x,target,'rx',label='true noisy data')
plt.plot(x,smooth_target,label='smoothed data')
plt.legend()
plt.show()

### Bias correction

With the graph we can spot a flaw with exponential moving averages: The smoothed series starts at 0 while the true series starts around -1.This error propagates through the series and its effect is especially visible early in the series.

With bias correction we can get better estimates early on. Instead of returning the value$s_t$ directly, we return$s_t^{corrected}$ as

s_corrected = s/(1 - beta**t)


Update your implementation to perform bias correction.

def gen_smooth(signal,beta,bias_correction=False):
raise NotImplementedError()
beta = 0.9
smooth_target = list(gen_smooth(target,beta,bias_correction=True))
plt.plot(x,target,'rx',label='true noisy data')
plt.plot(x,smooth_target,label='smoothed data')
plt.legend()
plt.show()

### Momentum

We can apply the concept of moving averages to gradient descent optimization. In each iteration, we calculate the gradient over a mini-batch of samples. The idea is to treat these gradients as noisy and flatten out the oscillations. In vanilla gradient descent, you use these gradients for the update directly e.g.

w = w - alpha * dW.


In momentum, you compute the gradient dW as usual. Then, you update the moving average of the gradient. In this algorithm the moving average is called velocity. Then you shift the parameters by the negative velocity scaled by the learning rate, as opposed to the gradient itself.

In pseudo-code:

vdW = np.zeros_like(w) # velocity of weight gradients, starts at 0
vdB = np.zeros_like(b) # velocity of bias gradients, starts at 0
for i in range(steps):
vdW = beta * vdW + (1 - beta) * dW # update velocities
vdb = beta * vdb + (1 - beta) * db
w = w - alpha * vdW # update weights
b = b - alpha * vdb


So now there are two hyperparameters:

• alpha, the learning rate
• beta, the hyperparameter used for velocity

Reference implementation: For reference, this is the vanilla SGD optimizer implemented in the neural net framework. Refer to the comments about how the update is performed.

class SGD(Optimizer):

def __init__(self, model, x_train=None, y_train=None, hyperparam=dict(), batch_size=128):
# call parent constructor
super(SGD, self).__init__(model, x_train, y_train, hyperparam, batch_size)

def _set_param(self):
# set hyperparameters
self.alpha = self.hyperparam.get("alpha", 0.001)

def _update(self, param, grad, g, i):
# param - A dictionary of parameters in the network
# param - dict; parameters of the network
# g - name of the parameters

# update the network:
update = param[g] - self.alpha * grad[g]
np.copyto(param[g], update)

def train(self, steps=1000, print_each=100):
# num_grad_stores creates copies of dictionaries { k : v }
#   k : str -  is the name of the parameter in the network.
#   v : np.ndarray - has the same shape as the parameter
#       and is initialized at zeros.
# for example in momentum, you can use self.grad_stores
# to keep track of the velocity.


Implement the Momentum optimizer. You can use self.grad_stores to keep track of the velocity.

class SGD_Momentum(Optimizer):

def __init__(self, model, x_train=None, y_train=None, hyperparam=dict(), batch_size=128):
super(SGD_Momentum, self).__init__(model, x_train, y_train, hyperparam, batch_size)

def _set_param(self):
raise NotImplementedError() #TODO: set hyperparameters self.alpha and self.beta

def _update(self, param, grad, g, i):
v = np.zeros_like(grad[g]) # TODO: update velocity
np.copyto(velocity[g], v)

# TODO: update network
np.copyto(param[g], update)

def train(self, steps=1000, print_each=100):
return self._train(steps, num_grad_stores=1, print_each=print_each)

Run the cell below to train the breast cancer model with this optimizer and plot the loss.

train_breast_cancer(SGD_Momentum)

### RMSProp

RMSProp keeps track of the moving average of the squared gradients, rather than the gradients themselves.

sdW is the running average of the squared weight gradients, the bias is left out from this example for brevity.

In the update, the gradients are divided by the square root of the moving average of squared gradients.

sdW = np.zeros_like(w)
for i in range(steps):
sdW = beta * sdW + (1 - beta) * np.square(dW) # update running avg. of square
w = w - alpha * dW/np.sqrt(sdW + epsilon) # update rule



Why is this done?

• sdW is expected to be small. So you divide dW by a small number and provide a boost to the weight update.
• sdb is expected to be large. So you divide db by a large number and dampen the bias update.

So now there are three hyperparameters:

• alpha, the learning rate. There is no default value, experiment with it and see what works.
• beta2, used to update the moving average of the squared gradients. A common default value is 0.99. We call this beta2 to distinguish it from the parameter beta used in momentum.
• epsilon - An arbitrary small number added to the root in the denominator. It prevents that we divide by zero or a number so small that the fraction explodes. A common default value is 10e-8

Implement the RMSProp optimizer.

class RMS_Prop(Optimizer):
def __init__(self, model, x_train=None, y_train=None, hyperparam=dict(), batch_size=128):
super(RMS_Prop, self).__init__(model, x_train, y_train, hyperparam, batch_size)

def _set_param(self):
# TODO: set self.alpha, self.beta2, self.epsilon
raise NotImplementedError()

def _update(self, param, grad, g, i):
# TODO: update the moving avg. of squared gradients and the network
raise NotImplementedError()

def train(self, steps=1000, print_each=100):
return self._train(steps, num_grad_stores=1, print_each=print_each)
train_breast_cancer(RMS_Prop)

Adam combines the properties of both Momentum and RMSProp. It keeps track of the velocity as the moving average of the gradients. It also keeps track of the moving average of the squared gradients. Additionally, we apply bias correction to the velocity and avg. of squared gradients.

In the update rule, we shift the parameter by the negative velocity divided by the square root of the avg. of squared gradients. Using the variable names from the previous exercies:

w = w - alpha * vdW_corrected/np.sqrt(sdW_corrected)


So now there are 4 hyperparameters:

• alpha, the learning rate

• beta, the hyperparameter for the moving average of the velocity. A common default value is 0.9

• beta2, the hyperparameter for the moving average of the squared gradients. A common default value is 0.99

• epsilon : a small value added to the denominator. Common default value: 10e-8

class Adam(Optimizer):
def __init__(self, model, x_train=None, y_train=None, hyperparam=dict(), batch_size=128):
super(Adam, self).__init__(model, x_train, y_train, hyperparam, batch_size)

def _set_param(self):
# TODO: set alpha, beta, beta2, epsilon
raise NotImplementedError()

def _update(self, param, grad, g, i):
# TODO: update the velocity, moving avg. of squared gradients and network
raise NotImplementedError()

def train(self, steps=1000, print_each=100):
# note that num_grad_stores is 2 since you need to keep track
# of both the velocity and moving avg. of squared gradients
return self._train(steps, num_grad_stores=2, print_each=print_each)
train_breast_cancer(Adam)

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Optimizers
by Diyar Oktay