Bayes by Backprop

Status of the notebook :

  • A working implementation of a Bayesian NN from Scratch inkl. Training by Bayes by Backprop [Blu15]

Motivation for uncertainty in the weights

According to [Blu15]

  • Regularization via compression cost on the weights
  • Richer representation and predictions from cheap model averaging
  • Exploration in simple reinforcement learning

from [XX]

  • Pruning of neural networks for reduced storage and computational cost, e.g. for mobile devices.

Bayesian Neural Networks

  • instead of point estimates (MAP) we describe the weights by a probability distributions$ p(\theta) $ (see Figure 1 of [Blu15]).
  • e.g. mean field approximation -$ p({\bf w}) = \prod_i p(w_i) $
  • with Gaussians$ p(w_i) = \mathcal N(\mu_i, \sigma_i^2) $

Baysian Inference

  • Posterior:$ p({\bf w}\mid \mathcal D) $
  • Prediction:$ p({\bf \hat y} \mid {\bf \hat x}) = \mathbb E_{p({\bf w}\mid \mathcal D)}\left[p({\bf \hat y} \mid {\bf \hat x}, {\bf w})\right] $
  • equivalent of using an ensemble of an uncountable infinite number of neural networks
  • intractable for larger NNs

Variational Approximation

$ q({\bf w}\mid \theta) $

Minimization of the ELBO:

`DOUBLE \begin{align} \theta^* &= \text{arg} \min\theta \mathcal D{KL} \left[ q({\bf w}\mid \theta) \mid \mid p({\bf w}\mid \mathcal D)\right] \ &= \text{arg} \min\theta \mathcal D{KL} \left[ q({\bf w}\mid \theta) \mid \mid p({\bf w}) p(\mathcal D \mid \mathcal {\bf w})\right] \ &= \text{arg} \min\theta \left( \mathcal D{KL} \left[ q({\bf w}\mid \theta) \mid \mid p({\bf w}) \right]

  • \mathbb E_{q({\bf w}\mid \theta)}\left[\log p(\mathcal D \mid {\bf w})\right]\right)\ \end{align} DOUBLE`
  • prior dependent part / complexity cost:$ \mathcal D_{KL} \left[ q({\bf w}\mid \theta) \mid \mid p({\bf w}) \right] $
  • data dependent part / likelihood cost: $ -\mathbb E_{q({\bf w}\mid \theta)}\left[\log p(\mathcal D \mid {\bf w})\right] $
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline


# we use the lightweight deep learning library dp
# you have to put dp.py in the same folder as this noebook
# you can download it from
# https://gitlab.com/deep.TEACHING/educational-materials/-/blob/master/notebooks/differentiable-programming/dp.py
# if you have downloaded the complete deep teaching git repo you find it in the folder ../differentiable-programming/

import dp # differentiable programming module
from dp import Node
# get the csv data from
#   https://pjreddie.com/projects/mnist-in-csv/
#data_path = "/Users/chris/data/mnist_csv/"
data_path = "/home/chris/Downloads/" 

try:
    train_data_mnist
except NameError:
    train_data_mnist = np.loadtxt(data_path + "mnist_train.csv", delimiter=",")
    test_data_mnist  = np.loadtxt(data_path + "mnist_test.csv", delimiter=",")

# TODO use standard MNIST data for faster loading
original_dim = 784 # Number of pixels in MNIST images.
output_dim = nb_classes = 10 # dimensionality of the latent code z.
intermediate_dim = 128 # Size of the hidden layer.
epochs = 30
image_dim = int(np.sqrt(original_dim))
image_dim
def get_xy_mnist(data_mnist=train_data_mnist):
    x = data_mnist[:,1:]
    x = x / 255. 
    y = data_mnist[:,0]
    return x, y
x_train, y_train_ = get_xy_mnist()
x_test, y_test_ = get_xy_mnist(test_data_mnist)
# one hot encoding for y
def one_hot_encoding(y, nb_classes=nb_classes):
    m = y.shape[0]
    y_ = np.zeros((m, nb_classes), dtype=int)
    y_[np.arange(m), y.astype(int)] = 1
    return y_
y_train = one_hot_encoding(y_train_)
y_test = one_hot_encoding(y_test_)
# Standard NN
class FeedForwardNN(dp.Model):
    
    def __init__(self, original_dim=original_dim, 
                 intermediate_dim=intermediate_dim, 
                 output_dim=output_dim):
        super(FeedForwardNN, self).__init__()
        
        # param
        self.fc1 = self.ReLu_Layer(original_dim, intermediate_dim, "fc1")
        self.fc2 = self.Linear_Layer(intermediate_dim, output_dim, "fc2")
        
    def forward(self, x):
        assert isinstance(x, Node)
        h1 = self.fc1(x)
        return self.fc2(h1).softmax()
             
    def loss(self, x, y=None):
        n = x.shape[0]
        y_pred = self.forward(x)
        loss = - y * y_pred.log()
        return loss.sum()/n
        
batch_size=64
def random_batch(x_train=x_train, y_train=y_train, batch_size=batch_size):
    n = x_train.shape[0]
    indices = np.random.randint(0,n, size=batch_size)
    return x_train[indices], y_train[indices]
model = FeedForwardNN()
x_, y_ = random_batch()
y_pred = model.forward(x=Node(x_))
loss = model.loss(x=Node(x_), y=Node(y_))
optimizer = dp.optimizer.RMS_Prop(model, x_train, y_train, {"alpha":0.001, "beta2":0.9}, batch_size=batch_size)
optimizer.train(steps=3000, print_each=100)
iteration	loss
1 	 2.44259090021864
100 	 0.3563105105925153
200 	 0.13585114737789328
300 	 0.3465467162522951
400 	 0.12276877525842722
500 	 0.24951843636899185
600 	 0.09596878225649089
700 	 0.23344820516729475
800 	 0.27883557784488594
900 	 0.1192808788147787
1000 	 0.10031648282381328
1100 	 0.09347440575806035
1200 	 0.11510715133199337
1300 	 0.06816317026451826
1400 	 0.142378640435843
1500 	 0.11166325522516689
1600 	 0.023504936999704743
1700 	 0.14019522352918665
1800 	 0.05362989091061093
1900 	 0.1336230729464381
2000 	 0.031242733851440187
2100 	 0.03128991045292066
2200 	 0.0401203391344262
2300 	 0.05551679313985189
2400 	 0.12693988665392572
2500 	 0.11841020484944927
2600 	 0.04701073373726761
2700 	 0.052503179447690604
2800 	 0.06533828229224387
2900 	 0.09233361711525746
3000 	 0.006093851614539158
# accuracy on train set
y_pred = model.forward(Node(x_train)).value.argmax(axis=1)
print ((y_pred==y_train_).sum()/y_train.shape[0]) 
0.9783
# accuracy on test set
y_pred = model.forward(Node(x_test)).value.argmax(axis=1)
print ((y_pred==y_test_).sum()/y_test.shape[0]) 
0.9702

Bayesian Neural Network

init_rho = -4.

# TODO: global variable
sample_store = dict()

softplus = lambda x: (1. + x.exp()).log()
NeuralNode = dp.NeuralNode


def _Linear_Bayes_Layer(fan_in, fan_out, name=None, param=dict(), initializer=None):
   
    assert isinstance(name, str)    
    assert initializer is None # not implemented
    
    mu_weight_name, mu_weight_value = NeuralNode.get_name_and_set_param(param, name, "mu_weight")
    rho_weight_name, rho_weight_value = NeuralNode.get_name_and_set_param(param, name, "rho_weight")
    mu_bias_name, mu_bias_value = NeuralNode.get_name_and_set_param(param, name, "mu_bias")
    rho_bias_name, rho_bias_value = NeuralNode.get_name_and_set_param(param, name, "rho_bias")
    
    if mu_weight_value is None:
        mu_weight_value = NeuralNode._initialize_W(fan_in, fan_out)
        mu_bias_value = NeuralNode._initialize_b(fan_in, fan_out)
        param[mu_weight_name] = mu_weight_value
        param[mu_bias_name] = mu_bias_value
  
        rho_weight_value = np.ones_like(mu_weight_value) * init_rho
        rho_bias_value = np.ones_like(mu_bias_value) * init_rho
        param[rho_weight_name] = rho_weight_value
        param[rho_bias_name] = rho_bias_value

    nodes = dict()  # type: Dict[str, Node]
    def set_node(nodes, value, name):
        node = Node(value, name)
        nodes[name] = node
        return node
    W_mu = set_node(nodes, mu_weight_value, mu_weight_name)
    b_mu = set_node(nodes, mu_bias_value, mu_bias_name)    
    W_rho = set_node(nodes, rho_weight_value, rho_weight_name)
    b_rho = set_node(nodes, rho_bias_value, rho_bias_name) 
    
    def _layer(X):
        epsilon_w = np.random.normal(size=(fan_in, fan_out))
        epsilon_b = np.random.normal(size=(1, fan_out))
        W_ = W_mu + Node(epsilon_w) * softplus(W_rho)
        b_ = b_mu + Node(epsilon_b) * softplus(b_rho)
        
        # we need the samples for our loss
        weight_sample_name = NeuralNode._get_fullname(name, "weight_sample")
        bias_sample_name = NeuralNode._get_fullname(name, "bias_sample")
        sample_store[weight_sample_name] = W_
        sample_store[bias_sample_name] = b_
        return X.dot(W_) + b_

    return _layer, param, nodes


def Linear_Bayes_Layer(self, fan_in, fan_out, name=None):
    return self._set_layer(fan_in, fan_out, name, _Linear_Bayes_Layer)

setattr(dp.Model, "Linear_Bayes_Layer", Linear_Bayes_Layer)
class BayesFeedForwardNN(dp.Model):
    
    def __init__(self, original_dim=original_dim, 
                 intermediate_dim=intermediate_dim, 
                 output_dim=output_dim):
        super(BayesFeedForwardNN, self).__init__()
        
        # param
        self.fc1 = self.Linear_Bayes_Layer(original_dim, intermediate_dim, "fc1")
        self.fc2 = self.Linear_Bayes_Layer(intermediate_dim, output_dim, "fc2")
        
    def forward(self, x):
        assert isinstance(x, Node)
        h1 = self.fc1(x).relu()
        return self.fc2(h1).softmax()
             
    def loss(self, x, y=None):
        raise NotImplementedError()
    
    

`DOUBLE \begin{align} \theta^* &= \text{arg} \min\theta \left( \mathcal D{KL} \left[ q({\bf w}\mid \theta) \mid \mid p({\bf w}) \right]

  • \mathbb E{q({\bf w}\mid \theta)}\left[\log p(\mathcal D \mid {\bf w})\right]\right)\ &= \text{arg} \min\theta \left( \mathbb E{q({\bf w}\mid \theta)}\left[ \log q({\bf w}\mid \theta)\right]- \mathbb E{q({\bf w}\mid \theta)}\left[\log p({\bf w})\right]
  • \mathbb E_{q({\bf w}\mid \theta)}\left[\log p(\mathcal D \mid {\bf w})\right]\right)\ \end{align} DOUBLE`

So equivalently we want to minimize the following loss:

`DOUBLE loss = \mathbb E{q({\bf w}\mid \theta)}\left[\log q({\bf w}\mid \theta)\right]- \mathbb E{q({\bf w}\mid \theta)}\left[ \log p({\bf w})\right]

  • \mathbb E_{q({\bf w}\mid \theta)}\left[\log p(\mathcal D \mid {\bf w})\right] DOUBLE`

Entropy

$ \mathbb E_{q({\bf w}\mid \theta)}\left[ \log q({\bf w}\mid \theta)\right] $

Note: That's the negative entropy of$ q({\bf w}\mid \theta) $. So this terms tries to maximize the entropy of$ q $.

We can compute the expectations by Monte-Carlo estimates.

So we sample for each data point$ l $ of a mini-batch a sample from the variational distribution:

$ {\bf w}^{(l)} \sim q({\bf w}\mid \theta) $

-$ l \in [1, 2, \dots, m] $ -$ m $ number of data points in the mini-batch -$ n $ total number of data points

Then we have for the approximated loss in the mini-batch: $ loss_{mb} \approx \frac{m}{n}\sum_{l=1}^m \log q({\bf w}^{(l)}\mid \theta)- \frac{m}{n}\sum_{l=1}^m \log p({\bf w}^{(l)}) - \sum_{l=1}^m\log p({\bf y}^{(l)} \mid {\bf w}^{(l)}, {\bf x}^{(l)}) $

negative Log-Likelihood

`DOUBLE

  • \sum_{l=1}^m\log p({\bf y}^{(l)} \mid {\bf w}^{(l)}, {\bf x}^{(l)}) DOUBLE`

Log-Posterior

$ \frac{m}{n} \sum_{l=1}^m \log q({\bf w}^{(l)}\mid \theta) $

PI_2 = 2.0 * np.pi
LOG_PI_2 = float(np.log(PI_2))

def log_normal(sample, mu, sigma):
    return - 0.5 * LOG_PI_2 - sigma.log() - (sample-mu).square() / (2.*sigma.square())

Log-Prior

`DOUBLE

  • \frac{m}{n} \sum_{l=1}^n \log p({\bf w}^{(l)}) DOUBLE`

Mean-Field approximation: $ \log p({\bf w}^{(l)}) = \sum_i\log p(w_i^{(l)}) $

with Gaussian-Priors:

$ p(weight_i) = \mathcal N(weight_i \mid 0,\sigma_{weight}^2) $

$ p(bias_j) = \mathcal N(bias_j \mid 0,\sigma_{bias}^2) $

# scale mixture prior
def normal(w, var, mu=0):
    s = 1.0 / np.sqrt(PI_2 * var)
    g = (- (w - mu).square() / (2 * float(var))).exp()
    return s * g

var_g1 = np.exp(-1.)**2
var_g2 = np.exp(-7.)**2
p = 0.25
def log_scale_mixture_prior(w, var_g1=var_g1, var_g2=var_g2):
    first = p * normal(w, var_g1)
    second = (1 - p) * normal(w, var_g2)
    return (first + second).log()
# Hyperparameter
sigma = Node(0.5)

M = x_train.shape[0]/batch_size# num. of batches
def loss(self, x, y):# TODO: nb_batched should be given as argument
    m = x.shape[0]
    y_pred = self.forward(x)
    
    # y must be one hot encoded
    assert (y.shape == y_pred.shape)
    
    # data likelihood - here cross-entropy
    neg_log_likelihood = - (y * y_pred.log()).sum()
        
    # prior
    log_prior = 0
    for sample_name, sample_value in sample_store.items():
        if sample_name.endswith("_sample"):
                #log_prior += log_normal(sample_value, 0., sigma).sum()
                log_prior += log_scale_mixture_prior(sample_value).sum()
      
    #variational posterior
    log_posterior = 0.
    for nnode_name, nnode_value in self.neural_nodes.items():
        nn = nnode_name.split("_")
        sample_weight = sample_store[nn[0]+"_weight_sample"]
        sample_bias = sample_store[nn[0]+"_bias_sample"]
        for node_name, node_value in nnode_value.nodes.items():
            if node_name.endswith("mu_weight"):
                mu_weight = node_value
            if node_name.endswith("mu_bias"):
                mu_bias = node_value
            if node_name.endswith("rho_weight"):
                rho_weight = node_value
            if node_name.endswith("rho_bias"):
                rho_bias = node_value
        log_posterior += log_normal(sample_weight, mu_weight, softplus(rho_weight)).sum()
        log_posterior += log_normal(sample_bias, mu_bias, softplus(rho_bias)).sum()
        
    # TODO different weighting of complexity vd data cost during training
    # see paper 3.4.
    loss_ = neg_log_likelihood + 1/M*(log_posterior - log_prior)
    return loss_

BayesFeedForwardNN.loss = loss
bayes_model = BayesFeedForwardNN()
optimizer = dp.optimizer.RMS_Prop(bayes_model, x_train, y_train, 
                                  {"alpha":0.001, "beta2":0.9}, 
                                  batch_size=batch_size)
optimizer.train(steps=10000, print_each=100)
iteration	loss
1 	 555.8524743339495
100 	 432.19328154564505
200 	 425.5282834927088
300 	 399.7855085185947
400 	 388.1594144420811
500 	 381.9412499799676
600 	 385.74467175502974
700 	 386.5106035931867
800 	 375.8686212840836
900 	 382.0504437684173
1000 	 376.279215690231
1100 	 365.2414078904296
1200 	 368.46763266460977
1300 	 373.29237033984776
1400 	 366.15632683246065
1500 	 359.5466638302649
1600 	 359.9014343186206
1700 	 355.3065088551375
1800 	 350.5585995305178
1900 	 350.9459805287687
2000 	 343.48580990549965
2100 	 346.8447042158543
2200 	 341.12262461199634
2300 	 346.77652392774763
2400 	 336.1143906573853
2500 	 341.75459864389205
2600 	 344.6684848241343
2700 	 338.0187913673604
2800 	 330.91542750706253
2900 	 330.79186976649885
3000 	 336.3696240521043
3100 	 323.4057671181288
3200 	 337.49009462579403
3300 	 324.85056015522173
3400 	 324.4397445491907
3500 	 319.50658121097536
3600 	 325.44405122044145
3700 	 312.13655154264427
3800 	 311.7469325216087
3900 	 309.16729680392643
4000 	 308.5556604191104
4100 	 306.05650332097207
4200 	 309.97855270890955
4300 	 303.230174740776
4400 	 313.0212910641261
4500 	 296.3730682049053
4600 	 294.1913345613057
4700 	 293.3870182234443
4800 	 291.00676672249574
4900 	 292.9634531771089
5000 	 290.7470633403903
5100 	 289.5400624775794
5200 	 294.2945232297104
5300 	 282.7186033579018
5400 	 277.9272590949305
5500 	 280.47359964227235
5600 	 275.6599734576633
5700 	 272.6786374603832
5800 	 274.1234111397373
5900 	 270.58702263454455
6000 	 270.3045527272958
6100 	 270.80812968372027
6200 	 263.52604545639235
6300 	 265.99153401652933
6400 	 257.5574445410981
6500 	 256.4176346100624
6600 	 252.3497648491807
6700 	 255.1209298509782
6800 	 250.6626107619079
6900 	 253.66572698749567
7000 	 246.2654606405229
7100 	 246.43371278441495
7200 	 244.71832271550056
7300 	 243.38126570585783
7400 	 237.67298392804992
7500 	 236.48316562078
7600 	 234.29633576071845
7700 	 239.84568010804318
7800 	 231.40301653931675
7900 	 230.3234800528206
8000 	 243.0108008006537
8100 	 227.80015415791308
8200 	 232.81771053313645
8300 	 228.7706050735224
8400 	 233.4289043358385
8500 	 227.81839664238365
8600 	 223.07728741898333
8700 	 221.9306857638958
8800 	 220.76133921376731
8900 	 227.0592752719502
9000 	 218.94043683919875
9100 	 224.61937781453324
9200 	 216.39602376307548
9300 	 215.2404214852406
9400 	 216.42226419684224
9500 	 213.5575270615868
9600 	 210.940764736489
9700 	 215.79251639547186
9800 	 212.38041028005614
9900 	 215.0095979811585
10000 	 209.60373750918436
y_pred = bayes_model.forward(Node(x_test)).value
y_pred = y_pred.argmax(axis=1)
(y_pred == y_test_).sum()/y_test_.shape[0]

The expectation can be approximated by $ p({\bf \hat y} \mid {\bf \hat x}) \approx \mathbb E_{q({\bf w}\mid \mathcal D)}\left[p({\bf \hat y} \mid {\bf \hat x}, {\bf w})\right] $

MC-Estimate:

$ p({\bf \hat y} \mid {\bf \hat x}) \approx \mathbb \sum_{{\bf w}'}p({\bf \hat y} \mid {\bf \hat x}, {\bf w'}) $

y_preds = []
for i in range(10):
    y_preds.append(bayes_model.forward(Node(x_test)).value)
y_preds = np.array(y_preds)
y_pred = y_preds.mean(axis=0)
y_pred = y_pred.argmax(axis=1)
(y_pred == y_test_).sum()/y_test_.shape[0]
# uncertainty of the max_pred
y_uncertainty = y_preds.std(axis=0)
# the most uncertained 
i=y_uncertainty[range(len(y_uncertainty)),y_pred].argmax()
plt.figure()
plt.imshow(x_test[i].reshape(28, 28), cmap='gray')
plt.show()
print(y_preds.mean(axis=0)[i].max())
cell-output
0.5789877528566475
i = y_preds.mean(axis=0)[range(len(y_pred)), y_pred].argmin()
print(y_pred[i])
5
# the lowest class prediction max
plt.figure()
plt.imshow(x_test[i].reshape(28, 28), cmap='gray')
plt.show()
print(y_preds.mean(axis=0)[i].max())
cell-output
0.23103863529292376

Literature