Exercises: Variance Reduction by Reparametrization

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm

%matplotlib inline

Score Function

Show that

$ \nabla_\phi \mathbb E_{q_\phi(z)} \left[ f(z) \right] = \mathbb E_{q_\phi(z)} \left[ f(z) \nabla_\phi \log q_\phi(z)\right] $

assume that we can use "Leibniz's rule for differentiation under the integral sign", i.e. we can swap the integral and the nabla-operator (partial derivatives).


The crude MC estimator has high variance:

$ \begin{align} \nabla_\phi \mathbb E_{q_\phi(z)} \left[ f(z) \right] &= \mathbb E_{q_\phi(z)} \left[ f(z) \nabla_\phi \log q_\phi(z)\right]\\ & \approx \frac{1}{n} \sum_{i=1}^n f(z^{(i)})\nabla_\phi \log q_\phi(z^{(i)}) \end{align} $

with sampling of the hidden variables$ z^{(i)} $ from$ q_\phi(z) $. Notation:$ z^{(i)}\sim q_\phi(z) $

As function we use$ f(z)=z^2 $

def f(z):
    return z**2
sigma2 = scale**2

def sample_z_from_q(loc=loc, scale=scale, size=size):
    return np.random.normal(loc=loc, scale=scale, size=size)

Let's keep the things simple: We use a Gaussian for our distribution$ q(z^{(i)}) $

$ \mathcal N (z \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp(-\frac{(z-\mu)^2}{2 \sigma^2}) $

Show that the derivatives w.r.t.$ \mu $ and$ \sigma^2 $ are:

$ \frac{\partial}{\partial \mu} \log \mathcal N (z \mid \mu, \sigma^2) = \frac{(z-\mu)}{\sigma^2}\\ $

$ \frac{\partial}{\partial \sigma^2} \log \mathcal N (z \mid \mu, \sigma^2) = - \frac{1}{2 \sigma^2}+ \frac{(z-\mu)^2}{2 (\sigma^2)^2}\\ $

def dq_dmu(z, mu=mu, sigma2=sigma2):
    return (z - mu)/sigma2

def dq_dsigma2(z, mu=mu, sigma2=sigma2):
    return - 1/(2*sigma2) + (z-mu)**2/sigma2**2

Exercise: Implement the Crude MC estimator:

$ \begin{align} \frac{\partial}{\partial \mu} \mathbb E_{\mathcal N (z \mid \mu, \sigma^2) } \left[ f(z) \right] &= \mathbb E_{\mathcal N (z \mid \mu, \sigma^2) } \left[ f(z) \frac{\partial}{\partial \mu} \log \mathcal N (z \mid \mu, \sigma^2)\right]\\ & \approx \frac{1}{n} \sum_{i=1}^n f(z^{(i)}) \frac{\partial}{\partial \mu} \log \mathcal N (z \mid \mu, \sigma^2) = \frac{1}{n} \sum_{i=1}^n f(z^{(i)}) \frac{(z^{(i)}-\mu)}{\sigma^2} \end{align} $

estimates = get_crude_mc_estimates(size=100)
#plt.xlim(15, 25)
estimates = get_crude_mc_estimates(size=10000)
plt.xlim(15, 25)

Reparametrization for variance reduction

$ z = g_\phi(\epsilon) $

$ \nabla_\phi \mathbb E_{q_\phi(z)} \left[ f(z) \right] = \mathbb E_{p(\epsilon)} [\nabla_\phi f (g_\phi(\epsilon))] $ $ p(\epsilon) $ is a fixed distribution with no dependence on$ \phi $.

Here: -$ p(\epsilon) = \mathcal N(0, 1) $ -$ g_{\mu, \sigma^2}(\epsilon) = \mu + \epsilon \sigma $


  • Implementing the reparametrized MC estimator

$ \frac{\partial}{\partial \mu} \mathbb E_{\mathcal N (z \mid \mu, \sigma^2) } \left[ f(z) \right] = \mathbb E_{p(\epsilon)} \left[\frac{\partial}{\partial \mu} f (g_{\mu, \sigma^2}(\epsilon))\right] $

  • Find a mathematical expression for$ \frac{\partial f}{\partial \mu} $ and use it in the estimator.
  • Sample$ \epsilon \sim \mathcal N(0, 1) $
def get_reparametrized_mc_estimates(size):
    estimates = []
    for i in range(10000):
        estimate= reparametrized_mc_estimate(size)
    return np.array(estimates)
# here we need just 100 samples to get a similar result as for 10000 simple_mc samples
estimates = get_reparametrized_mc_estimates(100)
plt.xlim(15, 25)
#### for 10000 samples
estimates = get_reparametrized_mc_estimates(10000)
plt.xlim(15, 25)

Comparision between CMC and reparametrized MC

crude_mc_var = []
reparam_mc_var = []
ns = np.array([10,30,90,300,600,900, 1500, 3000])
for i in ns:
    crude_mc_estimates = get_crude_mc_estimates(size=i)
    reparam_mc_estimates = get_reparametrized_mc_estimates(size=i)
crude_mc_var = np.array(crude_mc_var)
reparam_mc_var = np.array(reparam_mc_var)
plt.figure(figsize=(10, 8))
plt.loglog(ns, crude_mc_var, "bo", label="crude mc")
plt.loglog(ns, crude_mc_var, "b-")
plt.loglog(ns, reparam_mc_var, "go", label="reparametrized mc")
plt.loglog(ns, reparam_mc_var, "g-")
plt.xlabel("number of samples")
plt.ylabel("variance of the estimators")

As you can see from the plot there is a strait line in a log-log plot:

$ \log y = a + b \log x $

therefore (with$ c = \exp(a) $): $ y = c x^b $

We know:

$ var(estimate) = \frac{c}{n} $

so be must be$ -1 $. Let's check this:

b_crude = (np.log(crude_mc_var[-1]) - np.log(crude_mc_var[0]))/(np.log(ns[-1])-np.log(ns[0]))
b_reparm = (np.log(reparam_mc_var[-1]) - np.log(reparam_mc_var[0]))/(np.log(ns[-1])-np.log(ns[0]))

$ a = \log y - b \log x $

# with b \approx 1
as_reparam = np.log(reparam_mc_var) + np.log(ns) 
cs_reparam = np.exp(as_reparam)
# with b \approx 1
as_crude = np.log(crude_mc_var) + np.log(ns) 
cs_crude= np.exp(as_crude)

$ var(estimate\_reparam) = \frac{35.8}{n} $

$ var(estimate\_crude) = \frac{2648}{n} $

$ \eta_{VR} =\frac{\text{var}\left(\hat H_n\right)-\text{var}\left(\bar H_n\right)}{\text{var}\left(\hat H_n\right)} $

eta = (2648-35.8)/2648
# Here we need approx. 1.35% of the samples to get the same 
# variance with our variance reduction technique 


  • Diederik P. Kingma, Max Welling: Auto-Encoding Variational Bayes,
  • Diederik P. Kingma, Max Welling: Efficient Gradient-based Inference through Transformation between Bayes Nets and Neural Nets,