Exercises: Variance Reduction by Reparametrization

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm

%matplotlib inline

Score Function

Show that

$\nabla_\phi \mathbb E_{q_\phi(z)} \left[ f(z) \right] = \mathbb E_{q_\phi(z)} \left[ f(z) \nabla_\phi \log q_\phi(z)\right]$

assume that we can use "Leibniz's rule for differentiation under the integral sign", i.e. we can swap the integral and the nabla-operator (partial derivatives).

Problem:

The crude MC estimator has high variance:

\begin{align} \nabla_\phi \mathbb E_{q_\phi(z)} \left[ f(z) \right] &= \mathbb E_{q_\phi(z)} \left[ f(z) \nabla_\phi \log q_\phi(z)\right]\\ & \approx \frac{1}{n} \sum_{i=1}^n f(z^{(i)})\nabla_\phi \log q_\phi(z^{(i)}) \end{align}

with sampling of the hidden variables$z^{(i)}$ from$q_\phi(z)$. Notation:$z^{(i)}\sim q_\phi(z)$

As function we use$f(z)=z^2$

def f(z):
return z**2
loc=mu=10.
scale=3.
sigma2 = scale**2
size=100

def sample_z_from_q(loc=loc, scale=scale, size=size):
return np.random.normal(loc=loc, scale=scale, size=size)

Let's keep the things simple: We use a Gaussian for our distribution$q(z^{(i)})$

$\mathcal N (z \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp(-\frac{(z-\mu)^2}{2 \sigma^2})$

Show that the derivatives w.r.t.$\mu$ and$\sigma^2$ are:

$\frac{\partial}{\partial \mu} \log \mathcal N (z \mid \mu, \sigma^2) = \frac{(z-\mu)}{\sigma^2}\\$

$\frac{\partial}{\partial \sigma^2} \log \mathcal N (z \mid \mu, \sigma^2) = - \frac{1}{2 \sigma^2}+ \frac{(z-\mu)^2}{2 (\sigma^2)^2}\\$

def dq_dmu(z, mu=mu, sigma2=sigma2):
return (z - mu)/sigma2

def dq_dsigma2(z, mu=mu, sigma2=sigma2):
return - 1/(2*sigma2) + (z-mu)**2/sigma2**2

Exercise: Implement the Crude MC estimator:

\begin{align} \frac{\partial}{\partial \mu} \mathbb E_{\mathcal N (z \mid \mu, \sigma^2) } \left[ f(z) \right] &= \mathbb E_{\mathcal N (z \mid \mu, \sigma^2) } \left[ f(z) \frac{\partial}{\partial \mu} \log \mathcal N (z \mid \mu, \sigma^2)\right]\\ & \approx \frac{1}{n} \sum_{i=1}^n f(z^{(i)}) \frac{\partial}{\partial \mu} \log \mathcal N (z \mid \mu, \sigma^2) = \frac{1}{n} \sum_{i=1}^n f(z^{(i)}) \frac{(z^{(i)}-\mu)}{\sigma^2} \end{align}

estimates = get_crude_mc_estimates(size=100)
plt.hist(estimates);
#plt.xlim(15, 25)
estimates = get_crude_mc_estimates(size=10000)
plt.hist(estimates);
plt.xlim(15, 25)

Reparametrization for variance reduction

$z = g_\phi(\epsilon)$

$\nabla_\phi \mathbb E_{q_\phi(z)} \left[ f(z) \right] = \mathbb E_{p(\epsilon)} [\nabla_\phi f (g_\phi(\epsilon))]$ $p(\epsilon)$ is a fixed distribution with no dependence on$\phi$.

Here: -$p(\epsilon) = \mathcal N(0, 1)$ -$g_{\mu, \sigma^2}(\epsilon) = \mu + \epsilon \sigma$

Exercise

• Implementing the reparametrized MC estimator

$\frac{\partial}{\partial \mu} \mathbb E_{\mathcal N (z \mid \mu, \sigma^2) } \left[ f(z) \right] = \mathbb E_{p(\epsilon)} \left[\frac{\partial}{\partial \mu} f (g_{\mu, \sigma^2}(\epsilon))\right]$

• Find a mathematical expression for$\frac{\partial f}{\partial \mu}$ and use it in the estimator.
• Sample$\epsilon \sim \mathcal N(0, 1)$
def get_reparametrized_mc_estimates(size):
estimates = []
for i in range(10000):
estimate= reparametrized_mc_estimate(size)
estimates.append(estimate)
return np.array(estimates)
# here we need just 100 samples to get a similar result as for 10000 simple_mc samples
estimates = get_reparametrized_mc_estimates(100)
plt.hist(estimates);
plt.xlim(15, 25)
#### for 10000 samples
estimates = get_reparametrized_mc_estimates(10000)
plt.hist(estimates);
plt.xlim(15, 25)

Comparision between CMC and reparametrized MC

crude_mc_var = []
reparam_mc_var = []
ns = np.array([10,30,90,300,600,900, 1500, 3000])
for i in ns:
crude_mc_estimates = get_crude_mc_estimates(size=i)
reparam_mc_estimates = get_reparametrized_mc_estimates(size=i)
crude_mc_var.append(crude_mc_estimates.var(ddof=1))
reparam_mc_var.append(reparam_mc_estimates.var(ddof=1))
crude_mc_var = np.array(crude_mc_var)
reparam_mc_var = np.array(reparam_mc_var)
plt.figure(figsize=(10, 8))
plt.loglog(ns, crude_mc_var, "bo", label="crude mc")
plt.loglog(ns, crude_mc_var, "b-")
plt.loglog(ns, reparam_mc_var, "go", label="reparametrized mc")
plt.loglog(ns, reparam_mc_var, "g-")
plt.xlabel("number of samples")
plt.ylabel("variance of the estimators")
plt.legend();

As you can see from the plot there is a strait line in a log-log plot:

$\log y = a + b \log x$

therefore (with$c = \exp(a)$): $y = c x^b$

We know:

$var(estimate) = \frac{c}{n}$

so be must be$-1$. Let's check this:

b_crude = (np.log(crude_mc_var[-1]) - np.log(crude_mc_var[0]))/(np.log(ns[-1])-np.log(ns[0]))
b_crude
b_reparm = (np.log(reparam_mc_var[-1]) - np.log(reparam_mc_var[0]))/(np.log(ns[-1])-np.log(ns[0]))
b_reparm

$a = \log y - b \log x$

# with b \approx 1
as_reparam = np.log(reparam_mc_var) + np.log(ns)
cs_reparam = np.exp(as_reparam)
cs_reparam.mean()
# with b \approx 1
as_crude = np.log(crude_mc_var) + np.log(ns)
cs_crude= np.exp(as_crude)
cs_crude.mean()

$var(estimate\_reparam) = \frac{35.8}{n}$

$var(estimate\_crude) = \frac{2648}{n}$

$\eta_{VR} =\frac{\text{var}\left(\hat H_n\right)-\text{var}\left(\bar H_n\right)}{\text{var}\left(\hat H_n\right)}$

eta = (2648-35.8)/2648
eta
# Here we need approx. 1.35% of the samples to get the same
# variance with our variance reduction technique
35.8/2648

Literature

• Diederik P. Kingma, Max Welling: Auto-Encoding Variational Bayes,
• Diederik P. Kingma, Max Welling: Efficient Gradient-based Inference through Transformation between Bayes Nets and Neural Nets,