# Exercise - Multivariate Gaussian

## Introduction

To define a univariate gaussian distribution, we only need the variance and the mean. For multivariate gaussian we need the vectorized mean and the covariance matrix. In this notebook you will implement the functions to calculate both, the vectorized mean- and the covariance_matrix.

In a two dimensional vector space, the multivariate gaussian is called bivariate gaussian, which will be used throughout the whole notebook, so we are still able to visualize our data.

In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal. These statements raise exceptions, as long as the calculated result is not yet correct.

## Requirements

### Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

• Univariate gaussian
• Multivariate gaussian
• Variance
• Covariance
• Covariance matrix

The following material can help you to acquire this knowledge:

• Gaussian, variance, covariance, covariance matrix:
• Chapter 3 of the Deep Learning Book [GOO16]
• Chapter 1 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
• Univariate gaussian:
• Video and the following of Khan Academy [KHA18]
• Multivariate gaussian:
• Video PP 6.1 and following in the playlist Probability Primer of the youtube user mathematicalmonk [MAT18]

### Python Modules

# External Modules
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.stats import multivariate_normal as mvn
from mpl_toolkits.mplot3d import Axes3D
%%html
# prevents output cells to receive a height limit and a scroll bar
<style>
.output_wrapper, .output {
height:auto !important;
max-height:1000px;  /* your desired max-height here */
}
.output_scroll {
}
</style>

## Exercises

From an experiment we obtain size$N$ random samples${\bf x}_1, \dots, {\bf x}_p$ from a population$f(x)$, where each random sample${\bf x}=[x_1, \dots, x_N]$ contains$N$ variables. Each variable$x_1, \dots, x_N$ is normally distributed in itself and independent from each other ${\bf x}_p \perp {\bf x}_{p-1}$).

Sidenote:

In most literature, bold variables are used to denote different vectors (examples), whereas non-bold variables are used for the elements of one vector.

Usually the independence sign has two vertical lines, but unfortunately it is not part of the standard$\mathrm{L\!\!^{{}_{A}} \!\!\!\!\!\;\; T\!_{\displaystyle E} \! X}$ package.

N = 1000
p = 2
X = np.zeros((N,2))

mean_x1 = -1
mean_x2 = 2
std_dev_x1 = np.sqrt(0.6)
std_dev_x2 = np.sqrt(2.0)

X[:,0] = np.random.normal(size=N, loc=mean_x1, scale=std_dev_x1)
X[:,1] = np.random.normal(size=N, loc=mean_x2, scale=std_dev_x2)

The following plot visualizes the drawn sample.

fig = plt.figure(figsize=(7,7))
ax.scatter(X[...,0], X[...,1], c='b', marker='x', label="data")
Y_lim = -1,5
X_lim = -4,2

ax.set_xlim(X_lim[0],X_lim[1])
ax.set_ylim(Y_lim[0],Y_lim[1])
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
plt.legend()
plt.draw()
plt.show()

### Exercise - Mean and Covariance Matrix

Write a vectorized mean- and the covariance_matrix-function using only the numpy methods/functions/operators: +, -, *, /,np.sum(..), np.dot(..), np.shape(..), X.T)

Hint:

You don't need all of them (depending on your implementation). You can use your implemented mean function inside covariance_matrix.

# Implement this function

def mean(X):
"""
Calculates the vectorized mean over the first axis

:X: input-matrix.
:X type: numpy ndarray with n dimensions (n >= 1) of type float.

:return: vectorized mean of X.
.return type: numpy ndarray with n-1 dimensions of type float.
"""
raise NotImplementedError
# Your implementations should pass this test:

mu = mean(X)
np.testing.assert_almost_equal(mu, X.mean(axis=0))
def covariance_matrix(X):
"""
Calculates the covariance matrix of X

:X: input-matrix.
:X type: numpy ndarray with shape (n,m) of type float.

:return: covariance matrix of X.
.return type: numpy ndarray with shape (m,m) of type float.
"""
raise NotImplementedError
# Your implementations should pass this test:

cov = covariance_matrix(X)
cov_np = np.cov(X, rowvar=0)
np.testing.assert_array_almost_equal(cov, cov_np)

### Plots

To examine the underlying distribution of our drawn sample, we could, e.g. plot all the individual dimensions as histrograms:

fig = plt.figure(figsize=(12,5), dpi=80)
bins = 15

ax.hist(X[:,0], bins=bins)
ax.set_xlabel(r'$x_1$', fontsize=20)
ax.set_ylabel(r'$p$', fontsize=20)

ax2.hist(X[:,1], bins=bins)
ax2.set_xlabel(r'$x_2$', fontsize=20)
ax2.set_ylabel(r'$p$', fontsize=20)

plt.draw()
plt.show()

Since the sample we use is only two dimensional, we can even visualize them together:

%matplotlib notebook

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
hist, xedges, yedges = np.histogram2d(X[:,0], X[:,1], bins=10, range=[[X_lim[0], X_lim[1]], [Y_lim[0], Y_lim[1]]])

xpos, ypos = np.meshgrid(xedges[:-1], yedges[:-1])
xpos = xpos.flatten('F')
ypos = ypos.flatten('F')
zpos = np.zeros_like(xpos)

dx = 0.5 * np.ones_like(zpos)
dy = dx.copy()
dz = hist.flatten()

ax.bar3d(xpos, ypos, zpos, dx, dy, dz, color='cyan', zsort='average')
ax.view_init(45, 45) # 90,90 = topdown

plt.draw()
plt.show()

Considering the plots of the individual distributions and the joint distribution, it is not far-fetched to assume a bivariate normal distribution.

Using the mean and the covariance matrix of the sample, it is possible to estimate the underlying gaussian probability density function:

%matplotlib notebook

M=60
A = np.linspace(X_lim[0],X_lim[1], M)
B = np.linspace(Y_lim[0],Y_lim[1], M)
A_, B_ = np.meshgrid(A, B)
xy = np.column_stack([A_.flat, B_.flat])

# density values at the grid points
Z = mvn.pdf(xy, mu, cov).reshape(A_.shape)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel(r'$x_1$', fontsize=20)
ax.set_ylabel(r'$x_2$', fontsize=20)
ax.set_zlabel(r'$p$', fontsize=20)
ax.plot_surface(A_,B_,Z,cmap=cm.coolwarm)
ax.view_init(45, 45) # 90,90 = topdown
plt.draw()
plt.show()

A way to visualize both, the drawn sample and the estimated bivariate gaussian is using a 2D-scatter-plot together with a contour-plot. In a contour-plot all data points on a line with the same colour have the same value in the third dimension:

%matplotlib inline

# arbitrary contour levels
max = 0.12
contour_level = [max * 1/3, max * 2/3, max]
fig = plt.figure(figsize=(7,7))

ax.contour(A, B, Z, levels = contour_level)
ax.scatter(X[...,0], X[...,1], c='b', marker='x', label="data")
ax.set_xlim(X_lim[0],X_lim[1])
ax.set_ylim(Y_lim[0],Y_lim[1])
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
plt.draw()
plt.show()

### Exercise - Joint Normality

If$x_1, \dots, x_p$ are all normally distributed, does this always imply that they are jointly normally distributed, i.e. does the pair$(x_1, \dots, x_p)$ have a multivariate normal distribution?

## Literature

### Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise - Multivariate Gaussian
by Christian Herta, Klaus Strohmenger