MLFundamentals  Lineare Regression  Exercise: Multivariate Linear Regression
Table of Contents
Introduction
In this exercise you will implement the multiivariate linear regression, a model with two or more predictors and one response variable (opposed to one predictor using univariate linear regression). The whole exercise consists of the following steps:
 Generate values for two predictors/features $(x_1, x_2)$
 Implement a linear function as hypothesis (model)
 Generate values for the response (Y / target values)
 Plot the $((x_1, x_2), y)$ values in a 3D plot.
 Write a function to quantify your model (cost function)
 Implement the gradient descent algorithm to train your model (optimizer)
 Visualize your training process and results
 Apply feature scaling (pen & paper)
Requirements
Knowledge
You should have a basic knowledge of:
 Univariate lineare regression
 Multivariate linear regression
 Squared error
 Gradient descent
 numpy
 matplotlib
Suitable sources for acquiring this knowledge are:
 Multivariate Linear Regression Notebook by Christian Herta and his lecture slides (German)
 Chapter 2 of the open classroom Machine Learning by Andrew Ng
 Chapter 5.1 of Deep Learning by Ian Goodfellow
 Some parts of chapter 1 and 3 of Pattern Recognition and Machine Learning by Christopher M. Bishop
 numpy quickstart
 Matplotlib tutorials
Python Modules
By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.
# External Modules
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D
from matplotlib import cm
%matplotlib notebook
Exercise  Multivariate Linear Regression
We will only use two features in this notebook, so we are still able to plot them together with the target in a 3D plot. But your implementation should also be capable of handling more (except the plots).
Create Features
First we will create some features. The features should be in a 2D numpy array, the rows seperating the different feature vectors, the columns containing the features. Each feature should be uniformly distributed in a specifiable range.
Task:
Implement the function to generate a feature matrix (numpy array).
def create_feature_matrix(sample_size, n_features, x_min, x_max):
'''creates random feature vectors based on a lienar function in a given interval
Args:
sample_size: number feature vectors
n_features: number of features for each vector
x_min: lower bound value ranges
x_max: upper bound value ranges
Returns:
x: 2D array containing feature vecotrs with shape (sample_size, n_features)
'''
raise NotImplementedError("You should implement this!")
sample_size = 100
n_features = 2
x_min = [1.5,0.5]
x_max = [11.,5.0]
X = create_feature_matrix(sample_size, n_features, x_min, x_max)
assert len(X[:,0]) == sample_size
assert len(X[0,:]) == n_features
for i in range(n_features):
assert np.max(X[:,i]) <= x_max[i]
assert np.min(X[:,i]) >= x_min[i]
Linear Hypothesis
A short recap, a hypothesis $h_\theta(x)$ is a certain function that we believe is similar to a target function that we like to model. A hypothesis $h_\theta(x)$ is a function of $x$ with fixed parameters $\theta$.
Here we have $n$ features $x = [x_1, \ldots, x_n ]$ and $n+1$ $\theta$s:
adding an extra element to $x$ for convenience, this could also be rewritten as:
with $x_0 = 1$ for all feature vectors.
Or treating $x$ and $\theta$ as vectors:
with:
Or for the whole data set at once:
with:
Task:
Implement hypothesis $h_\theta(x)$ in the method linear_hypothesis
and return it as a function. Implement it the computationally efficient (pythonic) way by not using any loops and handling all data at once (use $X$ respectively $X'$).
Hint:
Of course you are free to implement as many helper functions as you like, e.g. for transforming $X$ to $X'$, though you do not have to. Up to you.
def linear_hypothesis(thetas):
''' Combines given list argument in a linear equation and returns it as a function
Args:
thetas: list of coefficients
Returns:
lambda that models a linear function based on thetas and x
'''
raise NotImplementedError("You should implement this!")
(len(X[:,0]), 1)
assert len(linear_hypothesis([.1,.2,.3])(X)) == sample_size
Generate Target Values
Task:
Use your implemented linear_hypothesis
inside the next function to generate some target values $Y$. Additionally add some Gaussian noise.
def generate_targets(X, theta, sigma):
''' Combines given arguments in a linear equation with X,
adds some Gaussian noise and returns the result
Args:
X: 2D numpy feature matrix
theta: list of coefficients
sigma: standard deviation of the gaussian noise
Returns:
target values for X
'''
raise NotImplementedError("You should implement this!")
theta = (2., 3., 4.)
sigma = 3.
y = generate_targets(X, theta, sigma)
assert len(y) == sample_size
Plot The Data
Task:
Plot the data $D = \{(x^{(1)}_1,x^{(1)}_2,y^{(1)}), \ldots, (x^{(n)}_1,x^{(n)}_2,y^{(n)})\}$ in a 3D scatter plot. The plot should look like the following:
Sidenote:
The command %matplotlib notebook
(instead of %matplotlib inline
) creates an interactive (e.g. rotatable) plot.
%matplotlib notebook
def plot_data_scatter(features, targets):
""" Plots the features and the targets in a 3D scatter plot
Args:
features: 2D numpyarray features
targets: ltargets
"""
raise NotImplementedError("You should implement this!")
plot_data_scatter(X, y)
Cost Function
A cost function $J$ depends on the given training data $D$ and hypothesis $h_\theta(x)$. In the context of the linear regression, the cost function measures how wrong a model is regarding its ability to estimate the relationship between $x$ and $y$ for specific $\theta$ values. Later we will treat this as an optimization problem and try to minimize the cost function $J_D(\theta)$ to find optimal $\theta$ values for our hypothesis $h_\theta(x)$. The cost function we use in this exercise is the MeanSquaredError cost function:
$J_D(\theta)=\frac{1}{2m}\sum_{i=1}^{m}{(h_\theta(x_i)y_i)^2}$
Implement the cost function $J_D(\theta)$ in the method mse_cost_function
. The method should return a function that takes the values of $\theta_0$ and $\theta_1$ as an argument.
Sidenote, the terms loss function or error function are often used interchangeably in the field of Machine Learning.
def mse_cost_function(x, y):
''' Implements MSE cost function as a function J(theta) on given traning data
Args:
x: vector of x values
y: vector of ground truth values y
Returns:
lambda J(theta) that models the cost function
'''
raise NotImplementedError("You should implement this!")
J = mse_cost_function(X, y)
print(J(theta))
Gradient Descent
A short recap, the gradient descent algorithm is a firstorder iterative optimization for finding a minimum of a function. From the current position in a (cost) function, the algorithm steps proportional to the negative of the gradient and repeats this until it reaches a local or global minimum and determines. Stepping proportional means that it does not go entirely in the direction of the negative gradient, but scaled by a fixed value $\alpha$ also called the learning rate. Implementing the following formalized update rule is the core of the optimization process:
$\theta_{j_{new}} \leftarrow \theta_{j_{old}}  \alpha * \frac{\delta}{\delta\theta_{j_{old}}} J(\theta_{old})$
Task:
Implement the function to update all theta values.
def update_theta(x, y, theta, learning_rate):
''' Updates learnable parameters theta
The update is done by calculating the partial derivities of
the cost function including the linear hypothesis. The
gradients scaled by a scalar are subtracted from the given
theta values.
Args:
x: 2D numpy array of x values
y: array of y values corresponding to x
theta: current theta values
learning_rate: value to scale the negative gradient
Returns:
t0: Updated theta_0
t1: Updated theta_1
'''
raise NotImplementedError("You should implement this!")
Using the update_theta
method, you can now implement the gradient descent algorithm. Iterate over the update rule to find the values for $\theta$ that minimize our cost function $J_D(\theta)$. This process is often called training of a machine learning model.
Task:
 Implement the function for the gradient descent.
 Create a history of all theta and cost values and return them.
def gradient_descent(learning_rate, theta, iterations, x, y):
''' Minimize theta values of a linear model based on MSE cost function
Args:
learning_rate: scalar, scales the negative gradient
theta: initial theta values
x: vector, x values from the data set
y: vector, y values from the data set
iterations: scalar, number of theta updates
Returns:
history_cost: cost after each iteration
history_theta: Updated theta values after each iteration
'''
raise NotImplementedError("You should implement this!")
Training and Evaluation
Task:
Choose an appropriate learning rate, number of iterations and initial theta values and start the training
# Your implementation:
alpha = 42.42 # assign an appropriate value
nb_iterations = 1337 # assign an appropriate value
start_values_theta = [42., 42., 42.] # assign appropriate values
history_cost, history_theta = gradient_descent(alpha, start_values_theta, nb_iterations, X, y)
Now that the training has finished we can visualize our results.
Task:
Plot the costs over the iterations. If you have used fig = plt.figure()
and ax = fig.add_subplot(111)
in the last plot, use it again here, else the plot will be added to the last plot instead of a new one.
Your plot should look similar to this one:
def plot_progress(costs):
""" Plots the costs over the iterations
Args:
costs: history of costs
"""
raise NotImplementedError("You should implement this!")
plot_progress(history_cost)
print("costs before the training:\t ", history_cost[0])
print("costs after the training:\t ", history_cost[1])
Task:
Finally plot the decision hyperplane (just a plain plane here though) together with the data in a 3D plot.
Your plot should look similar to this one:
def evaluation_plt(x, y, final_theta):
''' Plots a cost curve and the decision boundary
The Method plots a cost curve from a given training process (cost_hist).
It also plots the data set (x,y) and draws a linear decision boundary
with the parameters theta_0 and theta_1 into the plotted data set.
Args:
cost_hist: vector, history of all cost values from a opitmization
theta_0: scalar, model parameter for boundary
theta_1: scalar, model parameter for boundary
x: vector, x values from the data set
y: vector, y values from the data set
'''
raise NotImplementedError("You should implement this!")
evaluation_plt(X, y, history_theta[1])
print("thetas before the training:\t", history_theta[0])
print("thetas after the training:\t", history_theta[1])
Feature Scaling
Now suppose the following features $X$:
X = np.array([[0.0001, 2000],
[0.0002, 1800],
[0.0003, 1600]], dtype=np.float32)
sample_size = len(X[:,0])
X.std(axis=0)
v = ((0.0001  0.0002)**2 + (0.0003  0.0002)**2)/3
s = np.sqrt(v)
s
Optional:
You can even execute the cell above and start running your notebook again from top (all except executing the cell to generate your features, which would overwrite these new features).
When you start training you should notice that your costs do not decrease, maybe even increase, if you have not adjusted your learning rate (training might also throw an overflow warning).
Task:
This task can be done via pen & paper or by inserting some code below. Either way, you should be able to solve both tasks below on paper only using a calculator.

Apply feature scaling onto
X
using the mean and the standard deviation. What values do the scaled features have? 
After the training with scaled features your new $\theta'$ values will be very high, something like: $\theta'=[7197, 326, 326]$ (you can try it but you do not have to). Suppose $\theta'=[7197, 326, 326]$. What are the corresponding $\theta$ values for the unscaled data?

Answer the question: What would happen if we did not apply feature scaling and we would use polynomial regression for
X
?
X_scaled = (X  mu) / std
X_scaled
Summary and Outlook
During this exercise, the linear regression was extended to multidimensional feature space and feature scaling was practiced. You should be able to answer the following questions:
 How does the implementation of the multivariate regression differ from the univariate one?
 Why do we apply feature scaling?
 Why does feature scaling help?
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
Exercise: Multivariate Linear Regression
by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2018 Christian Herta, Klaus Strohmenger
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.