Exercise  Simple Linear Regression with PyTorch
Table of Contents
Introduction
Linear Regression is the Hello World of Machine Learning. In this exercise you will implement a simple linear regression (univariate linear regression), a model with one predictor and one response variable. The goal is to recap and practice fundamental concepts of Machine Learning aswell as practice the usage of the deep learning framework PyTorch. Steps in this exercise:
 Implement a linear function as hypothesis (model)
 Write a function to quantify your model (cost function)
 Learn to visualize the cost function
 Rescale your input features
 Implement the gradient descent algorithm to train your model (optimizer)
 Visualize your training process and results
 Replace self written functions for linear hypothesis, costs and optimizer with PyTorch BuiltIns
Requirements
Knowledge
You should have a basic knowledge of Machine Learning models, cost functions, optimization algorithms and also numpy and matplotlib. We will only recap these concepts for a better understanding and do not explain them in great detail. Suitable sources for acquiring this knowledge are:
 Simple Linear Regression Notebook by Christian Herta and his lecture slides (German)
 Chapter 5.1 of Deep Learning by Ian Goodfellow
 Some parts of chapter 1 and 3 of Pattern Recognition and Machine Learning by Christopher M. Bishop
 numpy quickstart
 Matplotlib tutorials
 PyTorch quickstart tutorial
Python Modules
By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.
# Import Numpy & PyTorch
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from mpl_toolkits.mplot3d.axes3d import Axes3D
Trianing Data
First, let us generate some training data:
$ m $Training data$ \mathcal D = \{(\vec x^{(1)}, y^{(1)}),(\vec x^{(2)},y^{(2)}), \dots (\vec x^{(m)},y^{(m)})\} $
here with $ m=100 $ training data

one feature$ x \in [0, 10] $

one target value (float):$ y = 5 + 2 \cdot x + \eta $

gaussian noise$ \eta $
Note that in reality, we do not know the true target function. If we did, we already knew the perfect model and did not have to find one.
### seed for reproducability
np.random.seed(42)
m = 100
X = np.random.uniform(low=0, high=10, size=(m))
### convert from float64 to float32 for various reasons:
### speedup, less memory usage, precision is enough.
### when using GPU, fp16, fp32 or fp64 depends on
### type of GPU (consumer/workstation/server).
X = np.array(X, dtype=np.float32)
print(X.shape)
print(X.dtype)
def true_target_function(x):
bias = 5
slope = 2
noise = np.random.randn(len(x))
return bias + slope * x + noise
y = true_target_function(X)
y = np.array(y, dtype=np.float32)
print(y.shape)
print(y.dtype)
Plot the data
First, let us plot the data, so we know what we have to deal with:
plt.scatter(X,y, label='$D_{Train}$')
plt.title('Training Data')
plt.xlabel('Feature $x$')
plt.ylabel('Target $y$')
plt.legend()
Exercise  Simple Linear Regression
A short recap, a hypothesis$ h_\theta(x) $ is a certain function that we believe is similar to a target function that we like to model.
Model
As we practice linear regression with one feature per training data our model (or hypothesis) will be a line:
Consisting of a bias $ \theta_0 $) and a slope $ \theta_1 $). So our hypothesis can be defined with:
$ h_{\theta}(x) = \hat y = \theta_0 + \theta_1 \cdot x = $
$ h_{\theta}(x) = \hat y = \theta_0 \cdot 1 + \theta_1 \cdot x_1 $
$ \hat y $ is often used for the prediction of our model (oppose to just$ y $ for the true target value).
Visualizing our model as graph:
Converting Data to Tensors
For PyTorch to be able to work with the data, we need to convert the numpy arrays.
### for pytorch, shape (m,1), 100x1 matrix, is crucial.
### shape (100), just a vector, can mess up calculations.
X_tensor = torch.from_numpy(X.reshape(m,1))
y_tensor = torch.from_numpy(y.reshape(m,1))
print(X.shape)
print(y.shape)
print(X_tensor.shape)
print(y_tensor.shape)
Model Parameters
For our linear model, here a line, we need a bias (theta_0
) and a slope (theta_1
). We just draw both from a normal distribution (torch.randn
).
Notice the argument requires_grad=True
. This will be of use later on to be able to get the partial derivatives of these variables.
torch.manual_seed(42)
### bias and slope
theta_0 = torch.randn(1, 1, requires_grad=True, dtype=torch.float32)
theta_1 = torch.randn(1, 1, requires_grad=True, dtype=torch.float32)
print(theta_0, theta_0.shape)
print(theta_1, theta_1.shape)
Define the Model
Task:
Implement the function for our hypothesis:
$ h_{\theta}(x) = \theta_0 + \theta_1 \cdot x $
just as you would do using plain python / numpy. Do not wonder, it is really that simple. Here we do not even have to specify if the parameters are simple python scalars, numpyarrays, or pytorchtensors.
def hypothesis(x, theta_0, theta_1):
###############################
##### YOUR SOLUTION START #####
###############################
raise NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
Now we can already make some preditions with our model and plot them. As we have not trained our model yet (adjusted our parameter$ \theta_0 $ and$ \theta_1 $), the result will most porbably be awful.
### generate predictions
y_predicted = hypothesis(X_tensor, theta_0, theta_1)
### output should look like:
### torch.Size([100, 1])
print(y_predicted.shape)
plt.scatter(X,y, label='$D_{Train}$')
plt.title('Training Data and Prediction')
plt.xlabel('Feature $x$')
plt.ylabel('Target $y$')
plt.plot(X_tensor.numpy(), y_predicted.detach().numpy(), 'r', label='model')
plt.legend()
Cost Function
Before we can tell our algorithm how to impove our hypothesis (here line), we first need a measure to quantify how good a line is. For this purpose we compute the distances of our predictions$ h_\theta(x^{(i)}) = \hat y^{(i)} $ and the true$ y^{(i)} $. And since we do not want to have negative distances we just square them. This leads use to the error measure, called ...
MeanSquaredError
Task:
Implement the computation of the MeanSquaredError loss. Don't use any buildin function of PyTorch.
Reminder:
$ MSE = \frac{1}{2m} \cdot \sum_{i=1}^m (h_{\theta}(x^{(i)})  y^{(i)})^2 $
with:
 the number of training examples$ m $
 the predictions of our current model$ h_{\theta}(x^{(i)}) $
 the true target values$ y^{(i)} $
Hint:
Avoid loops in your implementation. Use toch.mean()
or torch.sum()
instead.
def mse(predictions, targets):
###############################
##### YOUR SOLUTION START #####
###############################
raise NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
### test your implementation, compute the loss
loss = mse(y_predicted, y_tensor)
### should output something like:
### tensor(92.4921, dtype=torch.float64, grad_fn=<MulBackward>)
print(loss)
Visulization of Error Surface
Since our model is not very complex and depends only on two parameters$ \theta_0 $ and$ \theta_1 $, we can compute the MeanSquaredError loss for certrain values of$ \theta_0 $ and$ \theta_1 $ and visualize them in 3Dplot, resp. contourplot.
Note this visualization is just for the sake of learning purpose. With a more complex model, consisting of thousands (or millions) of parameters$ \theta $s, we could not only not visualize it, the computation effort would also be too high as we have to compute the error for every possible combination of our parameters$ \theta $s in a given interval.
If your implementation of the MeanSquaredError loss is correct, the plot should look like the following:
def create_cost_plt_data(x, y, hypothesis_func, cost_func, interval, num_samples, x_offset=0., y_offset=0.):
t0 = np.linspace(interval+x_offset, interval+x_offset, num=num_samples)
t1 = np.linspace(interval+y_offset, interval+y_offset, num=num_samples)
T0, T1 = np.meshgrid(t0, t1)
### initilize cost matrix and cost function
C = np.zeros([len(t0),len(t1)])
### calculate cost for each pair of the grid
for i, t_0 in enumerate(t0):
for j, t_1 in enumerate(t1):
t_0 = torch.tensor([[t_0]])
t_1 = torch.tensor([[t_1]])
preds = hypothesis_func(x, t_0, t_1)
loss = mse(preds, y)
C[j][i] = loss.detach().numpy()
return T0, T1, C
def create_cost_plt(T0, T1, Costs):
### contour plot
fig = plt.figure(figsize=(16,6))
fig.suptitle('Visualization of MSE Cost Function')
ax = fig.add_subplot(1, 2, 1)
ax.contour(T0, T1, C)
ax.set_xlabel('Theta 0')
ax.set_ylabel('Theta 1')
### surface plot
ax = fig.add_subplot(1, 2, 2, projection='3d')
surf = ax.plot_surface(T0, T1, C, rstride=1, cstride=1, cmap='viridis', linewidth=0, antialiased=False)
ax.set_xlabel('Theta 0', labelpad=9)
ax.set_ylabel('Theta 1', labelpad=9)
ax.set_zlabel('Costs', labelpad=9)
fig.colorbar(surf, shrink=0.5)
T0, T1, C = create_cost_plt_data(X_tensor, y_tensor, hypothesis, mse, 100, 100)
create_cost_plt(T0, T1, C)
Better do Feature Scaling
It seems, the error surface is not very balanced between$ \theta_0 $ and$ \theta_1 $. Scaling our features should help here.
Task:
Scale the features, so they are in the range of$ 1 $ to$ +1 $:
$ X_{scaled} = \frac{X  \mu}{std(X)} $
with:
 the mean of our features$ \mu $
 the standard deviation of our features$ std(X) $
If everything is correct, the new plot should look more balanced:
def scale_features(x):
###############################
##### YOUR SOLUTION START #####
###############################
raise NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
X_tensor = scale_features(X_tensor)
T0, T1, C = create_cost_plt_data(X_tensor, y_tensor, hypothesis, mse, 100, 100)
create_cost_plt(T0, T1, C)
Computing the Gradient
Remember the argument requires_grad=True
, when initializing our tensors theta_0
and theta_1
?
This means we can call backward()
on tensors, where theta_0
and theta_1
were used in the computation.
After calling backward()
on loss
we can access the partial derivatives of theta_0
and theta_1
.
Recap:
Why computing the gradient? Short: theta_1.grad
and theta_0.grad
tell us the steepness of the cost surface plot above for current values of$ \theta_0 $ and$ \theta_1 $, so we know in which direction we have to adjust$ \theta_0 $ and$ \theta_1 $ in order to reduce the cost and make our model fit better to the data.
### make preictions for current thetas
predictions = hypothesis(X_tensor, theta_0, theta_1)
### calculate the loss forwards and backwards
loss = mse(predictions, y_tensor)
loss.backward()
### computation of new gradients disabled inside the context
with torch.no_grad():
print(theta_1.grad)
print(theta_0.grad)
print(X_tensor.grad) # prints None, because we did not specify requires_grad=True
print(y_tensor.grad) # see above
### reset the gradient, else the gradient accumulates
theta_1.grad.zero_()
theta_0.grad.zero_()
Training the Model
Gradient Descent
Task:
Implement the funtion to train the model with gradient descent by making use of the code snippet above:
 (1) Make prediction, calculate the loss forwards and backwards.
 (2) Implement the gradient descent update rule.
Reminder:
Equation for the update rule:
$ \theta_j' = \theta_j  \alpha \cdot \frac{\partial}{\partial \theta_j} J(\vec\theta) $
with
 the cost$ J(\vec\theta) $ (here our MeanSquaredError)
 the partial derivative of one theta$ \frac{\partial}{\partial \theta_j} J(\vec\theta) $
 a learning rate$ \alpha $
Note
In your implementation for the update rule only use$ x = update $ instead of$ x = x  update $ to avoid PyTorch errors.
def train_model(X, y, hypothesis_func, cost_func, theta_0, theta_1, epochs=100, learning_rate=0.1):
### History to keep track of everything for later evaluation
cost_hist = []
theta_1_hist = []
theta_0_hist = []
theta_1_grad_hist = []
theta_0_grad_hist = []
### Iteratively apply gradient descent
for i in range(epochs):
###############################
##### YOUR SOLUTION START #####
###############################
##
## Task (1): Make prediction, calculate the loss forwards and backwards.
##
raise NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
### computation of new gradients disabled inside the context
with torch.no_grad():
### Keep track of everything for later purpose, don't care about this
cost_hist.append(cost_func(hypothesis_func(X, theta_0, theta_1), y))
theta_1_hist.append(np.copy(theta_1.detach().numpy()))
theta_0_hist.append(np.copy(theta_0.detach().numpy()))
theta_1_grad_hist.append(np.copy(theta_1.grad.detach().numpy()))
theta_0_grad_hist.append(np.copy(theta_0.grad.detach().numpy()))
###############################
##### YOUR SOLUTION START #####
###############################
##
## Task (2): Update theta_0 and theta_1 and reset the gradient
##
raise NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
return cost_hist, theta_0_hist, theta_1_hist, theta_0_grad_hist, theta_1_grad_hist
### hyperparameters
learning_rate = 1e1
epochs = 100
### let us just recreate new thetas each time.
### so while testing your implementation you have always fresh parameters
theta_0 = torch.randn(1, requires_grad=True, dtype=torch.float32)
theta_1 = torch.randn(1, 1, requires_grad=True, dtype=torch.float32)
eval_data = train_model(X_tensor, y_tensor, hypothesis, mse, theta_0, theta_1,
epochs=epochs, learning_rate=learning_rate)
cost_hist, theta_0_hist, theta_1_hist, theta_0_grad_hist, theta_1_grad_hist = eval_data
Evaluation
Plot Model Progession over Epochs
As we kept track of our model parameters$ \theta_0 $ and$ \theta_1 $, we can now plot how our model got adjusted during training. If learning was successfull, the plot could look like the following:
Light red hypothesis means early in training. The darker the red, the later the model.
plt.scatter(X_tensor,y_tensor)
n_steps = 100
step = epochs // n_steps
for i in range(len(cost_hist[0::step])):
y_pred = hypothesis(X_tensor.numpy(), theta_0_hist[i*step], theta_1_hist[i*step])
a = i / (n_steps*1.5) + 0.33
plt.plot(X_tensor.numpy(), y_pred, color='r', alpha=a)
plt.title('Training Data (blue) and learned Hypothesis (red)')
plt.xlabel('Feature x')
plt.ylabel('Target y')
Plot Cost and Gradient over Epochs
Let us also plot Costs and the Gradient$ \nabla J(\theta) $, i.e.$ \frac{\partial}{\partial \theta_0} J(\theta) $ and$ \frac{\partial}{\partial \theta_1} J(\theta) $ and how it evolved during the training process.
If learning was successfull, all three should get closer to zero at the end of the training. The plot could look like the following:
plt.figure(figsize=(16,3))
ax1 = plt.subplot(131)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Partial_Derivative theta_1')
ax2 = plt.subplot(132)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Partial_Derivative tehta_0')
ax3 = plt.subplot(133)
ax3.set_xlabel('Epoch')
ax3.set_ylabel('Costs')
ax1.plot(np.arange(0, len(cost_hist)), np.array(theta_1_grad_hist).flatten())
ax2.plot(np.arange(0, len(cost_hist)), np.array(theta_0_grad_hist).flatten())
ax3.plot(np.arange(0, len(cost_hist)), cost_hist)
print("smallest_costs: {}".format(np.array(cost_hist).min()))
Using PyTorchs BuiltIns
Until now we only used tensors and the autograd feature of PyTorch and implemented our own functions for the linear hypothesis, the costs and the update rule.
This is extremely valuable for learning the concepts of algorithms aswell as testing out own ideas or implementations of stateoftheart papers, which are not covered by the well established algorithms.
On the other side, when building complex models, like very deep neural networks, this can become tedious. For this purpose PyTorch provides builtins for a variety of functions. One of them is torch.nn.Linear
, which can be used to implement our linear hypothesis:
$ h_{\theta}(x) = \theta_0 + \theta_1 \cdot x $
Task:
 Implement the class
LinearRegression
, which usestorch.nn.Linear
for our hypothesis.  Add
torch.nn.Linear
as a class member in the constructor, e.g.self.linear = torch.nn.Linear(...)
 Apply the linear function onto
x
in the forward method and return the resutl.
#######################################
### Using builtin linear function ####
#######################################
class LinearRegression(nn.Module): # inheriting from nn.Module!
def __init__(self):
super(LinearRegression, self).__init__()
###############################
##### YOUR SOLUTION START #####
###############################
##
## Task (1): add torch.nn.Linear as a classmember
##
raise NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
def forward(self, x):
###############################
##### YOUR SOLUTION START #####
###############################
##
## Task (2): call the member function on x and return the result
##
raise NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
Computing the Gradient
To get the gradient, the process is almost the same. The only difference is, that we did not explicitly create our model parameters. So in order to get them, we call parameters()
of our model object.
### we can now create our model.
### and then gon, almost as beore
model = LinearRegression()
preds = model(X_tensor)
loss = mse(preds, y_tensor)
loss.backward()
### computation of new gradients disabled inside the context
with torch.no_grad():
for param in model.parameters():
print('param: ', param)
p_grad = param.grad.data
print('param grad: ', p_grad)
print()
p_grad.zero_() ### reset the gradient, else the gradient accumulates
Training the new model
Task:
Implement the function to train our new model. Again, just make use of the code snippet above and insert the code at the correct positions.
def train_model_new(X, y, model, cost_func, epochs=100, learning_rate=0.1):
### History of costs
cost_hist = []
### Iteratively apply gradient descent
for i in range(epochs):
###############################
##### YOUR SOLUTION START #####
###############################
##
## Task (1): Make predition, calculate the loss forwards an backwards.
##
return NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
### Keep track of everything for later purpose, don't care about this
cost_hist.append(cost_func(model(X), y))
### computation of new gradients disabled inside the context
with torch.no_grad():
for param in model.parameters():
###############################
##### YOUR SOLUTION START #####
###############################
##
## Task (2): Update theta_0 and theta_1 and reset the gradient
##
return NotImplementedError()
###############################
##### YOUR SOLUTION End #####
###############################
return cost_hist
# Hyperparameters
learning_rate = 1e1
epochs = 100
### create a new model only once before training
model = LinearRegression()
### start training
cost_hist = train_model_new(X_tensor, y_tensor, model, mse, epochs=epochs, learning_rate=learning_rate)
### plot the costs over epochs
plt.plot(np.linspace(0,epochs, epochs), cost_hist)
plt.xlabel('Epoch')
plt.ylabel('Costs')
More PyTorch BuiltIns
After making use of torch.nn.Linear
, there is more we can exchange for PyTorch builtins. The cost and the updates of our parameters.
Task:
Write the code to train a new model:
 Create a new model object before the training.

Use the builtin for the MeanSquaredError loss
torch.nn.MSELoss()
: Create an object once before the training, e.g.
criterion = torch.nn.MSELoss()
.  Calling it will give you a tensor, e.g.
loss = criterion(y_tensor, predictions)
.  You an call
backward()
on that tensor too.
 Create an object once before the training, e.g.

Also use a builtin optimizer, e.g.
torch.optim.Adam
: Create an object once before the training providing your
model.parameters()
and a learning rate as argument.  Use its functions
step()
andzero_grad()
to update the parameters, resp. to reset the gradient.
 Create an object once before the training providing your
 Keep track of the costs in each epoch to plot them afterwards.
Optionally:
 Also plot the data together with the hypothesis of epoch like seen before.
 Hint: You do not have to store the models parameters. You can just plot the models during training.
General Hint:
 Of couse you also do not have to implement all of this in a function plain code is ok if it does the job.
###############################
##### YOUR SOLUTION START #####
###############################
##
## Task (1): Train your model using the classes and functions mentioned
##
## Task (2): Plot the history of the costs afterwards
##
## Task (3): Optionally plot the data with the hypothesis of each epoch like seen before.
###############################
##### YOUR SOLUTION End #####
###############################
Summary and Outlook
During this exercise, fundamental elements of Machine Learning were covered. You should be able to answer the following questions:
 What is a model using the example of a linear function as a hypothesis?
 How do you quantify a model?
 What is the gradient descent algorithm and what is its used for in the context of Machine Learning?
 Can you explain the concept of hyperparameters and name some?
Also you should have learned how to implement your own functions with PyTorch, how to make use of PyTorchs autograd feature and how to use PyTorchs builtins for functions, costs and optimizers.
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
Exercise: Simple Linear Regression with PyTorch
by Christian Herta, Benjamin Voigt, Klaus Strohmenger
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2018 Christian Herta, Benjamin Voigt, , Klaus Strohmenger
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.