# Exercise - Simple Linear Regression with PyTorch

## Introduction

Linear Regression is the Hello World of Machine Learning. In this exercise you will implement a simple linear regression (univariate linear regression), a model with one predictor and one response variable. The goal is to recap and practice fundamental concepts of Machine Learning aswell as practice the usage of the deep learning framework PyTorch. Steps in this exercise:

1. Implement a linear function as hypothesis (model)
2. Write a function to quantify your model (cost function)
3. Learn to visualize the cost function
6. Visualize your training process and results
7. Replace self written functions for linear hypothesis, costs and optimizer with PyTorch Built-Ins

## Requirements

### Knowledge

You should have a basic knowledge of Machine Learning models, cost functions, optimization algorithms and also numpy and matplotlib. We will only recap these concepts for a better understanding and do not explain them in great detail. Suitable sources for acquiring this knowledge are:

### Python Modules

By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.

# Import Numpy & PyTorch
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from mpl_toolkits.mplot3d.axes3d import Axes3D

## Trianing Data

First, let us generate some training data:

-$m$-Training data$\mathcal D = \{(\vec x^{(1)}, y^{(1)}),(\vec x^{(2)},y^{(2)}), \dots (\vec x^{(m)},y^{(m)})\}$

here with -$m=100$ training data

• one feature$x \in [0, 10]$

• one target value (float):$y = 5 + 2 \cdot x + \eta$

• gaussian noise$\eta$

Note that in reality, we do not know the true target function. If we did, we already knew the perfect model and did not have to find one.

### seed for reproducability
np.random.seed(42)

m = 100
X = np.random.uniform(low=0, high=10, size=(m))

### convert from float64 to float32 for various reasons:
### speedup, less memory usage, precision is enough.
### when using GPU, fp16, fp32 or fp64 depends on
### type of GPU (consumer/workstation/server).
X = np.array(X, dtype=np.float32)

print(X.shape)
print(X.dtype)
def true_target_function(x):
bias = 5
slope = 2
noise = np.random.randn(len(x))
return bias + slope * x + noise
y = true_target_function(X)
y = np.array(y, dtype=np.float32)

print(y.shape)
print(y.dtype)

### Plot the data

First, let us plot the data, so we know what we have to deal with:

plt.scatter(X,y, label='$D_{Train}$')
plt.title('Training Data')
plt.xlabel('Feature $x$')
plt.ylabel('Target $y$')
plt.legend()

## Exercise - Simple Linear Regression

A short recap, a hypothesis$h_\theta(x)$ is a certain function that we believe is similar to a target function that we like to model.

## Model

As we practice linear regression with one feature per training data our model (or hypothesis) will be a line: Consisting of a bias $\theta_0$) and a slope $\theta_1$). So our hypothesis can be defined with:

$h_{\theta}(x) = \hat y = \theta_0 + \theta_1 \cdot x =$

$h_{\theta}(x) = \hat y = \theta_0 \cdot 1 + \theta_1 \cdot x_1$

$\hat y$ is often used for the prediction of our model (oppose to just$y$ for the true target value).

Visualizing our model as graph: ### Converting Data to Tensors

For PyTorch to be able to work with the data, we need to convert the numpy arrays.

### for pytorch, shape (m,1), 100x1 matrix, is crucial.
### shape (100), just a vector, can mess up calculations.
X_tensor = torch.from_numpy(X.reshape(m,1))
y_tensor = torch.from_numpy(y.reshape(m,1))

print(X.shape)
print(y.shape)
print(X_tensor.shape)
print(y_tensor.shape)

### Model Parameters

For our linear model, here a line, we need a bias (theta_0) and a slope (theta_1). We just draw both from a normal distribution (torch.randn).

Notice the argument requires_grad=True. This will be of use later on to be able to get the partial derivatives of these variables.

torch.manual_seed(42)

### bias and slope
theta_0 = torch.randn(1, 1, requires_grad=True, dtype=torch.float32)
theta_1 = torch.randn(1, 1, requires_grad=True, dtype=torch.float32)

print(theta_0, theta_0.shape)
print(theta_1, theta_1.shape)

### Define the Model

Implement the function for our hypothesis:

$h_{\theta}(x) = \theta_0 + \theta_1 \cdot x$

just as you would do using plain python / numpy. Do not wonder, it is really that simple. Here we do not even have to specify if the parameters are simple python scalars, numpy-arrays, or pytorch-tensors.

def hypothesis(x, theta_0, theta_1):

###############################
###############################

raise NotImplementedError()

###############################
###############################        

Now we can already make some preditions with our model and plot them. As we have not trained our model yet (adjusted our parameter$\theta_0$ and$\theta_1$), the result will most porbably be awful.

### generate predictions
y_predicted = hypothesis(X_tensor, theta_0, theta_1)

### output should look like:
### torch.Size([100, 1])
print(y_predicted.shape)
plt.scatter(X,y, label='$D_{Train}$')
plt.title('Training Data and Prediction')
plt.xlabel('Feature $x$')
plt.ylabel('Target $y$')
plt.plot(X_tensor.numpy(), y_predicted.detach().numpy(), 'r', label='model')
plt.legend()

## Cost Function

Before we can tell our algorithm how to impove our hypothesis (here line), we first need a measure to quantify how good a line is. For this purpose we compute the distances of our predictions$h_\theta(x^{(i)}) = \hat y^{(i)}$ and the true$y^{(i)}$. And since we do not want to have negative distances we just square them. This leads use to the error measure, called ...

### Mean-Squared-Error

Implement the computation of the Mean-Squared-Error loss. Don't use any build-in function of PyTorch.

Reminder:

$MSE = \frac{1}{2m} \cdot \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2$

with:

• the number of training examples$m$
• the predictions of our current model$h_{\theta}(x^{(i)})$
• the true target values$y^{(i)}$

Hint:

Avoid loops in your implementation. Use toch.mean() or torch.sum() instead.

def mse(predictions, targets):

###############################
###############################

raise NotImplementedError()

###############################
###############################  
### test your implementation, compute the loss
loss = mse(y_predicted, y_tensor)

### should output something like:
print(loss)

### Visulization of Error Surface

Since our model is not very complex and depends only on two parameters$\theta_0$ and$\theta_1$, we can compute the Mean-Squared-Error loss for certrain values of$\theta_0$ and$\theta_1$ and visualize them in 3D-plot, resp. contour-plot.

Note this visualization is just for the sake of learning purpose. With a more complex model, consisting of thousands (or millions) of parameters$\theta$s, we could not only not visualize it, the computation effort would also be too high as we have to compute the error for every possible combination of our parameters$\theta$s in a given interval.

If your implementation of the Mean-Squared-Error loss is correct, the plot should look like the following: def create_cost_plt_data(x, y, hypothesis_func, cost_func, interval, num_samples, x_offset=0., y_offset=0.):
t0 = np.linspace(-interval+x_offset, interval+x_offset, num=num_samples)
t1 = np.linspace(-interval+y_offset, interval+y_offset, num=num_samples)
T0, T1 = np.meshgrid(t0, t1)

### initilize cost matrix and cost function
C = np.zeros([len(t0),len(t1)])

### calculate cost for each pair of the grid
for i, t_0 in enumerate(t0):
for j, t_1 in enumerate(t1):
t_0 = torch.tensor([[t_0]])
t_1 = torch.tensor([[t_1]])
preds = hypothesis_func(x, t_0, t_1)
loss = mse(preds, y)
C[j][i] = loss.detach().numpy()

return T0, T1, C

def create_cost_plt(T0, T1, Costs):
### contour plot
fig = plt.figure(figsize=(16,6))
fig.suptitle('Visualization of MSE Cost Function')
ax.contour(T0, T1, C)
ax.set_xlabel('Theta 0')
ax.set_ylabel('Theta 1')

### surface plot
ax = fig.add_subplot(1, 2, 2, projection='3d')
surf = ax.plot_surface(T0, T1, C, rstride=1, cstride=1, cmap='viridis', linewidth=0, antialiased=False)
fig.colorbar(surf, shrink=0.5)
T0, T1, C = create_cost_plt_data(X_tensor, y_tensor, hypothesis, mse, 100, 100)
create_cost_plt(T0, T1, C)

### Better do Feature Scaling

It seems, the error surface is not very balanced between$\theta_0$ and$\theta_1$. Scaling our features should help here.

Scale the features, so they are in the range of$-1$ to$+1$:

$X_{scaled} = \frac{X - \mu}{std(X)}$

with:

• the mean of our features$\mu$
• the standard deviation of our features$std(X)$

If everything is correct, the new plot should look more balanced: def scale_features(x):

###############################
###############################

raise NotImplementedError()

###############################
###############################

X_tensor = scale_features(X_tensor)

T0, T1, C = create_cost_plt_data(X_tensor, y_tensor, hypothesis, mse, 100, 100)
create_cost_plt(T0, T1, C)

Remember the argument requires_grad=True, when initializing our tensors theta_0 and theta_1?

This means we can call backward() on tensors, where theta_0 and theta_1 were used in the computation. After calling backward() on loss we can access the partial derivatives of theta_0 and theta_1.

Recap:

Why computing the gradient? Short: theta_1.grad and theta_0.grad tell us the steepness of the cost surface plot above for current values of$\theta_0$ and$\theta_1$, so we know in which direction we have to adjust$\theta_0$ and$\theta_1$ in order to reduce the cost and make our model fit better to the data.

### make preictions for current thetas
predictions  = hypothesis(X_tensor, theta_0, theta_1)
### calculate the loss forwards and backwards
loss = mse(predictions, y_tensor)
loss.backward()

### computation of new gradients disabled inside the context

theta_0.grad.zero_()

## Training the Model

Implement the funtion to train the model with gradient descent by making use of the code snippet above:

• (1) Make prediction, calculate the loss forwards and backwards.
• (2) Implement the gradient descent update rule.

Reminder:

Equation for the update rule:

$\theta_j' = \theta_j - \alpha \cdot \frac{\partial}{\partial \theta_j} J(\vec\theta)$

with

• the cost$J(\vec\theta)$ (here our Mean-Squared-Error)
• the partial derivative of one theta$\frac{\partial}{\partial \theta_j} J(\vec\theta)$
• a learning rate$\alpha$

Note

In your implementation for the update rule only use$x -= update$ instead of$x = x - update$ to avoid PyTorch errors.

def train_model(X, y, hypothesis_func, cost_func, theta_0, theta_1, epochs=100, learning_rate=0.1):

### History to keep track of everything for later evaluation
cost_hist = []
theta_1_hist = []
theta_0_hist = []

for i in range(epochs):

###############################
###############################
##
## Task (1): Make prediction, calculate the loss forwards and backwards.
##

raise NotImplementedError()

###############################
###############################

### computation of new gradients disabled inside the context

cost_hist.append(cost_func(hypothesis_func(X, theta_0, theta_1), y))
theta_1_hist.append(np.copy(theta_1.detach().numpy()))
theta_0_hist.append(np.copy(theta_0.detach().numpy()))

###############################
###############################
##
##

raise NotImplementedError()

###############################
###############################

return cost_hist, theta_0_hist, theta_1_hist, theta_0_grad_hist, theta_1_grad_hist
### hyperparameters
learning_rate = 1e-1
epochs = 100

### let us just recreate new thetas each time.
### so while testing your implementation you have always fresh parameters
theta_1 = torch.randn(1, 1, requires_grad=True, dtype=torch.float32)

eval_data = train_model(X_tensor, y_tensor, hypothesis, mse, theta_0, theta_1,
epochs=epochs, learning_rate=learning_rate)

cost_hist, theta_0_hist, theta_1_hist, theta_0_grad_hist, theta_1_grad_hist = eval_data

## Evaluation

### Plot Model Progession over Epochs

As we kept track of our model parameters$\theta_0$ and$\theta_1$, we can now plot how our model got adjusted during training. If learning was successfull, the plot could look like the following: Light red hypothesis means early in training. The darker the red, the later the model.

plt.scatter(X_tensor,y_tensor)

n_steps = 100
step = epochs // n_steps
for i in range(len(cost_hist[0::step])):
y_pred = hypothesis(X_tensor.numpy(), theta_0_hist[i*step], theta_1_hist[i*step])
a = i / (n_steps*1.5) + 0.33
plt.plot(X_tensor.numpy(), y_pred, color='r', alpha=a)

plt.title('Training Data (blue) and learned Hypothesis (red)')
plt.xlabel('Feature x')
plt.ylabel('Target y')

### Plot Cost and Gradient over Epochs

Let us also plot Costs and the Gradient$\nabla J(\theta)$, i.e.$\frac{\partial}{\partial \theta_0} J(\theta)$ and$\frac{\partial}{\partial \theta_1} J(\theta)$ and how it evolved during the training process.

If learning was successfull, all three should get closer to zero at the end of the training. The plot could look like the following: plt.figure(figsize=(16,3))

ax1 = plt.subplot(131)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Partial_Derivative theta_1')
ax2 = plt.subplot(132)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Partial_Derivative tehta_0')
ax3 = plt.subplot(133)
ax3.set_xlabel('Epoch')
ax3.set_ylabel('Costs')

ax3.plot(np.arange(0, len(cost_hist)), cost_hist)

print("smallest_costs: {}".format(np.array(cost_hist).min()))

### Using PyTorchs Built-Ins

Until now we only used tensors and the autograd feature of PyTorch and implemented our own functions for the linear hypothesis, the costs and the update rule.

This is extremely valuable for learning the concepts of algorithms aswell as testing out own ideas or implementations of state-of-the-art papers, which are not covered by the well established algorithms.

On the other side, when building complex models, like very deep neural networks, this can become tedious. For this purpose PyTorch provides built-ins for a variety of functions. One of them is torch.nn.Linear, which can be used to implement our linear hypothesis:

$h_{\theta}(x) = \theta_0 + \theta_1 \cdot x$

• Implement the class LinearRegression, which uses torch.nn.Linear for our hypothesis.
• Add torch.nn.Linear as a class member in the constructor, e.g. self.linear = torch.nn.Linear(...)
• Apply the linear function onto x in the forward method and return the resutl.
#######################################
### Using built-in linear function ####
#######################################

class LinearRegression(nn.Module):  # inheriting from nn.Module!

def __init__(self):
super(LinearRegression, self).__init__()

###############################
###############################
##
##

raise NotImplementedError()

###############################
###############################

def forward(self, x):
###############################
###############################
##
## Task (2): call the member function on x and return the result
##

raise NotImplementedError()

###############################
############################### 

To get the gradient, the process is almost the same. The only difference is, that we did not explicitly create our model parameters. So in order to get them, we call parameters() of our model object.

### we can now create our model.
### and then gon, almost as beore
model = LinearRegression()
preds = model(X_tensor)
loss = mse(preds, y_tensor)
loss.backward()

### computation of new gradients disabled inside the context
for param in model.parameters():
print('param: ', param)
print()


#### Training the new model

Implement the function to train our new model. Again, just make use of the code snippet above and insert the code at the correct positions.

def train_model_new(X, y, model, cost_func, epochs=100, learning_rate=0.1):

### History of costs
cost_hist = []

for i in range(epochs):

###############################
###############################
##
## Task (1): Make predition, calculate the loss forwards an backwards.
##

return NotImplementedError()

###############################
###############################

cost_hist.append(cost_func(model(X), y))

### computation of new gradients disabled inside the context
for param in model.parameters():

###############################
###############################
##
##

return NotImplementedError()

###############################
###############################

return cost_hist
# Hyperparameters
learning_rate = 1e-1
epochs = 100

### create a new model only once before training
model = LinearRegression()

### start training
cost_hist = train_model_new(X_tensor, y_tensor, model, mse, epochs=epochs, learning_rate=learning_rate)

### plot the costs over epochs
plt.plot(np.linspace(0,epochs, epochs), cost_hist)
plt.xlabel('Epoch')
plt.ylabel('Costs')

### More PyTorch Built-Ins

After making use of torch.nn.Linear, there is more we can exchange for PyTorch built-ins. The cost and the updates of our parameters.

Write the code to train a new model:

• Create a new model object before the training.
• Use the built-in for the Mean-Squared-Error loss torch.nn.MSELoss():

• Create an object once before the training, e.g. criterion = torch.nn.MSELoss().
• Calling it will give you a tensor, e.g. loss = criterion(y_tensor, predictions).
• You an call backward() on that tensor too.
• Also use a built-in optimizer, e.g. torch.optim.Adam:

• Create an object once before the training providing your model.parameters() and a learning rate as argument.
• Use its functions step() and zero_grad() to update the parameters, resp. to reset the gradient.
• Keep track of the costs in each epoch to plot them afterwards.

Optionally:

• Also plot the data together with the hypothesis of epoch like seen before.
• Hint: You do not have to store the models parameters. You can just plot the models during training.

General Hint:

• Of couse you also do not have to implement all of this in a function plain code is ok if it does the job.
###############################
###############################
##
## Task (1): Train your model using the classes and functions mentioned
##
## Task (2): Plot the history of the costs afterwards
##
## Task (3): Optionally plot the data with the hypothesis of each epoch like seen before.

###############################
###############################    

## Summary and Outlook

During this exercise, fundamental elements of Machine Learning were covered. You should be able to answer the following questions:

• What is a model using the example of a linear function as a hypothesis?
• How do you quantify a model?
• What is the gradient descent algorithm and what is its used for in the context of Machine Learning?
• Can you explain the concept of hyperparameters and name some?

Also you should have learned how to implement your own functions with PyTorch, how to make use of PyTorchs autograd feature and how to use PyTorchs built-ins for functions, costs and optimizers.

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise: Simple Linear Regression with PyTorch
by Christian Herta, Benjamin Voigt, Klaus Strohmenger