# ML-Fundamentals - Logistic Regression and Regularization

## Table of Contents

## Introduction

In this exercise you will implement the *logistic regression*. Opposed to the *linear regression*, the purpose of this model is not to predict a continuous value (e.g. the temperature tomorrow), but to predict a certain class: For example, whether it will rain tomorrow or not. During this exercise you will:

- Implement the logistic function and plot it
- Implement the hypothesis using the logistic function
- Write a function to calculate the cross-entropy cost
- Implement the loss function using the hypothesis and cost
- Implement the gradient descent algorithm to train your model (optimizer)
- Visualize the decision boundary together with the data
- Calculate the accuracy of your model
- Extend your model with regularization
- Calculate the gradient for the loss function with cross-entropy cost (pen&paper)

## Requirements

### Knowledge

You should have a basic knowledge of:

- Logistic regression
- Cross-entropy loss
- Gradient descent
- numpy
- matplotlib

Suitable sources for acquiring this knowledge are:

- Logistic Regression Notebook by Christian Herta and corresponding lecture slides (German)
- Regularization Notebook by Christian Herta and corresponding lecture slides (German)
- Chapter 5.1 of Deep Learning by Ian Goodfellow
- Some parts of chapter 1 and 3 of Pattern Recognition and Machine Learning by Christopher M. Bishop
- numpy quickstart
- Matplotlib tutorials

### Python Modules

By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.

```
# External Modules
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

## Exercise - Logistic Regression

For convenience and visualization, we will only use two features in this notebook, so we are still able to plot them together with the target class. But your implementation should also be capable of handling more (except the plots).

### Data Generation

First we will create some artificial data. For each class, we will generate the features with bivariate (2D) normal distribution;

```
# class 0:
# covariance matrix and mean
cov0 = np.array([[5,-4],[-4,4]])
mean0 = np.array([2.,3])
# number of data points
m0 = 1000
# class 1
# covariance matrix
cov1 = np.array([[5,-3],[-3,3]])
mean1 = np.array([1.,1])
# number of data points
m1 = 1000
# generate m gaussian distributed data points with
# mean and cov.
r0 = np.random.multivariate_normal(mean0, cov0, m0)
r1 = np.random.multivariate_normal(mean1, cov1, m1)
```

```
plt.scatter(r0[...,0], r0[...,1], c='b', marker='o', label="class 0")
plt.scatter(r1[...,0], r1[...,1], c='r', marker='x', label="class 1")
plt.xlabel("x0")
plt.ylabel("x1")
plt.legend()
plt.show()
X = np.concatenate((r0,r1))
y = np.zeros(len(r0)+len(r1))
y[:len(r0),] = 1
```

### Logistic Function

For the logistic regression, we want the output of the hypothesis to be in the interval$ ]0, 1[ $. This is done using the *logistic function*. The logistic function is a special case of the *sigmoid function*, though in the domain of machine learning, the term *sigmoid function* is often used as a synonym for *logistic function*:

**Task:**

Implement the *logistic function* and plot it for 1000 points in the interval of$ [-10,10] $.

```
def logistic_function(x):
""" Applies the logistic function to x, element-wise. """
raise NotImplementedError("You should implement this function")
### Insert code to plot the logistic function below
```

### Logistic Hypothesis

The logistic hypothesis is defined as:

$ h_\theta(\vec x) = sigmoid(\vec \theta^T \vec x') $

with:

$ \vec x = \begin{pmatrix} x_1 & x_2 & \ldots & x_n \\ \end{pmatrix} \text{ and } \vec x' = \begin{pmatrix} 1 & x_1 & x_2 & \ldots & x_n \\ \end{pmatrix} $

or for the whole data set$ X $ and$ X' $

$ X = \begin{pmatrix} x_1^1 & \ldots & x_n^1 \\ x_1^2 & \ldots & x_n^2 \\ \vdots &\vdots &\vdots \\ x_1^m & \ldots & x_n^m \\ \end{pmatrix} \text{ and } X' = \begin{pmatrix} 1 & x_1^1 & \ldots & x_n^1 \\ 1 & x_1^2 & \ldots & x_n^2 \\ \vdots &\vdots &\vdots &\vdots \\ 1 & x_1^m & \ldots & x_n^m \\ \end{pmatrix} $

**Task:**

Implement the logistic hypothesis using your implementation of the logistic function. `logistic_hypothesis`

should return a function which accepts the training data$ X $. Example usage:

`>> theta = np.array([1.1, 2.0, -.9])`

`>> h = logistic_hypothesis(theta)`

`>> print(h(X))`

**Note:** The training data was sampled with random noise, so the actual values of your h(X) may differ.

`array([0.03587382, 0.0299963 , 0.97389774, ...,`

**Hint:**

You may of course also implement a helper function for transforming$ X $ into$ X' $ and use it inside the `lamda`

function of `logistic_hypothesis`

.

```
def logistic_hypothesis(theta):
''' Combines given list argument in a logistic equation and returns it as a function
Args:
thetas: list of coefficients
Returns:
lambda that models a logistc function based on thetas and x
'''
raise NotImplementedError("You should implement this function")
### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#h = logistic_hypothesis(theta)
#print(h(X))
```

### Cross-entropy

The cross-entropy costs are defined with:

\begin{equation} loss(h_\theta (x^i), y^i) = -y^i \cdot log(h_\theta (x^i)) - (1-y^i) \cdot log(1-h_\theta(x^i)) \end{equation}

**Task:**

Implement the cross-entropy cost.

Your python function should return a function, which accepts the vector$ \vec \theta $. The returned function should return the cost for each feature vector$ \vec x^i $. The length of the returned array of costs therefore has to be the same length as we have feature vectors (and also labels$ y $). Example usage:

`>> J = cross_entropy_loss(logistic_hypothesis, X, y)`

`>> print(J(theta))`

**Note:** The training data was sampled with random noise, so the actual values of your h(X) may differ.

`array([ 7.3, 9.5, ....`

```
def cross_entropy_costs(h, X, y):
''' Implements cross-entropy as a function costs(theta) on given traning data
Args:
h: the hypothesis as function
x: features as 2D array with shape (m_examples, n_features)
y: ground truth labels for given features with shape (m_examples)
Returns:
lambda costs(theta) that models the cross-entropy for each x^i
'''
raise NotImplementedError("You should implement this function")
### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#costs = cross_entropy_costs(logistic_hypothesis, X, y)
#print(costs(theta))
```

### Loss Function

\begin{equation} J_D(\theta)=\frac{1}{m}\sum_{i=1}^{m}\left(loss(h_\theta (x^i), y^i)\right) \end{equation}

**Task:**

Now implement the loss function$ J $, which calculates the mean costs for the whole training data$ X $. Your python function should return a function, which accepts the vector$ \vec \theta $.

**Note:** You can ignore the parameter `lambda_reg`

for now, it is a hyperparameter for regularization. In a later exercise, you may revisit your implementation and implement regularization if you wish.

```
def mean_cross_entropy_costs(X, y, hypothesis, cost_func, lambda_reg=0.1):
''' Implements mean cross-entropy as a function J(theta) on given traning data
Args:
X: features as 2D array with shape (m_examples, n_features)
y: ground truth labels for given features with shape (m_examples)
hypothesis: the hypothesis as function
cost_func: cost function
Returns:
lambda J(theta) that models the mean cross-entropy
'''
raise NotImplementedError("You should implement this")
### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#J = mean_cross_entropy_costs(X,y, logistic_hypothesis, cross_entropy_costs, 0.1)
#print(J(theta))
```

### Gradient Descent

A short recap, the gradient descent algorithm is a first-order iterative optimization for finding a minimum of a function. From the current position in a (cost) function, the algorithm steps proportional to the negative of the gradient and repeats this until it reaches a local or global minimum and determines. Stepping proportional means that it does not go entirely in the direction of the negative gradient, but scaled by a fixed value$ \alpha $ also called the learning rate. Implementing the following formalized update rule is the core of the optimization process:

\begin{equation} \theta_{j_{new}} \leftarrow \theta_{j_{old}} - \alpha * \frac{\delta}{\delta\theta_{j_{old}}} J(\theta_{old}) \end{equation}

**Task:**

Implement the function to update all theta values.

**Note:** You can ignore the parameter `lambda_reg`

for now, it is a hyperparameter for regularization. In a later exercise, you may revisit your implementation and implement regularization if you wish.

```
def compute_new_theta(X, y, theta, learning_rate, hypothesis, lambda_reg=0.1):
''' Updates learnable parameters theta
The update is done by calculating the partial derivities of
the cost function including the linear hypothesis. The
gradients scaled by a scalar are subtracted from the given
theta values.
Args:
X: 2D numpy array of x values
y: array of y values corresponding to x
theta: current theta values
learning_rate: value to scale the negative gradient
hypothesis: the hypothesis as function
Returns:
theta: Updated theta_0
'''
raise NotImplementedError("You should implement this")
```

Using the `compute_new_theta`

method, you can now implement the gradient descent algorithm. Iterate over the update rule to find the values for$ \theta $ that minimize our cost function$ J_D(\theta) $. This process is often called training of a machine learning model.

**Task:**

- Implement the function for the gradient descent.
- Create a history of all theta and cost values and return them.

```
def gradient_descent(X, y, theta, learning_rate, num_iters, lambda_reg=0.1):
''' Minimize theta values of a logistic model based on cross-entropy cost function
Args:
X: 2D numpy array of x values
y: array of y values corresponding to x
theta: current theta values
learning_rate: value to scale the negative gradient
num_iters: number of iterations updating thetas
Returns:
history_cost: cost after each iteration
history_theta: Updated theta values after each iteration
'''
raise NotImplementedError("You should implement this")
```

### Training and Evaluation

**Task:**

Choose an appropriate learning rate, number of iterations and initial theta values and start the training

```
# Insert your code below
```

Now that the training has finished we can visualize our results.

**Task:**

Plot the costs over the iterations. Your plot should look similar to this one:

```
def plot_progress(costs):
""" Plots the costs over the iterations
Args:
costs: history of costs
"""
raise NotImplementedError("You should implement this!")
```

```
plot_progress(history_cost)
print("costs before the training:\t ", history_cost[0])
print("costs after the training:\t ", history_cost[-1])
```

#### Plot Data and Decision Boundary

**Task:**

Now plot the deicision boundary (a straight line in this case) together with the data.

```
# Insert your code to plot below
theta_hist[-1]
```

#### Accuracy

**Task:**

- Calculate the accuracy of your final classifier. The accuracy is the proportion of the correctly classified data.
- Why will the accuracy never reach 100% using this model and this data set?

```
# Insert you code below
```

### Regularization

**Task:**

Extend your implementation with a regularization term$ \lambda $ by adding it as argument to the functions `mean_cross_entropy_costs`

, `compute_new_theta`

and `gradient_descent`

.

### Proof - Pen&Paper

The sigmoid activation function is defined as$ \sigma (z) = \frac{1}{1+\exp(-z)} $

**Task:**

Show that: $ \frac{d \sigma(z)}{d z} = \sigma(z)(1-\sigma(z)) $

**Task:**

Now show that: $ \frac{\partial \sigma(z)}{\partial \theta_1} = \sigma(z)(1-\sigma(z)) \cdot x_1 $ with: $ z = \theta_0 x_0 + \theta_1 x_1 $

Note that in general (because of symmetry) holds:

$ z = \theta_0 x_0 + \theta_1 x_1 + \dots $

$ \frac{\partial \sigma(z)}{\partial \theta_j} = \sigma(z)(1-\sigma(z)) \cdot x_j $

**Task:**

Show from
$ \frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \left( - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h_\theta({\vec x}^{(i)})+ (1 - y^{(i)}) \log \left( 1- h_\theta({\vec x}^{(i)})\right) \right] \right) $

that
$ \frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta({\vec x}^{(i)})- y^{(i)}\right) x_j^{(i)} $

with the sigmoid function as hypothesis$ h_\theta(\vec x^{(i)}) $

**Hint:**

Make use of your knowlede, that: $ \frac{\partial \sigma(z)}{\partial \theta_j} = \sigma(z)(1-\sigma(z)) \cdot x_j $

## Summary and Outlook

[TODO]

## Licenses

### Notebook License (CC-BY-SA 4.0)

*The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).*

Exercise: Logistic Regression and Regularization

by Christian Herta, Klaus Strohmenger

is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Based on a work at https://gitlab.com/deep.TEACHING.

### Code License (MIT)

*The following license only applies to code cells of the notebook.*

Copyright 2018 Christian Herta, Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.