# ML-Fundamentals - Logistic Regression and Regularization

## Introduction

In this exercise you will implement the logistic regression. Opposed to the linear regression, the purpose of this model is not to predict a continuous value (e.g. the temperature tomorrow), but to predict a certain class: For example, whether it will rain tomorrow or not. During this exercise you will:

1. Implement the logistic function and plot it
2. Implement the hypothesis using the logistic function
3. Write a function to calculate the cross-entropy cost
4. Implement the loss function using the hypothesis and cost
6. Visualize the decision boundary together with the data
7. Calculate the accuracy of your model
8. Extend your model with regularization
9. Calculate the gradient for the loss function with cross-entropy cost (pen&paper)

## Requirements

### Knowledge

You should have a basic knowledge of:

• Logistic regression
• Cross-entropy loss
• numpy
• matplotlib

Suitable sources for acquiring this knowledge are:

### Python Modules

By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.

# External Modules
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Exercise - Logistic Regression

For convenience and visualization, we will only use two features in this notebook, so we are still able to plot them together with the target class. But your implementation should also be capable of handling more (except the plots).

### Data Generation

First we will create some artificial data. For each class, we will generate the features with bivariate (2D) normal distribution;

# class 0:
# covariance matrix and mean
cov0 = np.array([[5,-4],[-4,4]])
mean0 = np.array([2.,3])
# number of data points
m0 = 1000

# class 1
# covariance matrix
cov1 = np.array([[5,-3],[-3,3]])
mean1 = np.array([1.,1])
# number of data points
m1 = 1000

# generate m gaussian distributed data points with
# mean and cov.
r0 = np.random.multivariate_normal(mean0, cov0, m0)
r1 = np.random.multivariate_normal(mean1, cov1, m1)
plt.scatter(r0[...,0], r0[...,1], c='b', marker='o', label="class 0")
plt.scatter(r1[...,0], r1[...,1], c='r', marker='x', label="class 1")
plt.xlabel("x0")
plt.ylabel("x1")
plt.legend()
plt.show()

X = np.concatenate((r0,r1))
y = np.zeros(len(r0)+len(r1))
y[:len(r0),] = 1

### Logistic Function

For the logistic regression, we want the output of the hypothesis to be in the interval$]0, 1[$. This is done using the logistic function. The logistic function is a special case of the sigmoid function, though in the domain of machine learning, the term sigmoid function is often used as a synonym for logistic function:

Implement the logistic function and plot it for 1000 points in the interval of$[-10,10]$.

def logistic_function(x):
""" Applies the logistic function to x, element-wise. """
raise NotImplementedError("You should implement this function")

### Insert code to plot the logistic function below


### Logistic Hypothesis

The logistic hypothesis is defined as:

$h_\theta(\vec x) = sigmoid(\vec \theta^T \vec x')$

with:

$\vec x = \begin{pmatrix} x_1 & x_2 & \ldots & x_n \\ \end{pmatrix} \text{ and } \vec x' = \begin{pmatrix} 1 & x_1 & x_2 & \ldots & x_n \\ \end{pmatrix}$

or for the whole data set$X$ and$X'$

$X = \begin{pmatrix} x_1^1 & \ldots & x_n^1 \\ x_1^2 & \ldots & x_n^2 \\ \vdots &\vdots &\vdots \\ x_1^m & \ldots & x_n^m \\ \end{pmatrix} \text{ and } X' = \begin{pmatrix} 1 & x_1^1 & \ldots & x_n^1 \\ 1 & x_1^2 & \ldots & x_n^2 \\ \vdots &\vdots &\vdots &\vdots \\ 1 & x_1^m & \ldots & x_n^m \\ \end{pmatrix}$

Implement the logistic hypothesis using your implementation of the logistic function. logistic_hypothesis should return a function which accepts the training data$X$. Example usage:

>> theta = np.array([1.1, 2.0, -.9])

>> h = logistic_hypothesis(theta)

>> print(h(X))

Note: The training data was sampled with random noise, so the actual values of your h(X) may differ.

array([0.03587382, 0.0299963 , 0.97389774, ...,

Hint:

You may of course also implement a helper function for transforming$X$ into$X'$ and use it inside the lamda function of logistic_hypothesis.

def logistic_hypothesis(theta):
''' Combines given list argument in a logistic equation and returns it as a function

Args:
thetas: list of coefficients

Returns:
lambda that models a logistc function based on thetas and x
'''
raise NotImplementedError("You should implement this function")

### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#h = logistic_hypothesis(theta)
#print(h(X))

### Cross-entropy

The cross-entropy costs are defined with:

\begin{equation} loss(h_\theta (x^i), y^i) = -y^i \cdot log(h_\theta (x^i)) - (1-y^i) \cdot log(1-h_\theta(x^i)) \end{equation}

Implement the cross-entropy cost.

Your python function should return a function, which accepts the vector$\vec \theta$. The returned function should return the cost for each feature vector$\vec x^i$. The length of the returned array of costs therefore has to be the same length as we have feature vectors (and also labels$y$). Example usage:

>> J = cross_entropy_loss(logistic_hypothesis, X, y)

>> print(J(theta))

Note: The training data was sampled with random noise, so the actual values of your h(X) may differ.

array([ 7.3, 9.5, ....

def cross_entropy_costs(h, X, y):
''' Implements cross-entropy as a function costs(theta) on given traning data

Args:
h: the hypothesis as function
x: features as 2D array with shape (m_examples, n_features)
y: ground truth labels for given features with shape (m_examples)

Returns:
lambda costs(theta) that models the cross-entropy for each x^i
'''
raise NotImplementedError("You should implement this function")

### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#costs = cross_entropy_costs(logistic_hypothesis, X, y)
#print(costs(theta))

### Loss Function

\begin{equation} J_D(\theta)=\frac{1}{m}\sum_{i=1}^{m}\left(loss(h_\theta (x^i), y^i)\right) \end{equation}

Now implement the loss function$J$, which calculates the mean costs for the whole training data$X$. Your python function should return a function, which accepts the vector$\vec \theta$.

Note: You can ignore the parameter lambda_reg for now, it is a hyperparameter for regularization. In a later exercise, you may revisit your implementation and implement regularization if you wish.


def mean_cross_entropy_costs(X, y, hypothesis, cost_func, lambda_reg=0.1):
''' Implements mean cross-entropy as a function J(theta) on given traning data

Args:
X: features as 2D array with shape (m_examples, n_features)
y: ground truth labels for given features with shape (m_examples)
hypothesis: the hypothesis as function
cost_func: cost function

Returns:
lambda J(theta) that models the mean cross-entropy
'''
raise NotImplementedError("You should implement this")

### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#J = mean_cross_entropy_costs(X,y, logistic_hypothesis, cross_entropy_costs, 0.1)
#print(J(theta))

A short recap, the gradient descent algorithm is a first-order iterative optimization for finding a minimum of a function. From the current position in a (cost) function, the algorithm steps proportional to the negative of the gradient and repeats this until it reaches a local or global minimum and determines. Stepping proportional means that it does not go entirely in the direction of the negative gradient, but scaled by a fixed value$\alpha$ also called the learning rate. Implementing the following formalized update rule is the core of the optimization process:

\begin{equation} \theta_{j_{new}} \leftarrow \theta_{j_{old}} - \alpha * \frac{\delta}{\delta\theta_{j_{old}}} J(\theta_{old}) \end{equation}

Implement the function to update all theta values.

Note: You can ignore the parameter lambda_reg for now, it is a hyperparameter for regularization. In a later exercise, you may revisit your implementation and implement regularization if you wish.

def compute_new_theta(X, y, theta, learning_rate, hypothesis, lambda_reg=0.1):

The update is done by calculating the partial derivities of
the cost function including the linear hypothesis. The
gradients scaled by a scalar are subtracted from the given
theta values.

Args:
X: 2D numpy array of x values
y: array of y values corresponding to x
theta: current theta values
learning_rate: value to scale the negative gradient
hypothesis: the hypothesis as function

Returns:
theta: Updated theta_0
'''
raise NotImplementedError("You should implement this")

### Uncomment to test your implementation
theta = np.array([1.,2.,3.])
theta = compute_new_theta(X, y, theta, .1, logistic_hypothesis, .1)
print(theta)

Using the compute_new_theta method, you can now implement the gradient descent algorithm. Iterate over the update rule to find the values for$\theta$ that minimize our cost function$J_D(\theta)$. This process is often called training of a machine learning model.

• Implement the function for the gradient descent.
• Create a history of all theta and cost values and return them.
def gradient_descent(X, y, theta, learning_rate, num_iters, lambda_reg=0.1):
''' Minimize theta values of a logistic model based on cross-entropy cost function

Args:
X: 2D numpy array of x values
y: array of y values corresponding to x
theta: current theta values
learning_rate: value to scale the negative gradient
num_iters: number of iterations updating thetas
lambda_reg: regularization strength

Returns:
history_cost: cost after each iteration
history_theta: Updated theta values after each iteration
'''
raise NotImplementedError("You should implement this")


### Training and Evaluation

Choose an appropriate learning rate, number of iterations and initial theta values and start the training

# TODO: Assign sensible values
alpha = 42
theta = np.array([42, -100, 10e5])
num_iters = 1234
history_cost, history_theta = gradient_descent(X, y, theta, alpha, num_iters)

Now that the training has finished we can visualize our results.

Plot the costs over the iterations. Your plot should look similar to this one:

def plot_progress(costs):
""" Plots the costs over the iterations

Args:
costs: history of costs
"""
raise NotImplementedError("You should implement this!")
plot_progress(history_cost)
print("costs before the training:\t ", history_cost)
print("costs after the training:\t ", history_cost[-1])

#### Plot Data and Decision Boundary

Now plot the deicision boundary (a straight line in this case) together with the data.

# Insert your code to plot below
theta_hist[-1]

#### Accuracy

The logistic hypothesis outputs a value in the interval$]0;1[$. We want to map this value to one specific class i.e.$0$ or$1$, so we apply a threshold known as the decision boundary: If the predicted value is < 0.5, the class is 0, otherwise it is 1.

1. Calculate the accuracy of your final classifier. The accuracy is the proportion of the correctly classified data.
2. Why will the accuracy never reach 100% using this model and this data set?
# Insert you code below


### Regularization

Extend your implementation with a regularization term$\lambda$ by adding it as argument to the functions mean_cross_entropy_costs, compute_new_theta and gradient_descent.

### Proof - Pen&Paper

The sigmoid activation function is defined as$\sigma (z) = \frac{1}{1+\exp(-z)}$

Show that: $\frac{d \sigma(z)}{d z} = \sigma(z)(1-\sigma(z))$

Now show that: $\frac{\partial \sigma(z)}{\partial \theta_1} = \sigma(z)(1-\sigma(z)) \cdot x_1$ with: $z = \theta_0 x_0 + \theta_1 x_1$

Note that in general (because of symmetry) holds:

$z = \theta_0 x_0 + \theta_1 x_1 + \dots$

$\frac{\partial \sigma(z)}{\partial \theta_j} = \sigma(z)(1-\sigma(z)) \cdot x_j$

Show from $\frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \left( - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h_\theta({\vec x}^{(i)})+ (1 - y^{(i)}) \log \left( 1- h_\theta({\vec x}^{(i)})\right) \right] \right)$
that $\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta({\vec x}^{(i)})- y^{(i)}\right) x_j^{(i)}$

with the sigmoid function as hypothesis$h_\theta(\vec x^{(i)})$

Hint:

Make use of your knowlede, that: $\frac{\partial \sigma(z)}{\partial \theta_j} = \sigma(z)(1-\sigma(z)) \cdot x_j$

## Summary and Outlook

During this exercise you learned about logistic regression and used it to perform binary classification on multidimensional data. You should be able to answer the following questions:

• How can you interpret the output of the logistic function?
• For which type of problem do you use linear regression and for which type of problem do you use logistic regression?

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise: Logistic Regression and Regularization
by Christian Herta, Klaus Strohmenger