ML-Fundamentals - Logistic Regression and Regularization

Introduction

In this exercise you will implement the logistic regression. Opposed to the linear regression, the purpose of this model is not to predict a continuous value (e.g. the temperature tomorrow), but to predict a certain class: For example, whether it will rain tomorrow or not. During this exercise you will:

  1. Implement the logistic function and plot it
  2. Implement the hypothesis using the logistic function
  3. Write a function to calculate the cross-entropy cost
  4. Implement the loss function using the hypothesis and cost
  5. Implement the gradient descent algorithm to train your model (optimizer)
  6. Visualize the decision boundary together with the data
  7. Calculate the accuracy of your model
  8. Extend your model with regularization
  9. Calculate the gradient for the loss function with cross-entropy cost (pen&paper)

Requirements

Knowledge

You should have a basic knowledge of:

  • Logistic regression
  • Cross-entropy loss
  • Gradient descent
  • numpy
  • matplotlib

Suitable sources for acquiring this knowledge are:

Python Modules

By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.

# External Modules
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Exercise - Logistic Regression

For convenience and visualization, we will only use two features in this notebook, so we are still able to plot them together with the target class. But your implementation should also be capable of handling more (except the plots).

Data Generation

First we will create some artificial data. For each class, we will generate the features with bivariate (2D) normal distribution;

# class 0:
# covariance matrix and mean
cov0 = np.array([[5,-4],[-4,4]])
mean0 = np.array([2.,3])
# number of data points
m0 = 1000

# class 1
# covariance matrix
cov1 = np.array([[5,-3],[-3,3]])
mean1 = np.array([1.,1])
# number of data points
m1 = 1000

# generate m gaussian distributed data points with
# mean and cov.
r0 = np.random.multivariate_normal(mean0, cov0, m0)
r1 = np.random.multivariate_normal(mean1, cov1, m1)
plt.scatter(r0[...,0], r0[...,1], c='b', marker='o', label="class 0")
plt.scatter(r1[...,0], r1[...,1], c='r', marker='x', label="class 1")
plt.xlabel("x0")
plt.ylabel("x1")
plt.legend()
plt.show()

X = np.concatenate((r0,r1))
y = np.zeros(len(r0)+len(r1))
y[:len(r0),] = 1

Logistic Function

For the logistic regression, we want the output of the hypothesis to be in the interval ]0,1[]0, 1[. This is done using the logistic function. The logistic function is a special case of the sigmoid function, though in the domain of machine learning, the term sigmoid function is often used as a synonym for logistic function:

Task:

Implement the logistic function and plot it for 1000 points in the interval of [10,10][-10,10].

def logistic_function(x):
    """ Returns f(x) with f beeing the logistic function.
    """
    raise NotImplementedError("You should implement this function")

### Insert code to plot the logistic function below

Logistic Hypothesis

The logistic hypothesis is defined as:

hθ(x)=sigmoid(θTx)h_\theta(\vec x) = sigmoid(\vec \theta^T \vec x')

with:

x=(x1x2xn) and x=(1x1x2xn)\vec x = \begin{pmatrix} x_1 & x_2 & \ldots & x_n \\ \end{pmatrix} \text{ and } \vec x' = \begin{pmatrix} 1 & x_1 & x_2 & \ldots & x_n \\ \end{pmatrix}

or for the whole data set XX and XX'

X=(x11xn1x12xn2x1mxnm) and X=(1x11xn11x12xn21x1mxnm)X = \begin{pmatrix} x_1^1 & \ldots & x_n^1 \\ x_1^2 & \ldots & x_n^2 \\ \vdots &\vdots &\vdots \\ x_1^m & \ldots & x_n^m \\ \end{pmatrix} \text{ and } X' = \begin{pmatrix} 1 & x_1^1 & \ldots & x_n^1 \\ 1 & x_1^2 & \ldots & x_n^2 \\ \vdots &\vdots &\vdots &\vdots \\ 1 & x_1^m & \ldots & x_n^m \\ \end{pmatrix}

Task:

Implement the logistic hypothesis using your implementation of the logistic function. logistic_hypothesis should return a function which accepts the training data XX:

>> theta = np.array([1.1, 2.0, -.9])

>> h = logistic_hypothesis(theta)

>> print(h(X))

array([ -0.89896965, 0.71147926, ....

Hint:

You may of course also implement a helper function for transforming XX into XX' and use it inside the lamda function of logistic_hypothesis.

def logistic_hypothesis(theta):
    ''' Combines given list argument in a logistic equation and returns it as a function
    
    Args:
        thetas: list of coefficients
        
    Returns:
        lambda that models a logistc function based on thetas and x
    '''
    raise NotImplementedError("You should implement this function")

    
### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#h = logistic_hypothesis(theta)
#print(h(X))

Cross-entropy

The cross-entropy costs are defined with:

loss(hθ(xi),yi)=yilog(hθ(xi))(1yi)log(1hθ(xi))loss(h_\theta (x^i), y^i) = -y^i \cdot log(h_\theta (x^i)) - (1-y^i) \cdot log(1-h_\theta(x^i))

Task:

Implement the cross-entropy cost.

Your python function should return a function, which accepts the vector θ\vec \theta. The returned function should return the cost for each feature vector xi\vec x^i. The length of the returned array of costs therefore has to be the same length as we have feature vectors (and also labels yy):

>> J = cross_entropy_loss(logistic_hypothesis, X, y)

>> print(J(theta))

array([ 7.3, 9.5, ....

def cross_entropy_costs(h, X, y):
    ''' Implements cross-entropy as a function costs(theta) on given traning data 
    
    Args:
        h: the hypothesis as function
        x: features as 2D array with shape (m_examples, n_features)  
        y: ground truth labels for given features with shape (m_examples)
        
    Returns:
        lambda costs(theta) that models the cross-entropy for each x^i
    '''
    raise NotImplementedError("You should implement this function")

### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#costs = cross_entropy_costs(logistic_hypothesis, X, y)
#print(costs(theta))

Loss Function

JD(θ)=1mi=1m(loss(hθ(xi),yi))J_D(\theta)=\frac{1}{m}\sum_{i=1}^{m}\left(loss(h_\theta (x^i), y^i)\right)

Task:

Now implement the loss function JJ, which calculates the mean costs for the whole training data XX. Your python function should return a function, which accepts the vector θ\vec \theta.


def mean_cross_entropy_costs(X, y, hypothesis, cost_func):
    ''' Implements mean cross-entropy as a function J(theta) on given traning data 
    
    Args:
        X: features as 2D array with shape (m_examples, n_features)  
        y: ground truth labels for given features with shape (m_examples)
        hypothesis: the hypothesis as function
        cost_func: cost function
        
    Returns:
        lambda J(theta) that models the mean cross-entropy
    '''
    raise NotImplementedError("You should implement this")
    
### Uncomment to test your implementation
#theta = np.array([1.,2.,3.])
#J = mean_cross_entropy_costs(X,y, logistic_hypothesis, cross_entropy_costs, 0.1)
#print(J(theta))

Gradient Descent

A short recap, the gradient descent algorithm is a first-order iterative optimization for finding a minimum of a function. From the current position in a (cost) function, the algorithm steps proportional to the negative of the gradient and repeats this until it reaches a local or global minimum and determines. Stepping proportional means that it does not go entirely in the direction of the negative gradient, but scaled by a fixed value α\alpha also called the learning rate. Implementing the following formalized update rule is the core of the optimization process:

θjnewθjoldαδδθjoldJ(θold)\theta_{j_{new}} \leftarrow \theta_{j_{old}} - \alpha * \frac{\delta}{\delta\theta_{j_{old}}} J(\theta_{old})

Task:

Implement the function to update all theta values.

def compute_new_theta(X, y, theta, learning_rate, hypothesis):
    ''' Updates learnable parameters theta 
    
    The update is done by calculating the partial derivities of 
    the cost function including the linear hypothesis. The 
    gradients scaled by a scalar are subtracted from the given 
    theta values.
    
    Args:
        X: 2D numpy array of x values
        y: array of y values corresponding to x
        theta: current theta values
        learning_rate: value to scale the negative gradient  
        hypothesis: the hypothesis as function

        
    Returns:
        theta: Updated theta_0
    '''
    raise NotImplementedError("You should implement this")

Using the compute_new_theta method, you can now implement the gradient descent algorithm. Iterate over the update rule to find the values for θ\theta that minimize our cost function JD(θ)J_D(\theta). This process is often called training of a machine learning model.

Task:

  • Implement the function for the gradient descent.
  • Create a history of all theta and cost values and return them.
def gradient_descent(X, y, theta, learning_rate, num_iters):
    ''' Minimize theta values of a logistic model based on cross-entropy cost function
    
    Args:
        X: 2D numpy array of x values
        y: array of y values corresponding to x
        theta: current theta values
        learning_rate: value to scale the negative gradient  
        num_iters: number of iterations updating thetas
        
    Returns:
        history_cost: cost after each iteration
        history_theta: Updated theta values after each iteration
    '''
    raise NotImplementedError("You should implement this")

Training and Evaluation

Task:

Choose an appropriate learning rate, number of iterations and initial theta values and start the training

# Insert your code below

Now that the training has finished we can visualize our results.

Task:

Plot the costs over the iterations. Your plot should look similar to this one:

def plot_progress(costs):
    """ Plots the costs over the iterations
    
    Args:
        costs: history of costs
    """
    raise NotImplementedError("You should implement this!")
plot_progress(history_cost)
print("costs before the training:\t ", history_cost[0])
print("costs after the training:\t ", history_cost[-1])

Plot Data and Decision Boundary

Task:

Now plot the deicision boundary (a straight line in this case) together with the data.

# Insert your code to plot below

Accuracy

Task:

  1. Calculate the accuracy of your final classifier. The accuracy is the proportion of the correctly classified data.
  2. Why will this value never reach zero using this model and this data set?
# Insert you code below

Solution 2:

Because the data is not linearly sperable and our model is only capable to draw a straight line as boundary.

Regularization

Task:

Extend your implementation with a regularization term λ\lambda by adding it as argument to the functions mean_cross_entropy_costs, compute_new_theta and gradient_descent.

Proof - Pen&Paper

The sigmoid activation function is defined as σ(z)=11+exp(z)\sigma (z) = \frac{1}{1+\exp(-z)}

Task:

Show that:

dσ(z)dz=σ(z)(1σ(z))\frac{d \sigma(z)}{d z} = \sigma(z)(1-\sigma(z))

Task:

Now show that:

σ(z)θ1=σ(z)(1σ(z))x1\frac{\partial \sigma(z)}{\partial \theta_1} = \sigma(z)(1-\sigma(z)) \cdot x_1

with:

z=θ0x0+θ1x1z = \theta_0 x_0 + \theta_1 x_1

Note that in general (because of symmetry) holds:

z=θ0x0+θ1x1+z = \theta_0 x_0 + \theta_1 x_1 + \dots
σ(z)θj=σ(z)(1σ(z))xj\frac{\partial \sigma(z)}{\partial \theta_j} = \sigma(z)(1-\sigma(z)) \cdot x_j

Task:

Show from

θjJ(θ)=θj(1mi=1m[y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]) \frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \left( - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h_\theta({\vec x}^{(i)})+ (1 - y^{(i)}) \log \left( 1- h_\theta({\vec x}^{(i)})\right) \right] \right)

that

θjJ(θ)=1mi=1m(hθ(x(i))y(i))xj(i)\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta({\vec x}^{(i)})- y^{(i)}\right) x_j^{(i)}

with the sigmoid function as hypothesis hθ(x(i))h_\theta(\vec x^{(i)})

Hint:

Make use of your knowlede, that:

σ(z)θj=σ(z)(1σ(z))xj\frac{\partial \sigma(z)}{\partial \theta_j} = \sigma(z)(1-\sigma(z)) \cdot x_j

Summary and Outlook

[TODO]

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise: Logistic Regression and Regularization
by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christian Herta, Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.