Exercise - Natural Pairing

Introduction

The cross-entropy is a widely used loss function when we use logistic regression (and also neural networks) for classification tasks.

The squared error is commonly used as loss function when we perform linear regression to predict continuous values.

Completing this exercise you will see why the squared error does not work for classification tasks:

  • by looking at the math behind logistic regression
  • and visually by plotting the individual functions and their derivatives

In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal.

Requirements

Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

  • Logistic function
  • Cost functions:
  • Cross-entropy
  • Squared error
  • Computational graph
  • Backpropagation

The following material can help you to acquire this knowledge:

  • Squared error, cross-entropy, computational graph, backpropagation:
  • Chapter 5 and 6 of the Deep Learning Book [GOO16]
  • Chapter 5 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
  • Logistic Regression (binary):
  • Video 15.3 and following in the playlist Machine Learning of the youtube user mathematicalmonk [MAT18]

Python Modules

# External Modules
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline 

Theory - Background

Binary Classification

For binary classification the logistic activation function σ(z)=11+exp(z)\sigma (z) = \frac{1}{1+\exp(-z)} is used, hence the name logistic regression. Sigmoid function is often used as a synonym for the logistic function, even though the logistic function is just a special case of a sigmoid function. The tangens hyperbolicus, e.g. is also a sigmoid function.

When we talk about the sigmoid function in this paper, we use it as a synonym for the logistic function.

Backpropagation

The training process of logistic regression consists of optimizing the weights θ\theta, so the prediction σ(z)=σ(x;θ)\sigma(z) = \sigma(x; \theta) gets closer to the desired target (or class) tt.

This is accomplished using the update rule:

θinew=θioldλJθiold\theta_i^{new} = \theta_i^{old} - \lambda \frac{\partial J}{\partial \theta_i^{old}}

λ\lambda is the learning rate (λ>0\lambda > 0) and the term Jθi\frac{\partial J}{\partial \theta_i} is computed using the backpropagation algorithm. The partial derivative of the loss function JJ is derived by applying the chain rule:

Jθi=Jσσzzθi\frac{\partial J}{\partial \theta_i} = \frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z} \frac{\partial z}{\partial \theta_i}

The factors are propagated back to compute efficiently the gradient of the loss (point wise cost) with backpropagation.

The first factor Jσ\frac{\partial J}{\partial \sigma} depends on the cost function JJ and the second factor σz\frac{\partial \sigma}{\partial z} on the activation function σ\sigma.

The product of the first two factors are backpropagated in the network.

Exercises

Exercise - Partial Derivative of the Activation Function

Pen & Paper exercise

Since we want to compare the sigmoid function + squared loss with the logistic function + cross-entropy, the second term σz\frac{\partial \sigma}{\partial z} stays the same for both.

Task:

Compute σ(z)z\frac{\partial \sigma(z)}{\partial z} for the sigmoid activation function σ(z)=11+exp(z)\sigma (z) = \frac{1}{1+\exp(-z)} and express the result as a function of σ(z)\sigma(z):

Hint:

The final solution is σ(z)z==σ(z)(1σ(z))\frac{\partial \sigma(z)}{\partial z} = \ldots = \sigma(z) (1-\sigma(z)). The task is to find out what is behind the "\ldots"

Exercise - Sigmoid Function with Squared Error

Pen & Paper exercise

Task:

  1. What is the product of the first two factors (Jσσz\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}) if we use the squared error as lost (cost of one example):
J(σ)=12(σ(z)t)2J(\sigma) = \frac{1}{2} (\sigma(z)-t)^2

with

  • the target t{0,1}t \in \{0,1\} (classification).

  • σ(z)\sigma(z) is interpreted as the predicted probability of class 1: p(y=1x)p(y=1 \mid \vec x).

  1. Why is the squared error problematic?

Hint:

  1. First compute Jσ\frac{\partial J}{\partial \sigma} as done in the first exercise. Then compute the product Jσσz\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}.
  2. When you do not know how to solve this task, insert some valid values for tt and zz. If you still cannot answere this question, skip it, proceed with this notebook until you have finished all of the plotting-exercises (which visualize this problem), then come back to answere this question.

Exercise - Sigmoid Function wit Cross-entropy

Pen & Paper exercise

Task:

  1. What is the product of the first two factors (Jσσz\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}) if we use the cross-entropy as lost (cost of one example):
J(σ)=tlog(σ(z))(1t)log(1σ(z))J(\sigma) = - t \log(\sigma(z)) - (1-t) \log (1-\sigma(z))

with

  • the target t{0,1}t \in \{0,1\} (classification).
  • σ(z)\sigma(z) is interpreted as the predicted probability of class 1: p(y=1x)p(y=1 \mid \vec x).
  1. Why is there no such problem for the cross-entropy loss?

Exercise - Concrete Example

To visualize the theory of this exercise we will now plot the functions for concrete values. Use the results of the pen & paper exercises to implement your methods.

Exercise - Plot Sigmoid and its Derivative

Task:

Implement both functions and execute the code cell for the plot.

  • the sigmoid function σ(z)\sigma(z)
  • the derivative of the sigmoid σ(z)z\frac{\partial \sigma(z)}{\partial z} to see the steepest point of the sigmoid
z = np.linspace(-8,8,200)
# Implement these functions

def sigmoid(z):
    """
    Returns the sigmoid of z.
    
    :z: concrete values.
    :z type: 1D numpy array of type float32.
    
    :returns: sigmoid of z.
    :r type: 1D numpy array of type float32
            with the same length as z.
    """
    raise NotImplementedError()
    
def derivative_sigmoid(z):
    """
    Returns the derivative of sigmoid of z.
    
    :z: concrete values.
    :z type: 1D numpy array of type float32.
    
    :returns: derivative of sigmoid of z.
    :r type: 1D numpy array of type float32
            with the same length as z.
    """
    raise NotImplementedError()
# Execute to verify your implementation

np.testing.assert_almost_equal(sigmoid(-100), 0)
np.testing.assert_almost_equal(sigmoid(100), 1)
assert derivative_sigmoid(0) == 0.25
# Execute this if you have implemented the functions above

plt.plot(z, sigmoid(z), label="sigmoid", color="blue")
plt.plot(z, derivative_sigmoid(z), label="derivative_sigmoid", color="black")
plt.legend()
plt.xlabel("z")
plt.ylabel("$\sigma(z)$ resp. $\sigma'(z)$")
_ = plt.title("sigmoid")

Exercise - Plot Squared Error and Cross-entropy

Task:

Now implement the squared error and the cross-entropy for concrete σ\sigma and t=1t=1.

sigma = np.linspace(0.01,1,200)
# Implement these functions

def squared_error(sigma):
    """
    Returns the suqared error of sigma for class t=1.
    
    :sigma: concrete value between 0 and 1
    :sgima type: float32.
    
    :returns: suqared error of sigma for class t=1.
    :r type: float32.
    """
    raise NotImplementedError()

def cross_entropy(sigma):
    """
    Returns the suqared error of sigma for class t=1.
    
    :sigma: concrete value between 0 and 1
    :sgima type: float32.
    
    :returns: suqared error of sigma for class t=1.
    :r type: float32.
    """
    raise NotImplementedError()
# Execute to verify your implementation

assert squared_error(1) == 0
assert cross_entropy(1) == 0
np.testing.assert_almost_equal(squared_error(0.1), 0.405)
np.testing.assert_almost_equal(cross_entropy(0.1), 2.3025850929940455)
# Execute this if you have implemented the functions above

plt.plot(sigma, squared_error(sigma), label="squared_error", color="red")
plt.plot(sigma, cross_entropy(sigma), label="cross_entropy", color="green")
plt.xlabel("sigma")
plt.ylabel("loss")
plt.legend()
_ = plt.title("loss for $t=1$")

Exercise - Plot Derivatives of Squared Error and Cross-entropy

Task:

Now implement the derivative of the loss J(σ)σ\frac{\partial J(\sigma)}{\partial \sigma} for

  • the squared error
  • the cross-entropy

for concrete σ\sigma and t=1t=1.

# Implement these functions

def derivative_squared_error(sigma):
    """
    Returns the derivative of the suqared error of 
    sigma for class t=1.
    
    :sigma: concrete value between 0 and 1
    :sgima type: float32.
    
    :returns: derivative of squared error of sigma for class t=1.
    :r type: float32.
    """
    raise NotImplementedError()

def derivative_cross_entropy(sigma):
    """
    Returns the derivative of the  suqared error of 
    sigma for class t=1.
    
    :sigma: concrete value between 0 and 1
    :sgima type: float32.
    
    :returns: derivative of suqared error of sigma for class t=1.
    :r type: float32.
    """
    raise NotImplementedError()
assert derivative_squared_error(0) == -1
assert derivative_cross_entropy(0.01) == -100
assert derivative_squared_error(1) == 0
assert derivative_cross_entropy(1) == -1
plt.plot(sigma, derivative_squared_error(sigma), label="derivative squared_error", color="red")
plt.plot(sigma, derivative_cross_entropy(sigma), label="derivative cross_entropy", color="green") 
plt.xlabel("sigma")
plt.ylabel("derivative loss")
plt.ylim(-10,1)
plt.legend()
_ = plt.title("derivative of the loss for $t=1$")

Exercise - Plot the Final Product

Task:

And finally implement the functions for the product Jσσz\frac{\partial J}{\partial \sigma}\frac{\partial \sigma}{\partial z} for

  • the squared error
  • the cross-entropy

for concrete σ\sigma and t=1t=1.

def product_derivative_squared_error_and_derivative_sigmoid(sigma):
    """
    Returns the product of the derivative of 
    the suqared error and the derivative of 
    the sigmoid for concrete sigma and class t=1.
    
    :sigma: concrete value between 0 and 1
    :sgima type: float32.
    
    :returns: see above.
    :r type: float32.
    """
    raise NotImplementedError()

def product_derivative_cross_entropy_and_derivative_sigmoid(sigma):
    """
    Returns the product of the derivative of 
    the cross-entropy and the derivative of 
    the sigmoid for concrete sigma and class t=1.
    
    :sigma: concrete value between 0 and 1
    :sgima type: float32.
    
    :returns: see above.
    :r type: float32.
    """
    raise NotImplementedError()
assert product_derivative_cross_entropy_and_derivative_sigmoid(0) == -1
assert product_derivative_squared_error_and_derivative_sigmoid(0) == 0
assert product_derivative_cross_entropy_and_derivative_sigmoid(1) == 0
assert product_derivative_squared_error_and_derivative_sigmoid(1) == 0
plt.plot(sigma, product_derivative_squared_error_and_derivative_sigmoid(sigma), label="derivative squared_error", color="red")
plt.plot(sigma, product_derivative_cross_entropy_and_derivative_sigmoid(sigma), label="derivative cross_entropy", color="green") 
plt.xlabel("sigma")
plt.ylabel("derivative loss")
plt.legend()
_ = plt.title("derivative of the loss for $t=1$")

Sidenote - Why the Squared Error works for Linear Regression

For linear regression our activation function aa is the identity function. Or in other words, there is no activation function.

This means a(z)=za(z) = z and therefor a(z)z=1\frac{\partial a(z)}{\partial z} =1, which results in:

Jaaz=Ja1=Ja\frac{\partial J}{\partial a}\frac{\partial a}{\partial z} = \frac{\partial J}{\partial a} 1 = \frac{\partial J}{\partial a}

When you completed the exercises above, then you will see that the term for linear regression Ja\frac{\partial J}{\partial a} (with squared error) exactly equals the product term for logistic regression (with crossentropy):

Ja=Jσσz\frac{\partial J}{\partial a}= \frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise - Natural Pairing
by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christian Herta, Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.