Exercise  Natural Pairing
Table of Contents
Introduction
The crossentropy is a widely used loss function when we use logistic regression (and also neural networks) for classification tasks.
The squared error is commonly used as loss function when we perform linear regression to predict continuous values.
Completing this exercise you will see why the squared error does not work for classification tasks:
 by looking at the math behind logistic regression
 and visually by plotting the individual functions and their derivatives
In order to detect errors in your own code, execute the notebook cells containing assert
or assert_almost_equal
.
Requirements
Knowledge
To complete this exercise notebook, you should possess knowledge about the following topics.
 Logistic function
 Cost functions:
 Crossentropy
 Squared error
 Computational graph
 Backpropagation
The following material can help you to acquire this knowledge:
 Squared error, crossentropy, computational graph, backpropagation:
 Chapter 5 and 6 of the Deep Learning Book [GOO16]
 Chapter 5 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
 Logistic Regression (binary):
 Video 15.3 and following in the playlist Machine Learning of the youtube user mathematicalmonk [MAT18]
Python Modules
# External Modules
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Theory  Background
Binary Classification
For binary classification the logistic activation function $\sigma (z) = \frac{1}{1+\exp(z)}$ is used, hence the name logistic regression. Sigmoid function is often used as a synonym for the logistic function, even though the logistic function is just a special case of a sigmoid function. The tangens hyperbolicus, e.g. is also a sigmoid function.
When we talk about the sigmoid function in this paper, we use it as a synonym for the logistic function.
Backpropagation
The training process of logistic regression consists of optimizing the weights $\theta$, so the prediction $\sigma(z) = \sigma(x; \theta)$ gets closer to the desired target (or class) $t$.
This is accomplished using the update rule:
$\lambda$ is the learning rate ($\lambda > 0$) and the term $\frac{\partial J}{\partial \theta_i}$ is computed using the backpropagation algorithm. The partial derivative of the loss function $J$ is derived by applying the chain rule:
The factors are propagated back to compute efficiently the gradient of the loss (point wise cost) with backpropagation.
The first factor $\frac{\partial J}{\partial \sigma}$ depends on the cost function $J$ and the second factor $\frac{\partial \sigma}{\partial z}$ on the activation function $\sigma$.
The product of the first two factors are backpropagated in the network.
Exercises
Exercise  Partial Derivative of the Activation Function
Pen & Paper exercise
Since we want to compare the sigmoid function + squared loss with the logistic function + crossentropy, the second term $\frac{\partial \sigma}{\partial z}$ stays the same for both.
Task:
Compute $\frac{\partial \sigma(z)}{\partial z}$ for the sigmoid activation function $\sigma (z) = \frac{1}{1+\exp(z)}$ and express the result as a function of $\sigma(z)$:
Hint:
The final solution is $\frac{\partial \sigma(z)}{\partial z} = \ldots = \sigma(z) (1\sigma(z))$. The task is to find out what is behind the "$\ldots$"
Exercise  Sigmoid Function with Squared Error
Pen & Paper exercise
Task:
 What is the product of the first two factors ($\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}$) if we use the squared error as lost (cost of one example):
with

the target $t \in \{0,1\}$ (classification).

$\sigma(z)$ is interpreted as the predicted probability of class 1: $p(y=1 \mid \vec x)$.
 Why is the squared error problematic?
Hint:
 First compute $\frac{\partial J}{\partial \sigma}$ as done in the first exercise. Then compute the product $\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}$.
 When you do not know how to solve this task, insert some valid values for $t$ and $z$. If you still cannot answere this question, skip it, proceed with this notebook until you have finished all of the plottingexercises (which visualize this problem), then come back to answere this question.
Exercise  Sigmoid Function wit Crossentropy
Pen & Paper exercise
Task:
 What is the product of the first two factors ($\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}$) if we use the crossentropy as lost (cost of one example):
with
 the target $t \in \{0,1\}$ (classification).
 $\sigma(z)$ is interpreted as the predicted probability of class 1: $p(y=1 \mid \vec x)$.
 Why is there no such problem for the crossentropy loss?
Exercise  Concrete Example
To visualize the theory of this exercise we will now plot the functions for concrete values. Use the results of the pen & paper exercises to implement your methods.
Exercise  Plot Sigmoid and its Derivative
Task:
Implement both functions and execute the code cell for the plot.
 the sigmoid function $\sigma(z)$
 the derivative of the sigmoid $\frac{\partial \sigma(z)}{\partial z}$ to see the steepest point of the sigmoid
z = np.linspace(8,8,200)
# Implement these functions
def sigmoid(z):
"""
Returns the sigmoid of z.
:z: concrete values.
:z type: 1D numpy array of type float32.
:returns: sigmoid of z.
:r type: 1D numpy array of type float32
with the same length as z.
"""
raise NotImplementedError()
def derivative_sigmoid(z):
"""
Returns the derivative of sigmoid of z.
:z: concrete values.
:z type: 1D numpy array of type float32.
:returns: derivative of sigmoid of z.
:r type: 1D numpy array of type float32
with the same length as z.
"""
raise NotImplementedError()
# Execute to verify your implementation
np.testing.assert_almost_equal(sigmoid(100), 0)
np.testing.assert_almost_equal(sigmoid(100), 1)
assert derivative_sigmoid(0) == 0.25
# Execute this if you have implemented the functions above
plt.plot(z, sigmoid(z), label="sigmoid", color="blue")
plt.plot(z, derivative_sigmoid(z), label="derivative_sigmoid", color="black")
plt.legend()
plt.xlabel("z")
plt.ylabel("$\sigma(z)$ resp. $\sigma'(z)$")
_ = plt.title("sigmoid")
Exercise  Plot Squared Error and Crossentropy
Task:
Now implement the squared error and the crossentropy for concrete $\sigma$ and $t=1$.
sigma = np.linspace(0.01,1,200)
# Implement these functions
def squared_error(sigma):
"""
Returns the suqared error of sigma for class t=1.
:sigma: concrete value between 0 and 1
:sgima type: float32.
:returns: suqared error of sigma for class t=1.
:r type: float32.
"""
raise NotImplementedError()
def cross_entropy(sigma):
"""
Returns the suqared error of sigma for class t=1.
:sigma: concrete value between 0 and 1
:sgima type: float32.
:returns: suqared error of sigma for class t=1.
:r type: float32.
"""
raise NotImplementedError()
# Execute to verify your implementation
assert squared_error(1) == 0
assert cross_entropy(1) == 0
np.testing.assert_almost_equal(squared_error(0.1), 0.405)
np.testing.assert_almost_equal(cross_entropy(0.1), 2.3025850929940455)
# Execute this if you have implemented the functions above
plt.plot(sigma, squared_error(sigma), label="squared_error", color="red")
plt.plot(sigma, cross_entropy(sigma), label="cross_entropy", color="green")
plt.xlabel("sigma")
plt.ylabel("loss")
plt.legend()
_ = plt.title("loss for $t=1$")
Exercise  Plot Derivatives of Squared Error and Crossentropy
Task:
Now implement the derivative of the loss $\frac{\partial J(\sigma)}{\partial \sigma}$ for
 the squared error
 the crossentropy
for concrete $\sigma$ and $t=1$.
# Implement these functions
def derivative_squared_error(sigma):
"""
Returns the derivative of the suqared error of
sigma for class t=1.
:sigma: concrete value between 0 and 1
:sgima type: float32.
:returns: derivative of squared error of sigma for class t=1.
:r type: float32.
"""
raise NotImplementedError()
def derivative_cross_entropy(sigma):
"""
Returns the derivative of the suqared error of
sigma for class t=1.
:sigma: concrete value between 0 and 1
:sgima type: float32.
:returns: derivative of suqared error of sigma for class t=1.
:r type: float32.
"""
raise NotImplementedError()
assert derivative_squared_error(0) == 1
assert derivative_cross_entropy(0.01) == 100
assert derivative_squared_error(1) == 0
assert derivative_cross_entropy(1) == 1
plt.plot(sigma, derivative_squared_error(sigma), label="derivative squared_error", color="red")
plt.plot(sigma, derivative_cross_entropy(sigma), label="derivative cross_entropy", color="green")
plt.xlabel("sigma")
plt.ylabel("derivative loss")
plt.ylim(10,1)
plt.legend()
_ = plt.title("derivative of the loss for $t=1$")
Exercise  Plot the Final Product
Task:
And finally implement the functions for the product $\frac{\partial J}{\partial \sigma}\frac{\partial \sigma}{\partial z}$ for
 the squared error
 the crossentropy
for concrete $\sigma$ and $t=1$.
def product_derivative_squared_error_and_derivative_sigmoid(sigma):
"""
Returns the product of the derivative of
the suqared error and the derivative of
the sigmoid for concrete sigma and class t=1.
:sigma: concrete value between 0 and 1
:sgima type: float32.
:returns: see above.
:r type: float32.
"""
raise NotImplementedError()
def product_derivative_cross_entropy_and_derivative_sigmoid(sigma):
"""
Returns the product of the derivative of
the crossentropy and the derivative of
the sigmoid for concrete sigma and class t=1.
:sigma: concrete value between 0 and 1
:sgima type: float32.
:returns: see above.
:r type: float32.
"""
raise NotImplementedError()
assert product_derivative_cross_entropy_and_derivative_sigmoid(0) == 1
assert product_derivative_squared_error_and_derivative_sigmoid(0) == 0
assert product_derivative_cross_entropy_and_derivative_sigmoid(1) == 0
assert product_derivative_squared_error_and_derivative_sigmoid(1) == 0
plt.plot(sigma, product_derivative_squared_error_and_derivative_sigmoid(sigma), label="derivative squared_error", color="red")
plt.plot(sigma, product_derivative_cross_entropy_and_derivative_sigmoid(sigma), label="derivative cross_entropy", color="green")
plt.xlabel("sigma")
plt.ylabel("derivative loss")
plt.legend()
_ = plt.title("derivative of the loss for $t=1$")
Sidenote  Why the Squared Error works for Linear Regression
For linear regression our activation function $a$ is the identity function. Or in other words, there is no activation function.
This means $a(z) = z$ and therefor $\frac{\partial a(z)}{\partial z} =1$, which results in:
When you completed the exercises above, then you will see that the term for linear regression $\frac{\partial J}{\partial a}$ (with squared error) exactly equals the product term for logistic regression (with crossentropy):
Literature
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
Exercise  Natural Pairing
by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2018 Christian Herta, Klaus Strohmenger
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.