# Exercise - Natural Pairing

## Introduction

The cross-entropy is a widely used loss function when we use logistic regression (and also neural networks) for classification tasks.

The squared error is commonly used as loss function when we perform linear regression to predict continuous values.

Completing this exercise you will see why the squared error does not work for classification tasks:

• by looking at the math behind logistic regression
• and visually by plotting the individual functions and their derivatives

In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal.

## Requirements

### Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

• Logistic function
• Cost functions:
• Cross-entropy
• Squared error
• Computational graph
• Backpropagation

• Squared error, cross-entropy, computational graph, backpropagation:
• Chapter 5 and 6 of the Deep Learning Book [GOO16]
• Chapter 5 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
• Logistic Regression (binary):
• Video 15.3 and following in the playlist Machine Learning of the youtube user mathematicalmonk [MAT18]

### Python Modules

# External Modules
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline 

## Theory - Background

### Binary Classification

For binary classification the logistic activation function $\sigma (z) = \frac{1}{1+\exp(-z)}$ is used, hence the name logistic regression. Sigmoid function is often used as a synonym for the logistic function, even though the logistic function is just a special case of a sigmoid function. The tangens hyperbolicus, e.g. is also a sigmoid function.

When we talk about the sigmoid function in this paper, we use it as a synonym for the logistic function.

### Backpropagation

The training process of logistic regression consists of optimizing the weights $\theta$, so the prediction $\sigma(z) = \sigma(x; \theta)$ gets closer to the desired target (or class) $t$.

This is accomplished using the update rule:

$\theta_i^{new} = \theta_i^{old} - \lambda \frac{\partial J}{\partial \theta_i^{old}}$

$\lambda$ is the learning rate ($\lambda > 0$) and the term $\frac{\partial J}{\partial \theta_i}$ is computed using the backpropagation algorithm. The partial derivative of the loss function $J$ is derived by applying the chain rule:

$\frac{\partial J}{\partial \theta_i} = \frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z} \frac{\partial z}{\partial \theta_i}$

The factors are propagated back to compute efficiently the gradient of the loss (point wise cost) with backpropagation.

The first factor $\frac{\partial J}{\partial \sigma}$ depends on the cost function $J$ and the second factor $\frac{\partial \sigma}{\partial z}$ on the activation function $\sigma$.

The product of the first two factors are backpropagated in the network.

## Exercises

### Exercise - Partial Derivative of the Activation Function

Pen & Paper exercise

Since we want to compare the sigmoid function + squared loss with the logistic function + cross-entropy, the second term $\frac{\partial \sigma}{\partial z}$ stays the same for both.

Compute $\frac{\partial \sigma(z)}{\partial z}$ for the sigmoid activation function $\sigma (z) = \frac{1}{1+\exp(-z)}$ and express the result as a function of $\sigma(z)$:

Hint:

The final solution is $\frac{\partial \sigma(z)}{\partial z} = \ldots = \sigma(z) (1-\sigma(z))$. The task is to find out what is behind the "$\ldots$"

### Exercise - Sigmoid Function with Squared Error

Pen & Paper exercise

1. What is the product of the first two factors ($\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}$) if we use the squared error as lost (cost of one example):
$J(\sigma) = \frac{1}{2} (\sigma(z)-t)^2$

with

• the target $t \in \{0,1\}$ (classification).

• $\sigma(z)$ is interpreted as the predicted probability of class 1: $p(y=1 \mid \vec x)$.

1. Why is the squared error problematic?

Hint:

1. First compute $\frac{\partial J}{\partial \sigma}$ as done in the first exercise. Then compute the product $\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}$.
2. When you do not know how to solve this task, insert some valid values for $t$ and $z$. If you still cannot answere this question, skip it, proceed with this notebook until you have finished all of the plotting-exercises (which visualize this problem), then come back to answere this question.

### Exercise - Sigmoid Function wit Cross-entropy

Pen & Paper exercise

1. What is the product of the first two factors ($\frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}$) if we use the cross-entropy as lost (cost of one example):
$J(\sigma) = - t \log(\sigma(z)) - (1-t) \log (1-\sigma(z))$

with

• the target $t \in \{0,1\}$ (classification).
• $\sigma(z)$ is interpreted as the predicted probability of class 1: $p(y=1 \mid \vec x)$.
1. Why is there no such problem for the cross-entropy loss?

### Exercise - Concrete Example

To visualize the theory of this exercise we will now plot the functions for concrete values. Use the results of the pen & paper exercises to implement your methods.

#### Exercise - Plot Sigmoid and its Derivative

Implement both functions and execute the code cell for the plot.

• the sigmoid function $\sigma(z)$
• the derivative of the sigmoid $\frac{\partial \sigma(z)}{\partial z}$ to see the steepest point of the sigmoid
z = np.linspace(-8,8,200)
# Implement these functions

def sigmoid(z):
"""
Returns the sigmoid of z.

:z: concrete values.
:z type: 1D numpy array of type float32.

:returns: sigmoid of z.
:r type: 1D numpy array of type float32
with the same length as z.
"""
raise NotImplementedError()

def derivative_sigmoid(z):
"""
Returns the derivative of sigmoid of z.

:z: concrete values.
:z type: 1D numpy array of type float32.

:returns: derivative of sigmoid of z.
:r type: 1D numpy array of type float32
with the same length as z.
"""
raise NotImplementedError()
# Execute to verify your implementation

np.testing.assert_almost_equal(sigmoid(-100), 0)
np.testing.assert_almost_equal(sigmoid(100), 1)
assert derivative_sigmoid(0) == 0.25
# Execute this if you have implemented the functions above

plt.plot(z, sigmoid(z), label="sigmoid", color="blue")
plt.plot(z, derivative_sigmoid(z), label="derivative_sigmoid", color="black")
plt.legend()
plt.xlabel("z")
plt.ylabel("$\sigma(z)$ resp. $\sigma'(z)$")
_ = plt.title("sigmoid")

#### Exercise - Plot Squared Error and Cross-entropy

Now implement the squared error and the cross-entropy for concrete $\sigma$ and $t=1$.

sigma = np.linspace(0.01,1,200)
# Implement these functions

def squared_error(sigma):
"""
Returns the suqared error of sigma for class t=1.

:sigma: concrete value between 0 and 1
:sgima type: float32.

:returns: suqared error of sigma for class t=1.
:r type: float32.
"""
raise NotImplementedError()

def cross_entropy(sigma):
"""
Returns the suqared error of sigma for class t=1.

:sigma: concrete value between 0 and 1
:sgima type: float32.

:returns: suqared error of sigma for class t=1.
:r type: float32.
"""
raise NotImplementedError()
# Execute to verify your implementation

assert squared_error(1) == 0
assert cross_entropy(1) == 0
np.testing.assert_almost_equal(squared_error(0.1), 0.405)
np.testing.assert_almost_equal(cross_entropy(0.1), 2.3025850929940455)
# Execute this if you have implemented the functions above

plt.plot(sigma, squared_error(sigma), label="squared_error", color="red")
plt.plot(sigma, cross_entropy(sigma), label="cross_entropy", color="green")
plt.xlabel("sigma")
plt.ylabel("loss")
plt.legend()
_ = plt.title("loss for $t=1$")

#### Exercise - Plot Derivatives of Squared Error and Cross-entropy

Now implement the derivative of the loss $\frac{\partial J(\sigma)}{\partial \sigma}$ for

• the squared error
• the cross-entropy

for concrete $\sigma$ and $t=1$.

# Implement these functions

def derivative_squared_error(sigma):
"""
Returns the derivative of the suqared error of
sigma for class t=1.

:sigma: concrete value between 0 and 1
:sgima type: float32.

:returns: derivative of squared error of sigma for class t=1.
:r type: float32.
"""
raise NotImplementedError()

def derivative_cross_entropy(sigma):
"""
Returns the derivative of the  suqared error of
sigma for class t=1.

:sigma: concrete value between 0 and 1
:sgima type: float32.

:returns: derivative of suqared error of sigma for class t=1.
:r type: float32.
"""
raise NotImplementedError()
assert derivative_squared_error(0) == -1
assert derivative_cross_entropy(0.01) == -100
assert derivative_squared_error(1) == 0
assert derivative_cross_entropy(1) == -1
plt.plot(sigma, derivative_squared_error(sigma), label="derivative squared_error", color="red")
plt.plot(sigma, derivative_cross_entropy(sigma), label="derivative cross_entropy", color="green")
plt.xlabel("sigma")
plt.ylabel("derivative loss")
plt.ylim(-10,1)
plt.legend()
_ = plt.title("derivative of the loss for $t=1$")

#### Exercise - Plot the Final Product

And finally implement the functions for the product $\frac{\partial J}{\partial \sigma}\frac{\partial \sigma}{\partial z}$ for

• the squared error
• the cross-entropy

for concrete $\sigma$ and $t=1$.

def product_derivative_squared_error_and_derivative_sigmoid(sigma):
"""
Returns the product of the derivative of
the suqared error and the derivative of
the sigmoid for concrete sigma and class t=1.

:sigma: concrete value between 0 and 1
:sgima type: float32.

:returns: see above.
:r type: float32.
"""
raise NotImplementedError()

def product_derivative_cross_entropy_and_derivative_sigmoid(sigma):
"""
Returns the product of the derivative of
the cross-entropy and the derivative of
the sigmoid for concrete sigma and class t=1.

:sigma: concrete value between 0 and 1
:sgima type: float32.

:returns: see above.
:r type: float32.
"""
raise NotImplementedError()
assert product_derivative_cross_entropy_and_derivative_sigmoid(0) == -1
assert product_derivative_squared_error_and_derivative_sigmoid(0) == 0
assert product_derivative_cross_entropy_and_derivative_sigmoid(1) == 0
assert product_derivative_squared_error_and_derivative_sigmoid(1) == 0
plt.plot(sigma, product_derivative_squared_error_and_derivative_sigmoid(sigma), label="derivative squared_error", color="red")
plt.plot(sigma, product_derivative_cross_entropy_and_derivative_sigmoid(sigma), label="derivative cross_entropy", color="green")
plt.xlabel("sigma")
plt.ylabel("derivative loss")
plt.legend()
_ = plt.title("derivative of the loss for $t=1$")

### Sidenote - Why the Squared Error works for Linear Regression

For linear regression our activation function $a$ is the identity function. Or in other words, there is no activation function.

This means $a(z) = z$ and therefor $\frac{\partial a(z)}{\partial z} =1$, which results in:

$\frac{\partial J}{\partial a}\frac{\partial a}{\partial z} = \frac{\partial J}{\partial a} 1 = \frac{\partial J}{\partial a}$

When you completed the exercises above, then you will see that the term for linear regression $\frac{\partial J}{\partial a}$ (with squared error) exactly equals the product term for logistic regression (with crossentropy):

$\frac{\partial J}{\partial a}= \frac{\partial J}{\partial \sigma} \frac{\partial \sigma}{\partial z}$

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise - Natural Pairing
by Christian Herta, Klaus Strohmenger