# HTW Berlin - Angewandte Informatik - Advanced Topics - Exercise - Meaning of Softmax Output Probability

## Introduction

When dealing with concrete classes and the classification of those, a core concept is the logistic function (binary classification) and the softmax function, if there are more than two classes. Both functions are used in 'simple' logistic / softmax regression (no hidden layers) and are still being used in complex neural networks with dozens of layers, be it a simple fully connected network, a convolutional neural network or a recurrent neural network.

Therefore, a solid understanding of the softmax function is worth a lot. This pen & paper exercise is intended to help you build up on this.

## Requirements

### Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

• Expected Value (Exercise Expected Value)
• Bayes rule (Exercise Bayes Rule)
• Softmax function
• Cross-entropy
• Constrained Optimization with Lagrange multiplier

• Chapter 5 and 6 of the Deep Learning Book[GOO16]
• Chapter 5 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
• Logistic Regression (binary):
• Video 15.3 and following in the playlist Machine Learning
• Lagrange multiplier:

• Videos from Khan Academy, starting here and following.

## Exercise - Meaning of Softmax Output Probability

Expected cost for the train data using softmax and cross-entropy:

\begin{split} \mathbb E[J_{train}(\theta)] & = - \int_{\mathcal X} \sum_{i=1}^c \log o_i({\bf x}; \theta) \cdot p_{train}({\bf x}, y=i) \text{ } d{\bf x} \\ & \\ & = - \int_{\mathcal X} \left(\sum_{i=1}^c \log o_i({\bf x}; \theta) \cdot p_{train}(y=i\mid{\bf x}) \right)p_{train}({\bf x}) \text{ } d{\bf x} \\ \end{split}

with

• The input vector ${\bf x}$
• The target value $y \in\{1 \dots,c \}$ is encoded with the class index
• $i$-th output $o_i$ (one hot encoded output neurons)
• $c$ number of different (exclusive) classes
• $p_{train}(y=i, {\bf x})$ joint probability function for the training data (true labels, also one hot)
• $p_{train}({\bf x})$ probability density function

Sidenote

First, do not let this equation scare you off. It is just the combination of the expected value formula and the cross-entropy cost.

General equation for the expected value of a function $f$ with respect to a probability density function $p$:

$\mathbb E_{x \sim p} [f(x)] = \int_{\mathcal X} p(x) \cdot f(x) \text{ } dx$

And the general equation for the cross-entropy with the computed predictions (net output) $q$:

$H(p, q) = -\sum_{i=0}^c p_i(x) \cdot \log q_i(x)$

The second line in the equation for the expected value is just the application of Bayes rule.

$p_{train}({\bf x}, y=i) = p_{train}(y=i\mid{\bf x}) \cdot p_{train}({\bf x})$

### Exercise - Proof

So for all $\bf x$ with $p_{train}({\bf x})>0$ the cost function is minimal if

$\sum_{i=1}^cp_{train}(y=i\mid{\bf x}) \cdot \log o_i({\bf x}; \theta)$

is minimal.

For all exercises we assume that we have a neural network with sufficent complexity to fit each possible function for each $\bf x$.

Under the (softmax) constraint that $\sum_i o_i({\bf x}; \theta)=1$, show that the outputs $o_i$ , which minimize the cost for all $\bf x$, have to fulfill:

$o_i({\bf x}; \theta) = p_{train}(y=i\mid{\bf x})$

Hint:

Minimization / maximization problem with constraint $\rightarrow$ Lagrange multiplier.

### Exercise - Discussion

• Discuss the first exercise. What is the meaning of $o_i$ (the softmax-output)?
• How reliable is such a interpretation of $o_i$ for regions with low probability density $p_{train}(\bf x)$?

### Exercise - Logistic Function as Special Case

Softmax

The softmax function is defined as:

$o_i({\bf x}; \theta) = \frac{\exp({\bf x} \theta_i)}{\sum_{k=1}^c\exp ({\bf x}\theta_k )}$

Logistic

For binary classification, we normally use the logistic function, defined as:

$h({\bf x}; {\bf w}) = \frac{1}{1 + \exp(-{\bf x}{\bf w} )}$

in conjunction with a threshold for $h$ (e.g. $0.5$) to decide if $\bf x$ belongs to class 1 or not.

with:

• $\theta$ being a weight matrix:
• e.g. 4x3, for 4 features and 3 classes (softmax)
• $\theta_i$ being a column weight vector (e.g 4x1)
• $\bf x$ being a row feature vector (e.g 1x4)
• ${\bf w}$ column weight vector e.g. 4x1, for 4 features (logistic)

• Show, that the logistic function is the same as the softmax function for $k = 2$
• Why it's better to use the logistic function for a binary classification problem?

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

HTW Berlin - Angewandte Informatik - Advanced Topics - Exercise - Meaning of Softmax Output Probability
by Christian Herta, Klaus Strohmenger