HTW Berlin  Angewandte Informatik  Advanced Topics  Exercise  Meaning of Softmax Output Probability
Table of Contents
Introduction
When dealing with concrete classes and the classification of those, a core concept is the logistic function (binary classification) and the softmax function, if there are more than two classes. Both functions are used in 'simple' logistic / softmax regression (no hidden layers) and are still being used in complex neural networks with dozens of layers, be it a simple fully connected network, a convolutional neural network or a recurrent neural network.
Therefore, a solid understanding of the softmax function is worth a lot. This pen & paper exercise is intended to help you build up on this.
Requirements
Knowledge
To complete this exercise notebook, you should possess knowledge about the following topics.
 Expected Value (Exercise Expected Value)
 Bayes rule (Exercise Bayes Rule)
 Softmax function
 Crossentropy
 Constrained Optimization with Lagrange multiplier
The following material can help you to acquire this knowledge:
 Softmax, crossentropy, gradient descent:
 Chapter 5 and 6 of the Deep Learning Book[GOO16]
 Chapter 5 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
 Logistic Regression (binary):
 Video 15.3 and following in the playlist Machine Learning

Lagrange multiplier:
 Videos from Khan Academy, starting here and following.
Exercise  Meaning of Softmax Output Probability
Expected cost for the train data using softmax and crossentropy:
\begin{split}
\mathbb E[J_{train}(\theta)] & =  \int_{\mathcal X} \sum_{i=1}^c \log o_i({\bf x}; \theta) \cdot p_{train}({\bf x}, y=i) \text{ } d{\bf x} \\
& \\
& =  \int_{\mathcal X} \left(\sum_{i=1}^c \log o_i({\bf x}; \theta) \cdot p_{train}(y=i\mid{\bf x}) \right)p_{train}({\bf x}) \text{ } d{\bf x} \\
\end{split}
with
 The input vector ${\bf x}$
 The target value $y \in\{1 \dots,c \}$ is encoded with the class index
 $i$th output $o_i$ (one hot encoded output neurons)
 $c$ number of different (exclusive) classes
 $p_{train}(y=i, {\bf x})$ joint probability function for the training data (true labels, also one hot)
 $p_{train}({\bf x})$ probability density function
Sidenote
For further reading: [BIS94]
First, do not let this equation scare you off. It is just the combination of the expected value formula and the crossentropy cost.
General equation for the expected value of a function $f$ with respect to a probability density function $p$:
And the general equation for the crossentropy with the computed predictions (net output) $q$:
The second line in the equation for the expected value is just the application of Bayes rule.
Exercise  Proof
So for all $\bf x$ with $p_{train}({\bf x})>0$ the cost function is minimal if
is minimal.
For all exercises we assume that we have a neural network with sufficent complexity to fit each possible function for each $\bf x$.
Task:
Under the (softmax) constraint that $\sum_i o_i({\bf x}; \theta)=1$, show that the outputs $o_i$ , which minimize the cost for all $\bf x$, have to fulfill:
Hint:
Minimization / maximization problem with constraint $\rightarrow$ Lagrange multiplier.
Exercise  Discussion
Task:
Answer the following questions:
 Discuss the first exercise. What is the meaning of $o_i$ (the softmaxoutput)?
 How reliable is such a interpretation of $o_i$ for regions with low probability density $p_{train}(\bf x)$?
Exercise  Logistic Function as Special Case
Softmax
The softmax function is defined as:
Logistic
For binary classification, we normally use the logistic function, defined as:
in conjunction with a threshold for $h$ (e.g. $0.5$) to decide if $\bf x$ belongs to class 1 or not.
with:
 $\theta$ being a weight matrix:
 e.g. 4x3, for 4 features and 3 classes (softmax)
 $\theta_i$ being a column weight vector (e.g 4x1)
 $\bf x$ being a row feature vector (e.g 1x4)
 ${\bf w}$ column weight vector e.g. 4x1, for 4 features (logistic)
Task:
 Show, that the logistic function is the same as the softmax function for $k = 2$
 Why it's better to use the logistic function for a binary classification problem?
Literature
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
HTW Berlin  Angewandte Informatik  Advanced Topics  Exercise  Meaning of Softmax Output Probability
by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2018 Christian Herta, Klaus Strohmenger
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.