HTW Berlin - Angewandte Informatik - Advanced Topics - Exercise - Meaning of Softmax Output Probability

Introduction

When dealing with concrete classes and the classification of those, a core concept is the logistic function (binary classification) and the softmax function, if there are more than two classes. Both functions are used in 'simple' logistic / softmax regression (no hidden layers) and are still being used in complex neural networks with dozens of layers, be it a simple fully connected network, a convolutional neural network or a recurrent neural network.

Therefore, a solid understanding of the softmax function is worth a lot. This pen & paper exercise is intended to help you build up on this.

Requirements

Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

  • Expected Value (Exercise Expected Value)
  • Bayes rule (Exercise Bayes Rule)
  • Softmax function
  • Cross-entropy
  • Constrained Optimization with Lagrange multiplier

The following material can help you to acquire this knowledge:

  • Softmax, cross-entropy, gradient descent:
  • Chapter 5 and 6 of the Deep Learning Book[GOO16]
  • Chapter 5 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
  • Logistic Regression (binary):
  • Video 15.3 and following in the playlist Machine Learning
  • Lagrange multiplier:

    • Videos from Khan Academy, starting here and following.

Exercise - Meaning of Softmax Output Probability

Expected cost for the train data using softmax and cross-entropy:

\begin{split} \mathbb E[J_{train}(\theta)] & = - \int_{\mathcal X} \sum_{i=1}^c \log o_i({\bf x}; \theta) \cdot p_{train}({\bf x}, y=i) \text{ } d{\bf x} \\ & \\ & = - \int_{\mathcal X} \left(\sum_{i=1}^c \log o_i({\bf x}; \theta) \cdot p_{train}(y=i\mid{\bf x}) \right)p_{train}({\bf x}) \text{ } d{\bf x} \\ \end{split}

with

  • The input vector x{\bf x}
  • The target value y{1,c}y \in\{1 \dots,c \} is encoded with the class index
  • ii-th output oio_i (one hot encoded output neurons)
  • cc number of different (exclusive) classes
  • ptrain(y=i,x)p_{train}(y=i, {\bf x}) joint probability function for the training data (true labels, also one hot)
  • ptrain(x)p_{train}({\bf x}) probability density function

Sidenote

For further reading: [BIS94]

First, do not let this equation scare you off. It is just the combination of the expected value formula and the cross-entropy cost.

General equation for the expected value of a function ff with respect to a probability density function pp:

Exp[f(x)]=Xp(x)f(x) dx\mathbb E_{x \sim p} [f(x)] = \int_{\mathcal X} p(x) \cdot f(x) \text{ } dx

And the general equation for the cross-entropy with the computed predictions (net output) qq:

H(p,q)=i=0cpi(x)logqi(x)H(p, q) = -\sum_{i=0}^c p_i(x) \cdot \log q_i(x)

The second line in the equation for the expected value is just the application of Bayes rule.

ptrain(x,y=i)=ptrain(y=ix)ptrain(x)p_{train}({\bf x}, y=i) = p_{train}(y=i\mid{\bf x}) \cdot p_{train}({\bf x})

Exercise - Proof

So for all x\bf x with ptrain(x)>0p_{train}({\bf x})>0 the cost function is minimal if

i=1cptrain(y=ix)logoi(x;θ)\sum_{i=1}^cp_{train}(y=i\mid{\bf x}) \cdot \log o_i({\bf x}; \theta)

is minimal.

For all exercises we assume that we have a neural network with sufficent complexity to fit each possible function for each x\bf x.

Task:

Under the (softmax) constraint that ioi(x;θ)=1\sum_i o_i({\bf x}; \theta)=1, show that the outputs oio_i , which minimize the cost for all x\bf x, have to fulfill:

oi(x;θ)=ptrain(y=ix)o_i({\bf x}; \theta) = p_{train}(y=i\mid{\bf x})

Hint:

Minimization / maximization problem with constraint \rightarrow Lagrange multiplier.

Exercise - Discussion

Task:

Answer the following questions:

  • Discuss the first exercise. What is the meaning of oio_i (the softmax-output)?
  • How reliable is such a interpretation of oio_i for regions with low probability density ptrain(x)p_{train}(\bf x)?

Exercise - Logistic Function as Special Case

Softmax

The softmax function is defined as:

oi(x;θ)=exp(xθi)k=1cexp(xθk)o_i({\bf x}; \theta) = \frac{\exp({\bf x} \theta_i)}{\sum_{k=1}^c\exp ({\bf x}\theta_k )}

Logistic

For binary classification, we normally use the logistic function, defined as:

h(x;w)=11+exp(xw)h({\bf x}; {\bf w}) = \frac{1}{1 + \exp(-{\bf x}{\bf w} )}

in conjunction with a threshold for hh (e.g. 0.50.5) to decide if x\bf x belongs to class 1 or not.

with:

  • θ\theta being a weight matrix:
  • e.g. 4x3, for 4 features and 3 classes (softmax)
  • θi\theta_i being a column weight vector (e.g 4x1)
  • x\bf x being a row feature vector (e.g 1x4)
  • w{\bf w} column weight vector e.g. 4x1, for 4 features (logistic)

Task:

  • Show, that the logistic function is the same as the softmax function for k=2k = 2
  • Why it's better to use the logistic function for a binary classification problem?

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

HTW Berlin - Angewandte Informatik - Advanced Topics - Exercise - Meaning of Softmax Output Probability
by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christian Herta, Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.