ML-Fundamentals - Probabilty Theory - Bayes' Theorem

Why is Bayes' Theorem of Interest

Bayesian inference lets you draw stronger conclusions from your data by folding in what you already know about the answer and Bayes' Theorem is the fundamental mathematical rule it relies on. This is the long term goal, but let us have a look at some examples first where Bayes' Theorem can provide an answer.

  • You had a new developed test to detect a seriouse disease and it came back positive. The doctor tells you that the prevalence is at 0,01% and the test indicates the presence of disease at 95% if the disease is indeed present. What is the probability that you have the disease?

  • At "Südkreuz" S-Bahn station in Berlin, an automatic face regonition system was installed to identify searched terrorists. The accuracy of the system is 99,9% and security experts estimate that of 4.200.000 people using the station in one year, about 13 have a terroristic background. In 0,0009% of the cases a non searched person is identified as a terrorist. What is the probability of the system reporting a detection that it is really a searched terrorist?

In this notebook you will learn to solve such problems with Bayes' Theorem. You will learn prerequisite to derive Bayes' theorem and how it can be visulized. As a side note, Bayes' Theorem is often called Bayes' rule, Bayes' law or Bayesian Probability. So do not be confused if you read on of these terms in another source.

Bayes' Theorem

We start with the formula you find most often when you search for Bayes' theorem to make our goal clearer. But then we will toss it to the side for a moment and take a look at some fundamentals of probability theory from a visual perspective.

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A)*P(A) }{P(B)}

Visual derivation

Suppose we have a Universe Ω\Omega that represents all possible outcomings, e.g. of an experiment or a prediction. We call one possible outcome an event AA in our Universe Ω\Omega. In a concrete example, the universe could be made up of people participating in a scientific study on chocolate addiction. Event AA could represent all people that are addicted to chocolate. In the counterpart, all other individuals describe the remaining population of Ω\Omega (formally ¬A\neg A).

So what is the probability a person is addicted to chocolate if you choose it at random? Probability theory tells us, it is the number of elements in our observed event AA (cardinality of event AA) divided by the number of all elements of our universe Ω\Omega (cardinality of Ω\Omega).

P(A)=AΩP(A) = \frac{\left | A \right |}{\left | \Omega \right |}

In the same way, further events can be added. In our study, a new test will be evaluated that detects the chocolate addiction. Let us call it event BB. Event BB includes all people with a 'positive' test result which indicates an addiction. We can visualize it and calculate the probability p(B)p(B) like we did with event AA. So the probability p(B)p(B) of a person with a positive test result randomly selected from our population would be the cardinality of event BB divided by the cardinality of universe Ω\Omega.

So far we have considered the events independently and looked at them one after another. Let us change that. As you can see, event AA and event BB share some people of our study population. From here it becomes more interesting because complex problem statements are possible.

One simple question is what is the probability that both events occur P(AB)P(A \cap B). To put it in words: the probability of a 'random selected person is addicted to chocolate and has a positive test'. P(AB)P(A \cap B) is the cardinality of the intersection of the two events divided by the cardinality of the universe.

P(AB)=ABΩ P(A \cap B) = \frac{\left | A \cap B \right |}{\left | \Omega \right |}

But there is more information in the diagram. Another question that leads us to Bayes' Theorem is 'if a person got a positive test, what is the probability he or she is addicted?'. The mathematical notation for such a question is a conditional probability P(AB)P(A \mid B), the probability of event AA given event BB. You can visualize it by reducing the area of our diagram to the region of event BB and ask for the for the intersection of ABA \cap B.

From the visualization it becomes clear that the following applies:

P(AB)=ABB P(A \mid B) = \frac{\left | A \cap B \right |}{\left | B \right |}

But we changed our perspective here from the universe Ω\Omega to our event BB. To get back to the probabilities from the universe we divide the numerator and the denominator by the cardinality of Ω\Omega, Ω|\Omega|.

P(AB)=ABΩBΩ P(A \mid B) = \frac{\frac{\left | A \cap B \right |}{|\Omega|}}{\frac{\left | B \right |}{|\Omega|}}

We can do that too for the conditional probability P(BA)P(B \mid A) to get a similar equation and diagram. P(BA)P(B \mid A) represents 'given that a randomly selected person is addicted, what is the probability that the person gets a positive test result?.

We can transform the equation a bit to get P(AB)P(A \cap B):

P(BA)=P(AB)P(A) P(B \mid A) = \frac{P(A \cap B)}{P(A)}
P(AB)=P(BA)P(A) \Leftrightarrow P(A \cap B) = P(B \mid A)*P(A)

Another aspect recognizable from the combination of both diagrams is that P(AB)P(A \cap B) and P(BA)P(B \cap A) have the same probability and therefore are interchangeable. Now we have all the components (joint and conditional probabilities) for the derivation of Bayes' Theorem together.

P(AB)=P(BA)P(A \cap B) = P(B \cap A)

Insert the equations we found during our visual derivation.

P(AB)P(B)=P(BA)P(A)P(A \mid B) * P(B) = P(B \mid A) * P(A)

Transform the equation for the conditional probability P(AB)P(A \mid B).

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) * P(A)}{P(B)}

Here we are Bayes' Theorem. Let us have a look at one simple example to see how we can apply the equation in practice.

This problem is from the book 'Think Bayes' by Allen B. Downey DOW13 and was based on an example from Wikipedia. Suppose you have two bowls and both contains two types of cookies but with a diffrent distribution. Bowl 1 contains 30 vanilla cookies and 10 chocolate. Bowl 2 contains 20 cookies of each kind.

Now you randomly draw a cookie from a bowl without looking at the bowls. It is a vanilla cookie. But what is the probability that the cookie came from bowl 1? Mathematically we can express that question as a conditional probability:

P(bowl1vanillacookie)P(bowl\;1 \mid vanilla\;cookie)

Well, a conditional probability is one term of our Bayes' Theorem equation. Let us take a look at the whole equation again and see if it helps us solve our cookie problem.

We can derive all probabilities we need to solve the equation from the discription above.

  • P(vanillacookiebowl1)P(vanilla\;cookie \mid bowl\;1): Is the probability that we get a vanilla cookie from bowl 1, which is exactly 3/43/4

  • P(bowl1)P(bowl\;1): Represent the probability to choose bowl1bowl\;1. Since we grabed a cookie from a bowl without looking at them, we choosed the bowl at random. Therefore we can assume both bowls have the same likelihood of 1/2=P(bowl1)=P(bowl2)1/2 = P(bowl\;1) = P(bowl\;2).

  • P(vanillacookie)P(vanilla\;cookie): This is the probability of drawing a vanilla cookie from either bowl. Since we assumed that both bowls are equally likely and both contain the same number of cookies, we had the same chance of drawing any cookie. In total, we have 80 cookies and 50 of them are vanilla cookies, so P(vanillacookie)=50/80=5/8P(vanilla\;cookie)=50/80=5/8.

Giving all information we can solve our initial question

P(bowl1vanillacookie)=P(vanillacookiebowl1)P(bowl1)P(vanillacookie)P(bowl\;1 \mid vanilla\;cookie) = \frac{P(vanilla\;cookie \mid bowl\;1) * P(bowl\;1)}{P(vanilla\;cookie)}

P(bowl1vanillacookie)=(3/4)(1/2)(5/8)P(bowl\;1 \mid vanilla\;cookie) = \frac{(3/4) * (1/2)}{(5/8)}

P(bowl1vanillacookie)=3/5P(bowl\;1 \mid vanilla\;cookie) = 3/5

As exercise you can implement the cookie problem in the deep.TEACHING notebook Machine Learning Fundamentals - Probability Theory - Exercise: Cookie Problem

Plain mathematical derivation

Various textbooks give the mathematical derivation of Bayes's theorem. We follow the derivation of FAH11 and recommend MUR12 as another international source.

For two given events A,BΩA,B \subset \Omega and P(B)>0P(B) > 0 the conditional probability is definied as

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

From the definition of the conditional probability and the symetry ability of joint probabilites, the product rule can be derived and written as

P(AB)=P(BA)P(A)P(A \cap B) = P(B \mid A) * P(A)

Combining the definition of conditional probabilities and the product rule yields to Bayes' Theorem:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) * P(A)}{P(B)}

Summary and Outlook

In this notebook you got an introduction to Bayes' Theorem from a visual perspective. You got to know the concept of probabilities, joint probabilities and conditional probabilities via Venn diagrams and finally derived the equation of Bayes' Theorem.

You can deepen your knowledge of Bayes' Theorem in offered exercises, e.g., Machine Learning Fundamentals - Probability Theory - Exercise: Cookie Problem. To learn more about how Bayes' Theorem is used in Machine Learning you can take a look at the notebook Machine Learning Fundamentals - Probability Theory - Bayesian inference.



Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

ML-Fundamentals - Probabilty - Bayes' Theorem
by Benjamin Voigt
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Benjamin Voigt

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.



The deep.TEACHING notebooks are developed at CBMI, a research institute of HTW Berlin - University of Applied Sciences. The work is supported by the German Ministry of Education and Research (BMBF).