Machine Learning Fundamentals - Probability Theory - Exercise: Cookie Problem

Introduction

The Cookie Problem is a basic exercise in Bayesian statistics. The problem description text in this notebook is a direct quote from Allen B. Downey's book Think Bayes [DOW13].

This exercise requires working with probability tables. We have chosen to implement these tables as Pandas DataFrames, which provides us with a nicely formatted representation and enables accessing columns by name. Necessary DataFrame operations are briefly introduced in this notebook.

In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal. These statements raise exceptions, as long as the calculated result is not yet correct.

Required Knowledge

  • Bayesian Statistics

    • Chain Rule
    • Bayes Rule
  • (Optional) Pandas Python Module

Required Python Modules

# External Modules
import pandas as pd
from numpy.testing import assert_almost_equal

"Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.

Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla. What is the probability that it came from Bowl 1?" [DOW13]

We are looking for the probability value P(B=1C=vanilla)P(B=1 \mid C=vanilla), where BB and CC are random variables for bowl and cookie. The following steps break down the problem and will guide you to a solution.

Step 1

What is the marginal distribution P(B)P(B) for the random variable BB (bowl)?

# update functions B_1 and B_2 to return the correct probability values for B=1 and B=2 respectively
def B_1():
    # your code goes here
    # return P(B=1)
    raise NotImplementedError()

def B_2():
    # your code goes here
    # return P(B=2)
    raise NotImplementedError()
# run this cell to print table
B = pd.DataFrame(
    [
        ['1', B_1()],
        ['2', B_2()]
    ],
    columns=['B', 'p'],
)
B
B p
0 1 0.5
1 2 0.5
# raise exception if sum is not 1.0
assert_almost_equal(B['p'].sum(), 1.0)

Step 2

What is the marginal distribution P(C)P(C) for the random variable CC (cookie)?

# update functions C_vanilla and C_chocolate to return the correct probability values for C=vanilla and C=chocolate respectively
def C_vanilla():
    # your code goes here
    # return P(C=vanilla)
    raise NotImplementedError()

def C_chocolate():
    # your code goes here
    # return P(C=chocolate)
    raise NotImplementedError()
# run this cell to print table
C = pd.DataFrame(
    [
        ['vanilla', C_vanilla()],
        ['chocolate',  C_chocolate()]
    ],
    columns=['C', 'p']
)
C
C p
0 vanilla 0.625
1 chocolate 0.375
# raise exception if sum is not 1.0
assert_almost_equal(C['p'].sum(), 1.0)

Step 3

What is the conditional probability distribution P(CB)P(C \mid B)?

# update the following functions to return the correct probability values
def C_vanilla_given_B_1():
    # your code goes here
    # return P(C=vanilla | B=1)
    raise NotImplementedError()

def C_vanilla_given_B_2():
    # your code goes here
    # return P(C=vanilla | B=2)
    raise NotImplementedError()

def C_chocolate_given_B_1():
    # your code goes here
    # return P(C=chocolate | B=1)
    raise NotImplementedError()

def C_chocolate_given_B_2():
    # your code goes here
    # return P(C=chocolate | B=2)
    raise NotImplementedError()
# run this cell to print table
C_given_B = pd.DataFrame(
    [
        ['1', 'vanilla', C_vanilla_given_B_1()],
        ['1', 'chocolate', C_chocolate_given_B_1()],
        ['2', 'vanilla', C_vanilla_given_B_2()],
        ['2', 'chocolate', C_chocolate_given_B_2()]
    ],
    columns=['B', 'C', 'p'],
)
C_given_B
B C p
0 1 vanilla 0.75
1 1 chocolate 0.25
2 2 vanilla 0.50
3 2 chocolate 0.50
# raise exception if sum is not 1.0
assert_almost_equal(C_given_B.query('B == "1"')['p'].sum(), 1.0)
assert_almost_equal(C_given_B.query('B == "2"')['p'].sum(), 1.0)

Reminder - The following code shows how to access a single probability value in a Pandas DataFrame:

# query for rows where B=2 and C=vanilla
# a single row should be selected
response = C_given_B.query('B == "2" & C == "vanilla"')
response
B C p
2 2 vanilla 0.5
# access probability value
response['p'].values[0]
# raise exception if the number of selected rows is not equal to 1
assert response.shape[0] == 1
# we define a convenient function to access a probability value in a DataFrame
def P(df, query):
    response = df.query(query)
    assert response.shape[0] == 1
    return response['p'].values[0]
P(C_given_B, 'B == "2" & C == "vanilla"')

Step 4

What is the joint probability distribution P(C,B)P(C, B)?

Reminder - Chain rule formula:

P(C,B)=P(B)P(CB)P(C, B) = P(B) \cdot P(C \mid B)
# update the following functions to return the correct probability values
# use values already available in DataFrames B and C_given_B
def C_vanilla_B_1():
    # your code goes here
    # return P(C=vanilla, B=1)
    raise NotImplementedError()

def C_vanilla_B_2():
    # your code goes here
    # return P(C=vanilla, B=2)
    raise NotImplementedError()

def C_chocolate_B_1():
    # your code goes here
    # return P(C=chocolate, B=1)
    raise NotImplementedError()

def C_chocolate_B_2():
    # your code goes here
    # return P(C=chocolate, B=2)
    raise NotImplementedError()
# run this cell to print table
C_B = pd.DataFrame(
    [
        ['1', 'vanilla', C_vanilla_B_1()],
        ['1', 'chocolate', C_chocolate_B_1()],
        ['2', 'vanilla', C_vanilla_B_2()],
        ['2', 'chocolate', C_chocolate_B_2()]
    ],
    columns=['B', 'C', 'p']
)
C_B
B C p
0 1 vanilla 0.375
1 1 chocolate 0.125
2 2 vanilla 0.250
3 2 chocolate 0.250
# raise exception if sum is not 1.0
assert_almost_equal(C_B['p'].sum(), 1.0)

Step 5

What is the conditional probability distribution P(BC)P(B \mid C)?

Reminder - Bayes rule formula:

P(BC)=P(C,B)P(C)P(B \mid C) = \frac{P(C, B)}{P(C)}
# update the following functions to return the correct probability values
# use values already available in DataFrames C and C_B
def B_1_given_C_vanilla():
    # your code goes here
    # return P(B=1 | C=vanilla)
    raise NotImplementedError()

def B_2_given_C_vanilla():
    # your code goes here
    # return P(B=2 | C=vanilla)
    raise NotImplementedError()

def B_1_given_C_chocolate():
    # your code goes here
    # return P(B=1 | C=chocolate)
    raise NotImplementedError()

def B_2_given_C_chocolate():
    # your code goes here
    # return P(B=2 | C=chocolate)
    raise NotImplementedError()
# run this cell to print table
B_given_C = pd.DataFrame(
    [
        ['1', 'vanilla', B_1_given_C_vanilla()],
        ['1', 'chocolate', B_1_given_C_chocolate()],
        ['2', 'vanilla', B_2_given_C_vanilla()],
        ['2', 'chocolate', B_2_given_C_chocolate()]
    ],
    columns=['B', 'C', 'p']
)
B_given_C
B C p
0 1 vanilla 0.600000
1 1 chocolate 0.333333
2 2 vanilla 0.400000
3 2 chocolate 0.666667
# raise exception if sum is not 1.0
assert_almost_equal(B_given_C.query('C == "vanilla"')['p'].sum(), 1.0)
assert_almost_equal(B_given_C.query('C == "chocolate"')['p'].sum(), 1.0)

Original Question

What is P(B=1C=vanilla)P(B=1 \mid C=vanilla)?

# Answer
P(B_given_C, 'B == "1" & C == "vanilla"')
# raise exeception if answer is not 0.6
assert_almost_equal(P(B_given_C, 'B == "1" & C == "vanilla"'), 0.6)
# also check the remaining values of P(B | C) for correctness
assert_almost_equal(P(B_given_C, 'B == "1" & C == "chocolate"'), 0.3333333)
assert_almost_equal(P(B_given_C, 'B == "2" & C == "vanilla"'), 0.4)
assert_almost_equal(P(B_given_C, 'B == "2" & C == "chocolate"'), 0.6666666)

Summary and Outlook

You have learned to work with probability tables and how to apply Product Rule and Bayes Rule. With the knowledge gained from this exercise, you can dive deeper into Bayesian statistics or learn about graphical models.

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g. images).

Machine Learning Fundamentals - Probability Theory - Exercise: Cookie Problem
by Christoph Jansen (deep.TEACHING - HTW Berlin)
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christoph Jansen

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.