# Machine Learning Fundamentals - Probability Theory - Exercise: Cookie Problem

## Introduction

The Cookie Problem is a basic exercise in Bayesian statistics. The problem description text in this notebook is a direct quote from Allen B. Downey's book Think Bayes [DOW13].

This exercise requires working with probability tables. We have chosen to implement these tables as Pandas DataFrames, which provides us with a nicely formatted representation and enables accessing columns by name. Necessary DataFrame operations are briefly introduced in this notebook.

In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal. These statements raise exceptions, as long as the calculated result is not yet correct.

### Required Knowledge

• Bayesian Statistics

• Chain Rule
• Bayes Rule
• (Optional) Pandas Python Module

### Required Python Modules

# External Modules
import pandas as pd
from numpy.testing import assert_almost_equal

"Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.

Now suppose you choose one of the bowls at random and, without looking, select a cookie at random. The cookie is vanilla. What is the probability that it came from Bowl 1?" [DOW13]

We are looking for the probability value $P(B=1 \mid C=vanilla)$, where $B$ and $C$ are random variables for bowl and cookie. The following steps break down the problem and will guide you to a solution.

### Step 1

What is the marginal distribution $P(B)$ for the random variable $B$ (bowl)?

# update functions B_1 and B_2 to return the correct probability values for B=1 and B=2 respectively
def B_1():
# your code goes here
# return P(B=1)
raise NotImplementedError()

def B_2():
# your code goes here
# return P(B=2)
raise NotImplementedError()
# run this cell to print table
B = pd.DataFrame(
[
['1', B_1()],
['2', B_2()]
],
columns=['B', 'p'],
)
B
# raise exception if sum is not 1.0
assert_almost_equal(B['p'].sum(), 1.0)

### Step 2

What is the marginal distribution $P(C)$ for the random variable $C$ (cookie)?

# update functions C_vanilla and C_chocolate to return the correct probability values for C=vanilla and C=chocolate respectively
def C_vanilla():
# your code goes here
# return P(C=vanilla)
raise NotImplementedError()

def C_chocolate():
# your code goes here
# return P(C=chocolate)
raise NotImplementedError()
# run this cell to print table
C = pd.DataFrame(
[
['vanilla', C_vanilla()],
['chocolate',  C_chocolate()]
],
columns=['C', 'p']
)
C
# raise exception if sum is not 1.0
assert_almost_equal(C['p'].sum(), 1.0)

### Step 3

What is the conditional probability distribution $P(C \mid B)$?

# update the following functions to return the correct probability values
def C_vanilla_given_B_1():
# your code goes here
# return P(C=vanilla | B=1)
raise NotImplementedError()

def C_vanilla_given_B_2():
# your code goes here
# return P(C=vanilla | B=2)
raise NotImplementedError()

def C_chocolate_given_B_1():
# your code goes here
# return P(C=chocolate | B=1)
raise NotImplementedError()

def C_chocolate_given_B_2():
# your code goes here
# return P(C=chocolate | B=2)
raise NotImplementedError()
# run this cell to print table
C_given_B = pd.DataFrame(
[
['1', 'vanilla', C_vanilla_given_B_1()],
['1', 'chocolate', C_chocolate_given_B_1()],
['2', 'vanilla', C_vanilla_given_B_2()],
['2', 'chocolate', C_chocolate_given_B_2()]
],
columns=['B', 'C', 'p'],
)
C_given_B
# raise exception if sum is not 1.0
assert_almost_equal(C_given_B.query('B == "1"')['p'].sum(), 1.0)
assert_almost_equal(C_given_B.query('B == "2"')['p'].sum(), 1.0)

Reminder - The following code shows how to access a single probability value in a Pandas DataFrame:

# query for rows where B=2 and C=vanilla
# a single row should be selected
response = C_given_B.query('B == "2" & C == "vanilla"')
response
# access probability value
response['p'].values[0]
# raise exception if the number of selected rows is not equal to 1
assert response.shape[0] == 1
# we define a convenient function to access a probability value in a DataFrame
def P(df, query):
response = df.query(query)
assert response.shape[0] == 1
return response['p'].values[0]
P(C_given_B, 'B == "2" & C == "vanilla"')

### Step 4

What is the joint probability distribution $P(C, B)$?

Reminder - Chain rule formula:

$P(C, B) = P(B) \cdot P(C \mid B)$
# update the following functions to return the correct probability values
# use values already available in DataFrames B and C_given_B
def C_vanilla_B_1():
# your code goes here
# return P(C=vanilla, B=1)
raise NotImplementedError()

def C_vanilla_B_2():
# your code goes here
# return P(C=vanilla, B=2)
raise NotImplementedError()

def C_chocolate_B_1():
# your code goes here
# return P(C=chocolate, B=1)
raise NotImplementedError()

def C_chocolate_B_2():
# your code goes here
# return P(C=chocolate, B=2)
raise NotImplementedError()
# run this cell to print table
C_B = pd.DataFrame(
[
['1', 'vanilla', C_vanilla_B_1()],
['1', 'chocolate', C_chocolate_B_1()],
['2', 'vanilla', C_vanilla_B_2()],
['2', 'chocolate', C_chocolate_B_2()]
],
columns=['B', 'C', 'p']
)
C_B
# raise exception if sum is not 1.0
assert_almost_equal(C_B['p'].sum(), 1.0)

### Step 5

What is the conditional probability distribution $P(B \mid C)$?

Reminder - Bayes rule formula:

$P(B \mid C) = \frac{P(C, B)}{P(C)}$
# update the following functions to return the correct probability values
# use values already available in DataFrames C and C_B
def B_1_given_C_vanilla():
# your code goes here
# return P(B=1 | C=vanilla)
raise NotImplementedError()

def B_2_given_C_vanilla():
# your code goes here
# return P(B=2 | C=vanilla)
raise NotImplementedError()

def B_1_given_C_chocolate():
# your code goes here
# return P(B=1 | C=chocolate)
raise NotImplementedError()

def B_2_given_C_chocolate():
# your code goes here
# return P(B=2 | C=chocolate)
raise NotImplementedError()
# run this cell to print table
B_given_C = pd.DataFrame(
[
['1', 'vanilla', B_1_given_C_vanilla()],
['1', 'chocolate', B_1_given_C_chocolate()],
['2', 'vanilla', B_2_given_C_vanilla()],
['2', 'chocolate', B_2_given_C_chocolate()]
],
columns=['B', 'C', 'p']
)
B_given_C
# raise exception if sum is not 1.0
assert_almost_equal(B_given_C.query('C == "vanilla"')['p'].sum(), 1.0)
assert_almost_equal(B_given_C.query('C == "chocolate"')['p'].sum(), 1.0)

### Original Question

What is $P(B=1 \mid C=vanilla)$?

# Answer
P(B_given_C, 'B == "1" & C == "vanilla"')
# raise exeception if answer is not 0.6
assert_almost_equal(P(B_given_C, 'B == "1" & C == "vanilla"'), 0.6)
# also check the remaining values of P(B | C) for correctness
assert_almost_equal(P(B_given_C, 'B == "1" & C == "chocolate"'), 0.3333333)
assert_almost_equal(P(B_given_C, 'B == "2" & C == "vanilla"'), 0.4)
assert_almost_equal(P(B_given_C, 'B == "2" & C == "chocolate"'), 0.6666666)

## Summary and Outlook

You have learned to work with probability tables and how to apply Product Rule and Bayes Rule. With the knowledge gained from this exercise, you can dive deeper into Bayesian statistics or learn about graphical models.

## Literature

### Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g. images).

Machine Learning Fundamentals - Probability Theory - Exercise: Cookie Problem
by Christoph Jansen (deep.TEACHING - HTW Berlin)