Machine Learning Fundamentals - Probability Theory - Bayesian Networks by Example

Introduction

Bayesian Networks are probabilistic models which describe interactions between random variables as a directed acyclic graph (DAG). A Bayesian Network can only be an approximation of the real world it tries to model. On the other hand, the model can encode useful assumptions about the problem domain, which enables us to reason about the problem and to reduce the computational complexity.

This notebooks uses probability distributions of the "student" example [KOL09] as a minimal introduction to Bayesian Networks and how to represent them in a graphical way. Bayesian Networks are themselves the foundation of sequence learning algorithms like Markov Models and Hidden Markov Models, which can be used in Natural Language Processing (NLP) like Language Models or Tagging Problems respectively.

In this notebook we use the Python package networkx as a generic way of representing a DAG and pandas to store probability tables assigned to graph nodes.

Requirements

Knowledge

The following Jupyter notebooks are preliminary material:

The fundamentals of Bayesian Networks are described in "Probabilistic Graphical Models - Principles and Techniques", Chapter 3 [KOL09].

As an additional resource, the online course "CS 188 - Artificial Intelligence", University of California, Berkley, USA by D. Klein and P. Abbeel is recommended:

Python Modules

# External Modules
import pandas as pd
import networkx as nx
from numpy.testing import assert_almost_equal
from IPython.display import HTML
from deep_teaching_commons.graphs.visualization import draw, highlight_path

Student Example

Take a look at the following joint probability distribution with two variables II and SS:

P(I,S)P(I, S)

By applying the chain rule we can calculate P(I, S) in two different ways.

1.

P(I,S)=P(S)P(IS)P(I, S) = P(S) \cdot P(I \mid S)

2.

P(I,S)=P(I)P(SI)P(I, S) = P(I) \cdot P(S \mid I)

It is not possible to decide for one way or another by just looking at the formulas, because we have not given the variables a meaning yet.

In this example, variable II describes the intelligence of a student, where II is either highhigh or normalnormal.

Val(I)={high,normal}Val(I) = \{high, normal\}

Variable SS describes a person's score in an admission test for US colleges called SAT, where the score value for SS is either highhigh or lowlow.

Val(S)={high,low}Val(S) = \{high, low\}

We can reason about the given variables and conclude that a student with a highhigh IQ is more likely to have highhigh SAT score. Therefore we assume that the SAT score SS depends on the intelligence II and not the other way round. To frame it differently, II causes SS.

This assumption can be encoded in a Bayesian Network with two nodes, one for each variable, where an edge from node II to node SS encodes the causality.

g = nx.DiGraph()
g.add_node('I', pos=(100, 0))
g.add_node('S', pos=(150, -70))
g.add_edge('I', 'S')
HTML(draw(g))
%3 I I S S I->S

In the Bayesian Network above, the node II represents the probability distribution P(I)P(I) and node SS represents P(SI)P(S \mid I). The full joint distribution reprenseted by the Bayesian Network is P(I,S)P(I, S) and can be calculated using the chain rule.

P(I,S)=P(I)P(SI)P(I,S) = P(I) \cdot P(S \mid I)

We can define a marginal probability distribution P(I)P(I) to express our believe that 30%30\% of all students have a high IQ. DataFrames from the pandas package can be used to create the table.

 I = pd.DataFrame(
    [
        ['high', 0.3],
        ['normal', 0.7]
    ],
    columns=['I', 'p']
)
I
I p
0 high 0.3
1 normal 0.7

Let's also define a conditinal distribution for P(SI)P(S \mid I).

S_given_I = pd.DataFrame(
    [
        ['high', 'high', 0.8],
        ['high', 'low', 0.2],
        ['normal', 'high', 0.05],
        ['normal', 'low', 0.95]
    ],
    columns=['I', 'S', 'p']
)
S_given_I
I S p
0 high high 0.80
1 high low 0.20
2 normal high 0.05
3 normal low 0.95

For example, given the table, a student with a normalnormal IQ only has a 5%5\% chance to have a highhigh SAT score.

With networkx, we can associate the given probability tables with their corresponding graph nodes. We use a node attribute named df to store the DataFrame reference.

g.nodes['I']['df'] = I
g.nodes['S']['df'] = S_given_I

Individual values of a node probability table can be accessed as follows.

def P(nx_graph, node_name, query):
    node = nx_graph.nodes[node_name]
    df = node['df']
    response = df.query(query)
    assert response.shape[0] == 1
    return response['p'].values[0]
P(g, 'S', 'I == "normal" & S == "high"')

We can now calucalate the joint distribution P(I,S)P(I, S) represented by the Bayesian Network.

P(I,S)=P(I)P(SI)P(I, S) = P(I) \cdot P(S \mid I)
I_S = pd.DataFrame(
    [
        ['high', 'high', P(g, 'I', 'I == "high"') * P(g, 'S', 'I == "high" & S == "high"')],
        ['high', 'low', P(g, 'I', 'I == "high"') * P(g, 'S', 'I == "high" & S == "low"')],
        ['normal', 'high', P(g, 'I', 'I == "normal"') * P(g, 'S', 'I == "normal" & S == "high"')],
        ['normal', 'low', P(g, 'I', 'I == "normal"') * P(g, 'S', 'I == "normal" & S == "low"')]
    ],
    columns=['I', 'S', 'p']
)
I_S
I S p
0 high high 0.240
1 high low 0.060
2 normal high 0.035
3 normal low 0.665
# verify that probabilities sum to 1.0
assert_almost_equal(I_S['p'].sum(), 1.0)

Example: Common Cause

Besides II and SS, a third variable GG is introduced to model. GG can represent the probabilities of three different grades goodgood, mediocremediocre, and badbad.

Val(G)={good,mediocre,bad}Val(G) = \{good, mediocre, bad\}

The joint distribution of the model is P(I,S,G)P(I, S, G).

We could apply the chain rule as follows.

P(I,S,G)=P(I)P(SI)P(GI,S)=P(I)P(SI)P(GI)P(GS)\begin{aligned} P(I, S, G) & = P(I) \cdot P(S \mid I) \cdot P(G \mid I, S) \\ & = P(I) \cdot P(S \mid I) \cdot P(G \mid I) \cdot P(G \mid S) \end{aligned}

We can extend the existing Bayesian Networks to include the additional node and edges.

g.add_node('G', pos=(50, -70))
g.add_edge('I', 'G')
g.add_edge('S', 'G')
HTML(draw(g))
%3 I I S S I->S G G I->G S->G

For our given problem domain, we can assume that the SAT score does not influence the grade of a person. They might be correlated in a way that a person with a goodgood grade is more likely to score highhigh in the SAT test, but this correlation does only stem from a common cause, the intelligence.

We can therefore assume independence of SS and GG given II.

SGIS \perp G \mid I

Therefore the chain rule for P(I,S,G)P(I,S,G) becomes less complex.

P(I,S,G)=P(I)P(SI)P(GI)P(I,S,G) = P(I) \cdot P(S \mid I) \cdot P(G \mid I)

The Bayesian Network does not include the edge from SS to GG anymore.

g.remove_edge('S', 'G')
HTML(draw(g))
%3 I I S S I->S G G I->G

We can define a probability table for P(GI)P(G \mid I) and assign the tables to the Bayesian Network nodes.

G_given_I = pd.DataFrame(
    [
        ['high', 'good', 0.74],
        ['high', 'mediocre', 0.17],
        ['high', 'bad', 0.09],
        ['normal', 'good', 0.2],
        ['normal', 'mediocre', 0.34],
        ['normal', 'bad', 0.46]
    ],
    columns=['I', 'G', 'p']
)
G_given_I
I G p
0 high good 0.74
1 high mediocre 0.17
2 high bad 0.09
3 normal good 0.20
4 normal mediocre 0.34
5 normal bad 0.46
g.nodes['G']['df'] = G_given_I

Again, it is possible to calculate the joint distribution P(I,S,G)P(I, S, G) from the given Bayesian Network.

rows = []

for v_i in ['high', 'normal']:
    for v_s in ['high', 'low']:
        for v_g in ['good', 'mediocre', 'bad']:
            rows.append([
                v_i, v_s, v_g,
                P(g, 'I', f'I == "{v_i}"') * P(g, 'S', f'I == "{v_i}" & S == "{v_s}"') * P(g, 'G', f'I == "{v_i}" & G == "{v_g}"')
            ])
            
I_S_G = pd.DataFrame(
    rows,
    columns=['I', 'S', 'G', 'p']
)
I_S_G
I S G p
0 high high good 0.1776
1 high high mediocre 0.0408
2 high high bad 0.0216
3 high low good 0.0444
4 high low mediocre 0.0102
5 high low bad 0.0054
6 normal high good 0.0070
7 normal high mediocre 0.0119
8 normal high bad 0.0161
9 normal low good 0.1330
10 normal low mediocre 0.2261
11 normal low bad 0.3059
# verify that probabilities sum to 1.0
assert_almost_equal(I_S_G['p'].sum(), 1.0)

Example: Indirect Causal Effect

The variable LL describes the probability that a recommendation letter is issued for the student.

Val(L)={issued,denied}Val(L) = \{issued, denied\}

The probability of a letter being issued, directly depends on the grade of a student and indirectly depends on the intelligence.

P(I,G,L,S)=P(I)P(GI)P(LG)P(SI)P(I, G, L, S) = P(I) \cdot P(G \mid I) \cdot P(L \mid G) \cdot P(S \mid I)

As can be seen in the formula for P(I,G,L,S)P(I, G, L, S), we have created an indirect causal effect of LL depending on GG and GG depending on II. The graphical representation is shown below. The path for IGLI \rightarrow G \rightarrow L is highlighted in green.

g.add_node('L', pos=(50, -140))
g.add_edge('G', 'L')
highlight_path(g, ['I', 'G', 'L'], reset=True)
HTML(draw(g))
%3 I I S S I->S G G I->G L L G->L
L_given_G = pd.DataFrame(
    [
        ['good', 'issued', 0.9],
        ['good', 'denied', 0.1],
        ['mediocre', 'issued', 0.6],
        ['mediocre', 'denied', 0.4],
        ['bad', 'issued', 0.01],
        ['bad', 'denied', 0.99]
    ],
    columns=['G', 'L', 'p']
)
L_given_G
G L p
0 good issued 0.90
1 good denied 0.10
2 mediocre issued 0.60
3 mediocre denied 0.40
4 bad issued 0.01
5 bad denied 0.99
g.nodes['L']['df'] = L_given_G

Example: Common Effect

The variable DD describes the probability, that it was difficult to get a certain grade.

Val(D)={hard,easy}Val(D) = \{hard, easy\}

The probability of getting a goodgood, mediocremediocre or badbad grade, now depends on both intelligence and difficulty.

P(D,I,G,L,S)=P(I)P(D)P(GI,D)P(SI)P(LG)=P(I)P(D)P(GI)P(GD)P(SI)P(LG)\begin{aligned} P(D, I, G, L, S) &= P(I) \cdot P(D) \cdot P(G \mid I, D) \cdot P(S \mid I) \cdot P(L \mid G) \\ &= P(I) \cdot P(D) \cdot P(G \mid I) \cdot P(G \mid D) \cdot P(S \mid I) \cdot P(L \mid G) \end{aligned}

As can be seen in the formula for P(I,G,D)P(I, G, D), we have created a common effect of GG depending on II and DD. The graphical representation is shown below. The path for DGID \rightarrow G \leftarrow I is highlighted in green.

g.add_node('D', pos=(0, 0))
g.add_edge('D', 'G')
highlight_path(g, ['I', 'G', 'D'], reset=True)
HTML(draw(g))
%3 I I S S I->S G G I->G L L G->L D D D->G
D = pd.DataFrame(
    [
        ['hard', 0.4],
        ['easy', 0.6]
    ],
    columns=['D', 'p']
)
g.nodes['D']['df'] = D

Due to the given changes, an update of the probability distribution assigned with node GG is required.

G_given_I_D = pd.DataFrame(
    [
        ['high', 'hard', 'good', 0.5],
        ['high', 'hard', 'mediocre', 0.3],
        ['high', 'hard', 'bad', 0.2],
        ['high', 'easy', 'good', 0.9],
        ['high', 'easy', 'mediocre', 0.08],
        ['high', 'easy', 'bad', 0.02],
        ['normal', 'hard', 'good', 0.05],
        ['normal', 'hard', 'mediocre', 0.25],
        ['normal', 'hard', 'bad', 0.7],
        ['normal', 'easy', 'good', 0.3],
        ['normal', 'easy', 'mediocre', 0.4],
        ['normal', 'easy', 'bad', 0.3]
    ],
    columns=['I', 'D', 'G', 'p']
)
G_given_I_D
I D G p
0 high hard good 0.50
1 high hard mediocre 0.30
2 high hard bad 0.20
3 high easy good 0.90
4 high easy mediocre 0.08
5 high easy bad 0.02
6 normal hard good 0.05
7 normal hard mediocre 0.25
8 normal hard bad 0.70
9 normal easy good 0.30
10 normal easy mediocre 0.40
11 normal easy bad 0.30
g.nodes['G']['df'] = G_given_I_D

Theory

As has been shown with the student example, every node in the graph depends on its predecessors. The full joint distribution of an arbitrary Bayesian Network is defined as follows.

P(X1,...,Xn)=i=1nP(Xipredecessors(Xi))P(X_1, ..., X_n) = \prod_{i=1}^n P(X_i \mid predecessors(X_i))

Observed Variables

In the student example, we are using a predetermined probability distribution over multiple random random variables. In a real world scenario, we would need to determine these values on our own. For example if we have all the data (intelligence, grade, etc.) of a population of 1000 students, we can use this data to calculate marginal and conditional probability distributions to construct a Bayesian Network.

Let's now think of another student, who was not present in the training data. If we want to know, how likely it is that this other student has a highhigh intelligence, we can calculate the value from our existing Bayesian Network. If we don't know anything about the student, all random variables in the Bayesian Network are hidden (unobserved), such that all possible outcomes have to be taken into account in the calculation.

Now consider, that we can collect evidendial data about this mysterious student, like the grade and SAT score. In this case the random variables GG and SS become observed and instead of taking all possibilities into account, we use a fixed value (e.g. G=mediocreG=mediocre and S=highS=high) for the observed variables.

As an example, consider the variables II (intelligence) and SS (SAT score). We might not be able to measure the intelligence of a student with a sensor, which means that we cannot observe it. But in the case that we observe a students SAT score, we can infer if the student is likely to be intelligent or not. The observed evidence will give us a more accurate probability of the student having a highhigh intelligence.

tmp = nx.DiGraph()
tmp.add_node('I')
tmp.add_node('S', observed=True)
tmp.add_edge('I', 'S')
HTML(draw(tmp))
%3 I I S S I->S

As shown above, observed variables are depicted as nodes filled with a background color.

A variable's property of being observed or not influences the indepence of other variables in the Bayesian Network and therefore changes the way certain probability values are calculated. The examples in the following sections provide detailed information about different scenarios, where a single random variable in a small Bayesian Network is observed. But in general, we can collect as much evidence as we want and the same rules would still apply. It is even possible to not collect any evidence at all or to observe every single random variable in the Bayesian Network.

Trails

A trail is a path between two nodes, which might span across other nodes connected by edges. The direction of edges does not matter for a trail to be existent or not.

Consider the trail SIGLS \leftarrow I \rightarrow G \rightarrow L in the graph of the student example graph. The direction of edges is not consistent, as shown below.

highlight_path(g, ['S', 'I', 'G', 'L'], reset=True)
HTML(draw(g))
%3 I I S S I->S G G I->G L L G->L D D D->G

A trail between two variables (e.g. LL and SS) may either be active or blocked. If a trail is active, both variables are dependent. They are independent if a trail is blocked.

The property of a trail being active or blocked depends on certain variables in the Bayesian Network being observed or not.

In order to determine if a trail of arbitrary length is active or not, we need an algorithm called D-Separation. This algorithm will be discussed in the next notebook Exercise: D-Separation. The foundation of the D-Separation algorithm however, are four basic cases called indirect causal effect, indirect evidential effect, common cause and common effect. These basic cases can be explained with triples of nodes, as shown in the sections below.

Inference

Inference: Indirect Causal Effect

Consider the trail IGLI \rightarrow G \rightarrow L as shown in the Bayesian Network below.

tmp = nx.DiGraph()
tmp.add_node('I', df=I)
tmp.add_node('G', df=G_given_I)
tmp.add_node('L', df=L_given_G)
tmp.add_edge('I', 'G')
tmp.add_edge('G', 'L')
HTML(draw(tmp))
%3 I I G G I->G L L G->L

To calculate P(L)P(L), it is required to include variables II and GG in the calculation, because II indirectly effects LL via GG. First we can calculate the joint distribution P(I,G,L)P(I, G, L) and then sum over the hidden variables II and GG.

P(L)=IGP(I)P(GI)P(LG)=IGP(I,G,L)\begin{aligned} P(L) &= \sum_{I}\sum_{G} P(I) \cdot P(G | I) \cdot P(L | G) \\ &= \sum_{I}\sum_{G} P(I, G, L) \end{aligned}

In a more verbose notation this can also be understood as

P(L)=iVal(I)gVal(G)P(i,g,L).P(L) = \sum_{i \in Val(I)} \sum_{g \in Val(G)} P(i, g, L).
rows = []

for v_i in ['high', 'normal']:
    for v_g in ['good', 'mediocre', 'bad']:
        for v_l in ['issued', 'denied']:
            rows.append([
                v_i, v_g, v_l,
                P(tmp, 'I', f'I == "{v_i}"') * P(tmp, 'G', f'I == "{v_i}" & G == "{v_g}"') * P(tmp, 'L', f'G == "{v_g}" & L == "{v_l}"')
            ])
            
I_G_L = pd.DataFrame(
    rows,
    columns=['I', 'G', 'L', 'p']
)
I_G_L
I G L p
0 high good issued 0.19980
1 high good denied 0.02220
2 high mediocre issued 0.03060
3 high mediocre denied 0.02040
4 high bad issued 0.00027
5 high bad denied 0.02673
6 normal good issued 0.12600
7 normal good denied 0.01400
8 normal mediocre issued 0.14280
9 normal mediocre denied 0.09520
10 normal bad issued 0.00322
11 normal bad denied 0.31878
L = pd.DataFrame(
    [
        ['issued', I_G_L.query('L == "issued"')['p'].sum()],
        ['denied', I_G_L.query('L == "denied"')['p'].sum()]
    ],
    columns=['L', 'p']
)
L
L P
0 issued 0.50269
1 denied 0.49731

In a hypothetical scenario, where we just know the intelligence of a student, II becomes an observed variable.

We can set a fixed value for II and use this evidence to infer P(Li), iVal(I)P(L \mid i),\ i \in Val(I).

P(Li)=GP(Gi)P(LG), iVal(I)P(L \mid i) = \sum_{G} P(G\mid i) \cdot P(L \mid G),\ i \in Val(I)
tmp.nodes['I']['observed'] = True
HTML(draw(tmp))
%3 I I G G I->G L L G->L

Let's assume we observe I=highI = high.

rows = []

for v_l in ['issued', 'denied']:
    for v_g in ['good', 'mediocre', 'bad']:
        # L (v_l) and G (v_g) are unobserved variables, while I == high is fixed
        rows.append([
            v_l, v_g,
            P(tmp, 'G', f'I == "high" & G == "{v_g}"') * P(tmp, 'L', f'G == "{v_g}" & L == "{v_l}"')
        ])
            
L_G_I_high = pd.DataFrame(
    rows,
    columns=['L', 'G', 'p']
)
L_G_I_high
L G p
0 issued good 0.6660
1 issued mediocre 0.1020
2 issued bad 0.0009
3 denied good 0.0740
4 denied mediocre 0.0680
5 denied bad 0.0891
L_I_high = pd.DataFrame(
    [
        ['issued', L_G_I_high.query('L == "issued"')['p'].sum()],
        ['denied', L_G_I_high.query('L == "denied"')['p'].sum()]
    ],
    columns=['L', 'p']
)
L_I_high
L p
0 issued 0.7689
1 denied 0.2311

Let's now assume similar scenario, but this time only the second variable GG in trail IGLI \rightarrow G \rightarrow L is obeserved, as shown in the Bayesian Network below.

tmp.nodes['I']['observed'] = False
tmp.nodes['G']['observed'] = True
HTML(draw(tmp))
%3 I I G G I->G L L G->L

If we observe, that the grade GG of a student is goodgood, we can take this evidence to answer a query for P(Lg), gVal(G)P(L \mid g),\ g \in Val(G), without using II in the calculation. In this case LL is independent of II given GG. The trail between II and LL is blocked by the observation of GG.

LIg, gVal(G)L \perp I \mid g,\ g \in Val(G)

The intelligence II does not provide any additional information in this situation. P(LG=good)P(L \mid G=good) is just a table lookup.

tmp.nodes['L']['df'].query('G == "good"')
G L p
0 good issued 0.9
1 good denied 0.1

Inference: Indirect Evidential Effect

Consider trail LGIL \leftarrow G \leftarrow I, which is the equivalent to IGLI \rightarrow G \rightarrow L used in the last example, but this time, we would like to answer a query about II. Let's assume LL is observed and we want to calculate P(Il), lVal(L)P(I \mid l),\ l \in Val(L).

tmp.nodes['G']['observed'] = False
tmp.nodes['L']['observed'] = True
HTML(draw(tmp))
%3 I I G G I->G L L G->L
P(Il)=1ZGP(I)P(GI)P(lG), lVal(L)P(I \mid l) = \frac{1}{Z} \cdot \sum_{G} P(I) \cdot P(G \mid I) \cdot P(l \mid G),\ l \in Val(L)

Normalization by a factor 1Z\frac{1}{Z} is required for probabilities in table P(Il)P(I \mid l) to not sum to 11.

To gain some intuition about the normalization factor, we can look at the joint distribution of our small Bayesian Network.

rows = []

for v_i in ['high', 'normal']:
    for v_g in ['good', 'mediocre', 'bad']:
        for v_l in ['issued', 'denied']:
            rows.append([
                v_i, v_g, v_l,
                P(tmp, 'I', f'I == "{v_i}"') * P(tmp, 'G', f'I == "{v_i}" & G == "{v_g}"') * P(tmp, 'L', f'G == "{v_g}" & L == "{v_l}"')
            ])
            
I_G_L = pd.DataFrame(
    rows,
    columns=['I', 'G', 'L', 'p']
)
I_G_L
I G L p
0 high good issued 0.19980
1 high good denied 0.02220
2 high mediocre issued 0.03060
3 high mediocre denied 0.02040
4 high bad issued 0.00027
5 high bad denied 0.02673
6 normal good issued 0.12600
7 normal good denied 0.01400
8 normal mediocre issued 0.14280
9 normal mediocre denied 0.09520
10 normal bad issued 0.00322
11 normal bad denied 0.31878
# probabilities in a full joint distribution sum to 1
assert_almost_equal(I_G_L['p'].sum(), 1.0)

If LL is observed to be issuedissued, we only account for a subset of the joint distribution where L=issuedL = issued.

I_G_L_issued = I_G_L.query('L == "issued"')
I_G_L_issued
I G L p
0 high good issued 0.19980
2 high mediocre issued 0.03060
4 high bad issued 0.00027
6 normal good issued 0.12600
8 normal mediocre issued 0.14280
10 normal bad issued 0.00322
# the sum of the probabilities is not equals 1.0
I_G_L_issued['p'].sum()

To calculate P(IL=issued)P(I \mid L=issued) we sum over GG.

I_L_issued = pd.DataFrame(
    [
        ['high', I_G_L_issued.query('I == "high"')['p'].sum()],
        ['normal', I_G_L_issued.query('I == "normal"')['p'].sum()]
    ],
    columns=['I', 'p']
)

I_L_issued
I p
0 high 0.23067
1 normal 0.27202

Since the last table I_G_L_issued did not sum to one, also the new table I_L_issued does not sum to one.

I_L_issued['p'].sum()

To gain back proper probabilities, we can normalize the resulting values by their sum ZZ.

# normalization
Z = I_L_issued['p'].sum()
I_L_issued['p'] = I_L_issued['p'] / Z
I_L_issued
I p
0 high 0.458871
1 normal 0.541129

Now, if GG is observed instead of LL, the path between II and LL is blocked.

ILg, gVal(G)I \perp L \mid g,\ g \in Val(G)
tmp.nodes['G']['observed'] = True
tmp.nodes['L']['observed'] = False
HTML(draw(tmp))
%3 I I G G I->G L L G->L
P(Ig)=1ZP(I)P(gI), gVal(G)P(I \mid g) = \frac{1}{Z} \cdot P(I) \cdot P(g \mid I),\ g \in Val(G)

For example, we can calculate P(IG=good)P(I \mid G = good).

rows = []

for v_i in ['high', 'normal']:
    rows.append([
        v_i,
        P(tmp, 'I', f'I == "{v_i}"') * P(tmp, 'G', f'I == "{v_i}" & G == "good"')
    ])
            
I_G_good = pd.DataFrame(
    rows,
    columns=['I', 'p']
)

# normalization
Z = I_G_good['p'].sum()
I_G_good['p'] = I_G_good['p'] / Z

I_G_good
I p
0 high 0.61326
1 normal 0.38674

Inference: Common Cause

In the student example graph a common cause was given by the V-shaped structure GISG \leftarrow I \rightarrow S as shown below. In this example GG and SS both depend on II.

tmp = nx.DiGraph()
tmp.add_node('I', df=I, pos=(100, 0))
tmp.add_node('S', df=S_given_I, pos=(150, -70))
tmp.add_node('G', df=G_given_I, pos=(50, -70))
tmp.add_edge('I', 'S')
tmp.add_edge('I', 'G')
HTML(draw(tmp))
%3 I I S S I->S G G I->G

If GG is observed we can use this evidence to calculate P(Sg), gVal(G)P(S \mid g),\ g \in Val(G). We sum over II to remove the hidden variable.

P(Sg)=1ZIP(I)P(gI)P(SI), gVal(G)P(S \mid g) = \frac{1}{Z} \cdot \sum_{I} P(I) \cdot P(g \mid I) \cdot P(S \mid I),\ g \in Val(G)
tmp.nodes['G']['observed'] = True
HTML(draw(tmp))
%3 I I S S I->S G G I->G

For P(SG=good)P(S \mid G = good) the following code shows the exact calculation.

rows = []

for v_i in ['high', 'normal']:
    for v_s in ['high', 'low']:
        rows.append([
            v_i, v_s,
            P(tmp, 'I', f'I == "{v_i}"') * P(tmp, 'G', f'I == "{v_i}" & G == "good"') * P(tmp, 'S', f'I == "{v_i}" & S == "{v_s}"')
        ])
            
G_good_I_S = pd.DataFrame(
    rows,
    columns=['I', 'S', 'p']
)

G_good_I_S
I S p
0 high high 0.1776
1 high low 0.0444
2 normal high 0.0070
3 normal low 0.1330
S_G_good = pd.DataFrame(
    [
        ['high', G_good_I_S.query('S == "high"')['p'].sum()],
        ['low', G_good_I_S.query('S == "low"')['p'].sum()]
    ],
    columns=['S', 'p']
)

# normalization
Z = S_G_good['p'].sum()
S_G_good['p'] = S_G_good['p'] / Z

S_G_good
S p
0 high 0.509945
1 low 0.490055

In the case that II is observed, the path between GG and SS is blocked.

tmp.nodes['G']['observed'] = False
tmp.nodes['I']['observed'] = True
HTML(draw(tmp))
%3 I I S S I->S G G I->G

P(SI=high)P(S \mid I = high) is just a table lookup.

tmp.nodes['S']['df'].query('I == "high"')
I S p
0 high high 0.8
1 high low 0.2

Inference: Common Effect

In the student example graph the trail DGID \rightarrow G \leftarrow I describes a common effect.

tmp = nx.DiGraph()
tmp.add_node('D', df=D, pos=(0, 0))
tmp.add_node('I', df=I, pos=(100, 0))
tmp.add_node('G', df=G_given_I_D, pos=(50, -70))
tmp.add_edge('D', 'G')
tmp.add_edge('I', 'G')
HTML(draw(tmp))
%3 D D G G D->G I I I->G

In this case the trail from DD to II is blocked if GG is unobserved (hidden). This means, that observing DD does not give us any more information about II, as long as GG is unobserved.

On the other hand, if GG is observed, we have to include DD in the calculation of P(Ig), gVal(G)P(I \mid g),\ g \in Val(G)

P(Ig)=1ZDP(I)P(D)P(gI,D)P(I \mid g) = \frac{1}{Z} \cdot \sum_{D} P(I) \cdot P(D) \cdot P(g \mid I, D)
tmp.nodes['G']['observed'] = True
HTML(draw(tmp))
%3 D D G G D->G I I I->G

The calculation of P(IG=good)P(I \mid G = good) is shown in the calculation below.

rows = []

for v_i in ['high', 'normal']:
    for v_d in ['hard', 'easy']:
        rows.append([
            v_i, v_d,
            P(tmp, 'I', f'I == "{v_i}"') * P(tmp, 'D', f'D == "{v_d}"') * P(tmp, 'G', f'G == "good" & I == "{v_i}" & D == "{v_d}"')
        ])
            
I_D_G_good = pd.DataFrame(
    rows,
    columns=['I', 'D', 'p']
)

I_D_G_good
I D p
0 high hard 0.060
1 high easy 0.162
2 normal hard 0.014
3 normal easy 0.126
I_G_good = pd.DataFrame(
    [
        ['high', I_D_G_good.query('I == "high"')['p'].sum()],
        ['normal', I_D_G_good.query('I == "normal"')['p'].sum()]
    ],
    columns=['I', 'p']
)

# normalization
Z = I_G_good['p'].sum()
I_G_good['p'] = I_G_good['p'] / Z

I_G_good
I p
0 high 0.61326
1 normal 0.38674

The trail DGID \rightarrow G \leftarrow I is also active, if some other node depending on GG is observed, as shown in the graph below.

tmp.nodes['G']['observed'] = False
tmp.add_node('L', df=L_given_G, observed=True, pos=(50,  -140))
tmp.add_edge('G', 'L')
HTML(draw(tmp))
%3 D D G G D->G I I I->G L L G->L

Summary and Outlook

This was an introduction to the basic mechanics of a Bayesian Network. The student example has been used to demonstrate the way probability distributions are encoded in such a network and how to run inference queries, when observed variables are involved.

We recommend learning about sampling in Bayesian Networks and D-Separation (see notebook Exercise: D-Separation), as well as special forms like Naive Bayes, Markov Models and Hidden Markov Models. Hidden Markov Models are especially useful for sequential applications like Named Entity Recognition tagging of sentences in the Natural Language Processing domain.

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Machine Learning Fundamentals - Probability Theory - Bayesian Networks by Example
by Christoph Jansen (deep.TEACHING - HTW Berlin)
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Author

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.