This notebook introduces Markov Models as a way to represent Bi-Gram language models. It is advised, but not strictly required, to learn about Bayesian Networks first (see Bayesian Networks by Example). Studying Bayesian Networks provides you with a basic understanding of probability theory and graphical models, which is useful in this context.

The comprehensive exercises in this notebook will guide you through the steps of implementing a Bi-Gram language model. Due to the general data sparsity in the Natural Language Processing (NLP) domain, the implementation requires various adaptions concerning numerical stability, which will be adressed accordingly.

As a data source, the GermEval2014 text corpus is used.

Required Python Modules

# Python Standard Library
from collections import Counter

# External Modules
import networkx as nx
import numpy as np
from numpy.testing import assert_almost_equal
from IPython.display import HTML
from deep_teaching_commons.graphs.visualization import draw
from import GermEval2014

Required Data

# data manager for Germ Eval 2014 corpus
# default base_data_dir is ~/deep.TEACHING/data
base_data_dir = None
dm = GermEval2014(base_data_dir=base_data_dir)
auto download is active, attempting download
data directory already exists, no download required
# only termns, no NER tags, are required in this exercise
def terms(sequences):
    for sequence in sequences:
        yield [term for term, _, _ in sequence]
# two sentences
for i, s in enumerate(terms(dm.train_sequences())):
    if i == 2: break
['Schartau', 'sagte', 'dem', '"', 'Tagesspiegel', '"', 'vom', 'Freitag', ',', 'Fischer', 'sei', '"', 'in', 'einer', 'Weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'überzeugend', 'war', '"', '.']
['Firmengründer', 'Wolf', 'Peter', 'Bree', 'arbeitete', 'Anfang', 'der', 'siebziger', 'Jahre', 'als', 'Möbelvertreter', ',', 'als', 'er', 'einen', 'fliegenden', 'Händler', 'aus', 'dem', 'Libanon', 'traf', '.']

Markov Models

In one of the preliminary notebooks Bayesian Networks by Example, we have introduced graphical models. Each node in a Bayesian Network corresponds to a random variable, which can take on different values with a certain probability given the predecessors of the variable.

A Markov Model is very similar, but represents a sequence of states. For example W1,...,WnW_1, ..., W_n could represent a temporal sequence of weather states, where W1,...,WnW_1, ..., W_n are random variables. These random variables share a common probability table WW. Since WW is shared, the length nn of a sequence SWS_W can vary without adjusting the model. Possible values of W are defined as

Val(W)={Sunny,Cloudy,Rainy}.Val(W) = \{Sunny, Cloudy, Rainy\}.

For example, the following graphical representation shows a weather model with 5 subsequent states.

def mm(sequence, x_offsets=None, observed=None):
    x_offset = 0
    x_offsets = x_offsets if x_offsets else [100 for _ in range(len(sequence)-1)]
    observed = observed if observed else [False for _ in range(len(sequence))]
    g = nx.DiGraph()
    g.add_node('O0', label=sequence[0], pos=(0, 0), observed=observed[0])
    for i in range(len(sequence)-1):
        j = i+1
        x_offset += x_offsets[i]
        g.add_node(f'O{j}', label=sequence[j], pos=(x_offset, 0), observed=observed[j])
        g.add_edge(f'O{i}', f'O{j}')
    return g

HTML(draw(mm(['W{}'.format(chr(0x2081 + i)) for i in range(5)])))
%3 O0 W₁ O1 W₂ O0->O1 O2 W₃ O1->O2 O3 W₄ O2->O3 O4 W₅ O3->O4

Assume we wanted to calculate the probability of the last weather event P(W5)P(W_5). By the rules of probability (see Bayesian Networks by Example) this would result in the following calculation.

P(W5)=W1W2W3W4P(W1)P(W2W1)P(W3W2)P(W4W3)P(W5W4)=W1W2W3W4P(W1,W2,W3,W4,W5)\begin{aligned} P(W_5) &= \sum_{W_1}\sum_{W_2}\sum_{W_3}\sum_{W_4} P(W_1) \cdot P(W_2 \mid W_1) \cdot P(W_3 \mid W_2) \cdot P(W_4 \mid W_3) \cdot P(W_5 \mid W_4) \\ &= \sum_{W_1}\sum_{W_2}\sum_{W_3}\sum_{W_4} P(W_1, W_2, W_3, W_4, W_5) \end{aligned}

Instead of calculating the probability of a certain element WtW1,...,WnW_t \in {W_1, ..., W_n} of our sequence SWS_W, we can also use this model to calculate the probability of a complete sequence P(SW)P(S_W) with nn elements as follows:

P(SW)=t=1nP(Wt)P(S_W) = \prod_{t=1}^{n} P(W_t)

For every element WtW_t, a complex probability calculation P(Wt)P(W_t), as shown for P(W5)P(W_5), would be required. Fortunately, by applying the Markov Assumption, we can reduce this calculation to a simple lookup in our probability table WW.

Markov Assumption

Usually, we observe a specific sequence of events. For example W1=Sunny,W2=Sunny,W3=Cloudy,W4=RainyW_1 = Sunny, W_2 = Sunny, W_3 = Cloudy, W_4 = Rainy. This means that W5W_5 becomes independent of W1W_1, W2W_2 and W3W_3 given W4W_4.

W5W1,W2,W3W4=RainyW_5 \perp W_1, W_2, W_3 \mid W_4 = Rainy

Calculating P(W5)P(W_5) can therefore be reduced to

P(W5)=P(W5W4=Rainy),P(W_5) = P(W_5 \mid W_4 = Rainy),

which is only a table lookup. If you are familiar with Bayesian Networks, you can see that the Markov Assumption is a direct consequence of observing random variables.

The following graphical representation shows the observed variables.

    ['W\u2081 = Sunny', 'W\u2082 = Sunny', 'W\u2083 = Cloudy', 'W\u2084 = Rainy', 'W\u2085'],
    x_offsets=[160, 160, 160, 115],
    observed=[True, True, True, True, False]
%3 O0 W₁ = Sunny O1 W₂ = Sunny O0->O1 O2 W₃ = Cloudy O1->O2 O3 W₄ = Rainy O2->O3 O4 W₅ O3->O4

We can now predict the next observed state in a sequence, by just looking at the current state. Asume, that we already know all transition probabilities from each possible state to the next state, we can express them in a simple table as follows.

index = {
    'Sunny': 0,
    'Cloudy': 1,
    'Rainy': 2

W = np.array([
    [0.8, 0.15, 0.05],
    [0.2, 0.5, 0.3],
    [0.2, 0.2, 0.6]

If it is SunnySunny at a given point in a sequence, the probability that it will be CloudyCloudy afterwards can be expressed as P(Wt=CloudyWt1=Sunny)P(W_t = Cloudy \mid W_{t-1} = Sunny). The value of P(Wt=CloudyWt1=Sunny)P(W_t = Cloudy \mid W_{t-1} = Sunny) can be accessed in the probability table WW as follows.

W[index['Sunny'], index['Cloudy']]

Markov Chains

As has been shown in the last section Markov Assumption, the we can greatly reduce the complexity of a Markov Model, when we are only dealing with observed variables. In this case we can also refer to our model as Markov Chain. Graphical representations of a Markov Chain differ from Bayesian Networks and Markov Models, because their nodes represent a specific state Val(W)={Sunny,Cloudy,Rainy}Val(W) = \{Sunny, Cloudy, Rainy\} instead of a random variable WtW_t. The edges between nodes represent the transition probability from state to another, as shown in the example below.

g = nx.DiGraph()

for x in index:

for x in g.nodes:
    for y in g.nodes:
        g.add_edge(x, y, label=W[index[x], index[y]])

%3 Sunny Sunny Sunny->Sunny 0.8 Cloudy Cloudy Sunny->Cloudy 0.15 Rainy Rainy Sunny->Rainy 0.05 Cloudy->Sunny 0.2 Cloudy->Cloudy 0.5 Cloudy->Rainy 0.3 Rainy->Sunny 0.2 Rainy->Cloudy 0.2 Rainy->Rainy 0.6

Language Models

A Bi-Gram language model can be represented as a Markov Model. Consider the following sentence s, which is contained in the GermEval2014 text corpus. The sentence is a sequence of terms.

for i, s in enumerate(terms(dm.train_sequences())):
    if i == 2041:

Prepending a START tag and appending an END tag allows us to assign probabilities to a term being at the beginning or the end of sentence respectively.

sse = ['START'] + s + ['END']
HTML(draw(mm(sse, x_offsets=[100, 118, 123, 116, 108, 85], observed=[True for _ in sse])))
%3 O0 START O1 Das O0->O1 O2 Lebensbild O1->O2 O3 einer O2->O3 O4 Dichterin O3->O4 O5 . O4->O5 O6 END O5->O6

Bi-Grams are all possible combinations of two subsequent words in the sentence. See the following code, for a list of all Bi-Grams.

list(zip(sse[:-1], sse[1:]))

In order to calculate probabilities for all possible terms and their successors, we have to count all Bi-Grams in the training corpus. Since the number of possible terms in a text corpus is large, the resulting transition matrix requires a lot of memory and is sparse.

raw_terms = set()

for s in terms(dm.train_sequences()):


The exercises in this notebook will guide through the implementation of a language model and how to apply some tricks to reduce sparsity.

Pen-and-Paper Exercises

Exercise: Terminology

What is the difference between Word, Token, Type and Term?

To answer the question, read the beginning of 2.2.1. Tokenization of "Introduction to information retrieval" [SCH08].

Exercise: Language Models

Study the fundamentals of N-Gram language models in Chapter 4 (accessed online 05.04.2018) of "Speech and Language Processing" [JUR09].

Further Readings: Language Modeling (Lecture Notes for NLP) [COL13], accessed online 05.04.2018

Solve the following exercises in Chapter 4 [JUR09] (see p. 25 in the linked PDF).

  • 4.1
  • 4.2
  • 4.3
  • 4.4
  • 4.5

Programming Exercises

Exercise: Normalization

Write a python function which normalizes the corpus as follows:

  • convert all uppercase letters to lower case
  • replace all digits by letter DD
# fill in your normalization code in the function body
def normalize(corpus):
    for s in corpus:
        normalized_sentence = None
        raise NotImplementedError()
        yield normalized_sentence
# two normalized sentences
for i, s in enumerate(normalize(terms(dm.train_sequences()))):
    if i == 2: break
['schartau', 'sagte', 'dem', '"', 'tagesspiegel', '"', 'vom', 'freitag', ',', 'fischer', 'sei', '"', 'in', 'einer', 'weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'überzeugend', 'war', '"', '.']
['firmengründer', 'wolf', 'peter', 'bree', 'arbeitete', 'anfang', 'der', 'siebziger', 'jahre', 'als', 'möbelvertreter', ',', 'als', 'er', 'einen', 'fliegenden', 'händler', 'aus', 'dem', 'libanon', 'traf', '.']
def test_normalize():
    input_corpus = [['R2D2', 'ist', 'ein', 'Roboter', 'aus', 'dem', 'Jahr', '1977', '.']]
    output_corpus = [['rDdD', 'ist', 'ein', 'roboter', 'aus', 'dem', 'jahr', 'DDDD', '.']]
    normalized_corpus = list(normalize(input_corpus))
    assert normalized_corpus == output_corpus

Exercise: Allowed Terms

Create a set of allowed terms. In this case we allow only the 8190 most common terms, to artificially reduce the size of our vocabulary. This is an important step to reduce the memory consumption of our Bi-Gram matrix, which we will construct in a later step.

  • Create a Python set, containing only the most common terms.
  • Hint: for the initial counting you could use the Counter class from collections.
# fill in code to create a list containing allowed terms
allowed_terms = None
def test_allowed_terms():
    assert isinstance(allowed_terms, set)
    assert 'immer' in allowed_terms
    assert 'mehr' in allowed_terms
    assert 'schartau' not in allowed_terms
    assert len(allowed_terms) == NUM_ALLOWED

The allowed terms are used in an additional filter function allowed which is applied as a generator. The filter function will replace every unknown term, not contained in the vocabulary by UNK.

# we create a function, which filters infrequent terms
def allowed_func(allowed_terms):
    # this is a closure
    def allowed(corpus):
        for s in corpus:
            yield [term if term in allowed_terms else UNKNOWN for term in s]
    # the returned function will have access to allowed_terms from its defintion context
    return allowed
allowed = allowed_func(allowed_terms)
# filter infrequent terms from two normalized sentences
for i, s in enumerate(allowed(normalize(terms(dm.train_sequences())))):
    if i == 2: break
['UNK', 'sagte', 'dem', '"', 'tagesspiegel', '"', 'vom', 'freitag', ',', 'fischer', 'sei', '"', 'in', 'einer', 'weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'UNK', 'war', '"', '.']
['UNK', 'wolf', 'peter', 'UNK', 'arbeitete', 'anfang', 'der', 'siebziger', 'jahre', 'als', 'UNK', ',', 'als', 'er', 'einen', 'fliegenden', 'händler', 'aus', 'dem', 'libanon', 'traf', '.']

Exercise: Index

Now that we have reduced the allowed vocabulary to a certain number of terms NN, we can build an index. This index does no only include each TERMTERM, but also special symbols UNKUNK and START_ENDSTART\_END. We can use START_ENDSTART\_END as a shared index for STARTSTART and ENDEND, because STARTSTART can only be on the left side of a Bi-Gram and ENDEND can only be on the right side of a Bi-Gram. Exercise: Count Bi-Grams will explain this in detail.

  • Create Python dictionary index where terms and special symbols are keys, while assigned values are integers.
# fill in code to create a dictionary containing a term index
index = None
def test_index():
    assert isinstance(index, dict)
    assert len(index) == len(allowed_terms) + 2
    assert UNKNOWN in index
    assert START_END in index
    for key, val in index.items():
        assert isinstance(key, str)
        assert isinstance(val, int)
    assert list(sorted(index.values())) == list(range(len(allowed_terms) + 2))

Exercise: Count Bi-Grams

In this exercise you will create a numpy matrix, which stores the number of occurrences of each Bi-Gram from our training corpus. We can think of a Bi-Gram as a combination of two subsequent elements aa and bb, where aa can be STARTSTART, UNKUNK or a TERMTERM and where bb can be ENDEND, UNKUNK or a TERMTERM. We can store a count for each possible combination of aa and bb in a matrix, by assigning the first dimension to aa and the second dimension to bb.

The index, which lets us map aa or bb to a fixed position in the array, has already been defined. For STARTSTART we only need a row, but no column, while for ENDEND we only need a column, but not row. Therefore the shared index START_ENDSTART\_END is a useful trick.

  • Initialize a numpy array CC, which stores Bi-Gram counts for all combinations of aa and bb.

    • Use the first dimension for aa and the second dimension for bb.
  • For lower memory consumption, use the numpy datatype (dtype) np.uint16, instead of np.int64.
  • Fill the array with Bi-Gram counts from the training corpus.

    • Use the index to refer to the correct indices in the matrix.
    • START \rightarrow END is a valid Bi-Gram for empty sequences.
# fill in code to initialize numpy data structures and fill them with bi-gram counts
C = None

for s in allowed(normalize(terms(dm.train_sequences()))):
C[index['immer'], index['mehr']]
def test_C():
    assert C.shape == (len(index), len(index))
    assert C.dtype == np.uint16
    assert C[index[START_END], index[START_END]] == 2

Exercise: Bi-Gram Probability

Implement a function p(a, b), which returns the probability P(ba)P(b \mid a) for a given Bi-Gram aa, bb. The probability of a Bi-Gram, is defined as

P(a,b)=C(a,b)tIC(a,t)\begin{aligned} P(a, b) &= \frac{C(a, b)}{\sum\limits_{t \in I} C(a, t)} \end{aligned}

where I|I| is the size of the index, including the START_END tag, and where CC is the Bi-Gram count.

  • p(a, b) must have access to C and the index.

  • Raise exception if a == END.
  • Raise exception if b == START.
# fill in code to implement a closure p_func, which returns p
def p_func(C, index):
    raise NotImplementedError()
    # variables which are defined here, will be available in p
    def p(a, b):
    return p
p = p_func(C, index)
p('immer', 'mehr')
def test_probability(p):
    test_terms = set(allowed_terms)
    test_terms.update([UNKNOWN, END])
    assert_almost_equal(sum(p(START, term) for term in test_terms), 1.0)
    assert_almost_equal(sum(p(UNKNOWN, term) for term in test_terms), 1.0)
    assert_almost_equal(sum(p('immer', term) for term in test_terms), 1.0)
    assert_almost_equal(sum(p('mehr', term) for term in test_terms), 1.0)
    def exception_raised(a, b):
        raised = False
            p(a, b)
            raised = True

        return raised
    assert exception_raised(END, UNKNOWN)
    assert exception_raised(UNKNOWN, START)


Exercise: Sentence Probability

First implement a function sentence_probability(s, p) which calculates the probability of a full sentence P(S)P(S), by multiplying the probabilities of all Bi-Grams in the sequence.

P(S)=t=1nP(vtvt1)P(S) = \prod_{t=1}^{n} P(v_t \mid v_{t-1})
  • v0v_0 refers to START
  • vnv_n refers to END
  • Parameter s is a sentence, without START or END tags.
  • Parameter p takes a function to calculate Bi-Gram probabilities, for example p(a, b).

Then implement a second function sentence_log_probability(s, p), which calculates the probability of a full sentence P(S)P(S), by adding the logarithmic probabilites of all Bi-Grams in sentence.

log(P(S))=t=1nlog(P(vtvt1))log(P(S)) = \sum_{t=1}^{n} log(P(v_t \mid v_{t-1}))

Hint: Adding in logarithmic space is equivalent to multiplying without logarithms (see example below), but prevents numerical instabilities due to floating-point arithmetic.

log(P(v1START)P(v2v1)P(ENDv2))=log(P(v1START))+log(P(v2v1))+log(P(ENDv2))log(P(v_1 \mid START) \cdot P(v_2 \mid v_1) \cdot P(END \mid v_2)) = log(P(v_1 \mid START)) + log(P(v_2 \mid v_1)) + log(P(END \mid v_2))
# fill in code to implement a function which return a sequence probabiliy
def sentence_probability(s, p):
    raise NotImplementedError()
def sentence_log_probability(s, p):
    raise NotImplementedError()
s1 = ['sie', 'nehmen', '.']
s2 = ['sie', 'nimmt', '.']
sentence_probability(s1, p), sentence_probability(s2, p)
sentence_log_probability(s1, p), sentence_log_probability(s2, p)
np.log(sentence_probability(s1, p)), sentence_log_probability(s1, p)
def test_sentence_probability(p):
    for s in allowed(normalize(terms(dm.train_sequences()))):
    assert_almost_equal(np.log(sentence_probability(s, p)), sentence_log_probability(s, p))

Exercise: Sampling

Implement a function sample(max_length=50), which returns a random sentence generated from the model.

  • sample(max_length=50) must have access to C and the index.

  • You can use np.random.choice to draw from a random distribution of the model.
  • Sampled sentences should not include START or END tags.
  • Remember to use START Bi-Grams to determine the probabilities of possible sentence beginnings.
  • Return a sentence as soon as an END tag has been sampled or if max_length is reached.
# fill in code to implement a closure sample_func, which returns sample
def sample_func(C, index):
    raise NotImplementedError()
    # variables which are defined here, will be available in sample
    def sample(max_length=50):
    return sample
sample = sample_func(C, index)
for _ in range(10):
    print(' '.join(sample()))
hier fleisch , seiner krönung der UNK .
der union UNK , und staat UNK trasse UNK bisher keine veränderung wurde die österreichische UNK ich UNK und der sich der landschaft ist man zu haben .
um eine UNK .
die UNK verstärkt nannte der italiener - bei UNK UNK anknüpfen .
von UNK .
das ist die polizei in der mit UNK und UNK , um UNK .
sieht die pro tag zu vermitteln .
so richtig gas " von seiten der frankfurter allgemeinen weniger unfälle , ohne menschlichen UNK politische heimat .
karriere , ein kunde schützen .

Exercise: Laplace Smoothing

Implement a function p_laplace(a, b), which returns the probability P^(ba)\hat{P}(b \mid a) for a given Bi-Gram aa, bb with Laplace smoothing applied. Laplacing smoothing, also known as Add-One smoothing, is defined as

P^(a,b)=C(a,b)+1tI(C(a,t)+1)=C(a,b)+1tIC(a,t)+I\begin{aligned} \hat{P}(a, b) &= \frac{C(a, b) + 1}{\sum\limits_{t \in I} (C(a, t) + 1)} \\ &= \frac{C(a, b) + 1}{\sum\limits_{t \in I} C(a, t) + |I|} \end{aligned}

where I|I| is the size of the index, including the START_END tag, and where CC is the Bi-Gram count.

  • p_laplace(a, b) must have access to C and the index.

  • The implementation is similar to the existing function p(a, b) and must pass the same tests.
  • Raise exception if a == END.
  • Raise exception if b == START.
# fill in code to implement a closure p_laplace_func, which returns p_laplace
def p_laplace_func(C, index):
    raise NotImplementedError()
    # variables which are defined here, will be available in sample
    def p_laplace(a, b):
    return sample
p_laplace = p_laplace_func(C, index)
p_laplace('immer', 'mehr'), p('immer', 'mehr')

We can use p_laplace as replacement for p in our existing functions.

sentence_log_probability(s1, p), sentence_log_probability(s1, p_laplace)

We can also use p_laplace in our existing tests.


Exercise: Perplexity

Implement a function perplexity(corpus), which returns the perplexity score of the model on a complete test corpus. Perplexity is defined as

Perplexity(T)=1STP(S)N=(1STP(S))1N=exp(log((1STP(S))1N))=exp(1Nlog((1STP(S))))=exp(1N(log(1)log(STP(S))))=exp(1Nlog(STP(S)))=exp(1NSTlog(P(S)))\begin{aligned} Perplexity(T) &= \sqrt[N]{ \frac{1}{\prod\limits_{S \in T} P(S)} } \\ &= (\frac{1}{\prod\limits_{S \in T} P(S)})^{\frac{1}{N}} \\ &= exp(log((\frac{1}{\prod\limits_{S \in T} P(S)})^{\frac{1}{N}})) \\ &= exp(\frac{1}{N} \cdot log((\frac{1}{\prod\limits_{S \in T} P(S)}))) \\ &= exp(\frac{1}{N} \cdot (log(1) - log(\prod\limits_{S \in T} P(S)))) \\ &= exp(-\frac{1}{N} \cdot log(\prod\limits_{S \in T} P(S))) \\ &= exp(-\frac{1}{N} \cdot \sum\limits_{S \in T} log(P(S))) \end{aligned}

where TT is a test corpus and NN is the number of all Bi-Grams in the corpus. The formula above gives you a clean mathematical definition at the top, and a numerically stable conversion using logarithmic probabilities at the bottom. For more intuition on perplexity watch Nlp - 2.3 - Evaluation and Perplexity by Daniel Jurafsky.

  • Use the numerically stable formula at the bottom as a reference for your implementation.
  • Use your exiting functions sentence_log_probabilities and p_laplace for Bi-Gram probabilities.
  • Use np.exp.
# use a test corpus, instead of the training corpus
for s in allowed(normalize(terms(dm.test_sequences()))):
# fill in your perplexity code in the function body
def perplexity(corpus):
    # your code goes here
    raise NotImplementedError()

Summary and Outlook

You have learned about Markov Models and how to implement a language model. With your acquired knowledge you can move on to Hidden Markov Models and apply them to tagging problems like Named Entity Recognition.



Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g. images).

Text Information Extraction - Sequences - Exercise: Bi-Gram Language Model
by Christoph Jansen and Christian Herta (deep.TEACHING - HTW Berlin)
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christoph Jansen, Christian Herta

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.