Text Information Extraction  Sequences  Exercise: BiGram Language Model
Table of Contents
Introduction
This notebook introduces Markov Models as a way to represent BiGram language models. It is advised, but not strictly required, to learn about Bayesian Networks first (see Bayesian Networks by Example). Studying Bayesian Networks provides you with a basic understanding of probability theory and graphical models, which is useful in this context.
The comprehensive exercises in this notebook will guide you through the steps of implementing a BiGram language model. Due to the general data sparsity in the Natural Language Processing (NLP) domain, the implementation requires various adaptions concerning numerical stability, which will be adressed accordingly.
As a data source, the GermEval2014 text corpus is used.
Required Knowledge
Required Python Modules
# Python Standard Library
from collections import Counter
# External Modules
import networkx as nx
import numpy as np
from numpy.testing import assert_almost_equal
from IPython.display import HTML
from deep_teaching_commons.graphs.visualization import draw
from deep_teaching_commons.data.text.germ_eval_2014 import GermEval2014
Required Data
# data manager for Germ Eval 2014 corpus
# default base_data_dir is ~/deep.TEACHING/data
base_data_dir = None
dm = GermEval2014(base_data_dir=base_data_dir)
auto download is active, attempting download
data directory already exists, no download required
# only termns, no NER tags, are required in this exercise
def terms(sequences):
for sequence in sequences:
yield [term for term, _, _ in sequence]
# two sentences
for i, s in enumerate(terms(dm.train_sequences())):
if i == 2: break
print(s)
['Schartau', 'sagte', 'dem', '"', 'Tagesspiegel', '"', 'vom', 'Freitag', ',', 'Fischer', 'sei', '"', 'in', 'einer', 'Weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'überzeugend', 'war', '"', '.']
['Firmengründer', 'Wolf', 'Peter', 'Bree', 'arbeitete', 'Anfang', 'der', 'siebziger', 'Jahre', 'als', 'Möbelvertreter', ',', 'als', 'er', 'einen', 'fliegenden', 'Händler', 'aus', 'dem', 'Libanon', 'traf', '.']
Markov Models
In one of the preliminary notebooks Bayesian Networks by Example, we have introduced graphical models. Each node in a Bayesian Network corresponds to a random variable, which can take on different values with a certain probability given the predecessors of the variable.
A Markov Model is very similar, but represents a sequence of states. For example $W_1, ..., W_n$ could represent a temporal sequence of weather states, where $W_1, ..., W_n$ are random variables. These random variables share a common probability table $W$. Since $W$ is shared, the length $n$ of a sequence $S_W$ can vary without adjusting the model. Possible values of W are defined as
For example, the following graphical representation shows a weather model with 5 subsequent states.
def mm(sequence, x_offsets=None, observed=None):
x_offset = 0
x_offsets = x_offsets if x_offsets else [100 for _ in range(len(sequence)1)]
observed = observed if observed else [False for _ in range(len(sequence))]
g = nx.DiGraph()
g.add_node('O0', label=sequence[0], pos=(0, 0), observed=observed[0])
for i in range(len(sequence)1):
j = i+1
x_offset += x_offsets[i]
g.add_node(f'O{j}', label=sequence[j], pos=(x_offset, 0), observed=observed[j])
g.add_edge(f'O{i}', f'O{j}')
return g
HTML(draw(mm(['W{}'.format(chr(0x2081 + i)) for i in range(5)])))
Assume we wanted to calculate the probability of the last weather event $P(W_5)$. By the rules of probability (see Bayesian Networks by Example) this would result in the following calculation.
Instead of calculating the probability of a certain element $W_t \in {W_1, ..., W_n}$ of our sequence $S_W$, we can also use this model to calculate the probability of a complete sequence $P(S_W)$ with $n$ elements as follows:
For every element $W_t$, a complex probability calculation $P(W_t)$, as shown for $P(W_5)$, would be required. Fortunately, by applying the Markov Assumption, we can reduce this calculation to a simple lookup in our probability table $W$.
Markov Assumption
Usually, we observe a specific sequence of events. For example $W_1 = Sunny, W_2 = Sunny, W_3 = Cloudy, W_4 = Rainy$. This means that $W_5$ becomes independent of $W_1$, $W_2$ and $W_3$ given $W_4$.
Calculating $P(W_5)$ can therefore be reduced to
which is only a table lookup. If you are familiar with Bayesian Networks, you can see that the Markov Assumption is a direct consequence of observing random variables.
The following graphical representation shows the observed variables.
HTML(draw(mm(
['W\u2081 = Sunny', 'W\u2082 = Sunny', 'W\u2083 = Cloudy', 'W\u2084 = Rainy', 'W\u2085'],
x_offsets=[160, 160, 160, 115],
observed=[True, True, True, True, False]
)))
We can now predict the next observed state in a sequence, by just looking at the current state. Asume, that we already know all transition probabilities from each possible state to the next state, we can express them in a simple table as follows.
index = {
'Sunny': 0,
'Cloudy': 1,
'Rainy': 2
}
W = np.array([
[0.8, 0.15, 0.05],
[0.2, 0.5, 0.3],
[0.2, 0.2, 0.6]
])
If it is $Sunny$ at a given point in a sequence, the probability that it will be $Cloudy$ afterwards can be expressed as $P(W_t = Cloudy \mid W_{t1} = Sunny)$. The value of $P(W_t = Cloudy \mid W_{t1} = Sunny)$ can be accessed in the probability table $W$ as follows.
W[index['Sunny'], index['Cloudy']]
Markov Chains
As has been shown in the last section Markov Assumption, the we can greatly reduce the complexity of a Markov Model, when we are only dealing with observed variables. In this case we can also refer to our model as Markov Chain. Graphical representations of a Markov Chain differ from Bayesian Networks and Markov Models, because their nodes represent a specific state $Val(W) = \{Sunny, Cloudy, Rainy\}$ instead of a random variable $W_t$. The edges between nodes represent the transition probability from state to another, as shown in the example below.
g = nx.DiGraph()
for x in index:
g.add_node(x)
for x in g.nodes:
for y in g.nodes:
g.add_edge(x, y, label=W[index[x], index[y]])
HTML(draw(g))
Language Models
A BiGram language model can be represented as a Markov Model. Consider the following sentence s
, which is contained in the GermEval2014 text corpus. The sentence is a sequence of terms.
for i, s in enumerate(terms(dm.train_sequences())):
if i == 2041:
break
s
Prepending a START
tag and appending an END
tag allows us to assign probabilities to a term being at the beginning or the end of sentence respectively.
sse = ['START'] + s + ['END']
sse
HTML(draw(mm(sse, x_offsets=[100, 118, 123, 116, 108, 85], observed=[True for _ in sse])))
BiGrams are all possible combinations of two subsequent words in the sentence. See the following code, for a list of all BiGrams.
list(zip(sse[:1], sse[1:]))
In order to calculate probabilities for all possible terms and their successors, we have to count all BiGrams in the training corpus. Since the number of possible terms in a text corpus is large, the resulting transition matrix requires a lot of memory and is sparse.
raw_terms = set()
for s in terms(dm.train_sequences()):
raw_terms.update(s)
len(raw_terms)
The exercises in this notebook will guide through the implementation of a language model and how to apply some tricks to reduce sparsity.
PenandPaper Exercises
Exercise: Terminology
What is the difference between Word, Token, Type and Term?
To answer the question, read the beginning of 2.2.1. Tokenization of "Introduction to information retrieval" [SCH08].
Exercise: Language Models
Study the fundamentals of NGram language models in Chapter 4 (accessed online 05.04.2018) of "Speech and Language Processing" [JUR09].
Further Readings: Language Modeling (Lecture Notes for NLP) [COL13], accessed online 05.04.2018
Solve the following exercises in Chapter 4 [JUR09] (see p. 25 in the linked PDF).
 4.1
 4.2
 4.3
 4.4
 4.5
Programming Exercises
Exercise: Normalization
Write a python function which normalizes the corpus as follows:
 convert all uppercase letters to lower case
 replace all digits by letter $D$
DIGIT = 'D'
# fill in your normalization code in the function body
def normalize(corpus):
for s in corpus:
normalized_sentence = None
raise NotImplementedError()
yield normalized_sentence
# two normalized sentences
for i, s in enumerate(normalize(terms(dm.train_sequences()))):
if i == 2: break
print(s)
['schartau', 'sagte', 'dem', '"', 'tagesspiegel', '"', 'vom', 'freitag', ',', 'fischer', 'sei', '"', 'in', 'einer', 'weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'überzeugend', 'war', '"', '.']
['firmengründer', 'wolf', 'peter', 'bree', 'arbeitete', 'anfang', 'der', 'siebziger', 'jahre', 'als', 'möbelvertreter', ',', 'als', 'er', 'einen', 'fliegenden', 'händler', 'aus', 'dem', 'libanon', 'traf', '.']
def test_normalize():
input_corpus = [['R2D2', 'ist', 'ein', 'Roboter', 'aus', 'dem', 'Jahr', '1977', '.']]
output_corpus = [['rDdD', 'ist', 'ein', 'roboter', 'aus', 'dem', 'jahr', 'DDDD', '.']]
normalized_corpus = list(normalize(input_corpus))
assert normalized_corpus == output_corpus
test_normalize()
Exercise: Allowed Terms
Create a set of allowed terms. In this case we allow only the 8190 most common terms, to artificially reduce the size of our vocabulary. This is an important step to reduce the memory consumption of our BiGram matrix, which we will construct in a later step.
 Create a Python set, containing only the most common terms.
 Hint: for the initial counting you could use the Counter class from collections.
NUM_ALLOWED = 8190
# fill in code to create a list containing allowed terms
allowed_terms = None
def test_allowed_terms():
assert isinstance(allowed_terms, set)
assert 'immer' in allowed_terms
assert 'mehr' in allowed_terms
assert 'schartau' not in allowed_terms
assert len(allowed_terms) == NUM_ALLOWED
test_allowed_terms()
The allowed terms are used in an additional filter function allowed
which is applied as a generator. The filter function will replace every unknown term, not contained in the vocabulary by UNK
.
UNKNOWN = 'UNK'
# we create a function, which filters infrequent terms
def allowed_func(allowed_terms):
# this is a closure
def allowed(corpus):
for s in corpus:
yield [term if term in allowed_terms else UNKNOWN for term in s]
# the returned function will have access to allowed_terms from its defintion context
return allowed
allowed = allowed_func(allowed_terms)
# filter infrequent terms from two normalized sentences
for i, s in enumerate(allowed(normalize(terms(dm.train_sequences())))):
if i == 2: break
print(s)
['UNK', 'sagte', 'dem', '"', 'tagesspiegel', '"', 'vom', 'freitag', ',', 'fischer', 'sei', '"', 'in', 'einer', 'weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'UNK', 'war', '"', '.']
['UNK', 'wolf', 'peter', 'UNK', 'arbeitete', 'anfang', 'der', 'siebziger', 'jahre', 'als', 'UNK', ',', 'als', 'er', 'einen', 'fliegenden', 'händler', 'aus', 'dem', 'libanon', 'traf', '.']
Exercise: Index
Now that we have reduced the allowed vocabulary to a certain number of terms $N$, we can build an index. This index does no only include each $TERM$, but also special symbols $UNK$ and $START\_END$. We can use $START\_END$ as a shared index for $START$ and $END$, because $START$ can only be on the left side of a BiGram and $END$ can only be on the right side of a BiGram. Exercise: Count BiGrams will explain this in detail.
 Create Python dictionary
index
where terms and special symbols are keys, while assigned values are integers.
START_END = 'START_END'
# fill in code to create a dictionary containing a term index
index = None
len(index)
index['immer']
def test_index():
assert isinstance(index, dict)
assert len(index) == len(allowed_terms) + 2
assert UNKNOWN in index
assert START_END in index
for key, val in index.items():
assert isinstance(key, str)
assert isinstance(val, int)
assert list(sorted(index.values())) == list(range(len(allowed_terms) + 2))
test_index()
Exercise: Count BiGrams
In this exercise you will create a numpy matrix, which stores the number of occurrences of each BiGram from our training corpus. We can think of a BiGram as a combination of two subsequent elements $a$ and $b$, where $a$ can be $START$, $UNK$ or a $TERM$ and where $b$ can be $END$, $UNK$ or a $TERM$. We can store a count for each possible combination of $a$ and $b$ in a matrix, by assigning the first dimension to $a$ and the second dimension to $b$.
The index
, which lets us map $a$ or $b$ to a fixed position in the array, has already been defined. For $START$ we only need a row, but no column, while for $END$ we only need a column, but not row. Therefore the shared index $START\_END$ is a useful trick.

Initialize a numpy array $C$, which stores BiGram counts for all combinations of $a$ and $b$.
 Use the first dimension for $a$ and the second dimension for $b$.
 For lower memory consumption, use the numpy datatype (dtype)
np.uint16
, instead ofnp.int64
. 
Fill the array with BiGram counts from the training corpus.
 Use the
index
to refer to the correct indices in the matrix.  START $\rightarrow$ END is a valid BiGram for empty sequences.
 Use the
# fill in code to initialize numpy data structures and fill them with bigram counts
C = None
for s in allowed(normalize(terms(dm.train_sequences()))):
pass
C[index['immer'], index['mehr']]
def test_C():
assert C.shape == (len(index), len(index))
assert C.dtype == np.uint16
assert C[index[START_END], index[START_END]] == 2
test_C()
Exercise: BiGram Probability
Implement a function p(a, b)
, which returns the probability $P(b \mid a)$ for a given BiGram $a$, $b$. The probability of a BiGram, is defined as
where $I$ is the size of the index
, including the START_END
tag, and where $C$ is the BiGram count.

p(a, b)
must have access toC
and theindex
. Use a closure as demonstrated in Exercise: Allowed Terms.
 Raise exception if
a == END
.  Raise exception if
b == START
.
START = 'START'
END = 'END'
# fill in code to implement a closure p_func, which returns p
def p_func(C, index):
raise NotImplementedError()
# variables which are defined here, will be available in p
def p(a, b):
pass
return p
p = p_func(C, index)
p('immer', 'mehr')
def test_probability(p):
test_terms = set(allowed_terms)
test_terms.update([UNKNOWN, END])
assert_almost_equal(sum(p(START, term) for term in test_terms), 1.0)
assert_almost_equal(sum(p(UNKNOWN, term) for term in test_terms), 1.0)
assert_almost_equal(sum(p('immer', term) for term in test_terms), 1.0)
assert_almost_equal(sum(p('mehr', term) for term in test_terms), 1.0)
def exception_raised(a, b):
raised = False
try:
p(a, b)
except:
raised = True
return raised
assert exception_raised(END, UNKNOWN)
assert exception_raised(UNKNOWN, START)
test_probability(p)
Exercise: Sentence Probability
First implement a function sentence_probability(s, p)
which calculates the probability of a full sentence $P(S)$, by multiplying the probabilities of all BiGrams in the sequence.
 $v_0$ refers to
START
 $v_n$ refers to
END
 Parameter
s
is a sentence, withoutSTART
orEND
tags.  Parameter
p
takes a function to calculate BiGram probabilities, for examplep(a, b)
.
Then implement a second function sentence_log_probability(s, p)
, which calculates the probability of a full sentence $P(S)$, by adding the logarithmic probabilites of all BiGrams in sentence.
 Use np.log.
Hint: Adding in logarithmic space is equivalent to multiplying without logarithms (see example below), but prevents numerical instabilities due to floatingpoint arithmetic.
# fill in code to implement a function which return a sequence probabiliy
def sentence_probability(s, p):
raise NotImplementedError()
def sentence_log_probability(s, p):
raise NotImplementedError()
s1 = ['sie', 'nehmen', '.']
s2 = ['sie', 'nimmt', '.']
sentence_probability(s1, p), sentence_probability(s2, p)
sentence_log_probability(s1, p), sentence_log_probability(s2, p)
np.log(sentence_probability(s1, p)), sentence_log_probability(s1, p)
def test_sentence_probability(p):
for s in allowed(normalize(terms(dm.train_sequences()))):
break
assert_almost_equal(np.log(sentence_probability(s, p)), sentence_log_probability(s, p))
test_sentence_probability(p)
Exercise: Sampling
Implement a function sample(max_length=50)
, which returns a random sentence generated from the model.

sample(max_length=50)
must have access toC
and theindex
. Use a closure as demonstrated in Exercise: Allowed Terms.
 You can use np.random.choice to draw from a random distribution of the model.
 Sampled sentences should not include
START
orEND
tags.  Remember to use
START
BiGrams to determine the probabilities of possible sentence beginnings.  Return a sentence as soon as an
END
tag has been sampled or ifmax_length
is reached.
# fill in code to implement a closure sample_func, which returns sample
def sample_func(C, index):
raise NotImplementedError()
# variables which are defined here, will be available in sample
def sample(max_length=50):
pass
return sample
sample = sample_func(C, index)
for _ in range(10):
print(' '.join(sample()))
hier fleisch , seiner krönung der UNK .
der union UNK , und staat UNK trasse UNK bisher keine veränderung wurde die österreichische UNK ich UNK und der sich der landschaft ist man zu haben .
um eine UNK .
UNK UNK .
die UNK verstärkt nannte der italiener  bei UNK UNK anknüpfen .
von UNK .
das ist die polizei in der mit UNK und UNK , um UNK .
sieht die pro tag zu vermitteln .
so richtig gas " von seiten der frankfurter allgemeinen weniger unfälle , ohne menschlichen UNK politische heimat .
karriere , ein kunde schützen .
Exercise: Laplace Smoothing
Implement a function p_laplace(a, b)
, which returns the probability $\hat{P}(b \mid a)$ for a given BiGram $a$, $b$ with Laplace smoothing applied. Laplacing smoothing, also known as AddOne smoothing, is defined as
where $I$ is the size of the index
, including the START_END
tag, and where $C$ is the BiGram count.

p_laplace(a, b)
must have access toC
and theindex
. Use a closure as demonstrated in Exercise: Allowed Terms.
 The implementation is similar to the existing function
p(a, b)
and must pass the same tests.  Raise exception if
a == END
.  Raise exception if
b == START
.
# fill in code to implement a closure p_laplace_func, which returns p_laplace
def p_laplace_func(C, index):
raise NotImplementedError()
# variables which are defined here, will be available in sample
def p_laplace(a, b):
pass
return sample
p_laplace = p_laplace_func(C, index)
p_laplace('immer', 'mehr'), p('immer', 'mehr')
We can use p_laplace
as replacement for p
in our existing functions.
sentence_log_probability(s1, p), sentence_log_probability(s1, p_laplace)
We can also use p_laplace
in our existing tests.
test_probability(p_laplace)
test_sentence_probability(p_laplace)
Exercise: Perplexity
Implement a function perplexity(corpus)
, which returns the perplexity score of the model on a complete test corpus. Perplexity is defined as
where $T$ is a test corpus and $N$ is the number of all BiGrams in the corpus. The formula above gives you a clean mathematical definition at the top, and a numerically stable conversion using logarithmic probabilities at the bottom. For more intuition on perplexity watch Nlp  2.3  Evaluation and Perplexity by Daniel Jurafsky.
 Use the numerically stable formula at the bottom as a reference for your implementation.
 Use your exiting functions
sentence_log_probabilities
andp_laplace
for BiGram probabilities.  Use np.exp.
# use a test corpus, instead of the training corpus
for s in allowed(normalize(terms(dm.test_sequences()))):
break
s
# fill in your perplexity code in the function body
def perplexity(corpus):
# your code goes here
raise NotImplementedError()
perplexity(allowed(normalize(terms(dm.test_sequences()))))
Summary and Outlook
You have learned about Markov Models and how to implement a language model. With your acquired knowledge you can move on to Hidden Markov Models and apply them to tagging problems like Named Entity Recognition.
Literature
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g. images).
Text Information Extraction  Sequences  Exercise: BiGram Language Model
by Christoph Jansen and Christian Herta (deep.TEACHING  HTW Berlin)
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2018 Christoph Jansen, Christian Herta
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.