# Text Information Extraction - Sequences - Exercise: Bi-Gram Language Model

## Introduction

This notebook introduces Markov Models as a way to represent Bi-Gram language models. It is advised, but not strictly required, to learn about Bayesian Networks first (see Bayesian Networks by Example). Studying Bayesian Networks provides you with a basic understanding of probability theory and graphical models, which is useful in this context.

The comprehensive exercises in this notebook will guide you through the steps of implementing a Bi-Gram language model. Due to the general data sparsity in the Natural Language Processing (NLP) domain, the implementation requires various adaptions concerning numerical stability, which will be adressed accordingly.

As a data source, the GermEval2014 text corpus is used.

### Required Python Modules

# Python Standard Library
from collections import Counter

# External Modules
import networkx as nx
import numpy as np
from numpy.testing import assert_almost_equal
from IPython.display import HTML
from deep_teaching_commons.graphs.visualization import draw
from deep_teaching_commons.data.text.germ_eval_2014 import GermEval2014

### Required Data

# data manager for Germ Eval 2014 corpus
# default base_data_dir is ~/deep.TEACHING/data
base_data_dir = None
dm = GermEval2014(base_data_dir=base_data_dir)
auto download is active, attempting download

# only termns, no NER tags, are required in this exercise
def terms(sequences):
for sequence in sequences:
yield [term for term, _, _ in sequence]
# two sentences
for i, s in enumerate(terms(dm.train_sequences())):
if i == 2: break
print(s)
['Schartau', 'sagte', 'dem', '"', 'Tagesspiegel', '"', 'vom', 'Freitag', ',', 'Fischer', 'sei', '"', 'in', 'einer', 'Weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'überzeugend', 'war', '"', '.']
['Firmengründer', 'Wolf', 'Peter', 'Bree', 'arbeitete', 'Anfang', 'der', 'siebziger', 'Jahre', 'als', 'Möbelvertreter', ',', 'als', 'er', 'einen', 'fliegenden', 'Händler', 'aus', 'dem', 'Libanon', 'traf', '.']


## Markov Models

In one of the preliminary notebooks Bayesian Networks by Example, we have introduced graphical models. Each node in a Bayesian Network corresponds to a random variable, which can take on different values with a certain probability given the predecessors of the variable.

A Markov Model is very similar, but represents a sequence of states. For example $W_1, ..., W_n$ could represent a temporal sequence of weather states, where $W_1, ..., W_n$ are random variables. These random variables share a common probability table $W$. Since $W$ is shared, the length $n$ of a sequence $S_W$ can vary without adjusting the model. Possible values of W are defined as

$Val(W) = \{Sunny, Cloudy, Rainy\}.$

For example, the following graphical representation shows a weather model with 5 subsequent states.

def mm(sequence, x_offsets=None, observed=None):
x_offset = 0
x_offsets = x_offsets if x_offsets else [100 for _ in range(len(sequence)-1)]
observed = observed if observed else [False for _ in range(len(sequence))]

g = nx.DiGraph()

for i in range(len(sequence)-1):
j = i+1
x_offset += x_offsets[i]

return g

HTML(draw(mm(['W{}'.format(chr(0x2081 + i)) for i in range(5)])))

Assume we wanted to calculate the probability of the last weather event $P(W_5)$. By the rules of probability (see Bayesian Networks by Example) this would result in the following calculation.

\begin{aligned} P(W_5) &= \sum_{W_1}\sum_{W_2}\sum_{W_3}\sum_{W_4} P(W_1) \cdot P(W_2 \mid W_1) \cdot P(W_3 \mid W_2) \cdot P(W_4 \mid W_3) \cdot P(W_5 \mid W_4) \\ &= \sum_{W_1}\sum_{W_2}\sum_{W_3}\sum_{W_4} P(W_1, W_2, W_3, W_4, W_5) \end{aligned}

Instead of calculating the probability of a certain element $W_t \in {W_1, ..., W_n}$ of our sequence $S_W$, we can also use this model to calculate the probability of a complete sequence $P(S_W)$ with $n$ elements as follows:

$P(S_W) = \prod_{t=1}^{n} P(W_t)$

For every element $W_t$, a complex probability calculation $P(W_t)$, as shown for $P(W_5)$, would be required. Fortunately, by applying the Markov Assumption, we can reduce this calculation to a simple lookup in our probability table $W$.

### Markov Assumption

Usually, we observe a specific sequence of events. For example $W_1 = Sunny, W_2 = Sunny, W_3 = Cloudy, W_4 = Rainy$. This means that $W_5$ becomes independent of $W_1$, $W_2$ and $W_3$ given $W_4$.

$W_5 \perp W_1, W_2, W_3 \mid W_4 = Rainy$

Calculating $P(W_5)$ can therefore be reduced to

$P(W_5) = P(W_5 \mid W_4 = Rainy),$

which is only a table lookup. If you are familiar with Bayesian Networks, you can see that the Markov Assumption is a direct consequence of observing random variables.

The following graphical representation shows the observed variables.

HTML(draw(mm(
['W\u2081 = Sunny', 'W\u2082 = Sunny', 'W\u2083 = Cloudy', 'W\u2084 = Rainy', 'W\u2085'],
x_offsets=[160, 160, 160, 115],
observed=[True, True, True, True, False]
)))

We can now predict the next observed state in a sequence, by just looking at the current state. Asume, that we already know all transition probabilities from each possible state to the next state, we can express them in a simple table as follows.

index = {
'Sunny': 0,
'Cloudy': 1,
'Rainy': 2
}

W = np.array([
[0.8, 0.15, 0.05],
[0.2, 0.5, 0.3],
[0.2, 0.2, 0.6]
])

If it is $Sunny$ at a given point in a sequence, the probability that it will be $Cloudy$ afterwards can be expressed as $P(W_t = Cloudy \mid W_{t-1} = Sunny)$. The value of $P(W_t = Cloudy \mid W_{t-1} = Sunny)$ can be accessed in the probability table $W$ as follows.

W[index['Sunny'], index['Cloudy']]

### Markov Chains

As has been shown in the last section Markov Assumption, the we can greatly reduce the complexity of a Markov Model, when we are only dealing with observed variables. In this case we can also refer to our model as Markov Chain. Graphical representations of a Markov Chain differ from Bayesian Networks and Markov Models, because their nodes represent a specific state $Val(W) = \{Sunny, Cloudy, Rainy\}$ instead of a random variable $W_t$. The edges between nodes represent the transition probability from state to another, as shown in the example below.

g = nx.DiGraph()

for x in index:

for x in g.nodes:
for y in g.nodes:

HTML(draw(g))

## Language Models

A Bi-Gram language model can be represented as a Markov Model. Consider the following sentence s, which is contained in the GermEval2014 text corpus. The sentence is a sequence of terms.

for i, s in enumerate(terms(dm.train_sequences())):
if i == 2041:
break
s

Prepending a START tag and appending an END tag allows us to assign probabilities to a term being at the beginning or the end of sentence respectively.

sse = ['START'] + s + ['END']
sse
HTML(draw(mm(sse, x_offsets=[100, 118, 123, 116, 108, 85], observed=[True for _ in sse])))

Bi-Grams are all possible combinations of two subsequent words in the sentence. See the following code, for a list of all Bi-Grams.

list(zip(sse[:-1], sse[1:]))

In order to calculate probabilities for all possible terms and their successors, we have to count all Bi-Grams in the training corpus. Since the number of possible terms in a text corpus is large, the resulting transition matrix requires a lot of memory and is sparse.

raw_terms = set()

for s in terms(dm.train_sequences()):
raw_terms.update(s)

len(raw_terms)

The exercises in this notebook will guide through the implementation of a language model and how to apply some tricks to reduce sparsity.

## Pen-and-Paper Exercises

### Exercise: Terminology

What is the difference between Word, Token, Type and Term?

To answer the question, read the beginning of 2.2.1. Tokenization of "Introduction to information retrieval" [SCH08].

### Exercise: Language Models

Study the fundamentals of N-Gram language models in Chapter 4 (accessed online 05.04.2018) of "Speech and Language Processing" [JUR09].

Further Readings: Language Modeling (Lecture Notes for NLP) [COL13], accessed online 05.04.2018

Solve the following exercises in Chapter 4 [JUR09] (see p. 25 in the linked PDF).

• 4.1
• 4.2
• 4.3
• 4.4
• 4.5

## Programming Exercises

### Exercise: Normalization

Write a python function which normalizes the corpus as follows:

• convert all uppercase letters to lower case
• replace all digits by letter $D$
DIGIT = 'D'
# fill in your normalization code in the function body
def normalize(corpus):
for s in corpus:
normalized_sentence = None
raise NotImplementedError()
yield normalized_sentence
# two normalized sentences
for i, s in enumerate(normalize(terms(dm.train_sequences()))):
if i == 2: break
print(s)
['schartau', 'sagte', 'dem', '"', 'tagesspiegel', '"', 'vom', 'freitag', ',', 'fischer', 'sei', '"', 'in', 'einer', 'weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'überzeugend', 'war', '"', '.']
['firmengründer', 'wolf', 'peter', 'bree', 'arbeitete', 'anfang', 'der', 'siebziger', 'jahre', 'als', 'möbelvertreter', ',', 'als', 'er', 'einen', 'fliegenden', 'händler', 'aus', 'dem', 'libanon', 'traf', '.']

def test_normalize():
input_corpus = [['R2D2', 'ist', 'ein', 'Roboter', 'aus', 'dem', 'Jahr', '1977', '.']]
output_corpus = [['rDdD', 'ist', 'ein', 'roboter', 'aus', 'dem', 'jahr', 'DDDD', '.']]
normalized_corpus = list(normalize(input_corpus))
assert normalized_corpus == output_corpus

test_normalize()

### Exercise: Allowed Terms

Create a set of allowed terms. In this case we allow only the 8190 most common terms, to artificially reduce the size of our vocabulary. This is an important step to reduce the memory consumption of our Bi-Gram matrix, which we will construct in a later step.

• Create a Python set, containing only the most common terms.
• Hint: for the initial counting you could use the Counter class from collections.
NUM_ALLOWED = 8190
# fill in code to create a list containing allowed terms
allowed_terms = None
def test_allowed_terms():
assert isinstance(allowed_terms, set)
assert 'immer' in allowed_terms
assert 'mehr' in allowed_terms
assert 'schartau' not in allowed_terms
assert len(allowed_terms) == NUM_ALLOWED

test_allowed_terms()

The allowed terms are used in an additional filter function allowed which is applied as a generator. The filter function will replace every unknown term, not contained in the vocabulary by UNK.

UNKNOWN = 'UNK'
# we create a function, which filters infrequent terms
def allowed_func(allowed_terms):
# this is a closure
def allowed(corpus):
for s in corpus:
yield [term if term in allowed_terms else UNKNOWN for term in s]

# the returned function will have access to allowed_terms from its defintion context
return allowed
allowed = allowed_func(allowed_terms)
# filter infrequent terms from two normalized sentences
for i, s in enumerate(allowed(normalize(terms(dm.train_sequences())))):
if i == 2: break
print(s)
['UNK', 'sagte', 'dem', '"', 'tagesspiegel', '"', 'vom', 'freitag', ',', 'fischer', 'sei', '"', 'in', 'einer', 'weise', 'aufgetreten', ',', 'die', 'alles', 'andere', 'als', 'UNK', 'war', '"', '.']
['UNK', 'wolf', 'peter', 'UNK', 'arbeitete', 'anfang', 'der', 'siebziger', 'jahre', 'als', 'UNK', ',', 'als', 'er', 'einen', 'fliegenden', 'händler', 'aus', 'dem', 'libanon', 'traf', '.']


### Exercise: Index

Now that we have reduced the allowed vocabulary to a certain number of terms $N$, we can build an index. This index does no only include each $TERM$, but also special symbols $UNK$ and $START\_END$. We can use $START\_END$ as a shared index for $START$ and $END$, because $START$ can only be on the left side of a Bi-Gram and $END$ can only be on the right side of a Bi-Gram. Exercise: Count Bi-Grams will explain this in detail.

• Create Python dictionary index where terms and special symbols are keys, while assigned values are integers.
START_END = 'START_END'
# fill in code to create a dictionary containing a term index
index = None
len(index)
index['immer']
def test_index():
assert isinstance(index, dict)
assert len(index) == len(allowed_terms) + 2
assert UNKNOWN in index
assert START_END in index

for key, val in index.items():
assert isinstance(key, str)
assert isinstance(val, int)

assert list(sorted(index.values())) == list(range(len(allowed_terms) + 2))

test_index()

### Exercise: Count Bi-Grams

In this exercise you will create a numpy matrix, which stores the number of occurrences of each Bi-Gram from our training corpus. We can think of a Bi-Gram as a combination of two subsequent elements $a$ and $b$, where $a$ can be $START$, $UNK$ or a $TERM$ and where $b$ can be $END$, $UNK$ or a $TERM$. We can store a count for each possible combination of $a$ and $b$ in a matrix, by assigning the first dimension to $a$ and the second dimension to $b$.

The index, which lets us map $a$ or $b$ to a fixed position in the array, has already been defined. For $START$ we only need a row, but no column, while for $END$ we only need a column, but not row. Therefore the shared index $START\_END$ is a useful trick.

• Initialize a numpy array $C$, which stores Bi-Gram counts for all combinations of $a$ and $b$.

• Use the first dimension for $a$ and the second dimension for $b$.
• For lower memory consumption, use the numpy datatype (dtype) np.uint16, instead of np.int64.
• Fill the array with Bi-Gram counts from the training corpus.

• Use the index to refer to the correct indices in the matrix.
• START $\rightarrow$ END is a valid Bi-Gram for empty sequences.
# fill in code to initialize numpy data structures and fill them with bi-gram counts
C = None

for s in allowed(normalize(terms(dm.train_sequences()))):
pass
C[index['immer'], index['mehr']]
def test_C():
assert C.shape == (len(index), len(index))
assert C.dtype == np.uint16
assert C[index[START_END], index[START_END]] == 2

test_C()

### Exercise: Bi-Gram Probability

Implement a function p(a, b), which returns the probability $P(b \mid a)$ for a given Bi-Gram $a$, $b$. The probability of a Bi-Gram, is defined as

\begin{aligned} P(a, b) &= \frac{C(a, b)}{\sum\limits_{t \in I} C(a, t)} \end{aligned}

where $|I|$ is the size of the index, including the START_END tag, and where $C$ is the Bi-Gram count.

• p(a, b) must have access to C and the index.

• Raise exception if a == END.
• Raise exception if b == START.
START = 'START'
END = 'END'
# fill in code to implement a closure p_func, which returns p
def p_func(C, index):
raise NotImplementedError()
# variables which are defined here, will be available in p

def p(a, b):
pass

return p
p = p_func(C, index)
p('immer', 'mehr')
def test_probability(p):
test_terms = set(allowed_terms)
test_terms.update([UNKNOWN, END])

assert_almost_equal(sum(p(START, term) for term in test_terms), 1.0)
assert_almost_equal(sum(p(UNKNOWN, term) for term in test_terms), 1.0)
assert_almost_equal(sum(p('immer', term) for term in test_terms), 1.0)
assert_almost_equal(sum(p('mehr', term) for term in test_terms), 1.0)

def exception_raised(a, b):
raised = False
try:
p(a, b)
except:
raised = True

return raised

assert exception_raised(END, UNKNOWN)
assert exception_raised(UNKNOWN, START)

test_probability(p)

### Exercise: Sentence Probability

First implement a function sentence_probability(s, p) which calculates the probability of a full sentence $P(S)$, by multiplying the probabilities of all Bi-Grams in the sequence.

$P(S) = \prod_{t=1}^{n} P(v_t \mid v_{t-1})$
• $v_0$ refers to START
• $v_n$ refers to END
• Parameter s is a sentence, without START or END tags.
• Parameter p takes a function to calculate Bi-Gram probabilities, for example p(a, b).

Then implement a second function sentence_log_probability(s, p), which calculates the probability of a full sentence $P(S)$, by adding the logarithmic probabilites of all Bi-Grams in sentence.

$log(P(S)) = \sum_{t=1}^{n} log(P(v_t \mid v_{t-1}))$

Hint: Adding in logarithmic space is equivalent to multiplying without logarithms (see example below), but prevents numerical instabilities due to floating-point arithmetic.

$log(P(v_1 \mid START) \cdot P(v_2 \mid v_1) \cdot P(END \mid v_2)) = log(P(v_1 \mid START)) + log(P(v_2 \mid v_1)) + log(P(END \mid v_2))$
# fill in code to implement a function which return a sequence probabiliy
def sentence_probability(s, p):
raise NotImplementedError()

def sentence_log_probability(s, p):
raise NotImplementedError()
s1 = ['sie', 'nehmen', '.']
s2 = ['sie', 'nimmt', '.']
sentence_probability(s1, p), sentence_probability(s2, p)
sentence_log_probability(s1, p), sentence_log_probability(s2, p)
np.log(sentence_probability(s1, p)), sentence_log_probability(s1, p)
def test_sentence_probability(p):
for s in allowed(normalize(terms(dm.train_sequences()))):
break

assert_almost_equal(np.log(sentence_probability(s, p)), sentence_log_probability(s, p))

test_sentence_probability(p)

### Exercise: Sampling

Implement a function sample(max_length=50), which returns a random sentence generated from the model.

• sample(max_length=50) must have access to C and the index.

• You can use np.random.choice to draw from a random distribution of the model.
• Sampled sentences should not include START or END tags.
• Remember to use START Bi-Grams to determine the probabilities of possible sentence beginnings.
• Return a sentence as soon as an END tag has been sampled or if max_length is reached.
# fill in code to implement a closure sample_func, which returns sample
def sample_func(C, index):
raise NotImplementedError()
# variables which are defined here, will be available in sample

def sample(max_length=50):
pass

return sample
sample = sample_func(C, index)
for _ in range(10):
print(' '.join(sample()))
hier fleisch , seiner krönung der UNK .
der union UNK , und staat UNK trasse UNK bisher keine veränderung wurde die österreichische UNK ich UNK und der sich der landschaft ist man zu haben .
um eine UNK .
UNK UNK .
die UNK verstärkt nannte der italiener - bei UNK UNK anknüpfen .
von UNK .
das ist die polizei in der mit UNK und UNK , um UNK .
sieht die pro tag zu vermitteln .
so richtig gas " von seiten der frankfurter allgemeinen weniger unfälle , ohne menschlichen UNK politische heimat .
karriere , ein kunde schützen .


### Exercise: Laplace Smoothing

Implement a function p_laplace(a, b), which returns the probability $\hat{P}(b \mid a)$ for a given Bi-Gram $a$, $b$ with Laplace smoothing applied. Laplacing smoothing, also known as Add-One smoothing, is defined as

\begin{aligned} \hat{P}(a, b) &= \frac{C(a, b) + 1}{\sum\limits_{t \in I} (C(a, t) + 1)} \\ &= \frac{C(a, b) + 1}{\sum\limits_{t \in I} C(a, t) + |I|} \end{aligned}

where $|I|$ is the size of the index, including the START_END tag, and where $C$ is the Bi-Gram count.

• p_laplace(a, b) must have access to C and the index.

• The implementation is similar to the existing function p(a, b) and must pass the same tests.
• Raise exception if a == END.
• Raise exception if b == START.
# fill in code to implement a closure p_laplace_func, which returns p_laplace
def p_laplace_func(C, index):
raise NotImplementedError()
# variables which are defined here, will be available in sample

def p_laplace(a, b):
pass

return sample
p_laplace = p_laplace_func(C, index)
p_laplace('immer', 'mehr'), p('immer', 'mehr')

We can use p_laplace as replacement for p in our existing functions.

sentence_log_probability(s1, p), sentence_log_probability(s1, p_laplace)

We can also use p_laplace in our existing tests.

test_probability(p_laplace)
test_sentence_probability(p_laplace)

### Exercise: Perplexity

Implement a function perplexity(corpus), which returns the perplexity score of the model on a complete test corpus. Perplexity is defined as

\begin{aligned} Perplexity(T) &= \sqrt[N]{ \frac{1}{\prod\limits_{S \in T} P(S)} } \\ &= (\frac{1}{\prod\limits_{S \in T} P(S)})^{\frac{1}{N}} \\ &= exp(log((\frac{1}{\prod\limits_{S \in T} P(S)})^{\frac{1}{N}})) \\ &= exp(\frac{1}{N} \cdot log((\frac{1}{\prod\limits_{S \in T} P(S)}))) \\ &= exp(\frac{1}{N} \cdot (log(1) - log(\prod\limits_{S \in T} P(S)))) \\ &= exp(-\frac{1}{N} \cdot log(\prod\limits_{S \in T} P(S))) \\ &= exp(-\frac{1}{N} \cdot \sum\limits_{S \in T} log(P(S))) \end{aligned}

where $T$ is a test corpus and $N$ is the number of all Bi-Grams in the corpus. The formula above gives you a clean mathematical definition at the top, and a numerically stable conversion using logarithmic probabilities at the bottom. For more intuition on perplexity watch Nlp - 2.3 - Evaluation and Perplexity by Daniel Jurafsky.

• Use the numerically stable formula at the bottom as a reference for your implementation.
• Use your exiting functions sentence_log_probabilities and p_laplace for Bi-Gram probabilities.
• Use np.exp.
# use a test corpus, instead of the training corpus
for s in allowed(normalize(terms(dm.test_sequences()))):
break
s
# fill in your perplexity code in the function body
def perplexity(corpus):
raise NotImplementedError()
perplexity(allowed(normalize(terms(dm.test_sequences()))))

## Summary and Outlook

You have learned about Markov Models and how to implement a language model. With your acquired knowledge you can move on to Hidden Markov Models and apply them to tagging problems like Named Entity Recognition.

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g. images).

Text Information Extraction - Sequences - Exercise: Bi-Gram Language Model
by Christoph Jansen and Christian Herta (deep.TEACHING - HTW Berlin)