# Text Information Extraction - Exercise - Linear Chain Conditional Random Fields

## Introduction

This notebook aims to provide a gentle introduction to Linear Chain Conditional Random Fields.

First, we'll introduce an example that uses two hidden states and short sequences of length 3. This allows us to

• model a CRF with a managable size
• find features and assign weights by hand
• calculate probabilities through brute force

Then, you'll implement more efficient algorithms that can handle longer sequences. Your findings from the initial approach can help you debug those algorithms.

At the end of the notebook you'll have a good grasp of the CRF model. We won't be dealing with training/classifiying CRFs but you'll have implemented a collection of algorithms to prepare you for the task.

## Requirements

### Knowledge

Linear Chain CRFs are an advanced method of sequence classification. If you're new to the topic, I recommend you start with the more common approach of Hidden Markov Models.

### Python Modules

import numpy as np

## Linear Chain CRFs ### Model of the conditional probability

We're interested in the conditional probability of the entire sequence of labels $z$ given the input sequence $x$ . In this notation there are $T$ positions (or time points) in the sequence.

$p(z_{1:T} \mid x_{1:T}, \vec \theta) = \frac{1}{Z}\prod_{t=1}^T F(z_{t-1}, z_t, x_{1:T}, t)$ $F(z_{t-1}, z_t, x_{1:T}, t)$ are the (unnormed) Factors with

• weight vector $\vec{\theta}$
• feature vector $\vec{f}$ , which we get by applying a feature function to the input. $= \frac{1}{Z}\prod_{t=1}^T \exp\left (\vec \theta \cdot \vec f(z_{t-1}, z_t, x_{1:T}, t)\right)$

For normalisation we divide by the partition function $Z$ (Zustandssumme), which is the sum of unnormed factors over all possible sequences $\vec z'$ $Z = \sum_{\vec z'}\prod_{t=1}^T \exp\left (\vec \theta \cdot \vec f(z'_{t-1}, z'_t, x_{1:T}, t)\right)$

### Feature Functions

$f_j(z_{t-1}, z_t, x_{1:T}, t)$

The (indicator) features can depend on the previous hidden state, the current hidden state, all the observations and the time index.

We introduce a simple example for two purposes:

• for didactic reasons
• to have a simple example we can solve by enumeration. So we can test and debug our algorithms much better.

### Simple Example

Let's assume we have a small spin system of 6 spins:

• Three hidden spins $s_t$ and three observed spins $x_t$ with $t \in\{1,2,3\}$ ).

• The possible spin states are: $val(s_t)$ = $val(x_t)$ = {UP, DOWN}.

• The first (hidden) spin $s_1$ sees a small magnetic field, so the energy between UP and DOWN is slightly different: $E(s_1=\text{UP})<E(s_1=\text{DOWN})$ .

• The hidden spins $s_1, s_2, s_3$ are coupled such that adjacent spins with equal state values have lower contibution to the energy, e.g. $E(s_1=\text{UP}, s_2=\text{UP})<E(s_1=\text{UP}, s_2=\text{DOWN})$ .

• We observe 3 spins ${x_1, x_2, x_3}$ . Each observed spin is coupled with exactly one hidden spin. e.g. $E(s_t=\text{UP}, x_t=\text{UP})<E(s_t=\text{UP}, x_t=\text{DOWN})$ .

i.e. the coupling coeffients (see below) $c_1, c_{sx}, c_{ss}$ are all positive.

### CRF

Graphical representation of a linear chain CRF with two possible types of factors $F(z_{t-1}, z_t, t)$ and $F(z_t, x_t)$ and $t \in {1,2,3}$ . The probability of the state of the system depends on its total energy $E$ and is given by the Boltzman Factor: $P(E) = \exp \left( - \frac{E}{kT}\right)$

We set $kT$ equals 1 for simplicity:

$P(E) = \exp \left( - E \right)$

The higher the energy, the (exponentially) lower is the probability of a state.

### Negative energy of the system

with

• DOWN $s_i=-1$ and
• UP $s_i= +1$

DOUBLE

• E(s{1:T}, x{1:T}) = c1 s_1+ c{sx} \sum{t=1}^3 s_t x_t + c{ss} \sum{t=2}^3 s{t-1} s_t DOUBLE

with - $s_{1:T} = (s_1, s_2, s_3)$ (hidden spins) - $x_{1:T} = (x_1, x_2, x_3)$ (observed spins)

here $T=3$ .

The total energy of the system is a sum of local contributions.

\begin{align} p(s_{1:T}, x_{1:T}) & = \frac{1}{Z} \exp\left(-E(s_{1:T}, x_{1:T})\right) \\ & = \frac{1}{Z} \exp\left( c_1 s_1+ c_{sx} \sum_{t=1}^3 s_t x_t + c_{ss} \sum_{t=2}^3 s_{t-1} s_t\right) \end{align}

with the partition function $Z$ (Zustandsumme) for the normalization: $Z = \sum_{s'_{1:T}, x'_{1:T}} \exp \left(-E(s'_{1:T}, x'_{1:T}) \right)$

We get the conditional probability just by using another $Z$ : \begin{align} p(s_{1:T} \mid x_{1:T}) & = \frac{1}{Z(x_{1:T})} \exp\left(-E(s_{1:T}, x_{1:T})\right) \\ & = \frac{1}{Z(x_{1:T})} \exp\left( c_1 s_1+ c_{sx}\sum_{t=1}^3 s_t x_t + c_{ss} \sum_{t=2}^3 s_{t-1} s_t\right) \end{align}

Here the partition function $Z$ is given by the sum over all possible hidden states $s'_{1:T}$ :

$Z(x_{1:T})= \sum_{s'_{1:T}} \exp \left(-E(s'_{1:T}, x_{1:T}) \right)$

## Feature function templates

As for HMMs we introduce a special start state $\text{BOS}$ (begin of sequence) for a probability factor which depends only on the first state $t=1$ ).

We write $z_t$ for the random variable of the hidden states with $z_t \in \{\text{BOS, DOWN, UP}\}$

So: $z_0 = \text{BOS}$

We introduce general feature function templates of the following form $f_k(z_{t-1}, z_t, \vec x, t)$ . Here we use only indicator functions $\mathbb 1$ ) as features (as for MEMMs). There are four different templates of features :

• Start-Feature Template: $f_i(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = v_n\land z_t=v_m)$
• Bias-Feature Template: $f_j(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = v_n \land z_t=v_m$ )
• Observation=DOWN Template:
$f_k(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = v_n\land z_t=v_m)$
• Observation=UP Template: $f_l(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = v_n\land z_t=v_m)$

Here each feature template has $3 \cdot 3$ different features (3 possible hidden states). In total we have $4\cdot3\cdot3$ features.

Note: The hidden states $z_{t-1}$ and $z_t$ are arguments of the feature functions. With the possible values for the hidden states: $v_n\in \{ \text{BOS, DOWN, UP}\}$

Note: The Start-Feature is not necessary (with our BOS state we could handle the start in our bias feature).

Note that some of the features are never used. But to keep the notation clear and to generate the features (from the feature templates) automatically we use this structure. In a more efficient implementation we would eliminate never used features.

## Exercises

### Exercise: Weight vector $\vec \theta$

We put all $3\cdot 3\cdot 4$ ) features in a feature vector $\vec f(z_{t-1}, z_t, x_{1:T}, t)$ .

If we compare our CRF Model \begin{align} p(z_{1:T} \mid x_{1:T}, \vec \theta) & = \frac{1}{Z(x_{1:T})}\prod_{t=1}^T \exp\left (\vec \theta \cdot \vec f(z_{t-1}, z_t, x_{1:T}, t)\right)\\ & = \frac{1}{Z(x_{1:T})} \exp\left( \sum_{t=1}^T (\vec \theta \cdot \vec f(z_{t-1}, z_t, x_{1:T}, t))\right) \end{align}

with

\begin{align} p(s_{1:T} \mid x_{1:T}) & = \frac{1}{Z(x_{1:T})} \exp\left(-E(s_{1:T} , x_{1:T})\right) \\ & = \frac{1}{Z(x_{1:T})} \exp\left( c_1 s_1 + c_{sx} \sum_{t=1}^3 s_t x_t + c_{ss} \sum_{t=2}^3 s_{t-1} s_t\right) \end{align}

we can construct the parameter vector $\vec \theta$ by hand.

What are the corresponding weights $\theta_i$ for the features? Relate the weights as coupling constants $c_1, c_{sx}, c_{ss}$ .

If a feature is never used, set the corresponding weight to 0.

Start-Features:
- $f_0(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = START \land z_t= START)$ - $f_1(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = START\land z_t=DOWN)$ - $f_2(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = START \land z_t=UP)$ - $f_3(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = DOWN \land z_t= START)$ - $f_4(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = DOWN\land z_t=DOWN)$ - $f_5(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = DOWN \land z_t=UP)$ - $f_6(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = UP \land z_t= START)$ - $f_7(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = UP\land z_t=DOWN)$ - $f_8(z_{t-1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t-1} = UP \land z_t=UP)$

Corresponding weights: - $\theta_0 =$ - $\theta_1 =$ - $\theta_2 =$ - $\theta_3 =$ - $\theta_4 =$ - $\theta_5 =$ - $\theta_6 =$ - $\theta_7 =$ - $\theta_8 =$

Bias-Features:
- $f_9(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = START \land z_t=START)$ - $f_{10}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = START \land z_t=DOWN)$ - $f_{11}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = START \land z_t=UP)$ - $f_{12}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = DOWN \land z_t=START)$ - $f_{13}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = DOWN \land z_t=DOWN)$ - $f_{14}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = DOWN \land z_t=UP)$ - $f_{15}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = UP \land z_t=START)$ - $f_{16}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = UP \land z_t=DOWN)$ - $f_{17}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(z_{t-1} = UP \land z_t=UP)$

Corresponding weights: - $\theta_9 =$ - $\theta_{10} =$ - $\theta_{11} =$ - $\theta_{12} =$ - $\theta_{13} =$ - $\theta_{14} =$ - $\theta_{15} =$ - $\theta_{16} =$ - $\theta_{17} =$

Observation=DOWN Features:
- $f_{18}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = START\land z_t=START)$ - $f_{19}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = START\land z_t=DOWN)$ - $f_{20}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = START\land z_t=UP)$ - $f_{21}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = DOWN\land z_t=START)$ - $f_{22}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = DOWN\land z_t=DOWN)$ - $f_{23}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = DOWN\land z_t=UP)$ - $f_{24}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = UP\land z_t=START)$ - $f_{25}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = UP\land z_t=DOWN)$ - $f_{26}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t-1} = UP\land z_t=UP)$

Corresponding weights: - $\theta_{18} =$ - $\theta_{19} =$ - $\theta_{20} =$ - $\theta_{21} =$ - $\theta_{22} =$ - $\theta_{23} =$ - $\theta_{24} =$ - $\theta_{25} =$ - $\theta_{26} =$

Observation=UP Features:
- $f_{27}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = START\land z_t=START)$ - $f_{28}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = START\land z_t=DOWN)$ - $f_{29}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = START\land z_t=UP)$ - $f_{30}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = DOWN\land z_t=START)$ - $f_{31}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = DOWN\land z_t=DOWN)$ - $f_{32}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = DOWN\land z_t=UP)$ - $f_{33}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = UP\land z_t=START)$ - $f_{34}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = UP\land z_t=DOWN)$ - $f_{35}(z_{t-1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t-1} = UP\land z_t=UP)$

Corresponding weights: - $\theta_{28} =$ - $\theta_{28} =$ - $\theta_{29} =$ - $\theta_{30} =$ - $\theta_{31} =$ - $\theta_{32} =$ - $\theta_{33} =$ - $\theta_{34} =$ - $\theta_{35} =$

### Setup for programming exercises

This snippet defines a few useful constants regarding each state and their respective energy contributions.

# Names for each state
UP = "UP"
DOWN = "DOWN"
START = "<BOS>"

# Energy contribution from the state
E_UP = 1.
E_DOWN = -1.
E_START = 1.5

# Map name of the state to its energy contribution
STATE_DICT = {
START: E_START,
DOWN: E_DOWN,
UP: E_UP
}

# States without START
INNER_STATES = [UP, DOWN]

# Length of the example sequence
LENGTH = 3

# Values of the coupling constants
smn = 5 # influence of the first spin energy due to outer magnetic field
c_1 = 1/smn

c_ss = .5
c_sx = .75

### Exercise: Total negative energy

Implement a python function to calculate the total negative energy for a spin system. In Negative energy of the system we defined this value as the sum of its local contributions DOUBLE

• E(s{1:T}, x{1:T}) = c1 s_1+ c{sx} \sum{t=1}^3 s_t x_t + c{ss} \sum{t=2}^3 s{t-1} s_t DOUBLE
def negative_sequence_energy(s,x):
# s : hidden spins
# x : observed spins
# returns : the total negative energy of the sequence -E(s,x)
raise NotImplementedError()

s = [UP,UP,UP]
x = [UP,UP,UP]
np.testing.assert_almost_equal(negative_sequence_energy(s,x), 3.45)
s = [DOWN,DOWN,UP]
x = [UP,UP,DOWN]
np.testing.assert_almost_equal(negative_sequence_energy(s,x), -2.45)

### Exercise: Probabilities for the Spin System

Implement a python function to compute the probability by (brute force) enumeration of the Spin-System for $p(s_{1:T}, x_{1:T})$

def prob(s,x):
# Returns the probability of the hidden spins s and observed spins x.
# Don't confuse this with the conditional probability p(s|x)!
raise NotImplementedError()

s = [UP,UP,UP]
x = [UP,UP,UP]
np.testing.assert_almost_equal(prob(s, x), 0.1748584549)
s = [DOWN,DOWN,DOWN]
x = [DOWN,DOWN,DOWN]
np.testing.assert_almost_equal(prob(s, x), 0.117211127549)
s = [DOWN,UP,DOWN]
x = [UP,DOWN,UP]
np.testing.assert_almost_equal(prob(s, x), 0.0001762198)

### Exercise: Conditional Probability

Implement a function cond_prob(s, x) which computes the probability of a given hidden sequence conditioned on a observed sequence, i.e. $p(s_{1:T} \mid x_{1:T})$

How many terms must we add to compute $Z(x_{1:T})$ (for $p(s_{1:T} \mid x_{1:T})$ )?

• For our Spin System
• In general (with the number of possible hidden states $|\mathcal S|$ ) (without the start state)
def cond_prob(s,x):
# s : sequence of hidden spins
# x : sequence of observed spins
# returns p(s|x) the conditional probability of hidden spins given observed spins
raise NotImplementedError()

s = [UP,UP,DOWN]
x = [UP,UP,DOWN]
np.testing.assert_approx_equal(cond_prob(s, x), 0.49151220231149634)
s = [UP,UP,UP]
x = [UP,UP,UP]
cp = cond_prob(s, x)
np.testing.assert_approx_equal(cp, 0.826540721870722)

### Exercise: Data

Note: It's not possible to sample from a CRF which models $p(z_{1:T}\mid x_{1:T})$ . It's a discriminative model and not a generative model.

But here we can sample because we have also model of $p(z_{1:T}, x_{1:T})$

• Sample data by "naive brute force" sampling, i.e. enumerate all possible sequences and compute the probability of each sequence. Then you can sample by np.random.choice(a=possible_seq,size=size,p=pp)
• Explain why (brute force) enumeration doesn't work with larger sequences.
def sample(size):
# size: int, number of requested samples
# returns: list of (s,x) tuples
# where each s is a hidden sequence and each x is its observation sequence
raise NotImplementedError()
sample(4)

### Exercise: Dynamic Programming for the partition function

The following exercises are based on the course on Neural Networks by Hugo Larochelle of Université de Sherbrooke. Weeks 3 and 4 focus on Conditional Random Fields.

So far, we've calculated the partition function with a brute force approach that enumerates all possible sequences. Now we want to move on to a dynamic programming algorithm, which allows us to work with longer sequences.

Study Larochelle's lecture Neural networks [3.3] : Conditional random fields - context window (timestamp 8:59 onwards). It introduces the notation of unary log-factors, which depend on the current state and pairwise log-factors which depend on the current and the next state. Which of the features from our spin system example are unary and which are pairwise log-factors?

• start feature
• bias feature
• observation-up feature
• observation-down feature

In other words, does the local energy contribution by this feature depend on the state of one hidden spin, or two?

### Exercise: Brute force comparison

Implement a function that computes the partition function for a given input sequence $x$ through brute force, like you did in the initial example. We can use the value of the parition function returned by the naive approach to verify the results of the more complex algorithms we're about to implement.

def partition_fn(x):
raise NotImplementedError()

np.testing.assert_almost_equal(partition_fn([UP,UP,UP]), 38.111119603947195)
np.testing.assert_almost_equal(partition_fn([UP,DOWN,UP]), 17.65747233953043)
np.testing.assert_almost_equal(partition_fn([DOWN,DOWN,DOWN]), 27.597811808118507)

### Exercise: Alpha values

Study Hugo Larochelle's lecture Neural networks [3.4] : Conditional random fields - computing the partition function. In the lecture, he covers the dynamic program to compute the partition function using alpha/beta tables and wraps up with notes on numeric stability.

Implement a function to compute the alpha values. It should return both the alpha table (which will be useful later) and the partition function.

def alpha_table(x):
raise NotImplementedError()
return alpha, Z

x = [UP, UP, UP]
naive_Z = partition_fn(x)
alpha, alpha_Z = alpha_table(x)
np.testing.assert_almost_equal(naive_Z, alpha_Z)

### Exercise: Beta values

Implement a function to compute the beta values. It should return both the beta table (which will be useful later) and the partition function. We can verify our algorithms by

• checking if the alpha and beta values return the same value for the partition function
• checking if the beta function returns the same partition function as the brute force approach
def beta_table(x):
raise NotImplementedError()
return beta, Z

x = [UP, UP, UP]
naive_Z = partition_fn(x)
alpha, alpha_Z = beta_table(x)
beta, beta_Z = beta_table(x)
np.testing.assert_almost_equal(naive_Z, beta_Z)
np.testing.assert_almost_equal(beta_Z, alpha_Z)

## Summary and Outlook

In this notebook, you have

• modeled a spin system as a linear chain CRF
• engineered feature functions to transform input into feature vectors
• assigned weights to features by hand
• implemented brute force algorithms for the probability and conditional probability of sequences
• implemented dynamic programming algorithms to calculate the alpha values, beta values and the partition function

In the next step, you can use your alpha and beta algorithms to comute the marginals i.e. the likelihood of a label at a given position. Then you can turn to the problem of decoding i.e. finding the best label sequence for a given input sequence.

This notebook ends here, but if you're motivated to continue working with CRFs, I recommend you check out the playlist on Neural Networks by Hugo Larochelle.

## Literature

### Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise - Linear Chain Conditional Random Fields
by Christian Herta, Diyar Oktay