Text Information Extraction  Exercise  Linear Chain Conditional Random Fields
Table of Contents
Introduction
This notebook aims to provide a gentle introduction to Linear Chain Conditional Random Fields.
First, we'll introduce an example that uses two hidden states and short sequences of length 3. This allows us to
 model a CRF with a managable size
 find features and assign weights by hand
 calculate probabilities through brute force
Then, you'll implement more efficient algorithms that can handle longer sequences. Your findings from the initial approach can help you debug those algorithms.
At the end of the notebook you'll have a good grasp of the CRF model. We won't be dealing with training/classifiying CRFs but you'll have implemented a collection of algorithms to prepare you for the task.
Requirements
Knowledge
Linear Chain CRFs are an advanced method of sequence classification. If you're new to the topic, I recommend you start with the more common approach of Hidden Markov Models.
Python Modules
import numpy as np
Data
Linear Chain CRFs
Model of the conditional probability
We're interested in the conditional probability of the entire sequence of labels $ z $ given the input sequence $ x $ . In this notation there are $ T $ positions (or time points) in the sequence.
$ p(z_{1:T} \mid x_{1:T}, \vec \theta) = \frac{1}{Z}\prod_{t=1}^T F(z_{t1}, z_t, x_{1:T}, t) $ $ F(z_{t1}, z_t, x_{1:T}, t) $ are the (unnormed) Factors with
 weight vector $ \vec{\theta} $
 feature vector $ \vec{f} $ , which we get by applying a feature function to the input. $ = \frac{1}{Z}\prod_{t=1}^T \exp\left (\vec \theta \cdot \vec f(z_{t1}, z_t, x_{1:T}, t)\right) $
For normalisation we divide by the partition function $ Z $ (Zustandssumme), which is the sum of unnormed factors over all possible sequences $ \vec z' $ $ Z = \sum_{\vec z'}\prod_{t=1}^T \exp\left (\vec \theta \cdot \vec f(z'_{t1}, z'_t, x_{1:T}, t)\right) $
Feature Functions
$ f_j(z_{t1}, z_t, x_{1:T}, t) $
The (indicator) features can depend on the previous hidden state, the current hidden state, all the observations and the time index.
We introduce a simple example for two purposes:
 for didactic reasons
 to have a simple example we can solve by enumeration. So we can test and debug our algorithms much better.
Simple Example
Let's assume we have a small spin system of 6 spins:

Three hidden spins $ s_t $ and three observed spins $ x_t $ with $ t \in\{1,2,3\} $ ).

The possible spin states are: $ val(s_t) $ = $ val(x_t) $ = {UP, DOWN}.

The first (hidden) spin $ s_1 $ sees a small magnetic field, so the energy between UP and DOWN is slightly different: $ E(s_1=\text{UP})<E(s_1=\text{DOWN}) $ .

The hidden spins $ s_1, s_2, s_3 $ are coupled such that adjacent spins with equal state values have lower contibution to the energy, e.g. $ E(s_1=\text{UP}, s_2=\text{UP})<E(s_1=\text{UP}, s_2=\text{DOWN}) $ .

We observe 3 spins $ {x_1, x_2, x_3} $ . Each observed spin is coupled with exactly one hidden spin. e.g. $ E(s_t=\text{UP}, x_t=\text{UP})<E(s_t=\text{UP}, x_t=\text{DOWN}) $ .
i.e. the coupling coeffients (see below) $ c_1, c_{sx}, c_{ss} $ are all positive.
CRF
Graphical representation of a linear chain CRF with two possible types of factors $ F(z_{t1}, z_t, t) $ and $ F(z_t, x_t) $ and $ t \in {1,2,3} $ .
The probability of the state of the system depends on its total energy $ E $ and is given by the Boltzman Factor: $ P(E) = \exp \left(  \frac{E}{kT}\right) $
We set $ kT $ equals 1 for simplicity:
$ P(E) = \exp \left(  E \right) $
The higher the energy, the (exponentially) lower is the probability of a state.
Negative energy of the system
with
 DOWN $ s_i=1 $ and
 UP $ s_i= +1 $
`DOUBLE
 E(s{1:T}, x{1:T}) = c1 s_1+ c{sx} \sum{t=1}^3 s_t x_t + c{ss} \sum{t=2}^3 s{t1} s_t DOUBLE`
with  $ s_{1:T} = (s_1, s_2, s_3) $ (hidden spins)  $ x_{1:T} = (x_1, x_2, x_3) $ (observed spins)
here $ T=3 $ .
The total energy of the system is a sum of local contributions.
$ \begin{align} p(s_{1:T}, x_{1:T}) & = \frac{1}{Z} \exp\left(E(s_{1:T}, x_{1:T})\right) \\ & = \frac{1}{Z} \exp\left( c_1 s_1+ c_{sx} \sum_{t=1}^3 s_t x_t + c_{ss} \sum_{t=2}^3 s_{t1} s_t\right) \end{align} $
with the partition function $ Z $ (Zustandsumme) for the normalization: $ Z = \sum_{s'_{1:T}, x'_{1:T}} \exp \left(E(s'_{1:T}, x'_{1:T}) \right) $
We get the conditional probability just by using another $ Z $ : $ \begin{align} p(s_{1:T} \mid x_{1:T}) & = \frac{1}{Z(x_{1:T})} \exp\left(E(s_{1:T}, x_{1:T})\right) \\ & = \frac{1}{Z(x_{1:T})} \exp\left( c_1 s_1+ c_{sx}\sum_{t=1}^3 s_t x_t + c_{ss} \sum_{t=2}^3 s_{t1} s_t\right) \end{align} $
Here the partition function $ Z $ is given by the sum over all possible hidden states $ s'_{1:T} $ :
$ Z(x_{1:T})= \sum_{s'_{1:T}} \exp \left(E(s'_{1:T}, x_{1:T}) \right) $
Exercises
Feature function templates
As for HMMs we introduce a special start state $ \text{BOS} $ (begin of sequence) for a probability factor which depends only on the first state $ t=1 $ ).
We write $ z_t $ for the random variable of the hidden states with $ z_t \in \{\text{BOS, DOWN, UP}\} $
So: $ z_0 = \text{BOS} $
We introduce general feature function templates of the following form $ f_k(z_{t1}, z_t, \vec x, t) $ . Here we use only indicator functions $ \mathbb 1 $ ) as features (as for MEMMs). There are four different templates of features :
 StartFeature Template: $ f_i(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = v_n\land z_t=v_m) $
 BiasFeature Template: $ f_j(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = v_n \land z_t=v_m $ )
 Observation=DOWN Template:
$ f_k(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = v_n\land z_t=v_m) $  Observation=UP Template: $ f_l(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = v_n\land z_t=v_m) $
Here each feature template has $ 3 \cdot 3 $ different features (3 possible hidden states). In total we have $ 4\cdot3\cdot3 $ features.
Note: The hidden states $ z_{t1} $ and $ z_t $ are arguments of the feature functions. With the possible values for the hidden states: $ v_n\in \{ \text{BOS, DOWN, UP}\} $
Note: The StartFeature is not necessary (with our BOS state we could handle the start in our bias feature).
Note that some of the features are never used. But to keep the notation clear and to generate the features (from the feature templates) automatically we use this structure. In a more efficient implementation we would eliminate never used features.
Exercises
Exercise: Weight vector $ \vec \theta $
We put all $ 3\cdot 3\cdot 4 $ ) features in a feature vector $ \vec f(z_{t1}, z_t, x_{1:T}, t) $ .
If we compare our CRF Model $ \begin{align} p(z_{1:T} \mid x_{1:T}, \vec \theta) & = \frac{1}{Z(x_{1:T})}\prod_{t=1}^T \exp\left (\vec \theta \cdot \vec f(z_{t1}, z_t, x_{1:T}, t)\right)\\ & = \frac{1}{Z(x_{1:T})} \exp\left( \sum_{t=1}^T (\vec \theta \cdot \vec f(z_{t1}, z_t, x_{1:T}, t))\right) \end{align} $
with
$ \begin{align} p(s_{1:T} \mid x_{1:T}) & = \frac{1}{Z(x_{1:T})} \exp\left(E(s_{1:T} , x_{1:T})\right) \\ & = \frac{1}{Z(x_{1:T})} \exp\left( c_1 s_1 + c_{sx} \sum_{t=1}^3 s_t x_t + c_{ss} \sum_{t=2}^3 s_{t1} s_t\right) \end{align} $
we can construct the parameter vector $ \vec \theta $ by hand.
What are the corresponding weights $ \theta_i $ for the features? Relate the weights as coupling constants $ c_1, c_{sx}, c_{ss} $ .
If a feature is never used, set the corresponding weight to 0.
StartFeatures:
 $ f_0(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = START \land z_t= START) $
 $ f_1(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = START\land z_t=DOWN) $
 $ f_2(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = START \land z_t=UP) $
 $ f_3(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = DOWN \land z_t= START) $
 $ f_4(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = DOWN\land z_t=DOWN) $
 $ f_5(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = DOWN \land z_t=UP) $
 $ f_6(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = UP \land z_t= START) $
 $ f_7(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = UP\land z_t=DOWN) $
 $ f_8(z_{t1}, z_t, \vec x, t) = \mathbb 1(t=1 \land z_{t1} = UP \land z_t=UP) $
Corresponding weights:  $ \theta_0 = $  $ \theta_1 = $  $ \theta_2 = $  $ \theta_3 = $  $ \theta_4 = $  $ \theta_5 = $  $ \theta_6 = $  $ \theta_7 = $  $ \theta_8 = $
BiasFeatures:
 $ f_9(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = START \land z_t=START) $
 $ f_{10}(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = START \land z_t=DOWN) $
 $ f_{11}(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = START \land z_t=UP) $
 $ f_{12}(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = DOWN \land z_t=START) $
 $ f_{13}(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = DOWN \land z_t=DOWN) $
 $ f_{14}(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = DOWN \land z_t=UP) $
 $ f_{15}(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = UP \land z_t=START) $
 $ f_{16}(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = UP \land z_t=DOWN) $
 $ f_{17}(z_{t1}, z_t, \vec x, t) = \mathbb 1(z_{t1} = UP \land z_t=UP) $
Corresponding weights:  $ \theta_9 = $  $ \theta_{10} = $  $ \theta_{11} = $  $ \theta_{12} = $  $ \theta_{13} = $  $ \theta_{14} = $  $ \theta_{15} = $  $ \theta_{16} = $  $ \theta_{17} = $
Observation=DOWN Features:
 $ f_{18}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = START\land z_t=START) $
 $ f_{19}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = START\land z_t=DOWN) $
 $ f_{20}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = START\land z_t=UP) $
 $ f_{21}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = DOWN\land z_t=START) $
 $ f_{22}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = DOWN\land z_t=DOWN) $
 $ f_{23}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = DOWN\land z_t=UP) $
 $ f_{24}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = UP\land z_t=START) $
 $ f_{25}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = UP\land z_t=DOWN) $
 $ f_{26}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{DOWN} \land z_{t1} = UP\land z_t=UP) $
Corresponding weights:  $ \theta_{18} = $  $ \theta_{19} = $  $ \theta_{20} = $  $ \theta_{21} = $  $ \theta_{22} = $  $ \theta_{23} = $  $ \theta_{24} = $  $ \theta_{25} = $  $ \theta_{26} = $
Observation=UP Features:
 $ f_{27}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = START\land z_t=START) $
 $ f_{28}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = START\land z_t=DOWN) $
 $ f_{29}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = START\land z_t=UP) $
 $ f_{30}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = DOWN\land z_t=START) $
 $ f_{31}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = DOWN\land z_t=DOWN) $
 $ f_{32}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = DOWN\land z_t=UP) $
 $ f_{33}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = UP\land z_t=START) $
 $ f_{34}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = UP\land z_t=DOWN) $
 $ f_{35}(z_{t1}, z_t, \vec x, t) = \mathbb 1(x_t=\text{UP} \land z_{t1} = UP\land z_t=UP) $
Corresponding weights:  $ \theta_{28} = $  $ \theta_{28} = $  $ \theta_{29} = $  $ \theta_{30} = $  $ \theta_{31} = $  $ \theta_{32} = $  $ \theta_{33} = $  $ \theta_{34} = $  $ \theta_{35} = $
Setup for programming exercises
This snippet defines a few useful constants regarding each state and their respective energy contributions.
# Names for each state
UP = "UP"
DOWN = "DOWN"
START = "<BOS>"
# Energy contribution from the state
E_UP = 1.
E_DOWN = 1.
E_START = 1.5
# Map name of the state to its energy contribution
STATE_DICT = {
START: E_START,
DOWN: E_DOWN,
UP: E_UP
}
# States without START
INNER_STATES = [UP, DOWN]
# Length of the example sequence
LENGTH = 3
# Values of the coupling constants
smn = 5 # influence of the first spin energy due to outer magnetic field
c_1 = 1/smn
c_ss = .5
c_sx = .75
Exercise: Total negative energy
Implement a python function to calculate the total negative energy for a spin system. In Negative energy of the system we defined this value as the sum of its local contributions `DOUBLE
 E(s{1:T}, x{1:T}) = c1 s_1+ c{sx} \sum{t=1}^3 s_t x_t + c{ss} \sum{t=2}^3 s{t1} s_t DOUBLE`
def negative_sequence_energy(s,x):
# s : hidden spins
# x : observed spins
# returns : the total negative energy of the sequence E(s,x)
raise NotImplementedError()
Verify your result:
s = [UP,UP,UP]
x = [UP,UP,UP]
np.testing.assert_almost_equal(negative_sequence_energy(s,x), 3.45)
s = [DOWN,DOWN,UP]
x = [UP,UP,DOWN]
np.testing.assert_almost_equal(negative_sequence_energy(s,x), 2.45)
Exercise: Probabilities for the Spin System
Implement a python function to compute the probability by (brute force) enumeration of the SpinSystem for $ p(s_{1:T}, x_{1:T}) $
def prob(s,x):
# Returns the probability of the hidden spins s and observed spins x.
# Don't confuse this with the conditional probability p(sx)!
raise NotImplementedError()
Verify your result:
s = [UP,UP,UP]
x = [UP,UP,UP]
np.testing.assert_almost_equal(prob(s, x), 0.1748584549)
s = [DOWN,DOWN,DOWN]
x = [DOWN,DOWN,DOWN]
np.testing.assert_almost_equal(prob(s, x), 0.117211127549)
s = [DOWN,UP,DOWN]
x = [UP,DOWN,UP]
np.testing.assert_almost_equal(prob(s, x), 0.0001762198)
Exercise: Conditional Probability
Implement a function cond_prob(s, x)
which computes the probability of a given hidden sequence
conditioned on a observed sequence, i.e. $ p(s_{1:T} \mid x_{1:T}) $
How many terms must we add to compute $ Z(x_{1:T}) $ (for $ p(s_{1:T} \mid x_{1:T}) $ )?
 For our Spin System
 In general (with the number of possible hidden states $ \mathcal S $ ) (without the start state)
def cond_prob(s,x):
# s : sequence of hidden spins
# x : sequence of observed spins
# returns p(sx) the conditional probability of hidden spins given observed spins
raise NotImplementedError()
Verify your result:
s = [UP,UP,DOWN]
x = [UP,UP,DOWN]
np.testing.assert_approx_equal(cond_prob(s, x), 0.49151220231149634)
s = [UP,UP,UP]
x = [UP,UP,UP]
cp = cond_prob(s, x)
np.testing.assert_approx_equal(cp, 0.826540721870722)
Exercise: Data
Note: It's not possible to sample from a CRF which models $ p(z_{1:T}\mid x_{1:T}) $ . It's a discriminative model and not a generative model.
But here we can sample because we have also model of $ p(z_{1:T}, x_{1:T}) $
 Sample data by "naive brute force" sampling, i.e. enumerate all possible sequences and compute the probability of each sequence. Then you can sample by
np.random.choice(a=possible_seq,size=size,p=pp)
 Explain why (brute force) enumeration doesn't work with larger sequences.
def sample(size):
# size: int, number of requested samples
# returns: list of (s,x) tuples
# where each s is a hidden sequence and each x is its observation sequence
raise NotImplementedError()
sample(4)
Exercise: Dynamic Programming for the partition function
The following exercises are based on the course on Neural Networks by Hugo Larochelle of Université de Sherbrooke. Weeks 3 and 4 focus on Conditional Random Fields.
Relevant links:
So far, we've calculated the partition function with a brute force approach that enumerates all possible sequences. Now we want to move on to a dynamic programming algorithm, which allows us to work with longer sequences.
Study Larochelle's lecture Neural networks [3.3] : Conditional random fields  context window (timestamp 8:59 onwards). It introduces the notation of unary logfactors, which depend on the current state and pairwise logfactors which depend on the current and the next state. Which of the features from our spin system example are unary and which are pairwise logfactors?
 start feature
 bias feature
 observationup feature
 observationdown feature
In other words, does the local energy contribution by this feature depend on the state of one hidden spin, or two?
Exercise: Brute force comparison
Implement a function that computes the partition function for a given input sequence $ x $ through brute force, like you did in the initial example. We can use the value of the parition function returned by the naive approach to verify the results of the more complex algorithms we're about to implement.
def partition_fn(x):
raise NotImplementedError()
Verify your result:
np.testing.assert_almost_equal(partition_fn([UP,UP,UP]), 38.111119603947195)
np.testing.assert_almost_equal(partition_fn([UP,DOWN,UP]), 17.65747233953043)
np.testing.assert_almost_equal(partition_fn([DOWN,DOWN,DOWN]), 27.597811808118507)
Exercise: Alpha values
Study Hugo Larochelle's lecture Neural networks [3.4] : Conditional random fields  computing the partition function. In the lecture, he covers the dynamic program to compute the partition function using alpha/beta tables and wraps up with notes on numeric stability.
Implement a function to compute the alpha values. It should return both the alpha table (which will be useful later) and the partition function.
def alpha_table(x):
raise NotImplementedError()
return alpha, Z
Verify your result:
x = [UP, UP, UP]
naive_Z = partition_fn(x)
alpha, alpha_Z = alpha_table(x)
np.testing.assert_almost_equal(naive_Z, alpha_Z)
Exercise: Beta values
Implement a function to compute the beta values. It should return both the beta table (which will be useful later) and the partition function. We can verify our algorithms by
 checking if the alpha and beta values return the same value for the partition function
 checking if the beta function returns the same partition function as the brute force approach
def beta_table(x):
raise NotImplementedError()
return beta, Z
Verify your result:
x = [UP, UP, UP]
naive_Z = partition_fn(x)
alpha, alpha_Z = beta_table(x)
beta, beta_Z = beta_table(x)
np.testing.assert_almost_equal(naive_Z, beta_Z)
np.testing.assert_almost_equal(beta_Z, alpha_Z)
Summary and Outlook
In this notebook, you have
 modeled a spin system as a linear chain CRF
 engineered feature functions to transform input into feature vectors
 assigned weights to features by hand
 implemented brute force algorithms for the probability and conditional probability of sequences
 implemented dynamic programming algorithms to calculate the alpha values, beta values and the partition function
In the next step, you can use your alpha and beta algorithms to comute the marginals i.e. the likelihood of a label at a given position. Then you can turn to the problem of decoding i.e. finding the best label sequence for a given input sequence.
This notebook ends here, but if you're motivated to continue working with CRFs, I recommend you check out the playlist on Neural Networks by Hugo Larochelle.
Literature
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
Exercise  Linear Chain Conditional Random Fields
by Christian Herta, Diyar Oktay
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2019 Exercise  Linear Chain Conditional Random Fields
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.