Text Information Extraction - Sequences - Maximum Entropy Markov Model


In this notebook you will implement a Parts-of-Speech (POS) tagger using a Maximum Entropy Markov Model (MEMM).

You will extract features from words, using properties of the word (e.g. prefixes) and its context (e.g. the previous word in the sentence).

In the main part, you'll implement a POS tagger so it can fit to training to data, learn from the data using gradient descent and decode sentences, i.e. predict a best POS tag for each word of the sentence. You can then run your tagger on the test data.

We'll wrap up with an example of the label-bias problem present in MEMMs.



Python Modules

import numpy as np
import nltk
from nltk.corpus import treebank
from sklearn.model_selection import train_test_split


We'll be using a fragment from the Penn Treebank dataset provided in the NLTK toolkit.

tagged_sentences = treebank.tagged_sents(tagset='universal')
train_set, test_set = train_test_split(tagged_sentences,test_size=0.2,random_state=12)
print('{} train sentences'.format(len(train_set)))
print('{} test sentences'.format(len(test_set)))

The data is a list of list of (string, string). Each list is a sentence, each (string, string) tuple represents a (word, POS tag)

Example usages

for sentence in tagged_sentences:
    for word, tag in sentence:
        print(word.ljust(15), tag)

Feature Functions

In a Hidden Markov Model, we used words as observations directly. For example, if we observed the sentence "Hello lovely world" the observation sequence would be

hello → lovely → world

We might make some minor generalisations like turning all words lowercase or removing diacritics from letters like ê, é and è.

A Maximum Entropy Markov Model on the other hand offers more liberty - We extract features from words in the context of the word sequence, i.e. the sentence.

A feature function$ \phi{}(x_{1:i}, s_i, s_{i-1}, i) $ is a function of $ i $ the token/position in the range from 1 to the length of the sequence $ x_{1:i} $ The word sequence from 1 to and including$ i $ $ s_i $ the tag at the$ i $th position $ s_{i-1} $ the tag at the$ i-1 $th position

This allows us to condition not just on the word itself, but also properties of the word, the current and previous tag and the word sequence so far. Each feature function can return

  • A 1 or 0, e.g. a function that answers 'is the word capitalised?'
  • A vector of 1s and 0s, e.g. a function that returns the previous tag. The returned vector has the length of the tag vocabulary. At the index of the previous tag, it contains a 1, otherwise 0s.

We can apply a whole host of feature functions on the input to extract meaningful information on the word and some of its context. When we concatenate all our findings, we get a binary feature vector representing the input. The observation sequence "Hello lovely world" may look like this once piped through our functions:

$ [0,0,1,0] $ →$ [0,1,1,1] $ →$ [1,1,0,1] $


Build a tag vocabulary

Extract all tags that occur in the training set train_set and store them in a list tag_vocabulary. Then we can use the list index as a unique ID for each tag.

tag_vocabulary = None

You can run this test to verify your result.

def test_tag_vocabulary(vocab):
    expected_vocabulary = ['CONJ', 'NOUN', 'ADJ', 'PRT', 'DET', 'VERB', 'NUM', 'X', '.', 'ADV', 'PRON', 'ADP']
    assert type(vocab) == type([]), "Not a list"
    assert len(vocab) == len(set(vocab)), "Duplicate tag found"
    assert set(vocab) == set(expected_vocabulary), "Does not match expected vocabulary"

Most frequent tag

This exercise is the first step towards constructing feature functions. The same word can have different tags depending on the context, for example 'book' can be a noun in

She is reading a book.

or a verb in

Could you book the flight?

The challenge is to find features in the word itself and the context to assess whether a specific instance of 'book' is a noun or a verb. Before we worry about the context though, we start with a strong indicator of what the POS tag for a term may be: Its most frequent tag in the training data.

For each term, find the tag it was associated with most frequently in the training data. Store your result as a dict mapping each term to its most frequent tag.

most_frequent_tag = dict()

Verify your solution.

def test_most_frequent_tag(words_and_tags):
    assert 'say' in words_and_tags.keys()
    assert words_and_tags['be'] == 'VERB'
    assert words_and_tags['book'] == 'NOUN'

Suffixes and Prefixes

Suffixes and prefixes of a word can be a meaningful indicator of their tag. This is especially helpful in dealing with unfamiliar words. Even if our classifier has never encountered the word serendipity it could make an educated guess that it's a noun because of the suffix -ity.

Research common suffixes and prefixes of various parts-of-speech categories and store them. Here is a small example. Expand the dictionary common_suffixes.

common_suffixes = {
    'able' : 'ADJ',
    'ship' : 'NOUN',
    'ity' : 'NOUN',
    'ing' : 'VERB'

Using those suffixes and prefixes, implement a feature function that returns a binary vector. Each cell in the returned vector represents a tag (NOUN, VERB, ...). The value of a cell is 1 if the word contains a suffix/prefix that hints towards the correspoding tag.


contains_suffix('progamming') should return:

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])

when tag_vocabulary[5] returns 'VERB'.

def contains_suffix(word):
    raise NotImplementedError()

If you're using a suffix-tag dict like above, you can run this test to verify your result.

def test_contains_suffix():
    noun_idx = tag_vocabulary.index('NOUN')
    msg = 'Only the noun suffix in "divinity" should be 1'
    divinity = contains_suffix('divinity')
    assert divinity[noun_idx] == 1 and divinity.sum() == 1, msg
    msg = 'The suffix hints for "qwerty" should be all zero'
    qwerty = contains_suffix('qwerty')
    assert np.all(qwerty == np.zeros_like(qwerty))


Features of the word

Implement a function that checks whether or not certain criteria are true or false for a given word. Some criteria you may want to test are (you should implement more).

  • Is the word uppercase?
  • Is it a valid number (12, -31.31, thousand)?
  • Is it a short word?


For only the proposed features, word_features("Yes") should return:

array([True, False, True])

def word_features(word):
    raise NotImplementedError()

Try and see if the features you extracted from the word itself match your expectations.

for word in ('Italian', 'aims', 'R2-D2'):
    print('{} {}'.format(word.ljust(10), word_features(word)))

Constructing feature vectors

Implement a function that transforms each word in a sentence into a feature vector. As mentioned before, you can stack together as many functions of the word, the word sequence so far, the previous tag and the current tag as you please. You may want to include the following features in your result:

  • If the term is known, its most frequent tag
  • Likely tags given any prefixes/suffixes of the term
  • Features of the word itself (is it uppercase, does it contain digits, etc.)
  • Whether or not the word follows certain keywords like will or would
  • The previous tag in the sequence

Concatenating all these findings should produce a binary vector encoding an observation as features of the term and its context, which our classifier conditions on to predict POS tags.


word_to_feature_vector( word_sequence = ['Hello', 'lovely', 'world'], nth = 0, previous_tag = -1, current_tag = None )

could for exmaple return (depending on the features you use):

array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False])

def word_to_feature_vector(word_sequence, nth, previous_tag, current_tag):
    word_sequence : list of string
         a sentence
    nth : int
         the function returns the n-th word i.e. word_sequence[nth]
         as a feature vector
    previous_tag : int
         id of the tag of word_sequence[nth-1], or -1 if the word is
         at the beginning of the sequence
    current_tag : int
         id of the tag of word_sequence[nth]

    feature_vector : np.array of bool
         a binary vector encoding information about the nth-word and the context
         passed to this function
    raise NotImplementedError()
# 'Hello' in the sentence 'Hello lovely world'
# as a feature vector
    word_sequence = ['Hello', 'lovely', 'world'],
    nth = 0,
    previous_tag = -1,
    current_tag = None

Transform the train sentences

Currently the tagged sentences is a list of list of (word : string, tag: string). We'll now

  • transform each word into its feature vector representation
  • transform each tag into its index in the tag vocabulary
  • flatten the tagged sentences into a single list of (word, previous tag)

The training examples$ X $ will be a list of tuples$ (w_i,t_{i-1}) $ where$ w_i $ is the i-th word transformed into a feature vector and$ t_{i-1} $ is the index of the tag associated with the previous word in the sentence. If$ w_i $ is the first word of a sentence,$ t_{i-1} $ will take on the special START index -1.

Note that the previous tag$ t_{i-1} $ in the tuple is not intended as a feature, we've already taken care of that in the word_to_feature function. In the learning process, our MEMM classifier uses a different set of weights for each previous tag. So$ t_{i-1} $ determines which set of weights to train.

The true labels$ y $ is a list of int$ [t_0 \dots t_m] $ indicating the tag at the$ i $th token.

def transform_train_corpus(tagged_sentences):
    raise NotImplementedError()
    return X,y

If your implementation of transform_train_corpus is correct, the next cell should print some output similar to this:

First word feature vector and index to previous toke (here start = -1):
(array([False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False]), -1)

First word tag:
X_train, y_train = transform_train_corpus(train_set)
print('First word feature vector and index to previous toke (here start = -1):')
print('\nFirst word tag:')
# Number of elements in X_train with the previous tag -1 is equal
# to the number of sentences in the training set
assert sum(1 for w,t_prev in X_train if t_prev==-1) == len(train_set)

A MEMM skeleton

You'll implement the functions fit, gradient_descent and decode in the following exercises. __init__ can remain blank.

class MEMM():
    def __init__(self):
    def fit(self,X,y):
        raise NotImplementedError()
    def gradient_descent(self,X,y,epochs,learning_rate):
        raise NotImplementedError()
    def decode_sentence(self,X):
        raise NotImplementedError()

Here is an example of how the class will be used.

memm = MEMM()
y_pred = memm.decode_sentence(X_test[0])
accuracy = np.mean(np.array(y_pred) == np.array(y_test))
print('accuracy for the first sentence: {}'.format(accuracy))

Fit function

In logistic regression, we compute scores over the output classes as the dot product of weights and the input. We pipe the scores through the softmax function to obtain a probability distribution over the output classes. $ p(y) = softmax(W \cdot x) $ Our MEMM classifier trains a different weights matrix$ W^s $ for each possible previous state$ s $. This allows you to compute a probability distribution over tags for a given input word, conditioned on the previous tag in the sequence. $ p(s_i \mid s_{i-1}=s) = softmax(W^s \cdot x) $

Task: Implement the fit function which fits the instance of the MEMM to the training examples X and ground truth y.

n_classes is the number of possible tags. n_features is the number of elements per feature vector (word). We can infer these parameters from the training examples$ X $ and$ y $.

W_given_prev is an array of weights matrices, one for each possible previous tag in the sequence. W_given_start is a single weights matrix for the special case when the input is at the beginning of a sequence. Their shape depends on the number of features and classes.

Initilise W_given_start and W_given_prev with normal distribution, e.g. np.random.randn.

def fit(self,X,y):
        self.n_classes = None
        self.n_features = None
        self.W_given_start = None
        self.W_given_prev = []
MEMM.fit = fit

Verify your solution.

memm = MEMM()
assert memm.n_classes == len(tag_vocabulary)
assert len(memm.W_given_prev) == len(tag_vocabulary)

Learning with Gradient Descent

Recall that our input$ X $ is tuples of words (as feature vectors) and previous tags. We'll split the input$ X $ into$ K+1 $ subsets, separated by their previous tag. Our classifier has$ K+1 $ weight matrices, given each previous state.The necessary steps in detail:

Split X into$ K+1 $ subsets$ x^i $ with$ K $ the number of our classes (tags).

Then calculate: $ z^i_j = x_1^i \cdot w_{1,j} + x_2^i \cdot w_{2,j} + \ldots + x_n^i \cdot w_{n,j} $ with $ j \in \{ 1, 2,\ldots, K\} $, $ n $ the number of features

Implement the softmax function and use it inside of gradient_descent to calculate the predicted probabilies of$ x^i $ belonging to class$ j $:

$ \hat y^i_j = \sigma(z^i)_j = \frac{e^{z^i_j}}{\sum_{k=1}^K e^{z^i_k}} $

For one epoch (or mini-batch) do:

  • Calculate the gradient of the cross entropy cost:

$ \nabla cost_j = \begin{pmatrix} \frac{\partial cost_j}{\partial w_{1,j}} \\ \frac{\partial cost_j}{\partial w_{2,j}} \\ \vdots \\ \frac{\partial cost_j}{\partial w_{n,j}} \\ \end{pmatrix} =(\hat y_j - y_j) \cdot X^T $

  • Update the wheight matrices with the update rule$ \forall e \in \{1,2,\ldots,n\} $ : $ w_{e,j}^{new} = w_{e,j}^{old} - \alpha \frac{\partial cost_j}{\partial w_{e,j}} $


  • Avoid loops when possible, e.g.$ Z $ or all$ X $ (or minibatch) can efficiently be computed with the matrix dot product:$ Z = W \cdot X^T $.
  • Of course you are free to implement as many helper functions as you like, e.g. for the softmax function or the update rule Each subset of$ X $ is used to train the corresponding weights using gradient descent. If you like, you can use mini-batch gradient descent.
def gradient_descent(self,X,y,epochs,learning_rate):
    raise NotImplementedError()

MEMM.gradient_descent = gradient_descent

Decoding with Viterbi

We want to decode whole sequences at a time. We'll pass a sentence as a list of untransformed words to the decoding function. It computes and returns the most likely sequence of tags for the sentence.

Remember that the decoding function deals with unfamiliar sentences, which means the tag sequence is unkown. For each word, the Viterbi algorithm considers each possible previous tag, which affects the value passed to the previous_tag parameter in the word_to_feature function.

Implement the viterbi algorithm:

1. Initialisation

For the first word of a sentence we multiply our trained W_given_start $ W_0 $) probabilities with the feature vector of our first word$ x^0 $ and store the previous tag (start = -1) in the backpointer. In order to also receive probability values for our unknown sentences, also apply the sotmax function $ \sigma $) here. $ \vec{viterbi_{0}} = \sigma(W_{0} \vec x^{0^T})\\ \vec{backpointer}_{0} = -1 $

2. Recursion

In the recursion step the computation of the next viterbi vector includes the provious one.

We calculate the likelihood of moving from times step (or token postions)$ t-1 $ to$ t $ and observing the different tags, producing a Matrix of likelihood vectors$ {V}_t = (\vec v_1, \vec v_2, \dots \vec v_K) $. In the viterbi table, we store the maximum vector, in the backpointer its index, which equals the tag index.

$ V_{t} = (\vec v_1, \vec v_2, \dots \vec v_K) = \sigma(viterbi_{t-1}W_{b} \vec x^{t^T})\\ \vec{viterbi_{t}} = \max_{k=1}^K \vec{v}_k \\ \vec{backpointer}_{t} = \text{arg} \max_{k=1}^K \vec{v}_k $

3. Finalisation

To get the most likely tag sequence, trace back the indizes in$ backpointer $ and keep track of them:

The index (scalar) of the most likely last word is: $ backpointer_{Last, Max} = \text{arg} \max_{k=1}^K \vec{viterbi}_{Last, k} $ $ backpointer_{Last, Max} $ then points to the index of the most likely word in the vector$ \vec{backpointer}_{Last-1} $ and so on.

def decode_sentence(self, sentence):
    # sentence : list of string, the word sequence to tag
    # returns : list of int, the most likely tag sequence
    raise NotImplementedError()
MEMM.decode_sentence = decode_sentence

Trying it out

def transform_test_sentences(tagged_sentences):
    X,y = [],[]
    for sent in tagged_sentences:
        words,tags = list(zip(*sent))
        y.append([tag_vocabulary.index(t) for t in tags])
    return X,y
X_test,y_test = transform_test_sentences(test_set)
memm = MEMM()
memm.fit(X_train, y_train)
predicted_sentences = [memm.decode_sentence(x) for x in X_test]

from itertools import chain
from sklearn.metrics import confusion_matrix
y_true = list(chain.from_iterable(y_test))
y_pred = list(chain.from_iterable(predicted_sentences))
### compare predicted tags with true tags for irst sentence:

for i in range(len(X_test[0])):
    print('{}\t\t{}\t{}'.format(X_test[0][i], tag_vocabulary[y_true[i]], tag_vocabulary[y_pred[i]]))

If your implementations are correct, the next cell should print something similar to the following. The numbers on the diagonal of the confusion matrix mark the correctly predicted words:

[[ 641    0    0    7    0    4    9    0    0    0    0    2]
 [   0 1610    0   63    1    1    1    0    1    0    0   16]
 [   0    2 1325    1    0    1    1    9    1    1    0    2]
 [   9    0    0 1883    0    2   11    0    5    0    1    2]
 [   0    0    0    0  530    8    0    2    1    2    0    4]
 [   4    3    0   11    1 2451    6    0   28    9    2  178]
 [   9    6    0   33    1    8  529    0   64    2    2    3]
 [   1    3    2    4    0    3    0  711    3    0    1   11]
 [   1    6   52    3    0   38   20    0  966    7    0   97]
 [   1    0    0    0    0    0    0    0    0 2355    0    0]
 [   0    7    0    0    0    3    8    0    0    0  403    3]
 [   2   23   15   15    0   78   32    1  145   18    7 5274]]
PRT     641 out of 663   (96.7 %)
DET    1610 out of 1693  (95.1 %)
X      1325 out of 1343  (98.7 %)
ADP    1883 out of 1913  (98.4 %)
PRON    530 out of 547   (96.9 %)
VERB   2451 out of 2693  (91.0 %)
ADV     529 out of 657   (80.5 %)
NUM     711 out of 739   (96.2 %)
ADJ     966 out of 1190  (81.2 %)
.      2355 out of 2356  (100.0 %)
CONJ    403 out of 424   (95.0 %)
NOUN   5274 out of 5610  (94.0 %)
Average accuracy:  0.9420012104095219 %
conf_mat = confusion_matrix(y_true,y_pred)
for i in range(conf_mat.shape[0]):
    print('{:<5} {:5} out of {:<5} ({:2.1f} %)'.format(tag_vocabulary[i], conf_mat[i,i], conf_mat[i].sum(), 100 * conf_mat[i,i]/conf_mat[i].sum()))
print('Average accuracy: ', np.trace(conf_mat)/np.sum(conf_mat), '%')

Label-Bias Problem

MEMMs suffer from a problem known as the label bias problem. Essentially,

  • states with few succeeding states are preferred
  • states with a single outgoing state ignore their observation entirely.

Read up on the label bias problem in Chapter 2 of Lafferty et. al's paper on Conditional Random Fields [LAF01]. Then walk through the example below.

Given is the following train data set$ (x_{1:T}, h_{1:T}) $:

$ \mathcal D_{train} = \{(rib,123), (rib,123), (rib,123), (rob,456), (rob,456)\} $

-$ h_{1:T} $: Hidden sequence -$ x_{1:T} $: Observed Sequence

  • Start state:$ h_0 = * $
  • Hidden state vocabulary:$ \{*,1,2,3,4,5,6\} $
  • Observation vocabulary:$ \{r,i,o,b\} $

Preparation: Give the following maximum likelihood estimates of$ p(h_t \mid x_t, h_{t-1}) $:

-$ P(1 \mid r, *) = ? $ -$ P(4 \mid r, *) = ? $ -$ P(2 \mid i, 1) = ? $ -$ P(3 \mid b, 2) = ? $ -$ P(5 \mid o, 1) = ? $ -$ P(6 \mid b, 5) = ? $

Note: In the training data, state 2 is the only successor of state 1. Therefore$ P(2 \mid 1) = 1 $, so$ P(2 \mid x, 1) = 1 $ for all observations$ x $

Task What are the probabilities of the hidden state sequences 123 and 456 given the observation sequence 'rob'? -$ P(123 \mid rob) = ? $ -$ P(456 \mid rob) = ? $

Discuss the solution. Is that what you have expected? Relate the solution to the label-bias problem.

Summary and Outlook

Parts-of-speech tagging is a disambiguation task. Sometimes a term unambiguously belongs to a certain part of speech. In those cases assigning a tag is trivial. With unfamiliar words, features of the word can provide a reliable hint, for example their suffix or if the word shape resembles a date. At other times, we have to look at the context to predict the part of speech. Recall the example from the introduction: Which feature could reveal whether an book is a noun or a verb? If the previous word is 'a' or 'the', this could indicate it's a noun in this specific sentence.

Have a look at the confusion matrix and see which classification your MEMM classifier struggles with. I'll leave you with this highlight_errors function, which prints senteces in which a word in the category actual was incorrectly classified as predicted. Can you identify patterns in the sentence or word that hint, or even imply a certain parts of speech? You could expand your feature functions to extract those patterns and make more well-informed predicitions.

def highlight_errors(actual,predicted):
    for i,sent in enumerate(y_test):
        for j,tag in enumerate(sent):
            if tag == tag_vocabulary.index(actual):
                prediction = predicted_sentences[i][j]
                if prediction == tag_vocabulary.index(predicted):
                    sentence = X_test[i]
                    word = sentence[j]
                    print('Actual: {}, Predicted: {}'.format(actual,predicted))
highlight_errors(actual='NOUN', predicted='ADJ')



Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Text Information Extraction - Sequences - Maximum Entropy Markov Model
by Christian Herta, Diyar Oktay
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christian Herta, Diyar Oktay

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.