Text Information Extraction - Sequences - Maximum Entropy Markov Model


In this notebook you will implement a Parts-of-Speech (POS) tagger using a Maximum Entropy Markov Model (MEMM).

You will extract features from words, using properties of the word (e.g. prefixes) and its context (e.g. the previous word in the sentence).

In the main part, you'll implement a POS tagger tagger so it can fit to training to data, learn from the data using gradient descent and decode sentences, i.e. predict a best POS tag for each word of the sentence. You can then run your tagger on the test data.

We'll wrap up with an example of the label-bias problem present in MEMMs.



Python Modules

import numpy as np
from nltk.corpus import treebank
from sklearn.model_selection import train_test_split


We'll be using a fragment from the Penn Treebank dataset provided in the NLTK toolkit.

tagged_sentences = treebank.tagged_sents(tagset='universal')
train_set, test_set = train_test_split(tagged_sentences,test_size=0.2,random_state=12)
print('{} train sentences'.format(len(train_set)))
print('{} test sentences'.format(len(test_set)))

The data is a list of list of (string, string). Each list is a sentence, each (string, string) tuple represents a (word, POS tag)

Example usages

for sentence in tagged_sentences:
    for word, tag in sentence:
        print(word.ljust(15), tag)

Feature Functions

In a Hidden Markov Model, we used words as observations directly. For example, if we observed the sentence "Hello lovely world" the observation sequence would be

hello → lovely → world

We might make some minor generalisations like turning all words lowercase or removing diacritics from letters like ê, é and è.

A Maximum Entropy Markov Model on the other hand offers more liberty - We extract features from words in the context of the word sequence, i.e. the sentence.

A feature function ϕ(x1:i,si,si1,i)\phi{}(x_{1:i}, s_i, s_{i-1}, i) is a function of

  • ii the token/position i, in the range from 1 to the length of the sequence
  • x1:ix_{1:i} The word sequence from 1 to and including ii
  • sis_i the tag at the iith position
  • si1s_{i-1} the tag at the i1i-1th position

This allows us to condition on not just on the word itself, but also properties of the word, the current and previous tag and the word sequence so far. Each feature function can return

  • A 1 or 0, e.g. a function that answers 'is the word capitalised?'
  • A vector of 1s and 0s, e.g. a function that returns the previous tag. The returned vector has the length of the tag vocabulary. At the index of the previous tag, it contains a 1, otherwise 0s.

We can apply a whole host of feature functions on the input to extract meaningful information on the word and some of its context. When we concatenate all our findings, we get a binary feature vector representing the input. The observation sequence "Hello lovely world" may look like this once piped through our functions:



Build a tag vocabulary

Extract all tags that occur in the training set and assign them a unique ID. We'll store each unique tag in a list and use its index in the list as the ID.

tag_vocabulary = None

You can run this test to verify your result.

def test_tag_vocabulary(vocab):
    expected_vocabulary = ['CONJ', 'NOUN', 'ADJ', 'PRT', 'DET', 'VERB', 'NUM', 'X', '.', 'ADV', 'PRON', 'ADP']
    assert type(vocab) == type([]), "Not a list"
    assert len(vocab) == len(set(vocab)), "Duplicate tag found"
    assert set(vocab) == set(expected_vocabulary), "Does not match expected vocabulary"

Most frequent tag

This exercise is the first step towards constructing feature functions. The same word can have different tags depending on the context, for example 'book' can be a noun in

She is reading a book.

or a verb in

Could you book the flight?

The challenge is to find features in the word itself and the context to assess whether a specific instance of 'book' is a noun or a verb. Before we worry about the context though, we start with a strong indicator of what the POS tag for a term may be: Its most frequent tag in the training data.

For each term, find the tag it was associated with most frequently in the training data. Store your result as a dict mapping each term to its most frequent tag.

most_frequent_tag = dict()

Verify your solution.

def test_most_frequent_tag(words_and_tags):
    assert 'say' in words_and_tags.keys()
    assert words_and_tags['be'] == 'VERB'
    assert words_and_tags['book'] == 'NOUN'

Suffixes and Prefixes

Suffixes and prefixes of a word can be a meaningful indicator of their tag. This is especially helpful in dealing with unfamiliar words. Even if our classifier has never encountered the word serendipity it could make an educated guess that it's a noun because of the suffix -ity.

Research common suffixes and prefixes of various parts-of-speech categories and store them. Here is a small sample:

common_suffixes = {
    'able' : 'ADJ',
    'ship' : 'NOUN',
    'ing' : 'VERB'

Using those suffixes and prefixes, implement a feature function that returns a binary vector. Each cell in the returned vector represents a tag (NOUN, VERB, ...). The value of a cell is 1 if the word contains a suffix/prefix that hints towards the correspoding tag.

def contains_suffix(word):
    raise NotImplementedError()

If you're using a suffix-tag dict like above, you can run this test to verify your result.

def test_contains_suffix():
    noun_idx = tag_vocabulary.index('NOUN')
    msg = 'Only the noun suffix in "divinity" should be 1'
    divinity = contains_suffix('divinity')
    assert divinity[noun_idx] == 1 and divinity.sum() == 1, msg
    msg = 'The suffix hints for "qwerty" should be all zero'
    qwerty = contains_suffix('qwerty')
    assert np.all(qwerty == np.zeros_like(qwerty))


Features of the word

Implement a function that checks whether or not certain criteria are true or false for a given word. Some criteria you may want to test are

  • Is the word uppercase?
  • Is it a valid number (12, -31.31, thousand)?
  • Is it a short word?
def word_features(word):
    raise NotImplementedError()

Try and see if the features you extracted from the word itself match your expectations.

for word in ('Italian', 'aims', 'R2-D2'):
    print('{} {}'.format(word.ljust(10), word_features(word)))

Constructing feature vectors

Implement a function that transforms each word in a sentence into a feature vector. As mentioned before, you can stack together as many functions of the word, the word sequence so far, the previous tag and the current tag as you please. You may want to include the following features in your result:

  • If the term is known, its most frequent tag
  • Likely tags given any prefixes/suffixes of the term
  • Features of the word itself (is it uppercase, does it contain digits, etc.)
  • Whether or not the word follows certain keywords like will or would
  • The previous tag in the sequence

Concatenating all these findings should produce a binary vector encoding an observation as features of the term and its context, which our classifier conditions on to predict POS tags.

def word_to_feature_vector(word_sequence, nth, previous_tag, current_tag):
    # Parameters:
    # word_sequence : list of string
    #     a sentence
    # nth : int
    #     the function returns the n-th word i.e. word_sequence[nth]
    #     as a feature vector
    # previous_tag : int
    #     id of the tag of word_sequence[nth-1], or -1 if the word is
    #     at the beginning of the sequence
    # current_tag : int
    #     id of the tag of word_sequence[nth]
    # Returns:
    # feature_vector : np.array of bool
    #     a binary vector encoding information about the nth-word and the context
    #     passed to this function
    raise NotImplementedError()
# 'Hello' in the sentence 'Hello lovely world'
# as a feature vector
    word_sequence = ['Hello', 'lovely', 'world'],
    nth = 0,
    previous_tag = -1,
    current_tag = None

Transform the train sentences

Currently the tagged sentences is a list of list of (word : string, tag: string). We'll now

  • transform each word into its feature vector representation
  • transform each tag into its index in the tag vocabulary
  • flatten the tagged sentences into a single list of (word, previous tag)

The training examples XX will be a list of tuples (wi,ti1)(w_i,t_{i-1}) where wiw_i is the i-th word transformed into a feature vector and ti1t_{i-1} is the index of the tag associated with the previous word in the sentence. If wiw_i is the first word of a sentence, ti1t_{i-1} will take on the special START index -1.

Note that the previous tag ti1t_{i-1} in the tuple is not intended as a feature, we've already taken care of that in the word_to_feature function. In the learning process, our MEMM classifier uses a different set of weights for each previous tag. So ti1t_{i-1} determines which set of weights to train.

The true labels yy is a list of int [t0tm][t_0 \dots t_m] indicating the tag at the iith token.

def transform_train_corpus(tagged_sentences):
    raise NotImplementedError()
    return X,y
X_train, y_train = transform_train_corpus(train_set)
# Number of elements in X_train with the previous tag -1 is equal
# to the number of sentences in the training set
assert sum(1 for w,t_prev in X_train if t_prev==-1) == len(train_set)

A MEMM skeleton

You'll implement the functions fit, gradient_descent and decode in the following exercises. __init__ can remain blank.

class MEMM():
    def __init__(self):
    def fit(self,X,y):
        raise NotImplementedError()
    def gradient_descent(self,X,y,epochs,learning_rate):
        raise NotImplementedError()
    def decode_sentence(self,X):
        raise NotImplementedError()

Here is an example of how the class will be used.

memm = MEMM()
y_pred = memm.decode_sentence(X_test[0])
accuracy = np.mean(np.array(y_pred) == np.array(y_test))
print('accuracy for the first sentence: {}'.format(accuracy))

Fit function

In logistic regression, we compute scores over the output classes as the dot product of weights and the input. We pipe the scores through the softmax function to obtain a probability distribution over the output classes.

p(y)=softmax(Wx)p(y) = softmax(W \cdot x)

Our MEMM classifier trains a different weights matrix WsW^s for each possible previous state ss. This allows you to compute a probability distribution over tags for a given input word, conditioned on the previous tag in the sequence.

p(sisi1=s)=softmax(Wsx)p(s_i \mid s_{i-1}=s) = softmax(W^s \cdot x)

Task: Implement the fit function which fits the instance of the MEMM to the training examples X and ground truth y.

n_classes is the number of possible tags. n_features is the number of elements per feature vector (word). We can infer these parameters from the training examples XX and yy.

W_given_prev is an array of weights matrices, one for each possible previous tag in the sequence. W_given_start is a single weights matrix for the special case the input is at the beginning of a sequence. Their shape depends on the number of features and classes.

def fit(self,X,y):
        self.n_classes = None
        self.n_features = None
        self.W_given_start = None
        self.W_given_prev = []
MEMM.fit = fit

Verify your solution.

memm = MEMM()
assert memm.n_classes == len(tag_vocabulary)
assert len(memm.W_given_prev) == len(tag_vocabulary)


Recall that our input XX is tuples of words (as feature vectors) and previous tags. We'll split the input XX into i+1i+1 subsets, separated by their previous tag. Fittingly, our classifier has i+1i+1 weight matrices, given each previous state. Each subset of XX is used to train the corresponding weights using gradient descent. If you like, you can use mini-batch gradient descent.

def gradient_descent(self,X,y,epochs,learning_rate):
    raise NotImplementedError()

MEMM.gradient_descent = gradient_descent


We want to decode whole sequences at a time. We'll pass a sentence as a list of untransformed words to the decoding function. It computes and returns the most likely sequence of tags for the sentence.

Remember that the decoding function deals with unfamiliar sentences, which means the tag sequence is unkown. For each word, the Viterbi algorithm considers each possible previous tag, which affects the value passed to the previous_tag parameter in the word_to_feature function.

def decode_sentence(self, sentence):
    # sentence : list of string, the word sequence to tag
    # returns : list of int, the assigned tag sequence
    raise NotImplementedError()
MEMM.decode_sentence = decode_sentence

Trying it out

def transform_test_sentences(tagged_sentences):
    X,y = [],[]
    for sent in tagged_sentences:
        words,tags = list(zip(*sent))
        y.append([tag_vocabulary.index(t) for t in tags])
    return X,y
X_test,y_test = transform_test_sentences(test_set)
memm = MEMM()
memm.fit(X_train, y_train)
predicted_sentences = [memm.decode_sentence(x) for x in X_test]

from itertools import chain
from sklearn.metrics import confusion_matrix
y_true = list(chain.from_iterable(y_test))
y_pred = list(chain.from_iterable(predicted_sentences))
conf_mat = confusion_matrix(y_true,y_pred)
for i in range(conf_mat.shape[0]):
    print('{:<5} {:5} out of {:<5} ({:2.1f})'.format(tag_vocabulary[i], conf_mat[i,i], conf_mat[i].sum(), 100 * conf_mat[i,i]/conf_mat[i].sum()))

Label-Bias Problem

MEMMs suffer from a problem known as the label bias problem. Essentially,

  • states with few succeeding states are preferred
  • states with a single outgoing state ignore their observation entirely.

Read up on the label bias problem in Chapter 2 of Lafferty et. al's paper on Conditional Random Fields [LAF01]. Then walk through the example below.

Given is the following train data set (x1:T,h1:T)(x_{1:T}, h_{1:T}):

Dtrain={(rib,123),(rib,123),(rib,123),(rob,456),(rob,456)}\mathcal D_{train} = \{(rib,123), (rib,123), (rib,123), (rob,456), (rob,456)\}
  • h1:Th_{1:T}: Hidden sequence
  • x1:Tx_{1:T}: Observed Sequence
  • Start state: h0=h_0 = *
  • Hidden state vocabulary: {,1,2,3,4,5,6}\{*,1,2,3,4,5,6\}
  • Observation vocabulary: {r,i,o,b}\{r,i,o,b\}

Preparation: Give the following maximum likelihood estimates of p(htxt,ht1)p(h_t \mid x_t, h_{t-1}):

  • P(1r,)=?P(1 \mid r, *) = ?
  • P(4r,)=?P(4 \mid r, *) = ?
  • P(2i,1)=?P(2 \mid i, 1) = ?
  • P(3b,2)=?P(3 \mid b, 2) = ?
  • P(5o,1)=?P(5 \mid o, 1) = ?
  • P(6b,5)=?P(6 \mid b, 5) = ?

Note: In the training data, state 2 is the only successor of state 1. Therefore P(21)=1P(2 \mid 1) = 1, so P(2x,1)=1P(2 \mid x, 1) = 1 for all observations xx

Task What are the probabilities of the hidden state sequences 123 and 456 given the observation sequence 'rob'?

  • P(123rob)=?P(123 \mid rob) = ?
  • P(456rob)=?P(456 \mid rob) = ?

Discuss the solution. Is that what you have expected? Relate the solution to the label-bias problem.

Summary and Outlook

Parts-of-speech tagging is a disambiguation task. Sometimes a term unambiguously belongs to a certain part of speech. In those cases assigning a tag is trivial. With unfamiliar words, features of the word can provide a reliable hint, for example their suffix or if the word shape resembles a date. At other times, we have to look at the context to predict the part of speech. Recall the example from the introduction: Which feature could reveal whether an book is a noun or a verb? If the previous word is 'a' or 'the', this could indicate it's a noun in this specific sentence.

Have a look at the confusion matrix and see which classification your MEMM classifier struggles with. I'll leave you with this highlight_errors function, which prints senteces in which a word in the category actual was incorrectly classified as predicted. Can you identify patterns in the sentence or word that hint, or even imply a certain parts of speech? You could expand your feature functions to extract those patterns and make more well-informed predicitions.

def highlight_errors(actual,predicted):
    for i,sent in enumerate(y_test):
        for j,tag in enumerate(sent):
            if tag == tag_vocabulary.index(actual):
                prediction = predicted_sentences[i][j]
                if prediction == tag_vocabulary.index(predicted):
                    sentence = X_test[i]
                    word = sentence[j]
                    print('Actual: {}, Predicted: {}'.format(actual,predicted))
highlight_errors(actual='NOUN', predicted='ADJ')



Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Text Information Extraction - Sequences - Maximum Entropy Markov Model
by Christian Herta, Diyar Oktay
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christian Herta, Diyar Oktay

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.