Natural Language Processing - Text Classification - Naive Bayes

Introduction

In this notebook, you will implement Naive Bayes learning algorithms for text classification.

You will work with the 20 Newsgroup dataset and explore how Bayes Theorem coupled with naive assumptions uses the features of a document to find a most likely class. You'll also cover using log-probabilities and smoothing for a more stable implementation.

At the end of this notebook you will have implemented two types of Naive Bayes classifiers, using the multinomial and Bernoulli model. The models differ in what types of features they extract from documents.

Requirements

Knowledge

  • This notebook is based on Chapter 13 "Text classification and Naive Bayes" of Introduction to Information Retrieval. The book is published over at https://nlp.stanford.edu/IR-book/information-retrieval-book.html, see MAN09 for a full citation. Study the chapter, in particular pp. 253 - 270 of the book, which corresponds to pp. 1 - 18 in the PDF.

  • We'll be using the same corpus and some of the text preparation methods presented in sklearn's Working With Text Data, so you may want to keep the article handy.

Python Modules

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import (CountVectorizer)
import numpy as np

Data

We're using the "20 Newsgroups" dataset as our corpus. It is a collection of over 10,000 newsgroup articles, each of which belongs to 1 of 20 classes.

basedir = '~/deep.TEACHING/data/newsgroups_dataset'
twenty_train = fetch_20newsgroups(basedir,
                                  subset='train',
                                  shuffle=True,
                                  random_state=42)
twenty_test = fetch_20newsgroups(basedir,
                                  subset='test',
                                  shuffle=True,
                                  random_state=42)

Teaching Content

Token vs. Term

Make sure you distinguish between token and term (MAN09 ch. 2.2.1 'Tokenization'). A token is a position in a document. A term is an item in the vocabulary. So we can take the sentence The good, the bad and the ugly, split it into the tokens the good the bad and the ugly and extract the terms the good bad and ugly as a vocabulary.

Bayes Theorem

The conditional probability $ P(A|B) $ tells us "What are the odds of A, given our knowledge of B?" In text classification, we have a class $ c $ from a set of classes $ C $ and a document $ d $ . We're interested in $ P(c|d) $ , the likelihood $ c $ is the correct class for the document, given our knowledge of the document. Bayes' Theorem tells us that this probability can be computed as DOUBLE P(c|d) = \frac{P(c)P(d|c)}{P(d)} DOUBLE``SINGLE P(c) SINGLE is the prior - The probability $ c $ is the correct class before we even consider the particular document. $ P(d|c) $ is the evidence. We've extracted features features from the document and they now provide evidence of how likely or unlikely the class is.

Maximum a posteriori $ P(c|d) $ gives us a score of how likely class $ c $ is for the document. We'll compute this score for all classes $ c $ in our set of classes. The class that yields the highest score is picked as the most likely, or best class to assign to the document. The process is called maximum a posteriori estimation or map for short.

$ c_{map} = \underset{c \in C}{argmax}[ P(c|d) ] $

Because we don't care about the score itself, we drop the denominator $ P(d) $ from the equation. It is constant across all classes, so dividing by it in every class won't change the argmax among classes. $ P(c|d) \propto P(c) * P(d|c) $

Naivety

First, we'll choose a vector representation for the document $ d $ . In multionomial Naive Bayes, we represent a document $ d $ with length $ n_d $ as a vector of its tokens $ \left< t_1,t_2,...t_{n_d} \right> $ . We assume the tokens to be conditionally independent given the class. For example, seeing hong in a document doesn't provide information on whether we'll see kong, given the class. Additionally we assume the position in the document provides no additional information on the likelihood of a term occuring, given the class. These assumptions are naive, but they allow us to compute the likelihood of document $ d $ given the class $ c $ as:

$ P(d|c) = P(\left< t_1,...,t_{n_d} \right>|c) = \underset{1 \leq k \leq n_d }{\Pi} P(t_k|c) $

$ 1 \leq k \leq n_d $ iterates over the $ k $ tokens in the document $ P(t_k|c) $ is the frequency of the term at the k-th token in class $ c $ .

Exercises

Bag of Words representation

Our documents are currently stored as lists of strings.

# The first document in plain text
twenty_train.data[0]

In order to perform classification, we'll extract features from documents and transform them into feature vectors, as described in the following paragraph on SKLearn:

  1. Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
  2. For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

In the following snippet, a CountVectorizer performs tasks 1. and 2. on our twenty_train corpus.

count_vectorizer = CountVectorizer()
X_train, y_train = count_vectorizer.fit_transform(twenty_train.data), twenty_train.target

Here's a brief summary of the data layout and some of the members you'll have to access in the upcoming sections. For a full reference, check out the sklearn article Working With Text Data.

X_train : sparse matrix, shape = (n_documents, n_terms) A document-term feature matrix that stores the count of each term in each document. Each row represents a document, each column represents a term. X_train[i,j] is the number of times term#j occurs in document#i.

num_documents, num_terms = X_train.shape
first_document = X_train[0,:]

y_train : array, shape = (n_documents,) The target vector which indicates the class of each document. Each element in y_train is an integer {0,1, ... ,n_classes-1}. y_train[i] is the class of document#i.

# Classes of the first few documents
y_train[0:10]
# The names of each class
', '.join(twenty_train.target_names)
# The learned vocabulary, index in the list corresponds to index of the term
', '.join(count_vectorizer.get_feature_names()[0:10])

Task: Implement a function that recovers a document from its feature vector, i.e. its row in X_train. You can recover the terms used and how many times they occur, but other information is lost, like the order of words, capitalisation or punctuation.

def recover_document(document):
    # Parameters:
    # document:
    #    sparse matrix ; dtype = np.int64 ; shape = (1,length_vocabulary)
    #    A transformed document i.e. a row from X_train
    #
    # Returns:
    #   A list of words used in the document. words that occur multiple
    #   times in the document should occur repeatedly in the list.
    return words

Test your implementation:

recovered = recover_document(X_train[0,:])
assert recovered.count('car') == 5
assert all([recovered.count(term) == 1 for term in ('college', 'enlighten', 'neighborhood')])
assert all([recovered.count(term) == 0 for term in ('sheep', 'notepad')])

Frequencies of classes and terms

The relative frequency of a class $ c $ is the number of documents in class $ c $ divided by the total number of documents. $ P(c) = \frac{N_c}{N} $

The conditional probability of a term given the class $ P(t|c) $ is the relative frequency of the term within documents of that class. $ P(t|c) = \frac{T_{ct}}{\sum_{t' \in V} T_{ct}} $

  • The numerator $ T_{ct} $ is the number of times term $ t $ occurs in documents of class $ c $ ,
  • The denominator is the total number of tokens in class $ c $ . Here it is computed as the sum of $ T_{ct} $ over all terms in the vocabulary $ t' \in V $

Task: Implement a function to learn the (unsmoothed) priors $ P(c) $ and conditional probabilities $ P(t|c) $ for all classes and terms.

Hint: As an approch, Manning suggests you concatenate all documents in a class $ c $ so you can more easily figure out the count of each term in $ c $ as well as the total number of tokens in $ c $

Think about how you can obtain such a concatenated document from our transformed document-term matrix.

def get_class_term_frequencies(X, y):
    # Learning process.
    # Copmutes the (unsmoothed) frequency of each class P(c)
    # and the conditional probability of each term given the class
    # for all classes and terms P(t|c)
    # Documents in X are generated with the multinomial model.
    #
    # Parameters:
    # X : sparse matrix ; shape = (num_documents, num_terms)
    #     a document-term matrix e.g. X_train
    # y : np.ndarray ; shape = (num_classes)
    #     the target vector
    #
    # Returns:
    # P(c) : np.ndarray ; shape = (num_classes)
    #     frequency of each class [0..1]
    # P(t|c) : np.ndarray ; shape = (num_classes, num_terms)
    #     conditional probability of each term given the class
    #     for all terms and classes
    priors, cond_probs = None,None
    raise NotImplementedError()
    return priors, cond_probs
priors, cond_probs = get_class_term_frequencies(X_train, y_train)

Verify your implementation:

num_classes = len(twenty_train.target_names)
num_terms = len(count_vectorizer.vocabulary_)

# test shapes
assert priors.shape == (num_classes,)
assert cond_probs.shape == (num_classes, num_terms)

# test probability distribution
np.testing.assert_almost_equal(priors.sum(), 1.0)
np.testing.assert_allclose(cond_probs.sum(axis=1), 1.0)

# test some vlaues
np.testing.assert_allclose(priors[[0,4,17]], [0.04242531, 0.05108715, 0.04984974])
np.testing.assert_allclose(cond_probs[[3,10]][:,[0,828,56442,130065]], np.array([[2.55697518e-04, 0.00000000e+00, 0.00000000e+00, 8.24830703e-06],
        [6.74787183e-04, 2.30696473e-05, 5.76741182e-06, 0.00000000e+00]]))

Classification

Recall the "score" of a class given the document is $ \begin{align} P(c|d) \propto & P(c) * P(d|c) \\ = & P(c) * \prod_{1 \leq k \leq n_d}P(t_k|c) \\ \end{align} $

And the best class to assign to a document is the one with the highest score among all classes. $ c_{map} = \underset{c \in C}{argmax} P(c|d) $

Let's set up a class for Naive Bayes classification. With this setup it's particular to the twenty_train dataset so we can more easily assess the target names and other members particular to this dataset. But of course you can create a generalised classifier.

class NaiveBayesMultinomial():
    def __init__(self, twenty_train):
        # transform text documents into feature vectors
        cv = CountVectorizer()
        X_train = cv.fit_transform(twenty_train.data)
        y_train = twenty_train.target
        # learn frequency of classes and terms given classes
        self.priors, self.cond_probs = get_class_term_frequencies(X_train, y_train)
        self.count_vectorizer = cv
        self.twenty_train = twenty_train

Task: Implement a function predict_sentence in the NaiveBayesMultinomial class. It should take a single sentence (string) as its input and return the name (string) of the most likely class to assign to that sentence.

Hint: You'll have to transform the sentence into its feature vector representation first. Relevant methods:

def predict_sentence(self, sentence):
    # sentence: str, A sentence to classificy
    # returns: str, name of the most likely class. names are listed in
    #    twenty_train.target_names
    raise NotImplementedError()
    return self.twenty_train.target_names[0]

NaiveBayesMultinomial.predict_sentence = predict_sentence

Verify your implementation.

clf = NaiveBayesMultinomial(twenty_train)

On short sentences:

assert 'sci.med' == clf.predict_sentence('patient hospital doctor')
assert 'sci.space' == clf.predict_sentence('planet jupiter moon')
assert 'alt.atheism' == clf.predict_sentence('god atheist')

On long sentences:

msg = "This test is expected to fail for a numerically unstable solution. We'll address this in a later section."
assert 'sci.med' == clf.predict_sentence('patient hospital doctor ' * 100), msg
assert 'sci.space' == clf.predict_sentence('planet jupiter moon ' * 100), msg
assert 'alt.atheism' == clf.predict_sentence('god atheist ' * 100), msg

Smoothing

Let's classify a sentence hospital cure doctor doctor [...], with doctor repeated 30 times. Given 30 occurences of the word doctor, it's hardly surprising the most likely class is sci.med. But what happens if we add the single word agreed to the sentence?

clf.predict_sentence('hospital cure' + ' doctor' * 30)
clf.predict_sentence('hospital cure' + ' doctor' * 30 + ' agreed')

Suddenly, talk.politics.misc is more likely. Our training data is limited, and it just so happens our train documents in sci.med have never seen the term 'agreed'. So the relative frequency of the term 'agreed' in 'sci.med' $ P(t=agreed \mid c=sci.med) = 0 $ .

Task: Review the equation for $ P(c|d) $ . Why is $ P(c=sci.med|d) $ zero for any document containing the term 'agreed'?

Task: Avoid zero-frequencies by applying add-one smoothing in your function that computes the conditional probabilities of each term given the class. Add-one smoothing is covered in Manning pp. 259 MAN09. We assume each term appears in each class at least once, then add the true number of occurences.

$ \hat{P}(t|c) = \frac{T_{ct} + 1}{\sum_{t' \in V} T_{ct} + B'} $

  • The hat $ \hat{P} $ means we're using an estimated/smoothed frequency
  • The denominator is $ T_{ct} $ , the number of times $ t $ occurs in $ c $ , plus one
  • The numerator is the total number of tokens in $ c $ plus a constant, in this model the length of the vocabulary $ B' $
def get_class_term_frequencies(X, y):
    # Uses smoothing
    priors, cond_probs = None,None
    raise NotImplementedError()
    return priors, cond_probs

Update your implementation and make sure your classifier uses the smoothed values.

priors, cond_probs = get_class_term_frequencies(X_train,y_train)
clf = NaiveBayesMultinomial(twenty_train)
num_classes = len(twenty_train.target_names)
num_terms = len(count_vectorizer.vocabulary_)

# test shapes
assert priors.shape == (num_classes,)
assert cond_probs.shape == (num_classes, num_terms)

# test probability distribution
np.testing.assert_almost_equal(priors.sum(), 1.0)
np.testing.assert_allclose(cond_probs.sum(axis=1), 1.0)

# no zero-frequencies
assert np.all(priors > 0)
assert np.all(cond_probs > 0)

# test some vlaues
np.testing.assert_allclose(cond_probs[[3,10]][:,[0,828,56442,130065]], np.array([[1.27315552e-04, 3.97861099e-06, 3.97861099e-06, 7.95722197e-06],
        [3.88803769e-04, 1.64747360e-05, 6.58989440e-06, 3.29494720e-06]]))

# ensure classifier uses the same probabilities
assert np.array_equal(priors, clf.priors)
assert np.array_equal(cond_probs, clf.cond_probs)

Numeric stability

Let's classify the sentence doctor patient. We would expect the most likely class to be sci.med.

clf.predict_sentence('doctor patient')

Now, let's classify the same sentence, but repeated many times. So the sentence is doctor patient, then doctor patient doctor patient doctor patient and so on.

[clf.predict_sentence('doctor patient  ' * repeat_count) for repeat_count in (1,10,20,50,100,1000)]

Given enough repititions, the prediction turns from sci.med to alt.atheism, which incidentally is the first class in our set of classes. Recall the equation for the score of a class

$ \begin{align} \hat{P}(c|d) \propto & P(c) * \hat{P}(d|c) \\ = & P(c) * \prod_{1 \leq k \leq n_d}\hat{P}(t_k|c) \\ \end{align} $ We multiply the probabilities $ P(doctor|c) $ and $ P(patient|c) $ many times over. Probabilites are values $ [0..1] $ , so the product of many probabilites will grow closer and closer to 0 and ultimately round down to 0 for any long enough sentence. A more numerically stable version uses the sum of logarithms of probabilities, rather than the product of probabilities:

$ \begin{align} \hat{P}(c|d) \propto & log(P(c)) + log(P(d|c)) \\ = & log(P(c)) + \sum_{1 \leq k \leq n_d} log(P(t_k|c)) \\ \end{align} $

This doesn't skew the result because we only care about the maximum among the classes - The log function is monotonous so the maximum doesn't change whether we use the product of probabilities or sum of log-probabilities.

Task: Update your predict_sentence function to be numerically stable and ensure your classifier uses the updated function.

def log_predict_sentence(self, sentence):
    raise NotImplementedError()

NaiveBayesMultinomial.predict_sentence = log_predict_sentence

Let's try the failed test from before again:

clf = NaiveBayesMultinomial(twenty_train)
msg = "This test is expected to fail for unsmoothed probabilities "
assert 'sci.med' == clf.predict_sentence('patient hospital doctor ' * 100), msg
assert 'sci.space' == clf.predict_sentence('planet jupiter moon ' * 100), msg
assert 'alt.atheism' == clf.predict_sentence('god atheist ' * 100), msg

As another test, compare your implementation against sklearn's multinomial Naive Bayes classifier:

from sklearn.naive_bayes import MultinomialNB
X_test, y_test = clf.count_vectorizer.transform(twenty_test.data), twenty_test.target
sk_nb = MultinomialNB().fit(X_train, y_train)
# use a subset of samples
samples = list(range(50))

sk_pred = sk_nb.predict(X_test[samples,:])
my_pred = [clf.predict_sentence(twenty_test.data[i]) for i in samples]
my_pred = np.array(list(twenty_train.target_names.index(c) for c in my_pred))
print('Our predictions: {}'.format(my_pred))
print('Sklearns predictions: {}'.format(sk_pred))
print('Match: {}'.format(np.mean(my_pred == sk_pred)))

The Bernoulli model

So far we've dealt with the multinomial model which models a document as the vector of its tokens $ \left< t_1, t_2, \dots t_{n_d} \right>, t_i \in V $ .

Now we turn to the Bernoulli model which chooses a binary representation: $ d = \left< e_1, e_2, ... e_M \right> , e_i \in \{0,1\} $ Each indicator $ e_i $ represents a term of the vocabulary. $ e_i $ is 1 if the $ i $ th term is present in the document, or 0 if it is absent.

We can again use a CountVectorizer to transform documents into this binary feature vector when we pass the argument binary=True to its constructor.

class NaiveBayesBernoulli():
    def __init__(self, twenty_train):
        cv = CountVectorizer(binary=True)
        X_train = cv.fit_transform(twenty_train.data)
        y_train = twenty_train.target
        self.priors, self.cond_probs = get_bernoulli_class_term_frequencies(X_train, y_train)
        self.count_vectorizer = cv
        self.twenty_train = twenty_train

Bernoulli - frequencies of classes and terms

The frequency of classes is computed the same way as before, with $ N_c $ as the number of documents in class $ c $ and $ N $ as the total number of documents. $ P(c) = \frac{N_c}{N} $

The smoothed conditional probability of a term given the class is different. We use the fraction of documents in $ c $ where the term $ t $ is present. $ \hat{P}(t|c) = \frac{N_{ct} + 1}{N_c + 2} $ where

  • the numerator is $ N_{ct} $ , the count of documents in $ c $ where $ t $ is present + 1 for smoothing
  • the denominator is the total count of documents in $ c $ , + 2 for smoothing

Task: Implement a function to learn the smoothed priors $ \hat{P}(c) $ and conditional probabilities $ \hat{P}(t|c) $ for all classes and terms using the Bernoulli model.

def get_bernoulli_class_term_frequencies(X,y):
    # Learning process of P(c) and P(t|c) for all terms and classes.
    # Documents in X are generated with the Bernoulli model.
    raise NotImplementedError()
    return priors, cond_probs

Verify your implementation.

clf_bernoulli = NaiveBayesBernoulli(twenty_train)
pc,ptc = clf_bernoulli.priors, clf_bernoulli.cond_probs

num_classes = len(twenty_train.target_names)
num_terms = len(count_vectorizer.vocabulary_)

# test shapes
assert pc.shape == (num_classes,)
assert ptc.shape == (num_classes, num_terms)

# test probability distribution
np.testing.assert_almost_equal(pc.sum(), 1.0)

# test smoothing
assert np.all(ptc > 0), "Not all conditional probabilites are greater than zero. Did you apply smooothing?"

# test some values
np.testing.assert_almost_equal(ptc[[3,10]][:,[0,828,56442,130065]].sum(), 0.10036926461345066)

Bernoulli - Classification

In the Bernoulli model we explicitly factor in the absence of a term in the score for a class. So the equation for the evidence $ P(d|c) $ becomes $ P(d|c) = \prod_{t_i \in V} \hat{P}(U_i = e_i \mid c) $ where $ t_i \in V $ iterates over all terms in the vocabulary $ \hat{P}(U_i = e_i \mid c) $ is $ \hat{P}(t_i|c) $ if the $ i $ th term is present in the document or $ 1 - \hat{P}(t_i \mid c) $ if the term is absent.

Plugging this into the equation for the best class $ c_{map} $ to assign to a document we get

$ c_{map} = \underset{c \epsilon C}{argmax} \left[ log(\hat{P}(c)) + \sum_{1 \leq i \leq M} log(\hat{P}(U = e_i|c)) \right] $

Task: Implement a function that assigns the most likely class to a sentence in the Bernoulli model.

def predict_sentence_bernoulli(sentence):
    # sentence : str, A sentence to classify.
    # Transforms the sentence according to the Bernoulli model and
    # returns the name of the best class.
    raise NotImplementedError()
    return twetny_train.target_names[0]

NaiveBayesBernoulli.predict_sentence = predict_sentence_bernoulli
clf_bernoulli.predict_sentence(twenty_test.data[5])

As a test, compare your implementation against SKLearn's Naive Bayes Bernoulli classifier:

from sklearn.naive_bayes import BernoulliNB
X_test, y_test = clf_bernoulli.count_vectorizer.transform(twenty_test.data), twenty_test.target
sk_nb = BernoulliNB().fit(X_train, y_train)
# use a subset of samples
samples = list(range(50))

sk_pred = sk_nb.predict(X_test[samples,:])
my_pred = [clf_bernoulli.predict_sentence(twenty_test.data[i]) for i in samples]
my_pred = np.array(list(twenty_train.target_names.index(c) for c in my_pred))
print('Our predictions: {}'.format(my_pred))
print('Sklearns predictions: {}'.format(sk_pred))
print('Match: {}'.format(np.mean(my_pred == sk_pred)))

Summary and Outlook

In this notebook you've implement classifiers for Naive Bayes classification, one using the multinomial model and another using the Bernoulli model.

Manning[MAN09] presents the implications of the model differences on p. 268, so perhaps you may want to construct some example documents to exhibit those effects.

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Naive Bayes
by Diyar Oktay
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Diyar Oktay

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.