Text Classification - Naive Bayes (Part 1) - Multinomial Naive Bayes

Introduction

This notebook deals with text classification using the Naive Bayes algorithm. Through gradual adjustments, we will transform Bayes' Theorem into an algorithm suited to find the most likely class for a document. You'll then implement a classifier in Python that employs this learning method.

Naive Bayes classifiers come in a few variations which differ in how they represent documents. This notebook focuses on the multinomial model, which describes documents through the total occurence count of each term.

Requirements

Knowledge

This notebook is based on Chapter 13 of Introduction to Information RetrievalMAN09. We will cover essential aspects but the literature is recommended for a deeper reading.

It may also be helpful if you're comfortable with calculating probabilities.

Python Modules

We use sklearn to fetch and preprocess documents as well as the ubiquitous numpy module.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import (CountVectorizer)
import numpy as np

Data

First, we load the 20 Newsgroups dataset split into a training and testing batch. It is a collection of over 10,000 newsgroup articles, each of which belongs to 1 of 20 classes.

Then, a CountVectorizer is fit to the training data. It learns a vocabulary that comprises all unique terms it encounters in the training documents.

Lastly, the CountVectorizer transforms the training and test documents into feature vectors which show the total occurence of each term of the vocabulary in the document.

twenty_train = fetch_20newsgroups('~/deep.TEACHING/data/newsgroups_dataset',
                                  subset='train',
                                  shuffle=True,
                                  random_state=42)
twenty_test = fetch_20newsgroups('~/deep.TEACHING/data/newsgroups_dataset',
                                  subset='test',
                                  shuffle=True,
                                  random_state=42)

count_vectorizer = CountVectorizer()
X_train = count_vectorizer.fit_transform(twenty_train.data)
y_train = twenty_train.target
X_test = count_vectorizer.transform(twenty_test.data)
y_test = twenty_test.target

Data Layout

  • X_train : sparse matrix, shape = (n_documents, n_terms)

  • y_train : array, shape = (n_documents,)

  • X_test, y_test:

Multinomial Naive Bayes

Conditional Probability

The conditional probability P(AB)P(A|B) tells us "What are the odds of A, knowing that B has already occured?" In text classification, we have a class cc from a predefined set of classes CC and a document dd. We want to know P(cd)P(c|d) - What are the odds cc is the correct class for the document, given our knowledge of dd? Bayes' Theorem tells us that this probability is

P(cd)=\ racP(c)P(dc)P(d)P(c|d) = \ rac{P(c)P(d|c)}{P(d)}

where

  • P(c)P(c) is the frequency of class c, also called the prior
  • P(dc)P(d|c) is the evidence. It can be considered "What is the probability of this document given the class cc?" We'll elaborate on what that means.

We'll also omit P(d)P(d) from the equation and justify this in an upcoming section.

P(cd)P(c)P(dc)P(c|d) \propto P(c)P(d|c)

Since we drop the denominator, we say \propto (is proportional to)

Naivety

Let's take a closer look at the evidence part of the equation P(dc)P(d|c). We need to express our document as a vector of features. The kind of features we consider is up to us, and it is where the flavours of Naive Bayes classifiers differ. In Multinomial Naive Bayes, we represent a document as the feature vector <t1,t2,...tnd>\left< t_1,t_2,...t_{n_d} \right>. The length of the document is ndn_d. The first token (or position) is t1t_1 and so on.

Token vs. term: A token is a position in a document. A term is a single element of the vocabulary. Each token in a document is occupied by a term. For example, a document that is 50 words long contains 50 tokens assuming all its words are known. When we transform an unfamiliar document into a feature vector after the vocabulary has been established, we skip all tokens that contain unfamiliar terms.

So, what is the probability of this sequence of tokens, given the class? To make this question feasible to answer we'll assume the terms to be conditionally independent given the class. For example, we dismiss that encountering 'hong' affects how likely we are to encounter 'kong' when we already know the class. We also assume positional independence so the conditional probability distribution of the terms is the same for all positions in the document. These are naive assumptions to make for text documents. But they allow us to calculate the conditional probability P(cd)P(c|d) by multiplying the conditional probabilities of its terms given cc.

P(dc)=P(<t1,...,tnd>c)=Π1kndP(tkc)P(d|c) = P(\left< t_1,...,t_{n_d} \right>|c) = \underset{1 \leq k \leq n_d }{\Pi} P(t_k|c)

where P(tkc)P(t_k|c) is the frequency of the term in cc.

Example: Assume we have a one-sentence document "Chinese Beijing China". How likely is 'china' to be the correct class for the document?

c=chinac = china
d=<chinese,beijing,chinese>d = \left< 'chinese', 'beijing', 'chinese' \right>
P(cd)=P(c)P(dc)P(c|d) = P(c) * P(d|c)
=P(c)P(chinesec)P(beijingc)P(chinesec)= P(c) * P('chinese'|c) * P('beijing'|c) * P('chinese'|c)

Prior and evidence

Now we'll review how the classifier determines how suited a class is for a document.

P(cd)P(c)1Π1kndP(tkc)2P(c|d) \propto \overset{1}{P(c)} * \overset{2}{\underset{1 \leq k \leq n_d}{\Pi} P(t_k|c)}
  1. prior: Without touching the document at all, we assume that an unfamiliar document is more likely to belong to a common class than a rare class.
  2. evidence : We iterate over all the tokens in the document. At each token we find a term and consider how frequent it is in the class. When we encounter a term rare to the class, P(tkc)P(t_k|c) is close to zero. Multiplying by it drags the probability for the entire document being in the class down considerably. Conversely, if the term at that token is very common in cc, the overall probability for the document being in cc doesn't drop as much.

Maximum a posteriori (MAP)

Assume we try to classify a document dd as either china or japan. On its own, P(chinad)P(china|d) isn't very interesting at all. However, comparing P(chinad)P(china|d) to P(japand)P(japan|d) tells us whether the document is more likely to belong to the class china or the class japan.

Therefore, we perform maximum a posteriori estimation. We calculate P(cd)P(c|d) for all classes and pick the single cc that produces the highest value.

cmap=argmaxcϵC[P(cd)]c_{map} = \underset{c \epsilon C}{argmax} \left[ P(c|d) \right]

This is good news. We don't care about the numeric probability of each class being correct, we only want to know which class is the most likely. That exaplains why we were allowed to drop P(d)P(d) from the equation earlier - It is constant for all classes, so dividing by it in every class has no effect on the maximum. Furthermore, this allows us to fix the problem of floating point underflow.

Floating Point Underflow

Notice how we're multiplying P(tkc)P(t_{k}|c) at each token in the document. Since this probability is [0..1][0..1], the product will grow smaller and smaller but our machines can only handle precision to an extent. For any document that is long enough the product will eventually be so small it rounds down to 0, completely nullifying our efforts at classification.

We can work around this thanks to a useful property of logarithms: log(xy)=log(x)+log(y)log(xy) = log(x) + log(y). So instead of multiplying the probabilities, we'll sum the logarithms of the probabilities. Like before, we don't care about the probability for each class being correct but rather the maximum. Since the log function is monotonic, multiplying the probabilities versus summing their logs will produce the same winning class.

cmap=argmaxcϵC[log(P(c))+1kndlog(P(tkc))]c_{map} = \underset{c \epsilon C}{argmax} \left[ log(P(c)) + \sum_{1 \leq k \leq n_d} log(P(t_k|c)) \right]

There's just one more issue we have to take care of.

Smoothing

P(tkc)P(t_k|c) answers "What is the probability of the term, given the class cc"? Let's simulate a document textctext_c which simply concatenates all the documents in cc. Intuitively, P(tkc)P(t_k|c) is the occurence count of the term in textctext_c divided by the length of textctext_c. What happens if a term doesn't occur in a class at all? If the training documents only feature 'nippon' in the class japan, P(nipponchina)P('nippon'|china) is 0. In our initial formula we would multiply by 0 which is non-recoverable. So even the contrived document "China Chinese nippon Beijing China" would inevitably yield a score of 0 for the class 'china'.

To account for zero-frequencies, we apply smoothing . By default, we'll assume each term occurs in each class once. To that, we'll add the true number of occurences of tt in cc. We'll divide by the length of the super-document plus the length of the vocabulary.

P^(tc)=Tct+1(tϵVTct)+B\hat{P}(t|c) = \frac{T_{ct}+1}{(\sum_{t' \epsilon V}{T_{ct'}}) + B'}

where

  • P^(tc)\hat{P}(t|c) is the estimated frequency of the term tt in class cc
  • TctT_{ct} is the number of times tt occurs in cc
  • tϵVTct\sum_{t' \epsilon V} T_{ct'} is the number of tokens in cc, equivalent to the length of our textctext_c
  • BB' is a constant for smoothing. In the multinomial model it is the length of the vocabulary.

Final Recap

We're now ready to calculate the best class cmapc_{map} to assign to a document dd.

cmap=argmaxcϵC[log(P^(c))+1kndlog(P^(tkc))]c_{map} = \underset{c \epsilon C}{argmax} \left[ log(\hat{P}(c)) + \sum_{1 \leq k \leq n_d} log(\hat{P}(t_k|c)) \right]

where

  • argmaxcϵC\underset{c \epsilon C}{argmax} means we perform the calculation for all classes cc and pick the cc that produces the maximum
  • P^(c)=NcN\hat{P}(c) = \frac{N_c}{N} is the estimated frequency of the class, given by the number of documents in cc divided by the total number of documents
  • ndn_d is the length of the document
  • 1knd1 \leq k \leq n_d iterates over all tokens in dd
  • tkt_k is the k-th position in the document and contains a term. P^(tkc)\hat{P}(t_k|c) is the estimated frequency of that term in cc. (see Smoothing)

Exercises

Training

Here is a setup for a Multinomial Naive Bayes classifier with two methods learn and predict. The docstrings elaborate what those methods are supposed to do. In the Testing section you'll find an example of how the class will be used.

Hint: You may find it helpful to simulate a document for each class which concatenates all documents in that class. This makes it straightforward to count the occurences of each term in a class as well as the number of tokens in the class.

class NaiveBayesMultinomial:
    def __init__(self):
        pass

    def learn(self, X, y):
        """Learns the priors ^P(c) for each class as well as the
        conditional probabilities ^P(t|c) for each term and each
        class from X and y.
        Infers the vocabulary and set of predefined classes from
        X and y.
        :param X: A document-term matrix e.g. X_train
        :param y: A target vector e.g. y_train
        :return: A self-reference
        """
        return self

    def predict(self, X):
        """
        Predicts a class for each document in X
        :param X: A document-term matrix e.g X_test
        :return: A target vector containing the predicted categories
        for each document in X
        """
        return np.full((X.shape,), -1)

Testing

classifier = NaiveBayesMultinomial()
classifier.learn(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print('accuracy: {}'.format(accuracy))

Comparing against sklearn's Multinomial NB

# Comparing against SKlearn's Multinomial NB
from sklearn.naive_bayes import MultinomialNB
sk_nb = MultinomialNB()
sk_pred = sk_nb.fit(X_train, y_train).predict(X_test)
sk_accuracy = np.mean(sk_pred == y_test)
print('sklearn accuracy: {}'.format(sk_accuracy))
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
np.set_printoptions(linewidth=float('inf'))
print(conf_mat)

Summary and Outlook

This has been an introduction to Naive Bayes classifiers using the multinomial model.

As foreshadowed, other models exist for different representations of documents. The Bernoulli model represents a document in a binary fashion, stating for each known term whether it is present or absent in the document. We'll implement a Bernoulli Naive Bayes classifier in Part 2.

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Text Classification - Naive Bayes (Part 1) - Multinomial Naive Bayes
by Diyar Oktay
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Diyar Oktay

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.