# Text Classification - Naive Bayes (Part 1) - Multinomial Naive Bayes

## Introduction

This notebook deals with text classification using the Naive Bayes algorithm. Through gradual adjustments, we will transform Bayes' Theorem into an algorithm suited to find the most likely class for a document. You'll then implement a classifier in Python that employs this learning method.

Naive Bayes classifiers come in a few variations which differ in how they represent documents. This notebook focuses on the multinomial model, which describes documents through the total occurence count of each term.

## Requirements

### Knowledge

This notebook is based on Chapter 13 of Introduction to Information RetrievalMAN09. We will cover essential aspects but the literature is recommended for a deeper reading.

It may also be helpful if you're comfortable with calculating probabilities.

### Python Modules

We use sklearn to fetch and preprocess documents as well as the ubiquitous numpy module.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import (CountVectorizer)
import numpy as np

### Data

First, we load the 20 Newsgroups dataset split into a training and testing batch. It is a collection of over 10,000 newsgroup articles, each of which belongs to 1 of 20 classes.

Then, a CountVectorizer is fit to the training data. It learns a vocabulary that comprises all unique terms it encounters in the training documents.

Lastly, the CountVectorizer transforms the training and test documents into feature vectors which show the total occurence of each term of the vocabulary in the document.

twenty_train = fetch_20newsgroups('~/deep.TEACHING/data/newsgroups_dataset',
subset='train',
shuffle=True,
random_state=42)
twenty_test = fetch_20newsgroups('~/deep.TEACHING/data/newsgroups_dataset',
subset='test',
shuffle=True,
random_state=42)

count_vectorizer = CountVectorizer()
X_train = count_vectorizer.fit_transform(twenty_train.data)
y_train = twenty_train.target
X_test = count_vectorizer.transform(twenty_test.data)
y_test = twenty_test.target

### Data Layout

• X_train : sparse matrix, shape = (n_documents, n_terms)

• y_train : array, shape = (n_documents,)

• X_test, y_test:

## Multinomial Naive Bayes

### Conditional Probability

The conditional probability $P(A|B)$ tells us "What are the odds of A, knowing that B has already occured?" In text classification, we have a class $c$ from a predefined set of classes $C$ and a document $d$. We want to know $P(c|d)$ - What are the odds $c$ is the correct class for the document, given our knowledge of $d$? Bayes' Theorem tells us that this probability is

$P(c|d) = \ rac{P(c)P(d|c)}{P(d)}$

where

• $P(c)$ is the frequency of class c, also called the prior
• $P(d|c)$ is the evidence. It can be considered "What is the probability of this document given the class $c$?" We'll elaborate on what that means.

We'll also omit $P(d)$ from the equation and justify this in an upcoming section.

$P(c|d) \propto P(c)P(d|c)$

Since we drop the denominator, we say $\propto$ (is proportional to)

### Naivety

Let's take a closer look at the evidence part of the equation $P(d|c)$. We need to express our document as a vector of features. The kind of features we consider is up to us, and it is where the flavours of Naive Bayes classifiers differ. In Multinomial Naive Bayes, we represent a document as the feature vector $\left< t_1,t_2,...t_{n_d} \right>$. The length of the document is $n_d$. The first token (or position) is $t_1$ and so on.

Token vs. term: A token is a position in a document. A term is a single element of the vocabulary. Each token in a document is occupied by a term. For example, a document that is 50 words long contains 50 tokens assuming all its words are known. When we transform an unfamiliar document into a feature vector after the vocabulary has been established, we skip all tokens that contain unfamiliar terms.

So, what is the probability of this sequence of tokens, given the class? To make this question feasible to answer we'll assume the terms to be conditionally independent given the class. For example, we dismiss that encountering 'hong' affects how likely we are to encounter 'kong' when we already know the class. We also assume positional independence so the conditional probability distribution of the terms is the same for all positions in the document. These are naive assumptions to make for text documents. But they allow us to calculate the conditional probability $P(c|d)$ by multiplying the conditional probabilities of its terms given $c$.

$P(d|c) = P(\left< t_1,...,t_{n_d} \right>|c) = \underset{1 \leq k \leq n_d }{\Pi} P(t_k|c)$

where $P(t_k|c)$ is the frequency of the term in $c$.

Example: Assume we have a one-sentence document "Chinese Beijing China". How likely is 'china' to be the correct class for the document?

$c = china$
$d = \left< 'chinese', 'beijing', 'chinese' \right>$
$P(c|d) = P(c) * P(d|c)$
$= P(c) * P('chinese'|c) * P('beijing'|c) * P('chinese'|c)$

### Prior and evidence

Now we'll review how the classifier determines how suited a class is for a document.

$P(c|d) \propto \overset{1}{P(c)} * \overset{2}{\underset{1 \leq k \leq n_d}{\Pi} P(t_k|c)}$
1. prior: Without touching the document at all, we assume that an unfamiliar document is more likely to belong to a common class than a rare class.
2. evidence : We iterate over all the tokens in the document. At each token we find a term and consider how frequent it is in the class. When we encounter a term rare to the class, $P(t_k|c)$ is close to zero. Multiplying by it drags the probability for the entire document being in the class down considerably. Conversely, if the term at that token is very common in $c$, the overall probability for the document being in $c$ doesn't drop as much.

### Maximum a posteriori (MAP)

Assume we try to classify a document $d$ as either china or japan. On its own, $P(china|d)$ isn't very interesting at all. However, comparing $P(china|d)$ to $P(japan|d)$ tells us whether the document is more likely to belong to the class china or the class japan.

Therefore, we perform maximum a posteriori estimation. We calculate $P(c|d)$ for all classes and pick the single $c$ that produces the highest value.

$c_{map} = \underset{c \epsilon C}{argmax} \left[ P(c|d) \right]$

This is good news. We don't care about the numeric probability of each class being correct, we only want to know which class is the most likely. That exaplains why we were allowed to drop $P(d)$ from the equation earlier - It is constant for all classes, so dividing by it in every class has no effect on the maximum. Furthermore, this allows us to fix the problem of floating point underflow.

### Floating Point Underflow

Notice how we're multiplying $P(t_{k}|c)$ at each token in the document. Since this probability is $[0..1]$, the product will grow smaller and smaller but our machines can only handle precision to an extent. For any document that is long enough the product will eventually be so small it rounds down to 0, completely nullifying our efforts at classification.

We can work around this thanks to a useful property of logarithms: $log(xy) = log(x) + log(y)$. So instead of multiplying the probabilities, we'll sum the logarithms of the probabilities. Like before, we don't care about the probability for each class being correct but rather the maximum. Since the log function is monotonic, multiplying the probabilities versus summing their logs will produce the same winning class.

$c_{map} = \underset{c \epsilon C}{argmax} \left[ log(P(c)) + \sum_{1 \leq k \leq n_d} log(P(t_k|c)) \right]$

There's just one more issue we have to take care of.

### Smoothing

$P(t_k|c)$ answers "What is the probability of the term, given the class $c$"? Let's simulate a document $text_c$ which simply concatenates all the documents in $c$. Intuitively, $P(t_k|c)$ is the occurence count of the term in $text_c$ divided by the length of $text_c$. What happens if a term doesn't occur in a class at all? If the training documents only feature 'nippon' in the class japan, $P('nippon'|china)$ is 0. In our initial formula we would multiply by 0 which is non-recoverable. So even the contrived document "China Chinese nippon Beijing China" would inevitably yield a score of 0 for the class 'china'.

To account for zero-frequencies, we apply smoothing . By default, we'll assume each term occurs in each class once. To that, we'll add the true number of occurences of $t$ in $c$. We'll divide by the length of the super-document plus the length of the vocabulary.

$\hat{P}(t|c) = \frac{T_{ct}+1}{(\sum_{t' \epsilon V}{T_{ct'}}) + B'}$

where

• $\hat{P}(t|c)$ is the estimated frequency of the term $t$ in class $c$
• $T_{ct}$ is the number of times $t$ occurs in $c$
• $\sum_{t' \epsilon V} T_{ct'}$ is the number of tokens in $c$, equivalent to the length of our $text_c$
• $B'$ is a constant for smoothing. In the multinomial model it is the length of the vocabulary.

### Final Recap

We're now ready to calculate the best class $c_{map}$ to assign to a document $d$.

$c_{map} = \underset{c \epsilon C}{argmax} \left[ log(\hat{P}(c)) + \sum_{1 \leq k \leq n_d} log(\hat{P}(t_k|c)) \right]$

where

• $\underset{c \epsilon C}{argmax}$ means we perform the calculation for all classes $c$ and pick the $c$ that produces the maximum
• $\hat{P}(c) = \frac{N_c}{N}$ is the estimated frequency of the class, given by the number of documents in $c$ divided by the total number of documents
• $n_d$ is the length of the document
• $1 \leq k \leq n_d$ iterates over all tokens in $d$
• $t_k$ is the k-th position in the document and contains a term. $\hat{P}(t_k|c)$ is the estimated frequency of that term in $c$. (see Smoothing)

## Exercises

### Training

Here is a setup for a Multinomial Naive Bayes classifier with two methods learn and predict. The docstrings elaborate what those methods are supposed to do. In the Testing section you'll find an example of how the class will be used.

Hint: You may find it helpful to simulate a document for each class which concatenates all documents in that class. This makes it straightforward to count the occurences of each term in a class as well as the number of tokens in the class.

class NaiveBayesMultinomial:
def __init__(self):
pass

def learn(self, X, y):
"""Learns the priors ^P(c) for each class as well as the
conditional probabilities ^P(t|c) for each term and each
class from X and y.
Infers the vocabulary and set of predefined classes from
X and y.
:param X: A document-term matrix e.g. X_train
:param y: A target vector e.g. y_train
:return: A self-reference
"""
return self

def predict(self, X):
"""
Predicts a class for each document in X
:param X: A document-term matrix e.g X_test
:return: A target vector containing the predicted categories
for each document in X
"""
return np.full((X.shape,), -1)

### Testing

classifier = NaiveBayesMultinomial()
classifier.learn(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print('accuracy: {}'.format(accuracy))

### Comparing against sklearn's Multinomial NB

# Comparing against SKlearn's Multinomial NB
from sklearn.naive_bayes import MultinomialNB
sk_nb = MultinomialNB()
sk_pred = sk_nb.fit(X_train, y_train).predict(X_test)
sk_accuracy = np.mean(sk_pred == y_test)
print('sklearn accuracy: {}'.format(sk_accuracy))
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
np.set_printoptions(linewidth=float('inf'))
print(conf_mat)

## Summary and Outlook

This has been an introduction to Naive Bayes classifiers using the multinomial model.

As foreshadowed, other models exist for different representations of documents. The Bernoulli model represents a document in a binary fashion, stating for each known term whether it is present or absent in the document. We'll implement a Bernoulli Naive Bayes classifier in Part 2.

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Text Classification - Naive Bayes (Part 1) - Multinomial Naive Bayes
by Diyar Oktay