# ML-Fundamentals - Evaluation Metrics

## Introduction

Building a model to predict continuous values (e.g. the temperature) or membership of a class is only useful for real world problems when we can measure how good (or bad) our model's predictions are. Therefore in this notebook, you will get to know (and implement) the most basic concepts of evaluating machine-learning models.

## Requirements

### Knowledge

You should have a basic knowledge of:

• numpy

Suitable sources for acquiring this knowledge are:

### Python Modules

By deep.TEACHING convention, all python modules needed to run the notebook are loaded centrally at the beginning.

# External Modules
import numpy as np
import hashlib
import matplotlib.pyplot as plt
def round_and_hash(value, precision=4, dtype=np.float32):
"""
Function to round and hash a scalar or numpy array of scalars.
Used to compare results with true solutions without spoiling the solution.
"""
rounded = np.array([value], dtype=dtype).round(decimals=precision)
hashed = hashlib.md5(rounded).hexdigest()
return hashed

## Exercises

### Confusion Matrix

A confusion matrix is often used in binary classification tasks where we only have 2 classes (positive, negative), but it can also be constructed when we have more classes. The green elements mark the correct classifications. In some cases the classes can be more similar to one another (e.g. C1 might less different to C2 than to C3), which here is indicated by the intensity of the red color.

The variable y_pred holds the labels for 10 predicted examples. The variable y_true contains the true labels. A 0 means positive, -1 means negative. If the first element in y_pred and the first element in y_true is both 0, it is a true positive and so on...

Implement the function to calculate the confusion matrix for given data. You function should return a 2D-matrix:

• confmatrix[0,0] should contain the number of _true positives
• confmatrix[1,1] should contain the number of _true negatives
• confmatrix[1,0] should contain the number of _false negatives
• confmatrix[0,1] should contain the number of _false positives
dataset_size = 20

### use this for random data
#y_pred = np.random.random_integers(low=0, high=1, size=dataset_size)
#y_true = np.random.random_integers(low=0, high=1, size=dataset_size)

### use this for fixed data (can be automatically evaluated with assertions)
y_pred = np.array([+0, +0, -1, -1, +0, +0, -1, +0, -1, +0, -1, -1, +0, +0, +0, -1, +0, -1, -1, -1])
y_true = np.array([-1, +0, +0, -1, -1, -1, -1, -1, -1, +0, +0, -1, -1, -1, +0, +0, -1, -1, -1, +0])

print(y_pred)
print(y_true)
def get_confusion_matrix_2_classes(y_pred, y_true):
conf = np.ndarray([2,2], dtype=np.int32)

return conf
conf_matrix = get_confusion_matrix_2_classes(y_pred, y_true)
print(conf_matrix)
print(np.array([['tp', 'fp'], ['fn', 'tn']]))
assert round_and_hash(conf_matrix) == '52ca17a7de673a7e78903f6a8ea91a0c'

#### Sidenote

A confusion matrix contains nearly all relevant information about the predictions our model has made. However comparing two or more confusion matrices of different models directly is hardly practical. Especially for human readability / interpretability it is necessary to withdraw unnecessary information. The rest of this notebook is about metrics, which are directly comparable

### Accuracy

Probably the most intuitive metric is the accuracy. The accuracy specifies what percentage of our predictions are correct.

For two classes:

$accuracy = \frac{tp + tn}{tp + tn + fp + fn}$

And in the general case with$n$ classes:

$accuracy = \frac{\sum_{i=1}^n h_{ii}}{\sum_{i=1}^n\sum_{j=1}^n h_{ij}}$

Implement the function to calculate the accuracy for the two class case, based on your confusion matrix.

Optional:

Implement your function to be able to process confusion matrices with$n$ classes, with$n \ge 2$

def calc_accuracy(conf_matrix):

raise NotImplementedError()

#### Sidenote

Although the accuracy is very intuitive it might not be suitable for all classification tasks. Problems are:

• Unbalanced classes. Class A has 1.000 times more examples than class B. A model which just classifies everything as Class A not even looking at any features reaches 99.9% accuracy but you can hardly even call it a machine-learning model.
• When misclassification of a true-class-B sample as class A has severe consequences. Let class A be fish in the water and class B naval mines. When you are in a submarine, you'd want a naval mine detector to identify as many true mines as mines as possible. As a trade-off it may predict more fish as mines and achieve only 95% accuracy.
accuracy = calc_accuracy(conf_matrix=conf_matrix)
print(accuracy)

assert round_and_hash(accuracy) == 'eaff9b69a66af6d38e881cfcce709153'

### Recall, Precision, Specificity and Fall-Out

The following metrics set a specific field of a binary class confusion matrix in relation to another field. They can be hard to memorize, but the picture might help with that.

#### Acronyms:

• Recall$\frac{TP}{TP + FN}$:

• sensivity, hit rate, true positive rate (TPR)
• Precision$\frac{TP}{TP +FP}$:

• positive predictive velue (PPV)
• Specificity$\frac{TN}{TN + FP}$:

• selectivity, true negative rate (TNR)
• Fall-out$\frac{FP}{FP +TN}$:

• false positive rate (FPR)

The other combinations negative predictive value (NPV)$\frac{TN}{TN + FN}$, false negative rate (FNR)$\frac{FN}{FN +TP}$, false discovery rate (FDR)$\frac{FP}{FP + TP}$ and false omission rate (FOR)$\frac{FN}{FN + TN}$ also exist, but the four mentioned in the heading and especially recall and precision are the ones mostly used.

### Balanced Accuracy

Balanced accuracy should be favoured, when classes are unbalanced. It is calculated with:

$accuracy_{balanced} = \frac{TPR + TNR}{2}$

Consider the confusion matrix with fish_and_mines_prediction, which classifies all objects as fish (negative) and does not detect the mine (positive) as a mine (false negative).

Implement the method to calculate$accuracy_{balanced}$.

• What is the value for the accuracy?
• What is the value for the balanced accuracy?
fish_and_mines_prediction = np.array([[1,0],[1,999]])
print(fish_and_mines_prediction)
def calc_balanced_accuracy(conf_matrix):
raise NotImplementedError()
acc = calc_accuracy(fish_and_mines_prediction)
print('accuracy: ', acc)
balanced_acc = calc_balanced_accuracy(fish_and_mines_prediction)
print('balanced accuracy: ', balanced_acc)

assert round_and_hash(balanced_acc) == '028aa23bdcb0575befa15321df88425e'

### F1-Score

Another well known measure is the F1-Score (also F-Score / F-Measure). It is defined as:

$F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$

• Implement the method to calculate the F1-Score
• What is the F1-Score for fish_and_mines_prediction for class mine?
• What is the F1-Score if we consider fish as positive class?

• For this task also implement the function to reverse the positives and negatives of a confusion matrix

• Do not reverse the matrix in-place. Create a new one and return it.
def calc_f1_score(conf_matrix):
raise NotImplementedError()

def reverse_pos_and_neg_of_conf_matrix(conf_matrix):
raise NotImplementedError()
f1score_mine = calc_f1_score(fish_and_mines_prediction)
print(f1score_mine)

assert round_and_hash(f1score_mine) == '6602c18aac262cc65614adb27ed43b2d'
fish_and_mines_reverse = reverse_pos_and_neg_of_conf_matrix(fish_and_mines_prediction)
f1score_fish = calc_f1_score(fish_and_mines_reverse)
print(f1score_fish)

assert round_and_hash(f1score_fish) == 'd4d4c5275731712b629968e3e8d1951e'

### ROC

In most models, there is one or more parameters, which allows you to adjust your model to predict more true positives, but in return also predicts more false positives. For our example with the fish and mines: Adjusting a parameter to correctly find true mines, will most likely also have the consequence, that some fish will also be predicted as mines.

The Receiver-Operating-Characteristic-Curve is a method to visualize this. To do so, you adjust your parameter(s) to, let us say, 10 different values and predict your data. For all 10 experiments you calculate the FPR and the TPR and plot the results in a 2D-Diagram. If your resulting curve is a straight line from the bottom left to the top right, then your model predicts at complete random, which is the worst case. The better your model is, the higher the slope at the beginning.

Picture inspired by [SHA15]

#### AUC

To summarize the ROC in a single number (in order to compare model), you calculate the Area-Under-The-Curve (AUC), which will be a value between 0.0 and 1.0. Though 0.5 is the worst (random predictions) model. A value less then 0.5 most likely means, you just accidentally switched some numbers.

Implement the method to calculate the ROC value for given 3 matrices. The resulting plot should look similar to the picture above.

def calc_roc_value(conf_matrix):
raise NotImplementedError()
return (tpr, fpr)
cm1 = np.array([[1,1],[99,99]])
cm2 = np.array([[60,40],[40,60]])
cm3 = np.array([[99,99],[1,1]])
cm1_roc = calc_roc_value(cm1)
cm2_roc = calc_roc_value(cm2)
cm3_roc = calc_roc_value(cm3)

tprs = [cm1_roc[0], cm2_roc[0], cm3_roc[0]]
fprs = [cm1_roc[1], cm2_roc[1], cm3_roc[1]]
print(tprs)
print(fprs)
plt.plot(np.linspace(0,1),np.linspace(0,1), label='random guess')
plt.plot(fprs, tprs, label='ROC')
plt.legend()
tpr, fpr = calc_roc_value(fish_and_mines_prediction)
print(tpr, fpr)

assert round_and_hash(tpr) == '38fc9331271e16f0e5586a0fc993be00'
assert round_and_hash(fpr) == 'f1d3ff8443297732862df21dc4e57262'

### Kappa Score

The kappa score takes into consideration that some correct predictions were made by 'accident':

$$\kappa = \frac{p_o - p_e}{1 - p_e},$$ with$p_o$ being the accuracy and$p_e$ the proportion of 'accidentally' correct classified examples.

For the binary classification task$p_e$ is calculated with:

$$p_e = \frac{(TP + FN) \cdot (TP + FP)}{b^2} + \frac{(FN + TN) \cdot (FP + TN)}{b^2}$$ with$b$ the total number of examples.

And in general for (n) different classes:

$$p_e = \frac{1}{b^2} \cdot \sum_{i=1}^{n} h_{i+} \cdot h_{+i}$$

with the sum of row$i SINGLESINGLE h_{i+}$ and the sum of column$i SINGLESINGLE h_{+i}$

Implement a function to compute the kappa score from the confusion matrix. Your implementation should definitely handle a confusion matrix for two classes. Optionally, make your implementation able to handle multiclass problems.

def kappa_score(conf_mat):
raise NotImplementedError()
np.testing.assert_almost_equal(kappa_score(fish_and_mines_prediction), 0.666222074024692)

#### Weighted Kappa Score

If some misclassifications are worse then others (C1 classified as C3 is worse than C1 classified as C2), it is possible to include weights in the calculation. In this case we assign weights$w_{11}$ to$w_{nn}$ to the confusion matrix. For the weighted kappa score we then have:

$$\kappa_w = 1 - \frac{\sum_i^n \sum_j^n w_{ij} \cdot h_{ij}}{\sum_i^n \sum_j^n w_{ij} \cdot \frac{h_{i+} \cdot h_{+j}}{b}}$$

## Summary and Outlook

In this notebook, you learned about a whole host of metrics to evaluate the quality of your predictions.

• You learned how to set up a confusion matrix and distinguish between true positives, false positives, false negatives and true negatives
• You learned how to compute metrics such as precision and recall from the confusion matrix
• You learned how to address the problem of imbalanced classes
• You learned to visualize the quality of your model at different settings using a ROC curve

A key takeaway is to always think about which metrics make a meaningful evaluation for each task. For example in the fish-and-mines problem, you'd gladly adjust your model to catch all the mines at the expense of classifying more mines as fish.

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise: Evaluation Metrics
by Klaus Strohmenger