# Extracting Features from Heatmaps

## Introduction

This is the first of three notebooks about the CAMELYON17 challenge. In this challenge, the task is to predict a patients pN-stage (pN0, pn0(i+), pNmi, pN1 and pN2). The pN-stage is determined by the lables of the five slides corresponding to the patient. Once the lables of the individual slides are known (negative, itc, micro, macro), the patient's pN-stage can be looked up in a table (see the official CAMELYON17 website). So the real difficulty of the challenge is to predict the labels of the slides.

Unfortunately the training data here is very limited as we have only 500 slides from the CAMELYON17 training set and 130 labled slides from the CAMELYON16 test set (labled with negative, itc, micro, macro). So opposed to the task of the CAMELYON16 challenged where we had thousands of tiles and only two different labels (normal and tumor), we will not be able to supply another CNN model with sufficient data. Even worse, for the itc class, we only have 35 examples in total.

To tackle this problem, our approach is to make use of domain specific knowledge and extract geoemtrical features from the heatmaps, which can be used to train a less complex model, e.g, a decision tree.

## Requirements

### Python-Modules

# External Modules
import numpy as np
import pandas as pd
import os

from matplotlib import pyplot as plt
from skimage import data
from scipy import ndimage
from skimage.measure import regionprops

%matplotlib inline

### Data

In this notebook you will be provided with heatmaps of slides of the CAMELYON16 test set and the CAMELYON17 set. Using them has some advantages over the heatmaps you might have created in the last notebook:

• You will not have to download the CAMELYON17 data set (~2.7 tera bytes).
• The heatmaps you will be provided with are a lot more accurate:

• They were created with a far superior model.
• The model did not only use zoom level 2. Instead two CNNs were trained. One on level 0 and one on level 1.
• As a consequence of the higher zoom, they also have a higher resolution.
• The CNNs' architecture was a more sophisticated one, the Inception v4 architecture (for details see [SZE17]).
• Algorithm for advanced (domain specific) color normalization was applied [BEJ16].
• All kinds of data augmentation were used.
• The CNNs were trained on two Geforce Titan X GPUs for over 2 weeks.

### Background

Short:

The heatmaps of the sldies were created with a CNN. Each heatmap represents a Whole-Slide-Image (WSI) of tissue. A WSI is approximately 200.000 x 100.000 pixels big. Each slide was cut into 256 x 256 pixel tiles. The CNN was trained to predict these tiles whether they contain tumorous tissue or not. One pixel of the provided heatmaps represents the prediction of the CNN of one 256 x 256 tile.

Long:

For more information have a look at the deepTEACHING website, or the official homepage of the CAMEYLON Challenge.

• HEATMAP_DIR
• PATH_C16_LABELS
• PATH_C17_LABELS
• HEATMAP_DIR
### EDIT THIS CELL:
### Assign the path to your CAMELYON16 data and create the directories
### if they do not exist yet.
CAM_BASE_DIR = '/path/to/CAMELYON/data/'
GENERATED_DATA = CAM_BASE_DIR + 'tutorial/'
HEATMAP_DIR = CAM_BASE_DIR + 'c16traintest_c17traintest_heatmaps_grey/'

# CAMELYON16 and 17 ground truth labels
PATH_C16_LABELS = CAM_BASE_DIR + 'CAMELYON16/test/Ground_Truth/reference.csv'
PATH_C17_LABELS = CAM_BASE_DIR + 'CAMELYON17/training/stage_labels.csv'

# Just one test slide for testing
HEATMAP_Test_001 = HEATMAP_DIR + 'Test_001.png'

## Teaching Content

To get a first impression of the heatmaps, we will first load and view one. A pixels value describes the "degree of believe" of the CNNs, the region contains tumorous tissue (1.0) or not (0.0).

# Load 'Test_001.png'

# View the dimensions
print(example_heatmap_grey.shape)

# View min / max values
print(example_heatmap_grey.min())
print(example_heatmap_grey.max())
# View as greyscale

axs = plt.subplot()
plt.imshow(example_heatmap_grey,cmap=plt.get_cmap('Greys_r'))
# View as heatmap (just for visualization purpose)

plt.imshow(example_heatmap_grey,cmap=plt.get_cmap('jet'))

## Exercises

In total we will extract 6 features of every heatmap:

1. Highest probability (value) on the heatmap (red)
2. Average probability on the heatmp. Sum all values and divide by the number of values$\gt 0.0$ (green)
3. Number of pixels after thresholding (pink)
4. Length of the larger side of the biggest object after thresholding (orange)
5. Length of the smaller side of the biggest object after thresholding (yellow)
6. Number of pixels of the biggest object after thresholding (blue)

feature_names = ['highest_probability',
'average_porbability',
'sum_pixel_after_threshold',
'biggest_object_sum_pixels',
'biggest_object_large_side',
'biggest_object_small_side']

### Extract First Two Features

The first to features can easily be extracted without further processing:

• highest probability
• average probability (not of all pixels, only pixels$\gt 0.0$)

Implement the functions to extract the features.

### Exercise

def extract_feature_highest_probability(greyscale_img):
raise NotImplementedError()

def extract_feature_avg_probability(greyscale_img):
raise NotImplementedError()
f1 = extract_feature_highest_probability(example_heatmap_grey)
f2 = extract_feature_avg_probability(example_heatmap_grey)
print(feature_names[0], f1)
print(feature_names[1], f2)

assert 1.0 > f1 > 0.97
assert 0.24 > f2 > 0.21

### Binarize Greyscale Heatmaps

In order to extract the next four features, we must clearly define which value between 0.0 and 1.0 (and above) we believe is tumorous tissue.

A logical value would be 0.5, since that is the value the CNN used as decision boundary when it was trained to predict if a tile of 256x256 pixels contains tumor or not.

But lower or higher values might be beneficial as well. Lower values will yield bigger regions of connected pixel. Higher values will contain less noise.

Implement the function to transform a greyscale image into a binary image.

### Exercise: Apply global threshold (e.g. 0.5)

def get_binary_of_greyscale_heatmap(greyscale_img, threshold=0.5):
raise NotImplementedError()
heatmap_binary = get_binary_of_greyscale_heatmap(example_heatmap_grey, 0.5)
plt.imshow(heatmap_binary,cmap=plt.get_cmap('Greys_r'))

### Extract Third Feature

Now that we have a binarized heatmap we can extract the feature

Implement the function to extract the feature sum_pixel_after_threshold. In other words, the area that is left.

def extract_feature_sum_after_binarization(binary_image):
raise NotImplementedError()
f3 = extract_feature_sum_after_binarization(heatmap_binary)
print(feature_names[2], f3)

assert f3 == 1865 ### only for threshold 0.5

### Find Biggest Object

Now the last three feature can be a bit tricky. First we need to find the biggest object in the image. The biggest object means, the largest area of connected pixel. A pixel is connected to another if it is the very next pixel to the left / right / up / down.

After we identified the individual objects (areas of connected pixels), we need to extract their features:

• area of the connected region (= biggest object) in sum of the pixels
• large side of the connected region
• small side of the connected region

Implement the function to identify objects (connected regions) in a binary image and to extract their features.

Note:

If you cannot manage to extract these three features you can proceed with this notebook and the others following only using the first three features, though classification results at the end of the series of the notebooks will not be as accurate.

def find_objects_features_in_binary_image(binary_image):
raise NotImplementedError()

If your implementation is correct, the output of the cell below could look similar to the following:

object number   |sum_pixels |large_side |small_side
0               |1          |1.0        |1.0
1               |1          |1.0        |1.0
2               |1          |1.0        |1.0
3               |3          |2.0        |2.0
4               |1359        |61.0      |39.0
5               |1          |1.0        |1.0
6               |2          |2.0        |1.0
...
...

ftrs_objects = find_objects_features_in_binary_image(heatmap_binary)

print('object number\t|%s\t|%s\t|%s' %('sum_pixels', 'large_side', 'small_side'))
print()
for i in range(len(ftrs_objects)):
print('%d\t\t|%d \t\t|%.1f\t\t|%.1f' % (i, ftrs_objects[i,0], ftrs_objects[i,1], ftrs_objects[i,2]))

Then we can just sort the objects according to the feature object_sum_pixels to get the features of the biggest object:

biggest_obj_idx = np.argmax(ftrs_objects[:,0])
f4 = ftrs_objects[biggest_obj_idx][0]
f5 = ftrs_objects[biggest_obj_idx][1]
f6 = ftrs_objects[biggest_obj_idx][2]
print('features of biggest object:')
print('%s: %.2f' %(feature_names[3], f4))
print('%s: %.2f' %(feature_names[4], f5))
print('%s: %.2f' %(feature_names[5], f6))

### Extract features of all Heatmaps

Now we have everything we need to extract the six features from the heatmaps.

Extract all six features from all heatmaps and combine them with their labels (negative, itc, micro, macro) in a pandas DataFrame

For the next notebook, it is crucial to add the data in different pandas DataFrame objects:

• one for CAMEYLON16 test data set (heatmaps named Test_xxx.png)
• one for CAMELYON17 train data set (named pantient_0xx.png)
• one for CAMELYON17 test data set (named _patient_1xx.png)

Executing the three cells below should show output similar to the one in this picture:

c16_test[0:3]
c17_train[0:3]
c17_test[0:3]
c16_test.to_csv(GENERATED_DATA + 'features_c16_test.csv')
c17_train.to_csv(GENERATED_DATA +'features_c17_train.csv')
c17_test.to_csv(GENERATED_DATA +'features_c17_test.csv')

## Summary and Outlook

Next we will train a simple classifier with the features we just extracted.

Possible improvements for this notebook:

• Extract more features (e.g. last three features also of the second biggest object)
• More advanced measuring of the length of the biggest object (instead of just using a bounding box)
• One or more preprocessing steps after binarization to connect near regions or to reduce noise e.g. using:
• Morphological image operations (morphological opening / closing)
• etc...
• etc...

Note:

If you want to add complete new features, it is always a good advice to know about the domain, so you do not extract arbitrary features, which have nothing to do with the classification decision. Read the corresponding section on the CAMELYON website, how real pathologists decide about the label (negative, itc, micro, macro) of a slide.

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

XXX
by Klaus Strohmenger