Extracting Features from Heatmaps
Table of Contents
This is the first of three notebooks about the CAMELYON17 challenge. In this challenge, the task is to predict a patients pN-stage (pN0, pn0(i+), pNmi, pN1 and pN2). The pN-stage is determined by the lables of the five slides corresponding to the patient. Once the lables of the individual slides are known (negative, itc, micro, macro), the patient's pN-stage can be looked up in a table (see the official CAMELYON17 website). So the real difficulty of the challenge is to predict the labels of the slides.
Unfortunately the training data here is very limited as we have only 500 slides from the CAMELYON17 training set and 130 labled slides from the CAMELYON16 test set (labled with negative, itc, micro, macro). So opposed to the task of the CAMELYON16 challenged where we had thousands of tiles and only two different labels (normal and tumor), we will not be able to supply another CNN model with sufficient data. Even worse, for the itc class, we only have 35 examples in total.
To tackle this problem, our approach is to make use of domain specific knowledge and extract geoemtrical features from the heatmaps, which can be used to train a less complex model, e.g, a decision tree.
# External Modules import numpy as np import pandas as pd import os import skimage from matplotlib import pyplot as plt from scipy import ndimage from skimage.measure import regionprops %matplotlib inline
In this notebook you will be provided with heatmaps of slides of the CAMELYON16 test set and the CAMELYON17 set. Using them has some advantages over the heatmaps you might have created in the last notebook:
- You will not have to download the CAMELYON17 data set (~2.7 tera bytes).
The heatmaps you will be provided with are a lot more accurate:
- They were created with a far superior model.
- The model did not only use zoom level 2. Instead two CNNs were trained. One on level 0 and one on level 1.
- As a consequence of the higher zoom, they also have a higher resolution.
- The CNNs' architecture was a more sophisticated one, the Inception v4 architecture (for details see [SZE17]).
- Algorithm for advanced (domain specific) color normalization was applied [[BEJ16]].
- All kinds of data augmentation were used.
- The CNNs were trained on two Geforce Titan X GPUs for over 2 weeks.
- You can read more about the techniques used in [STR18].
The heatmaps of the sldies were created with a CNN. Each heatmap represents a Whole-Slide-Image (WSI) of tissue. A WSI is approximately 200.000 x 100.000 pixels big. Each slide was cut into 256 x 256 pixel tiles. The CNN was trained to predict these tiles whether they contain tumorous tissue or not. One pixel of the provided heatmaps represents the prediction of the CNN of one 256 x 256 tile.
- Download the heatmaps from here and unpack them.
- Adjust the path variables:
### EDIT THIS CELL: ### Assign the path to your CAMELYON16 data and create the directories ### if they do not exist yet. CAM_BASE_DIR = '/path/to/CAMELYON/data/' # exmple: CAM_BASE_DIR = '/media/klaus/2612FE3171F55111/'
GENERATED_DATA = CAM_BASE_DIR + 'tutorial/' HEATMAP_DIR = CAM_BASE_DIR + 'c16traintest_c17traintest_heatmaps_grey/' # CAMELYON16 and 17 ground truth labels PATH_C16_LABELS = CAM_BASE_DIR + 'CAMELYON16/test/reference.csv' PATH_C17_LABELS = CAM_BASE_DIR + 'CAMELYON17/training/stage_labels.csv' # Just one test slide for testing HEATMAP_Test_001 = HEATMAP_DIR + 'Test_001.png'
Load and View Heatmaps
To get a first impression of the heatmaps, we will first load and view one. A pixels value describes the "degree of believe" of the CNNs, the region contains tumorous tissue (1.0) or not (0.0).
Executing the next two cells should show the following images:
# Load 'Test_001.png' example_heatmap_grey = skimage.io.imread(HEATMAP_Test_001, as_gray=True) # View the dimensions print(example_heatmap_grey.shape) # View min / max values print(example_heatmap_grey.min()) print(example_heatmap_grey.max())
# View as greyscale axs = plt.subplot() plt.imshow(example_heatmap_grey,cmap=plt.get_cmap('Greys_r'))
# View as heatmap (just for visualization purpose) plt.imshow(example_heatmap_grey,cmap=plt.get_cmap('jet'))
In total we will extract 6 features of every heatmap:
- Highest probability (value) on the heatmap (red)
- Average probability on the heatmp. Sum all values and divide by the number of values$ \gt 0.0 $ (green)
- Number of pixels after thresholding (pink)
- Length of the larger side of the biggest object after thresholding (orange)
- Length of the smaller side of the biggest object after thresholding (yellow)
- Number of pixels of the biggest object after thresholding (blue)
feature_names = ['highest_probability', 'average_porbability', 'sum_pixel_after_threshold', 'biggest_object_sum_pixels', 'biggest_object_large_side', 'biggest_object_small_side']
Extract First Two Features
The first to features can easily be extracted without further processing:
- highest probability
- average probability (not of all pixels, only pixels$ \gt 0.0 $)
Implement the functions to extract the features.
### Exercise def extract_feature_highest_probability(greyscale_img): raise NotImplementedError() def extract_feature_avg_probability(greyscale_img): raise NotImplementedError()
f1 = extract_feature_highest_probability(example_heatmap_grey) f2 = extract_feature_avg_probability(example_heatmap_grey) print(feature_names, f1) print(feature_names, f2) ### NOTE: These tests can only pass when using the provided heatmaps! ### Comment these lines out if you use your own heatmaps assert 1.0 > f1 > 0.97 assert 0.24 > f2 > 0.21
Binarize Greyscale Heatmaps
In order to extract the next four features, we must clearly define which value between 0.0 and 1.0 (and above) we believe is tumorous tissue.
A logical value would be 0.5, since that is the value the CNN used as decision boundary when it was trained to predict if a tile of 256x256 pixels contains tumor or not.
But lower or higher values might be beneficial as well. Lower values will yield bigger regions of connected pixel. Higher values will contain less noise.
Implement the function to transform a greyscale image into a binary image.
### Exercise: Apply global threshold (e.g. 0.5) def get_binary_of_greyscale_heatmap(greyscale_img, threshold=0.5): raise NotImplementedError()
heatmap_binary = get_binary_of_greyscale_heatmap(example_heatmap_grey, 0.5) plt.imshow(heatmap_binary,cmap=plt.get_cmap('Greys_r'))
Extract Third Feature
Now that we have a binarized heatmap we can extract the feature
Implement the function to extract the feature sum_pixel_after_threshold. In other words, the area that is left.
def extract_feature_sum_after_binarization(binary_image): raise NotImplementedError()
f3 = extract_feature_sum_after_binarization(heatmap_binary) print(feature_names, f3) ### NOTE: These tests can only pass when using the provided heatmaps! ### Comment these lines out if you use your own heatmaps assert f3 == 1865 ### only for threshold 0.5
Find Biggest Object
Now the last three feature can be a bit tricky. First we need to find the biggest object in the image. The biggest object means, the largest area of connected pixel. A pixel is connected to another if it is the very next pixel to the left / right / up / down.
After we identified the individual objects (areas of connected pixels), we need to extract their features:
- area of the connected region (= biggest object) in sum of the pixels
- large side of the connected region
- small side of the connected region
Implement the function to identify objects (connected regions) in a binary image and to extract their features.
If you cannot manage to extract these three features you can proceed with this notebook and the others following only using the first three features, though classification results at the end of the series of the notebooks will not be as accurate.
def find_objects_features_in_binary_image(binary_image): raise NotImplementedError()
If your implementation is correct, the output of the cell below could look similar to the following:
object number |sum_pixels |large_side |small_side 0 |1 |1.0 |1.0 1 |1 |1.0 |1.0 2 |1 |1.0 |1.0 3 |3 |2.0 |2.0 4 |1359 |61.0 |39.0 5 |1 |1.0 |1.0 6 |2 |2.0 |1.0 ... ...
ftrs_objects = find_objects_features_in_binary_image(heatmap_binary) print('object number\t|%s\t|%s\t|%s' %('sum_pixels', 'large_side', 'small_side')) print() for i in range(len(ftrs_objects)): print('%d\t\t|%d \t\t|%.1f\t\t|%.1f' % (i, ftrs_objects[i,0], ftrs_objects[i,1], ftrs_objects[i,2]))
Then we can just sort the objects according to the feature object_sum_pixels to get the features of the biggest object:
biggest_obj_idx = np.argmax(ftrs_objects[:,0]) f4 = ftrs_objects[biggest_obj_idx] f5 = ftrs_objects[biggest_obj_idx] f6 = ftrs_objects[biggest_obj_idx] print('features of biggest object:') print('%s: %.2f' %(feature_names, f4)) print('%s: %.2f' %(feature_names, f5)) print('%s: %.2f' %(feature_names, f6))
Extract features of all Heatmaps
Now we have everything we need to extract the six features from the heatmaps.
Extract all six features from all heatmaps and combine them with their labels (negative, itc, micro, macro) in a pandas
For the next notebook, it is crucial to add the data in different pandas
- one for CAMEYLON16 test data set (heatmaps named Test_xxx.png)
- one for CAMELYON17 train data set (named pantient_0xx.png)
- one for CAMELYON17 test data set (named _patient_1xx.png)
Executing the three cells below should show output similar to the one in this picture:
c16_test.to_csv(GENERATED_DATA + 'features_c16_test.csv') c17_train.to_csv(GENERATED_DATA +'features_c17_train.csv') c17_test.to_csv(GENERATED_DATA +'features_c17_test.csv')
Summary and Outlook
Next we will train a simple classifier with the features we just extracted.
Possible improvements for this notebook:
- Extract more features (e.g. last three features also of the second biggest object)
- More advanced measuring of the length of the biggest object (instead of just using a bounding box)
- One or more preprocessing steps after binarization to connect near regions or to reduce noise e.g. using:
- Morphological image operations (morphological opening / closing)
- More advanced: clustering algorithms, e.g. DBSCAN [LEE19]
- Extract Features for multiple thesholds and use them all together, e.g. 0.1, 0.2, ..., 0.9 [PIN19][zha19]
If you want to add complete new features, it is always a good advice to know about the domain, so you do not extract arbitrary features, which have nothing to do with the classification decision. Read the corresponding section on the CAMELYON website, how real pathologists decide about the label (negative, itc, micro, macro) of a slide.
Notebook License (CC-BY-SA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
by Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2018 Klaus Strohmenger
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.