Prediction and Heatmap Generation
Table of Contents
Now that we have a trained (and saved) model, we can use it to predict the slides of the CAMELYON16 test dataset. From the prediction of the individual tiles, we can build a heatmap of the whole slide, showing the regions, which are predicted to be metastatic. The steps in this notebook can be broken down into:
- Load the trained model
- Load CAMEYLON16 test dataset with Slidemanager
- Get slides with
- Get tiles with
- Predict the tiles and build the heatmaps
- Visually compare your heatmaps with the tumor masks (if test slides have metastatic regions)
Chances are high your model will not be able to produce good enough heatmaps. Therefore in the next notebook you will be offered high quality heatmaps produced by a far superior CNN.
import tensorflow as tf from tensorflow import keras import numpy as np import matplotlib.pyplot as plt import random import h5py import math from skimage.filters import threshold_otsu from skimage.transform import resize from preprocessing.datamodel import SlideManager from preprocessing.processing import split_negative_slide, split_positive_slide, create_tumor_mask, rgb2gray, create_otsu_mask_by_threshold from preprocessing.util import TileMap
Evaluation of the CAMELYON16 Challenge
Following the original CAMELYON16 challange, the task would now be, to predict CAMELYON16 test dataset. Back in 2016, the labels were not published to the public. The metrics to evaluate the model were:
1) Receiver operating characteristic (ROC) at slide level and then calculate the are under the ROC curve (AUC).
2) Free-response receiver operating characteristic (FROC) for lesion based evaluation. Briefly, this metric measures, how well the regions in a tumorus slide match the true regions. Also, for each coordinate in the metastatic region, a confidence score had to be submitted.
If you are interested in evaluating your model and see how it would have performed in the CAMELYON16 Challenge you can read more about the evaluation and the scoring at the official CAMELYON16 website
Since the labels of the CAMELYON16 challange have already been published it is no longer possible to hand in any results. Therefor we will not go into detail evaluating the model for the CAMELYON16 challange.
Instead we will head straight towards the CAMELYON17 challenge. The second goal of CAMELYON16 (lesion based) also prepares for this. From the confidence score it is straight forward to create a heatmap as prediction for a slide (similar to the tumor mask). These heatmaps can then be used to achieve the goals of the CAMELYON17 challenge, which are:
- Predict if a slide contains no tumor regions, only isolated tumor cells (ITCs), micro metastasis of macro metastasis.
- To be able to achieve this, the CAMELYON17 dataset is labeled with 4 different classes.
In the next notebooks, we will use the heatmaps, created with our model, to accomplish this. So the task in this notebook is to create the heatmaps first.
Setting the Paths
Set the paths according the destination where you store the data:
### EDIT THIS CELL: ### Assign the path to your CAMELYON16 data CAM_BASE_DIR = '/path/to/CAMELYON/data/' #example: absolute path for linux CAM_BASE_DIR = '/media/klaus/2612FE3171F55111/'
CAM16_DIR = CAM_BASE_DIR + 'CAMELYON16/' GENERATED_DATA = CAM_BASE_DIR + 'output/' # example: if path is different (option A) GENERATED_DATA = '/home/klaus/Documents/datasets/PN-STAGE/level0/' # example: if path is different (option B) GENERATED_DATA = '/home/klaus/Documents/datasets/PN-STAGE/level3/'
MODEL_FINAL = GENERATED_DATA + 'model_final.hdf5' # Destination to store the heatmaps which we will create in this notebook HEATMAPS_CAM16_TESTSET = GENERATED_DATA +'test_set_predictions/'
Loading the Model
First we will load our trained and saved model. Since we did not train the model with an optimizer from the
tf.keras package, we will have to recompile it.
# Recreate the exact same model, including weights model = tf.keras.models.load_model(MODEL_FINAL) model.compile(optimizer=tf.train.AdamOptimizer(learning_rate=0.0005), loss='binary_crossentropy', metrics=['accuracy'])
Reading CAMELYON16 Test Dataset
The main purpose of creating a training dataset as a single HDF5 file was to reduce the time reading the data. This was crucial for training, because we needed to read the same data over and over again while training. Concerning the test dataset, this is not as crucial, because we only need to read every slide once, predicting each tile once, after the training is finished.
So to read the CAMELYON16 test dataset, we can just use the
SlideManager.test_slides attribute and the
- We must use the same
tile_sizewe trained our CNN on
- We must use the same
poiwe used to seperate tisse from background
The higher the overlap, the higher resolution our heatmap will have.
- Higher overlap dramatically increases prediciton time.
- At least half the tile_zizse is suggested to reduce chance dividing smaller tumorous regions and there missclassifying tiles.
mgr = SlideManager(cam16_dir=CAM16_DIR) ### Depending on option chosen in "create-custom-dataset" (option A = level 0) (optopn B = level3) level = 3 ### 256 for either option as we trained our CNN on 256 tiles tile_size = 256 ### 20% of a tile must contain tissue (in contrast to slide background) poi = 0.20 ### more overlap, higher resolution but increased processing time overlap = tile_size // 2
When we pass a test slide as parameter to the method
create_tumor_mask, a mask will always be returned. If there exists no annotation xml file (because it is a slide without metastatic regions), the mask will just contain
nans. This method can be used to manually compare your generated heatmaps with the true tumor area.
slide = mgr.get_slide('test_001') ### some general slide information print(slide) print(slide.dimensions) print(slide.level_dimensions[level]) ### create tumor mask and show it mask = create_tumor_mask(slide, level=8) print(mask.shape) plt.imshow(mask, cmap='gray')
Since we trained our model with normalized images, we will also need the mean and the standard deviation of the color channels we used.
Create both varibles
std_pixel and assign the values by just looking them up in the last notebook.
### Exercise: Look up the corresponding values and save them into variables ### Assign the correct values mean_red = 0. mean_green = 0. mean_blue = 0. ### Assign the correct values std_red = 1. std_green = 1. std_blue = 1. mean_pixel = np.array([mean_red, mean_green, mean_blue]) std_pixel = np.array([std_red, std_green, std_blue])
Use your trained model to predict the individual tiles of each slide in the test dataset. From the predictions of your model (values form 0.0 to 1.0) build a heatmap for each slide. It should have the same ratio of width and height as the original slide, but of course with a smaller scale.
split_negative_slideson the test slides to receive the slides (you do not know if it is a tumor or normal slide). For the usage, refer to data-handling-usage-guide.ipynb.
split_negative_slidesyields images with pixel values in the range [0, 255] and you trained your CNN with values [0, 1].
- Scale each tile to [0, 1] first.
- Then normalize the colorchannels of the tile (see TissueDataset of the last notebook).
- When you use overlapping slides, the resoluton of you heatmap will be bigger. E.g. overlap of 128 to double the resolution.
Predict each tile yielded by
split_negative_slidesand position it in a heatmap at the appropriate location.
- You can calculate the position with the
boundsvariable, which is yielded together with the tile by the iterator of
- You can calculate the position with the
- Save your created heatmaps as png files (e.g. test_001.png)
- Save the original (*.xml files) masks as images so you can compare them with your heatmaps.
Here are examples of some created heatmaps (top: heatmaps. bottom: true masks from xml files in
CAMELYON16/test/lesion_annotations/). These heatmaps were generated with Option B (level 3), so they are relatively small (~100x50 pixels). For visualisation purpose the picture down is scaled:
This will take a lot of time.
If working on level 0 (Option A):
- Even properly implemented you will need about 1-2 hours per WSI
- We have for example 700x700 tiles for slide 'test_001'
- Fetching and predicting one row of tiles takes 5 seconds
- We have 700 rows, so we still need 58 minutes for the whole slide
If working on level 3 (Option B):
- Properly implemented you will need about 10 minutes per WSI
- 16 GB DDR3@1666MHz
- Xeon1231 CPU
- Camelyon Dataset stored on magnetic hard drive
If you do not have the time to classify all tiles of all slides, you can just implement the code, run it to produce the first 5-10 heatmaps and proceed with the next notebook.
In the next notebook you will be provided with some high quality heatmaps, produced with all the missing things, which were mentioned here.
### Exercise. Your code below
If everything went fine, repeat the process with all WSIs of the CAMELYON17 dataset. Though this will take at least 150 hours, even on level 3 only (option B).
So if you do nothave the time you can continue with the exercises and use the provided heatmaps in the next notebook.
Summary and Outlook
So far we have accomplished to ...
- ... divide our huge data set into smaller pieces (tiles) to be even able to handle it and use it to train a model.
- ... build and train a CNN to predict whether a single tile contains metastatic or normal tissue.
- ... use our CNN to predict the individual tiles of the slides of the test set.
- ... put the predictions of a slide together in order to generate a heatmap (or mask), which looks similar to the masks provided.
In the next notebook we will extract geometric features of these heatmaps to train another classifier, which will then be able to predict the tumor class of the slides (negative, itc, micro, macro)
Notebook License (CC-BY-SA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
by Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2018 Klaus Strohmenger
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.