Classification of Heatmaps

Introduction

In the last notebook we extracted geometrical features from our heatmaps and saved them in csv files. Now we will use these features to train a simple classifier to predict the lables of the slides (negative, itc, micro, macro).

Requirements

Python-Modules

import numpy as np
import pandas as pd

from sklearn.model_selection import cross_validate
from sklearn.preprocessing import Imputer
from sklearn import tree, naive_bayes, ensemble
from sklearn.externals.six import StringIO
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

from graphviz import Source

Exercises

Before we start, adjust the path of CAM_BASE_DIR (and also other variables as needed).

### EDIT THIS CELL:
### Do not edit this cell
### Assign the path to your CAMELYON16 data and create the directories
### if they do not exist yet.
CAM_BASE_DIR = '/path/to/CAMELYON/data/'
GENERATED_DATA = CAM_BASE_DIR + 'tutorial/' 
HEATMAP_DIR = CAM_BASE_DIR + 'c16traintest_c17traintest_heatmaps_grey/'

# CAMELYON16 and 17 ground truth labels
PATH_C16_LABELS = CAM_BASE_DIR + 'CAMELYON16/test/Ground_Truth/reference.csv'
PATH_C17_LABELS = CAM_BASE_DIR + 'CAMELYON17/training/stage_labels.csv'

FEATURES_C16TEST = GENERATED_DATA + 'features_c16_test.csv'
FEATURES_C17TRAIN = GENERATED_DATA +'features_c17_train.csv'
FEATURES_C17TEST = GENERATED_DATA +'features_c17_test.csv'

# Here we will save our predictions
PATH_C17TRAIN_PREDICITONS = CAM_BASE_DIR + 'CAMELYON17/c17_train_predictions.csv'
PATH_C17TEST_PREDICITONS = CAM_BASE_DIR + 'CAMELYON17/c17_train_predictions.csv'

Load the Data

Now we read in the csv files as pandas DataFrame objects.

Prepare the Data

Task:

Concatenate c16_test and c17_train in a new DataFrame variable c1617_train. That is the dataset we will use for hyperparameter optimization.

c1617_train = None ### Exercise

Now we split the data into labels and features.

Task:

  • Load the stage column into a variable y (1D-numpy array)
  • Load the six features highest_probability, ... into a variable x
# Exercise

x = None # features
y = None # labels

Remove Invalid Values

Some heatmaps (~2 or 3) could not be created by the CNN, so values for some slides ar missing.

Task:

Replace the missing values using the sklearn.preprocessing.Imputer class.

Hint:

For better results look at the labels of the missing heatmaps and replace the values for the features with the label mean.

### Exercise

# imp = 
# x = 

Train and Visualize Simple Decission Tree

Now we are ready to define and train a decision tree. We use the scikit learn decison tree module.

Task:

  • Define and train a decision tree for visualization first
  • Define and train a decision tree for validation with cross validation
  • Hint: Search for good hyperparameters using the CAMELYON16 test set and the CAMELYON17 training set
  • Define and train a decision tree for predicting the CAMELYON17 training set
  • Hint1: Use cross_val_predict
  • Hint2: For optimal results always use the complete CAMELYON16 test set and all but one slide of the CAMELYON17 training set for training the classifier and only predict the one slide left.
clf = None # Exercise

clf is an instance of a trained decision tree classifier.

The decision tree can be visualized. For this we must write an graphviz dot-File

It should look like:

graph = Source(tree.export_graphviz(clf, out_file=None
   , feature_names=columns
   , filled = True))
graph

### To open in seperate window and save it
#graph.format = 'png'
#graph.render('dtree_render',view=True)

Save as CSV

First prepare the DataFrame:

  • make a deep copy of c17_train
  • replace the stage values with you prediction
  • remove all collumns except the patient and stage column
  • save as csv
print(len(names_preds))
c17_train_copy = c17_train.copy(deep=True)
print(c17_train_copy.loc[0].values[2])

for i in range(500):
    c17_train_copy.loc[i,'stage'] = names_preds[c17_train_copy.loc[i,'patient']]

count = 0
for i in range(500):
    if c17_train_copy.loc[i].values[2] == c17_train.loc[i].values[2]:
        count += 1
        
print(count / 500)
c17_train_copy = c17_train_copy.drop(c17_train_copy.columns[3:], axis=1)
c17_train_copy = c17_train_copy.drop(c17_train_copy.columns[0], axis=1)
c17_train_copy.to_csv(PATH_C17_PREDICITONS)

Summary and Outlook

Congratulations. If you worked through all notebooks from the beginning you just completed 99% of the complete CAMELYON challenge (16 and 17).

If you want to improve your classifier, you can try to exchange the decision tree with another algorithm, e.g naive bayes, support vector machine, random forest, etc.... Depending on the classification algorithm you might also want to try out feature selection beforehand.

In the next notebook you will determine the patient's pN stage based on your predictions for the slides, to finally calculate the kappa score and compare your results with others on the CAMELYON website.

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

XXX
by Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.