Classification of Heatmaps
Table of Contents
In the last notebook we extracted geometrical features from our heatmaps and saved them in csv files. Now we will use these features to train a simple classifier to predict the lables of the slides (negative, itc, micro, macro).
import numpy as np import pandas as pd from sklearn.model_selection import cross_validate from sklearn.preprocessing import Imputer from sklearn import tree, naive_bayes, ensemble from sklearn.externals.six import StringIO from sklearn.model_selection import cross_val_score, cross_val_predict from sklearn.model_selection import GridSearchCV from sklearn.metrics import confusion_matrix from graphviz import Source
Before we start, adjust the path of
CAM_BASE_DIR (and also other variables as needed).
### EDIT THIS CELL: ### Do not edit this cell ### Assign the path to your CAMELYON16 data and create the directories ### if they do not exist yet. CAM_BASE_DIR = '/path/to/CAMELYON/data/' # exmple: CAM_BASE_DIR = '/media/klaus/2612FE3171F55111/'
GENERATED_DATA = CAM_BASE_DIR + 'tutorial/' HEATMAP_DIR = CAM_BASE_DIR + 'c16traintest_c17traintest_heatmaps_grey/' # CAMELYON16 and 17 ground truth labels PATH_C16_LABELS = CAM_BASE_DIR + 'CAMELYON16/test/Ground_Truth/reference.csv' PATH_C17_LABELS = CAM_BASE_DIR + 'CAMELYON17/training/stage_labels.csv' FEATURES_C16TEST = GENERATED_DATA + 'features_c16_test.csv' FEATURES_C17TRAIN = GENERATED_DATA +'features_c17_train.csv' FEATURES_C17TEST = GENERATED_DATA +'features_c17_test.csv' # Here we will save our predictions PATH_C17TRAIN_PREDICITONS = CAM_BASE_DIR + 'CAMELYON17/c17_train_predictions.csv' PATH_C17TEST_PREDICITONS = CAM_BASE_DIR + 'CAMELYON17/c17_train_predictions.csv'
Load the Data
Now we read in the csv files as pandas
Prepare the Data
c17_train in a new
c1617_train. That is the dataset we will use for hyperparameter optimization.
c1617_train = None ### Exercise
Now we split the data into labels and features.
- Load the
stagecolumn into a variable
- Load the six features
highest_probability, ... into a variable
# Exercise x = None # features y = None # labels
Remove Invalid Values
Some heatmaps (~2 or 3) could not be created by the CNN, so values for some slides ar missing.
Replace the missing values using the
For better results look at the labels of the missing heatmaps and replace the values for the features with the label mean.
### Exercise # imp = # x =
Train and Visualize Simple Decission Tree
Now we are ready to define and train a decision tree. We use the scikit learn decison tree module.
- Define and train a decision tree for visualization first
- Define and train a decision tree for validation with cross validation
- Hint: Search for good hyperparameters using the CAMELYON16 test set and the CAMELYON17 training set
- Define and train a decision tree for predicting the CAMELYON17 training set
- Hint1: Use
- Hint2: For optimal results always use the complete CAMELYON16 test set and all but one slide of the CAMELYON17 training set for training the classifier and only predict the one slide left.
clf = None # Exercise
clf is an instance of a trained decision tree classifier.
The decision tree can be visualized. For this we must write a graphviz dot-File
It should look like:
graph = Source(tree.export_graphviz(clf, out_file=None , feature_names=columns , filled = True)) graph ### To open in seperate window and save it #graph.format = 'png' #graph.render('dtree_render',view=True)
Save as CSV
First prepare the DataFrame:
- make a deep copy of
- replace the stage values with you prediction
- remove all collumns except the
- save as csv
print(len(names_preds)) c17_train_copy = c17_train.copy(deep=True) print(c17_train_copy.loc.values) for i in range(500): c17_train_copy.loc[i,'stage'] = names_preds[c17_train_copy.loc[i,'patient']] count = 0 for i in range(500): if c17_train_copy.loc[i].values == c17_train.loc[i].values: count += 1 print(count / 500)
c17_train_copy = c17_train_copy.drop(c17_train_copy.columns[3:], axis=1) c17_train_copy = c17_train_copy.drop(c17_train_copy.columns, axis=1) c17_train_copy.to_csv(PATH_C17TRAIN_PREDICITONS)
Summary and Outlook
Congratulations. If you worked through all notebooks from the beginning you just completed 99% of the complete CAMELYON challenge (16 and 17).
If you want to improve your classifier, you can try to exchange the decision tree with another algorithm, e.g naive bayes, support vector machine, random forest, etc.... Depending on the classification algorithm you might also want to try out feature selection beforehand.
In the next notebook you will determine the patient's pN stage based on your predictions for the slides, to finally calculate the kappa score and compare your results with others on the CAMELYON website.
Notebook License (CC-BY-SA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
by Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2018 Klaus Strohmenger
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.