Medical Image Classification Scenario

This project contains Python Jupyter notebooks to teach machine learning content in the context of medical data, e.g., automated tumor detection. The material focuses primarily on teaching basic knowledge of convolutional neural networks but also contains portions of fundamental machine learning knowledge. As an advanced topic, you get an introduction to the concept of attention models.

Scenario Description

The usual way to diagnose breast cancer is to analyze a patient’s tissue samples under a microscope. Examining such tissue slides is a complex task that requires years of training and expertise in a specific area by a pathologist. But scientific studies show that even in a group of highly experienced experts there can be substantial variability in the diagnoses for the same patient, which indicates the possibility of misdiagnosis [ELM15][ROB95]. This result is not surprising given the enormous amount of details in a tissue slide at 40x magnification. To get a sense of the amount of data, imagine that an average digitized tissue slide has a size of 200.000x100.000 pixel and you have to inspect every one of them to get an accurate diagnose. Needless to say, if you have to examine multiple slides per patient and have several patients, this is a lot of data to cover in a usually limited amount of diagnosis time. Following image depicts a scanned tissue slide (whole slide image, short WSI) at different magnification level.

WSI example

Under such circumstances, an automated detection algorithm can naturally complement the pathologists’ work process to enhance the possibility of an accurate diagnosis. Such algorithms have been successfully developed in the scientific field in recent years, in particular, models based on Convolutional Neural Networks [WAN16][LIU17]. But getting enough data to train machine learning algorithms is still a challenge in the medical context. However, the Radboud University Medical Center (Nijmegen, the Netherlands) and the University Medical Center Utrecht (Utrecht, the Netherlands) provide an extensive dataset containing sentinel lymph nodes of breast cancer patients in the context of their CAMELYON16 challenge. These data provide a good starting point for further scientific investigations and are therefore mainly used in that scenario. You can get the raw data as a registered user of the CAMELYON17 challenge (GoogleDrive/Baidu) or create a custom dataset for your own needs with the CVEDIA platform.

In context of the medical scenario you will develop a custom classifier for the given medical dataset that will decide:

  1. If a lymph node tissue contains metastases.
  2. What kinds of metastases are present, e.g., micro- or macro-metastasis?
  3. Which pN stage the patient is in based on the TNM staging system?

Teaching Material

Detection of metastases is a classification problem. To solve this first issue, you will implement a classification pipeline based on Neural Networks (NN) and develop it to Convolutional Neural Network (CNN)[LEC98]. Further, you will extend that pipeline with classical machine learning approaches, like decision trees, to address the second and third issue.

We divide the teaching material in that scenario into WSI preprocessing (global operations, like i/o handling of WSI data), WSI postprocessing (building heatmaps, extract features, preparation for further classification) and classification (machine learning approaches to solve our scenario issues) in order to make it modular and reusable. Besides, you will examine the concept of soft attention [MIN14][XU15] on WSIs to see if the amount of processed data can be reduced that way.

The deep.TEACHING project provides educational material for students to gain basic knowledge about the problem domain, the programming, math and statistics requirements, as well as the mentioned algorithms and their evaluation. Students will also learn how to construct complex machine learning systems, which can incorporate several algorithms at once.



To run the notebook processing the WSIs openslide is needed.

Manually on Ubuntu 18.04 (Tested):

  • Go to the openslide download page and download the tar.gz

    • This tutorial uses openslide version 3.4.1, which is confirmed to work (2018-11-02).
  • Install the following system packages. It is suggested to install the newest versions with:

    sudo apt-get install libjpeg-dev libtiff-dev libglib2.0-dev libcairo2-dev ibgdk-pixbuf2.0-dev libxml2-dev libsqlite3-dev valgrind zlib1g-dev libopenjp2-tools libopenjp2-7-dev
  • However, if installation fails, the following versions are confirmed to work with openslide version 3.4.1:

    sudo apt-get install libjpeg-dev=8c-2ubuntu8 libtiff-dev=4.0.9-5 libglib2.0-dev=2.56.2-0ubuntu0.18.04.2 libcairo2-dev=1.15.10-2 ibgdk-pixbuf2.0-dev=2.36.11-2 libxml2-dev=2.9.4+dfsg1-6.1ubuntu1.2 libsqlite3-dev=3.22.0-1 valgrind=1:3.13.0-2ubuntu2.1 zlib1g-dev=1:1.2.11.dfsg-0ubuntu2 libopenjp2-tools=2.3.0-1 libopenjp2-tools=2.3.0-1 libopenjp2-7-dev=2.3.0-1
  • Unpack the openslide tar.gz file and inside the unpacked folder execute the following (excerpt from the README.txt):

    make install
  • Finally add the following to the end of your ~/.bashrc

    ########## OpenSlide START ############
    export LD_LIBRARY_PATH
    ########## OpenSlide END ##############

Python Packages

To run the notebooks navigate to “educational-materials/medical-image-classification” and execute the following commands:

# Create a new virtual environment for this course and install dependencies from Pipfile.lock
pipenv install
# Create an ipython kernel for the virtual environment
pipenv run ipython kernel install --user --name medical_image_classification
# When opening a notebook with Jupyter Lab, select medical_image_classification (upper right corner)

It is possible that some packages do not work properly when loaded within an ipython-kernel referring to a virtual environment (medical_image_classification). In this case, install these packages for the user via pip:

pip3 install --user progress
pip3 install --user scikit-image


Topics are tentative and subject to change.

  • WSI Preprocessing

    • Reading different WSIs formats
    • Extract tissue from WSIs and create a custom dataset
    • Apply data augmentation and further task individual data transformations
    • Exploring different color models in context of medical data analysis
  • WSI Postprocessing

    • WSI confidence maps as classification result
    • Morphological operations to improve classsification results
    • Extract useful features from confidence maps to train further classifiers
  • Classification

    • Neural Networks as image classifier
    • Convolutional Neural Networks image classifier
    • Recurrent Neural Networks and Convolutional Neural Networks for image segmentation
    • Decision tree (random forest, mlp)
    • Cluster analysis (k-means, Expectation-Maximization(EM))
  • Soft-Attention

    • Understand and rebuild [MIN14]
    • Apply soft attention concept to WSIs

Reference (ISO 690)

[ELM15] ELMORE, Joann G., et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. Jama, 2015, 313. Jg., Nr. 11, S. 1122-1132.
[LEC98] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.
[LIU17] LIU, Yun, et al. Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442, 2017.
[MIN14] MNIH, Volodymyr, et al. Recurrent models of visual attention. In: Advances in neural information processing systems. 2014. S. 2204-2212.
[ROB95] ROBBINS, P., et al. Histological grading of breast carcinomas: a study of interobserver agreement. Human pathology, 1995, 26. Jg., Nr. 8, S. 873-879.
[WAN16] WANG, Dayong, et al. Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718, 2016.
[XU15] XU, Kelvin, et al. Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. 2015. S. 2048-2057.