Data Handling Usage Guide

Introduction

Handling of huge slide images in the tiff-format is very specific to the usecase of histology, requiring knowledge about specific packages like openslide. This series of notebooks offers some predefined classes for accessing the data, which will be presented here.

These predefined classes and functions take away the need to become familiar with openslide and other specific packages, allow easy access to the data, but at the same time still show what steps are necessary, when dealing with huge images, which cannot be loaded into RAM at once.

This notebook demonstrates how to use the SlideManager class to view Whole-Slide-Images (WSIs) and how to divide them into smaller pieces and how to store them for faster access.

For demonstration purposes we suggest to only download one tumor-image (and corresponding annotation) and one normal-image of the CAMELYON16 dataset. Detailed instructions how to do this are in the notebook.

The only thing you have to edit in this notebook is the path where these files are stored on your local machine.

Just execute each cell and try to understand what happens. Of course you do not have to memorize everything now. When you have to use the SlideManager class yourself in another notebook, you can refer to this notebook as a usage guide.

Requirements

Python-Modules

import random

import numpy as np
import h5py

from skimage.filters import threshold_otsu
from matplotlib import pyplot as plt

from preprocessing.datamodel import SlideManager
from preprocessing.processing import split_negative_slide, split_positive_slide, create_tumor_mask, rgb2gray
from preprocessing.util import TileMap

%matplotlib inline

Data

The data used in this notebook are from the CAMELYON data sets, which are freely available on the CAMELYON data page.

The whole data sets have the following sizes:

  • CAMELYON16 (~715 GiB)
  • CAMELYON17 (~2,8 TiB)

For this notebook to work the following file structure inside the data set folders must be given:

data
├── CAMELYON16
│   └── training
│       ├── lesion_annotations
│       │   └── tumor_001.xml - tumor_110.xml
│       ├── normal
│       │   └── normal_001.tif - normal_160.tif
│       └── tumor
│           └── tumor_001.tif - tumor_110.tif
│
└── CAMELYON17
    └── training
        ├── center_0
        │   └── patient_000_node_0.tif - patient_019_node_4.tif
        ├── center_1
        │   └── patient_020_node_0.tif - patient_039_node_4.tif
        ├── center_2
        │   └── patient_040_node_0.tif - patient_059_node_4.tif
        ├── center_3
        │   └── patient_060_node_0.tif - patient_079_node_4.tif
        ├── center_4
        │   └── patient_080_node_0.tif - patient_099_node_4.tif
        ├── lesion_annotations
        │   └── patient_004_node_4.xml - patient_099_node_4.xml
        └── stage_labels.csv

IMPORTANT

Not all of the files are needed to run this notebook. If you have not already downloaded the CAMELYON16 data set, download the following three files from the official google drive:

  • normal/normal_101.tif (2 GiB)
  • tumor/tumor_110.tif (1.4 GiB)
  • tumor_110.xml (in lesion_annoations.zip)

Note that you will need the whole CAMELYON16 data set in the next notebook. So after you have downloaded the three mentioned files, you might already want to start the download of the complete CAMELYON16 data set.

Teaching Content

CAMELYON Data Set

The CAMELYON Challenges are two grand challenges in the year 2016 and 2017 organized by the Diagnostic Image Analysis Group (DIAG) with the goal to evaluate algorithms for automated detection and classification of breast cancer metastases in whole-slide images (WSI) histological lymph node sections.

Each of the challenges provides a big data set of Whole-Slide-Images (WSIs). Each slide contains one or more hematoxylin and eosin (H&E) stained sections of one lymph node. The CAMELYON17 data set alone consists of 100 patients for training and 100 patients for testing with each 5 slides. Additionally the data set provides lesion-level annotations for tumorous tissue in form of tumor masks for CAMELYON16 and annotations in XML-Files for CAMELYON17.

The WSIs are provided in TIFF-Format. Each file consists of multiple layers with every layer providing different resolution of the image allowing fast access/viewing of multiple resolutions. The lowest layer (layer 0) has the original resolution and going up the resolution of each layer is halved. Effectively stacking up layers like a pyramid.

The images on the lowest level of the slide can be as huge as over 2,000,000 Pixels per axis, which makes it difficult to load a whole image into the memory at once.

The Preprocessing Package

The preprocessing package provides classes and functions that allow preprocessing the CAMELYON data set in a comfortable way.

SlideManager

The SlideManager class provides access to the whole-slide images of the CAMELYON data set. On object creation, it reads in the files in the specified path and creates Slide objects for every WSI.

Following example loads in only the CAMELYON16 dataset, but loading the CAMELYON17 dataset is possible aswell.

### EDIT THIS CELL:
### Assign the path to your CAMELYON16 data and create the directories
### if they do not exist yet.
CAM_BASE_DIR = '/path/to/CAMELYON/data/'
### Do not edit this cell
CAM16_DIR = CAM_BASE_DIR + 'CAMELYON16/'
# for minimal data set:
mgr = SlideManager(cam16_dir=CAM16_DIR)

# for whole train set (CAMELYON16 and 17):
# mgr = SlideManager(cam16_dir='some/path/CAMELYON16', cam17_dir='some/path/CAMELYON17')

print(mgr)

SlideManager objects offer multiple ways to access slides:

# annotated slides (slides with tumor annotation)
slides_anno = mgr.annotated_slides
print('Number of annotated slides: ', len(slides_anno))

# negative slides
slides_negative = mgr.negative_slides
print('Number of positive slides: ', len(slides_negative))

# get a random slide
slide = random.choice(mgr.slides)
print('\nA random slide:')
print(slide)

# list two slide names
print('\nTwo random slide names:')
for name in random.choices(mgr.slide_names, k=2):
    print(' ', name)

# choose a specific slide by name
print('\nSlide by name:')
#print(mgr.get_slide('Normal_101'))
print(mgr.get_slide('Normal_001'))

Slides

The Slide class describes a single WSI. It is based on the OpenSlide class but also provides information about stage (only for CAMELYON17 slides) and possible tumor annotations:

slide = random.choice(mgr.negative_slides)
print('Name: {}\n'
      'Size on layer 0: {:,} × {:,} pixel\n'
      'Layers: {}\n'
      'State: {}\n'
      'Annotations: {}'.format(slide.name,
                               *slide.level_dimensions[0], # openslide method
                               slide.level_count,          # openslide method
                               slide.stage,
                               len(slide.annotations)))

slide.get_thumbnail((300, 300))  # openslide method
slide = random.choice(mgr.annotated_slides)
print('Name: {}\n'
      'Size on layer 0: {:,} × {:,} pixel\n'
      'Layers: {}\n'
      'State: {}\n'
      'Annotations: {}'.format(slide.name,
                               *slide.level_dimensions[0],  # open slide method
                               slide.level_count,           # open slide method
                               slide.stage,
                               len(slide.annotations)))

slide.get_thumbnail((300, 300))  # open slide method

The Slide method get_full_slide() allows a direct access to the slide as image on a specific level. For all further operations we define one level

level = 0
print('Slide dimensions on layer %i: '%level, slide.level_dimensions[level])
img = slide.get_full_slide(level=5) # we cannot show it on level 0, would require ~100+ GByte RAM
print('Image size:', img.size)
plt.imshow(img)

Annotations

Tumor annotations are represented by the Annotation class. Every Annotation object has a name and a polygon encompassing the tumor region. Annotated slides may have multiple Annotation objects.

The Annotation object can create an image of a tumor region with the polygon drawn over. Here the first 3 annotations of the first annotated slide:

slide = mgr.annotated_slides[0]
print('Slide name: {}\n'.format(slide.name))
for annotation in slide.annotations[:3]:
    bounds = annotation.get_boundaries(0)
    print('Annotation name: "{}"\n'
          'Position on layer 0: {}\n'
          'Size on layer 0: {:,} × {:,} pixel\n'.format(annotation.name,
                                                        bounds[0],
                                                        *bounds[1]))
    plt.figure()
    plt.imshow(annotation.get_image())
plt.show()

Tiling

In general the file size and dimensions of a WSI are much bigger than a convolutional neural network can handle. To be able to predict the cancer status of tissue samples we need to cut each slide into small images, tiles, which we can feed into a network as training data.

Negative slides

Negative tissue slides are slides of healthy tissue without any tumorous cells. From these negative slides we want to extract as many tiles with tissue as possible and ignore those with only background.

To split the background from the actual tissue data we can use an image processing algorithm based on the Otsu Method which is provided in the scikit-image package. This method calculates a threshold that divides a gray scale image into foreground and background.

As an example we can get a negative slide from the SlideManager:

negative_slide = mgr.negative_slides[0]
print('Slide name: "{}"\n'
      'Tumorous: {}'.format(negative_slide.name,
                            negative_slide.has_tumor))
negative_slide.get_thumbnail((300, 300))

Now we can load the image into an NumPy array, convert it to gray scale and calculate the otsu threshold, though we will use another zoom level here.

# load the slide into numpy array
arr = np.asarray(negative_slide.get_full_slide(level=6))
print('array shape:', arr.shape)

# convert it to gray scale
arr_gray = rgb2gray(arr)
print('gray array shape:', arr_gray.shape)

# calculate otsu threshold
threshold = threshold_otsu(arr_gray)
threshold

The threshold splits the image into background and foreground. Pixels with a value lower are considered background and pixels with a higher value are considered foreground. To visualize this we can create a binary mask:

# create an otsu mask
mask = arr_gray > threshold
plt.imshow(mask)

If we compare the mask with the slide thumbnail above we can see the algorithm successfully separated the pixels with tissue data from the background. The binary mask also allows us to count how many pixels of actual tissue data we have by summing it up:

np.sum(mask)

To double check we can slice the mask into three vertical strips and sum them up:

print('Points of interests between 0 and 1700:    {:9,}\n'
      'Points of interests between 1700 and 4000: {:9,}\n' 
      'Points of interests between 3400 and 5120: {:9,}'.format(np.sum(mask[...,0:1700]),
                                                                np.sum(mask[...,1700:3400]),
                                                                np.sum(mask[...,3400:])))

We can see most off the tissue is in the first third and barely any in the last. This is exactly what we can see on the images above.

This method to separate background from foreground by creating one binary mask works very well for smaller images and so far we have only been working on layer 5 of the WSI pyramid. But for lower layers with much higher dimension this step can be very memory intensive. To avoid this problem we can calculate the otsu threshold once and then create a mask based on that threshold for every tile of the slide individually.

The function split_negative_slide of the preprocessing module splits a slide into tiles with the specified dimensions. For each tile a binary mask is created based on the otsu threshold. If the sum of this mask is higher than the provided threshold, the tile and its boundaries are returned.

tile_iter = split_negative_slide(
    negative_slide, level=3,
    otsu_threshold=threshold,  # otsu threshold calculated earlier
    tile_size=256,
    overlap=0,                 # no overlap
    poi_threshold=0.9)         # only select tiles with at least 90% tissue

# plot the first 3 tiles
plt.figure(figsize=(20, 6))
for i, (tile, bounds) in enumerate(tile_iter):
    print(bounds)          # tile boundaries on layer 0
    plt.subplot(1, 3, i+1)
    plt.imshow(tile)       # plot the tile

    if i >= 2:
        break

The TileMap class can be used to display the tiles and see which areas of the slide are selected. It's image attribute is an image of the WSI with all tiles marked:

import datetime
# initialize the map with the slide itself
tm = TileMap(negative_slide)

print('start splitting', datetime.datetime.now())
# create a new and unconsumed tile iterator
tile_iter = split_negative_slide(negative_slide, level=0,
                                 otsu_threshold=threshold,
                                 tile_size=512, 
                                 overlap=0,
                                 poi_threshold=0.9)
print('end splitting', datetime.datetime.now())

for _, bounds in tile_iter:
    # add the boundaries of every tile to the map
    print('next tile', datetime.datetime.now())
    tm.add_tile(bounds)

tm.image

As you can see the green rectangles which represent a tile are only covering the area with tissue. With the poi_threshold parameter we can set the minimum amount of tissue each tile has (here 20%) and with the overlap parameter we can let the tiles overlap each other.

Positive slides

From positive slides we only want tiles with actual tumorous tissue to train our network. We can ignore the background and the rest of the tissue. We can use the same process we used for negative slides, but instead of a binary mask based on the otsu threshold, we can create a tumor masked based on the CAMELYON data set tumor annotations.

positive_slide = mgr.get_slide('tumor_110')
print('Slide name: "{}"'.format(positive_slide.name))
positive_slide.get_thumbnail((300, 300))
mask = create_tumor_mask(positive_slide, level=level)
plt.imshow(mask)

The function create_tumor_mask takes the polygons from all annotations and creates a binary mask were 00 represents healthy and 11 tumorous tissue.

Note: Micro and macro metastases in the CAMELYON data set are annotated exhaustively. However, there might be isolated tumor cells (ITCs) that are not annotated at all. For this reason we only want to use the annotated areas from positive slides as positive tiles and we don't want to use the remaining tissue as negative tiles.

As with the otsu mask for negative slides, creating a tumor mask of a whole slide on a low layer might lead to memory issues. To prevent that we will only generate partial tumor mask for each tile we process.

ToDo!

The split_positive_slide function provides this. As an example we can again split a slide into 128×128128\times128 tiles on layer 5 and use the TileMap to display the result:

tile_iter = split_positive_slide(
    positive_slide, level=level,
    tile_size=tile_size,  # tile size of 128×128 pixel
    overlap=0)      # no overlap

tm = TileMap(positive_slide)
for _, bounds in tile_iter:
    # add the boundaries of every tile to the map
    tm.add_tile(bounds)

print('Created {} tiles.'.format(len(tm.tiles)))
tm.image

Storing

Finally we need to be to store the preprocessed data in a way we can easily read out and feed into the CNN.

The HDF5-format (Hierarchical Data Format version 5) is a data format designed to store and organize large amounts of data on the hard drive. For python the package h5py offers a rich API around HDF5 that allows us to store and load sets of data very similar to the way we handle NumPy ndarrays.

As an example, let us store one example slide into HDF5:

First, we need to create a new tile iterator. This time we want all the tiles with at least 5% of tissue and a small overlap of 5 pixels. To be able to see which tiles are written we will make a new TileMap as well.

# create a new and unconsumed tile iterator
tile_iter = split_negative_slide(negative_slide, level=5,
                                 otsu_threshold=threshold,
                                 tile_size=256, overlap=5,
                                 poi_threshold=0.05)

# initialize map
tm = TileMap(negative_slide)
filename = '{}/just_a_testfile_can_be_deleted.hdf5'.format(CAM16_DIR, negative_slide.name, level)
label = 0                                              #  1

with h5py.File(filename, "w", libver='latest') as h5:  #  2
    # creating a date set in the file
    dset = h5.create_dataset('data',                   #  3                      
        (0, tile_size, tile_size, 3),                              #  4
        dtype=np.uint8,                                #  5
        maxshape=(None, tile_size, tile_size, 3),                  #  6
        compression="lzf")                             #  7
    
    lset = h5.create_dataset('label', (0,),            #  8
                             maxshape=(None,),
                             dtype=np.uint8)
    cur = 0
    for tile, bounds in tile_iter:                     #  9
        dset.resize(cur + 1, axis=0)                   # 10 
        dset[cur:cur + 1] = tile                       # 11 

        lset.resize(cur + 1, axis=0)                   # 12
        lset[cur:cur + 1] = label

        cur += 1
        tm.add_tile(bounds)                            # 13
        print('.', end='')

print('\n{} tiles written'.format(cur))
tm.image

Issue a shell command to see how many bytes have been written.

!ls -l {filename} | cut -d' ' -f5,9

Now we can see what has been written to the file:

with h5py.File(filename, 'r') as h5:
    for key in h5.keys():
        print('Datasets "{}", shape: {}'.format(key, h5[key].shape))

The numbers match, we can read out five random tiles and their labels:

classes = ('negative', 'positive')

plt.figure(figsize=(20, 6))
with h5py.File(filename, 'r') as h5:
    rand_indexes = random.sample(range(len(h5['data'])), k=5)
    for i, idx in enumerate(rand_indexes):
        plt.subplot(1, 5, i+1)
        plt.imshow(h5['data'][idx])
        plt.title(classes[h5['label'][idx]])

We could successfully save and load the tiles.

Summary and Outlook

We explored the CAMELYON data set and were able to use utilize different tools to extract interesting data from Whole-Slide-Images.

The next step is to process all CAMELYON16 slides and save them into HDF5 format as a single file.

It is also possible to add additional steps into the preprocessing, like color normalization to counter the different hues produced by different scanners and data augmentation to increase the volume of data to feed into the CNN even further.

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

XXX
by Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.