Preprocessing the Dataset

Introduction

In this notebook you will create your own dataset based on the CAMELYON16 data set solely (~700 giga bytes), as you should not need to download over 3.5 tera bytes (CAMELYON16 and 17 data set). Subsequent notebooks will use this data set you will create now. Once you have finished this series of notebooks you can proceed enhancing your implementation to the whole data set including the CAMELYON17 data set.

The purpose of the preprocessing is the following:

If we had enough RAM to store the whole data set, we would just load it once at the beginning of the training. But this is not the case. Reading the different WSI-files in their compressed tiff format every single time we train a new batch is very time consuming. So storing tiles with a fixed zoom level, fixed size, cropped and labeled in one single file, will save us a lot of time.

Requirements

Python-Modules

# Python Standard Library
import random

# External Modules
import numpy as np
import h5py
from datetime import datetime
from skimage.filters import threshold_otsu
from matplotlib import pyplot as plt

# Furcifar Modules
from preprocessing.datamodel import SlideManager
from preprocessing.processing import split_negative_slide, split_positive_slide, create_tumor_mask, rgb2gray
from preprocessing.util import TileMap

%matplotlib inline

Data

The data used in this notebook are from the CAMELYON data sets, which are freely available on the CAMELYON data page.

The whole data sets have the following sizes:

  • CAMELYON16 (~715 GiB)
  • CAMELYON17 (~2,8 TiB)

For this notebook to work the following file structure (for CAMELYON16) inside the data folder must be given:

data
├── CAMELYON16
│   ├── training
│   │   ├── lesion_annotations
│   │   │   └── tumor_001.xml - tumor_110.xml
│   │   ├── normal
│   │   │   └── normal_001.tif - normal_160.tif
│   │   └── tumor
│   │       └── tumor_001.tif - tumor_110.tif
│   └── test
│       ├── lesion_annotations
│       │   └── test_001.xml - tumor_110.xml
│       └── images
│           └── test_001.tif - normal_160.tif
│
└── CAMELYON17
    └── training
        ├── center_0
        │   └── patient_000_node_0.tif - patient_019_node_4.tif
        ├── center_1
        │   └── patient_020_node_0.tif - patient_039_node_4.tif
        ├── center_2
        │   └── patient_040_node_0.tif - patient_059_node_4.tif
        ├── center_3
        │   └── patient_060_node_0.tif - patient_079_node_4.tif
        ├── center_4
        │   └── patient_080_node_0.tif - patient_099_node_4.tif
        ├── lesion_annotations
        │   └── patient_004_node_4.xml - patient_099_node_4.xml
        └── stage_labels.csv

Task:

If you have not done so far, download all remaining data of the CAMELYON16 data set and store it in a folder structure shown above.

Dataset Generation

In this notebook, we will use parts of the data-handling-usage-guide.ipynb to create our own dataset. You have two options:

Option A

  • Process all files from the CAMELYON16 data set
  • Slide zoom level 2 (0-9, 0 beeing the highest zoom)
  • Tile_size of 256x256
  • No overlap for negative tiles
  • 128 pixel overlap for tumorous (positive tiles) since they are scarce
  • Minimum of 20% tissue in tiles for normal slides
  • Minimum of 20% tumorours tissue for positive slides
  • We will get meaningful results in Part 1 of the tutorial (CAMELYON16)
  • Processing will take approximately ~20 hours [*]
  • Training of CNN in the next Notebook will take ~10 hours [*]

Option B

  • Process only part of the CAMELYON16 data set
  • Slide zoom level 3 (0-9, 0 beeing the highest zoom)
  • Tile_size of 256x256
  • No overlap for negative tiles
  • No overlap for positive tiles
  • Minimum of 20% tissue in tiles for normal slides
  • Minimum of 20% tumorours tissue for positive slides
  • We will probably not get meaningful results in Part 1 of the tutorial (CAMELYON16)
  • Processing will take approximately ~5 hours [*]
  • Training of CNN in the next Notebook will take ~2 hours [*]

Remark:

  • [*] [Tested on Xeon1231v3 @3.8Ghz, 16GB DDR3 @1666Hz]

Most importantly, we will save all tiles from all WSIs into a single HDF5 file. This is crucial because when accessing the data later for training, most time is consumed when opening a file. Additionally, the training works better, when a single batch (e.g. 100 tiles), is as heterogenous as the original data. So when we want to read 100 random tiles, we ideally want to read 100 tiles from 100 different slides and we do not want to open 100 different files to do so.

Background Information:

Depending on the staining process and the slide scanner, the slides can differ quite a lot in color. Therefore a batch containing 100 tiles from one slide only will most likely prevent the CNN from generalizing well.

### EDIT THIS CELL:
### Assign the path to your CAMELYON16 data and create the directories
### if they do not exist yet.
CAM_BASE_DIR = '/path/to/CAMELYON/data/'
### Do not edit this cell
CAM16_DIR = CAM_BASE_DIR + 'CAMELYON16/'
GENERATED_DATA = CAM_BASE_DIR + 'tutorial/'
mgr = SlideManager(cam16_dir=CAM16_DIR)
n_slides= len(mgr.slides)
### Execute this cell for option A

level = 0
tile_size = 512

poi_percent = 90
poi = poi_percent/100.
poi_tumor = 0.99

overlap = 0
overlap_tumor = 256
### Execute this cell for option B
level = 3
tile_size = 256
poi_percent = 20
poi = poi_percent/100.
overlap_tumor = 0,
filename = '{}/{}x{}_poi{}_l{}.hdf5'.format(GENERATED_DATA, tile_size, tile_size, 
                                                       poi_percent, level)

h5 = h5py.File(filename, "w", libver='latest')
tiles_neg = 0

for i in range(len(mgr.negative_slides)): 
    # load the slide into numpy array
    arr = np.asarray(mgr.negative_slides[i].get_full_slide(level=3))
    # convert it to gray scale
    arr_gray = rgb2gray(arr)

    # calculate otsu threshold
    threshold = threshold_otsu(arr_gray)
    
    # create a new and unconsumed tile iterator
    # because we have so many  negative slides we do not use overlap
    tile_iter = split_negative_slide(mgr.negative_slides[i], level=level,
                                     otsu_threshold=threshold,
                                     tile_size=tile_size, overlap=0,
                                     poi_threshold=poi)

    # creating a date set in the file
    dset = h5.create_dataset(mgr.negative_slides[i].name, 
                             (0, tile_size, tile_size, 3), 
                             dtype=np.uint8,
                             maxshape=(None, tile_size, tile_size, 3),
                             compression=0)   

    cur = 0
    for tile, bounds in tile_iter:
        if cur > 100: break
        dset.resize(cur + 1, axis=0)
        dset[cur:cur + 1] = tile
        tiles_neg += 1
        cur += 1 

    print(datetime.now(), i, '/', len(mgr.negative_slides), '  tiles  ', cur)
    print('neg tiles total: ', tiles_neg)
tiles_pos = 0

for i in range(len(mgr.annotated_slides)):
    # create a new and unconsumed tile iterator
    # because we have too litle positive slides, we use overlap half overlapping tilesize
    tile_iter = split_positive_slide(mgr.annotated_slides[i], level=level,
                                     tile_size=tile_size, overlap=overlap_tumor,
                                     poi_threshold=poi_tumor) 

    # creating a date set in the file
    dset = h5.create_dataset(mgr.annotated_slides[i].name, 
                             (0, tile_size, tile_size, 3), 
                             dtype=np.uint8,
                             maxshape=(None, tile_size, tile_size, 3),
                             compression=0)   

    cur = 0
    for tile, bounds in tile_iter:
        if cur > 100: break
        dset.resize(cur + 1, axis=0)
        dset[cur:cur + 1] = tile
        cur += 1
        tiles_pos += 1
        
    print(datetime.now(), i, '/', len(mgr.annotated_slides), '  tiles  ', cur)
    print('pos tiles total: ', tiles_pos)
for k in list(h5.keys()):
    if 'umor' in k:
        if h5[k].shape[0] == 0:
            print(k)
            print(h5[k].shape)
            del h5[k]
h5.close()

Summary and Outlook

The next step is to train a neural network with the preprocessed data to be able to classify and predict unseen tiles.

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

XXX
by Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.