Do Tissue Source Sites leave identifiable Signatures in Whole Slide Images beyond staining?

Piotr Keller, Muhammad Dawood and Fayyaz ul Amir Afsar Minhas

Tissue Image Analytics Center, University of Warwick, United Kingdom

This repository contains the code for the following manuscript:

Do Tissue Source Sites leave identifiable Signatures in Whole Slide Images beyond staining?, accepted by ICLR 2023 Workshop TML4H.

Introduction

Why can deep learning predictors trained on Whole Slide Images fail to generalize? It is a common theme in Computational Pathology (CPath) to see a high performing model developed in a research setting experience a large drop in performance when it is eventually deployed to a new clinical environment. One of the major reasons for this is the batch effect that is introduced during the creation of whole slide images resulting in a domain shift. CPath pipelines try to reduce this effect via stain normalization techniques. However, in this paper, we provide empirical evidence that stain normalization methods do not result in any significant reduction of the batch effect. This is done via clustering analysis of the dataset as well as training weakly-supervised models to predict source sites. This study aims to open up avenues for further research for effective handling of batch effects for improving trustworthiness and generalization of predictive modelling in the Computational Pathology domain.

Dependencies

scipy 1.7.3

numpy 1.21.6

matplotlib 3.2.2

geomloss 0.2.4

pandas 1.3.5

torch 1.12.1+cu113

sksurv 0.17.2

lifelines 0.27.4

sklearn 1.0.2

seaborn 0.11.2

tqdm 4.64.1

Usage

Step 1. Data download

Download the FFPE whole slide images from GDC portal (https://portal.gdc.cancer.gov/) for breast carcinoma (TCGA-BRCA).

Download corresponding gene point mutation and Disease Specific Survival from cBioPortal (https://www.cbioportal.org/).

Step 2. Data processing

Slide selection: select high quality WSIs from the original dataset
Tile extraction: extract 512x512 tiles from the large WSI at a spatial resolution of 0.50 microns-per-pixel
Patches capturing less that 40% of informative tissue are discarded
Stain normalization
Feature extraction: extract a feature vector for each tile using SuffleNet pretrained on ImageNet

Details can be found in the paper.

Step 3. Training and evaluation of CLAM model with 1024-dimensional feature representations of 1052 TCGA-BRCA slides

Used the code from Clustering-constrained attention multiple instance learning (CLAM) Github to train a neural network for predicting source site labels.

Details about the model can be found in the orginal paper.

Step 4. MMD Kernel generation for 1024-dimensional feature representations of 1052 TCGA-BRCA slides

Using the code from Maximum Mean Discrepancy Kernels For Predictive And Prognostic Modeling Of Whole Slide Images Github to generate an $N \times N$ distance matrix using MMD where $N$ is the number of WSIs in a dataset.

Details can be found in the original paper.

Step 5. Source site prediction using MMD kernels

Using the code from source_site_prediction.py to generate a Support Vector Machine (SVM) with a predefined kernel (generated from Step 4.) to predict source site origin of a given Whole Slide Image.

Details can be found in the paper.

Note

Some intermediate data are put into the folder data.

* Joint first authorship

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
Staining.png		Staining.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do Tissue Source Sites leave identifiable Signatures in Whole Slide Images beyond staining?

Piotr Keller, Muhammad Dawood and Fayyaz ul Amir Afsar Minhas

Tissue Image Analytics Center, University of Warwick, United Kingdom

Introduction

Dependencies

Usage

Step 1. Data download

Step 2. Data processing

Step 3. Training and evaluation of CLAM model with 1024-dimensional feature representations of 1052 TCGA-BRCA slides

Step 4. MMD Kernel generation for 1024-dimensional feature representations of 1052 TCGA-BRCA slides

Step 5. Source site prediction using MMD kernels

Note

About

Releases

Packages

pkeller00/Src-Site-Pred

Folders and files

Latest commit

History

Repository files navigation

Do Tissue Source Sites leave identifiable Signatures in Whole Slide Images beyond staining?

Piotr Keller*, Muhammad Dawood* and Fayyaz ul Amir Afsar Minhas

Tissue Image Analytics Center, University of Warwick, United Kingdom

Introduction

Dependencies

Usage

Step 1. Data download

Step 2. Data processing

Step 3. Training and evaluation of CLAM model with 1024-dimensional feature representations of 1052 TCGA-BRCA slides

Step 4. MMD Kernel generation for 1024-dimensional feature representations of 1052 TCGA-BRCA slides

Step 5. Source site prediction using MMD kernels

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Piotr Keller, Muhammad Dawood and Fayyaz ul Amir Afsar Minhas

Packages