(This repository is still under construction and the README.md file will be updated)
In this work, we test the generalizability of a convolutional neural network, UNet
with residual units trained on PET/CT images of one cancer type to other cancer types. We used three oncological PET/CT datasets (provided by the autoPET 2022 challenge) of different cancer types: lymphoma (n=145)
, lung cancer (n=168)
, and melanoma (n=188)
, collected from two institutions. The dataset also contained PET/CT images from healthy control patients (n=513)
, but those were not used for this work. The dataset is publicly available and can be downloaded via TCIA website from here.
The original CT images and annotations were resampled to the resolution of the original PET images, and CT intensities (in Hounsfield units) were clipped between (-1024, 1024)
. Both PET (in SUV) and CT intensities were normalized in (0,1)
. All the images were then resampled to a voxel spacing of 2.0 mm × 2.0 mm × 2.0 mm
. During training, randomly cropped patches of sizes 192 × 192 × 192
were extracted with centers on a foreground or a background voxel with probabilities 5/6 and 1/6, respectively. Spatial augmentations like random affine and 3D elastic deformations were applied to the cropped patches. Input to the network was created by combining the PET and CT patches along the channel dimension. The annotation masks contained two labels: 0
for background and 1
for the lesion class.
All our networks were trained with nn.DataParallel(.)
wrapper on a Standard_NC24s_v3 Azure Virtual Machines from Microsoft consisting of 4 NVIDIA GPUs each with a 16 GiB RAM and 24 vCPUs with overall 448 GiB RAM.
A UNet
with residual units adapted from the MONAI
[1] was used in this work. This network architecture is shown in Figure 1 below and it can be created using the monai.networks.nets.UNet
class of MONAI as follows:
from monai.networks.nets import UNet
device = torch.device("cuda:0")
model = UNet(
spatial_dims=3,
in_channels=2,
out_channels=2,
channels=(16, 32, 64, 128, 256, 512),
strides=(2, 2, 2, 2, 2),
num_res_units=2,
norm=Norm.BATCH,
).to(device)
The networks were trained using Dice Loss, monai.losses.DiceLoss(.)
given by the following equation:
where, batch-size
and the number of voxels in a patch
, respectively. In this work, we set
Dice similarity coefficient (DSC) metric, adapted from monai.metrics.DiceMetric(.)
, was used for evaluation of the overlap between the ground truth and the predicted mask for the lesion class. Inference was performed using a sliding window method with a window size of 192 × 192 × 192
on the test set images.
Each of the PET/CT data from three different cancer types were randomly split into training (80%) and test (20%) sets. For each cancer type, the networks were trained to segment the specific single cancer type under 5-fold cross-validation (CV). We evaluated the model on the internal test set of the same cancer type as the training set and then assessed the transferability of the model's lesion segmentation ability on a different cancer type. We further explored different ensembling techniques - Average (Avg)
, Weighted Average (WtAvg)
(with weights equal to the mean DSC on the corresponding validation fold), Majority Voting (Vote)
, and STAPLE
[2] to combine the five models trained in 5-fold CV as a possible route towards improving model generalizability to new cancer types. The details about the 5-fold split for training/validation and testing for the three cancer types can be found in three .csv files containing the metadata here.
A short description of our network performance with respect to mean and median DSC for different training and test set pairs can be found in the Figure and the Table below,
Ensemble models on lymphoma | Ensemble models on lung cancer | Ensemble models on melanoma |
---|---|---|
A short description of the results is as follows:
Training data | Ensemble type | DSC Lymphoma (Test) | DSC Lung cancer (Test) | DSC Melanoma (Test) | |||
---|---|---|---|---|---|---|---|
mean | median | mean | median | mean | median | ||
Lymphoma | Average DSC over folds [01234] | 0.5541±0.2774 | 0.6791 | 0.4021±0.2412 | 0.4265 | 0.3686±0.286 | 0.3194 |
Average | 0.5832±0.2772 | 0.7196 | 0.4161±0.2514 | 0..4473 | 0.4330±0.3138 | 0.4255 | |
Weighted Average | 0.5838±0.2761 | 0.7194 | 0.4161±0.2519 | 0.4462 | 0.4337±0.3139 | 0.4249 | |
Vote | 0.5691±0.2787 | 0.707 | 0.4031±0.2491 | 0.419 | 0.4253±0.3133 | 0.4282 | |
STAPLE | 0.5766±0.2839 | 0.7057 | 0.4374±0.253 | 0.4527 | 0.4063±0.2933 | 0.3914 | |
Lung cancer | Average DSC over folds [01234] | 0.3886±0.2497 | 0.4234 | 0.6909±0.2092 | 0.7339 | 0.3729±0.2465 | 0.3783 |
Average | 0.4062±0.2775 | 0.4765 | 0.7147±0.2023 | 0.7626 | 0.4206±0.2651 | 0.4754 | |
Weighted Average | 0.4063±0.2775 | 0.4753 | 0.7148±0.2023 | 0.763 | 0.4207±0.2650 | 0.4768 | |
Vote | 0.3992±0.2789 | 0.4663 | 0.7134±0.2026 | 0.7704 | 0.4248±0.2667 | 0.476 | |
STAPLE | 0.4132±0.2597 | 0.4583 | 0.708±0.2062 | 0.76 | 0.381±0.2512 | 0.3887 | |
Melanoma | Average DSC over folds [01234] | 0.4026±0.2342 | 0.4516 | 0.4033±0.2283 | 0.4237 | 0.4737±0.2877 | 0.5186 |
Average | 0.4136±0.2347 | 0.4419 | 0.4119±0.2337 | 0.4419 | 0.5175±0.2831 | 0.6038 | |
Weighted Average | 0.4118±0.2365 | 0.4411 | 0.4119±0.2336 | 0.4395 | 0.5191±0.2822 | 0.6067 | |
Vote | 0.3495±0.245 | 0.3591 | 0.3823±0.235 | 0.4173 | 0.4736±0.2822 | 0.5283 | |
STAPLE | 0.4575±0.2459 | 0.5192 | 0.4316±0.2303 | 0.4612 | 0.5154±0.2938 | 0.5887 |
[1] MONAI: Medical Open Network for AI, AI Toolkit for Healthcare Imaging
[2] Simon K. Warfield, Kelly H. Zou, and William M. Wells, Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation, IEEE Trans Med Imaging, 2004.