This directory hosts sample scripts to launch a multi-gpu distributed training using torch.distributed.DistributedDataParallel
and MONAI Core on UF HiperGator's AI partition, a SLURM cluster using Singularity as container runtime.
- This tutorial assumes you have downloaded the repository
monai_uf_tutorials
following this section. - If you have no experience running MONAI Core using Singularity as container runtime on HiperGator before, I strongly recommend going through tutorial
monaicore_singlegpu
and making sure it's working before moving on to this tutorial. - If you have no experience running distributed training with MONAI Core on HiperGator before, I strongly recommend trying out the following examples in order, i.e., from simple to complex, which will make debugging easier.
- In all following commands, replace
hju
with your HiperGator username; change the path to files according to your own settings on HiperGator. - In all following SLURM job scripts, alter the
#SBATCH
settings for your own needs. - Please read the comments at the beginning of each script to get a better understanding on how to tune the scripts to your own needs.
- The training python script we're using here is adapted from a MONAI Core tutorial script.
- Synthetic data will be generated, so you don't need to download any data or have your own data to use this script.
- Validation is not implemented within the training loop, see other following examples for validation implementation.
-
Go to directory
unet_ddp/
cd ~/monai_uf_tutorials/monaicore_multigpu/unet_ddp/
-
Submit a SLURM job script
launch.sh
to launch a distributed training on a single node(see sample script/unet_ddp/launch.sh
)sbatch launch.sh
Alternatively, submit a SLURM job script
launch.sh
to launch a distributed training on multiple nodes(see sample script/unet_ddp/launch_multinode.sh
)sbatch launch_multinode.sh
Note The difference in
launch.sh
(for training on a single node) andlaunch_multinode.sh
(for training on multiple nodes):#SBATCH
settings.- use
run_on_node.sh
orrun_on_multinode.sh
in line 70
PT_LAUNCH_SCRIPT=$(realpath "${PT_LAUNCH_UTILS_PATH}/run_on_node.sh")
-
Check SLURM output file, file name format
launch.sh.job_id.out
orlaunch_multinode.sh.job_id.out
(see sample file/unet_ddp/launch.sh.job_id.out
and/unet_ddp/launch_multinode.sh.job_id.out
).cat launch.sh.job_id.out
-
The training python script is adapted from a MONAI Core tutorial script.
-
This example is a real-world task based on Decathlon challenge Task01: Brain Tumor segmentation, so it's more complicated than Exmaple 1.
-
Steps to get the required data:
- go to http://medicaldecathlon.com/, click on
Get Data
and then downloadTask01_BrainTumour.tar
to your local computer. - upload it to your storage partition (e.g. blue or red partition) on HiperGator, sample command:
scp path_to_Task01_BrainTumour.tar hju@hpg.rc.ufl.edu:path_to_storage_directory
- extract the data to directory
/Task01_BrainTumour
:
tar xvf path_to_Task01_BrainTumour.tar
- go to http://medicaldecathlon.com/, click on
-
To make the data visible to the MONAI Core Singularity container, we need to bind the data directory into the container, see line 44 in
/brats_ddp/launch.sh
:PYTHON_PATH="singularity exec --nv --bind /blue/vendor-nvidia/hju/data/brats_data:/mnt \ /blue/vendor-nvidia/hju/monaicore0.8.1 python3"
note the use of
--bind
flag:--bind path_to data_directory_on_hipergator:directory_name_seen_by_container
you can name
directory_name_seen_by_container
whatever you like, i.e., it doen't have to be/mnt
. See Singularity doc on bind paths for more details.note for this example, I'm binding
/brats_data
, the parent directory of/brats_data/Task01_BrainTumour
, to the container, which is for the sake of point 4 in this section. See Example 3 for binding data directory (not its parent directory) to the container. -
For this training script, we also need to provide the parent directory of
mnt/Task01_BrainTumour
as an input argument. See line 39 in/brats_ddp/launch.sh
:TRAINING_CMD="$TRAINING_SCRIPT -d=/mnt --epochs=20"
-
Multiple fast model training techniques are used: optimizer Novograd, cache intermediate data on GPU memory, ThreadDataLoader, Automated Mixed Precision (AMP). See Fast Model Training guide to learn more.
-
Dataset is splitted and cached on each GPU before training, see the implementation of
BratsCacheDataset
. This can avoid duplicated caching content on each GPU, but will not do global shuffle before every epoch. If you want to do global shuffle while caching on GPUs, you can replace theBratsCacheDataset
object with aCacheDataset
object and aDistributedSampler
object, where each GPU will cache the whole dataset, see discussion.
Steps are similar to Example 1, except sample scripts and output files are in directory brats_ddp/
.
- The training python script is adapted from a MONAI Core tutorial script.
- data, bind
- single gpu
- fast techs