Step-by-step Tutorial: use MONAI Core by running a singularity container on a single GPU at HiperGator
- clean developing environment: easy to maintain dependencies of specific version, easy to debug.
- portable(reproducible).
- scalable.
- MONAI Core containers are based on NGC (NVIDIA GPU CLOUD) PyTorch containers: optimized performance on NVIDIA GPUs, keep updating.
- This tutorial assumes you have downloaded the repository monai_uf_tutorials following this section.
- In all following commands, change
hju
to your HiperGator username; change the path to files according to your settings on HiperGator. - In all following SLURM job scripts, alter the “#SBATCH” settings for your needs.
- Download MONAI Core tutorial repository.
- Build a MONAI Core singularity container.
- Run tutorial scripts within the container.
Log in to HiperGator, change hju
to your username
ssh hju@hpg.rc.ufl.edu
Go to your home directory
cd ~
Download MONAI Core tutorial repository into a local directory. The default name of the directory is tutorials
.
git clone https://github.com/Project-MONAI/tutorials.git
Make all files in the directory executable
chmod -R +x tutorials/
Go to directory monaicore_singlegpu/
.
cd ~/monai_uf_tutorials/monaicore_singlegpu/
Submit a SLURM job script (see sample script build_container.sh
) to build a singularity sandbox container (a writable directory) in a directory (e.g., a directory in your blue storage, don't store the container in your home directory, see HiperGator Storage). See useful SLURM Commands on HiperGator to learn more about commands like sbatch
.
sbatch build_container.sh
Sample console output
Submitted batch job 41779619 # 41779619 is the job id
This job might take a long time. Check if the job is still running (watch
command will update the job status every 2 sec.)
watch squeue -u $USER
Sample console output
Every 2.0s: squeue -u hju Mon Jul 11 20:23:29 2022
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
41784715 gpu build_co hju R 0:05 1 c38a-s5
(check useful SLURM Commands on HiperGator to learn more about commands like squeue
.)
Once the job is done, check the SLURM output file, file name in format build_container.sh.job_id.out
. See sample output file). If there is no error, then the container was built successfully.
cat build_container.sh.job_id.out
Tutorial scripts are in 2 formats, python scripts (.py) and jupyter notebooks (.ipynb). How to run each script format is shown separately below.
Go to directory monaicore_singlegpu/
.
cd ~/monai_uf_tutorials/monaicore_singlegpu
Request resources for an interactive session and start a bash shell (learn more about interactive session on HiperGator, learn more about GPU access on HiperGator).
To request an A100 GPU:
srun --nodes=1 --ntasks=1 --partition=gpu --gpus=a100:1 --cpus-per-task=4 --mem-per-cpu 64gb --time=03:00:00 --pty -u bash -i
To request a GeForce GPU:
srun --nodes=1 --ntasks=1 --partition=gpu --gpus=geforce:1 --cpus-per-task=4 --mem-per-cpu 64gb --time=03:00:00 --pty -u bash -i
When you see the prompt changes from something like [hju@login5 monaicore_singlegpu]$
to something like [hju@c39a-s39 monaicore_singlegpu]$
, you have successfully hopped from the login node to a compute node. Sample console output:
srun: job 41777769 queued and waiting for resources
srun: job 41777769 has been allocated resources
[hju@c39a-s39 monaicore_singlegpu]$
Load singularity.
module load singularity
Spawn a new shell within your container and interact with it. Alter the path to your container.
singularity shell --nv /blue/vendor-nvidia/hju/monaicore0.9.1
It might hang for a while before any console output. The change in prompt (from hju@... to Singularity>) indicates that you have entered the container. Sample console output:
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (484) bind mounts # can be ignored
"uname": executable file not found in $PATH # can be ignored
Singularity>
Run a python script.
python3 /home/hju/tutorials/2d_segmentation/torch/unet_training_array.py
The first time you run scripts within a newly spun-up container, it might hang for quite a while before you see any console output, and the following runs will be much faster to see output. See sample console output.
Exit the container.
Ctrl + D
Exit the compute node and go back to the login node.
Ctrl + D
Submit a SLURM job script. See sample job script run_container.sh
.
sbatch run_container.sh
Check the SLURM output file, file name in the format run_container.sh.job_id.out
. See sample output file.
cat run_container.sh.job_id.out
or you can use tail
command to follow the SLURM output while it's spitting out.
tail -f run_container.sh.job_id.out
Submit a SLURM job script to launch a jupyter lab server (see sample script launch_jupyter_lab.sh
). Note in the SLURM job script, do not set cpu memory too small, otherwise some cells (e.g., cells for training) in the jupyter notebooks cannot execute.
sbatch launch_jupyter_lab.sh
Check the SLURM output file, file name in format launch_jupyter_lab.sh.job_id.out
. It might take a while until you see the hostname (e.g., c38a-s5) and http://hostname:8888/?token= in the sample output file, which we will use in next steps.
cat launch_jupyter_lab.sh.job_id.out
Open another local terminal window, SSH to the server host, alter the hostname (e.g., c1000a-s17
between the 2 colons) according to the SLURM output above, alter hju
to your username.
ssh -NL 8888:c38a-s5:8888 hju@hpg.rc.ufl.edu
You will be prompted to enter your password. If the password is correct, there will be no console output; if wrong, you will be prompted to enter it again.
In a web browser, go to http://localhost:8888/
. You might be prompted to authenticate as shown in the screenshot below. Copy & paste token from the SLURM output above in the prompt box. Note: if you do not want to go through copying/pasting the token for every jupyter job, you can set a default password, see remote jupyter notebook on HiperGator.
Open any jupyter notebooks in the directory tutorials/
on the left pane (e.g. ~/tutorials/2d_classification/mednist_tutorial.ipynb) and you're ready to run.