You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am reaching out regarding a CUDA out of memory issue I've encountered during multi-GPU training on a setup featuring 8 A800 GPUs, each equipped with 80GB of memory. The problem surfaces even when utilizing the default batch size of 16, necessitating a reduction to a batch size of 8 for the training to proceed. Despite conforming to all other script configurations, the memory limitation persists. I am seeking collective wisdom on the possible causes of this phenomenon and any immediate recommendations for improving memory management strategies. Your insights and experiences are highly appreciated.
Thank you in advance for any guidance or shared knowledge on this matter.
The text was updated successfully, but these errors were encountered:
It runs fine with A100 80G even with BS=32. It'd be strange to see A800 80G cannot host the model w/o OOM.
BS=32 should yield a memory usage of around 65GB, BS=16 is about 40GB.
Maybe you should make sure you installed xformer and other dependencies correctly, current readme do have some version mismatch and you need to manually settle them to make sure packages all matches correctly.
Thank you very much for your prompt and insightful response. We have successfully resolved the issue. It turns out I had inadvertently removed xformers across different versions of our environment, which led to excessive memory usage. With your guidance, we have reinstalled the environment and can now smoothly train with a larger batch size. Thank you once again for your invaluable support!
I am reaching out regarding a CUDA out of memory issue I've encountered during multi-GPU training on a setup featuring 8 A800 GPUs, each equipped with 80GB of memory. The problem surfaces even when utilizing the default batch size of 16, necessitating a reduction to a batch size of 8 for the training to proceed. Despite conforming to all other script configurations, the memory limitation persists. I am seeking collective wisdom on the possible causes of this phenomenon and any immediate recommendations for improving memory management strategies. Your insights and experiences are highly appreciated.
Thank you in advance for any guidance or shared knowledge on this matter.
The text was updated successfully, but these errors were encountered: