CUDA OOM Error with 8xA800(80G) GPUs at Default BS=16 #33

annaoooo · 2024-06-04T07:42:38Z

I am reaching out regarding a CUDA out of memory issue I've encountered during multi-GPU training on a setup featuring 8 A800 GPUs, each equipped with 80GB of memory. The problem surfaces even when utilizing the default batch size of 16, necessitating a reduction to a batch size of 8 for the training to proceed. Despite conforming to all other script configurations, the memory limitation persists. I am seeking collective wisdom on the possible causes of this phenomenon and any immediate recommendations for improving memory management strategies. Your insights and experiences are highly appreciated.

Thank you in advance for any guidance or shared knowledge on this matter.

TimandXiyu · 2024-06-05T14:07:55Z

It runs fine with A100 80G even with BS=32. It'd be strange to see A800 80G cannot host the model w/o OOM.
BS=32 should yield a memory usage of around 65GB, BS=16 is about 40GB.
Maybe you should make sure you installed xformer and other dependencies correctly, current readme do have some version mismatch and you need to manually settle them to make sure packages all matches correctly.

annaoooo · 2024-06-06T12:09:03Z

Thank you very much for your prompt and insightful response. We have successfully resolved the issue. It turns out I had inadvertently removed xformers across different versions of our environment, which led to excessive memory usage. With your guidance, we have reinstalled the environment and can now smoothly train with a larger batch size. Thank you once again for your invaluable support!

xilanhua12138 · 2024-06-24T07:52:36Z

@TimandXiyu Why when I enable xformers, the num_layers becomes zero? Can you share the xfomers version and coda version?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM Error with 8xA800(80G) GPUs at Default BS=16 #33

CUDA OOM Error with 8xA800(80G) GPUs at Default BS=16 #33

annaoooo commented Jun 4, 2024

TimandXiyu commented Jun 5, 2024

annaoooo commented Jun 6, 2024

xilanhua12138 commented Jun 24, 2024

CUDA OOM Error with 8xA800(80G) GPUs at Default BS=16 #33

CUDA OOM Error with 8xA800(80G) GPUs at Default BS=16 #33

Comments

annaoooo commented Jun 4, 2024

TimandXiyu commented Jun 5, 2024

annaoooo commented Jun 6, 2024

xilanhua12138 commented Jun 24, 2024