Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OOM Error with 8xA800(80G) GPUs at Default BS=16 #33

Open
annaoooo opened this issue Jun 4, 2024 · 3 comments
Open

CUDA OOM Error with 8xA800(80G) GPUs at Default BS=16 #33

annaoooo opened this issue Jun 4, 2024 · 3 comments

Comments

@annaoooo
Copy link

annaoooo commented Jun 4, 2024

I am reaching out regarding a CUDA out of memory issue I've encountered during multi-GPU training on a setup featuring 8 A800 GPUs, each equipped with 80GB of memory. The problem surfaces even when utilizing the default batch size of 16, necessitating a reduction to a batch size of 8 for the training to proceed. Despite conforming to all other script configurations, the memory limitation persists. I am seeking collective wisdom on the possible causes of this phenomenon and any immediate recommendations for improving memory management strategies. Your insights and experiences are highly appreciated.

Thank you in advance for any guidance or shared knowledge on this matter.

@TimandXiyu
Copy link

It runs fine with A100 80G even with BS=32. It'd be strange to see A800 80G cannot host the model w/o OOM.
BS=32 should yield a memory usage of around 65GB, BS=16 is about 40GB.
Maybe you should make sure you installed xformer and other dependencies correctly, current readme do have some version mismatch and you need to manually settle them to make sure packages all matches correctly.

@annaoooo
Copy link
Author

annaoooo commented Jun 6, 2024

Thank you very much for your prompt and insightful response. We have successfully resolved the issue. It turns out I had inadvertently removed xformers across different versions of our environment, which led to excessive memory usage. With your guidance, we have reinstalled the environment and can now smoothly train with a larger batch size. Thank you once again for your invaluable support!

@xilanhua12138
Copy link

@TimandXiyu Why when I enable xformers, the num_layers becomes zero? Can you share the xfomers version and coda version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants