-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about torch.compile has better throughput with 128-GPUs than 8-GPUs #619
Comments
Thank you for the question. This is a great observation! After some initial investigation, I think the difference might be caused by the underlying hardware differences.
Will make an update once we have the new result. |
Thank you for your answer. |
That's another good question. We haven't done any specific studies on weak scaling which you cared about -- I need to understand more on where this slowdown come from before answering your question with confidence. Are you aware of any studies on this topic? I can think of several slowdowns when scaling from 8 GPU (single-node) to 128 GPU, for eager/pure FSDP:
On the other hand there should be some (although maybe not much) savings when scaling up. E.g. each rank now holds a smaller fraction of parameter, to be updated by the optimizer. More can be said if we look at the profile traces and compare. In general, this seems to be a complex topic, and the "speedup ratio" can be task-specific. E.g. what if we train a 70B model, would the ratio be higher or lower? This is not to mention the variation each run/iteration could have. |
Thank you for publishing the paper. I hope to get your answers to the following questions.:
Normally, the training speed will decline as the number of GPUs increases. However, in the paper, with the torch.compile technology, the speed with 128 GPUs is better than that with 8 GPUs.
The text was updated successfully, but these errors were encountered: