Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters #44

Open
viyjy opened this issue Jun 26, 2023 · 2 comments

Comments

@viyjy
Copy link

viyjy commented Jun 26, 2023

Hi, I am trying to use autoresume to continue train my failed jobs, but get the following error:

File "/opt/conda/lib/python3.9/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 243, in _check_order
RuntimeError: Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters

When I use a single node to train a model, save checkpoint, and set autoresume=True to continue the training by using a single node, it works.
However, when I use 16 nodes to train a model, save checkpoint, and use 1 or 16 nodes to do autoresume, I get the aforementioned error.
I googled it, but only find this Stack Overflow. Same error, but no answer yet.

@Landanjs
Copy link
Contributor

Apologies for the delay! Are you able to specify the checkpoint you want to load using load_path instead of autoresume=True? Or do you hit the same error?

@viyjy
Copy link
Author

viyjy commented Dec 13, 2023

@Landanjs Yes, I am able to use load_path. However, the job gets stuck at the very beginning if I use load_path=/path/of/checpoint and set load_weights_only=False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants