-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensure reproducible determinsitc numerics #597
base: main
Are you sure you want to change the base?
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@@ -48,6 +48,10 @@ def set_determinism(seed: Optional[int]) -> None: | |||
torch.backends.cudnn.benchmark = False | |||
# set Python seed | |||
os.environ["PYTHONHASHSEED"] = str(seed) | |||
torch.use_deterministic_algorithms(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to add this earlier but I found it crashes compile when used with fp8:
"torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: "fill_out" not implemented for 'Float8_e4m3fn'"
Thus, I don't think we want to add this atm.
It works (compiles) if you don't use fp8 but a lot of the need for determinism is to better showcase fp8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed an issue to start work on getting this resolved:
pytorch/pytorch#137160
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining the context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should leave a TODO here reminding us to use it after the issue gets fixed.
also tbh I don't know what exactly torch.use_deterministic_algorithms
does. Does it cover everything else we are doing here?
torch.use_deterministic_algorithms(True) | ||
# env var for deterministic CuBLAS | ||
# https://github.com/pytorch/pytorch/blob/18525e185e211b3eab44c67a688e5df8396f6f97/torch/__init__.py#L1300 | ||
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how you found this arcane variable, but nice job finding it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved, but expressly on the condition of removing the:
torch.use_deterministic_algo(True)
as it crashes out during compile with fp8.
The cublas setting though is imo worth landing asap and I verified no issues with compile/fp8.
AWS is doing runs now to re-do the loss curves and I've pinged them this change, but easier if it lands in main.
Thanks for finding this obscure cublas setting to resolve the grad norm disparity!
@@ -48,6 +48,10 @@ def set_determinism(seed: Optional[int]) -> None: | |||
torch.backends.cudnn.benchmark = False | |||
# set Python seed | |||
os.environ["PYTHONHASHSEED"] = str(seed) | |||
torch.use_deterministic_algorithms(True) | |||
# env var for deterministic CuBLAS | |||
# https://github.com/pytorch/pytorch/blob/18525e185e211b3eab44c67a688e5df8396f6f97/torch/__init__.py#L1300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we cite https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html
instead of the permanent link?
@@ -48,6 +48,10 @@ def set_determinism(seed: Optional[int]) -> None: | |||
torch.backends.cudnn.benchmark = False | |||
# set Python seed | |||
os.environ["PYTHONHASHSEED"] = str(seed) | |||
torch.use_deterministic_algorithms(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should leave a TODO here reminding us to use it after the issue gets fixed.
also tbh I don't know what exactly torch.use_deterministic_algorithms
does. Does it cover everything else we are doing here?
deterministic_algo only covers a subset of operations. My understanding is we will still need all the other settings. torch.nn.ReplicationPad2d It will error on some other ops that don't have a deterministic impl, but I think the main takeaway is this is additive rather than replacing the other settings. |
resolve #593
grad norms are quite different if running the same config twice
grad norms become exactly the same in repeated runs