ensure reproducible determinsitc numerics #597

weifengpy · 2024-10-02T03:36:32Z

resolve #593

grad norms are quite different if running the same config twice

grad norms become exactly the same in repeated runs

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

lessw2020 · 2024-10-02T04:23:37Z

torchtitan/utils.py

@@ -48,6 +48,10 @@ def set_determinism(seed: Optional[int]) -> None:
        torch.backends.cudnn.benchmark = False
        # set Python seed
        os.environ["PYTHONHASHSEED"] = str(seed)
+        torch.use_deterministic_algorithms(True)


I was going to add this earlier but I found it crashes compile when used with fp8:
"torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: "fill_out" not implemented for 'Float8_e4m3fn'"

Thus, I don't think we want to add this atm.
It works (compiles) if you don't use fp8 but a lot of the need for determinism is to better showcase fp8.

I filed an issue to start work on getting this resolved:
pytorch/pytorch#137160

thanks for explaining the context

We should leave a TODO here reminding us to use it after the issue gets fixed.

also tbh I don't know what exactly torch.use_deterministic_algorithms does. Does it cover everything else we are doing here?

lessw2020 · 2024-10-02T04:24:13Z

torchtitan/utils.py

+        torch.use_deterministic_algorithms(True)
+        # env var for deterministic CuBLAS
+        # https://github.com/pytorch/pytorch/blob/18525e185e211b3eab44c67a688e5df8396f6f97/torch/__init__.py#L1300
+        os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"


not sure how you found this arcane variable, but nice job finding it!

lessw2020

approved, but expressly on the condition of removing the:
torch.use_deterministic_algo(True)
as it crashes out during compile with fp8.

The cublas setting though is imo worth landing asap and I verified no issues with compile/fp8.

AWS is doing runs now to re-do the loss curves and I've pinged them this change, but easier if it lands in main.
Thanks for finding this obscure cublas setting to resolve the grad norm disparity!

tianyu-l · 2024-10-02T22:03:50Z

torchtitan/utils.py

@@ -48,6 +48,10 @@ def set_determinism(seed: Optional[int]) -> None:
        torch.backends.cudnn.benchmark = False
        # set Python seed
        os.environ["PYTHONHASHSEED"] = str(seed)
+        torch.use_deterministic_algorithms(True)
+        # env var for deterministic CuBLAS
+        # https://github.com/pytorch/pytorch/blob/18525e185e211b3eab44c67a688e5df8396f6f97/torch/__init__.py#L1300


can we cite https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html
instead of the permanent link?

tianyu-l · 2024-10-02T22:06:03Z

torchtitan/utils.py

@@ -48,6 +48,10 @@ def set_determinism(seed: Optional[int]) -> None:
        torch.backends.cudnn.benchmark = False
        # set Python seed
        os.environ["PYTHONHASHSEED"] = str(seed)
+        torch.use_deterministic_algorithms(True)


We should leave a TODO here reminding us to use it after the issue gets fixed.

also tbh I don't know what exactly torch.use_deterministic_algorithms does. Does it cover everything else we are doing here?

lessw2020 · 2024-10-02T23:08:41Z

re: also tbh I don't know what exactly torch.use_deterministic_algorithms does. Does it cover everything else we are doing here?

deterministic_algo only covers a subset of operations. My understanding is we will still need all the other settings.
Summary is it impacts convolutions generically (1D, 2D, etc) and these specific operations:

torch.nn.ReplicationPad2d
torch.bmm()
torch.Tensor.getitem() when attempting to differentiate a CPU tensor and the index is a list of tensors
torch.Tensor.index_put() with accumulate=False
torch.Tensor.index_put() with accumulate=True when called on a CPU tensor
torch.Tensor.put_()]with accumulate=True when called on a CPU tensor
torch.Tensor.scatter_add_() when called on a CUDA tensor
torch.gather() when called on a CUDA tensor that requires grad
torch.index_add() when called on CUDA tensor
torch.index_select() when attempting to differentiate a CUDA tensor
torch.repeat_interleave() when attempting to differentiate a CUDA tensor
torch.Tensor.index_copy() when called on a CPU or CUDA tensor
torch.Tensor.scatter()when src type is Tensor and called on CUDA tensor
torch.Tensor.scatter_reduce() when reduce='sum' or reduce='mean' and called on CUDA tensor

It will error on some other ops that don't have a deterministic impl, but I think the main takeaway is this is additive rather than replacing the other settings.

reproducible determinsitc numerics

c85f1e3

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2024

restore config_manager

18bde85

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy requested review from tianyu-l and lessw2020 October 2, 2024 03:37

lessw2020 reviewed Oct 2, 2024

View reviewed changes

lessw2020 approved these changes Oct 2, 2024

View reviewed changes

tianyu-l reviewed Oct 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure reproducible determinsitc numerics #597

ensure reproducible determinsitc numerics #597

weifengpy commented Oct 2, 2024 •

edited

Loading

lessw2020 Oct 2, 2024

lessw2020 Oct 2, 2024

weifengpy Oct 2, 2024

tianyu-l Oct 2, 2024

lessw2020 Oct 2, 2024

lessw2020 left a comment

tianyu-l Oct 2, 2024

tianyu-l Oct 2, 2024

lessw2020 commented Oct 2, 2024

ensure reproducible determinsitc numerics #597

Are you sure you want to change the base?

ensure reproducible determinsitc numerics #597

Conversation

weifengpy commented Oct 2, 2024 • edited Loading

lessw2020 Oct 2, 2024

Choose a reason for hiding this comment

lessw2020 Oct 2, 2024

Choose a reason for hiding this comment

weifengpy Oct 2, 2024

Choose a reason for hiding this comment

tianyu-l Oct 2, 2024

Choose a reason for hiding this comment

lessw2020 Oct 2, 2024

Choose a reason for hiding this comment

lessw2020 left a comment

Choose a reason for hiding this comment

tianyu-l Oct 2, 2024

Choose a reason for hiding this comment

tianyu-l Oct 2, 2024

Choose a reason for hiding this comment

lessw2020 commented Oct 2, 2024

weifengpy commented Oct 2, 2024 •

edited

Loading