reproducable numerics for loss, weights and gradients for single node (8 GPUs) #593

weifengpy · 2024-10-01T00:18:20Z

by default, torchtitan use FSDP2 mixed precision (param_dtype=bfloat16, reduce_dtype=float32)

for low-precision dtypes (float8 and int8), it's nature to compare loss curve with bfloat16 and see how well they match. (also a good idea to compare weights norm and gradients norm)

for bfloat16 itself, multiple runs will yield different loss curves and the undeterminism should be understood and documented (say NCCL gradient reduction, attention, seed). Otherwise it's hard to understand if numeric differences are coming from low-precision dtypes

I plotted gradient norms, loss = sum(model.parameters.grad), using llama3-8b with 8 GPUs with deterministic model init and deterministic data loader

for bfloat16, gradients are quite different in repeated runs

turning off gradient norm clipping helps a lot, but could not explain all of the divergence

filing the issue here and hopefully it can be a good candidate for what's next

The text was updated successfully, but these errors were encountered:

awgu · 2024-10-01T03:10:15Z

IIUC, the default SDPA backend for us is flash, and flash backward is non-deterministic?

I think we can try to enable some deterministic SDPA: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

weifengpy · 2024-10-01T04:36:32Z

IIUC, the default SDPA backend for us is flash, and flash backward is non-deterministic?

I think we can try to enable some deterministic SDPA: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

good call out!

weifengpy changed the title ~~reproducable numerics for loss, weights and gradients~~ reproducable numerics for loss, weights and gradients for single node (8 GPUs) Oct 1, 2024

weifengpy linked a pull request Oct 2, 2024 that will close this issue

ensure reproducible determinsitc numerics #597

Open

yf225 added the enhancement New feature or request label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproducable numerics for loss, weights and gradients for single node (8 GPUs) #593

reproducable numerics for loss, weights and gradients for single node (8 GPUs) #593

weifengpy commented Oct 1, 2024 •

edited

Loading

awgu commented Oct 1, 2024

weifengpy commented Oct 1, 2024

reproducable numerics for loss, weights and gradients for single node (8 GPUs) #593

reproducable numerics for loss, weights and gradients for single node (8 GPUs) #593

Comments

weifengpy commented Oct 1, 2024 • edited Loading

awgu commented Oct 1, 2024

weifengpy commented Oct 1, 2024

weifengpy commented Oct 1, 2024 •

edited

Loading