You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
by default, torchtitan use FSDP2 mixed precision (param_dtype=bfloat16, reduce_dtype=float32)
for low-precision dtypes (float8 and int8), it's nature to compare loss curve with bfloat16 and see how well they match. (also a good idea to compare weights norm and gradients norm)
for bfloat16 itself, multiple runs will yield different loss curves and the undeterminism should be understood and documented (say NCCL gradient reduction, attention, seed). Otherwise it's hard to understand if numeric differences are coming from low-precision dtypes
I plotted gradient norms, loss = sum(model.parameters.grad), using llama3-8b with 8 GPUs with deterministic model init and deterministic data loader
for bfloat16, gradients are quite different in repeated runs
turning off gradient norm clipping helps a lot, but could not explain all of the divergence
filing the issue here and hopefully it can be a good candidate for what's next
The text was updated successfully, but these errors were encountered:
weifengpy
changed the title
reproducable numerics for loss, weights and gradients
reproducable numerics for loss, weights and gradients for single node (8 GPUs)
Oct 1, 2024
by default, torchtitan use FSDP2 mixed precision (param_dtype=bfloat16, reduce_dtype=float32)
for low-precision dtypes (float8 and int8), it's nature to compare loss curve with bfloat16 and see how well they match. (also a good idea to compare weights norm and gradients norm)
for bfloat16 itself, multiple runs will yield different loss curves and the undeterminism should be understood and documented (say NCCL gradient reduction, attention, seed). Otherwise it's hard to understand if numeric differences are coming from low-precision dtypes
I plotted gradient norms,
loss = sum(model.parameters.grad)
, using llama3-8b with 8 GPUs with deterministic model init and deterministic data loaderfor bfloat16, gradients are quite different in repeated runs
turning off gradient norm clipping helps a lot, but could not explain all of the divergence
filing the issue here and hopefully it can be a good candidate for what's next
The text was updated successfully, but these errors were encountered: