ALLGATHER_BASE timeout error #1165

aknvictor · 2024-07-11T23:23:26Z

I keep getting this error when I run with

CUDA_LAUNCH_BLOCKING=1; tune run --nproc_per_node 4 lora_finetune_distributed --config scripts/2B_lora.yaml

any thoughts what I might be doing wrong. I'm running the latest version (from github)

1|16|Loss: 2.572175979614258: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:25<00:00,  1.48s/it]
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15037, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeo
ut(ms)=600000) ran for 600055 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15037, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeo
ut(ms)=600000) ran for 600059 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15037, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeo
ut(ms)=600000) ran for 600086 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 15037, last enqueued NCCL work: 15042, last completed NCCL work: 15036.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on cor
rupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=150
37, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f71e1c81897 in /miniconda/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f71e2f5ac62 in /minic
onda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f71e2f5fa80 in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f71e2f60dcc in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f722ea18bf4 in /miniconda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81cf (0x7f72304551cf in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7f722f937dd3 in /lib64/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 15037, last enqueued NCCL work: 15042, last completed NCCL work: 15036.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on cor
rupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=150
37, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeout(ms)=600000) ran for 600059 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb024756897 in /miniconda/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb025a2fc62 in /minic
onda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fb025a34a80 in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb025a35dcc in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7fb0714edbf4 in /miniconda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81cf (0x7fb072f2a1cf in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb07240cdd3 in /lib64/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 15037, last enqueued NCCL work: 15042, last completed NCCL work: 15036.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15037, OpType=_ALLGATHER_BASE, NumelIn=131072512, NumelOut=524290048, Timeout(ms)=600000) ran for 600055 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f654644a897 in /miniconda/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f6547723c62 in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f6547728a80 in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6547729dcc in /miniconda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f65931e1bf4 in /miniconda/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81cf (0x7f6594c1e1cf in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6594100dd3 in /lib64/libc.so.6)

The text was updated successfully, but these errors were encountered:

ebsmothers · 2024-07-12T00:38:19Z

Hi @aknvictor thanks for creating the issue. With NCCL timeouts it can be hard to pinpoint the cause. Looks like this is occurring at the end of an epoch? Also I'm curious what's in the scripts/2B_lora.yaml config -- is it just a copied version of gemma/2B_lora.yaml, or have you made any other customizations?

aknvictor · 2024-07-12T03:00:57Z

Yes, the error occurs at the end of an epoch. Yes, 2B_lora.yaml is copy of the original gemma/2B_lora.yaml config with only modifications to the file paths and batch_size.

ebsmothers · 2024-07-12T04:30:18Z

@aknvictor not sure what type of GPU you're on, but if possible can you try to run on a single device instead? Something like tune run lora_finetune_single_device --config scripts/2B_lora_single_device.yaml, where scripts/2B_lora_single_device.yaml is an analogous copy of torchtune's corresponding single-device config gemma/2B_lora_single_device.yaml.

I tried to repro on my end and I see the same error as in #1122, so wondering if that's the underlying cause here and the distributed run is masking the real source of the error.

aknvictor · 2024-07-12T06:36:09Z

Yes, I did run it on a single device (A100). It works fine after i skipped the erroneous key in the checkpoint save (as a temporary hack).

if key == 'lm_head.weight':
      continue

Admittedly, the bug/issue may be broader than that (especially when the run is distributed)

ebsmothers · 2024-07-12T14:05:30Z

It works fine after i skipped the erroneous key in the checkpoint save (as a temporary hack).

@aknvictor just to clarify, does skipping the key in the distributed case resolve the original timeout error? Or do you still see it even after removing that line?

aknvictor · 2024-07-12T16:23:38Z

The timeout error is still there when the run is distributed.

pbontrager · 2024-07-17T18:33:03Z

We fixed the Gemma checkpoint issue (Issue #1190). Could you try running your script again without the code below?

if key == 'lm_head.weight':
      continue

aknvictor · 2024-07-22T14:32:31Z

Yes. I'm still getting the error

ebsmothers · 2024-07-26T14:44:03Z

Hi @aknvictor sorry for the delay here. If you're still seeing the timeout error on distributed runs after pulling from latest main, would you be able to (a) provide more details of your environment (pip list, what hardware you're running on) and (b) help pinpoint where exactly the hang is occurring? I'm assuming it's somewhere in checkpoint save, maybe when trying to gather parameters from different GPUs, but not sure. (One hacky way to narrow it down for (b) is just add a call to torch.distributed.barrier() and then raise an error immediately afterwards; then you can bisect where in the code the hang is occurring based on whether you get this error or not)

aknvictor · 2024-08-03T16:30:01Z

The issues has been resolved in latest main. Thanks!

aknvictor closed this as completed Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALLGATHER_BASE timeout error #1165

ALLGATHER_BASE timeout error #1165

aknvictor commented Jul 11, 2024

ebsmothers commented Jul 12, 2024

aknvictor commented Jul 12, 2024

ebsmothers commented Jul 12, 2024

aknvictor commented Jul 12, 2024

ebsmothers commented Jul 12, 2024

aknvictor commented Jul 12, 2024

pbontrager commented Jul 17, 2024

aknvictor commented Jul 22, 2024

ebsmothers commented Jul 26, 2024

aknvictor commented Aug 3, 2024

ALLGATHER_BASE timeout error #1165

ALLGATHER_BASE timeout error #1165

Comments

aknvictor commented Jul 11, 2024

ebsmothers commented Jul 12, 2024

aknvictor commented Jul 12, 2024

ebsmothers commented Jul 12, 2024

aknvictor commented Jul 12, 2024

ebsmothers commented Jul 12, 2024

aknvictor commented Jul 12, 2024

pbontrager commented Jul 17, 2024

aknvictor commented Jul 22, 2024

ebsmothers commented Jul 26, 2024

aknvictor commented Aug 3, 2024