Pipeline parallelism examples with Pippy fails #3151

goelayu · 2024-10-09T18:59:04Z

System Info

- `Accelerate` version: 0.35.0.dev0
- Platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.35
- `accelerate` bash location: redacted
- Python version: 3.10.14
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1007.59 GB
- GPU type: NVIDIA H100 PCIe

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Run the llama.py example in distributed inference folder.
Getting the following error

torch._dynamo.exc.UserError: Dynamic control flow is not supported at the moment. Please use functorch.experimental.control_flow.cond to explicitly capture the control flow. 
For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#cond-operands

Expected behavior

I have tried using different accelerate launch flags such as --dynamo_use_dynamic, however I am not sure how to fix the above error.

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-10-10T17:56:32Z

@goelayu can you try upgrading your python version? IIRC that can play a role. (3.12 ideally)

goelayu · 2024-10-14T21:25:44Z

@muellerzr still the same error.

Accelerate version: 0.35.0.dev0
Platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.35
accelerate bash location: /home/goelayus/miniforge3/envs/myenv/bin/accelerate
Python version: 3.12.7
Numpy version: 2.1.2
PyTorch version (GPU?): 2.4.1+cu121 (True)
PyTorch XPU available: False
PyTorch NPU available: False
PyTorch MLU available: False
PyTorch MUSA available: False
System RAM: 1007.59 GB
GPU type: NVIDIA H100 PCIe
Accelerate default config:
Not found

Here's the entire stack trace if that helps. Seems like some kind of versioning mismatch.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1006, in _trace_with_export
[rank0]:     ep = torch.export.export(
[rank0]:          ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/__init__.py", line 174, in export
[rank0]:     return _export(
[rank0]:            ^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 945, in wrapper
[rank0]:     raise e
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 928, in wrapper
[rank0]:     ep = fn(*args, **kwargs)
[rank0]:          ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 89, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 1533, in _export
[rank0]:     exported_program = ExportedProgram(
[rank0]:                        ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 246, in __init__
[rank0]:     self.verifier().check(self)
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 154, in check
[rank0]:     self._check_graph_module(ep.graph_module)
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 220, in _check_graph_module
[rank0]:     _check_val(node)
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 62, in _check_val
[rank0]:     raise SpecViolationError(f"Node.meta {node.name} is missing val field.")
[rank0]: torch._export.verifier.SpecViolationError: Node.meta _enter_autocast is missing val field.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/goelayus/Research/inference/LLMInfer/gpu-gpu/accelerate/examples/inference/pippy/llama.py", line 38, in <module>
[rank0]:     model = prepare_pippy(model, split_points="auto", example_kwargs=inputs)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 170, in prepare_pippy
[rank0]:     stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 87, in build_pipeline
[rank0]:     pipe = pipeline(
[rank0]:            ^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1229, in pipeline
[rank0]:     return Pipe.from_tracing(
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1044, in from_tracing
[rank0]:     exported_program = Pipe._trace_with_export(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1012, in _trace_with_export
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html
[rank0]:[W1014 14:23:47.244784990 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1006, in _trace_with_export
[rank1]:     ep = torch.export.export(
[rank1]:          ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/__init__.py", line 174, in export
[rank1]:     return _export(
[rank1]:            ^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 945, in wrapper
[rank1]:     raise e
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 928, in wrapper
[rank1]:     ep = fn(*args, **kwargs)
[rank1]:          ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 89, in wrapper
[rank1]:     return fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 1533, in _export
[rank1]:     exported_program = ExportedProgram(
[rank1]:                        ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 246, in __init__
[rank1]:     self.verifier().check(self)
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 154, in check
[rank1]:     self._check_graph_module(ep.graph_module)
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 220, in _check_graph_module
[rank1]:     _check_val(node)
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 62, in _check_val
[rank1]:     raise SpecViolationError(f"Node.meta {node.name} is missing val field.")
[rank1]: torch._export.verifier.SpecViolationError: Node.meta _enter_autocast is missing val field.

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/goelayus/Research/inference/LLMInfer/gpu-gpu/accelerate/examples/inference/pippy/llama.py", line 38, in <module>
[rank1]:     model = prepare_pippy(model, split_points="auto", example_kwargs=inputs)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 170, in prepare_pippy
[rank1]:     stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 87, in build_pipeline
[rank1]:     pipe = pipeline(
[rank1]:            ^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1229, in pipeline
[rank1]:     return Pipe.from_tracing(
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1044, in from_tracing
[rank1]:     exported_program = Pipe._trace_with_export(
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1012, in _trace_with_export
[rank1]:     raise RuntimeError(
[rank1]: RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html
W1014 14:23:48.161000 140104898389824 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3229053 closing signal SIGTERM
E1014 14:23:48.777000 140104898389824 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3229052) of binary: /home/goelayus/miniforge3/envs/myenv/bin/python3.12
Traceback (most recent call last):
  File "/home/goelayus/miniforge3/envs/myenv/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
llama.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-14_14:23:48
  host      : syrax-41
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3229052)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

muellerzr self-assigned this Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline parallelism examples with Pippy fails #3151

Pipeline parallelism examples with Pippy fails #3151

goelayu commented Oct 9, 2024 •

edited

Loading

muellerzr commented Oct 10, 2024

goelayu commented Oct 14, 2024

Pipeline parallelism examples with Pippy fails #3151

Pipeline parallelism examples with Pippy fails #3151

Comments

goelayu commented Oct 9, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Oct 10, 2024

goelayu commented Oct 14, 2024

goelayu commented Oct 9, 2024 •

edited

Loading