Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline parallelism examples with Pippy fails #3151

Open
1 of 4 tasks
goelayu opened this issue Oct 9, 2024 · 2 comments
Open
1 of 4 tasks

Pipeline parallelism examples with Pippy fails #3151

goelayu opened this issue Oct 9, 2024 · 2 comments
Assignees

Comments

@goelayu
Copy link

goelayu commented Oct 9, 2024

System Info

- `Accelerate` version: 0.35.0.dev0
- Platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.35
- `accelerate` bash location: redacted
- Python version: 3.10.14
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1007.59 GB
- GPU type: NVIDIA H100 PCIe

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

  1. Run the llama.py example in distributed inference folder.
  2. Getting the following error
torch._dynamo.exc.UserError: Dynamic control flow is not supported at the moment. Please use functorch.experimental.control_flow.cond to explicitly capture the control flow. 
For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#cond-operands

Expected behavior

I have tried using different accelerate launch flags such as --dynamo_use_dynamic, however I am not sure how to fix the above error.

@muellerzr muellerzr self-assigned this Oct 10, 2024
@muellerzr
Copy link
Collaborator

@goelayu can you try upgrading your python version? IIRC that can play a role. (3.12 ideally)

@goelayu
Copy link
Author

goelayu commented Oct 14, 2024

@muellerzr still the same error.

  • Accelerate version: 0.35.0.dev0
  • Platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.35
  • accelerate bash location: /home/goelayus/miniforge3/envs/myenv/bin/accelerate
  • Python version: 3.12.7
  • Numpy version: 2.1.2
  • PyTorch version (GPU?): 2.4.1+cu121 (True)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • PyTorch MLU available: False
  • PyTorch MUSA available: False
  • System RAM: 1007.59 GB
  • GPU type: NVIDIA H100 PCIe
  • Accelerate default config:
    Not found

Here's the entire stack trace if that helps. Seems like some kind of versioning mismatch.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1006, in _trace_with_export
[rank0]:     ep = torch.export.export(
[rank0]:          ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/__init__.py", line 174, in export
[rank0]:     return _export(
[rank0]:            ^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 945, in wrapper
[rank0]:     raise e
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 928, in wrapper
[rank0]:     ep = fn(*args, **kwargs)
[rank0]:          ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 89, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 1533, in _export
[rank0]:     exported_program = ExportedProgram(
[rank0]:                        ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 246, in __init__
[rank0]:     self.verifier().check(self)
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 154, in check
[rank0]:     self._check_graph_module(ep.graph_module)
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 220, in _check_graph_module
[rank0]:     _check_val(node)
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 62, in _check_val
[rank0]:     raise SpecViolationError(f"Node.meta {node.name} is missing val field.")
[rank0]: torch._export.verifier.SpecViolationError: Node.meta _enter_autocast is missing val field.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/goelayus/Research/inference/LLMInfer/gpu-gpu/accelerate/examples/inference/pippy/llama.py", line 38, in <module>
[rank0]:     model = prepare_pippy(model, split_points="auto", example_kwargs=inputs)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 170, in prepare_pippy
[rank0]:     stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 87, in build_pipeline
[rank0]:     pipe = pipeline(
[rank0]:            ^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1229, in pipeline
[rank0]:     return Pipe.from_tracing(
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1044, in from_tracing
[rank0]:     exported_program = Pipe._trace_with_export(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1012, in _trace_with_export
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html
[rank0]:[W1014 14:23:47.244784990 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1006, in _trace_with_export
[rank1]:     ep = torch.export.export(
[rank1]:          ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/__init__.py", line 174, in export
[rank1]:     return _export(
[rank1]:            ^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 945, in wrapper
[rank1]:     raise e
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 928, in wrapper
[rank1]:     ep = fn(*args, **kwargs)
[rank1]:          ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 89, in wrapper
[rank1]:     return fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/_trace.py", line 1533, in _export
[rank1]:     exported_program = ExportedProgram(
[rank1]:                        ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/export/exported_program.py", line 246, in __init__
[rank1]:     self.verifier().check(self)
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 154, in check
[rank1]:     self._check_graph_module(ep.graph_module)
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 220, in _check_graph_module
[rank1]:     _check_val(node)
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/_export/verifier.py", line 62, in _check_val
[rank1]:     raise SpecViolationError(f"Node.meta {node.name} is missing val field.")
[rank1]: torch._export.verifier.SpecViolationError: Node.meta _enter_autocast is missing val field.

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/goelayus/Research/inference/LLMInfer/gpu-gpu/accelerate/examples/inference/pippy/llama.py", line 38, in <module>
[rank1]:     model = prepare_pippy(model, split_points="auto", example_kwargs=inputs)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 170, in prepare_pippy
[rank1]:     stage = build_pipeline(model, split_points, example_args, example_kwargs, num_chunks)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/inference.py", line 87, in build_pipeline
[rank1]:     pipe = pipeline(
[rank1]:            ^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1229, in pipeline
[rank1]:     return Pipe.from_tracing(
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1044, in from_tracing
[rank1]:     exported_program = Pipe._trace_with_export(
[rank1]:                        ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/pipelining/_IR.py", line 1012, in _trace_with_export
[rank1]:     raise RuntimeError(
[rank1]: RuntimeError: It seems that we cannot capture your model as a full graph. Typical reasons include graph breaks, data/shape-dependent control flow, or missing meta kernels for custom operators. You can use our manual pipeline interfaces, or try to fix the graph breaks, see https://pytorch.org/docs/stable/export.html
W1014 14:23:48.161000 140104898389824 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3229053 closing signal SIGTERM
E1014 14:23:48.777000 140104898389824 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3229052) of binary: /home/goelayus/miniforge3/envs/myenv/bin/python3.12
Traceback (most recent call last):
  File "/home/goelayus/miniforge3/envs/myenv/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/goelayus/miniforge3/envs/myenv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
llama.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-14_14:23:48
  host      : syrax-41
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3229052)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants