Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerator.prepare() get OOM,but available in single GPU #3182

Open
2 of 4 tasks
lqf0624 opened this issue Oct 21, 2024 · 1 comment
Open
2 of 4 tasks

accelerator.prepare() get OOM,but available in single GPU #3182

lqf0624 opened this issue Oct 21, 2024 · 1 comment

Comments

@lqf0624
Copy link

lqf0624 commented Oct 21, 2024

System Info

- `Accelerate` version: 1.0.1
- Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 2015.00 GB
- GPU type: NVIDIA A800-SXM4-40GB
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

import accelerate
from accelerate import DistributedDataParallelKwargs
from transformers import GPT2Model
accelerator = accelerate.Accelerator(kwargs_handlers=[ddp_kwargs])
model = GPT2Model.from_pretrained(args.model_dir,output_hidden_states = True)
        
if args.pretrain == 1 and args.freeze == 1:
    peft_config = LoraConfig(
    r=128,
    lora_alpha=256,
    lora_dropout=0.1,
    )   
model = get_peft_model(model, peft_config)
model = accelerator.prepare(model)

Expected behavior

Here is the information:

Traceback (most recent call last):
  File "/workspace/Graph-Network/main.py", line 174, in <module>
    model = accelerator.prepare(model)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1350, in prepare
    result = tuple(
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1351, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1226, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1460, in prepare_model
    model = model.to(self.device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1173, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1159, in convert
    return t.to(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It's confusing that CUDA raise OOM but unlike others, it did not even try to allocate any GPU memory. In fact, my GPUs are empty according to nvidia-smi

@BenjaminBossan
Copy link
Member

Thanks for reporting. Could you please:

  1. Share the output of accelerate env
  2. Tell us how you run the script
  3. Tell us what PEFT version you're using
  4. What is the model in args.model_dir?
  5. If you comment out model = get_peft_model(model, peft_config), do you get the same error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants