Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] accelerate ignores TPU #3169

Open
1 of 4 tasks
steveepreston opened this issue Oct 14, 2024 · 1 comment
Open
1 of 4 tasks

[Bug] accelerate ignores TPU #3169

steveepreston opened this issue Oct 14, 2024 · 1 comment

Comments

@steveepreston
Copy link

steveepreston commented Oct 14, 2024

System Info

latest version. tested via both `pip install -U accelerate` and `pip install git+https://github.com/huggingface/accelerate`

Information

  • My own modified scripts
  • The official example scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

While trying to fine-tune llms via torch/transformers on Kaggle TPU v3-8, getting an error that says accelerate don't count TPUs as device:

Error: RuntimeError: There are currently no available devices found, must be one of 'XPU', 'CUDA', or 'NPU'.

To make sure i also tested GoogleCloudPlatform Example that is a torch TPU fine-tune, got same exact error.

Error thrown on the trainer = SFTTrainer(...). You can see full Traceback of Error in below:


Click to Show Full Error Traceback
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[48], line 4
      1 from trl import SFTTrainer
      2 from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
----> 4 trainer = SFTTrainer(
      5     model=base_model,
      6     train_dataset=data,
      7     args=TrainingArguments(
      8         per_device_train_batch_size=BATCH_SIZE,  # This is actually the global batch size for SPMD.
      9         num_train_epochs=1,
     10         max_steps=-1,
     11         output_dir="/output_dir",
     12         optim="adafactor",
     13         logging_steps=1,
     14         dataloader_drop_last = True,  # Required for SPMD.
     15         fsdp="full_shard",
     16         fsdp_config=fsdp_config,
     17     ),
     18     peft_config=lora_config,
     19     dataset_text_field="quote",
     20     max_seq_length=max_seq_length,
     21     packing=True,
     22 )

File /usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py:101, in _deprecate_arguments.<locals>._inner_deprecate_positional_args.<locals>.inner_f(*args, **kwargs)
     99         message += "\n\n" + custom_message
    100     warnings.warn(message, FutureWarning)
--> 101 return f(*args, **kwargs)

File /usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:401, in SFTTrainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics, peft_config, dataset_text_field, packing, formatting_func, max_seq_length, infinite, num_of_sequences, chars_per_token, dataset_num_proc, dataset_batch_size, neftune_noise_alpha, model_init_kwargs, dataset_kwargs, eval_packing)
    395 if tokenizer.padding_side is not None and tokenizer.padding_side != "right":
    396     warnings.warn(
    397         "You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to "
    398         "overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code."
    399     )
--> 401 super().__init__(
    402     model=model,
    403     args=args,
    404     data_collator=data_collator,
    405     train_dataset=train_dataset,
    406     eval_dataset=eval_dataset,
    407     tokenizer=tokenizer,
    408     model_init=model_init,
    409     compute_metrics=compute_metrics,
    410     callbacks=callbacks,
    411     optimizers=optimizers,
    412     preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    413 )
    415 # Add tags for models that have been loaded with the correct transformers version
    416 if hasattr(self.model, "add_model_tags"):

File /usr/local/lib/python3.10/site-packages/transformers/trainer.py:411, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    408 self.deepspeed = None
    409 self.is_in_train = False
--> 411 self.create_accelerator_and_postprocess()
    413 # memory metrics - must set up as early as possible
    414 self._memory_tracker = TrainerMemoryTracker(self.args.skip_memory_metrics)

File /usr/local/lib/python3.10/site-packages/transformers/trainer.py:4858, in Trainer.create_accelerator_and_postprocess(self)
   4855     args.update(accelerator_config)
   4857 # create accelerator object
-> 4858 self.accelerator = Accelerator(**args)
   4859 # some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
   4860 self.gather_function = self.accelerator.gather_for_metrics

File /usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:349, in Accelerator.__init__(self, device_placement, split_batches, mixed_precision, gradient_accumulation_steps, cpu, dataloader_config, deepspeed_plugin, fsdp_plugin, megatron_lm_plugin, rng_types, log_with, project_dir, project_config, gradient_accumulation_plugin, step_scheduler_with_optimizer, kwargs_handlers, dynamo_backend, deepspeed_plugins)
    345         raise ValueError(f"FSDP requires PyTorch >= {FSDP_PYTORCH_VERSION}")
    347 if fsdp_plugin is None:  # init from env variables
    348     fsdp_plugin = (
--> 349         FullyShardedDataParallelPlugin() if os.environ.get("ACCELERATE_USE_FSDP", "false") == "true" else None
    350     )
    351 else:
    352     if not isinstance(fsdp_plugin, FullyShardedDataParallelPlugin):

File <string>:21, in __init__(self, sharding_strategy, backward_prefetch, mixed_precision_policy, auto_wrap_policy, cpu_offload, ignored_modules, state_dict_type, state_dict_config, optim_state_dict_config, limit_all_gathers, use_orig_params, param_init_fn, sync_module_states, forward_prefetch, activation_checkpointing, cpu_ram_efficient_loading, transformer_cls_names_to_wrap, min_num_params)

File /usr/local/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:1684, in FullyShardedDataParallelPlugin.__post_init__(self)
   1682     device = torch.xpu.current_device()
   1683 else:
-> 1684     raise RuntimeError(
   1685         "There are currently no available devices found, must be one of 'XPU', 'CUDA', or 'NPU'."
   1686     )
   1687 # Create a function that will be used to initialize the parameters of the model
   1688 # when using `sync_module_states`
   1689 self.param_init_fn = lambda x: x.to_empty(device=device, recurse=False)

RuntimeError: There are currently no available devices found, must be one of 'XPU', 'CUDA', or 'NPU'.

 
 
upgraded transformers, peft and trl to latest version, but got same error.

Expected behavior

accelerate detect TPU and dont throw error through accelerate/utils/dataclasses.py.

@steveepreston
Copy link
Author

update:

the error RuntimeError: There are currently no available devices found, must be one of 'XPU', 'CUDA', or 'NPU' dosen't thrown on transformers==4.38.2 and llama-3 fine-tune successfully done on TPU VM

but llama-3.1 requires upgraded version of transformers. so it's a deadend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant