You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
! Please note that the system info above does not reflect the actual environment accelerate runs in on Sagemaker. The above config is generated in a Sagemaker official container.
To reproduce the bug:
Create any training script that invokes accelerator.gather()
Configure accelerate to run on a Sagemaker multi-gpu machine using accelerate config, use 209479262201.dkr.ecr.us-west-2.amazonaws.com/1xgpt-from-sagemaker:2.3.0 as your docker image
Create a training job using accelerate launch and run the training script
Expected behavior
Sagemaker will return an error somewhere along the lines of this:
If accelerate launch is invoked inside of sagemaker instead of used to create the sagemaker job, the script works fine. I suspect this is because MPI is not well-supported by sagemaker yet accelerate launch uses MPI
Yes, I'd recommend invoking inside of sagemaker instead in this case. (Though MPI should only be ran on CPU, not GPU)
Sorry if I wasn't clear in my original report. This is more of a complaint on the default behavior of of accelerate launch when configured to run on SageMaker. When I followed this guide to configure and run accelerate with SageMaker's, it defaulted to MPI, which doesn't work with distributed training on SageMaker. accelerate luanch should default to NCCL when configured to run distributed training on SageMaker.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
! Please note that the system info above does not reflect the actual environment accelerate runs in on Sagemaker. The above config is generated in a Sagemaker official container.
To reproduce the bug:
accelerate config
, use 209479262201.dkr.ecr.us-west-2.amazonaws.com/1xgpt-from-sagemaker:2.3.0 as your docker imageaccelerate launch
and run the training scriptExpected behavior
Sagemaker will return an error somewhere along the lines of this:
The text was updated successfully, but these errors were encountered: