Skip to content

Latest commit

 

History

History

Distributed_Training_with_DeepSpeed

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Introduction to Distributed Training with DeepSpeed

Introduction

DeepSpeed is an open-source library that facilitates the training of large deep learning models based on PyTorch. With minimal code changes, a developer can train a model on a single GPU machine, a single machine with multiple GPUs or on multiple machines in a distributed fashion.

One of the advantages is that it enables massive models. When the library was first released, it was able to train a model of 200B parameters, by the end of 2021, the team was able to train Megatron-Turing NLG 530B, the largest generative language model to date. They are working to support soon a model of 1 trillion parameters.

The other important feature is its speed. According to their experiments, DeepSpeed trains 2–7x faster than other solutions by reducing communication volume during distributed training.

Last, but not least, the library only requires minimal code changes to use. In comparison to other distributed training libraries, DeepSpeed does not require a code redesign or model refactoring.

Installation

The installation is very simple, for a basic test of the library we can install DeepSpeed, PyTorch and Transformers.

conda create -n deepspeed python=3.7 -y
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
pip install deepspeed transformers datasets fire loguru sh pytz

The versions installed are from the requirements.txt file.

print(f"Numpy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"DeepSpeed version: {deepspeed.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
Numpy version: 1.21.2
PyTorch version: 1.10.2
DeepSpeed version: 0.5.10
Transformers version: 4.16.0
Datasets version: 1.18.1

Implementation of DeepSpeed in a PyTorch model

One of the first tutorials that can be found in the repository explains how to create and train a Transformer encoder on the Masked Language Modeling (MLM) task. It also shows the code changes that need to be made to transform a PyTorch solution into a DeepSpeed one.

In the file train_bert.py we can see how to train a Transformer encoder on the MLM task using standard PyTorch and in the file train_bert_ds.py we can see how to train the same model using DeepSpeed.

Initialization

Replace the original PyTorch code:

model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

with DeepSpeed alternative:

ds_config = {
  "train_micro_batch_size_per_gpu": batch_size,
  "optimizer": {
      "type": "Adam",
      "params": {
          "lr": 1e-4
      }
  },
  "fp16": {
      "enabled": True
  },
  "zero_optimization": {
      "stage": 1,
      "offload_optimizer": {
         "device": "cpu"
      }
  }
}
model, _, _, _ = deepspeed.initialize(model=model, model_parameters=model.parameters(), config=ds_config)

Training

Replace the original PyTorch code:

for step, batch in enumerate(data_iterator, start=start_step):
    loss.backward()
    optimizer.step()

with DeepSpeed alternative:

for step, batch in enumerate(data_iterator, start=start_step):
    model.backward(loss)
    model.step()

Model Checkpointing

Replace the original PyTorch code:

if step % checkpoint_every != 0:
    state_dict = {
        "model": model.state_dict(),
        "optimizer": optimizer.state_dict(),
    }
    torch.save(obj=state_dict, f=str(exp_dir / f"checkpoint.iter_{step}.pt"))

with DeepSpeed alternative:

if step % checkpoint_every != 0:
    model.save_checkpoint(save_dir=exp_dir, client_state={'checkpoint_step': step})

Execution

In order to train the standar PyTorch model, and assuming we are on a machine with at least one GPU, we can run the following command:

python train_bert.py --checkpoint_dir experiments --num_iterations 1000 --local_rank 0 --log_every 500

For running the same model using DeepSpeed, we can run the following command. This command will take by default all the GPUs available on the machine.

deepspeed train_bert_ds.py --checkpoint_dir experiments --num_iterations 1000 --log_every 500

References