LoRA, doesn't seem to be faster. #322

TITC · 2024-08-15T12:47:27Z

TITC
Aug 15, 2024

as it mentioned at Interpretation 7, LoRA shoud be more quicker, but exp 5 is 0.69 min and exp 10 is 0.75 min.

Moreover, using LoRA is also slightly faster because fewer parameters have to be updated.

I also reproduced it at L40s GPU as below:

(base) ubuntu@l40s-instance:/opt/repository/LLMs-from-scratch/ch06/02_bonus_additional-experiments$ python additional-experiments.py --trainable_layers all
/home/ubuntu/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
2024-08-15 11:55:09.175906: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 11:55:09.210939: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 11:55:09.222097: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-15 11:55:09.253434: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-15 11:55:10.475634: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
checkpoint: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 77.0/77.0 [00:00<00:00, 44.3kiB/s]
encoder.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:01<00:00, 608kiB/s]
hparams.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 53.8kiB/s]
model.ckpt.data-00000-of-00001: 100%|█████████████████████████████████████████████████████████████████████████████████████| 498M/498M [00:59<00:00, 8.39MiB/s]
model.ckpt.index: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5.21k/5.21k [00:00<00:00, 3.05MiB/s]
model.ckpt.meta: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 471k/471k [00:01<00:00, 339kiB/s]
vocab.bpe: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:01<00:00, 308kiB/s]
File downloaded and saved as sms_spam_collection/SMSSpamCollection.tsv
Ep 1 (Step 000000): Train loss 2.230, Val loss 2.499
Ep 1 (Step 000050): Train loss 0.247, Val loss 0.136
Ep 1 (Step 000100): Train loss 0.188, Val loss 0.194
Training accuracy: 97.50% | Validation accuracy: 95.00%
Ep 2 (Step 000150): Train loss 0.454, Val loss 0.117
Ep 2 (Step 000200): Train loss 0.165, Val loss 0.126
Ep 2 (Step 000250): Train loss 0.101, Val loss 0.085
Training accuracy: 100.00% | Validation accuracy: 95.00%
Ep 3 (Step 000300): Train loss 0.057, Val loss 0.111
Ep 3 (Step 000350): Train loss 0.022, Val loss 0.096
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 4 (Step 000400): Train loss 0.010, Val loss 0.086
Ep 4 (Step 000450): Train loss 0.004, Val loss 0.100
Ep 4 (Step 000500): Train loss 0.001, Val loss 0.131
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 5 (Step 000550): Train loss 0.001, Val loss 0.175
Ep 5 (Step 000600): Train loss 0.012, Val loss 0.096
Training accuracy: 97.50% | Validation accuracy: 97.50%
Training completed in 0.74 minutes.
Training accuracy: 99.42%
Validation accuracy: 97.99%
Test accuracy: 97.67%

(base) ubuntu@l40s-instance:/opt/repository/LLMs-from-scratch/ch06/02_bonus_additional-experiments$ python additional-experiments.py --trainable_layers lora --lora_rank 16 --lora_alpha 16
/home/ubuntu/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
2024-08-15 11:57:28.516293: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 11:57:28.534235: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 11:57:28.539821: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-15 11:57:28.552504: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-15 11:57:29.439871: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
File already exists and is up-to-date: gpt2/124M/checkpoint
File already exists and is up-to-date: gpt2/124M/encoder.json
File already exists and is up-to-date: gpt2/124M/hparams.json
File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/124M/model.ckpt.index
File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: gpt2/124M/vocab.bpe
Ep 1 (Step 000000): Train loss 2.819, Val loss 3.164
Ep 1 (Step 000050): Train loss 0.296, Val loss 0.178
Ep 1 (Step 000100): Train loss 0.124, Val loss 0.148
Training accuracy: 95.00% | Validation accuracy: 97.50%
Ep 2 (Step 000150): Train loss 0.124, Val loss 0.086
Ep 2 (Step 000200): Train loss 0.141, Val loss 0.102
Ep 2 (Step 000250): Train loss 0.043, Val loss 0.097
Training accuracy: 100.00% | Validation accuracy: 92.50%
Ep 3 (Step 000300): Train loss 0.035, Val loss 0.105
Ep 3 (Step 000350): Train loss 0.100, Val loss 0.180
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 4 (Step 000400): Train loss 0.089, Val loss 0.061
Ep 4 (Step 000450): Train loss 0.080, Val loss 0.125
Ep 4 (Step 000500): Train loss 0.055, Val loss 0.082
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 5 (Step 000550): Train loss 0.006, Val loss 0.073
Ep 5 (Step 000600): Train loss 0.012, Val loss 0.058
Training accuracy: 100.00% | Validation accuracy: 97.50%
Training completed in 0.88 minutes.
Training accuracy: 99.52%
Validation accuracy: 98.66%
Test accuracy: 97.67%

it seems that LoRA is more time-consuming in this ablation experiment, but I don't know the reason.

Answered by rasbt

Aug 16, 2024

@TITC I tested this with the larger 1558M model, and LoRA seems to be faster with that one. 5.79 instead of 8.12 min. I updated the table (row 9 vs 12):

	Model	Weights	Trainable token position	Trainable layers	Context length	Training acc	Validation acc	Test acc	Training time	CPU/GPU
9	gpt2-xl (1558M)	pretrained	last	all	longest train ex. (120)	100.00%	98.66%	98.67%	8.12 min	A100
12	gpt2-xl (1558M)	pretrained	last	LoRA	longest train ex. (120)	100.00%	98.66%	98.33%	5.79 min	A100

View full answer

rasbt · 2024-08-15T13:19:41Z

rasbt
Aug 15, 2024
Maintainer

That's an interesting observation, and I would say that this is because the models are quite small. So the additional overhead during the forward pass may outweigh the performance gains in the backward pass.

I remember running this with larger models but, to be honest, I forgot the results -- whether it was faster or not.

The other thing is that you could try my alternative implementation with merged weights from here.

I.e., in the book, I used

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
        self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)
        self.B = nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x


class LinearWithLoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )

    def forward(self, x):
        return self.linear(x) + self.lora(x)

because it is easier to explain and more intuitive perhaps when looking at the LoRA figures. However, you can reformulate it as follows, which may be faster:

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
        self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)
        self.B = nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x

    
# This LoRA code is equivalent to LinearWithLoRA
class LinearWithLoRAMerged(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )

    def forward(self, x):
        lora = self.lora.A @ self.lora.B
        combined_weight = self.linear.weight + self.lora.alpha*lora.T
        return F.linear(x, combined_weight, self.linear.bias)

7 replies

TITC Aug 16, 2024
Author

Thanks for your guides, I will try these clues that you mentioned, hope there is always an answer behind a question.

rasbt Aug 16, 2024
Maintainer

@TITC I tested this with the larger 1558M model, and LoRA seems to be faster with that one. 5.79 instead of 8.12 min. I updated the table (row 9 vs 12):

	Model	Weights	Trainable token position	Trainable layers	Context length	Training acc	Validation acc	Test acc	Training time	CPU/GPU
9	gpt2-xl (1558M)	pretrained	last	all	longest train ex. (120)	100.00%	98.66%	98.67%	8.12 min	A100
12	gpt2-xl (1558M)	pretrained	last	LoRA	longest train ex. (120)	100.00%	98.66%	98.33%	5.79 min	A100

Answer selected by rasbt

TITC Aug 18, 2024
Author

So the additional overhead during the forward pass may outweigh the performance gains in the backward pass.

In the 1558M model, the LoRA's forward propagation is faster than Trainable layers(all), and also the backward propagation. that's looks fit the theoretical, because LoRA has 22147234 parameters(requires_grad) is less than all 1557614402.

gpt2-xl (1558M) LoRA
the calc_loss_batch function which represent forward cost 81.8s, the loss.backward cost 98.7s, and weight update cost 12.8s.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   656                                           @profile                                                                                                          
   657                                           def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,                       
   658                                                                       eval_freq, eval_iter, max_steps=None, trainable_token_pos=-1,                         
   659                                                                       accumulation_steps=1, ignore_index=-100):                                             
   660                                               # Initialize lists to track losses and tokens seen                                                            
   661         1          0.0      0.0      0.0      train_losses, val_losses, train_accs, val_accs = [], [], [], []                                               
   662         1          0.0      0.0      0.0      examples_seen, global_step = 0, -1                                                                            
   663                                                                                                                                                             
   664                                               # Main training loop                                                                                          
   665         6          0.0      0.0      0.0      for epoch in range(num_epochs):                                                                               
   666         5          0.0      0.0      0.0          model.train()  # Set model to training mode                                                               
   667                                                                                                                                                             
   668       655          1.4      0.0      0.6          for batch_idx, (input_batch, target_batch) in enumerate(train_loader):                                    
   669      1300         81.8      0.1     37.5              loss = calc_loss_batch(                                                                               
   670       650          0.0      0.0      0.0                  input_batch, target_batch, model, device,                                                         
   671       650          0.0      0.0      0.0                  trainable_token_pos=trainable_token_pos, ignore_index=ignore_index                                
   672                                                       )                                                                                                     
   673                                                                                                                                                             
   674                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   675                                                       # See https://sebastianraschka.com/blog/2023/llm-grad-accumulation.html                               
   676                                                       # for an explanation                                                                                  
   677       650          0.2      0.0      0.1              loss /= accumulation_steps                                                                            
   678                                                                                                                                                             
   679       650         98.7      0.2     45.3              loss.backward()  # Calculate loss gradients                                                           
   680                                                                                                                                                             
   681                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   682       650          0.0      0.0      0.0              is_update_step = ((batch_idx + 1) % accumulation_steps == 0) or ((batch_idx + 1) == len(train_loader))
   683       650          0.0      0.0      0.0              if is_update_step:                                                                                    
   684       650         12.8      0.0      5.9                  optimizer.step()  # Update model weights using loss gradients

gpt2-xl (1558M) All
the calc_loss_batch function which represent forward cost 163.8s, the loss.backward cost 138.0s and weight update cost 19.4s.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   656                                           @profile                                                                                                          
   657                                           def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,                       
   658                                                                       eval_freq, eval_iter, max_steps=None, trainable_token_pos=-1,                         
   659                                                                       accumulation_steps=1, ignore_index=-100):                                             
   660                                               # Initialize lists to track losses and tokens seen                                                            
   661         1          0.0      0.0      0.0      train_losses, val_losses, train_accs, val_accs = [], [], [], []                                               
   662         1          0.0      0.0      0.0      examples_seen, global_step = 0, -1                                                                            
   663                                                                                                                                                             
   664                                               # Main training loop                                                                                          
   665         6          0.0      0.0      0.0      for epoch in range(num_epochs):                                                                               
   666         5          0.0      0.0      0.0          model.train()  # Set model to training mode                                                               
   667                                                                                                                                                             
   668       655          1.3      0.0      0.4          for batch_idx, (input_batch, target_batch) in enumerate(train_loader):                                    
   669      1300        163.8      0.1     47.3              loss = calc_loss_batch(                                                                               
   670       650          0.0      0.0      0.0                  input_batch, target_batch, model, device,                                                         
   671       650          0.0      0.0      0.0                  trainable_token_pos=trainable_token_pos, ignore_index=ignore_index                                
   672                                                       )                                                                                                     
   673                                                                                                                                                             
   674                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   675                                                       # See https://sebastianraschka.com/blog/2023/llm-grad-accumulation.html                               
   676                                                       # for an explanation                                                                                  
   677       650          0.1      0.0      0.0              loss /= accumulation_steps                                                                            
   678                                                                                                                                                             
   679       650        138.0      0.2     39.8              loss.backward()  # Calculate loss gradients                                                           
   680                                                                                                                                                             
   681                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   682       650          0.0      0.0      0.0              is_update_step = ((batch_idx + 1) % accumulation_steps == 0) or ((batch_idx + 1) == len(train_loader))
   683       650          0.0      0.0      0.0              if is_update_step:                                                                                    
   684       650         19.4      0.0      5.6                  optimizer.step()  # Update model weights using loss gradients

In the 124M group, the LoRA's forward propagation is slower than Trainable layers(all), and also the backward propagation.

gpt2-xl (124M) LoRA(LinearWithLoRAMerged)
the calc_loss_batch function which represent forward cost 18.3s, the loss.backward cost 22.2s, and weight update cost 3.6s.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   689                                           @profile                                                                                                          
   690                                           def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,                       
   691                                                                       eval_freq, eval_iter, max_steps=None, trainable_token_pos=-1,                         
   692                                                                       accumulation_steps=1, ignore_index=-100):                                             
   693                                               # Initialize lists to track losses and tokens seen                                                            
   694         1          0.0      0.0      0.0      train_losses, val_losses, train_accs, val_accs = [], [], [], []                                               
   695         1          0.0      0.0      0.0      examples_seen, global_step = 0, -1                                                                            
   696                                                                                                                                                             
   697                                               # Main training loop                                                                                          
   698         6          0.0      0.0      0.0      for epoch in range(num_epochs):                                                                               
   699         5          0.0      0.0      0.0          model.train()  # Set model to training mode                                                               
   700                                                                                                                                                             
   701       655          1.1      0.0      2.2          for batch_idx, (input_batch, target_batch) in enumerate(train_loader):                                    
   702      1300         18.3      0.0     36.6              loss = calc_loss_batch(                                                                               
   703       650          0.0      0.0      0.0                  input_batch, target_batch, model, device,                                                         
   704       650          0.0      0.0      0.0                  trainable_token_pos=trainable_token_pos, ignore_index=ignore_index                                
   705                                                       )                                                                                                     
   706                                                                                                                                                             
   707                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   708                                                       # See https://sebastianraschka.com/blog/2023/llm-grad-accumulation.html                               
   709                                                       # for an explanation                                                                                  
   710       650          0.0      0.0      0.1              loss /= accumulation_steps                                                                            
   711                                                                                                                                                             
   712       650         22.2      0.0     44.3              loss.backward()  # Calculate loss gradients                                                           
   713                                                                                                                                                             
   714                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   715       650          0.0      0.0      0.0              is_update_step = ((batch_idx + 1) % accumulation_steps == 0) or ((batch_idx + 1) == len(train_loader))
   716       650          0.0      0.0      0.0              if is_update_step:                                                                                    
   717       650          3.6      0.0      7.3                  optimizer.step()  # Update model weights using loss gradients

gpt2-xl (124M) LoRA(LinearWithLoRA)
the calc_loss_batch function which represent forward cost 21.0s, the loss.backward cost 24.8s, and weight update cost 4.1s.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   689                                           @profile                                                                                                          
   690                                           def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,                       
   691                                                                       eval_freq, eval_iter, max_steps=None, trainable_token_pos=-1,                         
   692                                                                       accumulation_steps=1, ignore_index=-100):                                             
   693                                               # Initialize lists to track losses and tokens seen                                                            
   694         1          0.0      0.0      0.0      train_losses, val_losses, train_accs, val_accs = [], [], [], []                                               
   695         1          0.0      0.0      0.0      examples_seen, global_step = 0, -1                                                                            
   696                                                                                                                                                             
   697                                               # Main training loop                                                                                          
   698         6          0.0      0.0      0.0      for epoch in range(num_epochs):                                                                               
   699         5          0.0      0.0      0.0          model.train()  # Set model to training mode                                                               
   700                                                                                                                                                             
   701       655          1.2      0.0      2.1          for batch_idx, (input_batch, target_batch) in enumerate(train_loader):                                    
   702      1300         21.0      0.0     37.1              loss = calc_loss_batch(                                                                               
   703       650          0.0      0.0      0.0                  input_batch, target_batch, model, device,                                                         
   704       650          0.0      0.0      0.0                  trainable_token_pos=trainable_token_pos, ignore_index=ignore_index                                
   705                                                       )                                                                                                     
   706                                                                                                                                                             
   707                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   708                                                       # See https://sebastianraschka.com/blog/2023/llm-grad-accumulation.html                               
   709                                                       # for an explanation                                                                                  
   710       650          0.0      0.0      0.1              loss /= accumulation_steps                                                                            
   711                                                                                                                                                             
   712       650         24.8      0.0     43.9              loss.backward()  # Calculate loss gradients                                                           
   713                                                                                                                                                             
   714                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   715       650          0.0      0.0      0.0              is_update_step = ((batch_idx + 1) % accumulation_steps == 0) or ((batch_idx + 1) == len(train_loader))
   716       650          0.0      0.0      0.0              if is_update_step:                                                                                    
   717       650          4.1      0.0      7.2                  optimizer.step()  # Update model weights using loss gradients

gpt2-xl (124M) All
the calc_loss_batch function which represent forward cost 16.9s, the loss.backward cost 19.4s, and weight update cost 4.9s.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   689                                           @profile                                                                                                          
   690                                           def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,                       
   691                                                                       eval_freq, eval_iter, max_steps=None, trainable_token_pos=-1,                         
   692                                                                       accumulation_steps=1, ignore_index=-100):                                             
   693                                               # Initialize lists to track losses and tokens seen                                                            
   694         1          0.0      0.0      0.0      train_losses, val_losses, train_accs, val_accs = [], [], [], []                                               
   695         1          0.0      0.0      0.0      examples_seen, global_step = 0, -1                                                                            
   696                                                                                                                                                             
   697                                               # Main training loop                                                                                          
   698         6          0.0      0.0      0.0      for epoch in range(num_epochs):                                                                               
   699         5          0.0      0.0      0.0          model.train()  # Set model to training mode                                                               
   700                                                                                                                                                             
   701       655          1.1      0.0      2.4          for batch_idx, (input_batch, target_batch) in enumerate(train_loader):                                    
   702      1300         17.3      0.0     36.6              loss = calc_loss_batch(                                                                               
   703       650          0.0      0.0      0.0                  input_batch, target_batch, model, device,                                                         
   704       650          0.0      0.0      0.0                  trainable_token_pos=trainable_token_pos, ignore_index=ignore_index                                
   705                                                       )                                                                                                     
   706                                                                                                                                                             
   707                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   708                                                       # See https://sebastianraschka.com/blog/2023/llm-grad-accumulation.html                               
   709                                                       # for an explanation                                                                                  
   710       650          0.0      0.0      0.1              loss /= accumulation_steps                                                                            
   711                                                                                                                                                             
   712       650         20.2      0.0     42.8              loss.backward()  # Calculate loss gradients                                                           
   713                                                                                                                                                             
   714                                                       # Use gradient accumulation if accumulation_steps > 1                                                 
   715       650          0.0      0.0      0.0              is_update_step = ((batch_idx + 1) % accumulation_steps == 0) or ((batch_idx + 1) == len(train_loader))
   716       650          0.0      0.0      0.0              if is_update_step:                                                                                    
   717       650          4.9      0.0     10.4                  optimizer.step()  # Update model weights using loss gradients

LinearWithLoRAMerged do faster than LinearWithLoRA, but it seems that there isn't performance gained in the backward pass at 124M LoRA group.

Model	calc_loss_batch (Forward)	loss.backward	Weight Update	Backpropagation (Total)	Total Time
	(seconds)	(seconds)	(seconds)	(seconds)	(seconds)
gpt2-xl (124M) All	16.9	19.4	4.9	24.3	41.2
gpt2-xl (124M) LoRA (LinearWithLoRA)	21.0	24.8	4.1	28.9	49.9
gpt2-xl (124M) LoRA (LinearWithLoRAMerged)	18.3	22.2	3.6	25.8	44.1
gpt2-xl (1558M) All	163.8	138.0	19.4	157.4	321.2
gpt2-xl (1558M) LoRA	81.8	98.7	12.8	111.5	193.3

rasbt Aug 18, 2024
Maintainer

Really awesome analysis, thanks so much for sharing. I would say that it all makes sense. The one that is a bit surprising though is that

gpt2-xl (124M) LoRA (LinearWithLoRAMerged)

has a faster backward pass than

gpt2-xl (124M) LoRA (LinearWithLoRA)

Maybe it's just random fluctuation, but since both update the same number of parameters, the backward pass should be the same. (For the forward pass I understand the difference).

Anyways, super interesting, and thanks for sharing!

TITC Aug 19, 2024
Author

Thanks for your recognition and guidance. I learned a lot from it. However, I’m still confused about why does computing loss gradients for 2,668,066 parameters(LoRA) take more time than for 124,441,346 parameters(all)? or am I misunderstanding something?

For both LinearWithLoRAMerged and LinearWithLoRA, I re-ran the test 10 times (5 times for each). Below are the times consumed during loss.backward:

Method	Time 1	Time 2	Time 3	Time 4	Time 5	Average Time
LinearWithLoRAMerged	22.8	22.8	23.6	22.6	23.2	22.82
LinearWithLoRA	26.1	26.3	26.6	25.2	24.8	25.98

I agree that the backward pass should be the same, but I’m not sure why there’s a difference or how to investigate further. These questions might not have practical significance, I’m just curious about.

Thank you once again for your patience and guidance.

rasbt Aug 19, 2024
Maintainer

Good questions!

why does computing loss gradients for 2,668,066 parameters(LoRA) take more time than for 124,441,346 parameters(all)? or am I misunderstanding something?

It looks like this only happens for the small model. I suppose it's because there are more weight matrices, and the overhead of checking those outweighs the efficiency gained by calculating/not calculating the gradients.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA, doesn't seem to be faster. #322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

LoRA, doesn't seem to be faster. #322

TITC Aug 15, 2024

Replies: 1 comment · 7 replies

rasbt Aug 15, 2024 Maintainer

TITC Aug 16, 2024 Author

rasbt Aug 16, 2024 Maintainer

TITC Aug 18, 2024 Author

rasbt Aug 18, 2024 Maintainer

TITC Aug 19, 2024 Author

rasbt Aug 19, 2024 Maintainer

TITC
Aug 15, 2024

Replies: 1 comment 7 replies

rasbt
Aug 15, 2024
Maintainer

TITC Aug 16, 2024
Author

rasbt Aug 16, 2024
Maintainer

TITC Aug 18, 2024
Author

rasbt Aug 18, 2024
Maintainer

TITC Aug 19, 2024
Author

rasbt Aug 19, 2024
Maintainer