-
as it mentioned at Interpretation 7, LoRA shoud be more quicker, but exp 5 is 0.69 min and exp 10 is 0.75 min.
I also reproduced it at L40s GPU as below: (base) ubuntu@l40s-instance:/opt/repository/LLMs-from-scratch/ch06/02_bonus_additional-experiments$ python additional-experiments.py --trainable_layers all
/home/ubuntu/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
2024-08-15 11:55:09.175906: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 11:55:09.210939: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 11:55:09.222097: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-15 11:55:09.253434: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-15 11:55:10.475634: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
checkpoint: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 77.0/77.0 [00:00<00:00, 44.3kiB/s]
encoder.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:01<00:00, 608kiB/s]
hparams.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 53.8kiB/s]
model.ckpt.data-00000-of-00001: 100%|█████████████████████████████████████████████████████████████████████████████████████| 498M/498M [00:59<00:00, 8.39MiB/s]
model.ckpt.index: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 5.21k/5.21k [00:00<00:00, 3.05MiB/s]
model.ckpt.meta: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 471k/471k [00:01<00:00, 339kiB/s]
vocab.bpe: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:01<00:00, 308kiB/s]
File downloaded and saved as sms_spam_collection/SMSSpamCollection.tsv
Ep 1 (Step 000000): Train loss 2.230, Val loss 2.499
Ep 1 (Step 000050): Train loss 0.247, Val loss 0.136
Ep 1 (Step 000100): Train loss 0.188, Val loss 0.194
Training accuracy: 97.50% | Validation accuracy: 95.00%
Ep 2 (Step 000150): Train loss 0.454, Val loss 0.117
Ep 2 (Step 000200): Train loss 0.165, Val loss 0.126
Ep 2 (Step 000250): Train loss 0.101, Val loss 0.085
Training accuracy: 100.00% | Validation accuracy: 95.00%
Ep 3 (Step 000300): Train loss 0.057, Val loss 0.111
Ep 3 (Step 000350): Train loss 0.022, Val loss 0.096
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 4 (Step 000400): Train loss 0.010, Val loss 0.086
Ep 4 (Step 000450): Train loss 0.004, Val loss 0.100
Ep 4 (Step 000500): Train loss 0.001, Val loss 0.131
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 5 (Step 000550): Train loss 0.001, Val loss 0.175
Ep 5 (Step 000600): Train loss 0.012, Val loss 0.096
Training accuracy: 97.50% | Validation accuracy: 97.50%
Training completed in 0.74 minutes.
Training accuracy: 99.42%
Validation accuracy: 97.99%
Test accuracy: 97.67%
(base) ubuntu@l40s-instance:/opt/repository/LLMs-from-scratch/ch06/02_bonus_additional-experiments$ python additional-experiments.py --trainable_layers lora --lora_rank 16 --lora_alpha 16
/home/ubuntu/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
from pandas.core import (
2024-08-15 11:57:28.516293: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 11:57:28.534235: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 11:57:28.539821: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-15 11:57:28.552504: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-15 11:57:29.439871: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
File already exists and is up-to-date: gpt2/124M/checkpoint
File already exists and is up-to-date: gpt2/124M/encoder.json
File already exists and is up-to-date: gpt2/124M/hparams.json
File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/124M/model.ckpt.index
File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: gpt2/124M/vocab.bpe
Ep 1 (Step 000000): Train loss 2.819, Val loss 3.164
Ep 1 (Step 000050): Train loss 0.296, Val loss 0.178
Ep 1 (Step 000100): Train loss 0.124, Val loss 0.148
Training accuracy: 95.00% | Validation accuracy: 97.50%
Ep 2 (Step 000150): Train loss 0.124, Val loss 0.086
Ep 2 (Step 000200): Train loss 0.141, Val loss 0.102
Ep 2 (Step 000250): Train loss 0.043, Val loss 0.097
Training accuracy: 100.00% | Validation accuracy: 92.50%
Ep 3 (Step 000300): Train loss 0.035, Val loss 0.105
Ep 3 (Step 000350): Train loss 0.100, Val loss 0.180
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 4 (Step 000400): Train loss 0.089, Val loss 0.061
Ep 4 (Step 000450): Train loss 0.080, Val loss 0.125
Ep 4 (Step 000500): Train loss 0.055, Val loss 0.082
Training accuracy: 100.00% | Validation accuracy: 97.50%
Ep 5 (Step 000550): Train loss 0.006, Val loss 0.073
Ep 5 (Step 000600): Train loss 0.012, Val loss 0.058
Training accuracy: 100.00% | Validation accuracy: 97.50%
Training completed in 0.88 minutes.
Training accuracy: 99.52%
Validation accuracy: 98.66%
Test accuracy: 97.67%
it seems that LoRA is more time-consuming in this ablation experiment, but I don't know the reason. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
That's an interesting observation, and I would say that this is because the models are quite small. So the additional overhead during the forward pass may outweigh the performance gains in the backward pass. I remember running this with larger models but, to be honest, I forgot the results -- whether it was faster or not. The other thing is that you could try my alternative implementation with merged weights from here. I.e., in the book, I used class LoRALayer(nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)
self.B = nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha
def forward(self, x):
x = self.alpha * (x @ self.A @ self.B)
return x
class LinearWithLoRA(nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)
def forward(self, x):
return self.linear(x) + self.lora(x) because it is easier to explain and more intuitive perhaps when looking at the LoRA figures. However, you can reformulate it as follows, which may be faster: class LoRALayer(nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)
self.B = nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha
def forward(self, x):
x = self.alpha * (x @ self.A @ self.B)
return x
# This LoRA code is equivalent to LinearWithLoRA
class LinearWithLoRAMerged(nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)
def forward(self, x):
lora = self.lora.A @ self.lora.B
combined_weight = self.linear.weight + self.lora.alpha*lora.T
return F.linear(x, combined_weight, self.linear.bias) |
Beta Was this translation helpful? Give feedback.
@TITC I tested this with the larger 1558M model, and LoRA seems to be faster with that one. 5.79 instead of 8.12 min. I updated the table (row 9 vs 12):