Skip to content

0.43.2: finetune Llama 405B on 4x GPUs with improved QLoRA+FSDP, CUDA 12.5 support

Compare
Choose a tag to compare
@Titus-von-Koeller Titus-von-Koeller released this 23 Jul 18:42
· 46 commits to main since this release

0.43.2

This release is quite significant as the QLoRA bug fix has big implications for higher seqlen and batch sizes.

For each sequence (i.e. batch size increase of one) we expect memory savings of:

  • 405B: 39GB for seqlen=1024, and 4888GB for seqlen=128,00
  • 70B: 10.1GB for seqlen=1024 and 1258GB for seqlen=128,00

This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.

Improvements:

Bug Fixes

  • 4bit getstate and 8bit deepcopy (#1230 #1231, thanks @BenjaminBossan)
  • missing optimizers in str2optimizer32bit (#1222, thanks @EtienneDosSantos)
  • CUDA 12.5 build issue (#1273, thanks @HennerM)
  • fix for min_8bit_size functionality in Optimizer base classes (#1286, thanks @Edenzzzz)
  • QLoRA mem bug (#1270, thanks @Ther-nullptr)
  • tests for cpu only platforms (#1259, thanks @galqiwi)
  • restoration of quant_storage for CPU offloading (#1279)
  • optim update error with non-contiguous grads/params (deepspeed) (#1187)