Skip to content

Releases: bitsandbytes-foundation/bitsandbytes

4-bit Inference

12 Jul 00:25
Compare
Choose a tag to compare

Efficient 4-bit Inference (NF4, FP4)

This release adds efficient inference routines for batch size 1. Expected speedups vs 16-bit precision (fp16/bf16) for matrix multiplications with inner product dimension of at least 4096 (LLaMA 7B) is:

  • 2.2x for Turing (T4, RTX 2080, etc.)
  • 3.4x for Ampere (A100, A40, RTX 3090, etc.)
  • 4.0x for Ada/Hopper (H100, L40, RTX 4090, etc.)

The inference kernels for batch size 1 are about 8x faster than 4-bit training kernel for QLoRA. This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1.

No code changes are needed to take advantage of the new kernels as long as a batch size of 1 is used.

Big thanks to @crowsonkb, @Birch-san, and @sekstini for some beta testing and helping to debug some early errors.

Changelog

Features:

  • Added 4-bit inference kernels for batch size=1. Currently supported are the NF4, FP4 data types.
  • Added support for quantizations of bfloat16 input data.

Bug fixes:

  • Added device variable for bitsandbytes layers to be compatible with PyTorch layers.

Deprecated:

  • Binaries for CUDA 11.2, 11.6 no longer ship with pip install bitsandbytes and need to be compiled from source.

4-bit QLoRA, Paged Optimizers, and 8-bit Memory Leak Bugfix

20 Jun 02:50
Compare
Choose a tag to compare

This release brings 4-bit quantization support for QLoRA fine-tuning and a critical bugfix that doubled the memory cost of 8-bit models when they were serialized. Furthermore, paged optimizers are introduced, including 8-bit Lion.

0.39.1

Features:

  • 4-bit matrix multiplication for Float4 and NormalFloat4 data types.
  • Added 4-bit quantization routines
  • Doubled quantization routines for 4-bit quantization
  • Paged optimizers for Adam and Lion.
  • bfloat16 gradient / weight support for Adam and Lion with 8 or 32-bit states.

Bug fixes:

  • Fixed a bug where 8-bit models consumed twice the memory as expected after serialization (thank you @mryab)

Deprecated:

  • Kepler binaries (GTX 700s and Tesla K40/K80) are no longer provided via pip and need to be compiled from source. Kepler support might be fully removed in the future.

8-bit Lion, 8-bit Load/Store from HF Hub

12 Apr 15:13
Compare
Choose a tag to compare

8-bit Lion, Load/Store 8-bit Models directly from/to HF Hub

This release brings 8-bit Lion to bitsandbytes. Compared to standard 32-bit Adam, it is 8x more memory efficient.

Furthermore, now models can now be serialized in 8-bit and pushed to the HuggingFace Hub. This means you can also load them from the hub in 8-bit, making big models much easier to download and load into CPU memory.

To use this feature, you need the newest transformer release (this will likely be integrated into the HF transformer release tomorrow).

In this release, CUDA 10.2 and GTX 700/K10 GPUs are deprecated in order to allow for broad support of bfloat16 in release 0.39.0.

Features:

  • Support for 32 and 8-bit Lion has been added. Thank you @lucidrains
  • Support for serialization of Linear8bitLt layers (LLM.int8()). This allows to store and load 8-bit weights directly from the HuggingFace Hub. Thank you @mryab
  • New bug report features python -m bitsandbytes now gives extensive debugging details to debug CUDA setup failures.

Bug fixes:

  • Fixed a bug where some bitsandbytes methods failed in a model-parallel setup on multiple GPUs. Thank you @tonylins
  • Fixed a bug where cudart.so libraries could not be found in newer PyTorch releases.

Improvements:

  • Improved the CUDA Setup procedure by doing a more extensive search for CUDA libraries

Deprecated:

  • Devices with compute capability 3.0 (GTX 700s, K10) and 3.2 (Tegra K1, Jetson TK1) are now deprecated and support will be removed in 0.39.0.
  • Support for CUDA 10.0 and 10.2 will be removed in bitsandbytes 0.39.0

Int8 Matmul backward for all GPUs

02 Feb 14:51
Compare
Choose a tag to compare

This release changed the default bitsandbytets matrix multiplication (bnb.matmul) to now support memory efficient backward by default. Additionally, matrix multiplication with 8-bit weights is supported for all GPUs.

During backdrop, the Int8 weights are converted back to a row-major layout through an inverse index. The general matmul for all GPUs by using Int8 weights is done by casting the weights from Int8 to the inputs data type (FT32/FP32/BF16/F16) and then doing standard matrix multiplication. As such, the matrix multiplication during backdrop and for non-tensor-core devices will be memory efficient, but slow.

These contributions were the work of Alexander Borzunov and Yozh, thank you!

Features:

  • Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
  • Int8 now supported on all GPUs. On devices with compute capability < 7.5, the Int weights are cast to 16/32-bit for the matrix multiplication. Contributed by @borzunov

Improvements:

  • Improved logging for the CUDA detection mechanism.

Ada/Hopper+fake k-bit quantization

04 Jan 11:57
Compare
Choose a tag to compare

The 0.36.0 release brings a lot of bug fixes, improvements, and new features:

  • better automatic CUDA detection & setup
  • better automatic compilation instruction generation in the case of failures
  • CUDA 11.8 and 12.0 support
  • Ada (RTX 40s series) and Hopper (H100) support
  • Added fake k-bit float, int, and quantile quantization (2 <= k <= 8, Int8 storage)

Additional features also include fake k-bit quantization and smaller block sizes for block-wise quantization, which are used in our k-bit Inference Scaling Laws work. Fake k-bit quantization is useful to simulated k-bit data types, but they do not provide memory or runtime benefits. Here is how you use these features.

Faster block-wise quantization that now allows for very small block sizes of down to 64:

from bitsandbytes import functional as F
q, state = F.quantize_blockwise(X, blocksize=64)
X = F.dequantize_blockwise(q, state, blocksize=64)

k-bit fake quantization via block-wise quantization:

# 4-bit float quantization stored as Int8
from bitsandbytes import functional as F
# 4-bit float with 2 exponent bits
code = F.create_fp8_map(signed=True, exponent_bits=2, precision_bits=1, total_bits=4).cuda()
q, state = F.quantize_blockwise(X, code=code) # q has 4-bit indices which represent values in the codebook
X = F.dequantize_blockwise(q, state)

0.36.0: Improvements, Ada/Hopper support, fake k-bit quantization.

Features:

  • CUDA 11.8 and 12.0 support added
  • support for Ada and Hopper GPUs added (compute capability 8.9 and 9.0)
  • support for fake k-bit block-wise quantization for Int, Float, quantile quantization, and dynamic exponent data types added
  • Added CUDA instruction generator to fix some installations.
  • Added additional block sizes for quantization {64, 128, 256, 512, 1024}
  • Added SRAM Quantile algorithm to quickly estimate less than 256 quantiles
  • Added option to suppress the bitsandbytes welcome message (@Cyberes)

Regression:

  • Compute capability 3.0 removed: GTX 600s and 700s series is no longer supported (except GTX 780 and GTX 780 Ti)

Bug fixes:

  • fixed a bug where too long directory names would crash the CUDA SETUP #35 (@tomaarsen)
  • fixed a bug where CPU installations on Colab would run into an error #34 (@tomaarsen)
  • fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
  • fixed a bug where the CUDA setup failed due to a wrong function call.
  • fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
  • fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
  • fixed a bug where not finding the cuda runtime led to an incomprehensible error.
  • fixed a bug where with missing CUDA the default was an error instead of the loading the CPU library
  • fixed a bug where the CC version of the GPU was not detected appropriately (@BlackHC)
  • fixed a bug in CPU quantization which lead to errors when the input buffer exceeded 2^31 elements

Improvements:

  • multiple improvements in formatting, removal of unused imports, and slight performance improvements (@tomaarsen)
  • StableEmbedding layer now has device and dtype parameters to make it 1:1 replaceable with regular Embedding layers (@lostmsu)
  • runtime performance of block-wise quantization slightly improved
  • added error message for the case multiple libcudart.so are installed and bitsandbytes picks the wrong one

CUDA 11.8 Support for Dreambooth finetuning

10 Oct 03:16
Compare
Choose a tag to compare

0.35.0

CUDA 11.8 support and bug fixes

Features:

  • CUDA 11.8 support added and binaries added to the PyPI release.

Bug fixes:

  • fixed a bug where too long directory names would crash the CUDA SETUP #35 (thank you @tomaarsen)
  • fixed a bug where CPU installations on Colab would run into an error #34 (thank you @tomaarsen)
  • fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52

Memory efficient backprop

20 Sep 04:54
Compare
Choose a tag to compare

This release introduces memory-efficient backprop through frozen weights where the gradient is calculated from the 8-bit weights but is computed in fp16. This is useful for creating Low-rank (LoRa) Adapters for fine-tuning large models.

This is a feature contributed by @dbaranchuk and @justheuristic.

0.34.0

Bug fixes and memory-efficient backprop

Features:

  • Linear8bitLt layer now supports memory_efficient_backward=True which enables backprop of gradients through frozen weights.

Bug fixes:

  • fixed an issue where too many threads were created in blockwise quantization on the CPU for large tensors

0.33.0: Various bug fixes

11 Sep 23:15
Compare
Choose a tag to compare

0.33.0

Various bug fixes

Features:

  • CPU quantization now supports a variable blocksize variable to enhance quantization speed or precision. 19a7adc

Bug fixes:

  • fixed an issue in CPU quantization where tensors with more than 2^31 elements would fail 19a7adc
  • fixed a bug where cpu binaries would fail if no GPU would be detected eab4d82
  • fixed an issue where cpu binaries cause additional stdout messages 92a3363
  • fixed an import of bnb.utils 2e630b5

We thank @mryab, @mbrukman, @chessgecko, @dbaranchuk for pull request with bug fixes and new features.