Quantization support #12

show981111 · 2024-04-06T01:38:32Z

Hi, thank you for the awesome work. I was wondering if there is a quantized version of prismatic, or if I can quantize the LLM backbone at least. I saw that for inference, it is loading the weights using load_state_dict, so I am not sure how to approach quantization. Any insight would be helpful. Thanks!

The text was updated successfully, but these errors were encountered:

siddk · 2024-04-15T15:18:47Z

This is a good question -- I would love to support this, but don't have too much experience loading LLMs in 4-bit/8-bit precision. If you can link me to some code for loading e.g., LLaMa-2 in 8-bit precision, I can see what would make sense!

djghosh13 · 2024-04-23T18:44:35Z

If I understand correctly, LlamaForCausalLM already supports easy quantization. Something like

quantization_config = transformers.BitsAndBytesConfig(load_in_8bit=True)
LlamaForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config
)

works for me to load LLaMA-2 in 8-bit (or 4-bit if you specify in the BitsAndBytesConfig parameters).

The docs for BitsAndBytesConfig is here: https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig

show981111 changed the title ~~Inference speed using pre-trained backbone, not from prismatic vlm checkpoint.~~ Quantization support Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization support #12

Quantization support #12

show981111 commented Apr 6, 2024 •

edited

Loading

siddk commented Apr 15, 2024

djghosh13 commented Apr 23, 2024

Quantization support #12

Quantization support #12

Comments

show981111 commented Apr 6, 2024 • edited Loading

siddk commented Apr 15, 2024

djghosh13 commented Apr 23, 2024

show981111 commented Apr 6, 2024 •

edited

Loading