Have you ever tried to run a powerful AI model and got an out-of-memory error? You’re not alone. Today’s models are massive — often requiring expensive GPUs with tens of gigabytes of VRAM. Quantization is the technique that changes that.
It reduces model size by changing how numbers are stored: instead of high-precision floats, you use simpler, lower-bit formats that need far less memory. Think of it like compressing a photo — you trade a small amount of quality for a much smaller file size.
Why Do We Need Quantization?
Consumer hardware will never keep pace with state-of-the-art model sizes. But quantization closes the gap. A 32B parameter model that normally needs ~64 GB of GPU memory can fit inside 24 GB with 4-bit quantization — at a modest cost in precision.
How Does Quantization Work?
Under the hood, quantization converts high-precision floating-point numbers (like fp32) to lower-bit formats like bf16, int8, or int4. The precision loss comes from dropping mantissa bits.
In IEEE 754, a floating-point number has three parts:
[sign bit] [exponent bits] [mantissa bits]
- FP32 — 1 sign bit, 8 exponent bits, 23 mantissa bits
- BF16 — 1 sign bit, 8 exponent bits, 7 mantissa bits
- INT8 — 8 bits total, no exponent/mantissa split
By shrinking the mantissa, you halve (or quarter) the memory footprint while keeping the dynamic range largely intact. That’s the core of quantization.
Practical Implementation with BitsAndBytes
BitsAndBytes is the most straightforward library for on-the-fly quantization — no pre-training or complex setup required.
pip install bitsandbytes accelerate transformers
4-bit quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"your-model-name",
quantization_config=bnb_config,
device_map="auto"
)
nf4 (NormalFloat4) is the recommended quant type for LLMs — it’s designed for weights that follow a normal distribution and outperforms standard INT4 in accuracy.
8-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"your-model-name",
quantization_config=bnb_config,
device_map="auto"
)
8-bit is a safer choice when you need higher accuracy and can afford slightly more memory.
Comparing Quantization Methods
| Method | Bits | Memory savings | Accuracy hit | Best for |
|---|---|---|---|---|
| FP16 | 16 | ~50% vs FP32 | Minimal | General inference |
| INT8 | 8 | ~75% vs FP32 | Small | Production deployment |
| INT4 (NF4) | 4 | ~87% vs FP32 | Moderate | Consumer hardware |
What About GGUF, GPTQ, AWQ?
These are post-training quantization formats — the model is quantized once and saved as a smaller file. They give better accuracy at a given bit-width but require running a quantization process upfront.
BitsAndBytes is different: it quantizes at load time, which means you can try any Hugging Face model with quantization in a few lines of code — no preprocessing required. That’s why it’s the go-to for experimentation.