Quantization: Making Models Fit
Here is a problem you will hit immediately when trying to run LLMs on consumer hardware: a 7-billion parameter model in FP16 takes 14 GB of memory. An M2 MacBook Air has 8 GB of unified memory. A 70B model needs 140 GB in FP16 – more than any single GPU on the market. Even if you have enough memory, the model needs to stream through the memory bus during inference, and memory bandwidth is the bottleneck for token generation.
Quantization is the solution. By representing weights with fewer bits – 8, 4, 3, or even 2 bits instead of 16 – we shrink the model dramatically. A 7B model at 4-bit quantization fits in about 4 GB. The 70B model fits in 35-40 GB. And because there is less data to read from memory, inference gets faster too.
But quantization is not free. Fewer bits means less precision, which means some degradation in output quality. The art of quantization is finding the sweet spot: aggressive enough to fit in memory and hit target speeds, but gentle enough to preserve model quality.
In this chapter, we will cover the major quantization schemes you will encounter in practice, understand the math behind them, work through concrete examples of quantizing and dequantizing, and analyze the tradeoffs between size, speed, and quality.
Why Quantize? The Numbers
Let us start with the raw arithmetic that makes quantization essential:
Model Sizes at Different Precisions
=====================================
Parameters FP32 FP16 Q8_0 Q4_0 Q4_K_M Q2_K
---------- ---- ---- ---- ---- ------ ----
1.5B 6 GB 3 GB 1.6 GB 0.9 GB 1.0 GB 0.6 GB
7B 28 GB 14 GB 7.5 GB 4.0 GB 4.4 GB 2.7 GB
13B 52 GB 26 GB 13.8 GB 7.4 GB 8.0 GB 4.9 GB
34B 136 GB 68 GB 36.1 GB 19.2 GB 21.1 GB 12.7 GB
70B 280 GB 140 GB 74.4 GB 39.6 GB 43.4 GB 26.2 GB
Apple Silicon Unified Memory:
M1/M2/M3 (base): 8 GB -> Q4_0 7B fits, FP16 does not
M1/M2/M3 Pro: 18 GB -> Q4_0 13B fits
M1/M2/M3 Max: 64 GB -> Q4_0 70B fits
M2/M3 Ultra: 192 GB -> FP16 70B fits
But memory capacity is only half the story. The other half is bandwidth:
The Bandwidth Equation
=======================
During autoregressive decoding (generating one token at a time),
EVERY weight in the model is read from memory exactly once per token.
decode_speed (tokens/sec) = memory_bandwidth / model_size_in_bytes
Apple M2 Pro: ~200 GB/s bandwidth
FP16 7B: 200 / 14.0 = ~14 tokens/sec
Q8_0 7B: 200 / 7.5 = ~27 tokens/sec
Q4_0 7B: 200 / 4.0 = ~50 tokens/sec
Q4_K_M 7B: 200 / 4.4 = ~45 tokens/sec
Q2_K 7B: 200 / 2.7 = ~74 tokens/sec
This is a theoretical upper bound -- real speeds are 60-80% of this
due to attention computation, KV cache reads, and other overhead.
But the proportionality holds: half the bytes = double the speed.
Decode Speed vs Quantization
=============================
Speed Model: 7B on M2 Pro (200 GB/s)
(tok/s)
80 | * Q2_K
|
70 |
|
60 |
| * Q4_0
50 | * Q4_K_M
|
40 |
|
30 | * Q8_0
|
20 |
| * FP16
10 |
|
0 +-----+------+------+------+------+------+----->
16 8 6 4 3 2 bits/weight
The Fundamentals of Quantization
At its core, quantization maps a continuous range of floating-point values to a discrete set of integer values. The simplest form is uniform affine quantization:
Uniform Quantization
=====================
Given: a set of FP16 weights in range [w_min, w_max]
Goal: represent each weight using b bits (0 to 2^b - 1)
Quantize:
scale = (w_max - w_min) / (2^b - 1)
zero_point = round(-w_min / scale)
q[i] = round(w[i] / scale) + zero_point
Dequantize:
w[i] = scale * (q[i] - zero_point)
Example: b=4 bits (range 0..15)
Weights: [-0.8, 0.3, -0.1, 0.5, -0.6, 0.7]
w_min = -0.8, w_max = 0.7
scale = (-0.8 - 0.7) / -15 = 0.1
zero_point = round(0.8 / 0.1) = 8
Quantize:
q[-0.8] = round(-0.8/0.1) + 8 = -8 + 8 = 0
q[0.3] = round(0.3/0.1) + 8 = 3 + 8 = 11
q[-0.1] = round(-0.1/0.1) + 8 = -1 + 8 = 7
q[0.5] = round(0.5/0.1) + 8 = 5 + 8 = 13
q[-0.6] = round(-0.6/0.1) + 8 = -6 + 8 = 2
q[0.7] = round(0.7/0.1) + 8 = 7 + 8 = 15
Stored: [0, 11, 7, 13, 2, 15] (each is 4 bits)
Dequantize:
w[0] = 0.1 * (0 - 8) = -0.8 (exact)
w[11] = 0.1 * (11 - 8) = 0.3 (exact)
w[7] = 0.1 * (7 - 8) = -0.1 (exact)
w[13] = 0.1 * (13 - 8) = 0.5 (exact)
w[2] = 0.1 * (2 - 8) = -0.6 (exact)
w[15] = 0.1 * (15 - 8) = 0.7 (exact)
In this lucky example, all values hit exactly.
In reality, most values get rounded, introducing error.
But this global quantization (one scale and zero-point for the entire weight matrix) is too crude for neural networks. The weight distribution varies significantly across different parts of the matrix. Modern quantization uses block quantization: the weights are divided into small blocks, each with its own scale (and possibly zero-point).
Block Quantization: GGUF Format (Q4_0)
The GGUF format (used by llama.cpp and many Metal inference engines) uses block quantization. The simplest variant, Q4_0, works as follows:
Q4_0 Block Structure
=====================
Block size: 32 weights
Storage per block:
- 1 x FP16 scale factor (2 bytes)
- 32 x 4-bit signed integers packed into 16 bytes
Total: 18 bytes for 32 weights = 4.5 bits/weight
Quantization (symmetric, no zero point):
For each block of 32 weights:
abs_max = max(|w[0]|, |w[1]|, ..., |w[31]|)
scale = abs_max / 8 (maps to range [-8, 7])
For each weight w[i]:
q[i] = clamp(round(w[i] / scale), -8, 7)
Store as (q[i] + 8), giving range [0, 15] (4 bits)
Dequantization:
w[i] = scale * (stored[i] - 8)
Memory layout (18 bytes per block):
+--------+--------+--------+---+--------+
| scale | byte 0 | byte 1 |...| byte15 |
| (FP16) | w0|w1 | w2|w3 | |w30|w31 |
+--------+--------+--------+---+--------+
2B 1B 1B 1B
(2 nibbles packed per byte)
Worked Example: Quantizing a Block with Q4_0
Example: Quantize 32 weights with Q4_0
========================================
Weights (first 8 of 32 shown):
[ 0.23, -0.41, 0.67, -0.12, 0.55, -0.89, 0.34, 0.08, ... ]
Step 1: Find absolute maximum
abs_max = max(|all 32 weights|) = 0.89 (from -0.89)
Step 2: Compute scale
scale = 0.89 / 8 = 0.11125
Step 3: Quantize each weight
q[0] = round(0.23 / 0.11125) = round(2.067) = 2 -> stored: 2+8 = 10
q[1] = round(-0.41 / 0.11125) = round(-3.685) = -4 -> stored: -4+8 = 4
q[2] = round(0.67 / 0.11125) = round(6.022) = 6 -> stored: 6+8 = 14
q[3] = round(-0.12 / 0.11125) = round(-1.079) = -1 -> stored: -1+8 = 7
q[4] = round(0.55 / 0.11125) = round(4.944) = 5 -> stored: 5+8 = 13
q[5] = round(-0.89 / 0.11125) = round(-8.000) = -8 -> stored: -8+8 = 0
q[6] = round(0.34 / 0.11125) = round(3.056) = 3 -> stored: 3+8 = 11
q[7] = round(0.08 / 0.11125) = round(0.719) = 1 -> stored: 1+8 = 9
Step 4: Pack into bytes (two 4-bit values per byte)
byte[0] = (stored[1] << 4) | stored[0] = (4 << 4) | 10 = 0x4A
byte[1] = (stored[3] << 4) | stored[2] = (7 << 4) | 14 = 0x7E
byte[2] = (stored[5] << 4) | stored[4] = (0 << 4) | 13 = 0x0D
byte[3] = (stored[7] << 4) | stored[6] = (9 << 4) | 11 = 0x9B
Step 5: Dequantize (to verify)
dq[0] = 0.11125 * (10 - 8) = 0.11125 * 2 = 0.2225 (was 0.23, error: 0.008)
dq[1] = 0.11125 * (4 - 8) = 0.11125 * -4 = -0.4450 (was -0.41, error: 0.035)
dq[2] = 0.11125 * (14 - 8) = 0.11125 * 6 = 0.6675 (was 0.67, error: 0.003)
dq[3] = 0.11125 * (7 - 8) = 0.11125 * -1 = -0.1113 (was -0.12, error: 0.009)
dq[4] = 0.11125 * (13 - 8) = 0.11125 * 5 = 0.5563 (was 0.55, error: 0.006)
dq[5] = 0.11125 * (0 - 8) = 0.11125 * -8 = -0.8900 (was -0.89, error: 0.000)
dq[6] = 0.11125 * (11 - 8) = 0.11125 * 3 = 0.3338 (was 0.34, error: 0.006)
dq[7] = 0.11125 * (9 - 8) = 0.11125 * 1 = 0.1113 (was 0.08, error: 0.031)
Average absolute error: ~0.012
Relative error: ~3-4% on average
Bit Packing Details
Two 4-bit values are packed into each byte. This is a critical operation for both quantization (packing) and dequantization (unpacking):
Bit Packing: Two Nibbles Per Byte
===================================
Stored values: a=10 (0b1010), b=4 (0b0100)
Pack: byte = (b << 4) | a
byte = (0b0100 << 4) | 0b1010
byte = 0b01001010
byte = 0x4A
Unpack:
a = byte & 0x0F = 0x4A & 0x0F = 0x0A = 10
b = (byte >> 4) & 0x0F = (0x4A >> 4) & 0x0F = 0x04 = 4
Visual:
+---+---+---+---+---+---+---+---+
| b3| b2| b1| b0| a3| a2| a1| a0| <-- one byte
+---+---+---+---+---+---+---+---+
| high nibble | low nibble |
| value b | value a |
In Metal shader code:
// Unpacking 4-bit values from bytes
uchar packed_byte = block.packed[j];
// Extract two 4-bit values
int8_t val_low = (packed_byte & 0x0F) - 8; // Low nibble, subtract offset
int8_t val_high = (packed_byte >> 4) - 8; // High nibble, subtract offset
// Dequantize
float w0 = float(block.scale) * float(val_low);
float w1 = float(block.scale) * float(val_high);
K-Quant Family: Super-Blocks of 256
The basic Q4_0 format uses a single scale per 32 weights. The K-quant family (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K) introduced by llama.cpp uses a two-level hierarchy: super-blocks of 256 weights, each containing 8 sub-blocks of 32 weights.
K-Quant Super-Block Structure (Q4_K)
======================================
Super-block: 256 weights
+---------------------------------------------------+
| FP16 scale_of_scales | FP16 scale_of_mins | 4 bytes
+---------------------------------------------------+
| 8 x 6-bit sub-block scales (packed) | 6 bytes
+---------------------------------------------------+
| 8 x 6-bit sub-block mins (packed) | 6 bytes
+---------------------------------------------------+
| 256 x 4-bit quantized values (packed) | 128 bytes
+---------------------------------------------------+
Total: 144 bytes for 256 weights = 4.5 bits/weight
Each sub-block of 32 weights has its own:
- 6-bit scale (quantized, relative to super-block scale_of_scales)
- 6-bit minimum (quantized, relative to super-block scale_of_mins)
Dequantization for weight i in sub-block b:
sub_scale = scale_of_scales * sub_scales_q6[b]
sub_min = scale_of_mins * sub_mins_q6[b]
w[i] = sub_scale * q4[i] - sub_min
Why the two-level structure? It improves quantization accuracy by allowing each sub-block of 32 weights to have a different range, while keeping the overhead (scale and min metadata) small by quantizing the per-sub-block parameters themselves.
Q4_0 vs Q4_K: Quantization Granularity
========================================
Q4_0: One FP16 scale per 32 weights
+---------+---------+---------+---------+
| scale_0 | scale_1 | scale_2 | scale_3 | ... (256 weights = 8 blocks)
| 32 wts | 32 wts | 32 wts | 32 wts |
+---------+---------+---------+---------+
Each scale is independent FP16 (full precision).
Overhead: 2 bytes / 32 weights = 0.5 bits/weight
Total: 4.0 + 0.5 = 4.5 bits/weight
Q4_K: Two-level hierarchy for 256 weights
+-------------------------------------------+
| super-block: scale_of_scales, scale_of_mins| 4 bytes for 256 weights
| sub-block 0: 6-bit scale, 6-bit min |
| sub-block 1: 6-bit scale, 6-bit min |
| ... |
| sub-block 7: 6-bit scale, 6-bit min |
+-------------------------------------------+
Sub-block parameters: 12 bytes / 256 = 0.375 bits/weight
Super-block parameters: 4 bytes / 256 = 0.125 bits/weight
Total: 4.0 + 0.375 + 0.125 = 4.5 bits/weight
Same bits/weight, but Q4_K has finer-grained adaptation
and asymmetric ranges (scale + min instead of just scale).
Result: measurably lower perplexity.
The K-Quant Zoo
Here is the full family of K-quant formats:
K-Quant Formats Overview
=========================
Format Bits/wt Quant Sub-block params Quality
------ ------- ----- ---------------- -------
Q2_K 2.56 2-bit 4-bit scale+min Poor (emergency use)
Q3_K_S 3.44 3-bit 6-bit scale Acceptable
Q3_K_M 3.91 3-bit 6-bit scale+min Fair
Q3_K_L 4.28 3-bit 6-bit scale+min Fair+
Q4_K_S 4.50 4-bit 6-bit scale+min Good
Q4_K_M 4.84 4-bit 6-bit scale+min Good+
Q5_K_S 5.50 5-bit 6-bit scale+min Very Good
Q5_K_M 5.69 5-bit 6-bit scale+min Very Good+
Q6_K 6.56 6-bit 8-bit scale Excellent
The S/M/L suffixes mean Small/Medium/Large and refer to which
layers get higher precision. "M" uses higher precision for the
attention layers and output layers (most sensitive to quantization).
The _M variants are particularly clever: they use mixed precision, with more bits for
the layers that matter most:
Q4_K_M: Mixed Precision by Layer
==================================
Layer Type Quantization Why
---------- ------------ ---
Attention Q,K,V,O Q4_K Important for quality
FFN gate, up Q4_K Standard
FFN down Q6_K Output projection, sensitive
Output (vocab) layer Q6_K Final prediction, very sensitive
Embedding Q4_K Large but less sensitive
The "M" variant is ~8% larger than "S" but measurably better.
Per-Channel and Per-Group Quantization (MLX Format)
The MLX framework (Apple’s machine learning framework) uses a different quantization approach: per-group affine quantization with configurable group sizes.
MLX Quantization
=================
Group size: typically 64 or 128
Per group: one FP16 scale and one FP16 bias (zero-point)
Quantize (per group of G weights):
w_min = min(weights in group)
w_max = max(weights in group)
scale = (w_max - w_min) / (2^bits - 1)
bias = w_min
q[i] = round((w[i] - bias) / scale)
q[i] = clamp(q[i], 0, 2^bits - 1)
Dequantize:
w[i] = scale * q[i] + bias
Example with group_size=64, bits=4:
+------------------------------------+
| scale (FP16) | bias (FP16) | 64x4b |
| 2 bytes | 2 bytes | 32 B |
+------------------------------------+
Total: 36 bytes for 64 weights = 4.5 bits/weight
With group_size=128:
+-------------------------------------+
| scale (FP16) | bias (FP16) | 128x4b |
| 2 bytes | 2 bytes | 64 B |
+-------------------------------------+
Total: 68 bytes for 128 weights = 4.25 bits/weight
The key difference from GGUF quantization: MLX uses affine quantization (scale + bias) rather than symmetric quantization (scale only). This better handles weight distributions that are not centered around zero:
Symmetric vs Affine Quantization
==================================
Symmetric (Q4_0):
Maps [-abs_max, +abs_max] to [-8, +7]
scale = abs_max / 8
w = scale * (q - 8)
Problem: if weights are [0.1, 0.3, 0.5, 0.7, 0.9]
abs_max = 0.9, scale = 0.1125
Half the quantization levels (-8 to -1) are wasted!
Wasted range
<---------->
-0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
| | | | | | | | | |
q=0 q=2 q=4 q=6 q=8 q=9 q=11 q=13 q=14 q=16
NO WEIGHTS HERE ALL WEIGHTS HERE
Affine (MLX):
Maps [w_min, w_max] to [0, 15]
scale = (0.9 - 0.1) / 15 = 0.0533
bias = 0.1
w = scale * q + bias
0.1 0.15 0.21 0.26 ... 0.84 0.9
| | | | | |
q=0 q=1 q=2 q=3 ... q=14 q=15
FULL RANGE UTILIZED!
Affine quantization uses all levels for the actual data range.
Better for asymmetric distributions (common in practice).
MLX Quantization Code Example
// MLX-style dequantization in Metal
kernel void dequantize_mlx_q4(
device const uint32_t* packed_weights [[buffer(0)]], // Packed 4-bit
device const half* scales [[buffer(1)]], // Per-group scales
device const half* biases [[buffer(2)]], // Per-group biases
device half* output [[buffer(3)]], // Dequantized
constant uint& group_size [[buffer(4)]],
uint tid [[thread_position_in_grid]])
{
uint group_id = tid / group_size;
uint in_group = tid % group_size;
half scale = scales[group_id];
half bias = biases[group_id];
// Each uint32 holds 8 x 4-bit values
uint word_idx = tid / 8;
uint bit_offset = (tid % 8) * 4;
uint32_t word = packed_weights[word_idx];
uint8_t q = (word >> bit_offset) & 0xF;
output[tid] = scale * half(q) + bias;
}
Impact on Quality: Perplexity Analysis
Perplexity is the standard metric for measuring quantization quality. Lower is better – it measures how “surprised” the model is by test data. A perplexity increase of more than 1-2% is generally noticeable in output quality.
Perplexity Comparison (Llama 2 7B, WikiText-2)
================================================
Format Bits/wt Perplexity Delta Quality Assessment
------ ------- ---------- ----- ------------------
FP16 16.00 5.796 --- Reference (baseline)
Q8_0 8.50 5.799 +0.003 Indistinguishable
Q6_K 6.56 5.804 +0.008 Excellent
Q5_K_M 5.69 5.812 +0.016 Very good
Q5_K_S 5.50 5.819 +0.023 Very good
Q4_K_M 4.84 5.882 +0.086 Good
Q4_K_S 4.50 5.912 +0.116 Good
Q4_0 4.50 5.946 +0.150 Acceptable
Q3_K_M 3.91 6.145 +0.349 Fair (noticeable)
Q3_K_S 3.44 6.351 +0.555 Degraded
Q2_K 2.56 6.981 +1.185 Poor
Rule of thumb:
- Q4_K_M and above: quality loss rarely noticeable in practice
- Q3_K: quality loss sometimes noticeable, especially on hard tasks
- Q2_K: clear quality degradation, only for extreme memory constraints
Perplexity vs Bits per Weight
==============================
Perplexity
|
7.0 |* Q2_K
|
6.5 | * Q3_K_S
| * Q3_K_M
6.0 | * * Q4_0, Q4_K_S
| * * Q4_K_M, Q5_K_S
5.8 | * * Q5_K_M, Q6_K
| * Q8_0
5.6 | * FP16
|
5.4 +---+---+---+---+---+---+---+---+---+---+--->
2 3 4 5 6 7 8 ... 16 bits/weight
The curve has a "knee" around 4-5 bits/weight.
Below 4 bits, quality drops rapidly.
Above 5 bits, returns diminish.
4-5 bits is the sweet spot for most applications.
The Bandwidth Equation: Predicting Decode Speed
We mentioned this earlier, but let us formalize it. During autoregressive decoding:
Decode Speed Model
====================
Given:
P = number of parameters
b = bits per weight (effective, including quantization overhead)
BW = memory bandwidth (bytes/sec)
overhead = non-matmul time (attention, normalization, etc.)
model_bytes = P * b / 8
time_per_token = model_bytes / BW + overhead
tokens_per_second = 1 / time_per_token
Example: Llama 2 7B on M2 Pro
P = 6.74 billion (actual parameter count)
BW = 200 GB/s
Q4_K_M (b = 4.84):
model_bytes = 6.74e9 * 4.84 / 8 = 4.077 GB
matmul_time = 4.077 / 200 = 0.0204 sec = 20.4 ms
overhead ≈ 5 ms (attention + KV cache + norms)
time_per_token ≈ 25.4 ms
speed ≈ 39 tokens/sec
Observed: ~35-40 tokens/sec (matches!)
This equation is remarkably accurate because GEMV (the matmul during decode) is almost perfectly bandwidth-bound. The only thing that breaks the model is when the KV cache becomes very large (long sequences) and the attention computation starts contributing significantly.
When Does Attention Start to Matter?
======================================
KV cache read per step (GQA, 8 KV heads, d=128):
kv_bytes = 2 * 8 * seq_len * 128 * 2 = 4096 * seq_len bytes
Total per token = model_bytes + kv_bytes
seq_len kv_bytes model(Q4_K_M) kv/model Impact
-------- -------- ------------- -------- ------
512 2 MB 4.08 GB 0.05% Negligible
2048 8 MB 4.08 GB 0.2% Negligible
8192 32 MB 4.08 GB 0.8% Minimal
32768 128 MB 4.08 GB 3.1% Small
131072 512 MB 4.08 GB 12.5% Noticeable (~11% slower)
KV cache impact is negligible for typical conversations (< 8K tokens)
but matters for very long contexts (> 32K tokens).
Comparison Table
Here is a comprehensive comparison of all the major quantization formats:
+----------+--------+--------+----------+---------+--------+-------+
| Format | Bits/ | Block | Params | Asym- | Mixed | Used |
| | weight | size | per block| metric? | prec? | by |
+----------+--------+--------+----------+---------+--------+-------+
| FP16 | 16.00 | N/A | N/A | N/A | No | Base |
| Q8_0 | 8.50 | 32 | 1 scale | No | No | GGUF |
| Q6_K | 6.56 | 256 | 8-bit sc | No | No | GGUF |
| Q5_K_M | 5.69 | 256 | 6-bit s+m| Yes | Yes | GGUF |
| Q5_K_S | 5.50 | 256 | 6-bit s+m| Yes | No | GGUF |
| Q4_K_M | 4.84 | 256 | 6-bit s+m| Yes | Yes | GGUF |
| Q4_K_S | 4.50 | 256 | 6-bit s+m| Yes | No | GGUF |
| Q4_0 | 4.50 | 32 | 1 scale | No | No | GGUF |
| Q3_K_M | 3.91 | 256 | 6-bit s+m| Yes | Yes | GGUF |
| Q3_K_S | 3.44 | 256 | 6-bit sc | No | No | GGUF |
| Q2_K | 2.56 | 256 | 4-bit s+m| Yes | No | GGUF |
| MLX-4 | 4.25 | 64-128| scale+bias| Yes | No | MLX |
| MLX-8 | 8.25 | 64-128| scale+bias| Yes | No | MLX |
| MLX-2 | 2.25 | 64-128| scale+bias| Yes | No | MLX |
+----------+--------+--------+----------+---------+--------+-------+
Legend: sc = scale only, s+m = scale + minimum, Asym = asymmetric range
Mixed prec = different quantization levels for different layers
Worked Example: Full Dequantization Pipeline
Let us walk through dequantizing a complete Q4_K block on the GPU, showing exactly what happens in a Metal shader:
Q4_K Dequantization Pipeline
==============================
Input: One super-block of 256 quantized weights
Step 1: Read super-block header
d = FP16 scale_of_scales (e.g., 0.0156)
dmin = FP16 scale_of_mins (e.g., 0.0078)
Step 2: Unpack sub-block parameters (6-bit values, packed)
The 8 sub-block scales are packed as 6-bit values in 6 bytes:
+--------+--------+--------+--------+--------+--------+
| byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 |
+--------+--------+--------+--------+--------+--------+
sub_scale[0] = byte[0] & 0x3F = 23
sub_scale[1] = ((byte[0]>>6) | (byte[1]<<2)) & 0x3F = 18
sub_scale[2] = ((byte[1]>>4) | (byte[2]<<4)) & 0x3F = 31
...
Similarly for sub_mins.
Step 3: Compute actual sub-block scale and min
For sub-block b:
actual_scale[b] = d * sub_scale[b]
actual_min[b] = dmin * sub_min[b]
actual_scale[0] = 0.0156 * 23 = 0.3588
actual_min[0] = 0.0078 * 15 = 0.1170 (example)
Step 4: Dequantize weights in sub-block
For weight i in sub-block b:
q = unpack_4bit(packed_data, b*32 + i) (0..15)
w = actual_scale[b] * q - actual_min[b]
q[0] = 7: w = 0.3588 * 7 - 0.1170 = 2.3946
q[1] = 12: w = 0.3588 * 12 - 0.1170 = 4.1886
q[2] = 3: w = 0.3588 * 3 - 0.1170 = 0.9594
q[3] = 0: w = 0.3588 * 0 - 0.1170 = -0.1170 (min value)
q[4] = 15: w = 0.3588 * 15 - 0.1170 = 5.2650 (max value)
...
And here is how this looks in Metal shader code:
// Q4_K super-block structure
struct block_q4_K {
half d; // scale of scales
half dmin; // scale of mins
uchar scales[12]; // 8 x 6-bit scales + 8 x 6-bit mins, packed
uchar qs[128]; // 256 x 4-bit quantized values
};
// Dequantize one Q4_K block
inline void dequantize_q4_K(
device const block_q4_K& block,
uint sub_block, // 0..7
uint index_in_sub, // 0..31
thread float& result)
{
// Step 1: Read super-block scales
float d = float(block.d);
float dmin = float(block.dmin);
// Step 2: Unpack 6-bit sub-block scale and min
// (Simplified -- actual packing is more complex)
uint8_t raw_scale, raw_min;
if (sub_block < 4) {
raw_scale = block.scales[sub_block] & 0x3F;
raw_min = block.scales[sub_block + 4] & 0x3F;
} else {
// Higher sub-blocks use bits from multiple bytes
raw_scale = ((block.scales[sub_block + 4] & 0xF) |
((block.scales[sub_block - 4] >> 6) << 4));
raw_min = ((block.scales[sub_block + 4] >> 4) |
((block.scales[sub_block] >> 6) << 4));
}
float scale = d * float(raw_scale);
float min = dmin * float(raw_min);
// Step 3: Unpack 4-bit weight value
uint byte_idx = (sub_block * 32 + index_in_sub) / 2;
uchar packed = block.qs[byte_idx];
uint8_t q;
if (index_in_sub % 2 == 0) {
q = packed & 0x0F; // Low nibble
} else {
q = (packed >> 4) & 0x0F; // High nibble
}
// Step 4: Dequantize
result = scale * float(q) - min;
}
Quantizing the KV Cache
So far we have discussed quantizing the model weights. But there is another large memory consumer during inference: the KV cache. For long context lengths, the KV cache can rival or exceed the model size.
KV Cache Quantization
======================
Standard KV cache (FP16):
Per layer: 2 * n_kv_heads * seq_len * d_k * 2 bytes
32 layers, 8 KV heads, d_k=128, seq_len=32768:
= 32 * 2 * 8 * 32768 * 128 * 2 = 4 GB
Q8 KV cache:
Same but 1 byte per element + small scale overhead
≈ 2 GB (50% reduction)
Q4 KV cache:
≈ 1 GB (75% reduction)
KV cache quantization is trickier than weight quantization because:
1. Values are computed dynamically (not known ahead of time)
2. Must quantize on-the-fly as new K/V vectors are computed
3. Range can change as new tokens arrive
4. Quality impact is harder to predict
Approach: per-head, per-position quantization
For each new (key, value) vector being added to the cache:
1. Compute the vector in FP16
2. Quantize to Q8 or Q4 with per-vector scale
3. Store quantized vector in cache
4. Dequantize on-the-fly during attention computation
Practical Considerations
Choosing the Right Quantization
Here is a practical decision flowchart:
Quantization Selection Flowchart
==================================
How much memory do you have?
|
+-- Very limited (8 GB): Use Q4_K_M for 7B, Q2_K for 13B
|
+-- Moderate (16-18 GB): Use Q4_K_M for 13B, Q6_K for 7B
|
+-- Generous (32-64 GB): Use Q6_K or Q8_0 for 13B-34B
|
+-- Abundant (96+ GB): Consider FP16 for best quality
What is your speed target?
|
+-- Maximum speed: Use lowest quantization that maintains acceptable quality
| (usually Q4_K_M or Q4_K_S)
|
+-- Quality priority: Use highest quantization that meets speed requirements
| (Q6_K or Q8_0)
|
+-- Balanced: Q4_K_M is almost always the right answer
Task sensitivity?
|
+-- Creative writing, coding: Q4_K_M usually fine
|
+-- Math, reasoning: Consider Q5_K_M or Q6_K
|
+-- Simple Q&A, summarization: Q3_K_M can work
Quantization and GEMV Performance
Remember from Chapter 14: GEMV is bandwidth-bound. Quantization directly reduces the bytes read, so the speedup is nearly linear with compression ratio:
GEMV Performance with Quantization
====================================
Single matmul: [1, 4096] * [4096, 11008]
M2 Pro, 200 GB/s
Format Weight Size Read Time Speedup
------ ----------- --------- -------
FP16 90.1 MB 0.451 ms 1.0x
Q8_0 47.8 MB 0.239 ms 1.9x
Q6_K 37.0 MB 0.185 ms 2.4x
Q4_K_M 24.5 MB 0.123 ms 3.7x
Q4_0 23.6 MB 0.118 ms 3.8x
Q2_K 14.4 MB 0.072 ms 6.3x
Note: Dequantization adds ~5-10% compute overhead.
Net speedup is slightly less than the raw bandwidth reduction,
but still very close.
GEMM (Prefill) with Quantization
For prefill (GEMM), quantization does not directly speed things up because GEMM is compute-bound. However, it does reduce memory footprint, which matters for:
- Fitting larger models in memory
- Keeping more of the model in cache
- Reducing memory pressure from concurrent operations
Prefill Performance: Less Clear-Cut
=====================================
Matmul: [512, 4096] * [4096, 11008]
M2 Pro
Format Approach Time
------ -------- ----
FP16 Native FP16 MMA ~2.0 ms
Q4_K_M Dequant to FP16 + MMA ~2.5 ms (slower!)
Q4_K_M Optimized mixed-precision ~2.2 ms (close to FP16)
During prefill, quantization can actually be SLOWER because:
1. Extra dequantization compute
2. MMA hardware is optimized for FP16/BF16, not mixed precision
3. GEMM is compute-bound, not memory-bound
But for most LLM use cases, decode time dominates total latency,
so optimizing decode (where quantization helps enormously) is
more important than optimizing prefill.
Summary
Quantization is what makes LLM inference practical on consumer hardware. The key points:
-
Why quantize: A 7B FP16 model needs 14 GB; Q4 needs ~4 GB. Decode speed is proportional to 1/model_size because GEMV is bandwidth-bound.
-
Block quantization (Q4_0): Groups of 32 weights share one FP16 scale. 4 bits per weight + scale overhead = 4.5 bits/weight effective. Simple and fast.
-
K-quant family: Super-blocks of 256 with 8 sub-blocks of 32. Two-level scale hierarchy. Mixed precision variants (_M) use more bits for sensitive layers. Q4_K_M is the most popular choice.
-
Per-group affine (MLX): Groups of 64-128 with per-group scale and bias. Asymmetric quantization handles non-centered distributions better.
-
Bit packing: Two 4-bit values per byte. Unpacking is a shift and mask operation, done on-the-fly during dequantization in the GPU shader.
-
Quality impact: The perplexity curve has a knee at ~4-5 bits/weight. Q4_K_M is the sweet spot for most applications. Below 3 bits, quality degrades noticeably.
-
The bandwidth equation:
decode_speed ≈ bandwidth / model_bytes. This simple formula predicts real-world performance with surprising accuracy. -
Dequantize on the fly: During GEMV, weights are dequantized in registers as they are loaded. The compute cost of dequantization is negligible compared to the bandwidth savings.
With quantization, a $1200 MacBook Air with 16 GB of RAM can run a 7B parameter model at 40+ tokens per second – fast enough for interactive use. That is the power of trading a tiny bit of precision for an enormous reduction in memory and bandwidth requirements.