Quantization Formats In Depth
This chapter is the definitive reference for every quantization format Akunu supports. We will go byte by byte through the memory layouts, work through dequantization by hand, and build the mental model you need to write or debug quantized GEMV kernels on Metal. If you have ever wondered what exactly lives inside a Q4_0 block, or how K-quant super-blocks manage to pack 256 values with mixed bit widths, this is the chapter.
Why Quantization Matters
A 7B parameter model in F16 requires 14 GB of memory. That exceeds the unified memory of every base-model MacBook Air and most MacBook Pros. Quantize those weights to 4 bits and you are down to 3.5 GB – comfortably fitting on a machine with 8 GB of RAM, with plenty left over for KV cache and activations.
But quantization is not free. You are trading precision for memory, and the format you choose determines both the quality of that trade-off and the computational cost of dequantizing at inference time. GGUF alone defines over a dozen quantization formats. MLX adds its own family. Understanding them is essential for anyone working on inference engines.
GGUF Legacy Formats
These are the original formats from GGML/llama.cpp. They operate on fixed-size blocks of elements, each block containing packed quantized values plus per-block scale factors.
Q4_0: The Simplest Quantized Format
Q4_0 is where most people should start understanding quantization. It is symmetric 4-bit quantization with a single F16 scale per block of 32 elements.
Block layout (18 bytes total):
+------------------+----------------------------------+
| F16 scale (d) | 16 bytes of nibbles |
| 2 bytes | (32 x 4-bit values) |
+------------------+----------------------------------+
Byte 0-1 Bytes 2-17
Each byte in the nibble section holds two 4-bit values:
Byte layout: [lo_nibble : hi_nibble]
[ q[2i] : q[2i+1] ]
Bit positions: 7 6 5 4 3 2 1 0
^^^^^^^--------- hi nibble (bits 4-7) = q[2i+1]
^^^^^^^^ lo nibble (bits 0-3) = q[2i]
Dequantization formula:
x[i] = d * (q[i] - 8)
The subtraction of 8 centers the 4-bit range [0, 15] around zero, giving an effective range of [-8, 7]. This is symmetric quantization – the zero point is fixed at 8, not learned.
Worked example: suppose we have a block with scale d = 0.5 (as F16) and the first data byte is 0xA3.
Byte 0xA3 = 1010 0011 in binary
lo nibble = 0011 = 3 -> q[0] = 3
hi nibble = 1010 = 10 -> q[1] = 10
x[0] = 0.5 * (3 - 8) = 0.5 * (-5) = -2.5
x[1] = 0.5 * (10 - 8) = 0.5 * (2) = 1.0
Bit extraction in Metal:
// Extract two Q4_0 values from a byte
uint8_t byte = block_data[j];
int q_lo = (byte & 0x0F); // bits 0-3
int q_hi = (byte >> 4) & 0x0F; // bits 4-7
float x_lo = d * ((float)q_lo - 8.0f);
float x_hi = d * ((float)q_hi - 8.0f);
Size calculation:
| Parameter | Value |
|---|---|
| Block size | 32 elements |
| Scale overhead | 2 bytes (F16) |
| Data | 16 bytes (32 nibbles) |
| Total per block | 18 bytes |
| Bits per weight | 18 * 8 / 32 = 4.5 bpw |
The overhead of the scale factor means Q4_0 is not exactly 4 bits per weight – it is 4.5 bpw. This is a common source of confusion. The “4” in Q4_0 refers to the quantized value width, not the effective bits per weight.
Q4_1: Asymmetric 4-bit
Q4_1 adds a minimum value (zero-point) per block, enabling asymmetric quantization:
Block layout (20 bytes):
+------------------+------------------+------------------+
| F16 scale (d) | F16 minimum (m) | 16 bytes nibbles |
| 2 bytes | 2 bytes | (32 x 4-bit) |
+------------------+------------------+------------------+
Bytes 0-1 Bytes 2-3 Bytes 4-19
Dequantization:
x[i] = d * q[i] + m
No subtraction of 8 here – the minimum m handles the offset. The range [0, 15] maps to [m, m + 15*d]. This can better represent distributions that are not centered at zero, at the cost of 2 extra bytes per block.
| Parameter | Value |
|---|---|
| Block size | 32 elements |
| Scale + min overhead | 4 bytes |
| Data | 16 bytes |
| Total per block | 20 bytes |
| Bits per weight | 20 * 8 / 32 = 5.0 bpw |
Q8_0: 8-bit Symmetric
Q8_0 stores each value as a signed 8-bit integer with a single F16 scale per block of 32:
Block layout (34 bytes):
+------------------+----------------------------------+
| F16 scale (d) | 32 bytes of int8 values |
| 2 bytes | (32 x 8-bit signed) |
+------------------+----------------------------------+
Bytes 0-1 Bytes 2-33
Dequantization:
x[i] = d * q[i] // q[i] is int8, range [-128, 127]
No offset subtraction needed because int8 is already signed.
| Parameter | Value |
|---|---|
| Block size | 32 elements |
| Scale overhead | 2 bytes |
| Data | 32 bytes |
| Total per block | 34 bytes |
| Bits per weight | 34 * 8 / 32 = 8.5 bpw |
Q8_0 is primarily used for activations in mixed-precision inference, not for weight storage (since 8.5 bpw barely saves memory over F16’s 16 bpw). Its main advantage is that int8 dot products can use SIMD integer multiply-accumulate instructions, which are faster than F16 multiply-accumulate on some hardware.
K-Quants: Super-Block Architecture
The K-quant family (Q2_K through Q6_K) was introduced by @ikawrakow in llama.cpp to improve quantization quality at low bit widths.1 The key insight is that a single scale factor per 32 elements is too coarse for 2-3 bit quantization – the approximation error is unacceptably high. K-quants solve this with a two-level hierarchy: super-blocks of 256 elements containing multiple sub-blocks, each with its own scale.
Super-block (256 elements)
+-----------------------------------------------------------+
| Sub-block scales (quantized to fewer bits themselves) |
| Super-block scale (F16) + super-block min (F16) |
+-----------------------------------------------------------+
| Sub-block 0 (32 elements, quantized data) |
| Sub-block 1 (32 elements, quantized data) |
| ... |
| Sub-block 7 (32 elements, quantized data) |
+-----------------------------------------------------------+
The sub-block scales are themselves quantized (usually to 4 or 6 bits), and the super-block scale converts them back to floating point. This is nested quantization – you quantize the quantization parameters.
Q2_K: 2-bit with 4-bit Sub-Block Scales
The most aggressive K-quant. Each element gets only 2 bits, but the hierarchical scales keep quality surprisingly usable.
Super-block layout (256 elements, 84 bytes):
+--------------------------------------------------+
| F16 super-scale (d) | 2 bytes |
| F16 super-minimum (dmin) | 2 bytes |
| 16 bytes: sub-block scales | (16 x 4-bit pairs) |
| each byte: [scale_hi:scale_lo] |
| 64 bytes: quantized data | (256 x 2-bit) |
+--------------------------------------------------+
The 16 bytes of sub-block scales encode 16 pairs of (scale, minimum) values, one for each sub-block of 16 elements. Each pair is packed into a single byte as two 4-bit values.
Dequantization (for element i in sub-block j):
sub_scale = (scales_byte[j] & 0x0F)
sub_min = (scales_byte[j] >> 4)
x[i] = d * sub_scale * q[i] - dmin * sub_min
| Parameter | Value |
|---|---|
| Super-block size | 256 elements |
| Overhead | 2 (d) + 2 (dmin) + 16 (sub-scales) = 20 bytes |
| Data | 64 bytes (256 x 2-bit) |
| Total | 84 bytes |
| Bits per weight | 84 * 8 / 256 = 2.625 bpw |
Q3_K: 3-bit with Mixed Sub-Block Scales
Q3_K uses 3 bits per value with 256-element super-blocks.
Super-block layout (256 elements, 110 bytes):
+--------------------------------------------------+
| F16 super-scale (d) | 2 bytes |
| 12 bytes: quantized sub-scales | |
| (16 scales, 6-bit each, packed) | |
| 32 bytes: high-bits of quants | |
| (256 bits, 1 per element) | |
| 64 bytes: low-bits of quants | |
| (256 x 2-bit) | |
+--------------------------------------------------+
The 3-bit values are split across two regions: the low 2 bits are packed into the 64-byte “quants low” section (like Q2_K), and the high bit is stored separately in the 32-byte “high bits” section. This split layout simplifies SIMD extraction.
Dequantization:
q_lo = (quants_lo[i/4] >> (2 * (i%4))) & 0x03 // 2 low bits
q_hi = (hmask[i/8] >> (i%8)) & 1 // 1 high bit
q = q_lo | (q_hi << 2) // 3-bit value [0..7]
x[i] = d * sub_scale * (q - 4) // center at 4
| Parameter | Value |
|---|---|
| Super-block size | 256 elements |
| Total | 110 bytes |
| Bits per weight | 110 * 8 / 256 = 3.4375 bpw |
Q4_K: 4-bit K-Quant
Q4_K is the workhorse of the K-quant family. It provides a good balance of quality and compression that works well for most models.
Super-block layout (256 elements, 144 bytes):
+--------------------------------------------------+
| F16 super-scale (d) | 2 bytes |
| F16 super-minimum (dmin) | 2 bytes |
| 12 bytes: sub-block scales+mins | |
| (8 sub-blocks, 6-bit scale + | |
| 6-bit min, packed) | |
| 128 bytes: quantized data | |
| (256 x 4-bit nibbles) | |
+--------------------------------------------------+
Each of the 8 sub-blocks has a 6-bit scale and a 6-bit minimum, packed into the 12-byte scale section. The packing is non-trivial – the 6-bit values are split across multiple bytes.
Scale packing detail (12 bytes for 8 sub-blocks):
Bytes 0-3: low 4 bits of scales[0..7] (4 bits each, 2 per byte)
Bytes 4-7: low 4 bits of mins[0..7] (4 bits each, 2 per byte)
Bytes 8-9: high 2 bits of scales[0..7] (2 bits each, packed)
Bytes 10-11: high 2 bits of mins[0..7] (2 bits each, packed)
Dequantization:
scale_6bit = low4(scales, j) | (high2(scales, j) << 4)
min_6bit = low4(mins, j) | (high2(mins, j) << 4)
x[i] = d * scale_6bit * q[i] - dmin * min_6bit
| Parameter | Value |
|---|---|
| Super-block size | 256 elements |
| Total | 144 bytes |
| Bits per weight | 144 * 8 / 256 = 4.5 bpw |
Q5_K: 5-bit K-Quant
Q5_K extends Q4_K with an extra bit per value:
Super-block layout (256 elements, 176 bytes):
+--------------------------------------------------+
| F16 super-scale (d) | 2 bytes |
| F16 super-minimum (dmin) | 2 bytes |
| 12 bytes: sub-block scales+mins | |
| 128 bytes: low nibbles | |
| (256 x 4-bit) | |
| 32 bytes: high bits | |
| (256 x 1-bit) | |
+--------------------------------------------------+
Like Q3_K, the 5th bit is stored separately from the low 4 bits. This allows the low-nibble extraction to use the same SIMD patterns as Q4_K.
| Parameter | Value |
|---|---|
| Super-block size | 256 elements |
| Total | 176 bytes |
| Bits per weight | 176 * 8 / 256 = 5.5 bpw |
Q6_K: 6-bit K-Quant
Q6_K is the highest-quality K-quant, approaching F16 accuracy for most models.
Super-block layout (256 elements, 210 bytes):
+--------------------------------------------------+
| F16 super-scale (d) | 2 bytes |
| 16 bytes: sub-block scales (int8) | |
| 128 bytes: low nibbles | |
| (256 x 4-bit) | |
| 64 bytes: high dibits | |
| (256 x 2-bit) | |
+--------------------------------------------------+
Q6_K simplifies the scale storage: each sub-block scale is a full int8 value (not quantized further). There is no separate minimum – Q6_K uses symmetric quantization like Q4_0.
Dequantization:
q_lo = (quants_lo[i/2] >> (4*(i%2))) & 0x0F // 4 low bits
q_hi = (quants_hi[i/4] >> (2*(i%4))) & 0x03 // 2 high bits
q = q_lo | (q_hi << 4) // 6-bit value [0..63]
x[i] = d * sub_scale_int8 * (q - 32) // center at 32
| Parameter | Value |
|---|---|
| Super-block size | 256 elements |
| Total | 210 bytes |
| Bits per weight | 210 * 8 / 256 = 6.5625 bpw |
K-Quant Summary Table
| Format | Bits/value | Block size | Bytes/block | Effective bpw | Scale type | Symmetry |
|---|---|---|---|---|---|---|
| Q2_K | 2 | 256 | 84 | 2.63 | 4-bit nested | Asymmetric |
| Q3_K | 3 | 256 | 110 | 3.44 | 6-bit nested | Symmetric |
| Q4_K | 4 | 256 | 144 | 4.50 | 6-bit nested | Asymmetric |
| Q5_K | 5 | 256 | 176 | 5.50 | 6-bit nested | Asymmetric |
| Q6_K | 6 | 256 | 210 | 6.56 | int8 | Symmetric |
And for comparison, the legacy formats:
| Format | Bits/value | Block size | Bytes/block | Effective bpw | Scale type | Symmetry |
|---|---|---|---|---|---|---|
| Q4_0 | 4 | 32 | 18 | 4.50 | F16 | Symmetric |
| Q4_1 | 4 | 32 | 20 | 5.00 | F16 + F16 min | Asymmetric |
| Q5_0 | 5 | 32 | 22 | 5.50 | F16 | Symmetric |
| Q8_0 | 8 | 32 | 34 | 8.50 | F16 | Symmetric |
MLX Per-Group Quantization
MLX takes a different approach. Rather than defining custom block layouts with packed scales, MLX uses a straightforward per-group scheme with separate tensors for weights, scales, and biases.
Layout
For a weight matrix of shape [N, K] quantized to B bits with group size G:
Weight tensor: shape [N, K*B/32], dtype U32
Scales tensor: shape [N, K/G], dtype F16 or BF16
Biases tensor: shape [N, K/G], dtype F16 or BF16
Each U32 word packs 32/B quantized values. The values within a U32 are stored contiguously from LSB to MSB.
Bit Extraction
For B-bit quantization, extracting the j-th value from a U32:
uint32_t word = packed_weights[word_index];
uint32_t mask = (1u << B) - 1; // B ones
int shift = (j % (32 / B)) * B;
uint32_t q = (word >> shift) & mask;
Example: 4-bit extraction from U32 word 0xFEDCBA98:
Binary: 1111 1110 1101 1100 1011 1010 1001 1000
Value 0 (bits 0-3): 1000 = 8
Value 1 (bits 4-7): 1001 = 9
Value 2 (bits 8-11): 1010 = 10
Value 3 (bits 12-15): 1011 = 11
Value 4 (bits 16-19): 1100 = 12
Value 5 (bits 20-23): 1101 = 13
Value 6 (bits 24-27): 1110 = 14
Value 7 (bits 28-31): 1111 = 15
Dequantization
group_index = j / G
x[i][j] = scales[i][group_index] * q[i][j] + biases[i][group_index]
This is asymmetric affine quantization. The bias acts as a zero-point, allowing the quantization grid to cover any range, not just one centered at zero.
MLX Bit Width Variants
Akunu supports four MLX quantization widths, each mapped to a dtype code:
| Bit width | Dtype code | Values per U32 | Typical group size | Effective bpw |
|---|---|---|---|---|
| 3-bit | 99 | 10 (+ 2 bits padding) | 64 | ~3.5 |
| 4-bit | 100 | 8 | 64 | ~4.5 |
| 6-bit | 102 | 5 (+ 2 bits padding) | 64 | ~6.5 |
| 8-bit | 101 | 4 | 64 | ~8.5 |
The effective bpw includes the overhead of scale and bias storage. For a [4096, 4096] matrix with group size 64:
Weight data: 4096 * 4096 * B / 8 bytes
Scale data: 4096 * (4096/64) * 2 bytes = 4096 * 64 * 2 = 524,288 bytes
Bias data: same as scale = 524,288 bytes
Total overhead: 1,048,576 bytes (~1 MB)
This overhead is constant regardless of bit width, and is small relative to the weight data for large matrices.
3-bit Packing Detail
3-bit is the most irregular case because 32 is not evenly divisible by 3. MLX packs 10 three-bit values into each U32 (10 * 3 = 30 bits), leaving 2 bits unused:
U32 word: [unused:2][q9:3][q8:3][q7:3][q6:3][q5:3][q4:3][q3:3][q2:3][q1:3][q0:3]
Bits: 31-30 29-27 26-24 23-21 20-18 17-15 14-12 11-9 8-6 5-3 2-0
The extraction code:
uint32_t word = packed[word_index];
int pos_in_word = j % 10;
int shift = pos_in_word * 3;
uint32_t q = (word >> shift) & 0x7; // mask = 0b111
GPU Buffer Layout (Packed)
As discussed in the previous chapter, Akunu packs the three MLX tensors into a single GPU buffer for each weight matrix:
Offset 0: Packed U32 weights
Offset weight_bytes: F16 scales
Offset weight_bytes + scale_bytes: F16 biases
The Metal kernel receives the buffer pointer and a weight_bytes parameter. It computes scale and bias offsets arithmetically:
device const half *scales = (device const half *)
((device const char *)weights + params.weight_bytes);
device const half *biases = scales + (params.N * params.K / params.group_size);
Metal Kernel Dequantization Patterns
Each format requires a different dequantization strategy in the GEMV kernel. Here are the common patterns:
Q4_0 GEMV Inner Loop
// Each thread processes a chunk of the K dimension
for (int k = tid; k < K; k += stride) {
int block_idx = k / 32;
int block_off = k % 32;
// Load block header
half d = block_scales[block_idx];
// Load and extract nibble
int byte_idx = block_off / 2;
uint8_t byte = block_data[block_idx * 16 + byte_idx];
int nibble = (block_off & 1) ? (byte >> 4) : (byte & 0x0F);
// Dequantize and accumulate
float w = float(d) * (float(nibble) - 8.0f);
sum += w * float(input[k]);
}
K-Quant GEMV Pattern (Q4_K)
// Process one super-block (256 elements) at a time
for (int sb = ...; sb < n_superblocks; sb++) {
half d = super_scales[sb];
half dmin = super_mins[sb];
// Decode sub-block scales (6-bit from packed bytes)
for (int sub = 0; sub < 8; sub++) {
int sc = decode_6bit_scale(scale_bytes, sub);
int mn = decode_6bit_min(scale_bytes, sub);
float sub_scale = float(d) * sc;
float sub_min = float(dmin) * mn;
for (int k = 0; k < 32; k++) {
int q = extract_nibble(data, sub*32 + k);
float w = sub_scale * q - sub_min;
sum += w * float(input[sb*256 + sub*32 + k]);
}
}
}
MLX GEMV Pattern
// Process one group at a time
for (int g = 0; g < K / group_size; g++) {
half scale = scales[row * n_groups + g];
half bias = biases[row * n_groups + g];
for (int k = 0; k < group_size; k++) {
int global_k = g * group_size + k;
uint32_t word = packed[row * K_packed + global_k / values_per_word];
int pos = global_k % values_per_word;
uint32_t q = (word >> (pos * bits)) & bit_mask;
float w = float(scale) * float(q) + float(bias);
sum += w * float(input[global_k]);
}
}
Quality vs Size Comparison
The following table summarizes quality-size trade-offs. Perplexity numbers are approximate and vary by model, but the relative ordering is consistent.2
| Format | Effective bpw | Model size (7B) | Perplexity impact | Best use case |
|---|---|---|---|---|
| F16 | 16.0 | 14 GB | Baseline | Reference / debugging |
| Q8_0 | 8.5 | 7.4 GB | Negligible | Activation quantization |
| Q6_K | 6.56 | 5.7 GB | Very small | Quality-sensitive apps |
| Q5_K | 5.50 | 4.8 GB | Small | Good quality/size balance |
| Q4_K | 4.50 | 3.9 GB | Moderate | Best general-purpose |
| Q4_0 | 4.50 | 3.9 GB | Moderate+ | Fastest decode (simple format) |
| Q3_K | 3.44 | 3.0 GB | Noticeable | Memory-constrained |
| Q2_K | 2.63 | 2.3 GB | Significant | Extreme compression |
| MLX Q4 | ~4.5 | ~3.9 GB | Moderate | MLX ecosystem models |
| MLX Q3 | ~3.5 | ~3.1 GB | Noticeable | MLX ecosystem, low memory |
| MLX Q8 | ~8.5 | ~7.4 GB | Negligible | High quality MLX |
How Akunu Selects Kernels
The dtype code embedded in (or derived from) the weight file determines which kernels are used. Akunu’s DTypeDescriptor table maps each dtype to a complete set of kernel names:
| Dtype | Code | GEMV kernel | GEMM kernel | Fused SiLU | Embedding |
|---|---|---|---|---|---|
| F16 | 1 | gemv_f16 | simd_gemm_f16 | – | embedding_lookup_f16 |
| Q4_0 | 2 | gemv_q4_0 | simd_gemm_q4_0 | gemv_q4_0_silu | embedding_lookup_q4_0 |
| Q4_1 | 3 | gemv_q4_1 | simd_gemm_q4_1 | – | embedding_lookup_q4_1 |
| Q8_0 | 8 | gemv_q8_0 | simd_gemm_q8_0 | – | embedding_lookup_q8_0 |
| Q2_K | 10 | gemv_q2_k | simd_gemm_q2_k | – | – |
| Q3_K | 11 | gemv_q3_k | simd_gemm_q3_k | – | – |
| Q4_K | 12 | gemv_q4_k | simd_gemm_q4_k | – | embedding_lookup_q4_k |
| Q5_K | 13 | gemv_q5_k | simd_gemm_q5_k | – | – |
| Q6_K | 14 | gemv_q6_k | simd_gemm_q6_k | – | embedding_lookup_q6_k |
| BF16 | 31 | gemv_bf16 | simd_gemm_bf16 | – | embedding_lookup_bf16 |
| MLX Q3 | 99 | gemv_mlx_q3 | simd_gemm_mlx_q3 | gemv_mlx_q3_silu | embedding_lookup_mlx_generic |
| MLX Q4 | 100 | gemv_mlx_q4 | simd_gemm_mlx_q4 | gemv_mlx_q4_silu | embedding_lookup_mlx_q4 |
| MLX Q6 | 102 | gemv_mlx_q6 | simd_gemm_mlx_q6 | gemv_mlx_q6_silu | embedding_lookup_mlx_generic |
| MLX Q8 | 101 | gemv_mlx_q8 | simd_gemm_mlx_q8 | gemv_mlx_q8_silu | embedding_lookup_mlx_generic |
Note the pattern: GGUF formats have dtype codes below 32 (matching GGML’s enum), while MLX formats use codes 99-102. This avoids any collision between the two namespaces.
Each descriptor also includes dispatch geometry – the number of rows per threadgroup and the threadgroup size. These are tuned per format because different formats have different computational density:
| Format family | Rows/threadgroup | Threadgroup size | Rationale |
|---|---|---|---|
| F16 | 16 | 128 | Simple dequant, high arithmetic density |
| Q4_0/Q4_1 | 16 | 128 | Simple block format, fast extraction |
| Q8_0 | 32 | 256 | Larger data per block, needs more threads |
| K-quants | 16 | 256 | Complex nested dequant, more ALU work |
| MLX all | 16 | 128 | Group-based, moderate complexity |
Mixed Quantization
Many GGUF models use different quantization levels for different layers. For example, a Q4_K_M quantization (the “M” stands for “mixed”) might use:
- Q6_K for the attention norms and output norm (small tensors, quality-sensitive)
- Q4_K for most weight matrices
- Q5_K for the first and last few layers
Akunu handles this transparently because get_dtype() returns the per-tensor dtype, and build_dispatch_table() selects the kernel for each weight individually:
snprintf(name, sizeof(name), "layers.%d.attention.q.weight", layer);
uint32_t q_dtype = weights.get_dtype(name); // might be Q4_K
snprintf(name, sizeof(name), "layers.%d.attention.k.weight", layer);
uint32_t k_dtype = weights.get_dtype(name); // might be Q6_K
// Each gets the correct kernel
gemv(input, q_weight, output_q, 0, q_dtype, q_dim, dim);
gemv(input, k_weight, output_k, 0, k_dtype, kv_dim, dim);
Weight fusion (QKV or gate+up) requires matching dtypes – you cannot fuse a Q4_K weight with a Q6_K weight because they have different block layouts. The fusion check verifies this:
bool fuse_qkv = q_dtype == k_dtype && k_dtype == v_dtype;
If the dtypes do not match, Akunu falls back to separate GEMV dispatches.
Practical Guidance
For users choosing a quantization format:
- Q4_K_M is the sweet spot for most use cases. It provides good quality at ~4.5 bpw with the K-quant’s hierarchical scales.
- MLX Q4 is comparable in quality and works well with models from the MLX ecosystem.
- Q4_0 is slightly lower quality than Q4_K but uses simpler block structure, which can be faster for decode (where GEMV is the bottleneck).
- Q6_K or MLX Q8 if you can afford the memory and want near-lossless quality.
- Q2_K and MLX Q3 should be reserved for cases where memory is truly scarce. Quality degradation is noticeable.
For kernel developers:
- The block-of-32 formats (Q4_0, Q4_1, Q8_0) are the easiest to implement. Start there.
- K-quants require careful handling of the nested scale packing. Get the bit extraction right by testing against a reference implementation before optimizing.
- MLX formats are conceptually simpler (uniform group structure, no nested quantization) but require handling the three-tensor buffer layout and function constants for group size and K dimension.
- Always profile with real models. The format with the least memory usage is not always the fastest – simpler dequantization (Q4_0) can outperform complex dequantization (Q4_K) even at the same bit width, because the kernel spends less time on scale lookups.3
-
@ikawrakow, “k-quants: 2, 3, 4, 5, and 6-bit quantization for llama.cpp,” llama.cpp PR #1684, 2023. The key contribution was the super-block architecture that enables usable 2-3 bit quantization. See https://github.com/ggerganov/llama.cpp/pull/1684. ↩
-
Perplexity numbers are from the llama.cpp quantization benchmarks. Exact values depend on the model and evaluation dataset. The relative ordering (F16 > Q6_K > Q5_K > Q4_K > Q4_0 > Q3_K > Q2_K) is consistent across models. ↩
-
On Apple M2 Pro, Q4_0 GEMV for a 4096x4096 matrix runs at approximately 92% of memory bandwidth, while Q4_K achieves about 85%, despite both being ~4.5 bpw. The difference is the 6-bit sub-scale decoding overhead in Q4_K. ↩