Quantization Formats In Depth

This chapter is the definitive reference for every quantization format Akunu supports. We will go byte by byte through the memory layouts, work through dequantization by hand, and build the mental model you need to write or debug quantized GEMV kernels on Metal. If you have ever wondered what exactly lives inside a Q4_0 block, or how K-quant super-blocks manage to pack 256 values with mixed bit widths, this is the chapter.

Why Quantization Matters

A 7B parameter model in F16 requires 14 GB of memory. That exceeds the unified memory of every base-model MacBook Air and most MacBook Pros. Quantize those weights to 4 bits and you are down to 3.5 GB – comfortably fitting on a machine with 8 GB of RAM, with plenty left over for KV cache and activations.

But quantization is not free. You are trading precision for memory, and the format you choose determines both the quality of that trade-off and the computational cost of dequantizing at inference time. GGUF alone defines over a dozen quantization formats. MLX adds its own family. Understanding them is essential for anyone working on inference engines.

GGUF Legacy Formats

These are the original formats from GGML/llama.cpp. They operate on fixed-size blocks of elements, each block containing packed quantized values plus per-block scale factors.

Q4_0: The Simplest Quantized Format

Q4_0 is where most people should start understanding quantization. It is symmetric 4-bit quantization with a single F16 scale per block of 32 elements.

Block layout (18 bytes total):

+------------------+----------------------------------+
| F16 scale (d)    | 16 bytes of nibbles              |
| 2 bytes          | (32 x 4-bit values)              |
+------------------+----------------------------------+
  Byte 0-1           Bytes 2-17

Each byte in the nibble section holds two 4-bit values:

Byte layout:  [lo_nibble : hi_nibble]
              [  q[2i]   :  q[2i+1] ]

Bit positions:  7 6 5 4 3 2 1 0
                ^^^^^^^---------  hi nibble (bits 4-7) = q[2i+1]
                        ^^^^^^^^  lo nibble (bits 0-3) = q[2i]

Dequantization formula:

x[i] = d * (q[i] - 8)

The subtraction of 8 centers the 4-bit range [0, 15] around zero, giving an effective range of [-8, 7]. This is symmetric quantization – the zero point is fixed at 8, not learned.

Worked example: suppose we have a block with scale d = 0.5 (as F16) and the first data byte is 0xA3.

Byte 0xA3 = 1010 0011 in binary

lo nibble = 0011 = 3  -> q[0] = 3
hi nibble = 1010 = 10 -> q[1] = 10

x[0] = 0.5 * (3 - 8)  = 0.5 * (-5) = -2.5
x[1] = 0.5 * (10 - 8) = 0.5 * (2)  =  1.0

Bit extraction in Metal:

// Extract two Q4_0 values from a byte
uint8_t byte = block_data[j];
int q_lo = (byte & 0x0F);       // bits 0-3
int q_hi = (byte >> 4) & 0x0F;  // bits 4-7

float x_lo = d * ((float)q_lo - 8.0f);
float x_hi = d * ((float)q_hi - 8.0f);

Size calculation:

Parameter	Value
Block size	32 elements
Scale overhead	2 bytes (F16)
Data	16 bytes (32 nibbles)
Total per block	18 bytes
Bits per weight	18 * 8 / 32 = 4.5 bpw

The overhead of the scale factor means Q4_0 is not exactly 4 bits per weight – it is 4.5 bpw. This is a common source of confusion. The “4” in Q4_0 refers to the quantized value width, not the effective bits per weight.

Q4_1: Asymmetric 4-bit

Q4_1 adds a minimum value (zero-point) per block, enabling asymmetric quantization:

Block layout (20 bytes):

+------------------+------------------+------------------+
| F16 scale (d)    | F16 minimum (m)  | 16 bytes nibbles |
| 2 bytes          | 2 bytes          | (32 x 4-bit)    |
+------------------+------------------+------------------+
  Bytes 0-1          Bytes 2-3          Bytes 4-19

Dequantization:

x[i] = d * q[i] + m

No subtraction of 8 here – the minimum m handles the offset. The range [0, 15] maps to [m, m + 15*d]. This can better represent distributions that are not centered at zero, at the cost of 2 extra bytes per block.

Parameter	Value
Block size	32 elements
Scale + min overhead	4 bytes
Data	16 bytes
Total per block	20 bytes
Bits per weight	20 * 8 / 32 = 5.0 bpw

Q8_0: 8-bit Symmetric

Q8_0 stores each value as a signed 8-bit integer with a single F16 scale per block of 32:

Block layout (34 bytes):

+------------------+----------------------------------+
| F16 scale (d)    | 32 bytes of int8 values          |
| 2 bytes          | (32 x 8-bit signed)              |
+------------------+----------------------------------+
  Bytes 0-1          Bytes 2-33

Dequantization:

x[i] = d * q[i]    // q[i] is int8, range [-128, 127]

No offset subtraction needed because int8 is already signed.

Parameter	Value
Block size	32 elements
Scale overhead	2 bytes
Data	32 bytes
Total per block	34 bytes
Bits per weight	34 * 8 / 32 = 8.5 bpw

Q8_0 is primarily used for activations in mixed-precision inference, not for weight storage (since 8.5 bpw barely saves memory over F16’s 16 bpw). Its main advantage is that int8 dot products can use SIMD integer multiply-accumulate instructions, which are faster than F16 multiply-accumulate on some hardware.

K-Quants: Super-Block Architecture

The K-quant family (Q2_K through Q6_K) was introduced by @ikawrakow in llama.cpp to improve quantization quality at low bit widths.¹ The key insight is that a single scale factor per 32 elements is too coarse for 2-3 bit quantization – the approximation error is unacceptably high. K-quants solve this with a two-level hierarchy: super-blocks of 256 elements containing multiple sub-blocks, each with its own scale.

Super-block (256 elements)
+-----------------------------------------------------------+
|  Sub-block scales (quantized to fewer bits themselves)     |
|  Super-block scale (F16) + super-block min (F16)          |
+-----------------------------------------------------------+
|  Sub-block 0 (32 elements, quantized data)                |
|  Sub-block 1 (32 elements, quantized data)                |
|  ...                                                      |
|  Sub-block 7 (32 elements, quantized data)                |
+-----------------------------------------------------------+

The sub-block scales are themselves quantized (usually to 4 or 6 bits), and the super-block scale converts them back to floating point. This is nested quantization – you quantize the quantization parameters.

Q2_K: 2-bit with 4-bit Sub-Block Scales

The most aggressive K-quant. Each element gets only 2 bits, but the hierarchical scales keep quality surprisingly usable.

Super-block layout (256 elements, 84 bytes):

+--------------------------------------------------+
| F16 super-scale (d)        | 2 bytes             |
| F16 super-minimum (dmin)   | 2 bytes             |
| 16 bytes: sub-block scales | (16 x 4-bit pairs)  |
|   each byte: [scale_hi:scale_lo]                 |
| 64 bytes: quantized data   | (256 x 2-bit)       |
+--------------------------------------------------+

The 16 bytes of sub-block scales encode 16 pairs of (scale, minimum) values, one for each sub-block of 16 elements. Each pair is packed into a single byte as two 4-bit values.

Dequantization (for element i in sub-block j):

sub_scale = (scales_byte[j] & 0x0F)
sub_min   = (scales_byte[j] >> 4)

x[i] = d * sub_scale * q[i] - dmin * sub_min

Parameter	Value
Super-block size	256 elements
Overhead	2 (d) + 2 (dmin) + 16 (sub-scales) = 20 bytes
Data	64 bytes (256 x 2-bit)
Total	84 bytes
Bits per weight	84 * 8 / 256 = 2.625 bpw

Q3_K: 3-bit with Mixed Sub-Block Scales

Q3_K uses 3 bits per value with 256-element super-blocks.

Super-block layout (256 elements, 110 bytes):

+--------------------------------------------------+
| F16 super-scale (d)                | 2 bytes     |
| 12 bytes: quantized sub-scales     |             |
|   (16 scales, 6-bit each, packed)  |             |
| 32 bytes: high-bits of quants      |             |
|   (256 bits, 1 per element)        |             |
| 64 bytes: low-bits of quants       |             |
|   (256 x 2-bit)                    |             |
+--------------------------------------------------+

The 3-bit values are split across two regions: the low 2 bits are packed into the 64-byte “quants low” section (like Q2_K), and the high bit is stored separately in the 32-byte “high bits” section. This split layout simplifies SIMD extraction.

Dequantization:

q_lo = (quants_lo[i/4] >> (2 * (i%4))) & 0x03  // 2 low bits
q_hi = (hmask[i/8] >> (i%8)) & 1                // 1 high bit
q = q_lo | (q_hi << 2)                          // 3-bit value [0..7]

x[i] = d * sub_scale * (q - 4)                  // center at 4

Parameter	Value
Super-block size	256 elements
Total	110 bytes
Bits per weight	110 * 8 / 256 = 3.4375 bpw

Q4_K: 4-bit K-Quant

Q4_K is the workhorse of the K-quant family. It provides a good balance of quality and compression that works well for most models.

Super-block layout (256 elements, 144 bytes):

+--------------------------------------------------+
| F16 super-scale (d)                | 2 bytes     |
| F16 super-minimum (dmin)           | 2 bytes     |
| 12 bytes: sub-block scales+mins    |             |
|   (8 sub-blocks, 6-bit scale +     |             |
|    6-bit min, packed)               |             |
| 128 bytes: quantized data          |             |
|   (256 x 4-bit nibbles)            |             |
+--------------------------------------------------+

Each of the 8 sub-blocks has a 6-bit scale and a 6-bit minimum, packed into the 12-byte scale section. The packing is non-trivial – the 6-bit values are split across multiple bytes.

Scale packing detail (12 bytes for 8 sub-blocks):

Bytes 0-3:  low 4 bits of scales[0..7]  (4 bits each, 2 per byte)
Bytes 4-7:  low 4 bits of mins[0..7]    (4 bits each, 2 per byte)
Bytes 8-9:  high 2 bits of scales[0..7] (2 bits each, packed)
Bytes 10-11: high 2 bits of mins[0..7]  (2 bits each, packed)

Dequantization:

scale_6bit = low4(scales, j) | (high2(scales, j) << 4)
min_6bit   = low4(mins, j)   | (high2(mins, j) << 4)

x[i] = d * scale_6bit * q[i] - dmin * min_6bit

Parameter	Value
Super-block size	256 elements
Total	144 bytes
Bits per weight	144 * 8 / 256 = 4.5 bpw

Q5_K: 5-bit K-Quant

Q5_K extends Q4_K with an extra bit per value:

Super-block layout (256 elements, 176 bytes):

+--------------------------------------------------+
| F16 super-scale (d)                | 2 bytes     |
| F16 super-minimum (dmin)           | 2 bytes     |
| 12 bytes: sub-block scales+mins    |             |
| 128 bytes: low nibbles             |             |
|   (256 x 4-bit)                    |             |
| 32 bytes: high bits                |             |
|   (256 x 1-bit)                    |             |
+--------------------------------------------------+

Like Q3_K, the 5th bit is stored separately from the low 4 bits. This allows the low-nibble extraction to use the same SIMD patterns as Q4_K.

Parameter	Value
Super-block size	256 elements
Total	176 bytes
Bits per weight	176 * 8 / 256 = 5.5 bpw

Q6_K: 6-bit K-Quant

Q6_K is the highest-quality K-quant, approaching F16 accuracy for most models.

Super-block layout (256 elements, 210 bytes):

+--------------------------------------------------+
| F16 super-scale (d)                | 2 bytes     |
| 16 bytes: sub-block scales (int8)  |             |
| 128 bytes: low nibbles             |             |
|   (256 x 4-bit)                    |             |
| 64 bytes: high dibits              |             |
|   (256 x 2-bit)                    |             |
+--------------------------------------------------+

Q6_K simplifies the scale storage: each sub-block scale is a full int8 value (not quantized further). There is no separate minimum – Q6_K uses symmetric quantization like Q4_0.

Dequantization:

q_lo = (quants_lo[i/2] >> (4*(i%2))) & 0x0F     // 4 low bits
q_hi = (quants_hi[i/4] >> (2*(i%4))) & 0x03     // 2 high bits
q = q_lo | (q_hi << 4)                           // 6-bit value [0..63]

x[i] = d * sub_scale_int8 * (q - 32)            // center at 32

Parameter	Value
Super-block size	256 elements
Total	210 bytes
Bits per weight	210 * 8 / 256 = 6.5625 bpw

K-Quant Summary Table

Format	Bits/value	Block size	Bytes/block	Effective bpw	Scale type	Symmetry
Q2_K	2	256	84	2.63	4-bit nested	Asymmetric
Q3_K	3	256	110	3.44	6-bit nested	Symmetric
Q4_K	4	256	144	4.50	6-bit nested	Asymmetric
Q5_K	5	256	176	5.50	6-bit nested	Asymmetric
Q6_K	6	256	210	6.56	int8	Symmetric

And for comparison, the legacy formats:

Format	Bits/value	Block size	Bytes/block	Effective bpw	Scale type	Symmetry
Q4_0	4	32	18	4.50	F16	Symmetric
Q4_1	4	32	20	5.00	F16 + F16 min	Asymmetric
Q5_0	5	32	22	5.50	F16	Symmetric
Q8_0	8	32	34	8.50	F16	Symmetric

MLX Per-Group Quantization

MLX takes a different approach. Rather than defining custom block layouts with packed scales, MLX uses a straightforward per-group scheme with separate tensors for weights, scales, and biases.

Layout

For a weight matrix of shape [N, K] quantized to B bits with group size G:

Weight tensor:  shape [N, K*B/32], dtype U32
Scales tensor:  shape [N, K/G],    dtype F16 or BF16
Biases tensor:  shape [N, K/G],    dtype F16 or BF16

Each U32 word packs 32/B quantized values. The values within a U32 are stored contiguously from LSB to MSB.

Bit Extraction

For B-bit quantization, extracting the j-th value from a U32:

uint32_t word = packed_weights[word_index];
uint32_t mask = (1u << B) - 1;          // B ones
int shift = (j % (32 / B)) * B;
uint32_t q = (word >> shift) & mask;

Example: 4-bit extraction from U32 word 0xFEDCBA98:

Binary: 1111 1110 1101 1100 1011 1010 1001 1000

Value 0 (bits 0-3):   1000 = 8
Value 1 (bits 4-7):   1001 = 9
Value 2 (bits 8-11):  1010 = 10
Value 3 (bits 12-15): 1011 = 11
Value 4 (bits 16-19): 1100 = 12
Value 5 (bits 20-23): 1101 = 13
Value 6 (bits 24-27): 1110 = 14
Value 7 (bits 28-31): 1111 = 15

Dequantization

group_index = j / G
x[i][j] = scales[i][group_index] * q[i][j] + biases[i][group_index]

This is asymmetric affine quantization. The bias acts as a zero-point, allowing the quantization grid to cover any range, not just one centered at zero.

MLX Bit Width Variants

Akunu supports four MLX quantization widths, each mapped to a dtype code:

Bit width	Dtype code	Values per U32	Typical group size	Effective bpw
3-bit	99	10 (+ 2 bits padding)	64	~3.5
4-bit	100	8	64	~4.5
6-bit	102	5 (+ 2 bits padding)	64	~6.5
8-bit	101	4	64	~8.5

The effective bpw includes the overhead of scale and bias storage. For a [4096, 4096] matrix with group size 64:

Weight data:    4096 * 4096 * B / 8 bytes
Scale data:     4096 * (4096/64) * 2 bytes = 4096 * 64 * 2 = 524,288 bytes
Bias data:      same as scale = 524,288 bytes
Total overhead: 1,048,576 bytes (~1 MB)

This overhead is constant regardless of bit width, and is small relative to the weight data for large matrices.

3-bit Packing Detail

3-bit is the most irregular case because 32 is not evenly divisible by 3. MLX packs 10 three-bit values into each U32 (10 * 3 = 30 bits), leaving 2 bits unused:

U32 word: [unused:2][q9:3][q8:3][q7:3][q6:3][q5:3][q4:3][q3:3][q2:3][q1:3][q0:3]
Bits:      31-30    29-27  26-24  23-21  20-18  17-15  14-12  11-9   8-6    5-3    2-0

The extraction code:

uint32_t word = packed[word_index];
int pos_in_word = j % 10;
int shift = pos_in_word * 3;
uint32_t q = (word >> shift) & 0x7;  // mask = 0b111

GPU Buffer Layout (Packed)

As discussed in the previous chapter, Akunu packs the three MLX tensors into a single GPU buffer for each weight matrix:

Offset 0:                           Packed U32 weights
Offset weight_bytes:                F16 scales
Offset weight_bytes + scale_bytes:  F16 biases

The Metal kernel receives the buffer pointer and a weight_bytes parameter. It computes scale and bias offsets arithmetically:

device const half *scales = (device const half *)
    ((device const char *)weights + params.weight_bytes);
device const half *biases = scales + (params.N * params.K / params.group_size);

Metal Kernel Dequantization Patterns

Each format requires a different dequantization strategy in the GEMV kernel. Here are the common patterns:

Q4_0 GEMV Inner Loop

// Each thread processes a chunk of the K dimension
for (int k = tid; k < K; k += stride) {
    int block_idx = k / 32;
    int block_off = k % 32;

    // Load block header
    half d = block_scales[block_idx];

    // Load and extract nibble
    int byte_idx = block_off / 2;
    uint8_t byte = block_data[block_idx * 16 + byte_idx];
    int nibble = (block_off & 1) ? (byte >> 4) : (byte & 0x0F);

    // Dequantize and accumulate
    float w = float(d) * (float(nibble) - 8.0f);
    sum += w * float(input[k]);
}

K-Quant GEMV Pattern (Q4_K)

// Process one super-block (256 elements) at a time
for (int sb = ...; sb < n_superblocks; sb++) {
    half d = super_scales[sb];
    half dmin = super_mins[sb];

    // Decode sub-block scales (6-bit from packed bytes)
    for (int sub = 0; sub < 8; sub++) {
        int sc = decode_6bit_scale(scale_bytes, sub);
        int mn = decode_6bit_min(scale_bytes, sub);

        float sub_scale = float(d) * sc;
        float sub_min = float(dmin) * mn;

        for (int k = 0; k < 32; k++) {
            int q = extract_nibble(data, sub*32 + k);
            float w = sub_scale * q - sub_min;
            sum += w * float(input[sb*256 + sub*32 + k]);
        }
    }
}

MLX GEMV Pattern

// Process one group at a time
for (int g = 0; g < K / group_size; g++) {
    half scale = scales[row * n_groups + g];
    half bias = biases[row * n_groups + g];

    for (int k = 0; k < group_size; k++) {
        int global_k = g * group_size + k;
        uint32_t word = packed[row * K_packed + global_k / values_per_word];
        int pos = global_k % values_per_word;
        uint32_t q = (word >> (pos * bits)) & bit_mask;

        float w = float(scale) * float(q) + float(bias);
        sum += w * float(input[global_k]);
    }
}

Quality vs Size Comparison

The following table summarizes quality-size trade-offs. Perplexity numbers are approximate and vary by model, but the relative ordering is consistent.²

Format	Effective bpw	Model size (7B)	Perplexity impact	Best use case
F16	16.0	14 GB	Baseline	Reference / debugging
Q8_0	8.5	7.4 GB	Negligible	Activation quantization
Q6_K	6.56	5.7 GB	Very small	Quality-sensitive apps
Q5_K	5.50	4.8 GB	Small	Good quality/size balance
Q4_K	4.50	3.9 GB	Moderate	Best general-purpose
Q4_0	4.50	3.9 GB	Moderate+	Fastest decode (simple format)
Q3_K	3.44	3.0 GB	Noticeable	Memory-constrained
Q2_K	2.63	2.3 GB	Significant	Extreme compression
MLX Q4	~4.5	~3.9 GB	Moderate	MLX ecosystem models
MLX Q3	~3.5	~3.1 GB	Noticeable	MLX ecosystem, low memory
MLX Q8	~8.5	~7.4 GB	Negligible	High quality MLX

How Akunu Selects Kernels

The dtype code embedded in (or derived from) the weight file determines which kernels are used. Akunu’s DTypeDescriptor table maps each dtype to a complete set of kernel names:

Dtype	Code	GEMV kernel	GEMM kernel	Fused SiLU	Embedding
F16	1	`gemv_f16`	`simd_gemm_f16`	–	`embedding_lookup_f16`
Q4_0	2	`gemv_q4_0`	`simd_gemm_q4_0`	`gemv_q4_0_silu`	`embedding_lookup_q4_0`
Q4_1	3	`gemv_q4_1`	`simd_gemm_q4_1`	–	`embedding_lookup_q4_1`
Q8_0	8	`gemv_q8_0`	`simd_gemm_q8_0`	–	`embedding_lookup_q8_0`
Q2_K	10	`gemv_q2_k`	`simd_gemm_q2_k`	–	–
Q3_K	11	`gemv_q3_k`	`simd_gemm_q3_k`	–	–
Q4_K	12	`gemv_q4_k`	`simd_gemm_q4_k`	–	`embedding_lookup_q4_k`
Q5_K	13	`gemv_q5_k`	`simd_gemm_q5_k`	–	–
Q6_K	14	`gemv_q6_k`	`simd_gemm_q6_k`	–	`embedding_lookup_q6_k`
BF16	31	`gemv_bf16`	`simd_gemm_bf16`	–	`embedding_lookup_bf16`
MLX Q3	99	`gemv_mlx_q3`	`simd_gemm_mlx_q3`	`gemv_mlx_q3_silu`	`embedding_lookup_mlx_generic`
MLX Q4	100	`gemv_mlx_q4`	`simd_gemm_mlx_q4`	`gemv_mlx_q4_silu`	`embedding_lookup_mlx_q4`
MLX Q6	102	`gemv_mlx_q6`	`simd_gemm_mlx_q6`	`gemv_mlx_q6_silu`	`embedding_lookup_mlx_generic`
MLX Q8	101	`gemv_mlx_q8`	`simd_gemm_mlx_q8`	`gemv_mlx_q8_silu`	`embedding_lookup_mlx_generic`

Note the pattern: GGUF formats have dtype codes below 32 (matching GGML’s enum), while MLX formats use codes 99-102. This avoids any collision between the two namespaces.

Each descriptor also includes dispatch geometry – the number of rows per threadgroup and the threadgroup size. These are tuned per format because different formats have different computational density:

Format family	Rows/threadgroup	Threadgroup size	Rationale
F16	16	128	Simple dequant, high arithmetic density
Q4_0/Q4_1	16	128	Simple block format, fast extraction
Q8_0	32	256	Larger data per block, needs more threads
K-quants	16	256	Complex nested dequant, more ALU work
MLX all	16	128	Group-based, moderate complexity

Mixed Quantization

Many GGUF models use different quantization levels for different layers. For example, a Q4_K_M quantization (the “M” stands for “mixed”) might use:

Q6_K for the attention norms and output norm (small tensors, quality-sensitive)
Q4_K for most weight matrices
Q5_K for the first and last few layers

Akunu handles this transparently because get_dtype() returns the per-tensor dtype, and build_dispatch_table() selects the kernel for each weight individually:

snprintf(name, sizeof(name), "layers.%d.attention.q.weight", layer);
uint32_t q_dtype = weights.get_dtype(name);  // might be Q4_K

snprintf(name, sizeof(name), "layers.%d.attention.k.weight", layer);
uint32_t k_dtype = weights.get_dtype(name);  // might be Q6_K

// Each gets the correct kernel
gemv(input, q_weight, output_q, 0, q_dtype, q_dim, dim);
gemv(input, k_weight, output_k, 0, k_dtype, kv_dim, dim);

Weight fusion (QKV or gate+up) requires matching dtypes – you cannot fuse a Q4_K weight with a Q6_K weight because they have different block layouts. The fusion check verifies this:

bool fuse_qkv = q_dtype == k_dtype && k_dtype == v_dtype;

If the dtypes do not match, Akunu falls back to separate GEMV dispatches.

Practical Guidance

For users choosing a quantization format:

Q4_K_M is the sweet spot for most use cases. It provides good quality at ~4.5 bpw with the K-quant’s hierarchical scales.
MLX Q4 is comparable in quality and works well with models from the MLX ecosystem.
Q4_0 is slightly lower quality than Q4_K but uses simpler block structure, which can be faster for decode (where GEMV is the bottleneck).
Q6_K or MLX Q8 if you can afford the memory and want near-lossless quality.
Q2_K and MLX Q3 should be reserved for cases where memory is truly scarce. Quality degradation is noticeable.

For kernel developers:

The block-of-32 formats (Q4_0, Q4_1, Q8_0) are the easiest to implement. Start there.
K-quants require careful handling of the nested scale packing. Get the bit extraction right by testing against a reference implementation before optimizing.
MLX formats are conceptually simpler (uniform group structure, no nested quantization) but require handling the three-tensor buffer layout and function constants for group size and K dimension.
Always profile with real models. The format with the least memory usage is not always the fastest – simpler dequantization (Q4_0) can outperform complex dequantization (Q4_K) even at the same bit width, because the kernel spends less time on scale lookups.³

@ikawrakow, “k-quants: 2, 3, 4, 5, and 6-bit quantization for llama.cpp,” llama.cpp PR #1684, 2023. The key contribution was the super-block architecture that enables usable 2-3 bit quantization. See https://github.com/ggerganov/llama.cpp/pull/1684. ↩
Perplexity numbers are from the llama.cpp quantization benchmarks. Exact values depend on the model and evaluation dataset. The relative ordering (F16 > Q6_K > Q5_K > Q4_K > Q4_0 > Q3_K > Q2_K) is consistent across models. ↩
On Apple M2 Pro, Q4_0 GEMV for a 4096x4096 matrix runs at approximately 92% of memory bandwidth, while Q4_K achieves about 85%, despite both being ~4.5 bpw. The difference is the 6-bit sub-scale decoding overhead in Q4_K. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference