Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Quantization Formats In Depth

This chapter is the definitive reference for every quantization format Akunu supports. We will go byte by byte through the memory layouts, work through dequantization by hand, and build the mental model you need to write or debug quantized GEMV kernels on Metal. If you have ever wondered what exactly lives inside a Q4_0 block, or how K-quant super-blocks manage to pack 256 values with mixed bit widths, this is the chapter.

Why Quantization Matters

A 7B parameter model in F16 requires 14 GB of memory. That exceeds the unified memory of every base-model MacBook Air and most MacBook Pros. Quantize those weights to 4 bits and you are down to 3.5 GB – comfortably fitting on a machine with 8 GB of RAM, with plenty left over for KV cache and activations.

But quantization is not free. You are trading precision for memory, and the format you choose determines both the quality of that trade-off and the computational cost of dequantizing at inference time. GGUF alone defines over a dozen quantization formats. MLX adds its own family. Understanding them is essential for anyone working on inference engines.

GGUF Legacy Formats

These are the original formats from GGML/llama.cpp. They operate on fixed-size blocks of elements, each block containing packed quantized values plus per-block scale factors.

Q4_0: The Simplest Quantized Format

Q4_0 is where most people should start understanding quantization. It is symmetric 4-bit quantization with a single F16 scale per block of 32 elements.

Block layout (18 bytes total):

+------------------+----------------------------------+
| F16 scale (d)    | 16 bytes of nibbles              |
| 2 bytes          | (32 x 4-bit values)              |
+------------------+----------------------------------+
  Byte 0-1           Bytes 2-17

Each byte in the nibble section holds two 4-bit values:

Byte layout:  [lo_nibble : hi_nibble]
              [  q[2i]   :  q[2i+1] ]

Bit positions:  7 6 5 4 3 2 1 0
                ^^^^^^^---------  hi nibble (bits 4-7) = q[2i+1]
                        ^^^^^^^^  lo nibble (bits 0-3) = q[2i]

Dequantization formula:

x[i] = d * (q[i] - 8)

The subtraction of 8 centers the 4-bit range [0, 15] around zero, giving an effective range of [-8, 7]. This is symmetric quantization – the zero point is fixed at 8, not learned.

Worked example: suppose we have a block with scale d = 0.5 (as F16) and the first data byte is 0xA3.

Byte 0xA3 = 1010 0011 in binary

lo nibble = 0011 = 3  -> q[0] = 3
hi nibble = 1010 = 10 -> q[1] = 10

x[0] = 0.5 * (3 - 8)  = 0.5 * (-5) = -2.5
x[1] = 0.5 * (10 - 8) = 0.5 * (2)  =  1.0

Bit extraction in Metal:

// Extract two Q4_0 values from a byte
uint8_t byte = block_data[j];
int q_lo = (byte & 0x0F);       // bits 0-3
int q_hi = (byte >> 4) & 0x0F;  // bits 4-7

float x_lo = d * ((float)q_lo - 8.0f);
float x_hi = d * ((float)q_hi - 8.0f);

Size calculation:

ParameterValue
Block size32 elements
Scale overhead2 bytes (F16)
Data16 bytes (32 nibbles)
Total per block18 bytes
Bits per weight18 * 8 / 32 = 4.5 bpw

The overhead of the scale factor means Q4_0 is not exactly 4 bits per weight – it is 4.5 bpw. This is a common source of confusion. The “4” in Q4_0 refers to the quantized value width, not the effective bits per weight.

Q4_1: Asymmetric 4-bit

Q4_1 adds a minimum value (zero-point) per block, enabling asymmetric quantization:

Block layout (20 bytes):

+------------------+------------------+------------------+
| F16 scale (d)    | F16 minimum (m)  | 16 bytes nibbles |
| 2 bytes          | 2 bytes          | (32 x 4-bit)    |
+------------------+------------------+------------------+
  Bytes 0-1          Bytes 2-3          Bytes 4-19

Dequantization:

x[i] = d * q[i] + m

No subtraction of 8 here – the minimum m handles the offset. The range [0, 15] maps to [m, m + 15*d]. This can better represent distributions that are not centered at zero, at the cost of 2 extra bytes per block.

ParameterValue
Block size32 elements
Scale + min overhead4 bytes
Data16 bytes
Total per block20 bytes
Bits per weight20 * 8 / 32 = 5.0 bpw

Q8_0: 8-bit Symmetric

Q8_0 stores each value as a signed 8-bit integer with a single F16 scale per block of 32:

Block layout (34 bytes):

+------------------+----------------------------------+
| F16 scale (d)    | 32 bytes of int8 values          |
| 2 bytes          | (32 x 8-bit signed)              |
+------------------+----------------------------------+
  Bytes 0-1          Bytes 2-33

Dequantization:

x[i] = d * q[i]    // q[i] is int8, range [-128, 127]

No offset subtraction needed because int8 is already signed.

ParameterValue
Block size32 elements
Scale overhead2 bytes
Data32 bytes
Total per block34 bytes
Bits per weight34 * 8 / 32 = 8.5 bpw

Q8_0 is primarily used for activations in mixed-precision inference, not for weight storage (since 8.5 bpw barely saves memory over F16’s 16 bpw). Its main advantage is that int8 dot products can use SIMD integer multiply-accumulate instructions, which are faster than F16 multiply-accumulate on some hardware.

K-Quants: Super-Block Architecture

The K-quant family (Q2_K through Q6_K) was introduced by @ikawrakow in llama.cpp to improve quantization quality at low bit widths.1 The key insight is that a single scale factor per 32 elements is too coarse for 2-3 bit quantization – the approximation error is unacceptably high. K-quants solve this with a two-level hierarchy: super-blocks of 256 elements containing multiple sub-blocks, each with its own scale.

Super-block (256 elements)
+-----------------------------------------------------------+
|  Sub-block scales (quantized to fewer bits themselves)     |
|  Super-block scale (F16) + super-block min (F16)          |
+-----------------------------------------------------------+
|  Sub-block 0 (32 elements, quantized data)                |
|  Sub-block 1 (32 elements, quantized data)                |
|  ...                                                      |
|  Sub-block 7 (32 elements, quantized data)                |
+-----------------------------------------------------------+

The sub-block scales are themselves quantized (usually to 4 or 6 bits), and the super-block scale converts them back to floating point. This is nested quantization – you quantize the quantization parameters.

Q2_K: 2-bit with 4-bit Sub-Block Scales

The most aggressive K-quant. Each element gets only 2 bits, but the hierarchical scales keep quality surprisingly usable.

Super-block layout (256 elements, 84 bytes):

+--------------------------------------------------+
| F16 super-scale (d)        | 2 bytes             |
| F16 super-minimum (dmin)   | 2 bytes             |
| 16 bytes: sub-block scales | (16 x 4-bit pairs)  |
|   each byte: [scale_hi:scale_lo]                 |
| 64 bytes: quantized data   | (256 x 2-bit)       |
+--------------------------------------------------+

The 16 bytes of sub-block scales encode 16 pairs of (scale, minimum) values, one for each sub-block of 16 elements. Each pair is packed into a single byte as two 4-bit values.

Dequantization (for element i in sub-block j):

sub_scale = (scales_byte[j] & 0x0F)
sub_min   = (scales_byte[j] >> 4)

x[i] = d * sub_scale * q[i] - dmin * sub_min
ParameterValue
Super-block size256 elements
Overhead2 (d) + 2 (dmin) + 16 (sub-scales) = 20 bytes
Data64 bytes (256 x 2-bit)
Total84 bytes
Bits per weight84 * 8 / 256 = 2.625 bpw

Q3_K: 3-bit with Mixed Sub-Block Scales

Q3_K uses 3 bits per value with 256-element super-blocks.

Super-block layout (256 elements, 110 bytes):

+--------------------------------------------------+
| F16 super-scale (d)                | 2 bytes     |
| 12 bytes: quantized sub-scales     |             |
|   (16 scales, 6-bit each, packed)  |             |
| 32 bytes: high-bits of quants      |             |
|   (256 bits, 1 per element)        |             |
| 64 bytes: low-bits of quants       |             |
|   (256 x 2-bit)                    |             |
+--------------------------------------------------+

The 3-bit values are split across two regions: the low 2 bits are packed into the 64-byte “quants low” section (like Q2_K), and the high bit is stored separately in the 32-byte “high bits” section. This split layout simplifies SIMD extraction.

Dequantization:

q_lo = (quants_lo[i/4] >> (2 * (i%4))) & 0x03  // 2 low bits
q_hi = (hmask[i/8] >> (i%8)) & 1                // 1 high bit
q = q_lo | (q_hi << 2)                          // 3-bit value [0..7]

x[i] = d * sub_scale * (q - 4)                  // center at 4
ParameterValue
Super-block size256 elements
Total110 bytes
Bits per weight110 * 8 / 256 = 3.4375 bpw

Q4_K: 4-bit K-Quant

Q4_K is the workhorse of the K-quant family. It provides a good balance of quality and compression that works well for most models.

Super-block layout (256 elements, 144 bytes):

+--------------------------------------------------+
| F16 super-scale (d)                | 2 bytes     |
| F16 super-minimum (dmin)           | 2 bytes     |
| 12 bytes: sub-block scales+mins    |             |
|   (8 sub-blocks, 6-bit scale +     |             |
|    6-bit min, packed)               |             |
| 128 bytes: quantized data          |             |
|   (256 x 4-bit nibbles)            |             |
+--------------------------------------------------+

Each of the 8 sub-blocks has a 6-bit scale and a 6-bit minimum, packed into the 12-byte scale section. The packing is non-trivial – the 6-bit values are split across multiple bytes.

Scale packing detail (12 bytes for 8 sub-blocks):

Bytes 0-3:  low 4 bits of scales[0..7]  (4 bits each, 2 per byte)
Bytes 4-7:  low 4 bits of mins[0..7]    (4 bits each, 2 per byte)
Bytes 8-9:  high 2 bits of scales[0..7] (2 bits each, packed)
Bytes 10-11: high 2 bits of mins[0..7]  (2 bits each, packed)

Dequantization:

scale_6bit = low4(scales, j) | (high2(scales, j) << 4)
min_6bit   = low4(mins, j)   | (high2(mins, j) << 4)

x[i] = d * scale_6bit * q[i] - dmin * min_6bit
ParameterValue
Super-block size256 elements
Total144 bytes
Bits per weight144 * 8 / 256 = 4.5 bpw

Q5_K: 5-bit K-Quant

Q5_K extends Q4_K with an extra bit per value:

Super-block layout (256 elements, 176 bytes):

+--------------------------------------------------+
| F16 super-scale (d)                | 2 bytes     |
| F16 super-minimum (dmin)           | 2 bytes     |
| 12 bytes: sub-block scales+mins    |             |
| 128 bytes: low nibbles             |             |
|   (256 x 4-bit)                    |             |
| 32 bytes: high bits                |             |
|   (256 x 1-bit)                    |             |
+--------------------------------------------------+

Like Q3_K, the 5th bit is stored separately from the low 4 bits. This allows the low-nibble extraction to use the same SIMD patterns as Q4_K.

ParameterValue
Super-block size256 elements
Total176 bytes
Bits per weight176 * 8 / 256 = 5.5 bpw

Q6_K: 6-bit K-Quant

Q6_K is the highest-quality K-quant, approaching F16 accuracy for most models.

Super-block layout (256 elements, 210 bytes):

+--------------------------------------------------+
| F16 super-scale (d)                | 2 bytes     |
| 16 bytes: sub-block scales (int8)  |             |
| 128 bytes: low nibbles             |             |
|   (256 x 4-bit)                    |             |
| 64 bytes: high dibits              |             |
|   (256 x 2-bit)                    |             |
+--------------------------------------------------+

Q6_K simplifies the scale storage: each sub-block scale is a full int8 value (not quantized further). There is no separate minimum – Q6_K uses symmetric quantization like Q4_0.

Dequantization:

q_lo = (quants_lo[i/2] >> (4*(i%2))) & 0x0F     // 4 low bits
q_hi = (quants_hi[i/4] >> (2*(i%4))) & 0x03     // 2 high bits
q = q_lo | (q_hi << 4)                           // 6-bit value [0..63]

x[i] = d * sub_scale_int8 * (q - 32)            // center at 32
ParameterValue
Super-block size256 elements
Total210 bytes
Bits per weight210 * 8 / 256 = 6.5625 bpw

K-Quant Summary Table

FormatBits/valueBlock sizeBytes/blockEffective bpwScale typeSymmetry
Q2_K2256842.634-bit nestedAsymmetric
Q3_K32561103.446-bit nestedSymmetric
Q4_K42561444.506-bit nestedAsymmetric
Q5_K52561765.506-bit nestedAsymmetric
Q6_K62562106.56int8Symmetric

And for comparison, the legacy formats:

FormatBits/valueBlock sizeBytes/blockEffective bpwScale typeSymmetry
Q4_0432184.50F16Symmetric
Q4_1432205.00F16 + F16 minAsymmetric
Q5_0532225.50F16Symmetric
Q8_0832348.50F16Symmetric

MLX Per-Group Quantization

MLX takes a different approach. Rather than defining custom block layouts with packed scales, MLX uses a straightforward per-group scheme with separate tensors for weights, scales, and biases.

Layout

For a weight matrix of shape [N, K] quantized to B bits with group size G:

Weight tensor:  shape [N, K*B/32], dtype U32
Scales tensor:  shape [N, K/G],    dtype F16 or BF16
Biases tensor:  shape [N, K/G],    dtype F16 or BF16

Each U32 word packs 32/B quantized values. The values within a U32 are stored contiguously from LSB to MSB.

Bit Extraction

For B-bit quantization, extracting the j-th value from a U32:

uint32_t word = packed_weights[word_index];
uint32_t mask = (1u << B) - 1;          // B ones
int shift = (j % (32 / B)) * B;
uint32_t q = (word >> shift) & mask;

Example: 4-bit extraction from U32 word 0xFEDCBA98:

Binary: 1111 1110 1101 1100 1011 1010 1001 1000

Value 0 (bits 0-3):   1000 = 8
Value 1 (bits 4-7):   1001 = 9
Value 2 (bits 8-11):  1010 = 10
Value 3 (bits 12-15): 1011 = 11
Value 4 (bits 16-19): 1100 = 12
Value 5 (bits 20-23): 1101 = 13
Value 6 (bits 24-27): 1110 = 14
Value 7 (bits 28-31): 1111 = 15

Dequantization

group_index = j / G
x[i][j] = scales[i][group_index] * q[i][j] + biases[i][group_index]

This is asymmetric affine quantization. The bias acts as a zero-point, allowing the quantization grid to cover any range, not just one centered at zero.

MLX Bit Width Variants

Akunu supports four MLX quantization widths, each mapped to a dtype code:

Bit widthDtype codeValues per U32Typical group sizeEffective bpw
3-bit9910 (+ 2 bits padding)64~3.5
4-bit100864~4.5
6-bit1025 (+ 2 bits padding)64~6.5
8-bit101464~8.5

The effective bpw includes the overhead of scale and bias storage. For a [4096, 4096] matrix with group size 64:

Weight data:    4096 * 4096 * B / 8 bytes
Scale data:     4096 * (4096/64) * 2 bytes = 4096 * 64 * 2 = 524,288 bytes
Bias data:      same as scale = 524,288 bytes
Total overhead: 1,048,576 bytes (~1 MB)

This overhead is constant regardless of bit width, and is small relative to the weight data for large matrices.

3-bit Packing Detail

3-bit is the most irregular case because 32 is not evenly divisible by 3. MLX packs 10 three-bit values into each U32 (10 * 3 = 30 bits), leaving 2 bits unused:

U32 word: [unused:2][q9:3][q8:3][q7:3][q6:3][q5:3][q4:3][q3:3][q2:3][q1:3][q0:3]
Bits:      31-30    29-27  26-24  23-21  20-18  17-15  14-12  11-9   8-6    5-3    2-0

The extraction code:

uint32_t word = packed[word_index];
int pos_in_word = j % 10;
int shift = pos_in_word * 3;
uint32_t q = (word >> shift) & 0x7;  // mask = 0b111

GPU Buffer Layout (Packed)

As discussed in the previous chapter, Akunu packs the three MLX tensors into a single GPU buffer for each weight matrix:

Offset 0:                           Packed U32 weights
Offset weight_bytes:                F16 scales
Offset weight_bytes + scale_bytes:  F16 biases

The Metal kernel receives the buffer pointer and a weight_bytes parameter. It computes scale and bias offsets arithmetically:

device const half *scales = (device const half *)
    ((device const char *)weights + params.weight_bytes);
device const half *biases = scales + (params.N * params.K / params.group_size);

Metal Kernel Dequantization Patterns

Each format requires a different dequantization strategy in the GEMV kernel. Here are the common patterns:

Q4_0 GEMV Inner Loop

// Each thread processes a chunk of the K dimension
for (int k = tid; k < K; k += stride) {
    int block_idx = k / 32;
    int block_off = k % 32;

    // Load block header
    half d = block_scales[block_idx];

    // Load and extract nibble
    int byte_idx = block_off / 2;
    uint8_t byte = block_data[block_idx * 16 + byte_idx];
    int nibble = (block_off & 1) ? (byte >> 4) : (byte & 0x0F);

    // Dequantize and accumulate
    float w = float(d) * (float(nibble) - 8.0f);
    sum += w * float(input[k]);
}

K-Quant GEMV Pattern (Q4_K)

// Process one super-block (256 elements) at a time
for (int sb = ...; sb < n_superblocks; sb++) {
    half d = super_scales[sb];
    half dmin = super_mins[sb];

    // Decode sub-block scales (6-bit from packed bytes)
    for (int sub = 0; sub < 8; sub++) {
        int sc = decode_6bit_scale(scale_bytes, sub);
        int mn = decode_6bit_min(scale_bytes, sub);

        float sub_scale = float(d) * sc;
        float sub_min = float(dmin) * mn;

        for (int k = 0; k < 32; k++) {
            int q = extract_nibble(data, sub*32 + k);
            float w = sub_scale * q - sub_min;
            sum += w * float(input[sb*256 + sub*32 + k]);
        }
    }
}

MLX GEMV Pattern

// Process one group at a time
for (int g = 0; g < K / group_size; g++) {
    half scale = scales[row * n_groups + g];
    half bias = biases[row * n_groups + g];

    for (int k = 0; k < group_size; k++) {
        int global_k = g * group_size + k;
        uint32_t word = packed[row * K_packed + global_k / values_per_word];
        int pos = global_k % values_per_word;
        uint32_t q = (word >> (pos * bits)) & bit_mask;

        float w = float(scale) * float(q) + float(bias);
        sum += w * float(input[global_k]);
    }
}

Quality vs Size Comparison

The following table summarizes quality-size trade-offs. Perplexity numbers are approximate and vary by model, but the relative ordering is consistent.2

FormatEffective bpwModel size (7B)Perplexity impactBest use case
F1616.014 GBBaselineReference / debugging
Q8_08.57.4 GBNegligibleActivation quantization
Q6_K6.565.7 GBVery smallQuality-sensitive apps
Q5_K5.504.8 GBSmallGood quality/size balance
Q4_K4.503.9 GBModerateBest general-purpose
Q4_04.503.9 GBModerate+Fastest decode (simple format)
Q3_K3.443.0 GBNoticeableMemory-constrained
Q2_K2.632.3 GBSignificantExtreme compression
MLX Q4~4.5~3.9 GBModerateMLX ecosystem models
MLX Q3~3.5~3.1 GBNoticeableMLX ecosystem, low memory
MLX Q8~8.5~7.4 GBNegligibleHigh quality MLX

How Akunu Selects Kernels

The dtype code embedded in (or derived from) the weight file determines which kernels are used. Akunu’s DTypeDescriptor table maps each dtype to a complete set of kernel names:

DtypeCodeGEMV kernelGEMM kernelFused SiLUEmbedding
F161gemv_f16simd_gemm_f16embedding_lookup_f16
Q4_02gemv_q4_0simd_gemm_q4_0gemv_q4_0_siluembedding_lookup_q4_0
Q4_13gemv_q4_1simd_gemm_q4_1embedding_lookup_q4_1
Q8_08gemv_q8_0simd_gemm_q8_0embedding_lookup_q8_0
Q2_K10gemv_q2_ksimd_gemm_q2_k
Q3_K11gemv_q3_ksimd_gemm_q3_k
Q4_K12gemv_q4_ksimd_gemm_q4_kembedding_lookup_q4_k
Q5_K13gemv_q5_ksimd_gemm_q5_k
Q6_K14gemv_q6_ksimd_gemm_q6_kembedding_lookup_q6_k
BF1631gemv_bf16simd_gemm_bf16embedding_lookup_bf16
MLX Q399gemv_mlx_q3simd_gemm_mlx_q3gemv_mlx_q3_siluembedding_lookup_mlx_generic
MLX Q4100gemv_mlx_q4simd_gemm_mlx_q4gemv_mlx_q4_siluembedding_lookup_mlx_q4
MLX Q6102gemv_mlx_q6simd_gemm_mlx_q6gemv_mlx_q6_siluembedding_lookup_mlx_generic
MLX Q8101gemv_mlx_q8simd_gemm_mlx_q8gemv_mlx_q8_siluembedding_lookup_mlx_generic

Note the pattern: GGUF formats have dtype codes below 32 (matching GGML’s enum), while MLX formats use codes 99-102. This avoids any collision between the two namespaces.

Each descriptor also includes dispatch geometry – the number of rows per threadgroup and the threadgroup size. These are tuned per format because different formats have different computational density:

Format familyRows/threadgroupThreadgroup sizeRationale
F1616128Simple dequant, high arithmetic density
Q4_0/Q4_116128Simple block format, fast extraction
Q8_032256Larger data per block, needs more threads
K-quants16256Complex nested dequant, more ALU work
MLX all16128Group-based, moderate complexity

Mixed Quantization

Many GGUF models use different quantization levels for different layers. For example, a Q4_K_M quantization (the “M” stands for “mixed”) might use:

  • Q6_K for the attention norms and output norm (small tensors, quality-sensitive)
  • Q4_K for most weight matrices
  • Q5_K for the first and last few layers

Akunu handles this transparently because get_dtype() returns the per-tensor dtype, and build_dispatch_table() selects the kernel for each weight individually:

snprintf(name, sizeof(name), "layers.%d.attention.q.weight", layer);
uint32_t q_dtype = weights.get_dtype(name);  // might be Q4_K

snprintf(name, sizeof(name), "layers.%d.attention.k.weight", layer);
uint32_t k_dtype = weights.get_dtype(name);  // might be Q6_K

// Each gets the correct kernel
gemv(input, q_weight, output_q, 0, q_dtype, q_dim, dim);
gemv(input, k_weight, output_k, 0, k_dtype, kv_dim, dim);

Weight fusion (QKV or gate+up) requires matching dtypes – you cannot fuse a Q4_K weight with a Q6_K weight because they have different block layouts. The fusion check verifies this:

bool fuse_qkv = q_dtype == k_dtype && k_dtype == v_dtype;

If the dtypes do not match, Akunu falls back to separate GEMV dispatches.

Practical Guidance

For users choosing a quantization format:

  • Q4_K_M is the sweet spot for most use cases. It provides good quality at ~4.5 bpw with the K-quant’s hierarchical scales.
  • MLX Q4 is comparable in quality and works well with models from the MLX ecosystem.
  • Q4_0 is slightly lower quality than Q4_K but uses simpler block structure, which can be faster for decode (where GEMV is the bottleneck).
  • Q6_K or MLX Q8 if you can afford the memory and want near-lossless quality.
  • Q2_K and MLX Q3 should be reserved for cases where memory is truly scarce. Quality degradation is noticeable.

For kernel developers:

  • The block-of-32 formats (Q4_0, Q4_1, Q8_0) are the easiest to implement. Start there.
  • K-quants require careful handling of the nested scale packing. Get the bit extraction right by testing against a reference implementation before optimizing.
  • MLX formats are conceptually simpler (uniform group structure, no nested quantization) but require handling the three-tensor buffer layout and function constants for group size and K dimension.
  • Always profile with real models. The format with the least memory usage is not always the fastest – simpler dequantization (Q4_0) can outperform complex dequantization (Q4_K) even at the same bit width, because the kernel spends less time on scale lookups.3

  1. @ikawrakow, “k-quants: 2, 3, 4, 5, and 6-bit quantization for llama.cpp,” llama.cpp PR #1684, 2023. The key contribution was the super-block architecture that enables usable 2-3 bit quantization. See https://github.com/ggerganov/llama.cpp/pull/1684.

  2. Perplexity numbers are from the llama.cpp quantization benchmarks. Exact values depend on the model and evaluation dataset. The relative ordering (F16 > Q6_K > Q5_K > Q4_K > Q4_0 > Q3_K > Q2_K) is consistent across models.

  3. On Apple M2 Pro, Q4_0 GEMV for a 4096x4096 matrix runs at approximately 92% of memory bandwidth, while Q4_K achieves about 85%, despite both being ~4.5 bpw. The difference is the 6-bit sub-scale decoding overhead in Q4_K.