Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Quantization: Making Models Fit

Here is a problem you will hit immediately when trying to run LLMs on consumer hardware: a 7-billion parameter model in FP16 takes 14 GB of memory. An M2 MacBook Air has 8 GB of unified memory. A 70B model needs 140 GB in FP16 – more than any single GPU on the market. Even if you have enough memory, the model needs to stream through the memory bus during inference, and memory bandwidth is the bottleneck for token generation.

Quantization is the solution. By representing weights with fewer bits – 8, 4, 3, or even 2 bits instead of 16 – we shrink the model dramatically. A 7B model at 4-bit quantization fits in about 4 GB. The 70B model fits in 35-40 GB. And because there is less data to read from memory, inference gets faster too.

But quantization is not free. Fewer bits means less precision, which means some degradation in output quality. The art of quantization is finding the sweet spot: aggressive enough to fit in memory and hit target speeds, but gentle enough to preserve model quality.

In this chapter, we will cover the major quantization schemes you will encounter in practice, understand the math behind them, work through concrete examples of quantizing and dequantizing, and analyze the tradeoffs between size, speed, and quality.

Why Quantize? The Numbers

Let us start with the raw arithmetic that makes quantization essential:

  Model Sizes at Different Precisions
  =====================================

  Parameters    FP32     FP16     Q8_0     Q4_0     Q4_K_M    Q2_K
  ----------    ----     ----     ----     ----     ------    ----
  1.5B          6 GB     3 GB    1.6 GB   0.9 GB   1.0 GB   0.6 GB
  7B           28 GB    14 GB    7.5 GB   4.0 GB   4.4 GB   2.7 GB
  13B          52 GB    26 GB   13.8 GB   7.4 GB   8.0 GB   4.9 GB
  34B         136 GB    68 GB   36.1 GB  19.2 GB  21.1 GB  12.7 GB
  70B         280 GB   140 GB   74.4 GB  39.6 GB  43.4 GB  26.2 GB

  Apple Silicon Unified Memory:
  M1/M2/M3 (base):      8 GB   -> Q4_0 7B fits, FP16 does not
  M1/M2/M3 Pro:        18 GB   -> Q4_0 13B fits
  M1/M2/M3 Max:        64 GB   -> Q4_0 70B fits
  M2/M3 Ultra:        192 GB   -> FP16 70B fits

But memory capacity is only half the story. The other half is bandwidth:

  The Bandwidth Equation
  =======================

  During autoregressive decoding (generating one token at a time),
  EVERY weight in the model is read from memory exactly once per token.

  decode_speed (tokens/sec) = memory_bandwidth / model_size_in_bytes

  Apple M2 Pro: ~200 GB/s bandwidth

  FP16 7B:   200 / 14.0 = ~14 tokens/sec
  Q8_0 7B:   200 / 7.5  = ~27 tokens/sec
  Q4_0 7B:   200 / 4.0  = ~50 tokens/sec
  Q4_K_M 7B: 200 / 4.4  = ~45 tokens/sec
  Q2_K 7B:   200 / 2.7  = ~74 tokens/sec

  This is a theoretical upper bound -- real speeds are 60-80% of this
  due to attention computation, KV cache reads, and other overhead.
  But the proportionality holds: half the bytes = double the speed.
  Decode Speed vs Quantization
  =============================

  Speed                Model: 7B on M2 Pro (200 GB/s)
  (tok/s)
  80 |                                              *  Q2_K
     |
  70 |
     |
  60 |
     |                                 *  Q4_0
  50 |                              *  Q4_K_M
     |
  40 |
     |
  30 |                 *  Q8_0
     |
  20 |
     |     *  FP16
  10 |
     |
   0 +-----+------+------+------+------+------+----->
         16     8      6      4      3      2    bits/weight

The Fundamentals of Quantization

At its core, quantization maps a continuous range of floating-point values to a discrete set of integer values. The simplest form is uniform affine quantization:

  Uniform Quantization
  =====================

  Given: a set of FP16 weights in range [w_min, w_max]
  Goal:  represent each weight using b bits (0 to 2^b - 1)

  Quantize:
    scale = (w_max - w_min) / (2^b - 1)
    zero_point = round(-w_min / scale)
    q[i] = round(w[i] / scale) + zero_point

  Dequantize:
    w[i] = scale * (q[i] - zero_point)

  Example: b=4 bits (range 0..15)
    Weights: [-0.8, 0.3, -0.1, 0.5, -0.6, 0.7]
    w_min = -0.8, w_max = 0.7
    scale = (-0.8 - 0.7) / -15 = 0.1
    zero_point = round(0.8 / 0.1) = 8

    Quantize:
    q[-0.8] = round(-0.8/0.1) + 8 = -8 + 8 = 0
    q[0.3]  = round(0.3/0.1) + 8  =  3 + 8 = 11
    q[-0.1] = round(-0.1/0.1) + 8 = -1 + 8 = 7
    q[0.5]  = round(0.5/0.1) + 8  =  5 + 8 = 13
    q[-0.6] = round(-0.6/0.1) + 8 = -6 + 8 = 2
    q[0.7]  = round(0.7/0.1) + 8  =  7 + 8 = 15

    Stored: [0, 11, 7, 13, 2, 15]  (each is 4 bits)

    Dequantize:
    w[0]  = 0.1 * (0 - 8)  = -0.8   (exact)
    w[11] = 0.1 * (11 - 8) =  0.3   (exact)
    w[7]  = 0.1 * (7 - 8)  = -0.1   (exact)
    w[13] = 0.1 * (13 - 8) =  0.5   (exact)
    w[2]  = 0.1 * (2 - 8)  = -0.6   (exact)
    w[15] = 0.1 * (15 - 8) =  0.7   (exact)

  In this lucky example, all values hit exactly.
  In reality, most values get rounded, introducing error.

But this global quantization (one scale and zero-point for the entire weight matrix) is too crude for neural networks. The weight distribution varies significantly across different parts of the matrix. Modern quantization uses block quantization: the weights are divided into small blocks, each with its own scale (and possibly zero-point).

Block Quantization: GGUF Format (Q4_0)

The GGUF format (used by llama.cpp and many Metal inference engines) uses block quantization. The simplest variant, Q4_0, works as follows:

  Q4_0 Block Structure
  =====================

  Block size: 32 weights
  Storage per block:
    - 1 x FP16 scale factor (2 bytes)
    - 32 x 4-bit signed integers packed into 16 bytes
  Total: 18 bytes for 32 weights = 4.5 bits/weight

  Quantization (symmetric, no zero point):
    For each block of 32 weights:
      abs_max = max(|w[0]|, |w[1]|, ..., |w[31]|)
      scale = abs_max / 8     (maps to range [-8, 7])

      For each weight w[i]:
        q[i] = clamp(round(w[i] / scale), -8, 7)
        Store as (q[i] + 8), giving range [0, 15] (4 bits)

  Dequantization:
    w[i] = scale * (stored[i] - 8)

  Memory layout (18 bytes per block):
  +--------+--------+--------+---+--------+
  | scale  | byte 0 | byte 1 |...| byte15 |
  | (FP16) | w0|w1  | w2|w3  |   |w30|w31 |
  +--------+--------+--------+---+--------+
     2B        1B       1B          1B
             (2 nibbles packed per byte)

Worked Example: Quantizing a Block with Q4_0

  Example: Quantize 32 weights with Q4_0
  ========================================

  Weights (first 8 of 32 shown):
  [ 0.23, -0.41, 0.67, -0.12, 0.55, -0.89, 0.34, 0.08, ... ]

  Step 1: Find absolute maximum
    abs_max = max(|all 32 weights|) = 0.89  (from -0.89)

  Step 2: Compute scale
    scale = 0.89 / 8 = 0.11125

  Step 3: Quantize each weight
    q[0] = round(0.23 / 0.11125)  = round(2.067)  =  2  -> stored: 2+8 = 10
    q[1] = round(-0.41 / 0.11125) = round(-3.685)  = -4  -> stored: -4+8 = 4
    q[2] = round(0.67 / 0.11125)  = round(6.022)  =  6  -> stored: 6+8 = 14
    q[3] = round(-0.12 / 0.11125) = round(-1.079)  = -1  -> stored: -1+8 = 7
    q[4] = round(0.55 / 0.11125)  = round(4.944)  =  5  -> stored: 5+8 = 13
    q[5] = round(-0.89 / 0.11125) = round(-8.000)  = -8  -> stored: -8+8 = 0
    q[6] = round(0.34 / 0.11125)  = round(3.056)  =  3  -> stored: 3+8 = 11
    q[7] = round(0.08 / 0.11125)  = round(0.719)  =  1  -> stored: 1+8 = 9

  Step 4: Pack into bytes (two 4-bit values per byte)
    byte[0] = (stored[1] << 4) | stored[0] = (4 << 4) | 10 = 0x4A
    byte[1] = (stored[3] << 4) | stored[2] = (7 << 4) | 14 = 0x7E
    byte[2] = (stored[5] << 4) | stored[4] = (0 << 4) | 13 = 0x0D
    byte[3] = (stored[7] << 4) | stored[6] = (9 << 4) | 11 = 0x9B

  Step 5: Dequantize (to verify)
    dq[0] = 0.11125 * (10 - 8) = 0.11125 * 2  =  0.2225  (was 0.23, error: 0.008)
    dq[1] = 0.11125 * (4 - 8)  = 0.11125 * -4 = -0.4450  (was -0.41, error: 0.035)
    dq[2] = 0.11125 * (14 - 8) = 0.11125 * 6  =  0.6675  (was 0.67, error: 0.003)
    dq[3] = 0.11125 * (7 - 8)  = 0.11125 * -1 = -0.1113  (was -0.12, error: 0.009)
    dq[4] = 0.11125 * (13 - 8) = 0.11125 * 5  =  0.5563  (was 0.55, error: 0.006)
    dq[5] = 0.11125 * (0 - 8)  = 0.11125 * -8 = -0.8900  (was -0.89, error: 0.000)
    dq[6] = 0.11125 * (11 - 8) = 0.11125 * 3  =  0.3338  (was 0.34, error: 0.006)
    dq[7] = 0.11125 * (9 - 8)  = 0.11125 * 1  =  0.1113  (was 0.08, error: 0.031)

  Average absolute error: ~0.012
  Relative error: ~3-4% on average

Bit Packing Details

Two 4-bit values are packed into each byte. This is a critical operation for both quantization (packing) and dequantization (unpacking):

  Bit Packing: Two Nibbles Per Byte
  ===================================

  Stored values: a=10 (0b1010), b=4 (0b0100)

  Pack: byte = (b << 4) | a
        byte = (0b0100 << 4) | 0b1010
        byte = 0b01001010
        byte = 0x4A

  Unpack:
    a = byte & 0x0F       = 0x4A & 0x0F = 0x0A = 10
    b = (byte >> 4) & 0x0F = (0x4A >> 4) & 0x0F = 0x04 = 4

  Visual:
  +---+---+---+---+---+---+---+---+
  | b3| b2| b1| b0| a3| a2| a1| a0|  <-- one byte
  +---+---+---+---+---+---+---+---+
  |    high nibble |   low nibble  |
  |    value b     |   value a     |

In Metal shader code:

// Unpacking 4-bit values from bytes
uchar packed_byte = block.packed[j];

// Extract two 4-bit values
int8_t val_low  = (packed_byte & 0x0F) - 8;    // Low nibble, subtract offset
int8_t val_high = (packed_byte >> 4) - 8;       // High nibble, subtract offset

// Dequantize
float w0 = float(block.scale) * float(val_low);
float w1 = float(block.scale) * float(val_high);

K-Quant Family: Super-Blocks of 256

The basic Q4_0 format uses a single scale per 32 weights. The K-quant family (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K) introduced by llama.cpp uses a two-level hierarchy: super-blocks of 256 weights, each containing 8 sub-blocks of 32 weights.

  K-Quant Super-Block Structure (Q4_K)
  ======================================

  Super-block: 256 weights

  +---------------------------------------------------+
  | FP16 scale_of_scales  |  FP16 scale_of_mins       |   4 bytes
  +---------------------------------------------------+
  | 8 x 6-bit sub-block scales (packed)               |  6 bytes
  +---------------------------------------------------+
  | 8 x 6-bit sub-block mins (packed)                 |  6 bytes
  +---------------------------------------------------+
  | 256 x 4-bit quantized values (packed)             | 128 bytes
  +---------------------------------------------------+
  Total: 144 bytes for 256 weights = 4.5 bits/weight

  Each sub-block of 32 weights has its own:
    - 6-bit scale (quantized, relative to super-block scale_of_scales)
    - 6-bit minimum (quantized, relative to super-block scale_of_mins)

  Dequantization for weight i in sub-block b:
    sub_scale = scale_of_scales * sub_scales_q6[b]
    sub_min   = scale_of_mins * sub_mins_q6[b]
    w[i] = sub_scale * q4[i] - sub_min

Why the two-level structure? It improves quantization accuracy by allowing each sub-block of 32 weights to have a different range, while keeping the overhead (scale and min metadata) small by quantizing the per-sub-block parameters themselves.

  Q4_0 vs Q4_K: Quantization Granularity
  ========================================

  Q4_0: One FP16 scale per 32 weights
  +---------+---------+---------+---------+
  | scale_0 | scale_1 | scale_2 | scale_3 |  ... (256 weights = 8 blocks)
  |  32 wts |  32 wts |  32 wts |  32 wts |
  +---------+---------+---------+---------+
  Each scale is independent FP16 (full precision).
  Overhead: 2 bytes / 32 weights = 0.5 bits/weight
  Total: 4.0 + 0.5 = 4.5 bits/weight

  Q4_K: Two-level hierarchy for 256 weights
  +-------------------------------------------+
  | super-block: scale_of_scales, scale_of_mins|  4 bytes for 256 weights
  |   sub-block 0: 6-bit scale, 6-bit min     |
  |   sub-block 1: 6-bit scale, 6-bit min     |
  |   ...                                      |
  |   sub-block 7: 6-bit scale, 6-bit min     |
  +-------------------------------------------+
  Sub-block parameters: 12 bytes / 256 = 0.375 bits/weight
  Super-block parameters: 4 bytes / 256 = 0.125 bits/weight
  Total: 4.0 + 0.375 + 0.125 = 4.5 bits/weight

  Same bits/weight, but Q4_K has finer-grained adaptation
  and asymmetric ranges (scale + min instead of just scale).
  Result: measurably lower perplexity.

The K-Quant Zoo

Here is the full family of K-quant formats:

  K-Quant Formats Overview
  =========================

  Format   Bits/wt  Quant    Sub-block params     Quality
  ------   -------  -----    ----------------     -------
  Q2_K     2.56     2-bit    4-bit scale+min      Poor (emergency use)
  Q3_K_S   3.44     3-bit    6-bit scale          Acceptable
  Q3_K_M   3.91     3-bit    6-bit scale+min      Fair
  Q3_K_L   4.28     3-bit    6-bit scale+min      Fair+
  Q4_K_S   4.50     4-bit    6-bit scale+min      Good
  Q4_K_M   4.84     4-bit    6-bit scale+min      Good+
  Q5_K_S   5.50     5-bit    6-bit scale+min      Very Good
  Q5_K_M   5.69     5-bit    6-bit scale+min      Very Good+
  Q6_K     6.56     6-bit    8-bit scale          Excellent

  The S/M/L suffixes mean Small/Medium/Large and refer to which
  layers get higher precision. "M" uses higher precision for the
  attention layers and output layers (most sensitive to quantization).

The _M variants are particularly clever: they use mixed precision, with more bits for the layers that matter most:

  Q4_K_M: Mixed Precision by Layer
  ==================================

  Layer Type              Quantization    Why
  ----------              ------------    ---
  Attention Q,K,V,O       Q4_K           Important for quality
  FFN gate, up            Q4_K           Standard
  FFN down                Q6_K           Output projection, sensitive
  Output (vocab) layer    Q6_K           Final prediction, very sensitive
  Embedding               Q4_K           Large but less sensitive

  The "M" variant is ~8% larger than "S" but measurably better.

Per-Channel and Per-Group Quantization (MLX Format)

The MLX framework (Apple’s machine learning framework) uses a different quantization approach: per-group affine quantization with configurable group sizes.

  MLX Quantization
  =================

  Group size: typically 64 or 128
  Per group: one FP16 scale and one FP16 bias (zero-point)

  Quantize (per group of G weights):
    w_min = min(weights in group)
    w_max = max(weights in group)
    scale = (w_max - w_min) / (2^bits - 1)
    bias = w_min

    q[i] = round((w[i] - bias) / scale)
    q[i] = clamp(q[i], 0, 2^bits - 1)

  Dequantize:
    w[i] = scale * q[i] + bias

  Example with group_size=64, bits=4:
  +------------------------------------+
  | scale (FP16) | bias (FP16) | 64x4b |
  |    2 bytes   |   2 bytes   | 32 B  |
  +------------------------------------+
  Total: 36 bytes for 64 weights = 4.5 bits/weight

  With group_size=128:
  +-------------------------------------+
  | scale (FP16) | bias (FP16) | 128x4b |
  |    2 bytes   |   2 bytes   |  64 B  |
  +-------------------------------------+
  Total: 68 bytes for 128 weights = 4.25 bits/weight

The key difference from GGUF quantization: MLX uses affine quantization (scale + bias) rather than symmetric quantization (scale only). This better handles weight distributions that are not centered around zero:

  Symmetric vs Affine Quantization
  ==================================

  Symmetric (Q4_0):
    Maps [-abs_max, +abs_max] to [-8, +7]
    scale = abs_max / 8
    w = scale * (q - 8)

    Problem: if weights are [0.1, 0.3, 0.5, 0.7, 0.9]
    abs_max = 0.9, scale = 0.1125
    Half the quantization levels (-8 to -1) are wasted!

                            Wasted range
                     <---------->
    -0.9  -0.7  -0.5  -0.3  -0.1  0.1  0.3  0.5  0.7  0.9
     |     |     |     |     |     |    |    |    |    |
     q=0   q=2   q=4   q=6   q=8  q=9  q=11 q=13 q=14 q=16
                 NO WEIGHTS HERE           ALL WEIGHTS HERE


  Affine (MLX):
    Maps [w_min, w_max] to [0, 15]
    scale = (0.9 - 0.1) / 15 = 0.0533
    bias = 0.1
    w = scale * q + bias

    0.1    0.15   0.21   0.26   ...   0.84   0.9
     |      |      |      |           |      |
     q=0    q=1    q=2    q=3  ...   q=14   q=15
              FULL RANGE UTILIZED!

  Affine quantization uses all levels for the actual data range.
  Better for asymmetric distributions (common in practice).

MLX Quantization Code Example

// MLX-style dequantization in Metal
kernel void dequantize_mlx_q4(
    device const uint32_t* packed_weights [[buffer(0)]],  // Packed 4-bit
    device const half*     scales         [[buffer(1)]],  // Per-group scales
    device const half*     biases         [[buffer(2)]],  // Per-group biases
    device half*           output         [[buffer(3)]],  // Dequantized
    constant uint&         group_size     [[buffer(4)]],
    uint                   tid            [[thread_position_in_grid]])
{
    uint group_id = tid / group_size;
    uint in_group  = tid % group_size;

    half scale = scales[group_id];
    half bias  = biases[group_id];

    // Each uint32 holds 8 x 4-bit values
    uint word_idx = tid / 8;
    uint bit_offset = (tid % 8) * 4;
    uint32_t word = packed_weights[word_idx];
    uint8_t q = (word >> bit_offset) & 0xF;

    output[tid] = scale * half(q) + bias;
}

Impact on Quality: Perplexity Analysis

Perplexity is the standard metric for measuring quantization quality. Lower is better – it measures how “surprised” the model is by test data. A perplexity increase of more than 1-2% is generally noticeable in output quality.

  Perplexity Comparison (Llama 2 7B, WikiText-2)
  ================================================

  Format      Bits/wt    Perplexity    Delta     Quality Assessment
  ------      -------    ----------    -----     ------------------
  FP16        16.00      5.796         ---       Reference (baseline)
  Q8_0         8.50      5.799        +0.003     Indistinguishable
  Q6_K         6.56      5.804        +0.008     Excellent
  Q5_K_M       5.69      5.812        +0.016     Very good
  Q5_K_S       5.50      5.819        +0.023     Very good
  Q4_K_M       4.84      5.882        +0.086     Good
  Q4_K_S       4.50      5.912        +0.116     Good
  Q4_0         4.50      5.946        +0.150     Acceptable
  Q3_K_M       3.91      6.145        +0.349     Fair (noticeable)
  Q3_K_S       3.44      6.351        +0.555     Degraded
  Q2_K         2.56      6.981        +1.185     Poor

  Rule of thumb:
  - Q4_K_M and above: quality loss rarely noticeable in practice
  - Q3_K: quality loss sometimes noticeable, especially on hard tasks
  - Q2_K: clear quality degradation, only for extreme memory constraints
  Perplexity vs Bits per Weight
  ==============================

  Perplexity
      |
  7.0 |*                                              Q2_K
      |
  6.5 |    *                                          Q3_K_S
      |       *                                       Q3_K_M
  6.0 |            *  *                               Q4_0, Q4_K_S
      |               *  *                            Q4_K_M, Q5_K_S
  5.8 |                     * *                       Q5_K_M, Q6_K
      |                          *                    Q8_0
  5.6 |                                         *    FP16
      |
  5.4 +---+---+---+---+---+---+---+---+---+---+--->
         2   3   4   5   6   7   8  ...  16     bits/weight

  The curve has a "knee" around 4-5 bits/weight.
  Below 4 bits, quality drops rapidly.
  Above 5 bits, returns diminish.
  4-5 bits is the sweet spot for most applications.

The Bandwidth Equation: Predicting Decode Speed

We mentioned this earlier, but let us formalize it. During autoregressive decoding:

  Decode Speed Model
  ====================

  Given:
    P = number of parameters
    b = bits per weight (effective, including quantization overhead)
    BW = memory bandwidth (bytes/sec)
    overhead = non-matmul time (attention, normalization, etc.)

  model_bytes = P * b / 8

  time_per_token = model_bytes / BW + overhead

  tokens_per_second = 1 / time_per_token

  Example: Llama 2 7B on M2 Pro
  P = 6.74 billion (actual parameter count)
  BW = 200 GB/s

  Q4_K_M (b = 4.84):
    model_bytes = 6.74e9 * 4.84 / 8 = 4.077 GB
    matmul_time = 4.077 / 200 = 0.0204 sec = 20.4 ms
    overhead ≈ 5 ms (attention + KV cache + norms)
    time_per_token ≈ 25.4 ms
    speed ≈ 39 tokens/sec

  Observed: ~35-40 tokens/sec (matches!)

This equation is remarkably accurate because GEMV (the matmul during decode) is almost perfectly bandwidth-bound. The only thing that breaks the model is when the KV cache becomes very large (long sequences) and the attention computation starts contributing significantly.

  When Does Attention Start to Matter?
  ======================================

  KV cache read per step (GQA, 8 KV heads, d=128):
    kv_bytes = 2 * 8 * seq_len * 128 * 2 = 4096 * seq_len bytes

  Total per token = model_bytes + kv_bytes

  seq_len      kv_bytes    model(Q4_K_M)    kv/model    Impact
  --------     --------    -------------    --------    ------
  512          2 MB        4.08 GB          0.05%       Negligible
  2048         8 MB        4.08 GB          0.2%        Negligible
  8192         32 MB       4.08 GB          0.8%        Minimal
  32768        128 MB      4.08 GB          3.1%        Small
  131072       512 MB      4.08 GB          12.5%       Noticeable (~11% slower)

  KV cache impact is negligible for typical conversations (< 8K tokens)
  but matters for very long contexts (> 32K tokens).

Comparison Table

Here is a comprehensive comparison of all the major quantization formats:

  +----------+--------+--------+----------+---------+--------+-------+
  | Format   | Bits/  | Block  | Params   | Asym-   | Mixed  | Used  |
  |          | weight | size   | per block| metric? | prec?  | by    |
  +----------+--------+--------+----------+---------+--------+-------+
  | FP16     | 16.00  | N/A    | N/A      | N/A     | No     | Base  |
  | Q8_0     |  8.50  |   32   | 1 scale  | No      | No     | GGUF  |
  | Q6_K     |  6.56  |  256   | 8-bit sc | No      | No     | GGUF  |
  | Q5_K_M   |  5.69  |  256   | 6-bit s+m| Yes     | Yes    | GGUF  |
  | Q5_K_S   |  5.50  |  256   | 6-bit s+m| Yes     | No     | GGUF  |
  | Q4_K_M   |  4.84  |  256   | 6-bit s+m| Yes     | Yes    | GGUF  |
  | Q4_K_S   |  4.50  |  256   | 6-bit s+m| Yes     | No     | GGUF  |
  | Q4_0     |  4.50  |   32   | 1 scale  | No      | No     | GGUF  |
  | Q3_K_M   |  3.91  |  256   | 6-bit s+m| Yes     | Yes    | GGUF  |
  | Q3_K_S   |  3.44  |  256   | 6-bit sc | No      | No     | GGUF  |
  | Q2_K     |  2.56  |  256   | 4-bit s+m| Yes     | No     | GGUF  |
  | MLX-4    |  4.25   |  64-128| scale+bias| Yes   | No     | MLX   |
  | MLX-8    |  8.25   |  64-128| scale+bias| Yes   | No     | MLX   |
  | MLX-2    |  2.25   |  64-128| scale+bias| Yes   | No     | MLX   |
  +----------+--------+--------+----------+---------+--------+-------+

  Legend: sc = scale only, s+m = scale + minimum, Asym = asymmetric range
  Mixed prec = different quantization levels for different layers

Worked Example: Full Dequantization Pipeline

Let us walk through dequantizing a complete Q4_K block on the GPU, showing exactly what happens in a Metal shader:

  Q4_K Dequantization Pipeline
  ==============================

  Input: One super-block of 256 quantized weights

  Step 1: Read super-block header
    d = FP16 scale_of_scales       (e.g., 0.0156)
    dmin = FP16 scale_of_mins      (e.g., 0.0078)

  Step 2: Unpack sub-block parameters (6-bit values, packed)
    The 8 sub-block scales are packed as 6-bit values in 6 bytes:
    +--------+--------+--------+--------+--------+--------+
    | byte 0 | byte 1 | byte 2 | byte 3 | byte 4 | byte 5 |
    +--------+--------+--------+--------+--------+--------+

    sub_scale[0] = byte[0] & 0x3F                    = 23
    sub_scale[1] = ((byte[0]>>6) | (byte[1]<<2)) & 0x3F = 18
    sub_scale[2] = ((byte[1]>>4) | (byte[2]<<4)) & 0x3F = 31
    ...

    Similarly for sub_mins.

  Step 3: Compute actual sub-block scale and min
    For sub-block b:
      actual_scale[b] = d * sub_scale[b]
      actual_min[b]   = dmin * sub_min[b]

    actual_scale[0] = 0.0156 * 23 = 0.3588
    actual_min[0]   = 0.0078 * 15 = 0.1170  (example)

  Step 4: Dequantize weights in sub-block
    For weight i in sub-block b:
      q = unpack_4bit(packed_data, b*32 + i)   (0..15)
      w = actual_scale[b] * q - actual_min[b]

    q[0] = 7:   w = 0.3588 * 7 - 0.1170 = 2.3946
    q[1] = 12:  w = 0.3588 * 12 - 0.1170 = 4.1886
    q[2] = 3:   w = 0.3588 * 3 - 0.1170 = 0.9594
    q[3] = 0:   w = 0.3588 * 0 - 0.1170 = -0.1170  (min value)
    q[4] = 15:  w = 0.3588 * 15 - 0.1170 = 5.2650  (max value)
    ...

And here is how this looks in Metal shader code:

// Q4_K super-block structure
struct block_q4_K {
    half    d;              // scale of scales
    half    dmin;           // scale of mins
    uchar   scales[12];    // 8 x 6-bit scales + 8 x 6-bit mins, packed
    uchar   qs[128];       // 256 x 4-bit quantized values
};

// Dequantize one Q4_K block
inline void dequantize_q4_K(
    device const block_q4_K& block,
    uint sub_block,     // 0..7
    uint index_in_sub,  // 0..31
    thread float& result)
{
    // Step 1: Read super-block scales
    float d = float(block.d);
    float dmin = float(block.dmin);

    // Step 2: Unpack 6-bit sub-block scale and min
    // (Simplified -- actual packing is more complex)
    uint8_t raw_scale, raw_min;
    if (sub_block < 4) {
        raw_scale = block.scales[sub_block] & 0x3F;
        raw_min   = block.scales[sub_block + 4] & 0x3F;
    } else {
        // Higher sub-blocks use bits from multiple bytes
        raw_scale = ((block.scales[sub_block + 4] & 0xF) |
                    ((block.scales[sub_block - 4] >> 6) << 4));
        raw_min   = ((block.scales[sub_block + 4] >> 4) |
                    ((block.scales[sub_block]     >> 6) << 4));
    }

    float scale = d * float(raw_scale);
    float min   = dmin * float(raw_min);

    // Step 3: Unpack 4-bit weight value
    uint byte_idx = (sub_block * 32 + index_in_sub) / 2;
    uchar packed = block.qs[byte_idx];
    uint8_t q;
    if (index_in_sub % 2 == 0) {
        q = packed & 0x0F;        // Low nibble
    } else {
        q = (packed >> 4) & 0x0F; // High nibble
    }

    // Step 4: Dequantize
    result = scale * float(q) - min;
}

Quantizing the KV Cache

So far we have discussed quantizing the model weights. But there is another large memory consumer during inference: the KV cache. For long context lengths, the KV cache can rival or exceed the model size.

  KV Cache Quantization
  ======================

  Standard KV cache (FP16):
    Per layer: 2 * n_kv_heads * seq_len * d_k * 2 bytes
    32 layers, 8 KV heads, d_k=128, seq_len=32768:
    = 32 * 2 * 8 * 32768 * 128 * 2 = 4 GB

  Q8 KV cache:
    Same but 1 byte per element + small scale overhead
    ≈ 2 GB  (50% reduction)

  Q4 KV cache:
    ≈ 1 GB  (75% reduction)

  KV cache quantization is trickier than weight quantization because:
  1. Values are computed dynamically (not known ahead of time)
  2. Must quantize on-the-fly as new K/V vectors are computed
  3. Range can change as new tokens arrive
  4. Quality impact is harder to predict

  Approach: per-head, per-position quantization
    For each new (key, value) vector being added to the cache:
    1. Compute the vector in FP16
    2. Quantize to Q8 or Q4 with per-vector scale
    3. Store quantized vector in cache
    4. Dequantize on-the-fly during attention computation

Practical Considerations

Choosing the Right Quantization

Here is a practical decision flowchart:

  Quantization Selection Flowchart
  ==================================

  How much memory do you have?
  |
  +-- Very limited (8 GB): Use Q4_K_M for 7B, Q2_K for 13B
  |
  +-- Moderate (16-18 GB): Use Q4_K_M for 13B, Q6_K for 7B
  |
  +-- Generous (32-64 GB): Use Q6_K or Q8_0 for 13B-34B
  |
  +-- Abundant (96+ GB): Consider FP16 for best quality

  What is your speed target?
  |
  +-- Maximum speed: Use lowest quantization that maintains acceptable quality
  |                  (usually Q4_K_M or Q4_K_S)
  |
  +-- Quality priority: Use highest quantization that meets speed requirements
  |                     (Q6_K or Q8_0)
  |
  +-- Balanced: Q4_K_M is almost always the right answer

  Task sensitivity?
  |
  +-- Creative writing, coding: Q4_K_M usually fine
  |
  +-- Math, reasoning: Consider Q5_K_M or Q6_K
  |
  +-- Simple Q&A, summarization: Q3_K_M can work

Quantization and GEMV Performance

Remember from Chapter 14: GEMV is bandwidth-bound. Quantization directly reduces the bytes read, so the speedup is nearly linear with compression ratio:

  GEMV Performance with Quantization
  ====================================

  Single matmul: [1, 4096] * [4096, 11008]
  M2 Pro, 200 GB/s

  Format    Weight Size    Read Time    Speedup
  ------    -----------    ---------    -------
  FP16      90.1 MB        0.451 ms    1.0x
  Q8_0      47.8 MB        0.239 ms    1.9x
  Q6_K      37.0 MB        0.185 ms    2.4x
  Q4_K_M    24.5 MB        0.123 ms    3.7x
  Q4_0      23.6 MB        0.118 ms    3.8x
  Q2_K      14.4 MB        0.072 ms    6.3x

  Note: Dequantization adds ~5-10% compute overhead.
  Net speedup is slightly less than the raw bandwidth reduction,
  but still very close.

GEMM (Prefill) with Quantization

For prefill (GEMM), quantization does not directly speed things up because GEMM is compute-bound. However, it does reduce memory footprint, which matters for:

  1. Fitting larger models in memory
  2. Keeping more of the model in cache
  3. Reducing memory pressure from concurrent operations
  Prefill Performance: Less Clear-Cut
  =====================================

  Matmul: [512, 4096] * [4096, 11008]
  M2 Pro

  Format    Approach                     Time
  ------    --------                     ----
  FP16      Native FP16 MMA              ~2.0 ms
  Q4_K_M    Dequant to FP16 + MMA        ~2.5 ms  (slower!)
  Q4_K_M    Optimized mixed-precision    ~2.2 ms  (close to FP16)

  During prefill, quantization can actually be SLOWER because:
  1. Extra dequantization compute
  2. MMA hardware is optimized for FP16/BF16, not mixed precision
  3. GEMM is compute-bound, not memory-bound

  But for most LLM use cases, decode time dominates total latency,
  so optimizing decode (where quantization helps enormously) is
  more important than optimizing prefill.

Summary

Quantization is what makes LLM inference practical on consumer hardware. The key points:

  1. Why quantize: A 7B FP16 model needs 14 GB; Q4 needs ~4 GB. Decode speed is proportional to 1/model_size because GEMV is bandwidth-bound.

  2. Block quantization (Q4_0): Groups of 32 weights share one FP16 scale. 4 bits per weight + scale overhead = 4.5 bits/weight effective. Simple and fast.

  3. K-quant family: Super-blocks of 256 with 8 sub-blocks of 32. Two-level scale hierarchy. Mixed precision variants (_M) use more bits for sensitive layers. Q4_K_M is the most popular choice.

  4. Per-group affine (MLX): Groups of 64-128 with per-group scale and bias. Asymmetric quantization handles non-centered distributions better.

  5. Bit packing: Two 4-bit values per byte. Unpacking is a shift and mask operation, done on-the-fly during dequantization in the GPU shader.

  6. Quality impact: The perplexity curve has a knee at ~4-5 bits/weight. Q4_K_M is the sweet spot for most applications. Below 3 bits, quality degrades noticeably.

  7. The bandwidth equation: decode_speed ≈ bandwidth / model_bytes. This simple formula predicts real-world performance with surprising accuracy.

  8. Dequantize on the fly: During GEMV, weights are dequantized in registers as they are loaded. The compute cost of dequantization is negligible compared to the bandwidth savings.

With quantization, a $1200 MacBook Air with 16 GB of RAM can run a 7B parameter model at 40+ tokens per second – fast enough for interactive use. That is the power of trading a tiny bit of precision for an enormous reduction in memory and bandwidth requirements.