Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Appendix B: Quantization Format Reference

This appendix is a quick-reference for every quantization format supported by akunu. It covers the GGUF block-quantized formats (from the ggml/llama.cpp ecosystem) and the MLX group-quantized formats (from Apple’s MLX framework). For each format, you get the block/group size, bytes per block, effective bits per weight, and the dequantization formula.

If you want the why behind these formats, see Chapter 7 (Quantization). This appendix is the what – a lookup table you can keep open while reading kernel code or debugging weight loading.

How to Read the Tables

  • Block size: Number of weights packed together as a unit. GGUF formats use fixed block sizes (32 or 256). MLX formats use configurable group sizes (typically 64).
  • Bytes per block: Total storage for one block, including quantized values, scales, and any auxiliary data.
  • Bits per weight (bpw): Effective bits per weight element, computed as 8 * bytes_per_block / block_size. This is the number that determines model file size.
  • Dequant formula: How to reconstruct the floating-point value from the quantized representation.

GGUF Formats: Basic Quantization

These formats use a simple scheme: each block of weights shares one or two floating-point parameters (scale and optional minimum/zero-point).

FormatGGUF CodeBlock SizeBytes/BlockbpwDequant Formula
F3201432.0value = raw_f32
F1611216.0value = raw_f16
Q4_0232184.5value = d * (q[i] - 8) where q[i] is a 4-bit unsigned int, d is FP16 scale
Q4_1332205.0value = d * q[i] + m where d is FP16 scale, m is FP16 minimum
Q5_0632225.5value = d * (q[i] - 16) where q[i] is a 5-bit unsigned int (4 low bits packed + 1 high bit), d is FP16 scale
Q8_0832348.5value = d * q[i] where q[i] is a signed 8-bit int, d is FP16 scale
BF16301216.0value = raw_bf16 (Brain Float 16: 8-bit exponent, 7-bit mantissa)

Block Layout Details

Q4_0 (most common quantization in the GGUF ecosystem):

struct block_q4_0 {       // 18 bytes total
    half d;               //  2 bytes: scale factor
    uint8_t qs[16];       // 16 bytes: 32 x 4-bit values, packed in pairs
};                        // bpw = 18*8/32 = 4.5

Each byte in qs holds two 4-bit values: the low nibble is element 2i, the high nibble is element 2i+1. Dequantization extracts the nibble, subtracts 8 (to center around zero), and multiplies by the scale d.

Q4_1 (asymmetric variant of Q4_0):

struct block_q4_1 {       // 20 bytes total
    half d;               //  2 bytes: delta (scale)
    half m;               //  2 bytes: minimum
    uint8_t qs[16];       // 16 bytes: 32 x 4-bit values, packed in pairs
};                        // bpw = 20*8/32 = 5.0

The extra m (minimum) parameter means values are dequantized as d * q + m instead of d * (q - 8). This gives better accuracy when the weight distribution is not symmetric around zero.

Q5_0 (5-bit with high-bit extension):

struct block_q5_0 {       // 22 bytes total
    half d;               //  2 bytes: scale
    uint8_t qh[4];        //  4 bytes: 5th bit for each of 32 elements
    uint8_t qs[16];       // 16 bytes: lower 4 bits, packed in pairs
};                        // bpw = 22*8/32 = 5.5

The 5th bit for each element is stored separately in qh (packed as a uint32). To dequantize element i: extract the 4-bit value from qs, extract bit i from qh, combine to get a 5-bit unsigned int, subtract 16, multiply by d.

Q8_0 (8-bit, highest quality block quant):

struct block_q8_0 {       // 34 bytes total
    half d;               //  2 bytes: scale
    int8_t qs[32];        // 32 bytes: signed 8-bit values
};                        // bpw = 34*8/32 = 8.5

Simple and fast to dequantize: value = d * qs[i]. The 0.5 extra bpw overhead comes from the FP16 scale shared across 32 elements.

GGUF Formats: K-Quantization

K-quant formats use a two-level quantization scheme with super-blocks of 256 elements. Each super-block contains sub-blocks with their own scales, plus a super-block-level scale that controls the magnitude of the sub-block scales. This hierarchical approach gives better accuracy at the same bit width compared to basic formats.1

FormatGGUF CodeBlock SizeBytes/BlockbpwDescription
Q2_K10256842.6252-bit values + 4-bit scale/min per 16-element sub-block + super-block scale
Q3_K112561103.43752-bit base + 1 high bit + 6-bit packed scales + super-block scale
Q4_K122561444.54-bit values + 6-bit scales/mins + super-block scale
Q5_K132561765.54-bit base + 1 high bit + 6-bit scales/mins + super-block scale
Q6_K142562106.56254-bit low + 2-bit high (6-bit total) + 8-bit scales + super-block scale

K-Quant Block Layouts

Q4_K (the most popular K-quant for production use):

struct block_q4_K {              // 144 bytes total
    half d;                      //   2 bytes: super-block scale for quants
    half dmin;                   //   2 bytes: super-block scale for mins
    uint8_t scales[12];          //  12 bytes: 8 x 6-bit scales + 8 x 6-bit mins
    uint8_t qs[128];             // 128 bytes: 256 x 4-bit values, nibble-packed
};                               // bpw = 144*8/256 = 4.5

The 256-element super-block is divided into 8 sub-blocks of 32 elements each. Each sub-block has a 6-bit scale and a 6-bit minimum, packed into 12 bytes. Dequantization for element i in sub-block j:

value = d * scale_j * (q[i] - 8) + dmin * min_j

The get_scale_min_k4() helper in KernelCommon.h unpacks the 6-bit scale and minimum from the packed 12-byte scales array.

Q3_K (aggressive 3-bit quantization):

struct block_q3_K {              // 110 bytes total
    uint8_t hmask[32];           //  32 bytes: high bit for each of 256 elements
    uint8_t qs[64];              //  64 bytes: lower 2 bits packed (4 per byte)
    uint8_t scales[12];          //  12 bytes: 16 x signed 6-bit scales
    half d;                      //   2 bytes: super-block scale
};                               // bpw = 110*8/256 = 3.4375

Each element has 3 bits: 2 bits from qs and 1 bit from hmask. The 16 sub-blocks (16 elements each) have signed 6-bit scales packed into 12 bytes. The get_scale_q3_k() helper unpacks these.

Q2_K (extreme 2-bit quantization):

struct block_q2_K {              // 84 bytes total
    uint8_t scales[16];          // 16 bytes: 4-bit scale + 4-bit min per sub-block
    uint8_t qs[64];              // 64 bytes: 2-bit values (4 per byte)
    half d;                      //  2 bytes: super-block scale
    half dmin;                   //  2 bytes: super-block min scale
};                               // bpw = 84*8/256 = 2.625

Q6_K (high-quality 6-bit):

struct block_q6_K {              // 210 bytes total
    uint8_t ql[128];             // 128 bytes: lower 4 bits of 6-bit quants
    uint8_t qh[64];              //  64 bytes: upper 2 bits of 6-bit quants
    int8_t scales[16];           //  16 bytes: signed 8-bit sub-block scales
    half d;                      //   2 bytes: super-block scale
};                               // bpw = 210*8/256 = 6.5625

MLX Formats: Group Quantization

MLX uses a simpler group quantization scheme. Weights are divided into groups (typically 64 elements), and each group has an FP16 scale and FP16 bias (zero-point). The packed weight buffer layout is:

[packed_weights | scales | biases]

The MLXParams.weight_bytes field gives the byte offset where scales begin. Biases follow immediately.

FormatInternal CodeGroup SizeBitsbpwDequant Formula
MLX Q399643~3.5value = scale * (packed_3bit_int) + bias
MLX Q4100644~4.5value = scale * (packed_4bit_int) + bias
MLX Q6102646~6.5value = scale * (packed_6bit_int) + bias
MLX Q8101648~8.5value = scale * (packed_8bit_int) + bias

Notes on bpw for MLX: The effective bpw includes the overhead of the FP16 scale and bias per group. For group_size=64 with 4-bit values: (64 * 4 + 16 + 16) / 64 = 4.5 bpw. The exact overhead is 32 / group_size bits per weight for the scale+bias pair.

MLX Packing Details

MLX Q4: Each uint32 holds 8 x 4-bit values. The low 4 bits are element 0, bits 4-7 are element 1, and so on. The group_size determines how many packed uint32s share a single scale/bias pair: for group_size=64, that is 8 uint32s per group.

MLX Q3: Packing is more complex. Three bits per value means values do not align neatly to byte boundaries. MLX packs 32 x 3-bit values into 3 uint32s (96 bits for 32 values). The remaining 32 values in a 64-element group use another 3 uint32s.

MLX Q8: The simplest MLX format. Each byte holds one 8-bit quantized value. Dequantization is a simple multiply-add: value = scale * q[i] + bias.

Internal Dtype Codes

Akunu uses uint32_t dtype codes internally. GGUF dtypes 0-30 map directly to the GGUF specification. MLX formats use synthetic codes 99-102 that are assigned during weight loading by MLXWeightStore. The full mapping in dtype_descriptor.h:

CodeFormatOrigin
0F32GGUF
1F16GGUF
2Q4_0GGUF
3Q4_1GGUF
6Q5_0GGUF
8Q8_0GGUF
10Q2_KGGUF
11Q3_KGGUF
12Q4_KGGUF
13Q5_KGGUF
14Q6_KGGUF
30BF16GGUF
31BF16 (native)GGUF, M4+ only
99MLX Q3MLX SafeTensors
100MLX Q4MLX SafeTensors
101MLX Q8MLX SafeTensors
102MLX Q6MLX SafeTensors

Note that codes 4-5, 7, 9, 15-29 are defined in the GGUF specification (for types like Q5_1, Q8_1, IQ2_XXS, etc.) but are not currently supported by akunu’s Metal kernels. If you attempt to load a GGUF file using an unsupported dtype, the dtype_lookup() function falls back to the F16 descriptor, which will produce incorrect results. Check the dtype before loading.

Kernel Support Matrix

Not every format has every kernel variant. This table shows which Metal kernel types are available for each supported dtype:

FormatGEMVGEMV WideGEMMGEMM SmallEmbeddingFused SiLU
F16yesyesyesyesyes (generic)no
Q4_0yesyesyesyesyesyes
Q4_1yesyesyesyesyesno
Q5_0yesyesyesyesnono
Q8_0yesyesyesyesyesno
Q2_Kyesnoyesyesnono
Q3_Kyesnoyesyesnono
Q4_Kyesyesyesyesyesno
Q5_Kyesnoyesyesnono
Q6_Kyesnoyesyesyesno
BF16yesnoyesyesyesno
MLX Q3yesnoyesyesyesyes
MLX Q4yesyesyesyesyesyes
MLX Q6yesnoyesyesyesyes
MLX Q8yesyesyesyesyesyes

Key observations:

  • GEMV Wide kernels exist only for formats with wide enough adoption to justify the implementation effort. Q4_0 and Q4_K are the most common GGUF formats; MLX Q4 and Q8 are the most common MLX formats.
  • Fused SiLU kernels exist for Q4_0 and all MLX formats. These fuse the gate+up GEMV with the SiLU activation to eliminate an intermediate buffer write. Other GGUF K-quant formats do not have fused SiLU variants.
  • Embedding kernels that dequantize on the fly exist for the most common formats. Formats without a specialized embedding kernel use the generic F16 embedding lookup, which requires the embedding weights to be stored in FP16 (or converted during loading).

Model Size Estimation

To estimate the file size of a model in a given format:

file_size_bytes = n_parameters * bpw / 8 + metadata_overhead

The metadata overhead (GGUF header, tensor info, tokenizer data) is typically 1-10 MB for GGUF files, negligible for large models.

ParametersQ4_0 (4.5 bpw)Q4_K (4.5 bpw)Q8_0 (8.5 bpw)MLX Q4 (~4.5 bpw)F16 (16 bpw)
1B0.56 GB0.56 GB1.06 GB0.56 GB2.0 GB
4B2.25 GB2.25 GB4.25 GB2.25 GB8.0 GB
8B4.50 GB4.50 GB8.50 GB4.50 GB16.0 GB
14B7.88 GB7.88 GB14.88 GB7.88 GB28.0 GB
32B18.0 GB18.0 GB34.0 GB18.0 GB64.0 GB
70B39.4 GB39.4 GB74.4 GB39.4 GB140.0 GB

Note that Q4_0 and Q4_K have the same effective bpw (4.5) but Q4_K generally provides better accuracy due to the hierarchical scale structure. The file sizes are identical; the quality difference is in how those bits are allocated.

Memory Bandwidth and Decode Throughput

Since single-token decode is memory-bound, the quantization format directly determines the maximum achievable decode throughput. The relationship is:

theoretical_max_tok_s = memory_bandwidth_bytes_per_sec / model_weight_bytes

This means halving the bits per weight (e.g., Q8_0 to Q4_0) roughly doubles the theoretical decode speed. Here is a reference table for a 4B parameter model on various chips:

ChipBW (GB/s)Q4_0 (2.25 GB)Q8_0 (4.25 GB)F16 (8.0 GB)
M168.2530.3 tok/s16.1 tok/s8.5 tok/s
M210044.4 tok/s23.5 tok/s12.5 tok/s
M2 Pro20088.9 tok/s47.1 tok/s25.0 tok/s
M3 Max400177.8 tok/s94.1 tok/s50.0 tok/s
M4 Pro273121.3 tok/s64.2 tok/s34.1 tok/s
M4 Max546242.7 tok/s128.5 tok/s68.3 tok/s

These are theoretical maximums assuming 100% bandwidth utilization. In practice, akunu achieves 70-85% of these numbers due to overhead from attention, normalization, RoPE, and kernel dispatch. The SLC can push effective bandwidth above the raw DRAM bandwidth for chain decode workloads, sometimes exceeding the theoretical maximum.

Choosing a Quantization Format

Here is a decision guide based on common use cases:

Use CaseRecommended FormatRationale
Maximum speed, acceptable qualityQ4_0 (GGUF) or MLX Q4Lowest bpw with good quality. Best decode throughput.
Best quality/speed tradeoffQ4_K (GGUF)Same bpw as Q4_0 but better accuracy from hierarchical scales.
Quality-sensitive applicationsQ6_K (GGUF) or MLX Q66.5 bpw gives near-FP16 quality with 2.5x less memory.
Near-losslessQ8_0 (GGUF) or MLX Q88.5 bpw is essentially indistinguishable from FP16 for most tasks.
Research / debuggingF16 or BF16Full precision. Useful as a reference for measuring quantization error.
Tiny models (<1B params)Q8_0 or F16Small models are already fast; use higher precision to preserve quality.
Large models on limited RAMQ2_K or Q3_KAggressive quantization to fit models that would otherwise not fit in memory. Quality degrades noticeably.

Unsupported GGUF Types

The GGUF specification defines several additional quantization types that akunu does not currently support with Metal kernels. These are listed in gguf_parser.h but will fall back to the F16 dtype descriptor (producing incorrect results) if encountered:

GGUF CodeNameReason Not Supported
7Q5_1Asymmetric 5-bit; rare in practice, superseded by Q5_K
9Q8_1Asymmetric 8-bit; rarely used for distribution
15Q8_K8-bit K-quant; used internally by llama.cpp during quantization, not for inference
16IQ2_XXSImportance-matrix quantized 2-bit; complex lookup table dequantization
17IQ2_XSImportance-matrix 2-bit variant
18IQ3_XXSImportance-matrix 3-bit
19IQ1_SImportance-matrix 1-bit
20IQ4_NLNon-linear 4-bit with lookup table
21IQ3_SImportance-matrix 3-bit variant
22IQ2_SImportance-matrix 2-bit variant
23IQ4_XSImportance-matrix 4-bit
29IQ1_MImportance-matrix 1-bit variant

The IQ (importance-matrix quantized) formats use lookup tables for dequantization, which adds implementation complexity. They offer slightly better perplexity than the standard formats at the same bit width, but their adoption in the ecosystem is limited compared to Q4_0, Q4_K, and Q6_K. If there is demand, these could be added as Metal kernels in the future.

Comparing GGUF and MLX Quantization

Even at the same nominal bit width, GGUF and MLX quantization are not identical:

AspectGGUF (e.g., Q4_0)MLX (e.g., Q4)
Block/group structureFixed 32-element blocksConfigurable groups (typically 64)
Scale storageFP16 scale per block (Q4_0) or hierarchical 6-bit scales (Q4_K)FP16 scale + FP16 bias per group
Dequantd * (q - zero_point)scale * q + bias
SymmetrySymmetric (Q4_0) or asymmetric (Q4_1)Always asymmetric (has bias)
Overhead2 bytes per 32 elements (Q4_0) = 0.5 bpw4 bytes per 64 elements = 0.5 bpw
Quantization methodPost-training quantization by llama.cppPost-training quantization by MLX

The practical quality difference between GGUF Q4_0 and MLX Q4 is small for most models. The larger group size in MLX (64 vs 32) means each scale covers more elements, which can be slightly worse for weight distributions with high local variance, but the asymmetric bias term partially compensates.

For weights that are uniformly distributed around zero (which is common after training), symmetric quantization (GGUF Q4_0) is slightly more efficient because it does not waste bits on the bias term. For weights with non-zero mean per group, asymmetric quantization (MLX Q4, GGUF Q4_1) is more accurate.


  1. K-quantization was introduced by ikawrakow in llama.cpp (2023). The “K” originally stood for “k-quant” with no specific expansion. The key insight is that using more bits for scale parameters (6-bit sub-block scales + FP16 super-block scale) reduces the quantization error budget allocated to the scale itself.