Appendix B: Quantization Format Reference

This appendix is a quick-reference for every quantization format supported by akunu. It covers the GGUF block-quantized formats (from the ggml/llama.cpp ecosystem) and the MLX group-quantized formats (from Apple’s MLX framework). For each format, you get the block/group size, bytes per block, effective bits per weight, and the dequantization formula.

If you want the why behind these formats, see Chapter 7 (Quantization). This appendix is the what – a lookup table you can keep open while reading kernel code or debugging weight loading.

How to Read the Tables

Block size: Number of weights packed together as a unit. GGUF formats use fixed block sizes (32 or 256). MLX formats use configurable group sizes (typically 64).
Bytes per block: Total storage for one block, including quantized values, scales, and any auxiliary data.
Bits per weight (bpw): Effective bits per weight element, computed as 8 * bytes_per_block / block_size. This is the number that determines model file size.
Dequant formula: How to reconstruct the floating-point value from the quantized representation.

GGUF Formats: Basic Quantization

These formats use a simple scheme: each block of weights shares one or two floating-point parameters (scale and optional minimum/zero-point).

Format	GGUF Code	Block Size	Bytes/Block	bpw	Dequant Formula
F32	0	1	4	32.0	`value = raw_f32`
F16	1	1	2	16.0	`value = raw_f16`
Q4_0	2	32	18	4.5	`value = d * (q[i] - 8)` where `q[i]` is a 4-bit unsigned int, `d` is FP16 scale
Q4_1	3	32	20	5.0	`value = d * q[i] + m` where `d` is FP16 scale, `m` is FP16 minimum
Q5_0	6	32	22	5.5	`value = d * (q[i] - 16)` where `q[i]` is a 5-bit unsigned int (4 low bits packed + 1 high bit), `d` is FP16 scale
Q8_0	8	32	34	8.5	`value = d * q[i]` where `q[i]` is a signed 8-bit int, `d` is FP16 scale
BF16	30	1	2	16.0	`value = raw_bf16` (Brain Float 16: 8-bit exponent, 7-bit mantissa)

Block Layout Details

Q4_0 (most common quantization in the GGUF ecosystem):

struct block_q4_0 {       // 18 bytes total
    half d;               //  2 bytes: scale factor
    uint8_t qs[16];       // 16 bytes: 32 x 4-bit values, packed in pairs
};                        // bpw = 18*8/32 = 4.5

Each byte in qs holds two 4-bit values: the low nibble is element 2i, the high nibble is element 2i+1. Dequantization extracts the nibble, subtracts 8 (to center around zero), and multiplies by the scale d.

Q4_1 (asymmetric variant of Q4_0):

struct block_q4_1 {       // 20 bytes total
    half d;               //  2 bytes: delta (scale)
    half m;               //  2 bytes: minimum
    uint8_t qs[16];       // 16 bytes: 32 x 4-bit values, packed in pairs
};                        // bpw = 20*8/32 = 5.0

The extra m (minimum) parameter means values are dequantized as d * q + m instead of d * (q - 8). This gives better accuracy when the weight distribution is not symmetric around zero.

Q5_0 (5-bit with high-bit extension):

struct block_q5_0 {       // 22 bytes total
    half d;               //  2 bytes: scale
    uint8_t qh[4];        //  4 bytes: 5th bit for each of 32 elements
    uint8_t qs[16];       // 16 bytes: lower 4 bits, packed in pairs
};                        // bpw = 22*8/32 = 5.5

The 5th bit for each element is stored separately in qh (packed as a uint32). To dequantize element i: extract the 4-bit value from qs, extract bit i from qh, combine to get a 5-bit unsigned int, subtract 16, multiply by d.

Q8_0 (8-bit, highest quality block quant):

struct block_q8_0 {       // 34 bytes total
    half d;               //  2 bytes: scale
    int8_t qs[32];        // 32 bytes: signed 8-bit values
};                        // bpw = 34*8/32 = 8.5

Simple and fast to dequantize: value = d * qs[i]. The 0.5 extra bpw overhead comes from the FP16 scale shared across 32 elements.

GGUF Formats: K-Quantization

K-quant formats use a two-level quantization scheme with super-blocks of 256 elements. Each super-block contains sub-blocks with their own scales, plus a super-block-level scale that controls the magnitude of the sub-block scales. This hierarchical approach gives better accuracy at the same bit width compared to basic formats.¹

Format	GGUF Code	Block Size	Bytes/Block	bpw	Description
Q2_K	10	256	84	2.625	2-bit values + 4-bit scale/min per 16-element sub-block + super-block scale
Q3_K	11	256	110	3.4375	2-bit base + 1 high bit + 6-bit packed scales + super-block scale
Q4_K	12	256	144	4.5	4-bit values + 6-bit scales/mins + super-block scale
Q5_K	13	256	176	5.5	4-bit base + 1 high bit + 6-bit scales/mins + super-block scale
Q6_K	14	256	210	6.5625	4-bit low + 2-bit high (6-bit total) + 8-bit scales + super-block scale

K-Quant Block Layouts

Q4_K (the most popular K-quant for production use):

struct block_q4_K {              // 144 bytes total
    half d;                      //   2 bytes: super-block scale for quants
    half dmin;                   //   2 bytes: super-block scale for mins
    uint8_t scales[12];          //  12 bytes: 8 x 6-bit scales + 8 x 6-bit mins
    uint8_t qs[128];             // 128 bytes: 256 x 4-bit values, nibble-packed
};                               // bpw = 144*8/256 = 4.5

The 256-element super-block is divided into 8 sub-blocks of 32 elements each. Each sub-block has a 6-bit scale and a 6-bit minimum, packed into 12 bytes. Dequantization for element i in sub-block j:

value = d * scale_j * (q[i] - 8) + dmin * min_j

The get_scale_min_k4() helper in KernelCommon.h unpacks the 6-bit scale and minimum from the packed 12-byte scales array.

Q3_K (aggressive 3-bit quantization):

struct block_q3_K {              // 110 bytes total
    uint8_t hmask[32];           //  32 bytes: high bit for each of 256 elements
    uint8_t qs[64];              //  64 bytes: lower 2 bits packed (4 per byte)
    uint8_t scales[12];          //  12 bytes: 16 x signed 6-bit scales
    half d;                      //   2 bytes: super-block scale
};                               // bpw = 110*8/256 = 3.4375

Each element has 3 bits: 2 bits from qs and 1 bit from hmask. The 16 sub-blocks (16 elements each) have signed 6-bit scales packed into 12 bytes. The get_scale_q3_k() helper unpacks these.

Q2_K (extreme 2-bit quantization):

struct block_q2_K {              // 84 bytes total
    uint8_t scales[16];          // 16 bytes: 4-bit scale + 4-bit min per sub-block
    uint8_t qs[64];              // 64 bytes: 2-bit values (4 per byte)
    half d;                      //  2 bytes: super-block scale
    half dmin;                   //  2 bytes: super-block min scale
};                               // bpw = 84*8/256 = 2.625

Q6_K (high-quality 6-bit):

struct block_q6_K {              // 210 bytes total
    uint8_t ql[128];             // 128 bytes: lower 4 bits of 6-bit quants
    uint8_t qh[64];              //  64 bytes: upper 2 bits of 6-bit quants
    int8_t scales[16];           //  16 bytes: signed 8-bit sub-block scales
    half d;                      //   2 bytes: super-block scale
};                               // bpw = 210*8/256 = 6.5625

MLX Formats: Group Quantization

MLX uses a simpler group quantization scheme. Weights are divided into groups (typically 64 elements), and each group has an FP16 scale and FP16 bias (zero-point). The packed weight buffer layout is:

[packed_weights | scales | biases]

The MLXParams.weight_bytes field gives the byte offset where scales begin. Biases follow immediately.

Format	Internal Code	Group Size	Bits	bpw	Dequant Formula
MLX Q3	99	64	3	~3.5	`value = scale * (packed_3bit_int) + bias`
MLX Q4	100	64	4	~4.5	`value = scale * (packed_4bit_int) + bias`
MLX Q6	102	64	6	~6.5	`value = scale * (packed_6bit_int) + bias`
MLX Q8	101	64	8	~8.5	`value = scale * (packed_8bit_int) + bias`

Notes on bpw for MLX: The effective bpw includes the overhead of the FP16 scale and bias per group. For group_size=64 with 4-bit values: (64 * 4 + 16 + 16) / 64 = 4.5 bpw. The exact overhead is 32 / group_size bits per weight for the scale+bias pair.

MLX Packing Details

MLX Q4: Each uint32 holds 8 x 4-bit values. The low 4 bits are element 0, bits 4-7 are element 1, and so on. The group_size determines how many packed uint32s share a single scale/bias pair: for group_size=64, that is 8 uint32s per group.

MLX Q3: Packing is more complex. Three bits per value means values do not align neatly to byte boundaries. MLX packs 32 x 3-bit values into 3 uint32s (96 bits for 32 values). The remaining 32 values in a 64-element group use another 3 uint32s.

MLX Q8: The simplest MLX format. Each byte holds one 8-bit quantized value. Dequantization is a simple multiply-add: value = scale * q[i] + bias.

Internal Dtype Codes

Akunu uses uint32_t dtype codes internally. GGUF dtypes 0-30 map directly to the GGUF specification. MLX formats use synthetic codes 99-102 that are assigned during weight loading by MLXWeightStore. The full mapping in dtype_descriptor.h:

Code	Format	Origin
0	F32	GGUF
1	F16	GGUF
2	Q4_0	GGUF
3	Q4_1	GGUF
6	Q5_0	GGUF
8	Q8_0	GGUF
10	Q2_K	GGUF
11	Q3_K	GGUF
12	Q4_K	GGUF
13	Q5_K	GGUF
14	Q6_K	GGUF
30	BF16	GGUF
31	BF16 (native)	GGUF, M4+ only
99	MLX Q3	MLX SafeTensors
100	MLX Q4	MLX SafeTensors
101	MLX Q8	MLX SafeTensors
102	MLX Q6	MLX SafeTensors

Note that codes 4-5, 7, 9, 15-29 are defined in the GGUF specification (for types like Q5_1, Q8_1, IQ2_XXS, etc.) but are not currently supported by akunu’s Metal kernels. If you attempt to load a GGUF file using an unsupported dtype, the dtype_lookup() function falls back to the F16 descriptor, which will produce incorrect results. Check the dtype before loading.

Kernel Support Matrix

Not every format has every kernel variant. This table shows which Metal kernel types are available for each supported dtype:

Format	GEMV	GEMV Wide	GEMM	GEMM Small	Embedding	Fused SiLU
F16	yes	yes	yes	yes	yes (generic)	no
Q4_0	yes	yes	yes	yes	yes	yes
Q4_1	yes	yes	yes	yes	yes	no
Q5_0	yes	yes	yes	yes	no	no
Q8_0	yes	yes	yes	yes	yes	no
Q2_K	yes	no	yes	yes	no	no
Q3_K	yes	no	yes	yes	no	no
Q4_K	yes	yes	yes	yes	yes	no
Q5_K	yes	no	yes	yes	no	no
Q6_K	yes	no	yes	yes	yes	no
BF16	yes	no	yes	yes	yes	no
MLX Q3	yes	no	yes	yes	yes	yes
MLX Q4	yes	yes	yes	yes	yes	yes
MLX Q6	yes	no	yes	yes	yes	yes
MLX Q8	yes	yes	yes	yes	yes	yes

Key observations:

GEMV Wide kernels exist only for formats with wide enough adoption to justify the implementation effort. Q4_0 and Q4_K are the most common GGUF formats; MLX Q4 and Q8 are the most common MLX formats.
Fused SiLU kernels exist for Q4_0 and all MLX formats. These fuse the gate+up GEMV with the SiLU activation to eliminate an intermediate buffer write. Other GGUF K-quant formats do not have fused SiLU variants.
Embedding kernels that dequantize on the fly exist for the most common formats. Formats without a specialized embedding kernel use the generic F16 embedding lookup, which requires the embedding weights to be stored in FP16 (or converted during loading).

Model Size Estimation

To estimate the file size of a model in a given format:

file_size_bytes = n_parameters * bpw / 8 + metadata_overhead

The metadata overhead (GGUF header, tensor info, tokenizer data) is typically 1-10 MB for GGUF files, negligible for large models.

Parameters	Q4_0 (4.5 bpw)	Q4_K (4.5 bpw)	Q8_0 (8.5 bpw)	MLX Q4 (~4.5 bpw)	F16 (16 bpw)
1B	0.56 GB	0.56 GB	1.06 GB	0.56 GB	2.0 GB
4B	2.25 GB	2.25 GB	4.25 GB	2.25 GB	8.0 GB
8B	4.50 GB	4.50 GB	8.50 GB	4.50 GB	16.0 GB
14B	7.88 GB	7.88 GB	14.88 GB	7.88 GB	28.0 GB
32B	18.0 GB	18.0 GB	34.0 GB	18.0 GB	64.0 GB
70B	39.4 GB	39.4 GB	74.4 GB	39.4 GB	140.0 GB

Note that Q4_0 and Q4_K have the same effective bpw (4.5) but Q4_K generally provides better accuracy due to the hierarchical scale structure. The file sizes are identical; the quality difference is in how those bits are allocated.

Memory Bandwidth and Decode Throughput

Since single-token decode is memory-bound, the quantization format directly determines the maximum achievable decode throughput. The relationship is:

theoretical_max_tok_s = memory_bandwidth_bytes_per_sec / model_weight_bytes

This means halving the bits per weight (e.g., Q8_0 to Q4_0) roughly doubles the theoretical decode speed. Here is a reference table for a 4B parameter model on various chips:

Chip	BW (GB/s)	Q4_0 (2.25 GB)	Q8_0 (4.25 GB)	F16 (8.0 GB)
M1	68.25	30.3 tok/s	16.1 tok/s	8.5 tok/s
M2	100	44.4 tok/s	23.5 tok/s	12.5 tok/s
M2 Pro	200	88.9 tok/s	47.1 tok/s	25.0 tok/s
M3 Max	400	177.8 tok/s	94.1 tok/s	50.0 tok/s
M4 Pro	273	121.3 tok/s	64.2 tok/s	34.1 tok/s
M4 Max	546	242.7 tok/s	128.5 tok/s	68.3 tok/s

These are theoretical maximums assuming 100% bandwidth utilization. In practice, akunu achieves 70-85% of these numbers due to overhead from attention, normalization, RoPE, and kernel dispatch. The SLC can push effective bandwidth above the raw DRAM bandwidth for chain decode workloads, sometimes exceeding the theoretical maximum.

Choosing a Quantization Format

Here is a decision guide based on common use cases:

Use Case	Recommended Format	Rationale
Maximum speed, acceptable quality	Q4_0 (GGUF) or MLX Q4	Lowest bpw with good quality. Best decode throughput.
Best quality/speed tradeoff	Q4_K (GGUF)	Same bpw as Q4_0 but better accuracy from hierarchical scales.
Quality-sensitive applications	Q6_K (GGUF) or MLX Q6	6.5 bpw gives near-FP16 quality with 2.5x less memory.
Near-lossless	Q8_0 (GGUF) or MLX Q8	8.5 bpw is essentially indistinguishable from FP16 for most tasks.
Research / debugging	F16 or BF16	Full precision. Useful as a reference for measuring quantization error.
Tiny models (<1B params)	Q8_0 or F16	Small models are already fast; use higher precision to preserve quality.
Large models on limited RAM	Q2_K or Q3_K	Aggressive quantization to fit models that would otherwise not fit in memory. Quality degrades noticeably.

Unsupported GGUF Types

The GGUF specification defines several additional quantization types that akunu does not currently support with Metal kernels. These are listed in gguf_parser.h but will fall back to the F16 dtype descriptor (producing incorrect results) if encountered:

GGUF Code	Name	Reason Not Supported
7	Q5_1	Asymmetric 5-bit; rare in practice, superseded by Q5_K
9	Q8_1	Asymmetric 8-bit; rarely used for distribution
15	Q8_K	8-bit K-quant; used internally by llama.cpp during quantization, not for inference
16	IQ2_XXS	Importance-matrix quantized 2-bit; complex lookup table dequantization
17	IQ2_XS	Importance-matrix 2-bit variant
18	IQ3_XXS	Importance-matrix 3-bit
19	IQ1_S	Importance-matrix 1-bit
20	IQ4_NL	Non-linear 4-bit with lookup table
21	IQ3_S	Importance-matrix 3-bit variant
22	IQ2_S	Importance-matrix 2-bit variant
23	IQ4_XS	Importance-matrix 4-bit
29	IQ1_M	Importance-matrix 1-bit variant

The IQ (importance-matrix quantized) formats use lookup tables for dequantization, which adds implementation complexity. They offer slightly better perplexity than the standard formats at the same bit width, but their adoption in the ecosystem is limited compared to Q4_0, Q4_K, and Q6_K. If there is demand, these could be added as Metal kernels in the future.

Comparing GGUF and MLX Quantization

Even at the same nominal bit width, GGUF and MLX quantization are not identical:

Aspect	GGUF (e.g., Q4_0)	MLX (e.g., Q4)
Block/group structure	Fixed 32-element blocks	Configurable groups (typically 64)
Scale storage	FP16 scale per block (Q4_0) or hierarchical 6-bit scales (Q4_K)	FP16 scale + FP16 bias per group
Dequant	`d * (q - zero_point)`	`scale * q + bias`
Symmetry	Symmetric (Q4_0) or asymmetric (Q4_1)	Always asymmetric (has bias)
Overhead	2 bytes per 32 elements (Q4_0) = 0.5 bpw	4 bytes per 64 elements = 0.5 bpw
Quantization method	Post-training quantization by llama.cpp	Post-training quantization by MLX

The practical quality difference between GGUF Q4_0 and MLX Q4 is small for most models. The larger group size in MLX (64 vs 32) means each scale covers more elements, which can be slightly worse for weight distributions with high local variance, but the asymmetric bias term partially compensates.

For weights that are uniformly distributed around zero (which is common after training), symmetric quantization (GGUF Q4_0) is slightly more efficient because it does not waste bits on the bias term. For weights with non-zero mean per group, asymmetric quantization (MLX Q4, GGUF Q4_1) is more accurate.

K-quantization was introduced by ikawrakow in llama.cpp (2023). The “K” originally stood for “k-quant” with no specific expansion. The key insight is that using more bits for scale parameters (6-bit sub-block scales + FP16 super-block scale) reduces the quantization error budget allocated to the scale itself. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference