Appendix B: Quantization Format Reference
This appendix is a quick-reference for every quantization format supported by akunu. It covers the GGUF block-quantized formats (from the ggml/llama.cpp ecosystem) and the MLX group-quantized formats (from Apple’s MLX framework). For each format, you get the block/group size, bytes per block, effective bits per weight, and the dequantization formula.
If you want the why behind these formats, see Chapter 7 (Quantization). This appendix is the what – a lookup table you can keep open while reading kernel code or debugging weight loading.
How to Read the Tables
- Block size: Number of weights packed together as a unit. GGUF formats use fixed block sizes (32 or 256). MLX formats use configurable group sizes (typically 64).
- Bytes per block: Total storage for one block, including quantized values, scales, and any auxiliary data.
- Bits per weight (bpw): Effective bits per weight element, computed as
8 * bytes_per_block / block_size. This is the number that determines model file size. - Dequant formula: How to reconstruct the floating-point value from the quantized representation.
GGUF Formats: Basic Quantization
These formats use a simple scheme: each block of weights shares one or two floating-point parameters (scale and optional minimum/zero-point).
| Format | GGUF Code | Block Size | Bytes/Block | bpw | Dequant Formula |
|---|---|---|---|---|---|
| F32 | 0 | 1 | 4 | 32.0 | value = raw_f32 |
| F16 | 1 | 1 | 2 | 16.0 | value = raw_f16 |
| Q4_0 | 2 | 32 | 18 | 4.5 | value = d * (q[i] - 8) where q[i] is a 4-bit unsigned int, d is FP16 scale |
| Q4_1 | 3 | 32 | 20 | 5.0 | value = d * q[i] + m where d is FP16 scale, m is FP16 minimum |
| Q5_0 | 6 | 32 | 22 | 5.5 | value = d * (q[i] - 16) where q[i] is a 5-bit unsigned int (4 low bits packed + 1 high bit), d is FP16 scale |
| Q8_0 | 8 | 32 | 34 | 8.5 | value = d * q[i] where q[i] is a signed 8-bit int, d is FP16 scale |
| BF16 | 30 | 1 | 2 | 16.0 | value = raw_bf16 (Brain Float 16: 8-bit exponent, 7-bit mantissa) |
Block Layout Details
Q4_0 (most common quantization in the GGUF ecosystem):
struct block_q4_0 { // 18 bytes total
half d; // 2 bytes: scale factor
uint8_t qs[16]; // 16 bytes: 32 x 4-bit values, packed in pairs
}; // bpw = 18*8/32 = 4.5
Each byte in qs holds two 4-bit values: the low nibble is element 2i, the high nibble is element 2i+1. Dequantization extracts the nibble, subtracts 8 (to center around zero), and multiplies by the scale d.
Q4_1 (asymmetric variant of Q4_0):
struct block_q4_1 { // 20 bytes total
half d; // 2 bytes: delta (scale)
half m; // 2 bytes: minimum
uint8_t qs[16]; // 16 bytes: 32 x 4-bit values, packed in pairs
}; // bpw = 20*8/32 = 5.0
The extra m (minimum) parameter means values are dequantized as d * q + m instead of d * (q - 8). This gives better accuracy when the weight distribution is not symmetric around zero.
Q5_0 (5-bit with high-bit extension):
struct block_q5_0 { // 22 bytes total
half d; // 2 bytes: scale
uint8_t qh[4]; // 4 bytes: 5th bit for each of 32 elements
uint8_t qs[16]; // 16 bytes: lower 4 bits, packed in pairs
}; // bpw = 22*8/32 = 5.5
The 5th bit for each element is stored separately in qh (packed as a uint32). To dequantize element i: extract the 4-bit value from qs, extract bit i from qh, combine to get a 5-bit unsigned int, subtract 16, multiply by d.
Q8_0 (8-bit, highest quality block quant):
struct block_q8_0 { // 34 bytes total
half d; // 2 bytes: scale
int8_t qs[32]; // 32 bytes: signed 8-bit values
}; // bpw = 34*8/32 = 8.5
Simple and fast to dequantize: value = d * qs[i]. The 0.5 extra bpw overhead comes from the FP16 scale shared across 32 elements.
GGUF Formats: K-Quantization
K-quant formats use a two-level quantization scheme with super-blocks of 256 elements. Each super-block contains sub-blocks with their own scales, plus a super-block-level scale that controls the magnitude of the sub-block scales. This hierarchical approach gives better accuracy at the same bit width compared to basic formats.1
| Format | GGUF Code | Block Size | Bytes/Block | bpw | Description |
|---|---|---|---|---|---|
| Q2_K | 10 | 256 | 84 | 2.625 | 2-bit values + 4-bit scale/min per 16-element sub-block + super-block scale |
| Q3_K | 11 | 256 | 110 | 3.4375 | 2-bit base + 1 high bit + 6-bit packed scales + super-block scale |
| Q4_K | 12 | 256 | 144 | 4.5 | 4-bit values + 6-bit scales/mins + super-block scale |
| Q5_K | 13 | 256 | 176 | 5.5 | 4-bit base + 1 high bit + 6-bit scales/mins + super-block scale |
| Q6_K | 14 | 256 | 210 | 6.5625 | 4-bit low + 2-bit high (6-bit total) + 8-bit scales + super-block scale |
K-Quant Block Layouts
Q4_K (the most popular K-quant for production use):
struct block_q4_K { // 144 bytes total
half d; // 2 bytes: super-block scale for quants
half dmin; // 2 bytes: super-block scale for mins
uint8_t scales[12]; // 12 bytes: 8 x 6-bit scales + 8 x 6-bit mins
uint8_t qs[128]; // 128 bytes: 256 x 4-bit values, nibble-packed
}; // bpw = 144*8/256 = 4.5
The 256-element super-block is divided into 8 sub-blocks of 32 elements each. Each sub-block has a 6-bit scale and a 6-bit minimum, packed into 12 bytes. Dequantization for element i in sub-block j:
value = d * scale_j * (q[i] - 8) + dmin * min_j
The get_scale_min_k4() helper in KernelCommon.h unpacks the 6-bit scale and minimum from the packed 12-byte scales array.
Q3_K (aggressive 3-bit quantization):
struct block_q3_K { // 110 bytes total
uint8_t hmask[32]; // 32 bytes: high bit for each of 256 elements
uint8_t qs[64]; // 64 bytes: lower 2 bits packed (4 per byte)
uint8_t scales[12]; // 12 bytes: 16 x signed 6-bit scales
half d; // 2 bytes: super-block scale
}; // bpw = 110*8/256 = 3.4375
Each element has 3 bits: 2 bits from qs and 1 bit from hmask. The 16 sub-blocks (16 elements each) have signed 6-bit scales packed into 12 bytes. The get_scale_q3_k() helper unpacks these.
Q2_K (extreme 2-bit quantization):
struct block_q2_K { // 84 bytes total
uint8_t scales[16]; // 16 bytes: 4-bit scale + 4-bit min per sub-block
uint8_t qs[64]; // 64 bytes: 2-bit values (4 per byte)
half d; // 2 bytes: super-block scale
half dmin; // 2 bytes: super-block min scale
}; // bpw = 84*8/256 = 2.625
Q6_K (high-quality 6-bit):
struct block_q6_K { // 210 bytes total
uint8_t ql[128]; // 128 bytes: lower 4 bits of 6-bit quants
uint8_t qh[64]; // 64 bytes: upper 2 bits of 6-bit quants
int8_t scales[16]; // 16 bytes: signed 8-bit sub-block scales
half d; // 2 bytes: super-block scale
}; // bpw = 210*8/256 = 6.5625
MLX Formats: Group Quantization
MLX uses a simpler group quantization scheme. Weights are divided into groups (typically 64 elements), and each group has an FP16 scale and FP16 bias (zero-point). The packed weight buffer layout is:
[packed_weights | scales | biases]
The MLXParams.weight_bytes field gives the byte offset where scales begin. Biases follow immediately.
| Format | Internal Code | Group Size | Bits | bpw | Dequant Formula |
|---|---|---|---|---|---|
| MLX Q3 | 99 | 64 | 3 | ~3.5 | value = scale * (packed_3bit_int) + bias |
| MLX Q4 | 100 | 64 | 4 | ~4.5 | value = scale * (packed_4bit_int) + bias |
| MLX Q6 | 102 | 64 | 6 | ~6.5 | value = scale * (packed_6bit_int) + bias |
| MLX Q8 | 101 | 64 | 8 | ~8.5 | value = scale * (packed_8bit_int) + bias |
Notes on bpw for MLX: The effective bpw includes the overhead of the FP16 scale and bias per group. For group_size=64 with 4-bit values: (64 * 4 + 16 + 16) / 64 = 4.5 bpw. The exact overhead is 32 / group_size bits per weight for the scale+bias pair.
MLX Packing Details
MLX Q4: Each uint32 holds 8 x 4-bit values. The low 4 bits are element 0, bits 4-7 are element 1, and so on. The group_size determines how many packed uint32s share a single scale/bias pair: for group_size=64, that is 8 uint32s per group.
MLX Q3: Packing is more complex. Three bits per value means values do not align neatly to byte boundaries. MLX packs 32 x 3-bit values into 3 uint32s (96 bits for 32 values). The remaining 32 values in a 64-element group use another 3 uint32s.
MLX Q8: The simplest MLX format. Each byte holds one 8-bit quantized value. Dequantization is a simple multiply-add: value = scale * q[i] + bias.
Internal Dtype Codes
Akunu uses uint32_t dtype codes internally. GGUF dtypes 0-30 map directly to the GGUF specification. MLX formats use synthetic codes 99-102 that are assigned during weight loading by MLXWeightStore. The full mapping in dtype_descriptor.h:
| Code | Format | Origin |
|---|---|---|
| 0 | F32 | GGUF |
| 1 | F16 | GGUF |
| 2 | Q4_0 | GGUF |
| 3 | Q4_1 | GGUF |
| 6 | Q5_0 | GGUF |
| 8 | Q8_0 | GGUF |
| 10 | Q2_K | GGUF |
| 11 | Q3_K | GGUF |
| 12 | Q4_K | GGUF |
| 13 | Q5_K | GGUF |
| 14 | Q6_K | GGUF |
| 30 | BF16 | GGUF |
| 31 | BF16 (native) | GGUF, M4+ only |
| 99 | MLX Q3 | MLX SafeTensors |
| 100 | MLX Q4 | MLX SafeTensors |
| 101 | MLX Q8 | MLX SafeTensors |
| 102 | MLX Q6 | MLX SafeTensors |
Note that codes 4-5, 7, 9, 15-29 are defined in the GGUF specification (for types like Q5_1, Q8_1, IQ2_XXS, etc.) but are not currently supported by akunu’s Metal kernels. If you attempt to load a GGUF file using an unsupported dtype, the dtype_lookup() function falls back to the F16 descriptor, which will produce incorrect results. Check the dtype before loading.
Kernel Support Matrix
Not every format has every kernel variant. This table shows which Metal kernel types are available for each supported dtype:
| Format | GEMV | GEMV Wide | GEMM | GEMM Small | Embedding | Fused SiLU |
|---|---|---|---|---|---|---|
| F16 | yes | yes | yes | yes | yes (generic) | no |
| Q4_0 | yes | yes | yes | yes | yes | yes |
| Q4_1 | yes | yes | yes | yes | yes | no |
| Q5_0 | yes | yes | yes | yes | no | no |
| Q8_0 | yes | yes | yes | yes | yes | no |
| Q2_K | yes | no | yes | yes | no | no |
| Q3_K | yes | no | yes | yes | no | no |
| Q4_K | yes | yes | yes | yes | yes | no |
| Q5_K | yes | no | yes | yes | no | no |
| Q6_K | yes | no | yes | yes | yes | no |
| BF16 | yes | no | yes | yes | yes | no |
| MLX Q3 | yes | no | yes | yes | yes | yes |
| MLX Q4 | yes | yes | yes | yes | yes | yes |
| MLX Q6 | yes | no | yes | yes | yes | yes |
| MLX Q8 | yes | yes | yes | yes | yes | yes |
Key observations:
- GEMV Wide kernels exist only for formats with wide enough adoption to justify the implementation effort. Q4_0 and Q4_K are the most common GGUF formats; MLX Q4 and Q8 are the most common MLX formats.
- Fused SiLU kernels exist for Q4_0 and all MLX formats. These fuse the gate+up GEMV with the SiLU activation to eliminate an intermediate buffer write. Other GGUF K-quant formats do not have fused SiLU variants.
- Embedding kernels that dequantize on the fly exist for the most common formats. Formats without a specialized embedding kernel use the generic F16 embedding lookup, which requires the embedding weights to be stored in FP16 (or converted during loading).
Model Size Estimation
To estimate the file size of a model in a given format:
file_size_bytes = n_parameters * bpw / 8 + metadata_overhead
The metadata overhead (GGUF header, tensor info, tokenizer data) is typically 1-10 MB for GGUF files, negligible for large models.
| Parameters | Q4_0 (4.5 bpw) | Q4_K (4.5 bpw) | Q8_0 (8.5 bpw) | MLX Q4 (~4.5 bpw) | F16 (16 bpw) |
|---|---|---|---|---|---|
| 1B | 0.56 GB | 0.56 GB | 1.06 GB | 0.56 GB | 2.0 GB |
| 4B | 2.25 GB | 2.25 GB | 4.25 GB | 2.25 GB | 8.0 GB |
| 8B | 4.50 GB | 4.50 GB | 8.50 GB | 4.50 GB | 16.0 GB |
| 14B | 7.88 GB | 7.88 GB | 14.88 GB | 7.88 GB | 28.0 GB |
| 32B | 18.0 GB | 18.0 GB | 34.0 GB | 18.0 GB | 64.0 GB |
| 70B | 39.4 GB | 39.4 GB | 74.4 GB | 39.4 GB | 140.0 GB |
Note that Q4_0 and Q4_K have the same effective bpw (4.5) but Q4_K generally provides better accuracy due to the hierarchical scale structure. The file sizes are identical; the quality difference is in how those bits are allocated.
Memory Bandwidth and Decode Throughput
Since single-token decode is memory-bound, the quantization format directly determines the maximum achievable decode throughput. The relationship is:
theoretical_max_tok_s = memory_bandwidth_bytes_per_sec / model_weight_bytes
This means halving the bits per weight (e.g., Q8_0 to Q4_0) roughly doubles the theoretical decode speed. Here is a reference table for a 4B parameter model on various chips:
| Chip | BW (GB/s) | Q4_0 (2.25 GB) | Q8_0 (4.25 GB) | F16 (8.0 GB) |
|---|---|---|---|---|
| M1 | 68.25 | 30.3 tok/s | 16.1 tok/s | 8.5 tok/s |
| M2 | 100 | 44.4 tok/s | 23.5 tok/s | 12.5 tok/s |
| M2 Pro | 200 | 88.9 tok/s | 47.1 tok/s | 25.0 tok/s |
| M3 Max | 400 | 177.8 tok/s | 94.1 tok/s | 50.0 tok/s |
| M4 Pro | 273 | 121.3 tok/s | 64.2 tok/s | 34.1 tok/s |
| M4 Max | 546 | 242.7 tok/s | 128.5 tok/s | 68.3 tok/s |
These are theoretical maximums assuming 100% bandwidth utilization. In practice, akunu achieves 70-85% of these numbers due to overhead from attention, normalization, RoPE, and kernel dispatch. The SLC can push effective bandwidth above the raw DRAM bandwidth for chain decode workloads, sometimes exceeding the theoretical maximum.
Choosing a Quantization Format
Here is a decision guide based on common use cases:
| Use Case | Recommended Format | Rationale |
|---|---|---|
| Maximum speed, acceptable quality | Q4_0 (GGUF) or MLX Q4 | Lowest bpw with good quality. Best decode throughput. |
| Best quality/speed tradeoff | Q4_K (GGUF) | Same bpw as Q4_0 but better accuracy from hierarchical scales. |
| Quality-sensitive applications | Q6_K (GGUF) or MLX Q6 | 6.5 bpw gives near-FP16 quality with 2.5x less memory. |
| Near-lossless | Q8_0 (GGUF) or MLX Q8 | 8.5 bpw is essentially indistinguishable from FP16 for most tasks. |
| Research / debugging | F16 or BF16 | Full precision. Useful as a reference for measuring quantization error. |
| Tiny models (<1B params) | Q8_0 or F16 | Small models are already fast; use higher precision to preserve quality. |
| Large models on limited RAM | Q2_K or Q3_K | Aggressive quantization to fit models that would otherwise not fit in memory. Quality degrades noticeably. |
Unsupported GGUF Types
The GGUF specification defines several additional quantization types that akunu does not currently support with Metal kernels. These are listed in gguf_parser.h but will fall back to the F16 dtype descriptor (producing incorrect results) if encountered:
| GGUF Code | Name | Reason Not Supported |
|---|---|---|
| 7 | Q5_1 | Asymmetric 5-bit; rare in practice, superseded by Q5_K |
| 9 | Q8_1 | Asymmetric 8-bit; rarely used for distribution |
| 15 | Q8_K | 8-bit K-quant; used internally by llama.cpp during quantization, not for inference |
| 16 | IQ2_XXS | Importance-matrix quantized 2-bit; complex lookup table dequantization |
| 17 | IQ2_XS | Importance-matrix 2-bit variant |
| 18 | IQ3_XXS | Importance-matrix 3-bit |
| 19 | IQ1_S | Importance-matrix 1-bit |
| 20 | IQ4_NL | Non-linear 4-bit with lookup table |
| 21 | IQ3_S | Importance-matrix 3-bit variant |
| 22 | IQ2_S | Importance-matrix 2-bit variant |
| 23 | IQ4_XS | Importance-matrix 4-bit |
| 29 | IQ1_M | Importance-matrix 1-bit variant |
The IQ (importance-matrix quantized) formats use lookup tables for dequantization, which adds implementation complexity. They offer slightly better perplexity than the standard formats at the same bit width, but their adoption in the ecosystem is limited compared to Q4_0, Q4_K, and Q6_K. If there is demand, these could be added as Metal kernels in the future.
Comparing GGUF and MLX Quantization
Even at the same nominal bit width, GGUF and MLX quantization are not identical:
| Aspect | GGUF (e.g., Q4_0) | MLX (e.g., Q4) |
|---|---|---|
| Block/group structure | Fixed 32-element blocks | Configurable groups (typically 64) |
| Scale storage | FP16 scale per block (Q4_0) or hierarchical 6-bit scales (Q4_K) | FP16 scale + FP16 bias per group |
| Dequant | d * (q - zero_point) | scale * q + bias |
| Symmetry | Symmetric (Q4_0) or asymmetric (Q4_1) | Always asymmetric (has bias) |
| Overhead | 2 bytes per 32 elements (Q4_0) = 0.5 bpw | 4 bytes per 64 elements = 0.5 bpw |
| Quantization method | Post-training quantization by llama.cpp | Post-training quantization by MLX |
The practical quality difference between GGUF Q4_0 and MLX Q4 is small for most models. The larger group size in MLX (64 vs 32) means each scale covers more elements, which can be slightly worse for weight distributions with high local variance, but the asymmetric bias term partially compensates.
For weights that are uniformly distributed around zero (which is common after training), symmetric quantization (GGUF Q4_0) is slightly more efficient because it does not waste bits on the bias term. For weights with non-zero mean per group, asymmetric quantization (MLX Q4, GGUF Q4_1) is more accurate.
-
K-quantization was introduced by ikawrakow in llama.cpp (2023). The “K” originally stood for “k-quant” with no specific expansion. The key insight is that using more bits for scale parameters (6-bit sub-block scales + FP16 super-block scale) reduces the quantization error budget allocated to the scale itself. ↩