SafeTensors and MLX Formats
GGUF is not the only game in town. The Hugging Face ecosystem overwhelmingly distributes models in the SafeTensors format, and Apple’s MLX framework has made SafeTensors the default for its quantized model exports. Akunu supports both GGUF and MLX-quantized SafeTensors as first-class citizens through its WeightProvider abstraction, which auto-detects the format and presents a unified interface to the rest of the engine.
This chapter digs into how SafeTensors files are structured, how MLX layers its quantization scheme on top of SafeTensors, and how Akunu’s SafeTensorsParser and MLXWeightStore classes handle the entire pipeline from raw bytes on disk to GPU-resident weight buffers.
The SafeTensors File Format
SafeTensors was designed by Hugging Face as a secure, zero-copy alternative to Python pickle-based formats like PyTorch’s .bin files.1 The design is deliberately simple: a file consists of exactly two parts.
+--------------------------------------------------+
| 8 bytes: header_len (little-endian uint64) |
+--------------------------------------------------+
| header_len bytes: UTF-8 JSON header |
| { |
| "tensor_name": { |
| "dtype": "F16", |
| "shape": [4096, 4096], |
| "data_offsets": [0, 33554432] |
| }, |
| "__metadata__": { |
| "format": "mlx", |
| "quantization_config": "..." |
| } |
| } |
+--------------------------------------------------+
| Tensor data (contiguous, aligned) |
| [tensor_0 bytes] [tensor_1 bytes] ... |
+--------------------------------------------------+
That is it. No nested containers, no variable-length integer encodings, no type-length-value gymnastics. The entire schema lives in a single JSON object that you can parse with any JSON library. Let us walk through each section.
The 8-Byte Length Prefix
The first 8 bytes of the file encode the size of the JSON header as a little-endian unsigned 64-bit integer. This tells you exactly how many bytes to read before the tensor data begins. In Akunu’s SafeTensorsParser::open():
// Read header length (little-endian u64)
uint64_t header_len = 0;
memcpy(&header_len, data_, 8);
data_offset_ = 8 + header_len;
The data_offset_ field marks where the raw tensor bytes start. Every tensor’s data_offsets field in the header is relative to this point – not to the start of the file. This is a common source of confusion when manually inspecting SafeTensors files with a hex editor.
The JSON Header
The header is a flat JSON object where each key is a tensor name and each value is a small descriptor:
| Field | Type | Description |
|---|---|---|
dtype | string | Data type: "F16", "BF16", "F32", "U32", "I8" |
shape | int array | Dimensions, e.g. [4096, 4096] |
data_offsets | int array | [start, end] byte offsets from data section start |
There is also an optional __metadata__ key that carries arbitrary string key-value pairs. MLX uses this section to store quantization configuration, and Akunu reads it to detect quantization parameters.
Zero-Copy Access via mmap
Akunu never copies tensor data into heap-allocated buffers during parsing. The entire file is memory-mapped:
data_ = (const uint8_t *)mmap(nullptr, file_size_, PROT_READ,
MAP_PRIVATE, fd_, 0);
When you call tensor_data(), you get a direct pointer into the mmap’d region:
const void *tensor_data(const SafeTensorInfo& info) const {
size_t offset = data_offset_ + info.data_start;
return data_ + offset;
}
This means the kernel’s page fault handler brings tensor data into physical memory on demand, one page at a time. For large models with hundreds of tensors, this is significantly faster than read() calls, because you only touch the pages you actually need. On Apple Silicon with unified memory, these pages can be directly referenced by the GPU without any additional copy – though in practice Akunu does allocate Metal buffers and memcpy into them for format conversion reasons we will discuss shortly.
Akunu’s Minimal JSON Parser
You might notice that SafeTensorsParser includes a hand-rolled JSON parser rather than pulling in a library like nlohmann/json or simdjson. This is a deliberate choice. The SafeTensors header JSON is structurally simple – it is a flat object of objects, each with three well-known fields. The parser only needs to handle strings, integers, arrays of integers, and nested objects. Akunu’s parser handles this in about 100 lines of code:
parse_header()
|
+-- for each key in top-level object:
| if key == "__metadata__" -> parse_metadata()
| else -> parse dtype, shape, data_offsets into SafeTensorInfo
|
+-- skip_value() for any unknown fields
This avoids a dependency, keeps compile times low, and is fast enough that header parsing is never a bottleneck (even a 10MB header parses in under a millisecond).
The MLX Weight Store
While SafeTensorsParser handles the raw file format, MLXWeightStore adds the intelligence layer: name mapping, config extraction, quantization detection, and data type conversion.
Opening an MLX Model
An MLX model is typically a directory containing:
model_directory/
model.safetensors <-- weights
config.json <-- architecture + quantization config
tokenizer.json <-- tokenizer (handled separately)
tokenizer_config.json <-- tokenizer settings
When you call MLXWeightStore::open(), it:
- Detects whether the path is a directory or a single
.safetensorsfile - Opens the SafeTensors file via the parser
- Parses
config.jsonfor model architecture and quantization info - Builds the name mapping from MLX tensor names to Akunu’s canonical names
bool MLXWeightStore::open(const std::string& path) {
struct stat st;
stat(path.c_str(), &st);
if (S_ISDIR(st.st_mode)) {
model_dir_ = path;
safetensors_path = path + "/model.safetensors";
} else {
safetensors_path = path;
model_dir_ = extract_directory(path);
}
parser_.open(safetensors_path);
parse_config_json(model_dir_);
build_name_mapping(config_.n_layers);
}
Config Extraction from config.json
MLX models store their architecture configuration in a standard Hugging Face config.json. Akunu extracts the fields it needs using minimal JSON string search functions:
| config.json key | AkunuModelConfig field | Example value |
|---|---|---|
model_type | architecture | "llama" |
hidden_size | dim | 4096 |
num_hidden_layers | n_layers | 32 |
num_attention_heads | n_heads | 32 |
num_key_value_heads | n_kv_heads | 8 |
intermediate_size | ffn_dim | 11008 |
vocab_size | vocab_size | 32000 |
max_position_embeddings | max_seq_len | 4096 |
rms_norm_eps | norm_eps | 1e-5 |
rope_theta | rope_theta | 10000.0 |
The quantization configuration is nested inside quantization_config or quantization:
{
"quantization_config": {
"bits": 4,
"group_size": 64
}
}
Akunu tries both key names because different MLX exporters use different conventions:
for (const char *qkey : {"\"quantization_config\"", "\"quantization\""}) {
auto qpos = json.find(qkey);
if (qpos != std::string::npos) {
// extract bits and group_size
}
}
Name Mapping: MLX to Canonical
MLX models use Hugging Face tensor naming conventions, while Akunu internally uses a simplified canonical naming scheme. The mapping is defined as a static table of rules:
| MLX Name Pattern | Akunu Canonical Name |
|---|---|
model.embed_tokens.weight | token_embedding.weight |
model.norm.weight | output_norm.weight |
lm_head.weight | output.weight |
model.layers.{n}.self_attn.q_proj.weight | layers.{n}.attention.q.weight |
model.layers.{n}.self_attn.k_proj.weight | layers.{n}.attention.k.weight |
model.layers.{n}.self_attn.v_proj.weight | layers.{n}.attention.v.weight |
model.layers.{n}.self_attn.o_proj.weight | layers.{n}.attention.output.weight |
model.layers.{n}.mlp.gate_proj.weight | layers.{n}.ffn.gate.weight |
model.layers.{n}.mlp.up_proj.weight | layers.{n}.ffn.up.weight |
model.layers.{n}.mlp.down_proj.weight | layers.{n}.ffn.down.weight |
model.layers.{n}.input_layernorm.weight | layers.{n}.attention_norm.weight |
model.layers.{n}.post_attention_layernorm.weight | layers.{n}.ffn_norm.weight |
The {n} placeholder is expanded for each layer during build_name_mapping(). The function iterates over all rules, expanding layer-indexed patterns for layers 0 through n_layers - 1, and only records a mapping if the tensor actually exists in the SafeTensors file:
void MLXWeightStore::build_name_mapping(int n_layers) {
for (int r = 0; r < kNumMLXRules; r++) {
if (strstr(pattern, "{n}")) {
for (int layer = 0; layer < n_layers; layer++) {
std::string mlx_name = expand_rule(pattern, layer);
if (parser_.find(mlx_name)) {
std::string can = expand_rule(canonical, layer);
name_map_[can] = mlx_name;
}
}
} else {
if (parser_.find(pattern))
name_map_[canonical] = pattern;
}
}
}
This existence check is important because not all architectures have all tensors. For example, Qwen3 has QK-norm weights (q_norm.weight, k_norm.weight) that LLaMA does not. The mapping table includes rules for both, but only the ones that actually exist in the file get registered.
MLX Quantization: The Three-Tensor Pack
This is where things get interesting. When MLX exports a quantized model, it does not pack everything into a single blob like GGUF does. Instead, each quantized linear layer produces three separate tensors in the SafeTensors file:
layer.0.self_attn.q_proj.weight <- packed U32 integers
layer.0.self_attn.q_proj.scales <- F16 or BF16 scale factors
layer.0.self_attn.q_proj.biases <- F16 or BF16 zero-points
The Packed Weight Tensor
The weight tensor stores quantized values packed into 32-bit unsigned integers. For 4-bit quantization, each U32 holds 8 values (32 / 4 = 8). The tensor shape is [N, K_packed] where K_packed = K * bits / 32.
For a [4096, 4096] weight matrix quantized to 4-bit:
K_packed = 4096 * 4 / 32 = 512
weight tensor shape: [4096, 512] of U32
weight tensor size: 4096 * 512 * 4 = 8,388,608 bytes (8 MB)
vs. F16 original: 4096 * 4096 * 2 = 33,554,432 bytes (32 MB)
compression: 4x
Scales and Biases
The scale and bias tensors have shape [N, K / group_size]. With a typical group size of 64 and K = 4096, that is [4096, 64]. Each group of group_size quantized values shares one scale and one bias (zero-point).
The dequantization formula for a single value is:
value = scale * quantized_int + bias
This is an asymmetric affine quantization scheme – the bias term allows the quantization grid to be offset from zero, which can better represent distributions that are not centered at zero.2
GPU Buffer Layout
When Akunu loads a quantized tensor, it packs all three components into a single contiguous GPU buffer:
+---------------------------------------------+
| Packed U32 weights |
| (N * K * bits / 8 bytes) |
+---------------------------------------------+
| F16 scales |
| (N * K / group_size * 2 bytes) |
+---------------------------------------------+
| F16 biases |
| (N * K / group_size * 2 bytes) |
+---------------------------------------------+
This layout is what the MLX dequantization kernels expect. The kernel receives the total buffer and uses the weight_bytes parameter to find where the scales section begins:
scales_offset = weight_bytes
biases_offset = weight_bytes + n_scale_elements * 2
The loading code in load_quantized_tensor() handles this packing:
Buffer MLXWeightStore::load_quantized_tensor(const std::string& mlx_name) {
// ... find weight, scales, biases tensors ...
size_t total = w_bytes + s_elements * 2 + b_elements * 2;
Buffer buf = device_.allocate(total);
// Copy weights
memcpy(buf.contents, w_data, w_bytes);
// Copy/convert scales (BF16 -> F16 if needed)
uint8_t *dst = (uint8_t *)buf.contents + w_bytes;
// ... copy or convert scales ...
// Copy/convert biases
dst = (uint8_t *)buf.contents + w_bytes + s_elements * 2;
// ... copy or convert biases ...
return buf;
}
BF16 to F16 Conversion
A particularly tricky detail: MLX often stores scales and biases in BF16 (bfloat16) format, but Apple’s Metal shading language only gained native BF16 support with the M4 GPU family.3 For older GPUs (M1/M2/M3), Akunu must convert BF16 values to F16 on the fly during loading.
The conversion goes through F32 as an intermediate:
BF16 bits: [s][eeeeeeee][mmmmmmm] (1+8+7 = 16 bits)
F32 bits: [s][eeeeeeee][mmmmmmm 0000...] (1+8+23 = 32 bits)
Step 1: BF16 -> F32 (left-shift by 16)
Step 2: F32 -> F16 (hardware cast via __fp16)
In code:
const uint16_t *src = (const uint16_t *)s_data;
uint16_t *f16_dst = (uint16_t *)dst;
for (size_t i = 0; i < s_elements; i++) {
uint32_t f32_bits = (uint32_t)src[i] << 16; // BF16 -> F32
float val;
memcpy(&val, &f32_bits, 4);
__fp16 h = (__fp16)val; // F32 -> F16
memcpy(&f16_dst[i], &h, 2);
}
This is a lossy conversion. BF16 has 8 exponent bits and 7 mantissa bits, while F16 has 5 exponent bits and 10 mantissa bits. BF16 has wider dynamic range but less precision; F16 has narrower range but more precision. For scale/bias values in quantized models, this conversion is perfectly acceptable since these values are themselves approximations.4
The same conversion applies to raw (non-quantized) tensors. If a SafeTensors file contains BF16 tensors (common in models exported from PyTorch), Akunu converts them to F16 for Metal compatibility:
if (info->dtype == "BF16") {
std::vector<uint16_t> f16(n_elements);
const uint16_t *bf16 = (const uint16_t *)data;
for (size_t i = 0; i < n_elements; i++) {
// BF16 -> F32 -> F16
}
return device_.allocate(f16.data(), n_elements * 2);
}
Dynamic Quantization Detection
Akunu does not require the user to specify whether a model is quantized or what bit width it uses. The get_tensor() method dynamically detects quantization by looking for companion .scales tensors:
Buffer MLXWeightStore::get_tensor(const std::string& canonical_name) {
const std::string& mlx_name = name_map_[canonical_name];
// Check for .scales companion
std::string scales_name = mlx_name;
auto wpos = scales_name.rfind(".weight");
if (wpos != std::string::npos)
scales_name.replace(wpos, 7, ".scales");
if (quant_bits_ > 0 && parser_.find(scales_name)) {
buf = load_quantized_tensor(mlx_name);
// Set effective dtype: 99=Q3, 100=Q4, 102=Q6, 101=Q8
} else {
buf = load_raw_tensor(mlx_name);
// Set effective dtype: 1 (F16)
}
}
The effective dtype codes (99, 100, 101, 102) correspond to the MLX entries in Akunu’s DTypeDescriptor table, which maps each quantization format to the correct kernel names and dispatch geometry. This allows the same build_dispatch_table() code to work transparently with both GGUF and MLX models.
Weight Fusion
For performance-critical paths like the fused QKV projection, Akunu can fuse multiple quantized weight matrices into a single buffer. The challenge with MLX’s three-tensor layout is that you cannot simply concatenate the raw buffers – the scales and biases from different tensors need to be grouped together:
Unfused (3 separate tensors, each with [W|S|B]):
Q: [Wq | Sq | Bq]
K: [Wk | Sk | Bk]
V: [Wv | Sv | Bv]
Fused (1 buffer):
[Wq | Wk | Wv | Sq | Sk | Sv | Bq | Bk | Bv]
The kernel expects all weights contiguous, then all scales contiguous, then all biases contiguous. The fuse_mlx_packed() function handles this rearrangement:
// Copy weights section
for each tensor:
memcpy(fused + w_off, src, sec.w_bytes)
w_off += sec.w_bytes
// Copy scales section
for each tensor:
memcpy(fused + total_w + s_off*2, src + sec.w_bytes, sec.s_elements*2)
s_off += sec.s_elements
// Copy biases section
for each tensor:
memcpy(fused + total_w + total_s*2 + s_off*2,
src + sec.w_bytes + sec.s_elements*2, sec.s_elements*2)
This fusion happens once during model initialization. The fused buffer is cached and reused for every forward pass, so the cost is amortized.
The WeightProvider Abstraction
Above both WeightStore (GGUF) and MLXWeightStore sits the WeightProvider class, which provides a unified interface:
WeightProvider
|
+-- detect_format(path)
| directory or .safetensors -> MLX_SAFETENSORS
| otherwise -> GGUF
|
+-- get_tensor(canonical_name) -> Buffer
+-- get_dtype(canonical_name) -> uint32_t
+-- has_tensor(canonical_name) -> bool
+-- fuse_weights(a, b) -> Buffer
+-- fuse_weights(a, b, c) -> Buffer
+-- get_config() -> AkunuModelConfig
+-- get_metadata_string(key) -> string
The rest of the engine – build_dispatch_table(), the chain decoder, prefill – never touches MLXWeightStore or WeightStore directly. They go through WeightProvider, which delegates to the appropriate backend. This is what makes format support transparent: adding a new weight format (say, ONNX or TensorFlow SavedModel) would only require implementing a new backend class and adding a case to detect_format().
Comparison: SafeTensors vs GGUF
Both formats serve the same purpose – storing model weights efficiently for inference – but they make fundamentally different trade-offs:
| Feature | SafeTensors | GGUF |
|---|---|---|
| Header format | JSON | Binary (type-length-value) |
| Metadata | JSON key-value in __metadata__ | Typed KV pairs (string, int, float, array) |
| Tensor descriptor | dtype + shape + byte offsets | dtype + dimensions + offset |
| Quantization | External (separate scales/biases tensors) | Internal (packed blocks with embedded scales) |
| Tokenizer | Separate tokenizer.json file | Embedded in GGUF metadata |
| Architecture config | Separate config.json | Embedded in GGUF metadata |
| Single file | No (directory of files) | Yes (everything in one .gguf) |
| Ecosystem | Hugging Face, MLX, PyTorch | llama.cpp, whisper.cpp, Akunu |
| Parse complexity | Very low (JSON + mmap) | Medium (binary format, many tensor types) |
| Zero-copy possible | Yes (mmap) | Yes (mmap) |
The key philosophical difference: GGUF is self-contained (one file has everything including tokenizer), while SafeTensors is modular (weights, config, and tokenizer are separate files in a directory). GGUF bakes quantization into the tensor format itself, while MLX/SafeTensors keeps quantization as a layer on top of standard data types.5
For Akunu, both approaches work. The WeightProvider abstraction means the choice is purely a matter of where you got your model from – Hugging Face models come as SafeTensors directories, llama.cpp quantized models come as GGUF files, and Akunu handles both identically from the engine’s perspective.
Buffer Caching
Both WeightStore and MLXWeightStore cache GPU buffers after first load. When get_tensor() is called for a tensor that has already been loaded, it returns the cached buffer immediately:
Buffer MLXWeightStore::get_tensor(const std::string& canonical_name) {
auto cache_it = buffer_cache_.find(canonical_name);
if (cache_it != buffer_cache_.end())
return cache_it->second;
// ... load and cache ...
buffer_cache_[canonical_name] = buf;
return buf;
}
Fused weight buffers are also cached under composite keys like "layers.0.attention.q.weight+layers.0.attention.k.weight+layers.0.attention.v.weight". This means the fusion rearrangement only happens once, and subsequent calls return the pre-fused buffer.
On close(), all cached buffers are freed through the Device:
void MLXWeightStore::close() {
for (auto& [name, buf] : buffer_cache_)
device_.free_buffer(buf);
buffer_cache_.clear();
}
Summary
The SafeTensors/MLX weight pipeline in Akunu is a clean layered design:
SafeTensorsParser (raw file format)
|
MLXWeightStore (name mapping, quant detection,
| BF16 conversion, 3-tensor packing)
|
WeightProvider (unified GGUF/MLX interface)
|
build_dispatch_table() (format-agnostic)
Each layer has a single responsibility, and the abstractions are tight enough that the dispatch table builder genuinely does not know or care whether it is working with a GGUF Q4_0 model or an MLX 4-bit SafeTensors model. The dtype code (2 for Q4_0, 100 for MLX Q4) routes to the correct kernel through the DTypeDescriptor table, and the weight data arrives in the buffer layout that each kernel expects.
-
Hugging Face, “SafeTensors: A simple and safe way to store and distribute tensors,” 2023. The format was designed specifically to prevent arbitrary code execution vulnerabilities inherent in Python pickle deserialization. See https://github.com/huggingface/safetensors. ↩
-
Asymmetric quantization with a zero-point (bias) can represent the range [min, max] directly, while symmetric quantization forces the range to be [-max, max]. For activations and weights that are not centered at zero, asymmetric quantization wastes fewer quantization levels. ↩
-
Apple’s Metal Shading Language gained
bfloatsupport with the Apple GPU Family 9 (M4). On M1-M3, BF16 textures and buffer reads are not supported in Metal shaders. ↩ -
The precision loss from BF16-to-F16 conversion in scale values is typically on the order of 0.1% relative error, which is negligible compared to the quantization error from the 4-bit weight compression itself. ↩
-
This difference has practical implications for tooling. GGUF files can be inspected with
gguf-dumpto see everything about a model. SafeTensors models require reading multiple files to get the full picture. On the other hand, SafeTensors files are trivially inspectable with any JSON-aware tool. ↩