The Weight Provider Abstraction
Alright, let’s talk about one of those pieces of engineering that doesn’t get enough credit: the weight provider. In most inference engines, you’ll find a hard coupling between the model loading code and whatever file format the weights come in. GGUF models go through one path, PyTorch checkpoints through another, SafeTensors through yet another. Each path has its own quirks, its own name mangling, its own way of handing you a tensor.
Akunu takes a different approach. It puts a clean abstraction layer – WeightProvider
– between the model code and the file format. The model doesn’t know or care whether
its weights came from a GGUF file or an MLX-formatted SafeTensors directory. It asks
for layers.3.attention.q.weight, and it gets back a GPU buffer. Period.
This chapter is about that abstraction: how it works, why it exists, and what makes it interesting from a systems design perspective.
The Problem: Two Worlds, One Interface
Let’s set the stage. Akunu needs to load weights from two very different ecosystems:
-
GGUF files – The format popularized by llama.cpp. A single monolithic file containing all tensors, metadata, and quantization parameters. Tensor names follow the
blk.{n}.attn_q.weightconvention. -
MLX SafeTensors – Apple’s MLX framework exports models as a directory containing
config.jsonplus one or more.safetensorsfiles. Tensor names follow the HuggingFacemodel.layers.{n}.self_attn.q_proj.weightconvention. Quantized models pack weights, scales, and biases as three separate tensors.
These formats differ in almost every dimension:
+-------------------+----------------------------+----------------------------+
| Dimension | GGUF | MLX SafeTensors |
+-------------------+----------------------------+----------------------------+
| File structure | Single .gguf file | Directory with config.json |
| | | + model.safetensors |
+-------------------+----------------------------+----------------------------+
| Metadata | KV pairs in binary header | JSON config.json |
+-------------------+----------------------------+----------------------------+
| Tensor names | blk.0.attn_q.weight | model.layers.0.self_attn. |
| | | q_proj.weight |
+-------------------+----------------------------+----------------------------+
| Quantization | Block-level (Q4_0, Q6_K) | Per-group with separate |
| | embedded in tensor data | .scales + .biases |
+-------------------+----------------------------+----------------------------+
| Data types | 30+ GGML types | F16, BF16, F32, U32, I8 |
+-------------------+----------------------------+----------------------------+
| Tensor data | Contiguous in data section | Contiguous after header |
+-------------------+----------------------------+----------------------------+
The model code doesn’t want to deal with any of this. It wants a canonical name, and
it wants bytes on the GPU. The WeightProvider is the bridge.
The Strategy Pattern in Action
Let’s look at the actual class definition from weight_provider.h:
class WeightProvider {
public:
enum Format { GGUF, MLX_SAFETENSORS };
WeightProvider(Device& device) : device_(device) {}
~WeightProvider() { close(); }
bool open(const std::string& path);
void close();
Format format() const { return format_; }
AkunuModelConfig get_config() const;
Buffer get_tensor(const std::string& name);
uint32_t get_dtype(const std::string& name) const;
bool has_tensor(const std::string& name) const;
// Metadata access
std::string get_metadata_string(const std::string& key) const;
int64_t get_metadata_int(const std::string& key, int64_t def = 0) const;
float get_metadata_float(const std::string& key, float def = 0.0f) const;
std::vector<std::string> get_string_array(const std::string& key) const;
std::vector<float> get_float_array(const std::string& key) const;
// Tensor listing
int tensor_count() const;
std::string tensor_name_at(int index) const;
// MLX quantization info
int quant_bits() const;
int quant_group_size() const;
// Weight fusion
Buffer fuse_weights(const std::string& a, const std::string& b);
Buffer fuse_weights(const std::string& a, const std::string& b,
const std::string& c);
private:
Device& device_;
Format format_ = GGUF;
std::unique_ptr<WeightStore> gguf_;
std::unique_ptr<MLXWeightStore> mlx_;
static Format detect_format(const std::string& path);
};
If you squint, this is a textbook Strategy pattern. The WeightProvider holds a
pointer to one of two concrete implementations – WeightStore (for GGUF) or
MLXWeightStore (for SafeTensors) – and delegates every operation to whichever
one is active. But it’s not using virtual functions and inheritance. Instead, it
uses a simpler discriminated-union approach: an enum plus two unique_ptrs.
Why not virtual dispatch? Probably because there are only two backends and the
method set is well-defined. A vtable adds indirection for no real benefit here.
The explicit delegation is clear, debuggable, and has zero overhead beyond a
branch prediction that will always be correct (since the format doesn’t change
after open()).
Here’s the delegation pattern for get_tensor():
Buffer get_tensor(const std::string& name) {
return (format_ == MLX_SAFETENSORS)
? mlx_->get_tensor(name)
: gguf_->get_tensor(name);
}
Every method follows this exact pattern. Clean, predictable, no surprises.
Format Detection: Simpler Than You’d Think
When you call open(), the first thing that happens is format detection. And it’s
refreshingly simple:
static Format detect_format(const std::string& path) {
// Directory or .safetensors -> MLX
struct stat st;
if (stat(path.c_str(), &st) == 0 && S_ISDIR(st.st_mode))
return MLX_SAFETENSORS;
if (path.size() > 12 &&
path.substr(path.size() - 12) == ".safetensors")
return MLX_SAFETENSORS;
return GGUF;
}
That’s it. Two checks:
- Is it a directory? Then it’s an MLX model directory (containing
config.jsonandmodel.safetensors). - Does the filename end in
.safetensors? Same conclusion. - Everything else? GGUF.
No magic number sniffing, no content-based detection. This works because in
practice, users either point at a .gguf file or an MLX model directory. The
heuristic is simple, fast, and correct for the actual use cases.
The flow looks like this:
open("/path/to/model")
|
v
+--------------------+
| detect_format() |
+--------------------+
/ \
Directory or Everything
.safetensors? else
| |
v v
+----------------+ +----------------+
| MLXWeightStore | | WeightStore |
| .open() | | .open() |
+----------------+ +----------------+
| |
v v
Parse config.json Parse GGUF header
Open .safetensors mmap entire file
Build name map Build name map
Once the backend is created and opened, the WeightProvider is ready. All
subsequent calls go through the chosen backend.
The Canonical Name System
This is one of the most important design decisions in the weight system, and it’s worth understanding in detail. Different model formats use different naming conventions for the same tensors:
GGUF Convention MLX Convention
---------------- ----------------
token_embd.weight <---> model.embed_tokens.weight
blk.0.attn_q.weight <---> model.layers.0.self_attn.q_proj.weight
blk.0.ffn_gate.weight <---> model.layers.0.mlp.gate_proj.weight
output_norm.weight <---> model.norm.weight
Neither of these is what akunu’s model code uses. Instead, akunu defines its own canonical naming scheme:
Canonical Name Purpose
--------------------------------- ---------------------------
token_embedding.weight Token embedding matrix
layers.{n}.attention.q.weight Q projection, layer n
layers.{n}.attention.k.weight K projection, layer n
layers.{n}.attention.v.weight V projection, layer n
layers.{n}.attention.output.weight Output projection, layer n
layers.{n}.ffn.gate.weight SwiGLU gate projection
layers.{n}.ffn.up.weight SwiGLU up projection
layers.{n}.ffn.down.weight Down projection
layers.{n}.attention_norm.weight Pre-attention RMSNorm
layers.{n}.ffn_norm.weight Pre-FFN RMSNorm
output_norm.weight Final RMSNorm
output.weight LM head
Each backend maintains a bidirectional mapping between its format-specific names
and these canonical names. When the model code asks for
layers.3.attention.q.weight, the GGUF backend translates that to
blk.3.attn_q.weight, and the MLX backend translates it to
model.layers.3.self_attn.q_proj.weight.
The mapping is built at load time. Here’s the GGUF side, from weight_store.cpp:
static const struct {
const char *gguf;
const char *canonical;
} kBaseRules[] = {
{"token_embd.weight", "token_embedding.weight"},
{"blk.{n}.attn_q.weight", "layers.{n}.attention.q.weight"},
{"blk.{n}.attn_k.weight", "layers.{n}.attention.k.weight"},
{"blk.{n}.attn_v.weight", "layers.{n}.attention.v.weight"},
{"blk.{n}.attn_output.weight", "layers.{n}.attention.output.weight"},
{"blk.{n}.attn_norm.weight", "layers.{n}.attention_norm.weight"},
{"blk.{n}.ffn_gate.weight", "layers.{n}.ffn.gate.weight"},
{"blk.{n}.ffn_up.weight", "layers.{n}.ffn.up.weight"},
{"blk.{n}.ffn_down.weight", "layers.{n}.ffn.down.weight"},
{"blk.{n}.ffn_norm.weight", "layers.{n}.ffn_norm.weight"},
// ... plus bias tensors, QK-norm, Gemma post-norms, etc.
};
And the MLX side, from mlx_weight_store.h:
static const MLXNameRule kMLXRules[] = {
{"model.embed_tokens.weight", "token_embedding.weight"},
{"model.norm.weight", "output_norm.weight"},
{"lm_head.weight", "output.weight"},
{"model.layers.{n}.self_attn.q_proj.weight",
"layers.{n}.attention.q.weight"},
{"model.layers.{n}.mlp.gate_proj.weight",
"layers.{n}.ffn.gate.weight"},
// ... and so on
};
The {n} placeholder is a neat trick. During build_name_mapping(), each rule
is expanded for every layer in the model:
void WeightStore::build_name_mapping() {
int n_layers = get_metadata_int(arch + ".block_count", 0);
for (int r = 0; r < kNumBaseRules; r++) {
if (strstr(pattern, "{n}")) {
for (int layer = 0; layer < n_layers; layer++) {
std::string gguf_name = expand_rule(pattern, layer);
if (gguf_get_tensor(gguf_, gguf_name.c_str())) {
name_map_[expand_rule(canonical_pattern, layer)]
= gguf_name;
}
}
} else {
if (gguf_get_tensor(gguf_, pattern))
name_map_[canonical] = pattern;
}
}
}
The existence check (gguf_get_tensor()) is important. Not every model has every
tensor. Some models have QK-norm weights, some don’t. Some have bias tensors, most
don’t. By checking for existence, the mapping only includes tensors that are
actually present in the file.
The Data Flow: From File to GPU
Let’s trace the complete path of a weight tensor from disk to GPU. We’ll use the GGUF path since it’s more straightforward, but the MLX path follows the same high- level structure.
Model code calls:
provider.get_tensor("layers.5.ffn.gate.weight")
|
v
WeightProvider delegates to WeightStore
|
v
WeightStore::get_tensor("layers.5.ffn.gate.weight")
|
+-- Check buffer_cache_ (hit? return cached buffer)
|
+-- Lookup in name_map_:
| "layers.5.ffn.gate.weight" -> "blk.5.ffn_gate.weight"
|
+-- load_tensor_raw("blk.5.ffn_gate.weight")
|
+-- gguf_get_tensor(gguf_, "blk.5.ffn_gate.weight")
| Returns: GGUFTensorInfo { dtype=Q4_0, offset=0x1234,
| n_elements=14336*4096 }
|
+-- gguf_tensor_data(gguf_, info)
| Returns: pointer into mmap'd region (zero-copy!)
|
+-- Compute byte size from dtype:
| Q4_0: n_elements / 32 * 18 bytes
|
+-- dtype == F32 or BF16? Convert to F16
| Otherwise: direct copy to GPU
|
+-- device_.allocate(data, bytes)
Returns: Buffer { handle, size, contents }
A few things to note:
mmap is doing the heavy lifting. The GGUF parser memory-maps the entire file. When we need tensor data, we just compute a pointer into the mapped region. There’s no explicit read, no buffer allocation for the raw data. The OS handles paging in the data on demand. This means opening a 4GB model file is nearly instantaneous – the actual I/O happens lazily when we first touch each tensor’s bytes.
Lazy loading with caching. Tensors are loaded on first access and cached in
buffer_cache_. Once a tensor is on the GPU, subsequent requests for the same
tensor return the cached buffer. This is important because during inference, the
same weights are used on every forward pass.
Format conversion at load time. F32 and BF16 tensors are converted to F16 during loading. The GPU kernels expect F16, so this conversion happens exactly once. For quantized types (Q4_0, Q6_K, etc.), the data is copied as-is – the dequantization happens in the compute kernels.
The byte-size computation for quantized types is a lookup that maps dtype to a formula based on block size:
Type Block Size Bytes/Block Formula
------ ---------- ----------- -------------------------
Q4_0 32 elements 18 bytes n / 32 * 18
Q4_1 32 elements 20 bytes n / 32 * 20
Q5_0 32 elements 22 bytes n / 32 * 22
Q8_0 32 elements 34 bytes n / 32 * 34
Q2_K 256 elements 84 bytes n / 256 * 84
Q3_K 256 elements 110 bytes n / 256 * 110
Q4_K 256 elements 144 bytes n / 256 * 144
Q5_K 256 elements 176 bytes n / 256 * 176
Q6_K 256 elements 210 bytes n / 256 * 210
Q8_K 256 elements 292 bytes n / 256 * 292
We’ll cover the details of these formats in the quantization chapter.
Weight Fusion: Gate+Up and Q+K+V
One of the most performance-critical operations in the weight provider is weight fusion. The idea is simple: instead of doing two (or three) separate matrix multiplications and then combining the results, we concatenate the weight matrices and do a single, larger matmul.
For SwiGLU-based FFN layers, the gate and up projections can be fused:
Before fusion (2 matmuls):
gate_out = x @ gate_weight (dim -> ffn_dim)
up_out = x @ up_weight (dim -> ffn_dim)
After fusion (1 matmul):
fused_out = x @ [gate_weight; up_weight] (dim -> 2*ffn_dim)
gate_out = fused_out[:ffn_dim]
up_out = fused_out[ffn_dim:]
Similarly, Q, K, and V projections can be fused when they share the same input:
Before fusion (3 matmuls):
q = x @ q_weight (dim -> q_dim)
k = x @ k_weight (dim -> kv_dim)
v = x @ v_weight (dim -> kv_dim)
After fusion (1 matmul):
fused = x @ [q_weight; k_weight; v_weight] (dim -> q_dim+2*kv_dim)
The fuse_weights() methods handle this concatenation. For GGUF, it’s
straightforward – just concatenate the raw bytes:
Buffer WeightStore::fuse_weights(const std::string& name_a,
const std::string& name_b) {
Buffer a = get_tensor(name_a);
Buffer b = get_tensor(name_b);
size_t total = a.size + b.size;
Buffer fused = device_.allocate(total);
memcpy(fused.contents, a.contents, a.size);
memcpy((char*)fused.contents + a.size, b.contents, b.size);
fused_cache_[key] = fused;
return fused;
}
For MLX quantized models, fusion is more involved because each tensor is actually
a packed triple of [weights | scales | biases]. You can’t just concatenate the
whole buffers – you need to concatenate each section separately:
Input buffers (each is a packed triple):
Tensor A: [ A_weights | A_scales | A_biases ]
Tensor B: [ B_weights | B_scales | B_biases ]
Fused output:
[ A_weights | B_weights | A_scales | B_scales | A_biases | B_biases ]
|<--- all weights --->|<--- all scales --->|<--- all biases --->|
This layout is critical for the GPU kernel, which expects to find all weights
contiguous, then all scales contiguous, then all biases contiguous. The
fuse_mlx_packed() helper function handles this three-way interleaving.
Here’s the ASCII diagram of the full fusion pipeline for three tensors (Q+K+V):
Q buffer: [ Q_w (N_q*K_packed*4 bytes) | Q_s (N_q*K/gs*2) | Q_b (N_q*K/gs*2) ]
K buffer: [ K_w (N_k*K_packed*4 bytes) | K_s (N_k*K/gs*2) | K_b (N_k*K/gs*2) ]
V buffer: [ V_w (N_v*K_packed*4 bytes) | V_s (N_v*K/gs*2) | V_b (N_v*K/gs*2) ]
|
v
fuse_mlx_packed()
|
v
Fused: [ Q_w | K_w | V_w | Q_s | K_s | V_s | Q_b | K_b | V_b ]
|<-- total_w --->| |<-- total_s*2 -->| |<-- total_s*2 -->|
The fusion result is also cached (keyed by the concatenation of canonical names), so subsequent forward passes reuse the fused buffer.
Config Extraction: Two Paths to the Same Struct
The get_config() method returns an AkunuModelConfig struct regardless of the
source format. But the two backends extract this config very differently.
GGUF path: Config lives in the binary metadata KV pairs. Keys are prefixed with the architecture name:
Key Example Value
----------------------------------------- ------------
general.architecture "llama"
llama.embedding_length 4096
llama.block_count 32
llama.attention.head_count 32
llama.attention.head_count_kv 8
llama.feed_forward_length 14336
llama.context_length 8192
llama.rope.freq_base 500000.0
llama.attention.layer_norm_rms_epsilon 1e-5
The WeightStore::get_config() method tries architecture-prefixed keys first,
then falls back to unqualified keys. This is because some GGUF files use
llama.block_count while others use just block_count.
MLX path: Config lives in config.json, a standard HuggingFace config file:
{
"model_type": "llama",
"hidden_size": 4096,
"num_hidden_layers": 32,
"num_attention_heads": 32,
"num_key_value_heads": 8,
"intermediate_size": 14336,
"max_position_embeddings": 8192,
"rope_theta": 500000.0,
"rms_norm_eps": 1e-5,
"quantization_config": {
"bits": 4,
"group_size": 64
}
}
The MLXWeightStore::parse_config_json() method uses a minimal JSON parser
(hand-rolled, no dependencies) to extract these values. Note the different key
names: HuggingFace uses hidden_size where GGUF uses embedding_length, and
num_hidden_layers where GGUF uses block_count.
Both paths populate the same AkunuModelConfig struct:
typedef struct {
uint32_t dim; // embedding_length / hidden_size
uint32_t n_layers; // block_count / num_hidden_layers
uint32_t n_heads; // head_count / num_attention_heads
uint32_t n_kv_heads; // head_count_kv / num_key_value_heads
uint32_t head_dim; // explicit or dim/n_heads
uint32_t q_dim; // n_heads * head_dim
uint32_t kv_dim; // n_kv_heads * head_dim
uint32_t ffn_dim; // feed_forward_length / intermediate_size
uint32_t vocab_size;
uint32_t max_seq_len;
float norm_eps;
float rope_theta;
uint32_t sliding_window_pattern;
float rope_local_theta;
char architecture[32];
// ... encoder fields for Whisper
} AkunuModelConfig;
Architecture-Specific Handling
The weight system isn’t just a dumb loader. It has architecture-specific logic for several model families.
Whisper
Whisper is an encoder-decoder model, which means it has two sets of layers with
completely different tensor names. The GGUF backend has a separate rule table
(kWhisperRules) with entries for both encoder and decoder:
GGUF Name Canonical Name
---------------------------------------- --------------------------------
encoder.conv1.weight enc.conv1.weight
encoder.blocks.0.attn.query.weight enc.layers.0.attn.q.weight
decoder.blocks.0.attn.query.weight layers.0.attention.q.weight
decoder.blocks.0.cross_attn.query.weight layers.0.cross_attn.q.weight
The config extraction also handles Whisper specially, populating the encoder-
specific fields (enc_n_layers, enc_n_heads, n_mels, etc.) from
whisper.encoder.* metadata keys.
Gemma 3
Gemma 3 uses a sliding window attention pattern where every 6th layer has
global attention and the rest use local/sliding-window attention. The config
stores this as sliding_window_pattern = 6 and rope_local_theta = 10000.0.
Both GGUF and MLX backends detect this pattern when they see the gemma
architecture with a non-zero sliding window size.
QK-Norm (Qwen3)
Some newer models like Qwen3 add separate RMSNorm layers for Q and K projections before the attention computation. Both rule tables include mappings for these:
GGUF: blk.{n}.attn_q_norm.weight -> layers.{n}.attention.q_norm.weight
MLX: model.layers.{n}.self_attn.q_norm.weight -> (same canonical)
The Caching Architecture
Let’s look at the complete caching picture across the system:
+--------------------------------------------------+
| WeightProvider |
+--------------------------------------------------+
| |
v v
+----------------------+ +----------------------+
| WeightStore | | MLXWeightStore |
| (GGUF backend) | | (SafeTensors) |
+----------------------+ +----------------------+
| buffer_cache_: | | buffer_cache_: |
| canonical -> GPU | | canonical -> GPU |
| fused_cache_: | | "a+b" -> GPU |
| "a+b" -> GPU | | "a+b+c" -> GPU |
+----------------------+ +----------------------+
| |
v v
+-------------------+ +-------------------+
| mmap'd GGUF | | mmap'd .safe- |
| (OS page cache) | | tensors file |
+-------------------+ +-------------------+
There are effectively three levels of caching:
-
OS page cache: The mmap’d files are backed by the OS page cache. First access to a tensor’s bytes triggers a page fault and disk read. Subsequent accesses are served from RAM.
-
GPU buffer cache: Once a tensor is uploaded to the GPU (via
device_.allocate()), the result is cached inbuffer_cache_. The model never re-uploads a tensor. -
Fused buffer cache: Fused weight combinations are cached separately in
fused_cache_(GGUF) or inbuffer_cache_with composite keys like"layers.0.ffn.gate.weight+layers.0.ffn.up.weight"(MLX).
This means the steady-state memory picture during inference is: the original file is mmap’d but mostly paged out (the OS will reclaim those pages under memory pressure), and the weights live in GPU-accessible buffers.
Metadata Access: A Leaky Abstraction?
One area where the abstraction gets a bit leaky is metadata access. The
get_metadata_string(), get_metadata_int(), and get_metadata_float()
methods are primarily used for GGUF metadata (which is rich and structured).
The MLX backend’s implementations are stubs that return defaults:
// mlx_weight_store.cpp
std::string MLXWeightStore::get_metadata_string(const std::string&) const {
return "";
}
int64_t MLXWeightStore::get_metadata_int(const std::string&, int64_t def) const {
return def;
}
This makes sense when you think about it. GGUF metadata contains everything:
model architecture, tokenizer vocabulary, RoPE parameters, you name it. MLX
models store their config in config.json (parsed at open time into
AkunuModelConfig) and their tokenizer in a separate tokenizer.json file.
The metadata methods exist because the tokenizer system needs access to GGUF’s
embedded vocabulary arrays (tokenizer.ggml.tokens, tokenizer.ggml.scores).
For MLX models, the tokenizer is loaded from tokenizer.json through a
completely separate path, so these methods are never called.
Is this a code smell? Maybe. But it’s a pragmatic choice. Adding a separate tokenizer abstraction just to avoid empty stubs would be over-engineering.
Tensor Inspection: The Debug Interface
The tensor_count(), tensor_name_at(), and tensor_raw_dtype() methods
form a debug/inspection interface. These are used by akunu’s model inspection
tool to list all tensors in a weight file without loading them onto the GPU:
Index Name DType Elements
----- ---------------------------------------- --------- -----------
0 token_embd.weight Q8_0 524288000
1 blk.0.attn_q.weight Q4_K 16777216
2 blk.0.attn_k.weight Q4_K 4194304
...
199 output.weight Q8_0 524288000
This is purely for human consumption. The model code never uses these methods.
The Buffer Type
Throughout this chapter, we’ve been passing around Buffer objects. Let’s
clarify what this actually is:
struct Buffer {
void* handle; // Opaque GPU handle (MTLBuffer* on Metal)
size_t size; // Buffer size in bytes
void* contents; // CPU-accessible pointer (shared memory on Apple Silicon)
};
On Apple Silicon, GPU and CPU share the same physical memory, so contents
points directly to the GPU buffer’s storage. This means memcpy into
contents is the same as uploading to the GPU – there’s no separate DMA
transfer step. This is what makes the weight loading so fast: mmap the file,
memcpy from the mmap into the GPU buffer, done.
The {nullptr, 0, nullptr} triple is used as a sentinel for “not found” or
“error”. You’ll see this returned throughout the code as the failure case.
Putting It All Together
Let’s trace a complete model load from the application’s perspective:
Application WeightProvider Backend
----------- -------------- -------
provider.open("/path/to/model")
|
+----> detect_format() -----> "Is it a dir?" -----> MLX
| "Is it .gguf?" -----> GGUF
|
+----> backend.open("/path/to/model")
| |
| +-- mmap file / parse header
| +-- extract metadata / config.json
| +-- build_name_mapping()
|
+----> provider.get_config()
| |
| +-- backend.get_config() -> AkunuModelConfig
|
+----> For each layer:
| provider.fuse_weights(
| "layers.N.attention.q.weight",
| "layers.N.attention.k.weight",
| "layers.N.attention.v.weight")
| |
| +-- get_tensor() x3 (load each, cache)
| +-- concatenate (format-specific)
| +-- cache fused result
|
+----> provider.fuse_weights(
| "layers.N.ffn.gate.weight",
| "layers.N.ffn.up.weight")
|
+----> provider.get_tensor("layers.N.ffn.down.weight")
|
+----> (inference begins, all weights cached on GPU)
After the initial load, no further disk I/O happens. Every weight access is a hash table lookup returning a cached GPU buffer. The model code is completely format-agnostic – it works with canonical names and doesn’t know whether the underlying data came from a quantized GGUF, a full-precision SafeTensors file, or an MLX 4-bit quantized directory.
That’s the weight provider. Not flashy, not complicated, but it cleanly decouples two very different ecosystems from the model code that uses them. In the next chapters, we’ll dive deep into the formats themselves – starting with GGUF.