Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Weight Provider Abstraction

Alright, let’s talk about one of those pieces of engineering that doesn’t get enough credit: the weight provider. In most inference engines, you’ll find a hard coupling between the model loading code and whatever file format the weights come in. GGUF models go through one path, PyTorch checkpoints through another, SafeTensors through yet another. Each path has its own quirks, its own name mangling, its own way of handing you a tensor.

Akunu takes a different approach. It puts a clean abstraction layer – WeightProvider – between the model code and the file format. The model doesn’t know or care whether its weights came from a GGUF file or an MLX-formatted SafeTensors directory. It asks for layers.3.attention.q.weight, and it gets back a GPU buffer. Period.

This chapter is about that abstraction: how it works, why it exists, and what makes it interesting from a systems design perspective.

The Problem: Two Worlds, One Interface

Let’s set the stage. Akunu needs to load weights from two very different ecosystems:

  1. GGUF files – The format popularized by llama.cpp. A single monolithic file containing all tensors, metadata, and quantization parameters. Tensor names follow the blk.{n}.attn_q.weight convention.

  2. MLX SafeTensors – Apple’s MLX framework exports models as a directory containing config.json plus one or more .safetensors files. Tensor names follow the HuggingFace model.layers.{n}.self_attn.q_proj.weight convention. Quantized models pack weights, scales, and biases as three separate tensors.

These formats differ in almost every dimension:

+-------------------+----------------------------+----------------------------+
|   Dimension       |       GGUF                 |     MLX SafeTensors        |
+-------------------+----------------------------+----------------------------+
| File structure    | Single .gguf file          | Directory with config.json |
|                   |                            | + model.safetensors        |
+-------------------+----------------------------+----------------------------+
| Metadata          | KV pairs in binary header  | JSON config.json           |
+-------------------+----------------------------+----------------------------+
| Tensor names      | blk.0.attn_q.weight        | model.layers.0.self_attn.  |
|                   |                            |   q_proj.weight            |
+-------------------+----------------------------+----------------------------+
| Quantization      | Block-level (Q4_0, Q6_K)   | Per-group with separate    |
|                   | embedded in tensor data     |   .scales + .biases        |
+-------------------+----------------------------+----------------------------+
| Data types        | 30+ GGML types             | F16, BF16, F32, U32, I8   |
+-------------------+----------------------------+----------------------------+
| Tensor data       | Contiguous in data section | Contiguous after header    |
+-------------------+----------------------------+----------------------------+

The model code doesn’t want to deal with any of this. It wants a canonical name, and it wants bytes on the GPU. The WeightProvider is the bridge.

The Strategy Pattern in Action

Let’s look at the actual class definition from weight_provider.h:

class WeightProvider {
public:
    enum Format { GGUF, MLX_SAFETENSORS };

    WeightProvider(Device& device) : device_(device) {}
    ~WeightProvider() { close(); }

    bool open(const std::string& path);
    void close();

    Format format() const { return format_; }

    AkunuModelConfig get_config() const;
    Buffer get_tensor(const std::string& name);
    uint32_t get_dtype(const std::string& name) const;
    bool has_tensor(const std::string& name) const;

    // Metadata access
    std::string get_metadata_string(const std::string& key) const;
    int64_t get_metadata_int(const std::string& key, int64_t def = 0) const;
    float get_metadata_float(const std::string& key, float def = 0.0f) const;
    std::vector<std::string> get_string_array(const std::string& key) const;
    std::vector<float> get_float_array(const std::string& key) const;

    // Tensor listing
    int tensor_count() const;
    std::string tensor_name_at(int index) const;

    // MLX quantization info
    int quant_bits() const;
    int quant_group_size() const;

    // Weight fusion
    Buffer fuse_weights(const std::string& a, const std::string& b);
    Buffer fuse_weights(const std::string& a, const std::string& b,
                        const std::string& c);

private:
    Device& device_;
    Format format_ = GGUF;
    std::unique_ptr<WeightStore> gguf_;
    std::unique_ptr<MLXWeightStore> mlx_;

    static Format detect_format(const std::string& path);
};

If you squint, this is a textbook Strategy pattern. The WeightProvider holds a pointer to one of two concrete implementations – WeightStore (for GGUF) or MLXWeightStore (for SafeTensors) – and delegates every operation to whichever one is active. But it’s not using virtual functions and inheritance. Instead, it uses a simpler discriminated-union approach: an enum plus two unique_ptrs.

Why not virtual dispatch? Probably because there are only two backends and the method set is well-defined. A vtable adds indirection for no real benefit here. The explicit delegation is clear, debuggable, and has zero overhead beyond a branch prediction that will always be correct (since the format doesn’t change after open()).

Here’s the delegation pattern for get_tensor():

Buffer get_tensor(const std::string& name) {
    return (format_ == MLX_SAFETENSORS)
        ? mlx_->get_tensor(name)
        : gguf_->get_tensor(name);
}

Every method follows this exact pattern. Clean, predictable, no surprises.

Format Detection: Simpler Than You’d Think

When you call open(), the first thing that happens is format detection. And it’s refreshingly simple:

static Format detect_format(const std::string& path) {
    // Directory or .safetensors -> MLX
    struct stat st;
    if (stat(path.c_str(), &st) == 0 && S_ISDIR(st.st_mode))
        return MLX_SAFETENSORS;
    if (path.size() > 12 &&
        path.substr(path.size() - 12) == ".safetensors")
        return MLX_SAFETENSORS;
    return GGUF;
}

That’s it. Two checks:

  1. Is it a directory? Then it’s an MLX model directory (containing config.json and model.safetensors).
  2. Does the filename end in .safetensors? Same conclusion.
  3. Everything else? GGUF.

No magic number sniffing, no content-based detection. This works because in practice, users either point at a .gguf file or an MLX model directory. The heuristic is simple, fast, and correct for the actual use cases.

The flow looks like this:

            open("/path/to/model")
                     |
                     v
          +--------------------+
          |  detect_format()   |
          +--------------------+
           /                  \
    Directory or             Everything
    .safetensors?              else
         |                      |
         v                      v
  +----------------+    +----------------+
  | MLXWeightStore |    |   WeightStore  |
  |    .open()     |    |     .open()    |
  +----------------+    +----------------+
         |                      |
         v                      v
  Parse config.json     Parse GGUF header
  Open .safetensors     mmap entire file
  Build name map        Build name map

Once the backend is created and opened, the WeightProvider is ready. All subsequent calls go through the chosen backend.

The Canonical Name System

This is one of the most important design decisions in the weight system, and it’s worth understanding in detail. Different model formats use different naming conventions for the same tensors:

  GGUF Convention                    MLX Convention
  ----------------                   ----------------
  token_embd.weight          <--->   model.embed_tokens.weight
  blk.0.attn_q.weight        <--->   model.layers.0.self_attn.q_proj.weight
  blk.0.ffn_gate.weight      <--->   model.layers.0.mlp.gate_proj.weight
  output_norm.weight          <--->   model.norm.weight

Neither of these is what akunu’s model code uses. Instead, akunu defines its own canonical naming scheme:

  Canonical Name                     Purpose
  ---------------------------------  ---------------------------
  token_embedding.weight             Token embedding matrix
  layers.{n}.attention.q.weight      Q projection, layer n
  layers.{n}.attention.k.weight      K projection, layer n
  layers.{n}.attention.v.weight      V projection, layer n
  layers.{n}.attention.output.weight Output projection, layer n
  layers.{n}.ffn.gate.weight         SwiGLU gate projection
  layers.{n}.ffn.up.weight           SwiGLU up projection
  layers.{n}.ffn.down.weight         Down projection
  layers.{n}.attention_norm.weight   Pre-attention RMSNorm
  layers.{n}.ffn_norm.weight         Pre-FFN RMSNorm
  output_norm.weight                 Final RMSNorm
  output.weight                      LM head

Each backend maintains a bidirectional mapping between its format-specific names and these canonical names. When the model code asks for layers.3.attention.q.weight, the GGUF backend translates that to blk.3.attn_q.weight, and the MLX backend translates it to model.layers.3.self_attn.q_proj.weight.

The mapping is built at load time. Here’s the GGUF side, from weight_store.cpp:

static const struct {
    const char *gguf;
    const char *canonical;
} kBaseRules[] = {
    {"token_embd.weight",              "token_embedding.weight"},
    {"blk.{n}.attn_q.weight",          "layers.{n}.attention.q.weight"},
    {"blk.{n}.attn_k.weight",          "layers.{n}.attention.k.weight"},
    {"blk.{n}.attn_v.weight",          "layers.{n}.attention.v.weight"},
    {"blk.{n}.attn_output.weight",     "layers.{n}.attention.output.weight"},
    {"blk.{n}.attn_norm.weight",       "layers.{n}.attention_norm.weight"},
    {"blk.{n}.ffn_gate.weight",        "layers.{n}.ffn.gate.weight"},
    {"blk.{n}.ffn_up.weight",          "layers.{n}.ffn.up.weight"},
    {"blk.{n}.ffn_down.weight",        "layers.{n}.ffn.down.weight"},
    {"blk.{n}.ffn_norm.weight",        "layers.{n}.ffn_norm.weight"},
    // ... plus bias tensors, QK-norm, Gemma post-norms, etc.
};

And the MLX side, from mlx_weight_store.h:

static const MLXNameRule kMLXRules[] = {
    {"model.embed_tokens.weight",      "token_embedding.weight"},
    {"model.norm.weight",              "output_norm.weight"},
    {"lm_head.weight",                 "output.weight"},
    {"model.layers.{n}.self_attn.q_proj.weight",
                                       "layers.{n}.attention.q.weight"},
    {"model.layers.{n}.mlp.gate_proj.weight",
                                       "layers.{n}.ffn.gate.weight"},
    // ... and so on
};

The {n} placeholder is a neat trick. During build_name_mapping(), each rule is expanded for every layer in the model:

void WeightStore::build_name_mapping() {
    int n_layers = get_metadata_int(arch + ".block_count", 0);
    for (int r = 0; r < kNumBaseRules; r++) {
        if (strstr(pattern, "{n}")) {
            for (int layer = 0; layer < n_layers; layer++) {
                std::string gguf_name = expand_rule(pattern, layer);
                if (gguf_get_tensor(gguf_, gguf_name.c_str())) {
                    name_map_[expand_rule(canonical_pattern, layer)]
                        = gguf_name;
                }
            }
        } else {
            if (gguf_get_tensor(gguf_, pattern))
                name_map_[canonical] = pattern;
        }
    }
}

The existence check (gguf_get_tensor()) is important. Not every model has every tensor. Some models have QK-norm weights, some don’t. Some have bias tensors, most don’t. By checking for existence, the mapping only includes tensors that are actually present in the file.

The Data Flow: From File to GPU

Let’s trace the complete path of a weight tensor from disk to GPU. We’ll use the GGUF path since it’s more straightforward, but the MLX path follows the same high- level structure.

  Model code calls:
    provider.get_tensor("layers.5.ffn.gate.weight")
        |
        v
  WeightProvider delegates to WeightStore
        |
        v
  WeightStore::get_tensor("layers.5.ffn.gate.weight")
        |
        +-- Check buffer_cache_ (hit? return cached buffer)
        |
        +-- Lookup in name_map_:
        |   "layers.5.ffn.gate.weight" -> "blk.5.ffn_gate.weight"
        |
        +-- load_tensor_raw("blk.5.ffn_gate.weight")
              |
              +-- gguf_get_tensor(gguf_, "blk.5.ffn_gate.weight")
              |   Returns: GGUFTensorInfo { dtype=Q4_0, offset=0x1234,
              |                             n_elements=14336*4096 }
              |
              +-- gguf_tensor_data(gguf_, info)
              |   Returns: pointer into mmap'd region (zero-copy!)
              |
              +-- Compute byte size from dtype:
              |   Q4_0: n_elements / 32 * 18 bytes
              |
              +-- dtype == F32 or BF16? Convert to F16
              |   Otherwise: direct copy to GPU
              |
              +-- device_.allocate(data, bytes)
                  Returns: Buffer { handle, size, contents }

A few things to note:

mmap is doing the heavy lifting. The GGUF parser memory-maps the entire file. When we need tensor data, we just compute a pointer into the mapped region. There’s no explicit read, no buffer allocation for the raw data. The OS handles paging in the data on demand. This means opening a 4GB model file is nearly instantaneous – the actual I/O happens lazily when we first touch each tensor’s bytes.

Lazy loading with caching. Tensors are loaded on first access and cached in buffer_cache_. Once a tensor is on the GPU, subsequent requests for the same tensor return the cached buffer. This is important because during inference, the same weights are used on every forward pass.

Format conversion at load time. F32 and BF16 tensors are converted to F16 during loading. The GPU kernels expect F16, so this conversion happens exactly once. For quantized types (Q4_0, Q6_K, etc.), the data is copied as-is – the dequantization happens in the compute kernels.

The byte-size computation for quantized types is a lookup that maps dtype to a formula based on block size:

  Type    Block Size    Bytes/Block    Formula
  ------  ----------    -----------    -------------------------
  Q4_0    32 elements   18 bytes       n / 32 * 18
  Q4_1    32 elements   20 bytes       n / 32 * 20
  Q5_0    32 elements   22 bytes       n / 32 * 22
  Q8_0    32 elements   34 bytes       n / 32 * 34
  Q2_K    256 elements  84 bytes       n / 256 * 84
  Q3_K    256 elements  110 bytes      n / 256 * 110
  Q4_K    256 elements  144 bytes      n / 256 * 144
  Q5_K    256 elements  176 bytes      n / 256 * 176
  Q6_K    256 elements  210 bytes      n / 256 * 210
  Q8_K    256 elements  292 bytes      n / 256 * 292

We’ll cover the details of these formats in the quantization chapter.

Weight Fusion: Gate+Up and Q+K+V

One of the most performance-critical operations in the weight provider is weight fusion. The idea is simple: instead of doing two (or three) separate matrix multiplications and then combining the results, we concatenate the weight matrices and do a single, larger matmul.

For SwiGLU-based FFN layers, the gate and up projections can be fused:

  Before fusion (2 matmuls):
    gate_out = x @ gate_weight      (dim -> ffn_dim)
    up_out   = x @ up_weight        (dim -> ffn_dim)

  After fusion (1 matmul):
    fused_out = x @ [gate_weight; up_weight]   (dim -> 2*ffn_dim)
    gate_out  = fused_out[:ffn_dim]
    up_out    = fused_out[ffn_dim:]

Similarly, Q, K, and V projections can be fused when they share the same input:

  Before fusion (3 matmuls):
    q = x @ q_weight     (dim -> q_dim)
    k = x @ k_weight     (dim -> kv_dim)
    v = x @ v_weight     (dim -> kv_dim)

  After fusion (1 matmul):
    fused = x @ [q_weight; k_weight; v_weight]  (dim -> q_dim+2*kv_dim)

The fuse_weights() methods handle this concatenation. For GGUF, it’s straightforward – just concatenate the raw bytes:

Buffer WeightStore::fuse_weights(const std::string& name_a,
                                  const std::string& name_b) {
    Buffer a = get_tensor(name_a);
    Buffer b = get_tensor(name_b);
    size_t total = a.size + b.size;
    Buffer fused = device_.allocate(total);
    memcpy(fused.contents, a.contents, a.size);
    memcpy((char*)fused.contents + a.size, b.contents, b.size);
    fused_cache_[key] = fused;
    return fused;
}

For MLX quantized models, fusion is more involved because each tensor is actually a packed triple of [weights | scales | biases]. You can’t just concatenate the whole buffers – you need to concatenate each section separately:

  Input buffers (each is a packed triple):

  Tensor A: [  A_weights  |  A_scales  |  A_biases  ]
  Tensor B: [  B_weights  |  B_scales  |  B_biases  ]

  Fused output:

  [ A_weights | B_weights | A_scales | B_scales | A_biases | B_biases ]
  |<--- all weights --->|<--- all scales --->|<--- all biases --->|

This layout is critical for the GPU kernel, which expects to find all weights contiguous, then all scales contiguous, then all biases contiguous. The fuse_mlx_packed() helper function handles this three-way interleaving.

Here’s the ASCII diagram of the full fusion pipeline for three tensors (Q+K+V):

  Q buffer:  [ Q_w (N_q*K_packed*4 bytes) | Q_s (N_q*K/gs*2) | Q_b (N_q*K/gs*2) ]
  K buffer:  [ K_w (N_k*K_packed*4 bytes) | K_s (N_k*K/gs*2) | K_b (N_k*K/gs*2) ]
  V buffer:  [ V_w (N_v*K_packed*4 bytes) | V_s (N_v*K/gs*2) | V_b (N_v*K/gs*2) ]
                               |
                               v
                     fuse_mlx_packed()
                               |
                               v
  Fused: [ Q_w | K_w | V_w | Q_s | K_s | V_s | Q_b | K_b | V_b ]
          |<-- total_w --->| |<-- total_s*2 -->| |<-- total_s*2 -->|

The fusion result is also cached (keyed by the concatenation of canonical names), so subsequent forward passes reuse the fused buffer.

Config Extraction: Two Paths to the Same Struct

The get_config() method returns an AkunuModelConfig struct regardless of the source format. But the two backends extract this config very differently.

GGUF path: Config lives in the binary metadata KV pairs. Keys are prefixed with the architecture name:

  Key                                      Example Value
  -----------------------------------------  ------------
  general.architecture                       "llama"
  llama.embedding_length                     4096
  llama.block_count                          32
  llama.attention.head_count                 32
  llama.attention.head_count_kv              8
  llama.feed_forward_length                  14336
  llama.context_length                       8192
  llama.rope.freq_base                       500000.0
  llama.attention.layer_norm_rms_epsilon     1e-5

The WeightStore::get_config() method tries architecture-prefixed keys first, then falls back to unqualified keys. This is because some GGUF files use llama.block_count while others use just block_count.

MLX path: Config lives in config.json, a standard HuggingFace config file:

{
    "model_type": "llama",
    "hidden_size": 4096,
    "num_hidden_layers": 32,
    "num_attention_heads": 32,
    "num_key_value_heads": 8,
    "intermediate_size": 14336,
    "max_position_embeddings": 8192,
    "rope_theta": 500000.0,
    "rms_norm_eps": 1e-5,
    "quantization_config": {
        "bits": 4,
        "group_size": 64
    }
}

The MLXWeightStore::parse_config_json() method uses a minimal JSON parser (hand-rolled, no dependencies) to extract these values. Note the different key names: HuggingFace uses hidden_size where GGUF uses embedding_length, and num_hidden_layers where GGUF uses block_count.

Both paths populate the same AkunuModelConfig struct:

typedef struct {
    uint32_t dim;           // embedding_length / hidden_size
    uint32_t n_layers;      // block_count / num_hidden_layers
    uint32_t n_heads;       // head_count / num_attention_heads
    uint32_t n_kv_heads;    // head_count_kv / num_key_value_heads
    uint32_t head_dim;      // explicit or dim/n_heads
    uint32_t q_dim;         // n_heads * head_dim
    uint32_t kv_dim;        // n_kv_heads * head_dim
    uint32_t ffn_dim;       // feed_forward_length / intermediate_size
    uint32_t vocab_size;
    uint32_t max_seq_len;
    float norm_eps;
    float rope_theta;
    uint32_t sliding_window_pattern;
    float rope_local_theta;
    char architecture[32];
    // ... encoder fields for Whisper
} AkunuModelConfig;

Architecture-Specific Handling

The weight system isn’t just a dumb loader. It has architecture-specific logic for several model families.

Whisper

Whisper is an encoder-decoder model, which means it has two sets of layers with completely different tensor names. The GGUF backend has a separate rule table (kWhisperRules) with entries for both encoder and decoder:

  GGUF Name                               Canonical Name
  ----------------------------------------  --------------------------------
  encoder.conv1.weight                      enc.conv1.weight
  encoder.blocks.0.attn.query.weight        enc.layers.0.attn.q.weight
  decoder.blocks.0.attn.query.weight        layers.0.attention.q.weight
  decoder.blocks.0.cross_attn.query.weight  layers.0.cross_attn.q.weight

The config extraction also handles Whisper specially, populating the encoder- specific fields (enc_n_layers, enc_n_heads, n_mels, etc.) from whisper.encoder.* metadata keys.

Gemma 3

Gemma 3 uses a sliding window attention pattern where every 6th layer has global attention and the rest use local/sliding-window attention. The config stores this as sliding_window_pattern = 6 and rope_local_theta = 10000.0. Both GGUF and MLX backends detect this pattern when they see the gemma architecture with a non-zero sliding window size.

QK-Norm (Qwen3)

Some newer models like Qwen3 add separate RMSNorm layers for Q and K projections before the attention computation. Both rule tables include mappings for these:

  GGUF:  blk.{n}.attn_q_norm.weight  ->  layers.{n}.attention.q_norm.weight
  MLX:   model.layers.{n}.self_attn.q_norm.weight  ->  (same canonical)

The Caching Architecture

Let’s look at the complete caching picture across the system:

  +--------------------------------------------------+
  |                  WeightProvider                   |
  +--------------------------------------------------+
              |                         |
              v                         v
  +----------------------+    +----------------------+
  |    WeightStore       |    |   MLXWeightStore     |
  |  (GGUF backend)      |    |  (SafeTensors)       |
  +----------------------+    +----------------------+
  | buffer_cache_:       |    | buffer_cache_:       |
  |   canonical -> GPU   |    |   canonical -> GPU   |
  | fused_cache_:        |    |   "a+b" -> GPU       |
  |   "a+b" -> GPU       |    |   "a+b+c" -> GPU    |
  +----------------------+    +----------------------+
         |                           |
         v                           v
  +-------------------+     +-------------------+
  |   mmap'd GGUF     |     |  mmap'd .safe-    |
  |   (OS page cache) |     |  tensors file     |
  +-------------------+     +-------------------+

There are effectively three levels of caching:

  1. OS page cache: The mmap’d files are backed by the OS page cache. First access to a tensor’s bytes triggers a page fault and disk read. Subsequent accesses are served from RAM.

  2. GPU buffer cache: Once a tensor is uploaded to the GPU (via device_.allocate()), the result is cached in buffer_cache_. The model never re-uploads a tensor.

  3. Fused buffer cache: Fused weight combinations are cached separately in fused_cache_ (GGUF) or in buffer_cache_ with composite keys like "layers.0.ffn.gate.weight+layers.0.ffn.up.weight" (MLX).

This means the steady-state memory picture during inference is: the original file is mmap’d but mostly paged out (the OS will reclaim those pages under memory pressure), and the weights live in GPU-accessible buffers.

Metadata Access: A Leaky Abstraction?

One area where the abstraction gets a bit leaky is metadata access. The get_metadata_string(), get_metadata_int(), and get_metadata_float() methods are primarily used for GGUF metadata (which is rich and structured). The MLX backend’s implementations are stubs that return defaults:

// mlx_weight_store.cpp
std::string MLXWeightStore::get_metadata_string(const std::string&) const {
    return "";
}
int64_t MLXWeightStore::get_metadata_int(const std::string&, int64_t def) const {
    return def;
}

This makes sense when you think about it. GGUF metadata contains everything: model architecture, tokenizer vocabulary, RoPE parameters, you name it. MLX models store their config in config.json (parsed at open time into AkunuModelConfig) and their tokenizer in a separate tokenizer.json file.

The metadata methods exist because the tokenizer system needs access to GGUF’s embedded vocabulary arrays (tokenizer.ggml.tokens, tokenizer.ggml.scores). For MLX models, the tokenizer is loaded from tokenizer.json through a completely separate path, so these methods are never called.

Is this a code smell? Maybe. But it’s a pragmatic choice. Adding a separate tokenizer abstraction just to avoid empty stubs would be over-engineering.

Tensor Inspection: The Debug Interface

The tensor_count(), tensor_name_at(), and tensor_raw_dtype() methods form a debug/inspection interface. These are used by akunu’s model inspection tool to list all tensors in a weight file without loading them onto the GPU:

  Index   Name                                     DType     Elements
  -----   ---------------------------------------- --------- -----------
  0       token_embd.weight                        Q8_0      524288000
  1       blk.0.attn_q.weight                      Q4_K      16777216
  2       blk.0.attn_k.weight                      Q4_K      4194304
  ...
  199     output.weight                             Q8_0      524288000

This is purely for human consumption. The model code never uses these methods.

The Buffer Type

Throughout this chapter, we’ve been passing around Buffer objects. Let’s clarify what this actually is:

struct Buffer {
    void* handle;     // Opaque GPU handle (MTLBuffer* on Metal)
    size_t size;      // Buffer size in bytes
    void* contents;   // CPU-accessible pointer (shared memory on Apple Silicon)
};

On Apple Silicon, GPU and CPU share the same physical memory, so contents points directly to the GPU buffer’s storage. This means memcpy into contents is the same as uploading to the GPU – there’s no separate DMA transfer step. This is what makes the weight loading so fast: mmap the file, memcpy from the mmap into the GPU buffer, done.

The {nullptr, 0, nullptr} triple is used as a sentinel for “not found” or “error”. You’ll see this returned throughout the code as the failure case.

Putting It All Together

Let’s trace a complete model load from the application’s perspective:

  Application                      WeightProvider              Backend
  -----------                      --------------              -------
  provider.open("/path/to/model")
      |
      +----> detect_format()  -----> "Is it a dir?" -----> MLX
      |                              "Is it .gguf?" -----> GGUF
      |
      +----> backend.open("/path/to/model")
      |          |
      |          +-- mmap file / parse header
      |          +-- extract metadata / config.json
      |          +-- build_name_mapping()
      |
      +----> provider.get_config()
      |          |
      |          +-- backend.get_config() -> AkunuModelConfig
      |
      +----> For each layer:
      |        provider.fuse_weights(
      |          "layers.N.attention.q.weight",
      |          "layers.N.attention.k.weight",
      |          "layers.N.attention.v.weight")
      |            |
      |            +-- get_tensor() x3 (load each, cache)
      |            +-- concatenate (format-specific)
      |            +-- cache fused result
      |
      +----> provider.fuse_weights(
      |          "layers.N.ffn.gate.weight",
      |          "layers.N.ffn.up.weight")
      |
      +----> provider.get_tensor("layers.N.ffn.down.weight")
      |
      +----> (inference begins, all weights cached on GPU)

After the initial load, no further disk I/O happens. Every weight access is a hash table lookup returning a cached GPU buffer. The model code is completely format-agnostic – it works with canonical names and doesn’t know whether the underlying data came from a quantized GGUF, a full-precision SafeTensors file, or an MLX 4-bit quantized directory.

That’s the weight provider. Not flashy, not complicated, but it cleanly decouples two very different ecosystems from the model code that uses them. In the next chapters, we’ll dive deep into the formats themselves – starting with GGUF.