GGUF: Format Specification and Parser

If you’ve been anywhere near the local LLM scene, you’ve encountered GGUF files. They’re the de facto distribution format for quantized models, popularized by llama.cpp and now supported by pretty much every serious inference engine. GGUF stands for “GPT-Generated Unified Format” (though nobody actually calls it that), and it replaced the older GGML format back in 2023.

This chapter is a deep dive into the GGUF format – the binary layout, the parsing strategy, and how akunu’s gguf_parser.cpp turns a flat file into structured data. If you’ve ever been curious about what’s actually inside those multi-gigabyte files you download from HuggingFace, this is for you.

The Big Picture: File Layout

A GGUF file is a single, self-contained binary blob. Everything the inference engine needs – model architecture, hyperparameters, tokenizer vocabulary, and all the weight tensors – lives in one file. No companion JSON, no directory structure, no index files.

The high-level layout is dead simple:

  +=============================================+  offset 0
  |                  HEADER                     |
  |  +---------------------------------------+  |
  |  |  Magic number    (4 bytes, LE)        |  |
  |  |  Version         (4 bytes, LE)        |  |
  |  |  Tensor count    (8 bytes, LE)        |  |
  |  |  KV pair count   (8 bytes, LE)        |  |
  |  +---------------------------------------+  |
  |                                             |
  |  +---------------------------------------+  |
  |  |  Metadata KV pairs                    |  |
  |  |  (variable length, kv_count entries)  |  |
  |  +---------------------------------------+  |
  |                                             |
  |  +---------------------------------------+  |
  |  |  Tensor info entries                  |  |
  |  |  (variable length, tensor_count)      |  |
  |  +---------------------------------------+  |
  +=============================================+
  |        PADDING to 32-byte alignment         |
  +=============================================+  data_base
  |                                             |
  |              TENSOR DATA                    |
  |                                             |
  |  (raw bytes, tensors at their declared      |
  |   offsets from data_base)                   |
  |                                             |
  +=============================================+  EOF

Let’s zoom in on each section.

The Header: 24 Bytes of Truth

The header is the first 24 bytes of every GGUF file:

  Byte offset    Size    Field           Description
  -----------    ----    -----           -----------
  0              4       magic           0x46554747 ("GGUF" in little-endian)
  4              4       version         Format version (currently 3)
  8              8       tensor_count    Number of tensors in the file
  16             8       kv_count        Number of metadata key-value pairs

Let’s break down the magic number. In ASCII, G=0x47, G=0x47, U=0x55, F=0x46. Stored in little-endian as a 32-bit integer, that’s 0x46554747. Here’s the byte-level view:

  Address:  00  01  02  03  04  05  06  07
  Bytes:    47  47  55  46  03  00  00  00
            G   G   U   F   version=3

Akunu’s parser checks both the magic and the version:

static constexpr uint32_t GGUF_MAGIC   = 0x46554747;
static constexpr uint32_t GGUF_VERSION = 3;

uint32_t magic = read_u32(f);
if (magic != GGUF_MAGIC) {
    fprintf(stderr, "bad magic 0x%08x\n", magic);
    return nullptr;
}
uint32_t version = read_u32(f);
if (version != GGUF_VERSION) {
    fprintf(stderr, "unsupported version %u\n", version);
    return nullptr;
}

Version 3 is the current and only version akunu supports. Version 1 never saw wide use, and version 2 had a brief life before version 3 stabilized the format. The difference between v2 and v3 is minor – mainly around how string lengths and array counts are encoded (v3 uses uint64 everywhere, v2 used uint32 in some places).

Metadata Key-Value Pairs

Immediately after the 24-byte header, you find the metadata section. This is where the model’s configuration lives – architecture type, embedding dimensions, number of layers, RoPE parameters, tokenizer vocabulary, and more.

Each KV pair has this structure:

  +---------------------------+
  |  Key string               |
  |  +---------------------+  |
  |  | length  (8 bytes)   |  |
  |  | chars   (N bytes)   |  |
  |  +---------------------+  |
  +---------------------------+
  |  Value type  (4 bytes)    |
  +---------------------------+
  |  Value data  (variable)   |
  +---------------------------+

The key is a GGUF string: a uint64 length prefix followed by that many raw bytes (no null terminator in the file, though the parser adds one for convenience).

The value type is one of 13 possible types:

  Code   Type          Size       Description
  ----   ----          ----       -----------
  0      UINT8         1 byte     Unsigned 8-bit integer
  1      INT8          1 byte     Signed 8-bit integer
  2      UINT16        2 bytes    Unsigned 16-bit integer
  3      INT16         2 bytes    Signed 16-bit integer
  4      UINT32        4 bytes    Unsigned 32-bit integer
  5      INT32         4 bytes    Signed 32-bit integer
  6      FLOAT32       4 bytes    IEEE 754 single-precision float
  7      BOOL          1 byte     Boolean (0 or 1)
  8      STRING        variable   GGUF string (uint64 length + chars)
  9      ARRAY         variable   Typed array (see below)
  10     UINT64        8 bytes    Unsigned 64-bit integer
  11     INT64         8 bytes    Signed 64-bit integer
  12     FLOAT64       8 bytes    IEEE 754 double-precision float

Most metadata values are simple scalars. For example, the embedding dimension might be stored as:

  Key:    "llama.embedding_length"
  Type:   UINT32 (4)
  Value:  00 10 00 00  (4096 in little-endian)

Or a float parameter:

  Key:    "llama.rope.freq_base"
  Type:   FLOAT32 (6)
  Value:  00 40 1C 47  (500000.0 in IEEE 754)

Array Values

Arrays are the most complex metadata type. They have this layout:

  +-----------------------------+
  |  Element type   (4 bytes)   |
  |  Element count  (8 bytes)   |
  |  Element 0      (variable)  |
  |  Element 1      (variable)  |
  |  ...                        |
  |  Element N-1    (variable)  |
  +-----------------------------+

The two most important array uses are the tokenizer vocabulary (an array of strings) and the tokenizer scores (an array of float32). A vocabulary array for a 32K-token model would look something like:

  Element type:  STRING (8)
  Element count: 32000
  Element 0:     [len=6] "<unk>"  (the unknown token)
  Element 1:     [len=3] "<s>"    (begin of sequence)
  Element 2:     [len=4] "</s>"   (end of sequence)
  ...
  Element 31999: [len=8] "zoology"

Each string element is a full GGUF string (uint64 length + chars), so parsing an array of strings requires walking through variable-length elements sequentially. There’s no random access – you can’t jump to element 15000 without parsing all preceding elements.

How Akunu Parses Metadata

The parser reads metadata values with a big switch statement. Let’s look at the interesting parts:

static void read_metadata_value(GGUFFileImpl *f, GGUFMetadataKV *kv,
                                 uint32_t type) {
    kv->type = type;
    kv->array_len = 0;
    kv->array_data = nullptr;

    switch (type) {
    case GGUF_TYPE_UINT8:
        kv->value.u32 = read_u8(f);
        break;
    case GGUF_TYPE_INT32:
        kv->value.i32 = read_i32(f);
        break;
    case GGUF_TYPE_FLOAT32:
        kv->value.f32 = read_f32(f);
        break;
    case GGUF_TYPE_STRING:
        kv->value.str = read_string(f);
        break;
    case GGUF_TYPE_ARRAY: {
        uint32_t elem_type = read_u32(f);
        uint64_t count = read_u64(f);
        kv->array_len = count;
        kv->array_data = f->cursor;  // <-- raw pointer into mmap!
        // Skip past all elements
        for (uint64_t i = 0; i < count; i++) {
            skip_value(f, elem_type);
        }
        kv->value.u32 = elem_type;  // store element type
        break;
    }
    // ... other types ...
    }
}

Notice the array handling. The parser doesn’t decode array elements during the initial parse. Instead, it records a raw pointer (array_data) into the mmap’d region and the element type. The actual element decoding happens lazily when someone calls gguf_get_string_array() or gguf_get_float_array(). This is a nice optimization – the tokenizer vocabulary can have 100K+ entries, and parsing all of them upfront would waste time if the caller only needs the model architecture.

But the parser still has to skip past all elements to find where the next KV pair starts. The skip_value() function handles this:

static void skip_value(GGUFFileImpl *f, uint32_t type) {
    switch (type) {
    case GGUF_TYPE_UINT8:
    case GGUF_TYPE_INT8:
    case GGUF_TYPE_BOOL:
        f->cursor += 1;
        break;
    case GGUF_TYPE_UINT16:
    case GGUF_TYPE_INT16:
        f->cursor += 2;
        break;
    case GGUF_TYPE_UINT32:
    case GGUF_TYPE_INT32:
    case GGUF_TYPE_FLOAT32:
        f->cursor += 4;
        break;
    case GGUF_TYPE_UINT64:
    case GGUF_TYPE_INT64:
    case GGUF_TYPE_FLOAT64:
        f->cursor += 8;
        break;
    case GGUF_TYPE_STRING: {
        uint64_t len = read_u64(f);
        f->cursor += len;
        break;
    }
    case GGUF_TYPE_ARRAY: {
        uint32_t elem_type = read_u32(f);
        uint64_t count = read_u64(f);
        for (uint64_t i = 0; i < count; i++) {
            skip_value(f, elem_type);
        }
        break;
    }
    }
}

This is recursive for nested arrays (arrays of arrays), though in practice GGUF files don’t use nested arrays. The common case is arrays of strings or arrays of floats, which skip efficiently.

Tensor Info Entries

After all metadata KV pairs, the file contains tensor info entries. Each entry describes one tensor: its name, shape, data type, and offset into the data section.

  +-------------------------------+
  |  Tensor name (GGUF string)    |
  +-------------------------------+
  |  Number of dimensions (u32)   |
  +-------------------------------+
  |  Dimension 0  (u64)           |
  |  Dimension 1  (u64)           |
  |  ...up to 4 dims              |
  +-------------------------------+
  |  Data type    (u32)           |
  +-------------------------------+
  |  Offset       (u64)           |
  +-------------------------------+

Let’s work through a concrete example. A Q4_0-quantized attention Q projection weight for a 4096-dim model with 32 heads might look like:

  Name:       "blk.0.attn_q.weight"
  n_dims:     2
  dims[0]:    4096         (output dimension = dim)
  dims[1]:    4096         (input dimension = dim)
  dtype:      2            (GGUF_DTYPE_Q4_0)
  offset:     0x00000000   (first tensor, starts at data_base)

  n_elements: 4096 * 4096 = 16,777,216
  bytes:      16,777,216 / 32 * 18 = 9,437,184  (about 9 MB)

The parser stores this in a GGUFTensorInfo struct:

typedef struct {
    const char *name;
    uint64_t n_elements;
    uint32_t n_dims;
    uint64_t dims[4];
    uint64_t offset;
    uint32_t dtype;
} GGUFTensorInfo;

Note that n_elements is computed during parsing by multiplying all dimensions together. The parser also checks for overflow:

ti.n_elements = 1;
for (uint32_t d = 0; d < ti.n_dims; d++) {
    ti.dims[d] = read_u64(f);
    if (ti.dims[d] > 0 &&
        ti.n_elements > UINT64_MAX / ti.dims[d]) {
        fprintf(stderr, "dimension overflow\n");
        return nullptr;
    }
    ti.n_elements *= ti.dims[d];
}

That overflow check matters. A malicious GGUF file could set dimensions to absurd values, and without the check, n_elements would wrap around to a small number, leading to undersized buffer allocations and memory corruption.

The Data Section: Alignment Matters

After all tensor info entries, there’s a padding gap to align the data section to a 32-byte boundary. This alignment is important for efficient memory access, especially on GPUs where misaligned reads can be catastrophically slow.

static constexpr size_t GGUF_ALIGNMENT = 32;

size_t header_bytes = (size_t)(f->cursor - (const uint8_t*)f->mmap_addr);
size_t aligned = (header_bytes + GGUF_ALIGNMENT - 1)
               & ~(GGUF_ALIGNMENT - 1);
f->data_base = (const uint8_t*)f->mmap_addr + aligned;

The & ~(GGUF_ALIGNMENT - 1) trick is a classic bit manipulation for rounding up to a power-of-two alignment. Since GGUF_ALIGNMENT is 32 (which is 0x20), GGUF_ALIGNMENT - 1 is 0x1F, and ~0x1F is 0xFFFFFFE0. ANDing with this mask clears the bottom 5 bits, effectively rounding down. But we first added GGUF_ALIGNMENT - 1, so the net effect is rounding up.

  Example: header_bytes = 105,743
  + 31             = 105,774
  & 0xFFFFFFE0     = 105,760  (aligned to 32)
  data_base        = mmap_addr + 105,760

Each tensor’s data lives at data_base + tensor.offset. The offset is relative to data_base, not to the start of the file. This is a subtle but important detail – tensor offsets stored in the file are already relative to the aligned data section start.

Tensor Data Types: The Full Catalog

GGUF supports 31 tensor data types. Not all of them are commonly used, but akunu’s parser defines them all. Here’s the complete enumeration:

typedef enum {
    GGUF_DTYPE_F32      = 0,   // IEEE 754 float32
    GGUF_DTYPE_F16      = 1,   // IEEE 754 float16
    GGUF_DTYPE_Q4_0     = 2,   // 4-bit quantized, type 0
    GGUF_DTYPE_Q4_1     = 3,   // 4-bit quantized, type 1
    // 4, 5 are legacy (Q4_2, Q4_3 -- removed)
    GGUF_DTYPE_Q5_0     = 6,   // 5-bit quantized, type 0
    GGUF_DTYPE_Q5_1     = 7,   // 5-bit quantized, type 1
    GGUF_DTYPE_Q8_0     = 8,   // 8-bit quantized, type 0
    GGUF_DTYPE_Q8_1     = 9,   // 8-bit quantized, type 1
    GGUF_DTYPE_Q2_K     = 10,  // K-quant, 2-bit
    GGUF_DTYPE_Q3_K     = 11,  // K-quant, 3-bit
    GGUF_DTYPE_Q4_K     = 12,  // K-quant, 4-bit
    GGUF_DTYPE_Q5_K     = 13,  // K-quant, 5-bit
    GGUF_DTYPE_Q6_K     = 14,  // K-quant, 6-bit
    GGUF_DTYPE_Q8_K     = 15,  // K-quant, 8-bit
    GGUF_DTYPE_IQ2_XXS  = 16,  // Importance quant, 2-bit, extra-small
    GGUF_DTYPE_IQ2_XS   = 17,  // Importance quant, 2-bit, small
    GGUF_DTYPE_IQ3_XXS  = 18,  // Importance quant, 3-bit
    GGUF_DTYPE_IQ1_S    = 19,  // Importance quant, 1-bit
    GGUF_DTYPE_IQ4_NL   = 20,  // Importance quant, 4-bit nonlinear
    GGUF_DTYPE_IQ3_S    = 21,  // Importance quant, 3-bit
    GGUF_DTYPE_IQ2_S    = 22,  // Importance quant, 2-bit
    GGUF_DTYPE_IQ4_XS   = 23,  // Importance quant, 4-bit
    GGUF_DTYPE_I8       = 24,  // Plain int8
    GGUF_DTYPE_I16      = 25,  // Plain int16
    GGUF_DTYPE_I32      = 26,  // Plain int32
    GGUF_DTYPE_I64      = 27,  // Plain int64
    GGUF_DTYPE_F64      = 28,  // IEEE 754 float64
    GGUF_DTYPE_IQ1_M    = 29,  // Importance quant, 1-bit mixed
    GGUF_DTYPE_BF16     = 30,  // Brain float16
} GGUFTensorDType;

Note the gap at codes 4 and 5 – those were Q4_2 and Q4_3, experimental quantization types that were removed early in GGML’s history. The enum preserves backward compatibility by keeping the numbering stable.

Akunu’s weight loader (weight_store.cpp) handles the common types with explicit byte-size calculations:

  Type        Block      Bytes/Block    Bits/Weight   Notes
  ---------   ------     -----------    -----------   ------------------
  F32         1 elem     4              32            Full precision
  F16         1 elem     2              16            Half precision
  BF16        1 elem     2              16            Brain float
  Q4_0        32 elem    18             4.5           Scale + 4-bit quants
  Q4_1        32 elem    20             5.0           Scale+min + 4-bit
  Q5_0        32 elem    22             5.5           Scale + 5-bit quants
  Q8_0        32 elem    34             8.5           Scale + 8-bit quants
  Q2_K        256 elem   84             2.625         K-quant super-block
  Q3_K        256 elem   110            3.4375        K-quant super-block
  Q4_K        256 elem   144            4.5           K-quant super-block
  Q5_K        256 elem   176            5.5           K-quant super-block
  Q6_K        256 elem   210            6.5625        K-quant super-block
  Q8_K        256 elem   292            9.125         K-quant super-block

The “bits per weight” column gives you the effective compression ratio. Q4_0 isn’t exactly 4 bits per weight – the scale factor adds overhead, bringing it to 4.5 bits. K-quants are even less round because of their complex multi-level structure.

We’ll go into the byte-level details of each quantization format in a dedicated chapter. For now, just know that the GGUF parser doesn’t care about the internal structure of quantized blocks – it just needs to know the total byte count to hand off to the weight store.

The Parser Implementation: mmap and Cursors

Akunu’s GGUF parser is written in C with a C++ implementation file (it exposes a C API for maximum compatibility). The core strategy is simple:

Memory-map the entire file
Walk through it with a cursor pointer
Build hash maps for O(1) lookup by name

Let’s trace through gguf_open() step by step.

Step 1: Open and mmap

GGUFFile gguf_open(const char *path) {
    int fd = open(path, O_RDONLY);
    struct stat st;
    fstat(fd, &st);
    size_t file_size = (size_t)st.st_size;

    void *mapped = mmap(nullptr, file_size, PROT_READ,
                        MAP_PRIVATE, fd, 0);

    GGUFFileImpl *f = new GGUFFileImpl();
    f->fd = fd;
    f->mmap_addr = mapped;
    f->mmap_len = file_size;
    f->cursor = (const uint8_t*)mapped;
    f->end = f->cursor + file_size;
    // ...
}

The MAP_PRIVATE flag means modifications to the mapped region (which we never make) would be copy-on-write. PROT_READ ensures we can only read. The OS will page in data lazily as we access it.

The minimum file size check is 24 bytes (the header). Anything smaller can’t possibly be a valid GGUF file.

Step 2: Parse the header

uint32_t magic = read_u32(f);       // advances cursor by 4
uint32_t version = read_u32(f);     // advances cursor by 4
uint64_t tensor_count = read_u64(f); // advances cursor by 8
uint64_t kv_count = read_u64(f);    // advances cursor by 8

The read_* functions are thin wrappers around memcpy + cursor advance:

static inline uint32_t read_u32(GGUFFileImpl *f) {
    if (!has_bytes(f, 4)) return 0;
    uint32_t v;
    memcpy(&v, f->cursor, 4);
    f->cursor += 4;
    return v;
}

Why memcpy instead of a direct cast like *(uint32_t*)f->cursor? Because direct casts would be undefined behavior if the cursor isn’t aligned to a 4-byte boundary. GGUF strings have variable length, so after reading a string, the cursor can be at any alignment. memcpy is always safe and modern compilers optimize it into the same instruction as a direct load when they can prove alignment.

Step 3: Parse metadata

f->metadata.reserve(kv_count);
for (uint64_t i = 0; i < kv_count; i++) {
    GGUFMetadataKV kv;
    kv.key = read_string(f);
    uint32_t vtype = read_u32(f);
    read_metadata_value(f, &kv, vtype);
    f->metadata_map[kv.key] = f->metadata.size();
    f->metadata.push_back(kv);
}

Each KV pair is read sequentially (you have to, since they’re variable-length). The key string is allocated on the heap and tracked in owned_strings for cleanup. The metadata_map provides O(1) lookup by key name.

Step 4: Parse tensor info

f->tensors.reserve(tensor_count);
for (uint64_t i = 0; i < tensor_count; i++) {
    GGUFTensorInfo ti;
    ti.name = read_string(f);
    ti.n_dims = read_u32(f);
    ti.n_elements = 1;
    for (uint32_t d = 0; d < ti.n_dims; d++) {
        ti.dims[d] = read_u64(f);
        ti.n_elements *= ti.dims[d];
    }
    ti.dtype = read_u32(f);
    ti.offset = read_u64(f);
    f->tensor_map[ti.name] = f->tensors.size();
    f->tensors.push_back(ti);
}

Same pattern: sequential parse, build a hash map. The tensor_map maps tensor names to indices in the tensors vector.

Step 5: Compute data base

size_t header_bytes = (size_t)(f->cursor - (const uint8_t*)f->mmap_addr);
size_t aligned = (header_bytes + GGUF_ALIGNMENT - 1)
               & ~(GGUF_ALIGNMENT - 1);
f->data_base = (const uint8_t*)f->mmap_addr + aligned;

At this point, parsing is complete. The entire header has been walked, and we know where the data section begins.

The Internal State: GGUFFileImpl

Let’s look at the complete internal state structure:

struct GGUFFileImpl {
    int fd = -1;                   // File descriptor (kept open)
    void *mmap_addr = MAP_FAILED;  // mmap base address
    size_t mmap_len = 0;           // mmap'd region size

    const uint8_t *cursor;         // Current parse position
    const uint8_t *end;            // End of mmap'd region
    const uint8_t *data_base;      // Start of tensor data section

    std::vector<GGUFMetadataKV> metadata;   // Parsed metadata
    std::vector<GGUFTensorInfo> tensors;    // Parsed tensor info

    // O(1) lookup maps
    std::unordered_map<std::string, size_t> metadata_map;
    std::unordered_map<std::string, size_t> tensor_map;

    // Ownership tracking
    std::vector<char*> owned_strings;
    std::vector<const char**> owned_str_arrays;
    std::vector<float*> owned_flt_arrays;
};

The memory layout in action:

  Process virtual memory:
  +----------------------------------------------------------+
  |  Stack / heap / code / etc.                              |
  +----------------------------------------------------------+
  |  GGUFFileImpl (heap allocated)                           |
  |    - metadata vector (small, ~100 entries)               |
  |    - tensors vector (small, ~200 entries)                |
  |    - metadata_map hash table                             |
  |    - tensor_map hash table                               |
  |    - owned_strings pointers                              |
  +----------------------------------------------------------+
  |  mmap'd region (file_size bytes)                         |
  |    [header | metadata | tensor_info | pad | tensor_data] |
  |     ^cursor walks through this during parse              |
  |     ^data_base points to start of tensor_data            |
  +----------------------------------------------------------+

The key insight is that the mmap’d region stays mapped for the lifetime of the GGUFFile. When you ask for tensor data via gguf_tensor_data(), you get a raw pointer into this region. No copying at the parser level – the copy happens later when the weight store uploads to the GPU.

The C API: Functions and Usage

The parser exposes a clean C API. Here’s the complete interface:

// Lifecycle
GGUFFile gguf_open(const char *path);
void gguf_close(GGUFFile file);

// Counts
uint64_t gguf_tensor_count(GGUFFile file);
uint64_t gguf_metadata_count(GGUFFile file);

// Tensor lookup
const GGUFTensorInfo *gguf_get_tensor(GGUFFile file, const char *name);
const GGUFTensorInfo *gguf_get_tensor_by_index(GGUFFile file, uint64_t index);

// Metadata lookup
const GGUFMetadataKV *gguf_get_metadata(GGUFFile file, const char *key);
const GGUFMetadataKV *gguf_get_metadata_by_index(GGUFFile file, uint64_t index);

// Tensor data access
const void *gguf_tensor_data(GGUFFile file, const GGUFTensorInfo *info);

// Array helpers
const char **gguf_get_string_array(GGUFFile file, const char *key,
                                    uint64_t *out_count);
const float *gguf_get_float_array(GGUFFile file, const char *key,
                                   uint64_t *out_count);

The typical usage pattern (from WeightStore::open()):

gguf_ = gguf_open(path.c_str());
if (!gguf_) return false;

// Read a metadata int
const GGUFMetadataKV *kv = gguf_get_metadata(gguf_, "llama.block_count");
int n_layers = kv->value.u32;  // 32

// Get a tensor
const GGUFTensorInfo *info = gguf_get_tensor(gguf_, "blk.0.attn_q.weight");
const void *data = gguf_tensor_data(gguf_, info);
// data now points into the mmap'd file -- zero-copy!

// Get tokenizer vocabulary
uint64_t vocab_size;
const char **tokens = gguf_get_string_array(gguf_, "tokenizer.ggml.tokens",
                                             &vocab_size);

The String Array Helper: Lazy Decoding

The gguf_get_string_array() function deserves a closer look because it shows the lazy decoding strategy in action:

const char **gguf_get_string_array(GGUFFile file, const char *key,
                                    uint64_t *out_count) {
    const GGUFMetadataKV *kv = gguf_get_metadata(file, key);
    if (!kv || kv->type != GGUF_TYPE_ARRAY ||
        kv->value.u32 != GGUF_TYPE_STRING)
        return nullptr;

    uint64_t count = kv->array_len;
    const char **arr = (const char**)malloc(count * sizeof(const char*));
    file->owned_str_arrays.push_back(arr);

    // Walk the raw array data in the mmap
    const uint8_t *p = (const uint8_t*)kv->array_data;
    for (uint64_t i = 0; i < count; i++) {
        uint64_t slen;
        memcpy(&slen, p, 8);
        p += 8;

        char *s = (char*)malloc(slen + 1);
        memcpy(s, p, slen);
        s[slen] = '\0';
        p += slen;

        file->owned_strings.push_back(s);
        arr[i] = s;
    }

    *out_count = count;
    return arr;
}

This function walks the raw mmap’d bytes of the array, extracting each string. It’s called at most once per key (the result isn’t cached, but callers typically only call it once during initialization). For a 128K vocabulary, this allocates 128K small strings. Not the most memory-efficient approach, but it’s simple and the total memory is tiny compared to the gigabytes of tensor data.

The float array helper is simpler – it just does a bulk memcpy:

const float *gguf_get_float_array(GGUFFile file, const char *key,
                                   uint64_t *out_count) {
    const GGUFMetadataKV *kv = gguf_get_metadata(file, key);
    uint64_t count = kv->array_len;
    float *arr = (float*)malloc(count * sizeof(float));
    memcpy(arr, kv->array_data, count * sizeof(float));
    *out_count = count;
    return arr;
}

Float arrays are already in the right format in the mmap’d data (IEEE 754 little-endian), so it’s a straight copy.

Memory Management: Who Owns What?

The GGUFFileImpl tracks all dynamically allocated memory through three vectors:

owned_strings: All strings parsed from the file (tensor names, metadata keys, string values, string array elements)
owned_str_arrays: The const char** arrays returned by gguf_get_string_array()
owned_flt_arrays: The float* arrays returned by gguf_get_float_array()

On close, everything is freed:

void gguf_close(GGUFFile file) {
    if (file->mmap_addr != MAP_FAILED) {
        munmap(file->mmap_addr, file->mmap_len);
    }
    if (file->fd >= 0) {
        close(file->fd);
    }
    for (char *s : file->owned_strings) {
        free(s);
    }
    for (const char **a : file->owned_str_arrays) {
        free(a);
    }
    for (float *a : file->owned_flt_arrays) {
        free(a);
    }
    delete file;
}

This is a simple arena-style pattern: allocate freely during parsing, free everything at once on close. No individual deallocation, no reference counting. The parser’s lifetime matches the model’s lifetime, so this works perfectly.

Bounds Checking and Safety

The parser includes several safety checks:

Cursor bounds checking: Every read_* function checks has_bytes() before reading:

static inline bool has_bytes(const GGUFFileImpl *f, size_t n) {
    return (size_t)(f->end - f->cursor) >= n;
}

Maximum dimensions: Tensors can have at most 4 dimensions. More than that triggers an error.

Element count overflow: Dimension multiplication checks for uint64 overflow before proceeding.

Tensor data bounds: gguf_tensor_data() validates that the offset is within the mmap’d region:

const void *gguf_tensor_data(GGUFFile file, const GGUFTensorInfo *info) {
    if (info->offset >= (size_t)(file->end - file->data_base))
        return nullptr;
    return file->data_base + info->offset;
}

These checks protect against malformed or malicious GGUF files. They’re not exhaustive (there’s no check that tensor data regions don’t overlap, for example), but they catch the most common corruption scenarios.

A Real GGUF File, Byte by Byte

Let’s walk through the first few hundred bytes of a real GGUF file to tie everything together. Say we have a small model with 2 metadata entries and 3 tensors:

  Offset  Hex                                    Meaning
  ------  ---                                    -------
  0x0000  47 47 55 46                            Magic: "GGUF"
  0x0004  03 00 00 00                            Version: 3
  0x0008  03 00 00 00 00 00 00 00                Tensor count: 3
  0x0010  02 00 00 00 00 00 00 00                KV count: 2

  --- Metadata KV #0 ---
  0x0018  14 00 00 00 00 00 00 00                Key length: 20
  0x0020  67 65 6E 65 72 61 6C 2E                "general."
  0x0028  61 72 63 68 69 74 65 63                "architec"
  0x0030  74 75 72 65                            "ture"
  0x0034  08 00 00 00                            Type: STRING (8)
  0x0038  05 00 00 00 00 00 00 00                String length: 5
  0x0040  6C 6C 61 6D 61                         "llama"

  --- Metadata KV #1 ---
  0x0045  15 00 00 00 00 00 00 00                Key length: 21
  0x004D  6C 6C 61 6D 61 2E 62 6C                "llama.bl"
  0x0055  6F 63 6B 5F 63 6F 75 6E                "ock_coun"
  0x005D  74                                     "t"
  0x005E  04 00 00 00                            Type: UINT32 (4)
  0x0062  20 00 00 00                            Value: 32

  --- Tensor info #0 ---
  0x0066  11 00 00 00 00 00 00 00                Name length: 17
  0x006E  74 6F 6B 65 6E 5F 65 6D                "token_em"
  0x0076  62 64 2E 77 65 69 67 68                "bd.weigh"
  0x007E  74                                     "t"
  0x007F  02 00 00 00                            n_dims: 2
  0x0083  00 10 00 00 00 00 00 00                dims[0]: 4096
  0x008B  00 80 00 00 00 00 00 00                dims[1]: 32768
  0x0093  08 00 00 00                            dtype: Q8_0 (8)
  0x0097  00 00 00 00 00 00 00 00                offset: 0

  --- (more tensor info entries...) ---

  --- PADDING to 32-byte alignment ---
  --- TENSOR DATA ---

Notice how everything is little-endian and tightly packed. There’s no padding between fields within a section – the variable-length strings make it impossible to use fixed offsets. You have to parse sequentially.

Performance Characteristics

Let’s think about the performance of this parser:

Opening a file: The gguf_open() function is dominated by the metadata parse time. For a typical model with ~100 metadata entries and ~200 tensors, this takes microseconds. The mmap itself is near-instantaneous (it just sets up page table entries).

First tensor access: The first time you access tensor data via gguf_tensor_data(), the OS has to page in the data from disk. For an SSD, this is on the order of microseconds per page (4KB). A 9MB Q4_0 tensor requires about 2,250 pages, so first access is roughly 1-2ms.

Subsequent tensor accesses: After the data is paged in, access is just a pointer dereference – nanoseconds. The OS page cache keeps recently accessed pages in RAM.

Memory usage: The parser itself uses very little heap memory. The mmap’d region uses virtual address space but only consumes physical RAM for pages that have been accessed. A 4GB model file mapped but unaccessed uses essentially zero RAM.

Hash map lookups: Both metadata_map and tensor_map use std::unordered_map, giving O(1) average-case lookup. With ~200 tensors, the hash table overhead is negligible.

The overall design is optimized for the common case: open the file once, access each tensor once during model load, then never touch the parser again until shutdown. The mmap approach means the OS manages the caching, which is hard to beat for this access pattern.

Why Not a JSON Parser? Why Not Protobuf?

It’s worth asking why GGUF exists as a custom binary format instead of using an existing serialization framework. The answer is practical:

Zero-copy tensor data: Tensor data needs to be passed directly to GPU upload functions. With a binary format and mmap, you get a raw pointer into the file with no deserialization overhead. JSON or protobuf would require parsing into intermediate structures and copying.
Self-contained: A single file containing everything means no directory management, no missing companion files, no version mismatches between separate metadata and data files.
Streaming-friendly: The metadata is at the front of the file, so you can read the model config without touching the (much larger) tensor data. This matters for model inspection tools.
Alignment control: The 32-byte alignment of the data section is critical for GPU efficiency. Generic formats don’t give you this level of control.
No dependencies: The parser is self-contained C code. No JSON library, no protobuf compiler, no generated code.

The tradeoff is that you need a custom parser, and the format is harder to inspect with generic tools. But for the specific use case of distributing and loading quantized model weights, it’s hard to argue with the result: a simple format with a simple parser that delivers excellent performance.

In the next chapter, we’ll look at the other side of the coin: SafeTensors and the MLX ecosystem, which took a very different approach to the same problem.

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference