GGUF: Format Specification and Parser
If you’ve been anywhere near the local LLM scene, you’ve encountered GGUF files. They’re the de facto distribution format for quantized models, popularized by llama.cpp and now supported by pretty much every serious inference engine. GGUF stands for “GPT-Generated Unified Format” (though nobody actually calls it that), and it replaced the older GGML format back in 2023.
This chapter is a deep dive into the GGUF format – the binary layout, the parsing
strategy, and how akunu’s gguf_parser.cpp turns a flat file into structured data.
If you’ve ever been curious about what’s actually inside those multi-gigabyte files
you download from HuggingFace, this is for you.
The Big Picture: File Layout
A GGUF file is a single, self-contained binary blob. Everything the inference engine needs – model architecture, hyperparameters, tokenizer vocabulary, and all the weight tensors – lives in one file. No companion JSON, no directory structure, no index files.
The high-level layout is dead simple:
+=============================================+ offset 0
| HEADER |
| +---------------------------------------+ |
| | Magic number (4 bytes, LE) | |
| | Version (4 bytes, LE) | |
| | Tensor count (8 bytes, LE) | |
| | KV pair count (8 bytes, LE) | |
| +---------------------------------------+ |
| |
| +---------------------------------------+ |
| | Metadata KV pairs | |
| | (variable length, kv_count entries) | |
| +---------------------------------------+ |
| |
| +---------------------------------------+ |
| | Tensor info entries | |
| | (variable length, tensor_count) | |
| +---------------------------------------+ |
+=============================================+
| PADDING to 32-byte alignment |
+=============================================+ data_base
| |
| TENSOR DATA |
| |
| (raw bytes, tensors at their declared |
| offsets from data_base) |
| |
+=============================================+ EOF
Let’s zoom in on each section.
The Header: 24 Bytes of Truth
The header is the first 24 bytes of every GGUF file:
Byte offset Size Field Description
----------- ---- ----- -----------
0 4 magic 0x46554747 ("GGUF" in little-endian)
4 4 version Format version (currently 3)
8 8 tensor_count Number of tensors in the file
16 8 kv_count Number of metadata key-value pairs
Let’s break down the magic number. In ASCII, G=0x47, G=0x47, U=0x55,
F=0x46. Stored in little-endian as a 32-bit integer, that’s 0x46554747.
Here’s the byte-level view:
Address: 00 01 02 03 04 05 06 07
Bytes: 47 47 55 46 03 00 00 00
G G U F version=3
Akunu’s parser checks both the magic and the version:
static constexpr uint32_t GGUF_MAGIC = 0x46554747;
static constexpr uint32_t GGUF_VERSION = 3;
uint32_t magic = read_u32(f);
if (magic != GGUF_MAGIC) {
fprintf(stderr, "bad magic 0x%08x\n", magic);
return nullptr;
}
uint32_t version = read_u32(f);
if (version != GGUF_VERSION) {
fprintf(stderr, "unsupported version %u\n", version);
return nullptr;
}
Version 3 is the current and only version akunu supports. Version 1 never saw wide use, and version 2 had a brief life before version 3 stabilized the format. The difference between v2 and v3 is minor – mainly around how string lengths and array counts are encoded (v3 uses uint64 everywhere, v2 used uint32 in some places).
Metadata Key-Value Pairs
Immediately after the 24-byte header, you find the metadata section. This is where the model’s configuration lives – architecture type, embedding dimensions, number of layers, RoPE parameters, tokenizer vocabulary, and more.
Each KV pair has this structure:
+---------------------------+
| Key string |
| +---------------------+ |
| | length (8 bytes) | |
| | chars (N bytes) | |
| +---------------------+ |
+---------------------------+
| Value type (4 bytes) |
+---------------------------+
| Value data (variable) |
+---------------------------+
The key is a GGUF string: a uint64 length prefix followed by that many raw bytes (no null terminator in the file, though the parser adds one for convenience).
The value type is one of 13 possible types:
Code Type Size Description
---- ---- ---- -----------
0 UINT8 1 byte Unsigned 8-bit integer
1 INT8 1 byte Signed 8-bit integer
2 UINT16 2 bytes Unsigned 16-bit integer
3 INT16 2 bytes Signed 16-bit integer
4 UINT32 4 bytes Unsigned 32-bit integer
5 INT32 4 bytes Signed 32-bit integer
6 FLOAT32 4 bytes IEEE 754 single-precision float
7 BOOL 1 byte Boolean (0 or 1)
8 STRING variable GGUF string (uint64 length + chars)
9 ARRAY variable Typed array (see below)
10 UINT64 8 bytes Unsigned 64-bit integer
11 INT64 8 bytes Signed 64-bit integer
12 FLOAT64 8 bytes IEEE 754 double-precision float
Most metadata values are simple scalars. For example, the embedding dimension might be stored as:
Key: "llama.embedding_length"
Type: UINT32 (4)
Value: 00 10 00 00 (4096 in little-endian)
Or a float parameter:
Key: "llama.rope.freq_base"
Type: FLOAT32 (6)
Value: 00 40 1C 47 (500000.0 in IEEE 754)
Array Values
Arrays are the most complex metadata type. They have this layout:
+-----------------------------+
| Element type (4 bytes) |
| Element count (8 bytes) |
| Element 0 (variable) |
| Element 1 (variable) |
| ... |
| Element N-1 (variable) |
+-----------------------------+
The two most important array uses are the tokenizer vocabulary (an array of strings) and the tokenizer scores (an array of float32). A vocabulary array for a 32K-token model would look something like:
Element type: STRING (8)
Element count: 32000
Element 0: [len=6] "<unk>" (the unknown token)
Element 1: [len=3] "<s>" (begin of sequence)
Element 2: [len=4] "</s>" (end of sequence)
...
Element 31999: [len=8] "zoology"
Each string element is a full GGUF string (uint64 length + chars), so parsing an array of strings requires walking through variable-length elements sequentially. There’s no random access – you can’t jump to element 15000 without parsing all preceding elements.
How Akunu Parses Metadata
The parser reads metadata values with a big switch statement. Let’s look at the interesting parts:
static void read_metadata_value(GGUFFileImpl *f, GGUFMetadataKV *kv,
uint32_t type) {
kv->type = type;
kv->array_len = 0;
kv->array_data = nullptr;
switch (type) {
case GGUF_TYPE_UINT8:
kv->value.u32 = read_u8(f);
break;
case GGUF_TYPE_INT32:
kv->value.i32 = read_i32(f);
break;
case GGUF_TYPE_FLOAT32:
kv->value.f32 = read_f32(f);
break;
case GGUF_TYPE_STRING:
kv->value.str = read_string(f);
break;
case GGUF_TYPE_ARRAY: {
uint32_t elem_type = read_u32(f);
uint64_t count = read_u64(f);
kv->array_len = count;
kv->array_data = f->cursor; // <-- raw pointer into mmap!
// Skip past all elements
for (uint64_t i = 0; i < count; i++) {
skip_value(f, elem_type);
}
kv->value.u32 = elem_type; // store element type
break;
}
// ... other types ...
}
}
Notice the array handling. The parser doesn’t decode array elements during the
initial parse. Instead, it records a raw pointer (array_data) into the mmap’d
region and the element type. The actual element decoding happens lazily when
someone calls gguf_get_string_array() or gguf_get_float_array(). This is
a nice optimization – the tokenizer vocabulary can have 100K+ entries, and
parsing all of them upfront would waste time if the caller only needs the
model architecture.
But the parser still has to skip past all elements to find where the next KV
pair starts. The skip_value() function handles this:
static void skip_value(GGUFFileImpl *f, uint32_t type) {
switch (type) {
case GGUF_TYPE_UINT8:
case GGUF_TYPE_INT8:
case GGUF_TYPE_BOOL:
f->cursor += 1;
break;
case GGUF_TYPE_UINT16:
case GGUF_TYPE_INT16:
f->cursor += 2;
break;
case GGUF_TYPE_UINT32:
case GGUF_TYPE_INT32:
case GGUF_TYPE_FLOAT32:
f->cursor += 4;
break;
case GGUF_TYPE_UINT64:
case GGUF_TYPE_INT64:
case GGUF_TYPE_FLOAT64:
f->cursor += 8;
break;
case GGUF_TYPE_STRING: {
uint64_t len = read_u64(f);
f->cursor += len;
break;
}
case GGUF_TYPE_ARRAY: {
uint32_t elem_type = read_u32(f);
uint64_t count = read_u64(f);
for (uint64_t i = 0; i < count; i++) {
skip_value(f, elem_type);
}
break;
}
}
}
This is recursive for nested arrays (arrays of arrays), though in practice GGUF files don’t use nested arrays. The common case is arrays of strings or arrays of floats, which skip efficiently.
Tensor Info Entries
After all metadata KV pairs, the file contains tensor info entries. Each entry describes one tensor: its name, shape, data type, and offset into the data section.
+-------------------------------+
| Tensor name (GGUF string) |
+-------------------------------+
| Number of dimensions (u32) |
+-------------------------------+
| Dimension 0 (u64) |
| Dimension 1 (u64) |
| ...up to 4 dims |
+-------------------------------+
| Data type (u32) |
+-------------------------------+
| Offset (u64) |
+-------------------------------+
Let’s work through a concrete example. A Q4_0-quantized attention Q projection weight for a 4096-dim model with 32 heads might look like:
Name: "blk.0.attn_q.weight"
n_dims: 2
dims[0]: 4096 (output dimension = dim)
dims[1]: 4096 (input dimension = dim)
dtype: 2 (GGUF_DTYPE_Q4_0)
offset: 0x00000000 (first tensor, starts at data_base)
n_elements: 4096 * 4096 = 16,777,216
bytes: 16,777,216 / 32 * 18 = 9,437,184 (about 9 MB)
The parser stores this in a GGUFTensorInfo struct:
typedef struct {
const char *name;
uint64_t n_elements;
uint32_t n_dims;
uint64_t dims[4];
uint64_t offset;
uint32_t dtype;
} GGUFTensorInfo;
Note that n_elements is computed during parsing by multiplying all dimensions
together. The parser also checks for overflow:
ti.n_elements = 1;
for (uint32_t d = 0; d < ti.n_dims; d++) {
ti.dims[d] = read_u64(f);
if (ti.dims[d] > 0 &&
ti.n_elements > UINT64_MAX / ti.dims[d]) {
fprintf(stderr, "dimension overflow\n");
return nullptr;
}
ti.n_elements *= ti.dims[d];
}
That overflow check matters. A malicious GGUF file could set dimensions to
absurd values, and without the check, n_elements would wrap around to a
small number, leading to undersized buffer allocations and memory corruption.
The Data Section: Alignment Matters
After all tensor info entries, there’s a padding gap to align the data section to a 32-byte boundary. This alignment is important for efficient memory access, especially on GPUs where misaligned reads can be catastrophically slow.
static constexpr size_t GGUF_ALIGNMENT = 32;
size_t header_bytes = (size_t)(f->cursor - (const uint8_t*)f->mmap_addr);
size_t aligned = (header_bytes + GGUF_ALIGNMENT - 1)
& ~(GGUF_ALIGNMENT - 1);
f->data_base = (const uint8_t*)f->mmap_addr + aligned;
The & ~(GGUF_ALIGNMENT - 1) trick is a classic bit manipulation for rounding
up to a power-of-two alignment. Since GGUF_ALIGNMENT is 32 (which is
0x20), GGUF_ALIGNMENT - 1 is 0x1F, and ~0x1F is 0xFFFFFFE0. ANDing
with this mask clears the bottom 5 bits, effectively rounding down. But we
first added GGUF_ALIGNMENT - 1, so the net effect is rounding up.
Example: header_bytes = 105,743
+ 31 = 105,774
& 0xFFFFFFE0 = 105,760 (aligned to 32)
data_base = mmap_addr + 105,760
Each tensor’s data lives at data_base + tensor.offset. The offset is
relative to data_base, not to the start of the file. This is a subtle
but important detail – tensor offsets stored in the file are already
relative to the aligned data section start.
Tensor Data Types: The Full Catalog
GGUF supports 31 tensor data types. Not all of them are commonly used, but akunu’s parser defines them all. Here’s the complete enumeration:
typedef enum {
GGUF_DTYPE_F32 = 0, // IEEE 754 float32
GGUF_DTYPE_F16 = 1, // IEEE 754 float16
GGUF_DTYPE_Q4_0 = 2, // 4-bit quantized, type 0
GGUF_DTYPE_Q4_1 = 3, // 4-bit quantized, type 1
// 4, 5 are legacy (Q4_2, Q4_3 -- removed)
GGUF_DTYPE_Q5_0 = 6, // 5-bit quantized, type 0
GGUF_DTYPE_Q5_1 = 7, // 5-bit quantized, type 1
GGUF_DTYPE_Q8_0 = 8, // 8-bit quantized, type 0
GGUF_DTYPE_Q8_1 = 9, // 8-bit quantized, type 1
GGUF_DTYPE_Q2_K = 10, // K-quant, 2-bit
GGUF_DTYPE_Q3_K = 11, // K-quant, 3-bit
GGUF_DTYPE_Q4_K = 12, // K-quant, 4-bit
GGUF_DTYPE_Q5_K = 13, // K-quant, 5-bit
GGUF_DTYPE_Q6_K = 14, // K-quant, 6-bit
GGUF_DTYPE_Q8_K = 15, // K-quant, 8-bit
GGUF_DTYPE_IQ2_XXS = 16, // Importance quant, 2-bit, extra-small
GGUF_DTYPE_IQ2_XS = 17, // Importance quant, 2-bit, small
GGUF_DTYPE_IQ3_XXS = 18, // Importance quant, 3-bit
GGUF_DTYPE_IQ1_S = 19, // Importance quant, 1-bit
GGUF_DTYPE_IQ4_NL = 20, // Importance quant, 4-bit nonlinear
GGUF_DTYPE_IQ3_S = 21, // Importance quant, 3-bit
GGUF_DTYPE_IQ2_S = 22, // Importance quant, 2-bit
GGUF_DTYPE_IQ4_XS = 23, // Importance quant, 4-bit
GGUF_DTYPE_I8 = 24, // Plain int8
GGUF_DTYPE_I16 = 25, // Plain int16
GGUF_DTYPE_I32 = 26, // Plain int32
GGUF_DTYPE_I64 = 27, // Plain int64
GGUF_DTYPE_F64 = 28, // IEEE 754 float64
GGUF_DTYPE_IQ1_M = 29, // Importance quant, 1-bit mixed
GGUF_DTYPE_BF16 = 30, // Brain float16
} GGUFTensorDType;
Note the gap at codes 4 and 5 – those were Q4_2 and Q4_3, experimental quantization types that were removed early in GGML’s history. The enum preserves backward compatibility by keeping the numbering stable.
Akunu’s weight loader (weight_store.cpp) handles the common types with
explicit byte-size calculations:
Type Block Bytes/Block Bits/Weight Notes
--------- ------ ----------- ----------- ------------------
F32 1 elem 4 32 Full precision
F16 1 elem 2 16 Half precision
BF16 1 elem 2 16 Brain float
Q4_0 32 elem 18 4.5 Scale + 4-bit quants
Q4_1 32 elem 20 5.0 Scale+min + 4-bit
Q5_0 32 elem 22 5.5 Scale + 5-bit quants
Q8_0 32 elem 34 8.5 Scale + 8-bit quants
Q2_K 256 elem 84 2.625 K-quant super-block
Q3_K 256 elem 110 3.4375 K-quant super-block
Q4_K 256 elem 144 4.5 K-quant super-block
Q5_K 256 elem 176 5.5 K-quant super-block
Q6_K 256 elem 210 6.5625 K-quant super-block
Q8_K 256 elem 292 9.125 K-quant super-block
The “bits per weight” column gives you the effective compression ratio. Q4_0 isn’t exactly 4 bits per weight – the scale factor adds overhead, bringing it to 4.5 bits. K-quants are even less round because of their complex multi-level structure.
We’ll go into the byte-level details of each quantization format in a dedicated chapter. For now, just know that the GGUF parser doesn’t care about the internal structure of quantized blocks – it just needs to know the total byte count to hand off to the weight store.
The Parser Implementation: mmap and Cursors
Akunu’s GGUF parser is written in C with a C++ implementation file (it exposes a C API for maximum compatibility). The core strategy is simple:
- Memory-map the entire file
- Walk through it with a cursor pointer
- Build hash maps for O(1) lookup by name
Let’s trace through gguf_open() step by step.
Step 1: Open and mmap
GGUFFile gguf_open(const char *path) {
int fd = open(path, O_RDONLY);
struct stat st;
fstat(fd, &st);
size_t file_size = (size_t)st.st_size;
void *mapped = mmap(nullptr, file_size, PROT_READ,
MAP_PRIVATE, fd, 0);
GGUFFileImpl *f = new GGUFFileImpl();
f->fd = fd;
f->mmap_addr = mapped;
f->mmap_len = file_size;
f->cursor = (const uint8_t*)mapped;
f->end = f->cursor + file_size;
// ...
}
The MAP_PRIVATE flag means modifications to the mapped region (which we never
make) would be copy-on-write. PROT_READ ensures we can only read. The OS will
page in data lazily as we access it.
The minimum file size check is 24 bytes (the header). Anything smaller can’t possibly be a valid GGUF file.
Step 2: Parse the header
uint32_t magic = read_u32(f); // advances cursor by 4
uint32_t version = read_u32(f); // advances cursor by 4
uint64_t tensor_count = read_u64(f); // advances cursor by 8
uint64_t kv_count = read_u64(f); // advances cursor by 8
The read_* functions are thin wrappers around memcpy + cursor advance:
static inline uint32_t read_u32(GGUFFileImpl *f) {
if (!has_bytes(f, 4)) return 0;
uint32_t v;
memcpy(&v, f->cursor, 4);
f->cursor += 4;
return v;
}
Why memcpy instead of a direct cast like *(uint32_t*)f->cursor? Because
direct casts would be undefined behavior if the cursor isn’t aligned to a
4-byte boundary. GGUF strings have variable length, so after reading a string,
the cursor can be at any alignment. memcpy is always safe and modern
compilers optimize it into the same instruction as a direct load when they
can prove alignment.
Step 3: Parse metadata
f->metadata.reserve(kv_count);
for (uint64_t i = 0; i < kv_count; i++) {
GGUFMetadataKV kv;
kv.key = read_string(f);
uint32_t vtype = read_u32(f);
read_metadata_value(f, &kv, vtype);
f->metadata_map[kv.key] = f->metadata.size();
f->metadata.push_back(kv);
}
Each KV pair is read sequentially (you have to, since they’re variable-length).
The key string is allocated on the heap and tracked in owned_strings for
cleanup. The metadata_map provides O(1) lookup by key name.
Step 4: Parse tensor info
f->tensors.reserve(tensor_count);
for (uint64_t i = 0; i < tensor_count; i++) {
GGUFTensorInfo ti;
ti.name = read_string(f);
ti.n_dims = read_u32(f);
ti.n_elements = 1;
for (uint32_t d = 0; d < ti.n_dims; d++) {
ti.dims[d] = read_u64(f);
ti.n_elements *= ti.dims[d];
}
ti.dtype = read_u32(f);
ti.offset = read_u64(f);
f->tensor_map[ti.name] = f->tensors.size();
f->tensors.push_back(ti);
}
Same pattern: sequential parse, build a hash map. The tensor_map maps
tensor names to indices in the tensors vector.
Step 5: Compute data base
size_t header_bytes = (size_t)(f->cursor - (const uint8_t*)f->mmap_addr);
size_t aligned = (header_bytes + GGUF_ALIGNMENT - 1)
& ~(GGUF_ALIGNMENT - 1);
f->data_base = (const uint8_t*)f->mmap_addr + aligned;
At this point, parsing is complete. The entire header has been walked, and we know where the data section begins.
The Internal State: GGUFFileImpl
Let’s look at the complete internal state structure:
struct GGUFFileImpl {
int fd = -1; // File descriptor (kept open)
void *mmap_addr = MAP_FAILED; // mmap base address
size_t mmap_len = 0; // mmap'd region size
const uint8_t *cursor; // Current parse position
const uint8_t *end; // End of mmap'd region
const uint8_t *data_base; // Start of tensor data section
std::vector<GGUFMetadataKV> metadata; // Parsed metadata
std::vector<GGUFTensorInfo> tensors; // Parsed tensor info
// O(1) lookup maps
std::unordered_map<std::string, size_t> metadata_map;
std::unordered_map<std::string, size_t> tensor_map;
// Ownership tracking
std::vector<char*> owned_strings;
std::vector<const char**> owned_str_arrays;
std::vector<float*> owned_flt_arrays;
};
The memory layout in action:
Process virtual memory:
+----------------------------------------------------------+
| Stack / heap / code / etc. |
+----------------------------------------------------------+
| GGUFFileImpl (heap allocated) |
| - metadata vector (small, ~100 entries) |
| - tensors vector (small, ~200 entries) |
| - metadata_map hash table |
| - tensor_map hash table |
| - owned_strings pointers |
+----------------------------------------------------------+
| mmap'd region (file_size bytes) |
| [header | metadata | tensor_info | pad | tensor_data] |
| ^cursor walks through this during parse |
| ^data_base points to start of tensor_data |
+----------------------------------------------------------+
The key insight is that the mmap’d region stays mapped for the lifetime of
the GGUFFile. When you ask for tensor data via gguf_tensor_data(), you
get a raw pointer into this region. No copying at the parser level – the
copy happens later when the weight store uploads to the GPU.
The C API: Functions and Usage
The parser exposes a clean C API. Here’s the complete interface:
// Lifecycle
GGUFFile gguf_open(const char *path);
void gguf_close(GGUFFile file);
// Counts
uint64_t gguf_tensor_count(GGUFFile file);
uint64_t gguf_metadata_count(GGUFFile file);
// Tensor lookup
const GGUFTensorInfo *gguf_get_tensor(GGUFFile file, const char *name);
const GGUFTensorInfo *gguf_get_tensor_by_index(GGUFFile file, uint64_t index);
// Metadata lookup
const GGUFMetadataKV *gguf_get_metadata(GGUFFile file, const char *key);
const GGUFMetadataKV *gguf_get_metadata_by_index(GGUFFile file, uint64_t index);
// Tensor data access
const void *gguf_tensor_data(GGUFFile file, const GGUFTensorInfo *info);
// Array helpers
const char **gguf_get_string_array(GGUFFile file, const char *key,
uint64_t *out_count);
const float *gguf_get_float_array(GGUFFile file, const char *key,
uint64_t *out_count);
The typical usage pattern (from WeightStore::open()):
gguf_ = gguf_open(path.c_str());
if (!gguf_) return false;
// Read a metadata int
const GGUFMetadataKV *kv = gguf_get_metadata(gguf_, "llama.block_count");
int n_layers = kv->value.u32; // 32
// Get a tensor
const GGUFTensorInfo *info = gguf_get_tensor(gguf_, "blk.0.attn_q.weight");
const void *data = gguf_tensor_data(gguf_, info);
// data now points into the mmap'd file -- zero-copy!
// Get tokenizer vocabulary
uint64_t vocab_size;
const char **tokens = gguf_get_string_array(gguf_, "tokenizer.ggml.tokens",
&vocab_size);
The String Array Helper: Lazy Decoding
The gguf_get_string_array() function deserves a closer look because it shows
the lazy decoding strategy in action:
const char **gguf_get_string_array(GGUFFile file, const char *key,
uint64_t *out_count) {
const GGUFMetadataKV *kv = gguf_get_metadata(file, key);
if (!kv || kv->type != GGUF_TYPE_ARRAY ||
kv->value.u32 != GGUF_TYPE_STRING)
return nullptr;
uint64_t count = kv->array_len;
const char **arr = (const char**)malloc(count * sizeof(const char*));
file->owned_str_arrays.push_back(arr);
// Walk the raw array data in the mmap
const uint8_t *p = (const uint8_t*)kv->array_data;
for (uint64_t i = 0; i < count; i++) {
uint64_t slen;
memcpy(&slen, p, 8);
p += 8;
char *s = (char*)malloc(slen + 1);
memcpy(s, p, slen);
s[slen] = '\0';
p += slen;
file->owned_strings.push_back(s);
arr[i] = s;
}
*out_count = count;
return arr;
}
This function walks the raw mmap’d bytes of the array, extracting each string. It’s called at most once per key (the result isn’t cached, but callers typically only call it once during initialization). For a 128K vocabulary, this allocates 128K small strings. Not the most memory-efficient approach, but it’s simple and the total memory is tiny compared to the gigabytes of tensor data.
The float array helper is simpler – it just does a bulk memcpy:
const float *gguf_get_float_array(GGUFFile file, const char *key,
uint64_t *out_count) {
const GGUFMetadataKV *kv = gguf_get_metadata(file, key);
uint64_t count = kv->array_len;
float *arr = (float*)malloc(count * sizeof(float));
memcpy(arr, kv->array_data, count * sizeof(float));
*out_count = count;
return arr;
}
Float arrays are already in the right format in the mmap’d data (IEEE 754 little-endian), so it’s a straight copy.
Memory Management: Who Owns What?
The GGUFFileImpl tracks all dynamically allocated memory through three vectors:
owned_strings: All strings parsed from the file (tensor names, metadata keys, string values, string array elements)owned_str_arrays: Theconst char**arrays returned bygguf_get_string_array()owned_flt_arrays: Thefloat*arrays returned bygguf_get_float_array()
On close, everything is freed:
void gguf_close(GGUFFile file) {
if (file->mmap_addr != MAP_FAILED) {
munmap(file->mmap_addr, file->mmap_len);
}
if (file->fd >= 0) {
close(file->fd);
}
for (char *s : file->owned_strings) {
free(s);
}
for (const char **a : file->owned_str_arrays) {
free(a);
}
for (float *a : file->owned_flt_arrays) {
free(a);
}
delete file;
}
This is a simple arena-style pattern: allocate freely during parsing, free everything at once on close. No individual deallocation, no reference counting. The parser’s lifetime matches the model’s lifetime, so this works perfectly.
Bounds Checking and Safety
The parser includes several safety checks:
Cursor bounds checking: Every read_* function checks has_bytes() before
reading:
static inline bool has_bytes(const GGUFFileImpl *f, size_t n) {
return (size_t)(f->end - f->cursor) >= n;
}
Maximum dimensions: Tensors can have at most 4 dimensions. More than that triggers an error.
Element count overflow: Dimension multiplication checks for uint64 overflow before proceeding.
Tensor data bounds: gguf_tensor_data() validates that the offset is within
the mmap’d region:
const void *gguf_tensor_data(GGUFFile file, const GGUFTensorInfo *info) {
if (info->offset >= (size_t)(file->end - file->data_base))
return nullptr;
return file->data_base + info->offset;
}
These checks protect against malformed or malicious GGUF files. They’re not exhaustive (there’s no check that tensor data regions don’t overlap, for example), but they catch the most common corruption scenarios.
A Real GGUF File, Byte by Byte
Let’s walk through the first few hundred bytes of a real GGUF file to tie everything together. Say we have a small model with 2 metadata entries and 3 tensors:
Offset Hex Meaning
------ --- -------
0x0000 47 47 55 46 Magic: "GGUF"
0x0004 03 00 00 00 Version: 3
0x0008 03 00 00 00 00 00 00 00 Tensor count: 3
0x0010 02 00 00 00 00 00 00 00 KV count: 2
--- Metadata KV #0 ---
0x0018 14 00 00 00 00 00 00 00 Key length: 20
0x0020 67 65 6E 65 72 61 6C 2E "general."
0x0028 61 72 63 68 69 74 65 63 "architec"
0x0030 74 75 72 65 "ture"
0x0034 08 00 00 00 Type: STRING (8)
0x0038 05 00 00 00 00 00 00 00 String length: 5
0x0040 6C 6C 61 6D 61 "llama"
--- Metadata KV #1 ---
0x0045 15 00 00 00 00 00 00 00 Key length: 21
0x004D 6C 6C 61 6D 61 2E 62 6C "llama.bl"
0x0055 6F 63 6B 5F 63 6F 75 6E "ock_coun"
0x005D 74 "t"
0x005E 04 00 00 00 Type: UINT32 (4)
0x0062 20 00 00 00 Value: 32
--- Tensor info #0 ---
0x0066 11 00 00 00 00 00 00 00 Name length: 17
0x006E 74 6F 6B 65 6E 5F 65 6D "token_em"
0x0076 62 64 2E 77 65 69 67 68 "bd.weigh"
0x007E 74 "t"
0x007F 02 00 00 00 n_dims: 2
0x0083 00 10 00 00 00 00 00 00 dims[0]: 4096
0x008B 00 80 00 00 00 00 00 00 dims[1]: 32768
0x0093 08 00 00 00 dtype: Q8_0 (8)
0x0097 00 00 00 00 00 00 00 00 offset: 0
--- (more tensor info entries...) ---
--- PADDING to 32-byte alignment ---
--- TENSOR DATA ---
Notice how everything is little-endian and tightly packed. There’s no padding between fields within a section – the variable-length strings make it impossible to use fixed offsets. You have to parse sequentially.
Performance Characteristics
Let’s think about the performance of this parser:
Opening a file: The gguf_open() function is dominated by the metadata
parse time. For a typical model with ~100 metadata entries and ~200 tensors,
this takes microseconds. The mmap itself is near-instantaneous (it just sets
up page table entries).
First tensor access: The first time you access tensor data via
gguf_tensor_data(), the OS has to page in the data from disk. For an SSD,
this is on the order of microseconds per page (4KB). A 9MB Q4_0 tensor
requires about 2,250 pages, so first access is roughly 1-2ms.
Subsequent tensor accesses: After the data is paged in, access is just a pointer dereference – nanoseconds. The OS page cache keeps recently accessed pages in RAM.
Memory usage: The parser itself uses very little heap memory. The mmap’d region uses virtual address space but only consumes physical RAM for pages that have been accessed. A 4GB model file mapped but unaccessed uses essentially zero RAM.
Hash map lookups: Both metadata_map and tensor_map use
std::unordered_map, giving O(1) average-case lookup. With ~200 tensors,
the hash table overhead is negligible.
The overall design is optimized for the common case: open the file once, access each tensor once during model load, then never touch the parser again until shutdown. The mmap approach means the OS manages the caching, which is hard to beat for this access pattern.
Why Not a JSON Parser? Why Not Protobuf?
It’s worth asking why GGUF exists as a custom binary format instead of using an existing serialization framework. The answer is practical:
-
Zero-copy tensor data: Tensor data needs to be passed directly to GPU upload functions. With a binary format and mmap, you get a raw pointer into the file with no deserialization overhead. JSON or protobuf would require parsing into intermediate structures and copying.
-
Self-contained: A single file containing everything means no directory management, no missing companion files, no version mismatches between separate metadata and data files.
-
Streaming-friendly: The metadata is at the front of the file, so you can read the model config without touching the (much larger) tensor data. This matters for model inspection tools.
-
Alignment control: The 32-byte alignment of the data section is critical for GPU efficiency. Generic formats don’t give you this level of control.
-
No dependencies: The parser is self-contained C code. No JSON library, no protobuf compiler, no generated code.
The tradeoff is that you need a custom parser, and the format is harder to inspect with generic tools. But for the specific use case of distributing and loading quantized model weights, it’s hard to argue with the result: a simple format with a simple parser that delivers excellent performance.
In the next chapter, we’ll look at the other side of the coin: SafeTensors and the MLX ecosystem, which took a very different approach to the same problem.