Akunu Overview and Design Philosophy
Welcome to the deep-dive section of this book. Up until now, we have been building intuition for how LLM inference works on Apple Silicon: the Metal compute pipeline, the memory hierarchy, quantized matrix math, and the attention mechanism. Now it is time to see how all of those pieces come together in a real, production-quality inference engine.
Akunu is a high-performance LLM inference engine written specifically for Apple Silicon. The name comes from the Sinhala word meaning “embers” – a fitting metaphor for a project that tries to extract every last bit of heat from the GPU silicon.
In this chapter we will survey the project at a high level: what it does, how it is organized, what design principles drive it, and what the end-to-end inference flow looks like. Subsequent chapters will zoom in on each subsystem.
What Akunu Is (and What It Is Not)
Akunu is a local inference engine. You give it a model file (GGUF or MLX SafeTensors), it loads the weights onto the Apple GPU, and it runs the full transformer forward pass – prefill and decode – entirely on-device. There is no cloud, no server round-trip, no Python runtime.
Here is what it supports today:
| Feature | Details |
|---|---|
| Architectures | LLaMA, Qwen3, Gemma, Gemma 3, BERT, Whisper |
| Weight formats | GGUF, MLX SafeTensors |
| GGUF quant types | F32, F16, BF16, Q4_0, Q4_1, Q5_0, Q5_K, Q6_K, Q8_0, Q2_K, Q3_K, Q4_K |
| MLX quant types | 3-bit, 4-bit, 6-bit, 8-bit (with configurable group size) |
| Tasks | Text generation, chat, embeddings, speech transcription |
| Decoding modes | Greedy, sampled (top-k/top-p/min-p), speculative (n-gram), grammar-constrained |
| API surface | C API (FFI-friendly), CLI tools, OpenAI-compatible HTTP server |
What it is not: it is not a training framework. It is not a general-purpose tensor library. It does not try to be cross-platform (though the architecture makes a future CUDA backend straightforward, as we will see). Every design decision optimizes for one thing: token throughput on Apple Silicon.
Performance: The Numbers
Let us start with the punchline, because performance is the reason this engine exists. All benchmarks were run on an Apple M4 Pro (16 GPU cores, 273 GB/s memory bandwidth):
Decode throughput (tg128, tokens/sec):
vs llama.cpp:
Average speedup: 1.83x
Best case: 3.66x (Qwen3-0.6B-Q3_K_S: 448 vs 123 t/s)
Wins: 20/21 configurations
vs MLX:
Average speedup: 1.17x
Best case: 1.25x (Qwen3-0.6B-bf16: 207 vs 165 t/s)
Wins: 11/11 configurations
These are not cherry-picked numbers. Across 19 GGUF model configurations and 11 MLX configurations, akunu wins decode throughput in 31 out of 32 tests. The speedup is most dramatic on small models (0.6B-1B parameters) with aggressive quantization (Q2_K through Q5_K), where akunu achieves 2-3.5x the throughput of llama.cpp.
Why? Because small quantized models are compute-bound during decode – the matrix multiplications finish so fast that overhead dominates. Akunu’s precompiled dispatch table and zero-allocation hot path eliminate that overhead. We will see exactly how in the sections below.
The Five Design Principles
Every non-trivial design decision in akunu traces back to one of five principles. Understanding these up front will make the rest of the codebase click.
Principle 1: Data-Driven Design (ArchDescriptor)
The naive way to support multiple architectures looks like this:
// DON'T DO THIS
if (arch == "llama") {
activation = silu_gate;
rope = rope_interleaved;
} else if (arch == "qwen3") {
activation = silu_gate;
rope = rope_neox;
has_qk_norm = true;
} else if (arch == "gemma") {
activation = gelu_gate;
rope = rope_neox;
has_qk_norm = true;
embedding_scale = sqrt(dim);
// ... 20 more fields
}
This approach does not scale. Every new architecture touches dozens of files. Every
if/else branch is a potential bug.
Akunu takes a different approach: it captures all architecture-specific differences
in a single POD struct called ArchDescriptor. The struct has about 20 fields
covering activation kernels, RoPE style, embedding scaling, normalization, encoder
config, and more. The entire table builder and prefill engine read from this struct
and never branch on the architecture name.
Adding a new architecture means writing one factory function that fills in the struct. That is it. No code changes in the dispatch table builder, the prefill engine, or the decode loop.
+------------------+ +------------------+ +------------------+
| arch_llama() | | arch_qwen3() | | arch_gemma(dim) |
| activation: | | activation: | | activation: |
| silu_gate_f16 | | silu_gate_f16 | | gelu_gate_f16 |
| rope: | | rope: | | rope: |
| interleaved | | neox | | neox |
| qk_norm: false | | qk_norm: true | | qk_norm: true |
| embed_scale: 0 | | tie_embed: true | | embed_scale: |
+--------+---------+ +--------+---------+ | sqrt(dim) |
| | +--------+---------+
| | |
+------------------------+-------------------------+
|
v
+----------------------------+
| build_dispatch_table(...) |
| (reads ArchDescriptor, |
| never branches on arch) |
+----------------------------+
We will cover ArchDescriptor in depth in Chapter 22.
Principle 2: Precompiled Dispatch (DispatchTable)
This is the big one. In most inference engines, every forward pass involves:
- Looking up which kernel to run
- Resolving buffer pointers
- Computing dispatch geometry (grid size, threadgroup size)
- Encoding the compute command
Akunu does all of this once, at model load time, and stores the result in a
flat array of DispatchCmd structs. The decode hot path simply iterates this array,
patches a couple of per-token fields (position, token offset), and submits the whole
thing to the GPU.
Model load time (once): Decode time (every token):
Parse weights for each token:
| for each cmd in dispatch_table:
v patch position
Resolve kernel names patch token offset
| submit to encoder
v
Look up Pipeline State Objects
|
v
Compute grid dimensions
|
v
Bind buffers + params
|
v
Store in DispatchCmd[]
The DispatchCmd struct itself is a fixed-size POD type with no heap allocations:
DispatchCmd (fixed size, no heap):
+-----------------------------------+
| Pipeline pso | 8 bytes
| Buffer buffers[8] | 8 x 24 bytes
| uint32_t offsets[8] | 32 bytes
| int buffer_count | 4 bytes
| uint8_t param_bytes[64] | 64 bytes (inline kernel params)
| int param_size, param_index | 8 bytes
| Buffer param_buf | 24 bytes
| uint8_t param2_bytes[16] | 16 bytes (secondary params)
| Dim3 grid, threadgroup | 24 bytes
| bool use_dispatch_threads | 1 byte
| PatchType patch_type | 1 byte
| int patch_offset_1, patch_offset_2| 8 bytes
+-----------------------------------+
The entire forward pass for a single token is typically 50-100 commands (embedding + N layers * ~5 commands each + output norm + logit projection + argmax). These are stored contiguously in memory, which is great for the CPU cache.
This is why akunu’s decode is fast: the CPU-side work per token is essentially a
memcpy of a few patched bytes plus a loop of setBuffer/setBytes/dispatch
calls – all inlined, no virtual dispatch, no hash lookups, no string comparisons.
Principle 3: Zero-Allocation Hot Path
Once the model is loaded, the decode loop allocates zero bytes of memory. All
buffers are pre-allocated at model init time in a ScratchBuffers struct:
ScratchBuffers (all pre-allocated at model load):
Decode (single token):
h0 [dim] -- embedding output / residual ping
h1 [dim] -- residual pong
residual [dim] -- norm output
qkv [q_dim+2*kv] -- contiguous Q|K|V
attn_out [max(q_dim,dim)]
ffn_gate [ffn_dim]
ffn_up [ffn_dim]
ffn_act [ffn_dim]
logits [vocab_size]
token_ids [max_chain]
Prefill (batch):
batch_h0 [chunk * dim]
batch_q [chunk * q_dim]
batch_k [chunk * kv_dim]
... (same pattern)
The KV cache is also pre-allocated to the maximum context length. The decode loop
never calls malloc, never calls device.allocate, never resizes a vector. This
matters more than you might think – on Apple Silicon, malloc can take microseconds,
and when you are generating 400+ tokens per second, every microsecond counts.
Even the thread-local error buffer is a static char[512]:
thread_local char error_buf[512] = {};
No std::string, no exceptions, no heap in the hot path.
Principle 4: Virtual Device Interface
The Device base class provides a pure-virtual interface with about 30 methods
covering buffer allocation, kernel loading, command encoding, and synchronization.
Today there is exactly one implementation: MetalDevice, which wraps the Metal
API. But the abstraction exists for a reason.
+------------------+
| Device (base) |
| pure virtual |
+--------+---------+
|
+------------+------------+
| |
+----+-------+ +-----+------+
| MetalDevice| | CudaDevice |
| (ObjC++) | | (future) |
+------------+ +------------+
All backend-specific code lives behind this interface. The core engine – the
dispatch table builder, the prefill encoder, the decode loop – is pure C++ with
no #import, no @autoreleasepool, no id<MTLBuffer>. If someone wanted to port
akunu to CUDA, they would implement CudaDevice and everything else would just
work.
This is not a hypothetical – the clean separation was a deliberate design choice.
We will explore the Device interface in detail in Chapter 21.
Principle 5: Lazy Weight Loading
GGUF files can be large – a Q4_0 quantized 8B model is about 4.5 GB. Loading all weights into GPU memory at once would waste time on unused tensors and spike memory usage during initialization.
Akunu’s WeightStore uses lazy loading: when you call get_tensor("layers.5.attention.q.weight"),
it checks its internal cache. If the tensor has not been loaded yet, it reads
the raw bytes from the GGUF file (using memory-mapped I/O) and uploads them to a
GPU buffer. Subsequent calls return the cached buffer instantly.
The WeightProvider class unifies this behind a common interface for both GGUF and
MLX SafeTensors formats:
+-------------------+
| WeightProvider |
| (unified facade) |
+--------+----------+
|
+------------+------------+
| |
+------+------+ +------+------+
| WeightStore | | MLXWeightStore|
| (GGUF) | | (SafeTensors) |
+------+------+ +------+--------+
| |
+-------+-------+ +------+--------+
| GGUF mmap'd | | SafeTensors |
| file on disk | | + config.json |
+----------------+ +---------------+
Weight fusion also happens here: for performance, akunu can concatenate the Q, K,
and V projection matrices (or gate + up FFN matrices) into a single contiguous GPU
buffer, allowing one large GEMV instead of two or three small ones. This is
controlled by ChipConfig.should_fuse_weights, which is true on chips with large
enough SLC (System Level Cache).
Project Structure
The project is about 265 source files (C++, Objective-C++, Metal) plus around 135 Metal shader files. Here is how they are organized:
akunu/
|
+-- include/akunu/ Public C API headers
| +-- akunu.h Opaque handle API (load, generate, encode, etc.)
| +-- types.h POD structs (AkunuModelConfig, AkunuGenerationStats, etc.)
|
+-- src/
| +-- core/ Architecture-agnostic engine core
| | +-- device.h Virtual GPU device interface
| | +-- dispatch_table.h DispatchCmd + DispatchTable + encode_chain()
| | +-- table_builder.h/cpp Builds dispatch table from weights + config
| | +-- arch_descriptor.h ArchDescriptor + factory functions
| | +-- dtype_descriptor.h DTypeDescriptor + kernel lookup tables
| | +-- chip_config.h ChipConfig (hardware tuning)
| | +-- prefill.h/cpp Batched prefill (GEMM-based)
| |
| +-- inference/ High-level inference orchestration
| | +-- model_state.h ModelState struct (the opaque handle's guts)
| | +-- model_loader.cpp Model loading + initialization
| | +-- decode_loop.cpp Top-level generate loop (prefill + decode)
| | +-- decode_greedy.cpp Chain decode (greedy, zero-alloc)
| | +-- decode_sampled.cpp Sampled decode (top-k/p/min-p)
| | +-- decode_speculative.cpp N-gram speculative decode
| | +-- decode_grammar.cpp Grammar-constrained decode
| | +-- sampling.cpp CPU-side sampling (softmax, top-k, top-p)
| | +-- embedding.cpp BERT-style embedding extraction
| |
| +-- cache/ Memory management
| | +-- kv_cache.h Per-layer K/V buffer arrays
| | +-- scratch.h Pre-allocated scratch buffers
| | +-- whisper_buffers.h Whisper-specific encoder buffers
| |
| +-- weight/ Weight file I/O
| | +-- weight_provider.h Unified GGUF/MLX interface
| | +-- weight_store.h/cpp GGUF weight loading + fusion
| | +-- mlx_weight_store.h/cpp MLX SafeTensors loading
| | +-- gguf_parser.h/cpp Low-level GGUF format parsing
| | +-- safetensors_parser.h SafeTensors header parsing
| |
| +-- tokenizer/ BPE tokenizer
| +-- grammar/ Grammar-constrained decoding (GBNF + XGrammar)
| +-- whisper/ Whisper encoder + decoder
| +-- audio/ Mel spectrogram computation
| +-- server/ OpenAI-compatible HTTP server
| +-- speculative/ N-gram draft predictor
| +-- akunu_api.cpp C API implementation (thin wrappers)
|
+-- backend/
| +-- metal/
| +-- metal_device.h/mm MetalDevice implementation (ObjC++)
| +-- metal_device_impl.h AkunuMetalState (ObjC wrapper)
| +-- metal_types.h Metal-specific type aliases
| +-- kernels/ ~135 Metal shader files
| +-- metal/kernel/
| +-- attention/ Flash attention (prefill + decode variants)
| +-- matmul/ GEMV + GEMM for all quant types
| +-- norm/ RMSNorm, LayerNorm, head norms
| +-- rope/ RoPE (interleaved + NeoX + fused variants)
| +-- activation/ SiLU, GELU, gated variants
| +-- embedding/ Token embedding lookup (all dtypes)
| +-- sampling/ GPU-side argmax, top-k, temperature, penalties
| +-- convert/ Dtype conversion (F32<->F16, dequant)
| +-- conv/ Conv1D for Whisper frontend
| +-- fused/ Fused kernels (GEMV+norm, whisper GEMV)
|
+-- tools/ CLI executables
| +-- akunu_chat.cpp Interactive chat
| +-- akunu_bench.cpp llama-bench compatible benchmark
| +-- akunu_complete.cpp Text completion
| +-- akunu_inspect.cpp Model weight inspector
| +-- akunu_profile.cpp Per-layer GPU profiler
| +-- akunu_serve.cpp OpenAI-compatible HTTP server
| +-- akunu_transcribe.cpp Whisper transcription
| +-- akunu_benchmark.cpp Extended benchmarking
|
+-- tests/ Unit + integration tests
+-- 3rdparty/ XGrammar submodule
+-- bindings/ Language bindings (Swift)
+-- CMakeLists.txt Build system
+-- Makefile Top-level build driver
If you count the lines of actual akunu code (excluding 3rdparty), the core engine is roughly 9,750 lines of C++ and Objective-C++, plus about 135 Metal shader files. That is remarkably compact for what it does – a full inference engine supporting 6 architectures, 2 weight formats, 16+ quantization types, grammar-constrained decoding, speculative decoding, Whisper transcription, and an HTTP server.
High-Level Inference Flow
Let us trace what happens when you call akunu_generate() with a prompt. This is the
30,000-foot view; later chapters will zoom in on each step.
akunu_generate(model, prompt_tokens, n_prompt, max_tokens, sampling, callback, ...)
|
v
run_decode_loop(state, ...)
|
+-- 1. PREFILL (batched, GEMM-based)
| |
| | for chunk in prompt_tokens (up to max_prefill_chunk at a time):
| | encode_prefill(device, weights, config, arch, kv_cache, scratch,
| | chunk_tokens, chunk_size, position)
| | |
| | | For each layer:
| | | GEMM: batch_residual @ Q_weight -> batch_q
| | | GEMM: batch_residual @ K_weight -> batch_k
| | | GEMM: batch_residual @ V_weight -> batch_v
| | | RoPE + write to KV cache
| | | Flash attention (prefill variant)
| | | GEMM: attn_out @ O_weight -> batch_h1
| | | Residual add + RMSNorm
| | | GEMM: batch_residual @ gate_weight -> batch_gate
| | | GEMM: batch_residual @ up_weight -> batch_up
| | | Activation (SiLU*gate or GELU*gate)
| | | GEMM: batch_act @ down_weight -> batch_h1
| | | Residual add + RMSNorm (next layer)
| | |
| | v
| | Output norm + logit projection + argmax -> first token
| |
| v
| Return first_token + timing stats
|
+-- 2. DECODE (chain decode, GEMV-based)
|
| Choose decode path:
| - temperature == 0 && speculation_enabled -> decode_speculative()
| - temperature == 0 -> decode_greedy()
| - temperature > 0 -> decode_sampled()
| - grammar != null -> decode_grammar()
|
| decode_greedy (hot path):
| while generated < max_tokens:
| write next_token to token_ids buffer
| device.begin_encoding()
| device.encode_dispatch_table(&dispatch_table, position, chunk_size)
| device.end_encoding_sync() (or async with double buffering)
| kv_cache.advance(chunk_size)
| read token_ids buffer -> output tokens
| for each token: callback(token_id, text, user_data)
|
v
Return AkunuGenerationStats { prefill_time, decode_time, tokens/sec, ... }
A few things to notice:
Prefill uses GEMM, decode uses GEMV. During prefill, we process many tokens at once, so the Q/K/V projections are matrix-matrix multiplications (M > 1). During decode, we process one token at a time, so they are matrix-vector multiplications (M = 1). Akunu has separate optimized kernels for each.
Chain decode generates multiple tokens per GPU submission. Instead of submitting
one command buffer per token, akunu submits a “chain” of N tokens in a single
begin_encoding() / end_encoding() pair. The dispatch table is replayed N times
with patched position values. This amortizes the Metal command buffer overhead
across many tokens. The chain size is tuned per chip (64-128 tokens, see ChipConfig).
The callback is synchronous. After each GPU chunk completes, tokens are read
back from the token_ids buffer and delivered to the user’s callback one at a time.
The callback can return false to stop generation early.
No Python in the loop. The entire flow – from tokenization through GPU dispatch
through token decoding – is C++. The C API boundary is a thin wrapper in
akunu_api.cpp that casts the opaque void* handle to ModelState* and forwards
the call.
The ModelState: What Lives Behind the Opaque Handle
When you call akunu_load_model(), it returns an akunu_model_t, which is a
void* pointing to a ModelState struct. This is the central state object that
ties everything together:
struct ModelState {
std::unique_ptr<Device> device; // GPU device (MetalDevice)
WeightProvider *weights; // Weight file access
Tokenizer tokenizer; // BPE tokenizer
AkunuModelConfig config; // Parsed model config
ArchDescriptor arch; // Architecture descriptor
ChipConfig chip; // Hardware tuning params
KVCache kv_cache; // Per-layer K/V buffers
ScratchBuffers scratch; // Pre-allocated intermediates
DispatchTable dispatch_table; // Precompiled decode commands
NGramPredictor predictor; // Speculative n-gram predictor
bool speculation_enabled; // Whether spec decode is on
// Whisper-specific fields
bool is_whisper;
std::unique_ptr<WhisperBuffers> whisper_buf;
std::unique_ptr<MelSpectrogram> mel_spec;
std::unique_ptr<WhisperModel> whisper_model;
DispatchTable whisper_decode_table;
// ... beam search buffers
};
This is it. One struct, one allocation. The entire engine state fits in a single cache-friendly object. Compare this to inference frameworks that scatter state across dozens of Python objects, each with its own reference counting and garbage collection pressure.
Model Loading: What Happens at Init Time
The akunu_load_model() function is where all the expensive work happens. Here is
the sequence:
akunu_load_model(path, metallib_path, max_context)
|
+-- 1. Create MetalDevice (MTLCreateSystemDefaultDevice)
+-- 2. Load metallib (compiled shader library)
+-- 3. Open weight file (GGUF or MLX SafeTensors)
+-- 4. Parse model config (dims, layers, heads, vocab, etc.)
+-- 5. Select ArchDescriptor (arch_from_config)
+-- 6. Detect ChipConfig (GPU cores, family, SLC estimate)
+-- 7. Handle format-specific quirks:
| - MLX LLaMA: switch to NeoX RoPE
| - Tie embeddings if output.weight missing
| - Set quant_bits / quant_group_size from MLX metadata
+-- 8. Precompute RoPE frequencies (LLaMA 3 wavelen scaling)
+-- 9. Load tokenizer (from GGUF metadata or HF tokenizer.json)
+-- 10. Set context length (capped at model max or user-specified)
+-- 11. Allocate KV cache (n_layers * 2 buffers * max_seq_len)
+-- 12. Allocate ScratchBuffers (all intermediates)
+-- 13. Build DispatchTable (resolves all PSOs, binds buffers)
+-- 14. Warmup pass (compiles remaining Metal pipelines)
+-- 15. Return ModelState* as opaque handle
Steps 1-12 are straightforward initialization. Step 13 is where the magic happens:
build_dispatch_table() walks through the entire forward pass – embedding, norms,
projections, RoPE, attention, FFN – and for each operation, it resolves the kernel
name from DTypeDescriptor, looks up or compiles the Metal pipeline state object,
computes the dispatch geometry, binds the weight and scratch buffers, and stores
everything in a DispatchCmd. By the time this function returns, the engine knows
exactly what to do for each token – no runtime decisions left.
Supported Architectures at a Glance
Let us briefly survey how each supported architecture maps to akunu’s abstractions:
+----------+-------------+----------+----------+--------+--------+--------+
| Arch | Activation | RoPE | QK Norm | PostNm | Enc/Dec| Embed |
| | | | | | | Scale |
+----------+-------------+----------+----------+--------+--------+--------+
| LLaMA | silu_gate | interleav| no | no | no | 0 |
| Qwen3 | silu_gate | neox | yes | no | no | 0 |
| Gemma | gelu_gate | neox | yes | yes | no | sqrt(d)|
| Gemma3 | gelu_gate | neox | yes | yes | no | sqrt(d)|
| Whisper | gelu (plain)| none | no | no | yes | 0 |
| BERT | silu_gate | neox | no | no | no | 0 |
+----------+-------------+----------+----------+--------+--------+--------+
All of these differences are captured in the ArchDescriptor struct – no special
code paths. Gemma 3’s sliding window attention with alternating global/local layers?
That is just cfg.sliding_window_pattern > 0 in the RoPE theta selection. Whisper’s
Conv1D frontend, cross-attention, and sinusoidal positional embeddings? Those are
flag fields in ArchDescriptor plus a separate WhisperModel loading path.
The key insight is that most transformer architectures are minor variations on the same theme. They all have embedding, norm, QKV projection, attention, output projection, FFN, and output logits. The differences are in which activation function, which RoPE variant, whether there is an extra normalization step, and so on. Akunu exploits this regularity by parameterizing the differences rather than branching on them.
What Makes This Different from Other Engines
If you have used llama.cpp, MLX, or vLLM, you might wonder what akunu does differently. Here is a quick comparison:
vs. llama.cpp: llama.cpp builds a computation graph (ggml) at runtime and evaluates it node-by-node. Each node involves a virtual dispatch to find the right kernel, plus buffer management. Akunu eliminates this overhead by precompiling the entire forward pass into a flat command array. llama.cpp is more general (it runs on CPU, CUDA, Metal, Vulkan, etc.), but akunu squeezes more performance out of Metal specifically.
vs. MLX: MLX is a general-purpose array framework (like PyTorch) that happens to run on Metal. It has a JIT compiler, automatic differentiation, and a Python frontend. This generality comes at a cost: each operation goes through MLX’s dispatch layer, which involves hash lookups and potentially JIT compilation. Akunu bypasses all of this – it talks directly to Metal with precompiled pipelines.
vs. vLLM: vLLM targets datacenter GPU inference with features like PagedAttention, continuous batching, and multi-GPU tensor parallelism. Akunu targets single-device Apple Silicon with features like chain decode, SLC-aware weight fusion, and Metal-specific kernel optimization. Different tools for different jobs.
The common thread is specialization. Akunu does fewer things, but does them very well on one specific hardware platform.
A Note on Code Style
Before we dive deeper in the following chapters, a word about the codebase style. Akunu is written in C++17 with a strong preference for:
- POD structs over class hierarchies (DispatchCmd, KVCache, ScratchBuffers)
- Fixed-size inline storage over heap allocation (param_bytes[64], buffers[8])
- Factory functions over constructors (KVCache::create, ScratchBuffers::create)
- Explicit state over hidden globals (ModelState holds everything)
- One virtual class (Device) instead of a deep hierarchy
- Thread-local for truly per-thread state (error buffer, RNG)
The Metal backend uses Objective-C++ (.mm files) because it has to – Metal is an
Objective-C API. But this is strictly quarantined behind the Device interface. The
rest of the engine is pure C++ that compiles with any standard compiler.
Error handling is C-style: functions return null/false on failure and set a
thread-local error string via set_error(). No exceptions in the hot path. No
RAII wrappers around GPU resources (the ModelState destructor handles cleanup).
This style is not “modern C++” in the Herb Sutter sense, but it is effective for systems programming where you care about memory layout, cache behavior, and predictable performance.
Summary
Akunu is a tightly-focused inference engine that trades generality for performance on Apple Silicon. Its key design decisions are:
- ArchDescriptor – all architecture differences as data, not branches
- DispatchTable – precompiled GPU command sequences, replayed per token
- Zero-allocation decode – all buffers pre-allocated, nothing on the hot path
- Virtual Device – clean Metal abstraction, ready for future backends
- Lazy weight loading – tensors loaded on demand, with fusion support
The result is an engine that achieves 1.8x average speedup over llama.cpp on decode and 1.17x over MLX, in about 10,000 lines of C++ plus 135 Metal shaders.
In the next chapter, we will see how to build and run akunu from source.