Akunu Overview and Design Philosophy

Welcome to the deep-dive section of this book. Up until now, we have been building intuition for how LLM inference works on Apple Silicon: the Metal compute pipeline, the memory hierarchy, quantized matrix math, and the attention mechanism. Now it is time to see how all of those pieces come together in a real, production-quality inference engine.

Akunu is a high-performance LLM inference engine written specifically for Apple Silicon. The name comes from the Sinhala word meaning “embers” – a fitting metaphor for a project that tries to extract every last bit of heat from the GPU silicon.

In this chapter we will survey the project at a high level: what it does, how it is organized, what design principles drive it, and what the end-to-end inference flow looks like. Subsequent chapters will zoom in on each subsystem.

What Akunu Is (and What It Is Not)

Akunu is a local inference engine. You give it a model file (GGUF or MLX SafeTensors), it loads the weights onto the Apple GPU, and it runs the full transformer forward pass – prefill and decode – entirely on-device. There is no cloud, no server round-trip, no Python runtime.

Here is what it supports today:

Feature	Details
Architectures	LLaMA, Qwen3, Gemma, Gemma 3, BERT, Whisper
Weight formats	GGUF, MLX SafeTensors
GGUF quant types	F32, F16, BF16, Q4_0, Q4_1, Q5_0, Q5_K, Q6_K, Q8_0, Q2_K, Q3_K, Q4_K
MLX quant types	3-bit, 4-bit, 6-bit, 8-bit (with configurable group size)
Tasks	Text generation, chat, embeddings, speech transcription
Decoding modes	Greedy, sampled (top-k/top-p/min-p), speculative (n-gram), grammar-constrained
API surface	C API (FFI-friendly), CLI tools, OpenAI-compatible HTTP server

What it is not: it is not a training framework. It is not a general-purpose tensor library. It does not try to be cross-platform (though the architecture makes a future CUDA backend straightforward, as we will see). Every design decision optimizes for one thing: token throughput on Apple Silicon.

Performance: The Numbers

Let us start with the punchline, because performance is the reason this engine exists. All benchmarks were run on an Apple M4 Pro (16 GPU cores, 273 GB/s memory bandwidth):

Decode throughput (tg128, tokens/sec):

  vs llama.cpp:
    Average speedup:  1.83x
    Best case:        3.66x  (Qwen3-0.6B-Q3_K_S: 448 vs 123 t/s)
    Wins:             20/21 configurations

  vs MLX:
    Average speedup:  1.17x
    Best case:        1.25x  (Qwen3-0.6B-bf16: 207 vs 165 t/s)
    Wins:             11/11 configurations

These are not cherry-picked numbers. Across 19 GGUF model configurations and 11 MLX configurations, akunu wins decode throughput in 31 out of 32 tests. The speedup is most dramatic on small models (0.6B-1B parameters) with aggressive quantization (Q2_K through Q5_K), where akunu achieves 2-3.5x the throughput of llama.cpp.

Why? Because small quantized models are compute-bound during decode – the matrix multiplications finish so fast that overhead dominates. Akunu’s precompiled dispatch table and zero-allocation hot path eliminate that overhead. We will see exactly how in the sections below.

The Five Design Principles

Every non-trivial design decision in akunu traces back to one of five principles. Understanding these up front will make the rest of the codebase click.

Principle 1: Data-Driven Design (ArchDescriptor)

The naive way to support multiple architectures looks like this:

// DON'T DO THIS
if (arch == "llama") {
    activation = silu_gate;
    rope = rope_interleaved;
} else if (arch == "qwen3") {
    activation = silu_gate;
    rope = rope_neox;
    has_qk_norm = true;
} else if (arch == "gemma") {
    activation = gelu_gate;
    rope = rope_neox;
    has_qk_norm = true;
    embedding_scale = sqrt(dim);
    // ... 20 more fields
}

This approach does not scale. Every new architecture touches dozens of files. Every if/else branch is a potential bug.

Akunu takes a different approach: it captures all architecture-specific differences in a single POD struct called ArchDescriptor. The struct has about 20 fields covering activation kernels, RoPE style, embedding scaling, normalization, encoder config, and more. The entire table builder and prefill engine read from this struct and never branch on the architecture name.

Adding a new architecture means writing one factory function that fills in the struct. That is it. No code changes in the dispatch table builder, the prefill engine, or the decode loop.

+------------------+     +------------------+     +------------------+
| arch_llama()     |     | arch_qwen3()     |     | arch_gemma(dim)  |
| activation:      |     | activation:      |     | activation:      |
|   silu_gate_f16  |     |   silu_gate_f16  |     |   gelu_gate_f16  |
| rope:            |     | rope:            |     | rope:            |
|   interleaved    |     |   neox           |     |   neox           |
| qk_norm: false   |     | qk_norm: true    |     | qk_norm: true    |
| embed_scale: 0   |     | tie_embed: true  |     | embed_scale:     |
+--------+---------+     +--------+---------+     |   sqrt(dim)      |
         |                        |                +--------+---------+
         |                        |                         |
         +------------------------+-------------------------+
                                  |
                                  v
                    +----------------------------+
                    | build_dispatch_table(...)   |
                    | (reads ArchDescriptor,      |
                    |  never branches on arch)    |
                    +----------------------------+

We will cover ArchDescriptor in depth in Chapter 22.

Principle 2: Precompiled Dispatch (DispatchTable)

This is the big one. In most inference engines, every forward pass involves:

Looking up which kernel to run
Resolving buffer pointers
Computing dispatch geometry (grid size, threadgroup size)
Encoding the compute command

Akunu does all of this once, at model load time, and stores the result in a flat array of DispatchCmd structs. The decode hot path simply iterates this array, patches a couple of per-token fields (position, token offset), and submits the whole thing to the GPU.

Model load time (once):                    Decode time (every token):

  Parse weights                              for each token:
  |                                            for each cmd in dispatch_table:
  v                                              patch position
  Resolve kernel names                           patch token offset
  |                                              submit to encoder
  v
  Look up Pipeline State Objects
  |
  v
  Compute grid dimensions
  |
  v
  Bind buffers + params
  |
  v
  Store in DispatchCmd[]

The DispatchCmd struct itself is a fixed-size POD type with no heap allocations:

DispatchCmd (fixed size, no heap):
+-----------------------------------+
| Pipeline pso                      |  8 bytes
| Buffer buffers[8]                 |  8 x 24 bytes
| uint32_t offsets[8]               |  32 bytes
| int buffer_count                  |  4 bytes
| uint8_t param_bytes[64]           |  64 bytes  (inline kernel params)
| int param_size, param_index       |  8 bytes
| Buffer param_buf                  |  24 bytes
| uint8_t param2_bytes[16]          |  16 bytes  (secondary params)
| Dim3 grid, threadgroup            |  24 bytes
| bool use_dispatch_threads         |  1 byte
| PatchType patch_type              |  1 byte
| int patch_offset_1, patch_offset_2|  8 bytes
+-----------------------------------+

The entire forward pass for a single token is typically 50-100 commands (embedding + N layers * ~5 commands each + output norm + logit projection + argmax). These are stored contiguously in memory, which is great for the CPU cache.

This is why akunu’s decode is fast: the CPU-side work per token is essentially a memcpy of a few patched bytes plus a loop of setBuffer/setBytes/dispatch calls – all inlined, no virtual dispatch, no hash lookups, no string comparisons.

Principle 3: Zero-Allocation Hot Path

Once the model is loaded, the decode loop allocates zero bytes of memory. All buffers are pre-allocated at model init time in a ScratchBuffers struct:

ScratchBuffers (all pre-allocated at model load):

  Decode (single token):
    h0         [dim]          -- embedding output / residual ping
    h1         [dim]          -- residual pong
    residual   [dim]          -- norm output
    qkv        [q_dim+2*kv]  -- contiguous Q|K|V
    attn_out   [max(q_dim,dim)]
    ffn_gate   [ffn_dim]
    ffn_up     [ffn_dim]
    ffn_act    [ffn_dim]
    logits     [vocab_size]
    token_ids  [max_chain]

  Prefill (batch):
    batch_h0   [chunk * dim]
    batch_q    [chunk * q_dim]
    batch_k    [chunk * kv_dim]
    ... (same pattern)

The KV cache is also pre-allocated to the maximum context length. The decode loop never calls malloc, never calls device.allocate, never resizes a vector. This matters more than you might think – on Apple Silicon, malloc can take microseconds, and when you are generating 400+ tokens per second, every microsecond counts.

Even the thread-local error buffer is a static char[512]:

thread_local char error_buf[512] = {};

No std::string, no exceptions, no heap in the hot path.

Principle 4: Virtual Device Interface

The Device base class provides a pure-virtual interface with about 30 methods covering buffer allocation, kernel loading, command encoding, and synchronization. Today there is exactly one implementation: MetalDevice, which wraps the Metal API. But the abstraction exists for a reason.

         +------------------+
         |   Device (base)  |
         |   pure virtual   |
         +--------+---------+
                  |
     +------------+------------+
     |                         |
+----+-------+          +-----+------+
| MetalDevice|          | CudaDevice |
| (ObjC++)   |          | (future)   |
+------------+          +------------+

All backend-specific code lives behind this interface. The core engine – the dispatch table builder, the prefill encoder, the decode loop – is pure C++ with no #import, no @autoreleasepool, no id<MTLBuffer>. If someone wanted to port akunu to CUDA, they would implement CudaDevice and everything else would just work.

This is not a hypothetical – the clean separation was a deliberate design choice. We will explore the Device interface in detail in Chapter 21.

Principle 5: Lazy Weight Loading

GGUF files can be large – a Q4_0 quantized 8B model is about 4.5 GB. Loading all weights into GPU memory at once would waste time on unused tensors and spike memory usage during initialization.

Akunu’s WeightStore uses lazy loading: when you call get_tensor("layers.5.attention.q.weight"), it checks its internal cache. If the tensor has not been loaded yet, it reads the raw bytes from the GGUF file (using memory-mapped I/O) and uploads them to a GPU buffer. Subsequent calls return the cached buffer instantly.

The WeightProvider class unifies this behind a common interface for both GGUF and MLX SafeTensors formats:

                    +-------------------+
                    | WeightProvider    |
                    | (unified facade)  |
                    +--------+----------+
                             |
                +------------+------------+
                |                         |
         +------+------+          +------+------+
         | WeightStore |          | MLXWeightStore|
         | (GGUF)      |          | (SafeTensors) |
         +------+------+          +------+--------+
                |                         |
        +-------+-------+         +------+--------+
        | GGUF mmap'd   |         | SafeTensors   |
        | file on disk   |         | + config.json |
        +----------------+         +---------------+

Weight fusion also happens here: for performance, akunu can concatenate the Q, K, and V projection matrices (or gate + up FFN matrices) into a single contiguous GPU buffer, allowing one large GEMV instead of two or three small ones. This is controlled by ChipConfig.should_fuse_weights, which is true on chips with large enough SLC (System Level Cache).

Project Structure

The project is about 265 source files (C++, Objective-C++, Metal) plus around 135 Metal shader files. Here is how they are organized:

akunu/
|
+-- include/akunu/          Public C API headers
|   +-- akunu.h             Opaque handle API (load, generate, encode, etc.)
|   +-- types.h             POD structs (AkunuModelConfig, AkunuGenerationStats, etc.)
|
+-- src/
|   +-- core/               Architecture-agnostic engine core
|   |   +-- device.h        Virtual GPU device interface
|   |   +-- dispatch_table.h  DispatchCmd + DispatchTable + encode_chain()
|   |   +-- table_builder.h/cpp  Builds dispatch table from weights + config
|   |   +-- arch_descriptor.h    ArchDescriptor + factory functions
|   |   +-- dtype_descriptor.h   DTypeDescriptor + kernel lookup tables
|   |   +-- chip_config.h        ChipConfig (hardware tuning)
|   |   +-- prefill.h/cpp        Batched prefill (GEMM-based)
|   |
|   +-- inference/           High-level inference orchestration
|   |   +-- model_state.h   ModelState struct (the opaque handle's guts)
|   |   +-- model_loader.cpp  Model loading + initialization
|   |   +-- decode_loop.cpp   Top-level generate loop (prefill + decode)
|   |   +-- decode_greedy.cpp   Chain decode (greedy, zero-alloc)
|   |   +-- decode_sampled.cpp  Sampled decode (top-k/p/min-p)
|   |   +-- decode_speculative.cpp  N-gram speculative decode
|   |   +-- decode_grammar.cpp  Grammar-constrained decode
|   |   +-- sampling.cpp       CPU-side sampling (softmax, top-k, top-p)
|   |   +-- embedding.cpp      BERT-style embedding extraction
|   |
|   +-- cache/               Memory management
|   |   +-- kv_cache.h       Per-layer K/V buffer arrays
|   |   +-- scratch.h        Pre-allocated scratch buffers
|   |   +-- whisper_buffers.h  Whisper-specific encoder buffers
|   |
|   +-- weight/              Weight file I/O
|   |   +-- weight_provider.h   Unified GGUF/MLX interface
|   |   +-- weight_store.h/cpp  GGUF weight loading + fusion
|   |   +-- mlx_weight_store.h/cpp  MLX SafeTensors loading
|   |   +-- gguf_parser.h/cpp  Low-level GGUF format parsing
|   |   +-- safetensors_parser.h  SafeTensors header parsing
|   |
|   +-- tokenizer/           BPE tokenizer
|   +-- grammar/             Grammar-constrained decoding (GBNF + XGrammar)
|   +-- whisper/             Whisper encoder + decoder
|   +-- audio/               Mel spectrogram computation
|   +-- server/              OpenAI-compatible HTTP server
|   +-- speculative/         N-gram draft predictor
|   +-- akunu_api.cpp        C API implementation (thin wrappers)
|
+-- backend/
|   +-- metal/
|       +-- metal_device.h/mm   MetalDevice implementation (ObjC++)
|       +-- metal_device_impl.h AkunuMetalState (ObjC wrapper)
|       +-- metal_types.h       Metal-specific type aliases
|       +-- kernels/            ~135 Metal shader files
|           +-- metal/kernel/
|               +-- attention/   Flash attention (prefill + decode variants)
|               +-- matmul/      GEMV + GEMM for all quant types
|               +-- norm/        RMSNorm, LayerNorm, head norms
|               +-- rope/        RoPE (interleaved + NeoX + fused variants)
|               +-- activation/  SiLU, GELU, gated variants
|               +-- embedding/   Token embedding lookup (all dtypes)
|               +-- sampling/    GPU-side argmax, top-k, temperature, penalties
|               +-- convert/     Dtype conversion (F32<->F16, dequant)
|               +-- conv/        Conv1D for Whisper frontend
|               +-- fused/       Fused kernels (GEMV+norm, whisper GEMV)
|
+-- tools/                   CLI executables
|   +-- akunu_chat.cpp       Interactive chat
|   +-- akunu_bench.cpp      llama-bench compatible benchmark
|   +-- akunu_complete.cpp   Text completion
|   +-- akunu_inspect.cpp    Model weight inspector
|   +-- akunu_profile.cpp    Per-layer GPU profiler
|   +-- akunu_serve.cpp      OpenAI-compatible HTTP server
|   +-- akunu_transcribe.cpp Whisper transcription
|   +-- akunu_benchmark.cpp  Extended benchmarking
|
+-- tests/                   Unit + integration tests
+-- 3rdparty/                XGrammar submodule
+-- bindings/                Language bindings (Swift)
+-- CMakeLists.txt           Build system
+-- Makefile                 Top-level build driver

If you count the lines of actual akunu code (excluding 3rdparty), the core engine is roughly 9,750 lines of C++ and Objective-C++, plus about 135 Metal shader files. That is remarkably compact for what it does – a full inference engine supporting 6 architectures, 2 weight formats, 16+ quantization types, grammar-constrained decoding, speculative decoding, Whisper transcription, and an HTTP server.

High-Level Inference Flow

Let us trace what happens when you call akunu_generate() with a prompt. This is the 30,000-foot view; later chapters will zoom in on each step.

akunu_generate(model, prompt_tokens, n_prompt, max_tokens, sampling, callback, ...)
|
v
run_decode_loop(state, ...)
|
+-- 1. PREFILL (batched, GEMM-based)
|   |
|   |   for chunk in prompt_tokens (up to max_prefill_chunk at a time):
|   |     encode_prefill(device, weights, config, arch, kv_cache, scratch,
|   |                    chunk_tokens, chunk_size, position)
|   |     |
|   |     |   For each layer:
|   |     |     GEMM: batch_residual @ Q_weight  -> batch_q
|   |     |     GEMM: batch_residual @ K_weight  -> batch_k
|   |     |     GEMM: batch_residual @ V_weight  -> batch_v
|   |     |     RoPE + write to KV cache
|   |     |     Flash attention (prefill variant)
|   |     |     GEMM: attn_out @ O_weight        -> batch_h1
|   |     |     Residual add + RMSNorm
|   |     |     GEMM: batch_residual @ gate_weight -> batch_gate
|   |     |     GEMM: batch_residual @ up_weight   -> batch_up
|   |     |     Activation (SiLU*gate or GELU*gate)
|   |     |     GEMM: batch_act @ down_weight      -> batch_h1
|   |     |     Residual add + RMSNorm (next layer)
|   |     |
|   |     v
|   |     Output norm + logit projection + argmax -> first token
|   |
|   v
|   Return first_token + timing stats
|
+-- 2. DECODE (chain decode, GEMV-based)
    |
    |   Choose decode path:
    |     - temperature == 0 && speculation_enabled -> decode_speculative()
    |     - temperature == 0                        -> decode_greedy()
    |     - temperature > 0                         -> decode_sampled()
    |     - grammar != null                         -> decode_grammar()
    |
    |   decode_greedy (hot path):
    |     while generated < max_tokens:
    |       write next_token to token_ids buffer
    |       device.begin_encoding()
    |       device.encode_dispatch_table(&dispatch_table, position, chunk_size)
    |       device.end_encoding_sync()   (or async with double buffering)
    |       kv_cache.advance(chunk_size)
    |       read token_ids buffer -> output tokens
    |       for each token: callback(token_id, text, user_data)
    |
    v
    Return AkunuGenerationStats { prefill_time, decode_time, tokens/sec, ... }

A few things to notice:

Prefill uses GEMM, decode uses GEMV. During prefill, we process many tokens at once, so the Q/K/V projections are matrix-matrix multiplications (M > 1). During decode, we process one token at a time, so they are matrix-vector multiplications (M = 1). Akunu has separate optimized kernels for each.

Chain decode generates multiple tokens per GPU submission. Instead of submitting one command buffer per token, akunu submits a “chain” of N tokens in a single begin_encoding() / end_encoding() pair. The dispatch table is replayed N times with patched position values. This amortizes the Metal command buffer overhead across many tokens. The chain size is tuned per chip (64-128 tokens, see ChipConfig).

The callback is synchronous. After each GPU chunk completes, tokens are read back from the token_ids buffer and delivered to the user’s callback one at a time. The callback can return false to stop generation early.

No Python in the loop. The entire flow – from tokenization through GPU dispatch through token decoding – is C++. The C API boundary is a thin wrapper in akunu_api.cpp that casts the opaque void* handle to ModelState* and forwards the call.

The ModelState: What Lives Behind the Opaque Handle

When you call akunu_load_model(), it returns an akunu_model_t, which is a void* pointing to a ModelState struct. This is the central state object that ties everything together:

struct ModelState {
    std::unique_ptr<Device> device;     // GPU device (MetalDevice)
    WeightProvider *weights;            // Weight file access
    Tokenizer tokenizer;                // BPE tokenizer
    AkunuModelConfig config;            // Parsed model config
    ArchDescriptor arch;                // Architecture descriptor
    ChipConfig chip;                    // Hardware tuning params
    KVCache kv_cache;                   // Per-layer K/V buffers
    ScratchBuffers scratch;             // Pre-allocated intermediates
    DispatchTable dispatch_table;       // Precompiled decode commands
    NGramPredictor predictor;           // Speculative n-gram predictor
    bool speculation_enabled;           // Whether spec decode is on

    // Whisper-specific fields
    bool is_whisper;
    std::unique_ptr<WhisperBuffers> whisper_buf;
    std::unique_ptr<MelSpectrogram> mel_spec;
    std::unique_ptr<WhisperModel> whisper_model;
    DispatchTable whisper_decode_table;
    // ... beam search buffers
};

This is it. One struct, one allocation. The entire engine state fits in a single cache-friendly object. Compare this to inference frameworks that scatter state across dozens of Python objects, each with its own reference counting and garbage collection pressure.

Model Loading: What Happens at Init Time

The akunu_load_model() function is where all the expensive work happens. Here is the sequence:

akunu_load_model(path, metallib_path, max_context)
|
+-- 1. Create MetalDevice (MTLCreateSystemDefaultDevice)
+-- 2. Load metallib (compiled shader library)
+-- 3. Open weight file (GGUF or MLX SafeTensors)
+-- 4. Parse model config (dims, layers, heads, vocab, etc.)
+-- 5. Select ArchDescriptor (arch_from_config)
+-- 6. Detect ChipConfig (GPU cores, family, SLC estimate)
+-- 7. Handle format-specific quirks:
|      - MLX LLaMA: switch to NeoX RoPE
|      - Tie embeddings if output.weight missing
|      - Set quant_bits / quant_group_size from MLX metadata
+-- 8. Precompute RoPE frequencies (LLaMA 3 wavelen scaling)
+-- 9. Load tokenizer (from GGUF metadata or HF tokenizer.json)
+-- 10. Set context length (capped at model max or user-specified)
+-- 11. Allocate KV cache (n_layers * 2 buffers * max_seq_len)
+-- 12. Allocate ScratchBuffers (all intermediates)
+-- 13. Build DispatchTable (resolves all PSOs, binds buffers)
+-- 14. Warmup pass (compiles remaining Metal pipelines)
+-- 15. Return ModelState* as opaque handle

Steps 1-12 are straightforward initialization. Step 13 is where the magic happens: build_dispatch_table() walks through the entire forward pass – embedding, norms, projections, RoPE, attention, FFN – and for each operation, it resolves the kernel name from DTypeDescriptor, looks up or compiles the Metal pipeline state object, computes the dispatch geometry, binds the weight and scratch buffers, and stores everything in a DispatchCmd. By the time this function returns, the engine knows exactly what to do for each token – no runtime decisions left.

Supported Architectures at a Glance

Let us briefly survey how each supported architecture maps to akunu’s abstractions:

+----------+-------------+----------+----------+--------+--------+--------+
| Arch     | Activation  | RoPE     | QK Norm  | PostNm | Enc/Dec| Embed  |
|          |             |          |          |        |        | Scale  |
+----------+-------------+----------+----------+--------+--------+--------+
| LLaMA    | silu_gate   | interleav| no       | no     | no     | 0      |
| Qwen3    | silu_gate   | neox     | yes      | no     | no     | 0      |
| Gemma    | gelu_gate   | neox     | yes      | yes    | no     | sqrt(d)|
| Gemma3   | gelu_gate   | neox     | yes      | yes    | no     | sqrt(d)|
| Whisper  | gelu (plain)| none     | no       | no     | yes    | 0      |
| BERT     | silu_gate   | neox     | no       | no     | no     | 0      |
+----------+-------------+----------+----------+--------+--------+--------+

All of these differences are captured in the ArchDescriptor struct – no special code paths. Gemma 3’s sliding window attention with alternating global/local layers? That is just cfg.sliding_window_pattern > 0 in the RoPE theta selection. Whisper’s Conv1D frontend, cross-attention, and sinusoidal positional embeddings? Those are flag fields in ArchDescriptor plus a separate WhisperModel loading path.

The key insight is that most transformer architectures are minor variations on the same theme. They all have embedding, norm, QKV projection, attention, output projection, FFN, and output logits. The differences are in which activation function, which RoPE variant, whether there is an extra normalization step, and so on. Akunu exploits this regularity by parameterizing the differences rather than branching on them.

What Makes This Different from Other Engines

If you have used llama.cpp, MLX, or vLLM, you might wonder what akunu does differently. Here is a quick comparison:

vs. llama.cpp: llama.cpp builds a computation graph (ggml) at runtime and evaluates it node-by-node. Each node involves a virtual dispatch to find the right kernel, plus buffer management. Akunu eliminates this overhead by precompiling the entire forward pass into a flat command array. llama.cpp is more general (it runs on CPU, CUDA, Metal, Vulkan, etc.), but akunu squeezes more performance out of Metal specifically.

vs. MLX: MLX is a general-purpose array framework (like PyTorch) that happens to run on Metal. It has a JIT compiler, automatic differentiation, and a Python frontend. This generality comes at a cost: each operation goes through MLX’s dispatch layer, which involves hash lookups and potentially JIT compilation. Akunu bypasses all of this – it talks directly to Metal with precompiled pipelines.

vs. vLLM: vLLM targets datacenter GPU inference with features like PagedAttention, continuous batching, and multi-GPU tensor parallelism. Akunu targets single-device Apple Silicon with features like chain decode, SLC-aware weight fusion, and Metal-specific kernel optimization. Different tools for different jobs.

The common thread is specialization. Akunu does fewer things, but does them very well on one specific hardware platform.

A Note on Code Style

Before we dive deeper in the following chapters, a word about the codebase style. Akunu is written in C++17 with a strong preference for:

POD structs over class hierarchies (DispatchCmd, KVCache, ScratchBuffers)
Fixed-size inline storage over heap allocation (param_bytes[64], buffers[8])
Factory functions over constructors (KVCache::create, ScratchBuffers::create)
Explicit state over hidden globals (ModelState holds everything)
One virtual class (Device) instead of a deep hierarchy
Thread-local for truly per-thread state (error buffer, RNG)

The Metal backend uses Objective-C++ (.mm files) because it has to – Metal is an Objective-C API. But this is strictly quarantined behind the Device interface. The rest of the engine is pure C++ that compiles with any standard compiler.

Error handling is C-style: functions return null/false on failure and set a thread-local error string via set_error(). No exceptions in the hot path. No RAII wrappers around GPU resources (the ModelState destructor handles cleanup).

This style is not “modern C++” in the Herb Sutter sense, but it is effective for systems programming where you care about memory layout, cache behavior, and predictable performance.

Summary

Akunu is a tightly-focused inference engine that trades generality for performance on Apple Silicon. Its key design decisions are:

ArchDescriptor – all architecture differences as data, not branches
DispatchTable – precompiled GPU command sequences, replayed per token
Zero-allocation decode – all buffers pre-allocated, nothing on the hot path
Virtual Device – clean Metal abstraction, ready for future backends
Lazy weight loading – tensors loaded on demand, with fusion support

The result is an engine that achieves 1.8x average speedup over llama.cpp on decode and 1.17x over MLX, in about 10,000 lines of C++ plus 135 Metal shaders.

In the next chapter, we will see how to build and run akunu from source.

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference