Introduction
Welcome to Inside Akunu: Building a High-Performance Inference Engine for Apple Silicon. This is a book about going fast on hardware that most people barely understand. It is also a book about taking apart an inference engine – piece by piece, kernel by kernel, buffer by buffer – until you understand every decision that went into making it one of the fastest ways to run large language models on a Mac.
If you have ever wondered why your MacBook Pro can run a 70-billion-parameter model at conversational speed, or how a Metal compute shader can squeeze every last byte of bandwidth out of Apple’s unified memory, or what exactly happens between the moment you press Enter on a prompt and the moment the first token appears on screen, this book is for you.
What Is Akunu?
Akunu is a high-performance inference engine purpose-built for Apple Silicon. It runs large language models (LLMs), speech-to-text models (Whisper), and other transformer-based architectures entirely on the GPU cores of Apple’s M-series chips. It is written in C++17 (with a C API for FFI compatibility) and Metal Shading Language (MSL), with no dependencies on heavyweight frameworks like PyTorch, CoreML, or even Apple’s own MPS (Metal Performance Shaders) graph library.
The numbers speak for themselves: in decode throughput – the metric that determines how fast tokens appear on screen during generation – akunu achieves 1.83x faster decode than llama.cpp,1 the most popular open-source inference engine for consumer hardware. This is not a marginal improvement squeezed out of micro-optimizations. It is the result of fundamentally rethinking how to map the inference workload onto Apple Silicon’s unique architecture.
Here is a snapshot of decode performance on an M4 Pro (16 GPU cores, 273 GB/s):
| Model (GGUF) | llama.cpp (tok/s) | akunu (tok/s) | Speedup |
|---|---|---|---|
| Qwen3-0.6B Q3_K_S | 123 | 448 | 3.66x |
| Qwen3-0.6B Q4_0 | 169 | 465 | 2.75x |
| TinyLlama-1.1B Q4_0 | 208 | 343 | 1.65x |
| Llama-3.2-1B Q4_0 | 189 | 294 | 1.55x |
| Qwen3-4B Q4_K_M | 63 | 89 | 1.42x |
| Average (19 models) | 1.83x |
How does akunu achieve this? Not through any single trick, but through a relentless focus on the hardware. Every kernel is hand-tuned for Apple GPU microarchitecture. Every buffer allocation is designed around unified memory. Every dispatch decision accounts for the chip generation, the System Level Cache size, and the specific GPU core count of the machine it is running on. The entire inference pipeline – from loading model weights off disk to sampling the next token – is a single, carefully orchestrated sequence of Metal compute commands with minimal CPU overhead.
Who Is This Book For?
This book is written for people with a computer science background who want to understand how high-performance GPU programming works on Apple Silicon, and specifically how that knowledge is applied to build a state-of-the-art inference engine.
You should be comfortable with:
-
C/C++ programming. Akunu’s core is C++17, with a C API for external use. You should know your way around pointers, structs, and manual memory management.
-
Basic linear algebra. Matrix multiplication, dot products, transpose operations. You do not need a PhD in numerical methods, but you should know what a matrix-vector product is and why it matters for neural networks.
-
General computer architecture. Caches, memory hierarchies, pipelining, parallelism. A systems programming or computer architecture course is sufficient background.
-
The basics of how neural networks work. You should know what a transformer is at a high level – attention, feed-forward layers, embeddings, softmax. We will not be training models in this book; we will be running them as fast as physically possible.
You do not need prior experience with:
- Apple’s Metal API or Metal Shading Language
- GPU programming (CUDA, OpenCL, or otherwise)
- Apple Silicon internals
- The specific architectures of LLMs like Llama, Mistral, or Qwen
We will teach you all of that from the ground up.
What You Will Learn
This book is organized into nine parts, each building on the last. By the time you finish, you will have a deep understanding of every layer of the stack, from the silicon to the HTTP API.
Part I: Apple Silicon Fundamentals
Before you can write fast code for Apple Silicon, you need to understand what Apple Silicon actually is. Part I takes you on a tour of the hardware, starting with why Apple moved from Intel to ARM, then drilling into the SoC architecture, the GPU core design, the unified memory system, and the generational improvements from M1 through M4.
By the end of Part I, you will understand:
- Why Apple’s unified memory architecture is a game-changer for ML inference
- How Apple’s GPU cores differ from NVIDIA’s streaming multiprocessors
- What a SIMD group is and why it is the fundamental unit of GPU execution
- How the System Level Cache (SLC) acts as a shared last-level cache for all on-chip agents
- Why memory bandwidth – not compute – is the bottleneck for LLM decode
Part II: Metal Programming
Metal is Apple’s low-level GPU programming framework, and it is the only way to access the full power of the Apple GPU. Part II teaches you Metal from scratch: how to create compute pipelines, write shaders in the Metal Shading Language, dispatch threadgroups, manage buffers, and use SIMD group matrix operations. This is not a toy tutorial – by the end, you will understand the same programming model that akunu’s kernels use.
By the end of Part II, you will understand:
- How to set up a Metal compute pipeline and dispatch work to the GPU
- The Metal Shading Language (MSL) – a C++14-based language for writing GPU kernels
- How threadgroups, SIMD groups, and threads map to the hardware
- The Metal memory model: device memory, threadgroup memory, and how they interact
- SIMD group matrix operations (simdgroup_matrix) – Apple’s answer to NVIDIA’s Tensor Cores
- Performance optimization patterns: coalesced access, occupancy tuning, avoiding bank conflicts
Part III: Machine Learning on Metal
Part III bridges the gap between GPU programming and machine learning. We cover how tensors are represented in GPU memory, the different strategies for matrix multiplication, the attention mechanism, FlashAttention and how it maps to Metal, and quantization – the art of making models small enough to fit in memory without destroying quality.
Part IV: Akunu Architecture
Now we get to akunu itself. Part IV covers the design philosophy, the build system, the C API, the device abstraction layer, architecture descriptors (which let akunu support new model architectures without code changes), and chip configuration.
Part V: The Inference Pipeline
Part V walks through the complete inference pipeline: model loading, the dispatch table (akunu’s precompiled GPU command sequence), the prefill phase, the decode loop, and the decoding strategies: greedy/chain decode, sampled, speculative, and grammar-constrained.
Part VI: Metal Kernels Deep Dive
This is the heart of the book. Part VI takes you through every Metal kernel: GEMV, GEMM, FlashAttention, normalization, RoPE, embedding and activation, and sampling. For each, we explain the algorithm, the memory access pattern, the SIMD group coordination, and include interactive animations showing GPU execution.
Part VII: Weight Management
Models come in many file formats. Part VII covers the weight provider abstraction, the GGUF file format and akunu’s parser, SafeTensors and MLX format support, and the zoo of quantization formats (Q4_0, Q4_1, Q8_0, K-quants, MLX 3/4/6/8-bit, and more).
Part VIII: Supporting Systems
Inference engines need more than just GPU kernels. Part VIII covers the KV cache, scratch buffer architecture, the tokenizer (BPE and SentencePiece), the grammar engine (for structured output), Whisper (speech-to-text), and the HTTP server.
Part IX: Contributing to Akunu
The final part is a contributor guide: dev setup, testing, adding kernels, adding architectures, profiling, and architectural decision records.
Why This Book Exists
There are many resources for learning CUDA programming on NVIDIA GPUs. There are tutorials for PyTorch, guides for TensorRT, deep dives into NVIDIA’s Tensor Core architecture. The NVIDIA ecosystem is mature, well-documented, and widely understood.
The Apple Silicon ecosystem has… almost none of that.
If you want to understand how Apple’s GPU works at the microarchitectural level, you will find sparse documentation, a handful of WWDC sessions, and a lot of educated guesswork from the reverse-engineering community. If you want to write high-performance compute shaders for Metal, the official guides are thin on practical advice. If you want to understand how to map an LLM inference workload onto Apple Silicon efficiently, you are largely on your own.
This book exists to fill that gap. We have spent hundreds of hours profiling, benchmarking, reverse-engineering, and optimizing akunu’s Metal kernels on Apple Silicon. We have learned things about Apple’s GPU that are not documented anywhere. We want to share that knowledge so that the next person who wants to build something fast on Apple hardware does not have to start from zero.
| NVIDIA Ecosystem | Apple Silicon Ecosystem |
|---|---|
| CUDA Handbook | A few WWDC sessions |
| PTX ISA Guide | No public ISA reference |
| Tensor Core documentation | No GPU microarch docs |
| cuBLAS/cuDNN guides | Metal Best Practices (thin) |
| Hundreds of papers and blog posts | Scattered community reverse engineering |
| This book fills the gap |
How to Read This Book
This book is designed to be read sequentially. Each chapter builds on concepts introduced in previous chapters, and later parts assume familiarity with the hardware and programming model covered in earlier parts.
That said, here are some suggested paths depending on your background:
“I just want to understand akunu’s codebase so I can contribute.” Read Part I (skim if you already know Apple Silicon), skim Part II (read Chapter 8 on MSL carefully), then jump to Part IV and read sequentially through Part IX.
“I’m a CUDA programmer and I want to learn Metal.” Start with Part I to understand how Apple’s hardware differs from NVIDIA’s. Then read Part II carefully – it is the Metal equivalent of a CUDA programming guide. Chapter 3 (GPU architecture) and Chapter 9 (threadgroups and dispatch) are especially relevant for mapping your CUDA mental model to Metal.
“I want to understand the ML/inference side.” Skim Part I and II for context, then read Part III carefully. Then jump to Part V (the inference pipeline) and Part VI (the kernels).
“I want to understand everything.” Read the whole book, front to back. That is what it is for.
Conventions Used in This Book
Throughout this book, we use the following conventions:
-
Code listings are shown in monospaced font. Metal Shading Language code is annotated with comments explaining non-obvious constructs.
-
ASCII diagrams are used extensively. We chose ASCII art over images because it renders correctly in every format (web, PDF, terminal), is easy to modify, and can be included in code review comments and commit messages.
-
Performance numbers are given for specific hardware configurations. Unless otherwise noted, benchmarks were run on an M4 Pro (16 GPU cores, 273 GB/s) with macOS 15 unless otherwise noted. Your numbers will differ on different hardware.
-
“Apple GPU” refers to the GPU cores on Apple Silicon (M1, M2, M3, M4 families), not the older Intel integrated graphics on pre-2020 Macs.
-
Register types in Metal are explained when first used. Metal uses
uint,float,half(16-bit float), and others. We will explain the implications of each. -
Chip-specific details are called out in notes. When behavior differs between M1 and M4, we will tell you.
A Note on Apple’s Documentation (or Lack Thereof)
Apple is famously secretive about its hardware. The company does not publish detailed microarchitectural specifications for its GPUs, does not release ISA references for its shader cores, and does not provide the kind of performance tuning guides that NVIDIA publishes for CUDA.
This means that some of what we describe in this book – particularly regarding the internal structure of GPU cores, the exact sizes of register files, and the behavior of the instruction pipeline – is based on a combination of:
- Apple’s public documentation2 (Metal Best Practices Guide, WWDC sessions, Metal Feature Set tables)
- Reverse engineering (running carefully constructed microbenchmarks to measure latencies, throughputs, and cache sizes)
- Community knowledge3 (the excellent work of Dougall Johnson, Alyssa Rosenzweig, and others who have reverse-engineered Apple GPU internals)
- Empirical observation (running akunu’s kernels with different configurations and measuring what works best)
Where we are confident in our claims, we state them as facts. Where we are making educated inferences from indirect evidence, we say so. We encourage you to verify our claims through your own experiments – that is part of the fun.
Let’s Begin
Apple Silicon is, in our opinion, the most interesting computing platform to emerge in the last decade. It combines extraordinary hardware – a unified memory architecture, a surprisingly powerful GPU, and a custom interconnect fabric – with a programming model (Metal) that is powerful but underexplored. The gap between what this hardware can do and what most software actually does with it is enormous.
Akunu exists to close that gap for inference workloads. This book exists to show you how.
Turn the page. Let’s talk about the silicon.
“Any sufficiently advanced technology is indistinguishable from magic.” — Arthur C. Clarke
Our goal is to make the magic distinguishable.
-
Benchmarks run on M4 Pro (16 GPU cores, 273 GB/s) with llama.cpp b8610, 3 reps per config. Decode measured at tg128 (128-token generation). The 1.83x is the average across 19 GGUF models. Speedup is highest on small models (3.66x on Qwen3-0.6B Q3_K_S) where per-token dispatch overhead dominates, and converges to parity on larger models (0.98x on Qwen3-8B Q8_0). Prefill averages 0.91x. See BENCHMARKS.md for the full data and Chapter 55 for methodology. ↩
-
Apple. “Metal Best Practices Guide.” developer.apple.com. The primary official reference for Metal compute optimization. See https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/index.html. ↩
-
Grinberg, D. et al. “Reverse-engineering Apple GPU cores.” Asahi Linux project, 2022. The most detailed public analysis of Apple GPU internals. See https://dougallj.github.io/applegpu/. ↩