Introduction

Welcome to Inside Akunu: Building a High-Performance Inference Engine for Apple Silicon. This is a book about going fast on hardware that most people barely understand. It is also a book about taking apart an inference engine – piece by piece, kernel by kernel, buffer by buffer – until you understand every decision that went into making it one of the fastest ways to run large language models on a Mac.

If you have ever wondered why your MacBook Pro can run a 70-billion-parameter model at conversational speed, or how a Metal compute shader can squeeze every last byte of bandwidth out of Apple’s unified memory, or what exactly happens between the moment you press Enter on a prompt and the moment the first token appears on screen, this book is for you.

What Is Akunu?

Akunu is a high-performance inference engine purpose-built for Apple Silicon. It runs large language models (LLMs), speech-to-text models (Whisper), and other transformer-based architectures entirely on the GPU cores of Apple’s M-series chips. It is written in C++17 (with a C API for FFI compatibility) and Metal Shading Language (MSL), with no dependencies on heavyweight frameworks like PyTorch, CoreML, or even Apple’s own MPS (Metal Performance Shaders) graph library.

The numbers speak for themselves: in decode throughput – the metric that determines how fast tokens appear on screen during generation – akunu achieves 1.83x faster decode than llama.cpp,¹ the most popular open-source inference engine for consumer hardware. This is not a marginal improvement squeezed out of micro-optimizations. It is the result of fundamentally rethinking how to map the inference workload onto Apple Silicon’s unique architecture.

Here is a snapshot of decode performance on an M4 Pro (16 GPU cores, 273 GB/s):

Model (GGUF)	llama.cpp (tok/s)	akunu (tok/s)	Speedup
Qwen3-0.6B Q3_K_S	123	448	3.66x
Qwen3-0.6B Q4_0	169	465	2.75x
TinyLlama-1.1B Q4_0	208	343	1.65x
Llama-3.2-1B Q4_0	189	294	1.55x
Qwen3-4B Q4_K_M	63	89	1.42x
Average (19 models)			1.83x

How does akunu achieve this? Not through any single trick, but through a relentless focus on the hardware. Every kernel is hand-tuned for Apple GPU microarchitecture. Every buffer allocation is designed around unified memory. Every dispatch decision accounts for the chip generation, the System Level Cache size, and the specific GPU core count of the machine it is running on. The entire inference pipeline – from loading model weights off disk to sampling the next token – is a single, carefully orchestrated sequence of Metal compute commands with minimal CPU overhead.

Who Is This Book For?

This book is written for people with a computer science background who want to understand how high-performance GPU programming works on Apple Silicon, and specifically how that knowledge is applied to build a state-of-the-art inference engine.

You should be comfortable with:

C/C++ programming. Akunu’s core is C++17, with a C API for external use. You should know your way around pointers, structs, and manual memory management.
Basic linear algebra. Matrix multiplication, dot products, transpose operations. You do not need a PhD in numerical methods, but you should know what a matrix-vector product is and why it matters for neural networks.
General computer architecture. Caches, memory hierarchies, pipelining, parallelism. A systems programming or computer architecture course is sufficient background.
The basics of how neural networks work. You should know what a transformer is at a high level – attention, feed-forward layers, embeddings, softmax. We will not be training models in this book; we will be running them as fast as physically possible.

You do not need prior experience with:

Apple’s Metal API or Metal Shading Language
GPU programming (CUDA, OpenCL, or otherwise)
Apple Silicon internals
The specific architectures of LLMs like Llama, Mistral, or Qwen

We will teach you all of that from the ground up.

What You Will Learn

This book is organized into nine parts, each building on the last. By the time you finish, you will have a deep understanding of every layer of the stack, from the silicon to the HTTP API.

Part I: Apple Silicon Fundamentals

Before you can write fast code for Apple Silicon, you need to understand what Apple Silicon actually is. Part I takes you on a tour of the hardware, starting with why Apple moved from Intel to ARM, then drilling into the SoC architecture, the GPU core design, the unified memory system, and the generational improvements from M1 through M4.

By the end of Part I, you will understand:

Why Apple’s unified memory architecture is a game-changer for ML inference
How Apple’s GPU cores differ from NVIDIA’s streaming multiprocessors
What a SIMD group is and why it is the fundamental unit of GPU execution
How the System Level Cache (SLC) acts as a shared last-level cache for all on-chip agents
Why memory bandwidth – not compute – is the bottleneck for LLM decode

Part II: Metal Programming

Metal is Apple’s low-level GPU programming framework, and it is the only way to access the full power of the Apple GPU. Part II teaches you Metal from scratch: how to create compute pipelines, write shaders in the Metal Shading Language, dispatch threadgroups, manage buffers, and use SIMD group matrix operations. This is not a toy tutorial – by the end, you will understand the same programming model that akunu’s kernels use.

By the end of Part II, you will understand:

How to set up a Metal compute pipeline and dispatch work to the GPU
The Metal Shading Language (MSL) – a C++14-based language for writing GPU kernels
How threadgroups, SIMD groups, and threads map to the hardware
The Metal memory model: device memory, threadgroup memory, and how they interact
SIMD group matrix operations (simdgroup_matrix) – Apple’s answer to NVIDIA’s Tensor Cores
Performance optimization patterns: coalesced access, occupancy tuning, avoiding bank conflicts

Part III: Machine Learning on Metal

Part III bridges the gap between GPU programming and machine learning. We cover how tensors are represented in GPU memory, the different strategies for matrix multiplication, the attention mechanism, FlashAttention and how it maps to Metal, and quantization – the art of making models small enough to fit in memory without destroying quality.

Part IV: Akunu Architecture

Now we get to akunu itself. Part IV covers the design philosophy, the build system, the C API, the device abstraction layer, architecture descriptors (which let akunu support new model architectures without code changes), and chip configuration.

Part V: The Inference Pipeline

Part V walks through the complete inference pipeline: model loading, the dispatch table (akunu’s precompiled GPU command sequence), the prefill phase, the decode loop, and the decoding strategies: greedy/chain decode, sampled, speculative, and grammar-constrained.

Part VI: Metal Kernels Deep Dive

This is the heart of the book. Part VI takes you through every Metal kernel: GEMV, GEMM, FlashAttention, normalization, RoPE, embedding and activation, and sampling. For each, we explain the algorithm, the memory access pattern, the SIMD group coordination, and include interactive animations showing GPU execution.

Part VII: Weight Management

Models come in many file formats. Part VII covers the weight provider abstraction, the GGUF file format and akunu’s parser, SafeTensors and MLX format support, and the zoo of quantization formats (Q4_0, Q4_1, Q8_0, K-quants, MLX 3/4/6/8-bit, and more).

Part VIII: Supporting Systems

Inference engines need more than just GPU kernels. Part VIII covers the KV cache, scratch buffer architecture, the tokenizer (BPE and SentencePiece), the grammar engine (for structured output), Whisper (speech-to-text), and the HTTP server.

Part IX: Contributing to Akunu

The final part is a contributor guide: dev setup, testing, adding kernels, adding architectures, profiling, and architectural decision records.

Why This Book Exists

There are many resources for learning CUDA programming on NVIDIA GPUs. There are tutorials for PyTorch, guides for TensorRT, deep dives into NVIDIA’s Tensor Core architecture. The NVIDIA ecosystem is mature, well-documented, and widely understood.

The Apple Silicon ecosystem has… almost none of that.

If you want to understand how Apple’s GPU works at the microarchitectural level, you will find sparse documentation, a handful of WWDC sessions, and a lot of educated guesswork from the reverse-engineering community. If you want to write high-performance compute shaders for Metal, the official guides are thin on practical advice. If you want to understand how to map an LLM inference workload onto Apple Silicon efficiently, you are largely on your own.

This book exists to fill that gap. We have spent hundreds of hours profiling, benchmarking, reverse-engineering, and optimizing akunu’s Metal kernels on Apple Silicon. We have learned things about Apple’s GPU that are not documented anywhere. We want to share that knowledge so that the next person who wants to build something fast on Apple hardware does not have to start from zero.

NVIDIA Ecosystem	Apple Silicon Ecosystem
CUDA Handbook	A few WWDC sessions
PTX ISA Guide	No public ISA reference
Tensor Core documentation	No GPU microarch docs
cuBLAS/cuDNN guides	Metal Best Practices (thin)
Hundreds of papers and blog posts	Scattered community reverse engineering
	This book fills the gap

How to Read This Book

This book is designed to be read sequentially. Each chapter builds on concepts introduced in previous chapters, and later parts assume familiarity with the hardware and programming model covered in earlier parts.

That said, here are some suggested paths depending on your background:

“I just want to understand akunu’s codebase so I can contribute.” Read Part I (skim if you already know Apple Silicon), skim Part II (read Chapter 8 on MSL carefully), then jump to Part IV and read sequentially through Part IX.

“I’m a CUDA programmer and I want to learn Metal.” Start with Part I to understand how Apple’s hardware differs from NVIDIA’s. Then read Part II carefully – it is the Metal equivalent of a CUDA programming guide. Chapter 3 (GPU architecture) and Chapter 9 (threadgroups and dispatch) are especially relevant for mapping your CUDA mental model to Metal.

“I want to understand the ML/inference side.” Skim Part I and II for context, then read Part III carefully. Then jump to Part V (the inference pipeline) and Part VI (the kernels).

“I want to understand everything.” Read the whole book, front to back. That is what it is for.

Conventions Used in This Book

Throughout this book, we use the following conventions:

Code listings are shown in monospaced font. Metal Shading Language code is annotated with comments explaining non-obvious constructs.
ASCII diagrams are used extensively. We chose ASCII art over images because it renders correctly in every format (web, PDF, terminal), is easy to modify, and can be included in code review comments and commit messages.
Performance numbers are given for specific hardware configurations. Unless otherwise noted, benchmarks were run on an M4 Pro (16 GPU cores, 273 GB/s) with macOS 15 unless otherwise noted. Your numbers will differ on different hardware.
“Apple GPU” refers to the GPU cores on Apple Silicon (M1, M2, M3, M4 families), not the older Intel integrated graphics on pre-2020 Macs.
Register types in Metal are explained when first used. Metal uses uint, float, half (16-bit float), and others. We will explain the implications of each.
Chip-specific details are called out in notes. When behavior differs between M1 and M4, we will tell you.

A Note on Apple’s Documentation (or Lack Thereof)

Apple is famously secretive about its hardware. The company does not publish detailed microarchitectural specifications for its GPUs, does not release ISA references for its shader cores, and does not provide the kind of performance tuning guides that NVIDIA publishes for CUDA.

This means that some of what we describe in this book – particularly regarding the internal structure of GPU cores, the exact sizes of register files, and the behavior of the instruction pipeline – is based on a combination of:

Apple’s public documentation² (Metal Best Practices Guide, WWDC sessions, Metal Feature Set tables)
Reverse engineering (running carefully constructed microbenchmarks to measure latencies, throughputs, and cache sizes)
Community knowledge³ (the excellent work of Dougall Johnson, Alyssa Rosenzweig, and others who have reverse-engineered Apple GPU internals)
Empirical observation (running akunu’s kernels with different configurations and measuring what works best)

Where we are confident in our claims, we state them as facts. Where we are making educated inferences from indirect evidence, we say so. We encourage you to verify our claims through your own experiments – that is part of the fun.

Let’s Begin

Apple Silicon is, in our opinion, the most interesting computing platform to emerge in the last decade. It combines extraordinary hardware – a unified memory architecture, a surprisingly powerful GPU, and a custom interconnect fabric – with a programming model (Metal) that is powerful but underexplored. The gap between what this hardware can do and what most software actually does with it is enormous.

Akunu exists to close that gap for inference workloads. This book exists to show you how.

Turn the page. Let’s talk about the silicon.

“Any sufficiently advanced technology is indistinguishable from magic.” — Arthur C. Clarke

Our goal is to make the magic distinguishable.

Benchmarks run on M4 Pro (16 GPU cores, 273 GB/s) with llama.cpp b8610, 3 reps per config. Decode measured at tg128 (128-token generation). The 1.83x is the average across 19 GGUF models. Speedup is highest on small models (3.66x on Qwen3-0.6B Q3_K_S) where per-token dispatch overhead dominates, and converges to parity on larger models (0.98x on Qwen3-8B Q8_0). Prefill averages 0.91x. See BENCHMARKS.md for the full data and Chapter 55 for methodology. ↩
Apple. “Metal Best Practices Guide.” developer.apple.com. The primary official reference for Metal compute optimization. See https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/index.html. ↩
Grinberg, D. et al. “Reverse-engineering Apple GPU cores.” Asahi Linux project, 2022. The most detailed public analysis of Apple GPU internals. See https://dougallj.github.io/applegpu/. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference