Development Environment Setup
Welcome to Part IX of this book, where we transition from understanding how akunu works to actually contributing to its codebase. If you have read this far, you already know more about Apple Silicon LLM inference than most people who write it. Now it is time to get your hands dirty.
This chapter walks you through every step of setting up a development environment for akunu on macOS. We will cover hardware prerequisites, toolchain installation, cloning and building the project, IDE configuration, and debugging Metal shaders. By the end, you will have a running build that passes all tests and a workflow that lets you iterate quickly on both CPU and GPU code.
Hardware Prerequisites
Akunu targets Apple Silicon exclusively. You need a Mac with an M-series chip:
Supported Hardware
==================
+------------------+-------------+------------------+------------------+
| Chip | GPU Family | GPU Cores | Memory BW |
+------------------+-------------+------------------+------------------+
| M1 | Apple 7 | 7-8 | 68 GB/s |
| M1 Pro/Max/Ultra | Apple 7 | 14-64 | 200-800 GB/s |
| M2 | Apple 8 | 8-10 | 100 GB/s |
| M2 Pro/Max/Ultra | Apple 8 | 16-76 | 200-800 GB/s |
| M3 | Apple 8 | 8-10 | 100 GB/s |
| M3 Pro/Max/Ultra | Apple 8 | 11-40 | 150-400 GB/s |
| M4 | Apple 9 | 10 | 120 GB/s |
| M4 Pro/Max/Ultra | Apple 9 | 16-64 | 273-800 GB/s |
+------------------+-------------+------------------+------------------+
Minimum: M1 with 8 GB RAM (small models only)
Recommended: M2 Pro+ with 16+ GB RAM
Ideal: M4 Pro/Max with 36+ GB RAM
The GPU family number matters because it determines which Metal features are available. Apple 7 (M1) supports Metal 3.0. Apple 8 (M2/M3) adds Metal 3.1 with improvements to SIMD-group operations. Apple 9 (M4) introduces Metal 3.2 with native BF16 support and enhanced matrix operations.
Akunu auto-detects your GPU family at build time through the Metal compiler and
at runtime through the MTLDevice.supportsFamily API. You do not need to
configure anything – the build system picks the highest Metal standard your
hardware supports.
Software Prerequisites
macOS Version
You need macOS 14 (Sonoma) or later. macOS 15 (Sequoia) is recommended because it ships with Metal 3.2 support and improved GPU debugging tools. You can check your version:
sw_vers --productVersion
# 15.4.1
Xcode
Xcode 15 or later is required. Xcode 16+ is recommended because it includes:
- Metal 3.2 compiler with
-fmetal-math-fp32-functions=fastoptimization - Improved GPU profiler with per-kernel timing
- Better Metal shader debugging and validation layers
Install Xcode from the App Store, then make sure the command-line tools are selected:
# Check current Xcode version
xcodebuild -version
# Xcode 16.3
# Build version ...
# Ensure command-line tools point to full Xcode (not standalone CLT)
sudo xcode-select -s /Applications/Xcode.app/Contents/Developer
# Verify the Metal compiler is available
xcrun -sdk macosx metal --version
# Apple metal version 32023.155 (metalfe-32023.155)
The Metal compiler (metal) and linker (metallib) are part of Xcode, not
separate installs. If xcrun metal fails, your Xcode installation is
incomplete.
CMake
Akunu uses CMake 3.20+ as its build system for the C++ engine. The Metal shaders have their own Makefile-based build, but CMake orchestrates the C++ compilation and test executables.
# Install via Homebrew (recommended)
brew install cmake
# Verify version
cmake --version
# cmake version 3.31.6
Optional but Recommended
# Ninja: faster parallel builds than make
brew install ninja
# ccache: caches compilation results for faster rebuilds
brew install ccache
Cloning the Repository
Akunu uses git submodules for its third-party dependencies (currently just
XGrammar for grammar-constrained decoding). Always clone with --recursive:
git clone --recursive https://github.com/prabod/akunu.git
cd akunu
If you already cloned without --recursive, initialize the submodules now:
git submodule update --init --recursive
This pulls XGrammar (v0.1.33) into 3rdparty/xgrammar/. Without it, the
build still succeeds but grammar-constrained decoding is disabled.
Let us look at the directory structure:
akunu/
+-- CMakeLists.txt # C++ build system
+-- Makefile # Metal shader build + convenience targets
+-- VERSION # Version file
+-- include/
| +-- akunu/
| +-- akunu.h # Public C API
| +-- types.h # Public type definitions
+-- src/
| +-- core/ # Engine core: dispatch, descriptors, config
| +-- inference/ # Decode paths, sampling, model loading
| +-- tokenizer/ # BPE tokenizer
| +-- grammar/ # GBNF grammar, JSON schema
| +-- weight/ # GGUF parser, MLX SafeTensors, weight store
| +-- whisper/ # Whisper transcription engine
| +-- akunu_api.cpp # C API implementation
+-- backend/
| +-- metal/
| +-- metal_device.mm # Metal GPU backend (Objective-C++)
| +-- kernels/
| +-- ShaderTypes.h # Shared CPU/GPU param structs
| +-- KernelCommon.h # Shared GPU utilities
| +-- MetalKernels.c # Kernel registration
| +-- metal/kernel/ # .metal shader source files
| +-- activation/ # silu.metal, gelu.metal
| +-- attention/ # flash_attention*.metal, softmax.metal
| +-- common/ # residual_add.metal, transpose.metal
| +-- convert/ # dequant_*.metal, f16<->f32
| +-- embedding/ # embedding_lookup_*.metal
| +-- fused/ # gemv_q8_0_head_rmsnorm.metal
| +-- kv_cache/ # kv_cache_write.metal, shift.metal
| +-- matmul/ # gemv_*.metal, simd_gemm_*.metal
| +-- norm/ # rmsnorm.metal, layernorm.metal
| +-- rope/ # rope.metal, rope_neox.metal
| +-- sampling/ # argmax.metal, gumbel_topk.metal
+-- tools/ # CLI executables
| +-- akunu_chat.cpp # Interactive chat
| +-- akunu_bench.cpp # llama-bench style benchmark
| +-- akunu_profile.cpp # Per-kernel GPU profiler
| +-- akunu_serve.cpp # OpenAI-compatible HTTP server
+-- tests/
| +-- test_*.cpp # Unit and integration tests
| +-- kernels/ # Per-kernel GPU correctness tests
| +-- activation/ # test_silu.cpp, test_gelu.cpp, ...
| +-- attention/ # test_flash_attention.cpp
| +-- matmul/ # test_gemv_f16.cpp, test_gemv_q4_0.cpp, ...
| +-- norm/ # test_rmsnorm.cpp, test_gemma_rmsnorm.cpp
| +-- rope/ # test_rope.cpp, test_rope_neox.cpp
| +-- convert/ # test_f32_to_f16.cpp, test_dequant_q4_0.cpp
| +-- embedding/ # test_embedding_f16.cpp
+-- 3rdparty/
+-- xgrammar/ # Grammar-constrained decoding library
There are roughly 120 .metal shader files, 16 kernel tests, 14 unit/integration
tests, and 8 CLI tools. The total C++ codebase is around 15,000 lines, with
another 10,000+ lines of Metal shader code.
Building the Project
Akunu has a two-stage build:
- Metal shaders:
.metalsource files are compiled to.air(Apple Intermediate Representation), then linked into a singleakunu.metallibbinary - C++ engine: CMake builds all source files, links against Metal and Accelerate frameworks, and produces test executables and CLI tools
The Makefile provides convenience targets that run both stages:
Full Build (Shaders + Engine)
make
This is equivalent to make shaders engine. Let us trace what happens:
Step 1: make shaders
================================
For each .metal file in backend/metal/kernels/metal/kernel/**/:
xcrun -sdk macosx metal \
-std=metal3.2 \ <-- auto-detected (3.2 > 3.1 > 3.0)
-I backend/metal/kernels \
-O2 \ <-- optimization level
-fmetal-math-fp32-functions=fast \ <-- Xcode 16+ only
-c backend/metal/kernels/metal/kernel/norm/rmsnorm.metal \
-o build/air/metal/kernel/norm/rmsnorm.air
Then link all .air files into one metallib:
xcrun -sdk macosx metallib build/air/**/*.air -o build/akunu.metallib
Step 2: make engine
================================
mkdir -p build
cd build && cmake .. -DCMAKE_BUILD_TYPE=Release
cd build && make -j$(sysctl -n hw.ncpu)
This produces:
build/akunu_chat # interactive chat tool
build/akunu_bench # benchmark tool
build/akunu_profile # per-kernel profiler
build/akunu_serve # HTTP server
build/akunu_e2e # end-to-end test
build/akunu_test_* # all test executables
build/akunu_kernel_* # per-kernel test executables
build/libakunu_engine.a # static library
Build Time Expectations
On an M4 Pro (12-core CPU):
Stage First Build Rebuild (1 file changed)
----------- ----------- ------------------------
Metal shaders ~15 seconds ~2 seconds (1 .metal file)
C++ engine ~25 seconds ~3 seconds (1 .cpp file)
Total ~40 seconds ~5 seconds
The Metal shader build parallelizes across CPU cores. With 120+ shader files,
the first build compiles them all in parallel. Subsequent rebuilds only
recompile changed .metal files and re-link the metallib.
Shader-Only Build
If you are working on Metal kernels and do not need to rebuild C++:
make shaders
Engine-Only Build
If you changed only C++ code and the metallib already exists:
make engine
Shared Library Build
For language bindings (Python, Swift, etc.):
make shared
# Produces: build/libakunu.dylib
Debug Build
For debugging with Xcode or lldb:
mkdir -p build-debug
cd build-debug
cmake .. -DCMAKE_BUILD_TYPE=Debug
make -j$(sysctl -n hw.ncpu)
Debug builds disable optimizations and enable assert macros. They are significantly slower for inference (3-5x) but essential for stepping through code.
Clean Build
make clean
# Removes the entire build/ directory
Metal Shader Compilation Deep Dive
Understanding the shader build pipeline is important because debugging compilation errors in Metal shaders is different from debugging C++ errors.
.metal source files Shared headers
(120+ files) (ShaderTypes.h, KernelCommon.h)
| |
v v
+------------------------------------------+
| metal compiler (xcrun metal) |
| -std=metal3.2 -O2 -I includes |
| -fmetal-math-fp32-functions=fast |
+------------------------------------------+
|
v
.air files (Apple Intermediate Representation)
One per .metal file, in build/air/
|
v
+------------------------------------------+
| metallib linker (xcrun metallib) |
| Links all .air files into one binary |
+------------------------------------------+
|
v
build/akunu.metallib
(single binary, ~3 MB, loaded at runtime)
The Metal standard version is auto-detected by the Makefile:
METAL_STD := $(shell $(METAL_CC) -std=metal3.2 ... && echo metal3.2 || \
($(METAL_CC) -std=metal3.1 ... && echo metal3.1 || echo metal3.0))
This tries Metal 3.2 first, falls back to 3.1, then 3.0. The
-fmetal-math-fp32-functions=fast flag is similarly auto-detected and only
used when available (Xcode 16+). This allows the same codebase to build on
older Xcode versions without modification.
Common Shader Build Errors
Missing include: If you add a new .metal file that includes a header
not in the include path:
error: 'MyNewHeader.h' file not found
Fix: Add the header to backend/metal/kernels/ or its include/ subdirectory.
Metal standard mismatch: If you use a Metal 3.2 feature on a machine with only Metal 3.1:
error: unknown attribute 'metal3_2_features_only'
Fix: Guard the feature with #if __METAL_VERSION__ >= 320.
Type mismatch with ShaderTypes.h: The parameter structs in ShaderTypes.h
are shared between CPU (C++) and GPU (Metal). If you change a struct, both
sides must agree:
// This comment in ShaderTypes.h says it all:
// CRITICAL: Any change here MUST be mirrored in
// Sources/KernelStore/MetalTypes.swift.
// All structs are padded to 16-byte boundaries
// for Metal argument buffer alignment.
Running Tests
Akunu has several categories of tests, each with its own make target.
Unit Tests (No Model Required)
make test-unit
This runs tests that do not need a model file:
build/akunu_test_tokenizer_internal # BPE tokenizer internals
build/akunu_test_grammar # GBNF grammar parsing
build/akunu_test_server # HTTP server logic
build/akunu_test_whisper # Whisper format parsing
These tests are fast (< 1 second total) and should always pass on a clean build.
Inference Tests (Model Required)
make test-infer MODEL=models/Qwen3-0.6B-Q4_0.gguf
This runs tests that need a real model file:
build/akunu_e2e <model> "The capital of France is" 0 10
build/akunu_test_inference <model>
build/akunu_test_tokenizer <model>
You need to download a model first. The smallest model that exercises all code paths is Qwen3-0.6B in Q4_0 quantization (~400 MB).
Kernel Tests (No Model Required)
Kernel tests verify GPU correctness by comparing Metal shader output against CPU reference implementations. They need the metallib but not a model:
# Run individual kernel tests
build/akunu_kernel_test_rmsnorm
build/akunu_kernel_test_gemv_f16
build/akunu_kernel_test_flash_attention
# The full list of 16 kernel tests:
build/akunu_kernel_test_rmsnorm
build/akunu_kernel_test_gemma_rmsnorm
build/akunu_kernel_test_gemv_f16
build/akunu_kernel_test_gemv_q4_0
build/akunu_kernel_test_gemv_q8_0
build/akunu_kernel_test_gemm_f16
build/akunu_kernel_test_silu
build/akunu_kernel_test_gelu
build/akunu_kernel_test_silu_gate
build/akunu_kernel_test_gelu_gate
build/akunu_kernel_test_rope
build/akunu_kernel_test_rope_neox
build/akunu_kernel_test_flash_attention
build/akunu_kernel_test_embedding_f16
build/akunu_kernel_test_f32_to_f16
build/akunu_kernel_test_dequant_q4_0
Each kernel test creates a MetalDevice, loads the metallib, generates
deterministic test data, runs the GPU kernel, and compares against a CPU
reference. See Chapter 52 for a detailed walkthrough of the testing
infrastructure.
Running All Tests
# Unit + inference
make test MODEL=models/Qwen3-0.6B-Q4_0.gguf
Benchmark
make bench MODEL=models/Qwen3-0.6B-Q4_0.gguf
This runs akunu_bench with 512-token prefill and 128-token generation,
repeated 3 times, reporting tokens/second in llama-bench format.
IDE Setup
Xcode
Xcode is the best IDE for akunu development because it has native Metal shader support, GPU debugging, and frame capture.
Generating an Xcode project from CMake:
mkdir -p build-xcode
cd build-xcode
cmake .. -G Xcode
open akunu.xcodeproj
This creates an Xcode project with all targets (library, tests, tools).
However, it does not handle the Metal shader build – you still need
make shaders from the command line.
Xcode scheme setup:
- Select the
akunu_chatscheme for interactive testing - Edit the scheme: Run > Arguments > add model path as first argument
- Edit the scheme: Run > Options > set Working Directory to project root
- Build and run with Cmd+R
Metal shader editing in Xcode:
Xcode provides syntax highlighting and basic error checking for .metal
files. Open any .metal file from the project navigator. The Metal compiler
runs in the background and shows errors inline.
For shader editing, you want the include paths configured. In the Xcode project, add these to the Metal compiler settings:
Header Search Paths: $(PROJECT_DIR)/backend/metal/kernels
Visual Studio Code
VS Code with the right extensions provides a solid alternative:
# Install recommended extensions
code --install-extension ms-vscode.cpptools
code --install-extension ms-vscode.cmake-tools
code --install-extension nickmass.metal-shader
Create .vscode/settings.json:
{
"cmake.buildDirectory": "${workspaceFolder}/build",
"cmake.configureArgs": ["-DCMAKE_BUILD_TYPE=Debug"],
"C_Cpp.default.includePath": [
"${workspaceFolder}/include",
"${workspaceFolder}/src",
"${workspaceFolder}/backend"
],
"files.associations": {
"*.metal": "metal"
}
}
Create .vscode/tasks.json for shader builds:
{
"version": "2.0.0",
"tasks": [
{
"label": "Build Shaders",
"type": "shell",
"command": "make shaders",
"group": "build"
},
{
"label": "Build All",
"type": "shell",
"command": "make",
"group": {
"kind": "build",
"isDefault": true
}
}
]
}
CLion
CLion has excellent CMake integration. Open the project root directory and CLion will auto-detect the CMakeLists.txt. Add a custom build step for shaders:
- Settings > Build, Execution, Deployment > CMake > add a “Before launch”
step that runs
make shaders - Or configure an External Tool for the shader build
Metal Debugger and GPU Profiling
Xcode GPU Frame Capture
The most powerful tool for debugging Metal shaders is Xcode’s GPU Frame Capture:
-
Set the
METAL_DEVICE_WRAPPER_TYPEenvironment variable:export METAL_DEVICE_WRAPPER_TYPE=1 -
Run your akunu executable under Xcode
-
Click the camera icon in the debug bar to capture a GPU frame
-
Xcode shows every command buffer, compute encoder, and dispatch
This lets you inspect:
- Buffer contents at any point in the pipeline
- Shader execution time per dispatch
- Thread occupancy and register pressure
- Memory bandwidth utilization
Metal Validation Layer
Enable Metal API validation to catch buffer overflows, misaligned access, and other GPU programming errors:
export MTL_DEBUG_LAYER=1
export METAL_DEBUG_ERROR_MODE=assert
With validation enabled, Metal checks every API call and crashes immediately on misuse rather than producing silent corruption. This is essential during development but adds significant overhead – do not use it for benchmarking.
Metal Shader Debugging
For stepping through shader code line-by-line:
- In Xcode, select Debug > Attach to Process > your running akunu executable
- Enable GPU shader debugging: Product > Scheme > Edit Scheme > Run > Diagnostics > GPU Validation > Shader Validation
- Set a breakpoint in a
.metalfile - When the breakpoint hits, you can inspect thread variables, buffer contents, and threadgroup memory
This is slow (100x+ overhead) but invaluable for correctness debugging.
Metal System Trace
For system-level GPU analysis:
# Record a 5-second trace
xctrace record --template 'Metal System Trace' \
--output trace.trace \
--time-limit 5s \
--launch build/akunu_bench models/Qwen3-0.6B-Q4_0.gguf -n 32 -r 1
Open the trace in Instruments to see:
- GPU timeline (which kernels ran when)
- CPU-GPU synchronization points
- Memory allocation patterns
- Command buffer scheduling
The akunu_profile Tool
Akunu includes its own per-kernel profiling tool that does not require Xcode:
build/akunu_profile models/Qwen3-0.6B-Q4_0.gguf --tokens 5
This runs each dispatch command in its own command buffer (rather than the normal batched execution) and reports per-kernel GPU time. The output shows exactly which kernels dominate the forward pass:
Decode Summary (5 tokens)
========================================
embedding 0.012 ms 0.8%
attention_norm 0.008 ms 0.5%
qkv_gemv 0.142 ms 9.1%
rope_kv_write 0.015 ms 1.0%
flash_attention 0.098 ms 6.3%
output_gemv 0.047 ms 3.0%
ffn_norm 0.008 ms 0.5%
gate_gemv 0.142 ms 9.1%
up_gemv 0.142 ms 9.1%
silu_gate 0.012 ms 0.8%
down_gemv 0.142 ms 9.1%
... (per layer)
logit_projection 0.350 ms 22.5%
argmax 0.003 ms 0.2%
========================================
Total per token: 1.56 ms
Throughput: 641 t/s (single-token decode)
See Chapter 55 for a complete guide to profiling and benchmarking.
Quick Development Workflow
Here is the workflow most contributors use:
+------------------+ +------------------+ +------------------+
| Edit code | | Build | | Test |
| (.metal or .cpp)|---->| make |---->| kernel test |
| | | (~5s rebuild) | | or e2e test |
+------------------+ +------------------+ +------------------+
^ |
| |
+---------------------------------------------------+
Fix and iterate
For Metal kernel work:
# 1. Edit your shader
vim backend/metal/kernels/metal/kernel/norm/rmsnorm.metal
# 2. Rebuild shaders only (~2s)
make shaders
# 3. Rebuild the test (~3s)
make engine
# 4. Run the specific kernel test
build/akunu_kernel_test_rmsnorm
For C++ engine work:
# 1. Edit your source
vim src/core/table_builder.cpp
# 2. Rebuild engine only (~3s)
make engine
# 3. Run relevant test
build/akunu_e2e models/Qwen3-0.6B-Q4_0.gguf "Hello" 0 10
For both (new kernel end-to-end):
# 1. Write the .metal file
# 2. Add params to ShaderTypes.h
# 3. Wire up in table_builder.cpp
# 4. Write the kernel test
# 5. Full rebuild + test
make && build/akunu_kernel_test_your_new_kernel
Downloading Test Models
Several tests and all benchmarks require model files. Here are the recommended test models by size:
Model Size Use Case
----------------------------- ------- ----------------------------
Qwen3-0.6B-Q4_0.gguf ~400 MB Default test model (fast)
Llama-3.2-1B-Instruct-Q4_0 ~700 MB Test LLaMA architecture
Qwen3-4B-Q4_K_M.gguf ~2.5 GB Test larger models
whisper-base-en.bin ~140 MB Test Whisper transcription
Place models in the models/ directory at the project root:
mkdir -p models
# Download from HuggingFace or your preferred source
# Example using huggingface-cli:
huggingface-cli download Qwen/Qwen3-0.6B-GGUF \
--include "Qwen3-0.6B-Q4_0.gguf" \
--local-dir models/
The MODEL variable in the Makefile defaults to models/Qwen3-0.6B-Q4_0.gguf.
You can override it:
make test-infer MODEL=models/Llama-3.2-1B-Instruct-Q4_0.gguf
Troubleshooting
“Metallib not found”
The kernel tests look for the metallib in several relative paths:
bool _ok = dev->load_library("../../.build/metallib/akunu.metallib");
if (!_ok) _ok = dev->load_library(".build/metallib/akunu.metallib");
if (!_ok) _ok = dev->load_library("../../../.build/metallib/akunu.metallib");
If none of these match your working directory, either:
- Run tests from the project root:
./build/akunu_kernel_test_rmsnorm - Or set the metallib path explicitly (if the API supports it)
The simplest fix is to always run tests from the project root directory.
CMake Cannot Find Metal Framework
CMake Error: Could not find framework Metal
This means Xcode command-line tools are not properly installed:
sudo xcode-select -s /Applications/Xcode.app/Contents/Developer
XGrammar Build Fails
If the XGrammar submodule fails to build:
# Ensure submodule is initialized
git submodule update --init --recursive
# If still failing, the build continues without grammar support
# (AKUNU_HAS_XGRAMMAR will be OFF)
Grammar-constrained decoding is optional. The core inference engine works fine without it.
Objective-C++ Compilation Errors
Several files (like metal_device.mm and test files that use Metal) are
compiled as Objective-C++. CMake handles this automatically:
set_source_files_properties(tests/test_device.mm PROPERTIES LANGUAGE OBJCXX)
If you see errors about @interface or NSError, check that the file
extension is .mm (not .cpp) or that CMake has the LANGUAGE OBJCXX
property set.
Summary
Let us recap the essential commands:
# First-time setup
git clone --recursive https://github.com/prabod/akunu.git
cd akunu
# Full build
make
# Quick iterations
make shaders # Metal only
make engine # C++ only
# Tests
make test-unit # No model needed
make test-infer MODEL=models/Qwen3-0.6B-Q4_0.gguf # Needs model
build/akunu_kernel_test_rmsnorm # Single kernel test
# Benchmarks
make bench MODEL=models/Qwen3-0.6B-Q4_0.gguf
build/akunu_profile models/Qwen3-0.6B-Q4_0.gguf --tokens 5
# Debug
mkdir build-debug && cd build-debug
cmake .. -DCMAKE_BUILD_TYPE=Debug && make -j
export MTL_DEBUG_LAYER=1 # Metal validation
Your development environment is now ready. In the next chapter, we will dive deep into akunu’s testing infrastructure – the CPU reference implementations, the kernel test pattern, and how to write tests for new functionality.