Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Development Environment Setup

Welcome to Part IX of this book, where we transition from understanding how akunu works to actually contributing to its codebase. If you have read this far, you already know more about Apple Silicon LLM inference than most people who write it. Now it is time to get your hands dirty.

This chapter walks you through every step of setting up a development environment for akunu on macOS. We will cover hardware prerequisites, toolchain installation, cloning and building the project, IDE configuration, and debugging Metal shaders. By the end, you will have a running build that passes all tests and a workflow that lets you iterate quickly on both CPU and GPU code.

Hardware Prerequisites

Akunu targets Apple Silicon exclusively. You need a Mac with an M-series chip:

Supported Hardware
==================

  +------------------+-------------+------------------+------------------+
  | Chip             | GPU Family  | GPU Cores        | Memory BW        |
  +------------------+-------------+------------------+------------------+
  | M1               | Apple 7     | 7-8              | 68 GB/s          |
  | M1 Pro/Max/Ultra | Apple 7     | 14-64            | 200-800 GB/s     |
  | M2               | Apple 8     | 8-10             | 100 GB/s         |
  | M2 Pro/Max/Ultra | Apple 8     | 16-76            | 200-800 GB/s     |
  | M3               | Apple 8     | 8-10             | 100 GB/s         |
  | M3 Pro/Max/Ultra | Apple 8     | 11-40            | 150-400 GB/s     |
  | M4               | Apple 9     | 10               | 120 GB/s         |
  | M4 Pro/Max/Ultra | Apple 9     | 16-64            | 273-800 GB/s     |
  +------------------+-------------+------------------+------------------+

  Minimum: M1 with 8 GB RAM (small models only)
  Recommended: M2 Pro+ with 16+ GB RAM
  Ideal: M4 Pro/Max with 36+ GB RAM

The GPU family number matters because it determines which Metal features are available. Apple 7 (M1) supports Metal 3.0. Apple 8 (M2/M3) adds Metal 3.1 with improvements to SIMD-group operations. Apple 9 (M4) introduces Metal 3.2 with native BF16 support and enhanced matrix operations.

Akunu auto-detects your GPU family at build time through the Metal compiler and at runtime through the MTLDevice.supportsFamily API. You do not need to configure anything – the build system picks the highest Metal standard your hardware supports.

Software Prerequisites

macOS Version

You need macOS 14 (Sonoma) or later. macOS 15 (Sequoia) is recommended because it ships with Metal 3.2 support and improved GPU debugging tools. You can check your version:

sw_vers --productVersion
# 15.4.1

Xcode

Xcode 15 or later is required. Xcode 16+ is recommended because it includes:

  • Metal 3.2 compiler with -fmetal-math-fp32-functions=fast optimization
  • Improved GPU profiler with per-kernel timing
  • Better Metal shader debugging and validation layers

Install Xcode from the App Store, then make sure the command-line tools are selected:

# Check current Xcode version
xcodebuild -version
# Xcode 16.3
# Build version ...

# Ensure command-line tools point to full Xcode (not standalone CLT)
sudo xcode-select -s /Applications/Xcode.app/Contents/Developer

# Verify the Metal compiler is available
xcrun -sdk macosx metal --version
# Apple metal version 32023.155 (metalfe-32023.155)

The Metal compiler (metal) and linker (metallib) are part of Xcode, not separate installs. If xcrun metal fails, your Xcode installation is incomplete.

CMake

Akunu uses CMake 3.20+ as its build system for the C++ engine. The Metal shaders have their own Makefile-based build, but CMake orchestrates the C++ compilation and test executables.

# Install via Homebrew (recommended)
brew install cmake

# Verify version
cmake --version
# cmake version 3.31.6
# Ninja: faster parallel builds than make
brew install ninja

# ccache: caches compilation results for faster rebuilds
brew install ccache

Cloning the Repository

Akunu uses git submodules for its third-party dependencies (currently just XGrammar for grammar-constrained decoding). Always clone with --recursive:

git clone --recursive https://github.com/prabod/akunu.git
cd akunu

If you already cloned without --recursive, initialize the submodules now:

git submodule update --init --recursive

This pulls XGrammar (v0.1.33) into 3rdparty/xgrammar/. Without it, the build still succeeds but grammar-constrained decoding is disabled.

Let us look at the directory structure:

akunu/
+-- CMakeLists.txt           # C++ build system
+-- Makefile                 # Metal shader build + convenience targets
+-- VERSION                  # Version file
+-- include/
|   +-- akunu/
|       +-- akunu.h          # Public C API
|       +-- types.h          # Public type definitions
+-- src/
|   +-- core/                # Engine core: dispatch, descriptors, config
|   +-- inference/           # Decode paths, sampling, model loading
|   +-- tokenizer/           # BPE tokenizer
|   +-- grammar/             # GBNF grammar, JSON schema
|   +-- weight/              # GGUF parser, MLX SafeTensors, weight store
|   +-- whisper/             # Whisper transcription engine
|   +-- akunu_api.cpp        # C API implementation
+-- backend/
|   +-- metal/
|       +-- metal_device.mm  # Metal GPU backend (Objective-C++)
|       +-- kernels/
|           +-- ShaderTypes.h   # Shared CPU/GPU param structs
|           +-- KernelCommon.h  # Shared GPU utilities
|           +-- MetalKernels.c  # Kernel registration
|           +-- metal/kernel/   # .metal shader source files
|               +-- activation/ # silu.metal, gelu.metal
|               +-- attention/  # flash_attention*.metal, softmax.metal
|               +-- common/     # residual_add.metal, transpose.metal
|               +-- convert/    # dequant_*.metal, f16<->f32
|               +-- embedding/  # embedding_lookup_*.metal
|               +-- fused/      # gemv_q8_0_head_rmsnorm.metal
|               +-- kv_cache/   # kv_cache_write.metal, shift.metal
|               +-- matmul/     # gemv_*.metal, simd_gemm_*.metal
|               +-- norm/       # rmsnorm.metal, layernorm.metal
|               +-- rope/       # rope.metal, rope_neox.metal
|               +-- sampling/   # argmax.metal, gumbel_topk.metal
+-- tools/                   # CLI executables
|   +-- akunu_chat.cpp       # Interactive chat
|   +-- akunu_bench.cpp      # llama-bench style benchmark
|   +-- akunu_profile.cpp    # Per-kernel GPU profiler
|   +-- akunu_serve.cpp      # OpenAI-compatible HTTP server
+-- tests/
|   +-- test_*.cpp           # Unit and integration tests
|   +-- kernels/             # Per-kernel GPU correctness tests
|       +-- activation/      # test_silu.cpp, test_gelu.cpp, ...
|       +-- attention/       # test_flash_attention.cpp
|       +-- matmul/          # test_gemv_f16.cpp, test_gemv_q4_0.cpp, ...
|       +-- norm/            # test_rmsnorm.cpp, test_gemma_rmsnorm.cpp
|       +-- rope/            # test_rope.cpp, test_rope_neox.cpp
|       +-- convert/         # test_f32_to_f16.cpp, test_dequant_q4_0.cpp
|       +-- embedding/       # test_embedding_f16.cpp
+-- 3rdparty/
    +-- xgrammar/            # Grammar-constrained decoding library

There are roughly 120 .metal shader files, 16 kernel tests, 14 unit/integration tests, and 8 CLI tools. The total C++ codebase is around 15,000 lines, with another 10,000+ lines of Metal shader code.

Building the Project

Akunu has a two-stage build:

  1. Metal shaders: .metal source files are compiled to .air (Apple Intermediate Representation), then linked into a single akunu.metallib binary
  2. C++ engine: CMake builds all source files, links against Metal and Accelerate frameworks, and produces test executables and CLI tools

The Makefile provides convenience targets that run both stages:

Full Build (Shaders + Engine)

make

This is equivalent to make shaders engine. Let us trace what happens:

Step 1: make shaders
================================

  For each .metal file in backend/metal/kernels/metal/kernel/**/:

    xcrun -sdk macosx metal \
      -std=metal3.2 \          <-- auto-detected (3.2 > 3.1 > 3.0)
      -I backend/metal/kernels \
      -O2 \                    <-- optimization level
      -fmetal-math-fp32-functions=fast \  <-- Xcode 16+ only
      -c backend/metal/kernels/metal/kernel/norm/rmsnorm.metal \
      -o build/air/metal/kernel/norm/rmsnorm.air

  Then link all .air files into one metallib:

    xcrun -sdk macosx metallib build/air/**/*.air -o build/akunu.metallib


Step 2: make engine
================================

  mkdir -p build
  cd build && cmake .. -DCMAKE_BUILD_TYPE=Release
  cd build && make -j$(sysctl -n hw.ncpu)

  This produces:
    build/akunu_chat         # interactive chat tool
    build/akunu_bench        # benchmark tool
    build/akunu_profile      # per-kernel profiler
    build/akunu_serve        # HTTP server
    build/akunu_e2e          # end-to-end test
    build/akunu_test_*       # all test executables
    build/akunu_kernel_*     # per-kernel test executables
    build/libakunu_engine.a  # static library

Build Time Expectations

On an M4 Pro (12-core CPU):

Stage           First Build    Rebuild (1 file changed)
-----------     -----------    ------------------------
Metal shaders   ~15 seconds    ~2 seconds (1 .metal file)
C++ engine      ~25 seconds    ~3 seconds (1 .cpp file)
Total           ~40 seconds    ~5 seconds

The Metal shader build parallelizes across CPU cores. With 120+ shader files, the first build compiles them all in parallel. Subsequent rebuilds only recompile changed .metal files and re-link the metallib.

Shader-Only Build

If you are working on Metal kernels and do not need to rebuild C++:

make shaders

Engine-Only Build

If you changed only C++ code and the metallib already exists:

make engine

Shared Library Build

For language bindings (Python, Swift, etc.):

make shared
# Produces: build/libakunu.dylib

Debug Build

For debugging with Xcode or lldb:

mkdir -p build-debug
cd build-debug
cmake .. -DCMAKE_BUILD_TYPE=Debug
make -j$(sysctl -n hw.ncpu)

Debug builds disable optimizations and enable assert macros. They are significantly slower for inference (3-5x) but essential for stepping through code.

Clean Build

make clean
# Removes the entire build/ directory

Metal Shader Compilation Deep Dive

Understanding the shader build pipeline is important because debugging compilation errors in Metal shaders is different from debugging C++ errors.

   .metal source files          Shared headers
   (120+ files)                 (ShaderTypes.h, KernelCommon.h)
        |                              |
        v                              v
  +------------------------------------------+
  |  metal compiler (xcrun metal)            |
  |  -std=metal3.2  -O2  -I includes        |
  |  -fmetal-math-fp32-functions=fast        |
  +------------------------------------------+
        |
        v
   .air files (Apple Intermediate Representation)
   One per .metal file, in build/air/
        |
        v
  +------------------------------------------+
  |  metallib linker (xcrun metallib)        |
  |  Links all .air files into one binary    |
  +------------------------------------------+
        |
        v
   build/akunu.metallib
   (single binary, ~3 MB, loaded at runtime)

The Metal standard version is auto-detected by the Makefile:

METAL_STD := $(shell $(METAL_CC) -std=metal3.2 ... && echo metal3.2 || \
             ($(METAL_CC) -std=metal3.1 ... && echo metal3.1 || echo metal3.0))

This tries Metal 3.2 first, falls back to 3.1, then 3.0. The -fmetal-math-fp32-functions=fast flag is similarly auto-detected and only used when available (Xcode 16+). This allows the same codebase to build on older Xcode versions without modification.

Common Shader Build Errors

Missing include: If you add a new .metal file that includes a header not in the include path:

error: 'MyNewHeader.h' file not found

Fix: Add the header to backend/metal/kernels/ or its include/ subdirectory.

Metal standard mismatch: If you use a Metal 3.2 feature on a machine with only Metal 3.1:

error: unknown attribute 'metal3_2_features_only'

Fix: Guard the feature with #if __METAL_VERSION__ >= 320.

Type mismatch with ShaderTypes.h: The parameter structs in ShaderTypes.h are shared between CPU (C++) and GPU (Metal). If you change a struct, both sides must agree:

// This comment in ShaderTypes.h says it all:
// CRITICAL: Any change here MUST be mirrored in
// Sources/KernelStore/MetalTypes.swift.
// All structs are padded to 16-byte boundaries
// for Metal argument buffer alignment.

Running Tests

Akunu has several categories of tests, each with its own make target.

Unit Tests (No Model Required)

make test-unit

This runs tests that do not need a model file:

build/akunu_test_tokenizer_internal  # BPE tokenizer internals
build/akunu_test_grammar             # GBNF grammar parsing
build/akunu_test_server              # HTTP server logic
build/akunu_test_whisper             # Whisper format parsing

These tests are fast (< 1 second total) and should always pass on a clean build.

Inference Tests (Model Required)

make test-infer MODEL=models/Qwen3-0.6B-Q4_0.gguf

This runs tests that need a real model file:

build/akunu_e2e <model> "The capital of France is" 0 10
build/akunu_test_inference <model>
build/akunu_test_tokenizer <model>

You need to download a model first. The smallest model that exercises all code paths is Qwen3-0.6B in Q4_0 quantization (~400 MB).

Kernel Tests (No Model Required)

Kernel tests verify GPU correctness by comparing Metal shader output against CPU reference implementations. They need the metallib but not a model:

# Run individual kernel tests
build/akunu_kernel_test_rmsnorm
build/akunu_kernel_test_gemv_f16
build/akunu_kernel_test_flash_attention

# The full list of 16 kernel tests:
build/akunu_kernel_test_rmsnorm
build/akunu_kernel_test_gemma_rmsnorm
build/akunu_kernel_test_gemv_f16
build/akunu_kernel_test_gemv_q4_0
build/akunu_kernel_test_gemv_q8_0
build/akunu_kernel_test_gemm_f16
build/akunu_kernel_test_silu
build/akunu_kernel_test_gelu
build/akunu_kernel_test_silu_gate
build/akunu_kernel_test_gelu_gate
build/akunu_kernel_test_rope
build/akunu_kernel_test_rope_neox
build/akunu_kernel_test_flash_attention
build/akunu_kernel_test_embedding_f16
build/akunu_kernel_test_f32_to_f16
build/akunu_kernel_test_dequant_q4_0

Each kernel test creates a MetalDevice, loads the metallib, generates deterministic test data, runs the GPU kernel, and compares against a CPU reference. See Chapter 52 for a detailed walkthrough of the testing infrastructure.

Running All Tests

# Unit + inference
make test MODEL=models/Qwen3-0.6B-Q4_0.gguf

Benchmark

make bench MODEL=models/Qwen3-0.6B-Q4_0.gguf

This runs akunu_bench with 512-token prefill and 128-token generation, repeated 3 times, reporting tokens/second in llama-bench format.

IDE Setup

Xcode

Xcode is the best IDE for akunu development because it has native Metal shader support, GPU debugging, and frame capture.

Generating an Xcode project from CMake:

mkdir -p build-xcode
cd build-xcode
cmake .. -G Xcode
open akunu.xcodeproj

This creates an Xcode project with all targets (library, tests, tools). However, it does not handle the Metal shader build – you still need make shaders from the command line.

Xcode scheme setup:

  1. Select the akunu_chat scheme for interactive testing
  2. Edit the scheme: Run > Arguments > add model path as first argument
  3. Edit the scheme: Run > Options > set Working Directory to project root
  4. Build and run with Cmd+R

Metal shader editing in Xcode:

Xcode provides syntax highlighting and basic error checking for .metal files. Open any .metal file from the project navigator. The Metal compiler runs in the background and shows errors inline.

For shader editing, you want the include paths configured. In the Xcode project, add these to the Metal compiler settings:

Header Search Paths: $(PROJECT_DIR)/backend/metal/kernels

Visual Studio Code

VS Code with the right extensions provides a solid alternative:

# Install recommended extensions
code --install-extension ms-vscode.cpptools
code --install-extension ms-vscode.cmake-tools
code --install-extension nickmass.metal-shader

Create .vscode/settings.json:

{
  "cmake.buildDirectory": "${workspaceFolder}/build",
  "cmake.configureArgs": ["-DCMAKE_BUILD_TYPE=Debug"],
  "C_Cpp.default.includePath": [
    "${workspaceFolder}/include",
    "${workspaceFolder}/src",
    "${workspaceFolder}/backend"
  ],
  "files.associations": {
    "*.metal": "metal"
  }
}

Create .vscode/tasks.json for shader builds:

{
  "version": "2.0.0",
  "tasks": [
    {
      "label": "Build Shaders",
      "type": "shell",
      "command": "make shaders",
      "group": "build"
    },
    {
      "label": "Build All",
      "type": "shell",
      "command": "make",
      "group": {
        "kind": "build",
        "isDefault": true
      }
    }
  ]
}

CLion

CLion has excellent CMake integration. Open the project root directory and CLion will auto-detect the CMakeLists.txt. Add a custom build step for shaders:

  1. Settings > Build, Execution, Deployment > CMake > add a “Before launch” step that runs make shaders
  2. Or configure an External Tool for the shader build

Metal Debugger and GPU Profiling

Xcode GPU Frame Capture

The most powerful tool for debugging Metal shaders is Xcode’s GPU Frame Capture:

  1. Set the METAL_DEVICE_WRAPPER_TYPE environment variable:

    export METAL_DEVICE_WRAPPER_TYPE=1
    
  2. Run your akunu executable under Xcode

  3. Click the camera icon in the debug bar to capture a GPU frame

  4. Xcode shows every command buffer, compute encoder, and dispatch

This lets you inspect:

  • Buffer contents at any point in the pipeline
  • Shader execution time per dispatch
  • Thread occupancy and register pressure
  • Memory bandwidth utilization

Metal Validation Layer

Enable Metal API validation to catch buffer overflows, misaligned access, and other GPU programming errors:

export MTL_DEBUG_LAYER=1
export METAL_DEBUG_ERROR_MODE=assert

With validation enabled, Metal checks every API call and crashes immediately on misuse rather than producing silent corruption. This is essential during development but adds significant overhead – do not use it for benchmarking.

Metal Shader Debugging

For stepping through shader code line-by-line:

  1. In Xcode, select Debug > Attach to Process > your running akunu executable
  2. Enable GPU shader debugging: Product > Scheme > Edit Scheme > Run > Diagnostics > GPU Validation > Shader Validation
  3. Set a breakpoint in a .metal file
  4. When the breakpoint hits, you can inspect thread variables, buffer contents, and threadgroup memory

This is slow (100x+ overhead) but invaluable for correctness debugging.

Metal System Trace

For system-level GPU analysis:

# Record a 5-second trace
xctrace record --template 'Metal System Trace' \
  --output trace.trace \
  --time-limit 5s \
  --launch build/akunu_bench models/Qwen3-0.6B-Q4_0.gguf -n 32 -r 1

Open the trace in Instruments to see:

  • GPU timeline (which kernels ran when)
  • CPU-GPU synchronization points
  • Memory allocation patterns
  • Command buffer scheduling

The akunu_profile Tool

Akunu includes its own per-kernel profiling tool that does not require Xcode:

build/akunu_profile models/Qwen3-0.6B-Q4_0.gguf --tokens 5

This runs each dispatch command in its own command buffer (rather than the normal batched execution) and reports per-kernel GPU time. The output shows exactly which kernels dominate the forward pass:

  Decode Summary (5 tokens)
  ========================================
  embedding            0.012 ms    0.8%
  attention_norm       0.008 ms    0.5%
  qkv_gemv            0.142 ms    9.1%
  rope_kv_write       0.015 ms    1.0%
  flash_attention      0.098 ms    6.3%
  output_gemv          0.047 ms    3.0%
  ffn_norm             0.008 ms    0.5%
  gate_gemv            0.142 ms    9.1%
  up_gemv              0.142 ms    9.1%
  silu_gate            0.012 ms    0.8%
  down_gemv            0.142 ms    9.1%
  ... (per layer)
  logit_projection     0.350 ms   22.5%
  argmax               0.003 ms    0.2%
  ========================================
  Total per token:     1.56 ms
  Throughput:          641 t/s (single-token decode)

See Chapter 55 for a complete guide to profiling and benchmarking.

Quick Development Workflow

Here is the workflow most contributors use:

  +------------------+     +------------------+     +------------------+
  |  Edit code       |     |  Build           |     |  Test            |
  |  (.metal or .cpp)|---->|  make            |---->|  kernel test     |
  |                  |     |  (~5s rebuild)   |     |  or e2e test     |
  +------------------+     +------------------+     +------------------+
         ^                                                   |
         |                                                   |
         +---------------------------------------------------+
                        Fix and iterate

For Metal kernel work:

# 1. Edit your shader
vim backend/metal/kernels/metal/kernel/norm/rmsnorm.metal

# 2. Rebuild shaders only (~2s)
make shaders

# 3. Rebuild the test (~3s)
make engine

# 4. Run the specific kernel test
build/akunu_kernel_test_rmsnorm

For C++ engine work:

# 1. Edit your source
vim src/core/table_builder.cpp

# 2. Rebuild engine only (~3s)
make engine

# 3. Run relevant test
build/akunu_e2e models/Qwen3-0.6B-Q4_0.gguf "Hello" 0 10

For both (new kernel end-to-end):

# 1. Write the .metal file
# 2. Add params to ShaderTypes.h
# 3. Wire up in table_builder.cpp
# 4. Write the kernel test
# 5. Full rebuild + test
make && build/akunu_kernel_test_your_new_kernel

Downloading Test Models

Several tests and all benchmarks require model files. Here are the recommended test models by size:

Model                          Size     Use Case
-----------------------------  -------  ----------------------------
Qwen3-0.6B-Q4_0.gguf          ~400 MB  Default test model (fast)
Llama-3.2-1B-Instruct-Q4_0    ~700 MB  Test LLaMA architecture
Qwen3-4B-Q4_K_M.gguf          ~2.5 GB  Test larger models
whisper-base-en.bin            ~140 MB  Test Whisper transcription

Place models in the models/ directory at the project root:

mkdir -p models
# Download from HuggingFace or your preferred source
# Example using huggingface-cli:
huggingface-cli download Qwen/Qwen3-0.6B-GGUF \
  --include "Qwen3-0.6B-Q4_0.gguf" \
  --local-dir models/

The MODEL variable in the Makefile defaults to models/Qwen3-0.6B-Q4_0.gguf. You can override it:

make test-infer MODEL=models/Llama-3.2-1B-Instruct-Q4_0.gguf

Troubleshooting

“Metallib not found”

The kernel tests look for the metallib in several relative paths:

bool _ok = dev->load_library("../../.build/metallib/akunu.metallib");
if (!_ok) _ok = dev->load_library(".build/metallib/akunu.metallib");
if (!_ok) _ok = dev->load_library("../../../.build/metallib/akunu.metallib");

If none of these match your working directory, either:

  • Run tests from the project root: ./build/akunu_kernel_test_rmsnorm
  • Or set the metallib path explicitly (if the API supports it)

The simplest fix is to always run tests from the project root directory.

CMake Cannot Find Metal Framework

CMake Error: Could not find framework Metal

This means Xcode command-line tools are not properly installed:

sudo xcode-select -s /Applications/Xcode.app/Contents/Developer

XGrammar Build Fails

If the XGrammar submodule fails to build:

# Ensure submodule is initialized
git submodule update --init --recursive

# If still failing, the build continues without grammar support
# (AKUNU_HAS_XGRAMMAR will be OFF)

Grammar-constrained decoding is optional. The core inference engine works fine without it.

Objective-C++ Compilation Errors

Several files (like metal_device.mm and test files that use Metal) are compiled as Objective-C++. CMake handles this automatically:

set_source_files_properties(tests/test_device.mm PROPERTIES LANGUAGE OBJCXX)

If you see errors about @interface or NSError, check that the file extension is .mm (not .cpp) or that CMake has the LANGUAGE OBJCXX property set.

Summary

Let us recap the essential commands:

# First-time setup
git clone --recursive https://github.com/prabod/akunu.git
cd akunu

# Full build
make

# Quick iterations
make shaders         # Metal only
make engine          # C++ only

# Tests
make test-unit                                    # No model needed
make test-infer MODEL=models/Qwen3-0.6B-Q4_0.gguf  # Needs model
build/akunu_kernel_test_rmsnorm                   # Single kernel test

# Benchmarks
make bench MODEL=models/Qwen3-0.6B-Q4_0.gguf
build/akunu_profile models/Qwen3-0.6B-Q4_0.gguf --tokens 5

# Debug
mkdir build-debug && cd build-debug
cmake .. -DCMAKE_BUILD_TYPE=Debug && make -j
export MTL_DEBUG_LAYER=1  # Metal validation

Your development environment is now ready. In the next chapter, we will dive deep into akunu’s testing infrastructure – the CPU reference implementations, the kernel test pattern, and how to write tests for new functionality.