Building and Running Akunu

This chapter covers the practical mechanics of building akunu from source and using its CLI tools. If you have been reading the previous chapters to understand how akunu works, this is where you roll up your sleeves and actually compile it. We will walk through the CMake configuration, build options, Metal shader compilation, and each of the CLI tools.

Prerequisites

Akunu requires:

macOS 13 (Ventura) or later – for Metal 3 support and Apple GPU family 7+
Xcode 15+ (or at least the Command Line Tools) – for the clang++ compiler with Objective-C++ support and the metal shader compiler
CMake 3.20+ – the build system
Apple Silicon Mac – M1 or later. Akunu’s Metal backend requires UMA and simdgroup_matrix support (GPU family 7+)
A model file – either GGUF format (from llama.cpp ecosystem) or MLX SafeTensors directory

Optional dependencies:

XGrammar (v0.1.33) – for grammar-constrained JSON output. Included as a git submodule at 3rdparty/xgrammar

Project Structure

akunu/
├── CMakeLists.txt           # Top-level build configuration
├── include/
│   └── akunu/
│       ├── akunu.h          # C API header
│       └── types.h          # Shared type definitions
├── src/
│   ├── akunu_api.cpp        # C API implementation
│   ├── core/                # Backend-agnostic core
│   │   ├── device.h         # Device abstraction
│   │   ├── dispatch_table.h # Precompiled command sequence
│   │   ├── table_builder.cpp
│   │   ├── arch_descriptor.h
│   │   ├── chip_config.h
│   │   ├── dtype_descriptor.h
│   │   └── ...
│   ├── weight/              # GGUF/MLX weight loading
│   ├── tokenizer/           # BPE tokenizer
│   ├── grammar/             # GBNF/JSON schema grammar
│   ├── inference/           # Decode loops, sampling
│   ├── cache/               # KV cache, scratch buffers
│   ├── server/              # HTTP server
│   ├── speculative/         # Speculative decoding
│   └── whisper/             # Whisper speech-to-text
├── backend/
│   └── metal/
│       ├── metal_device.h
│       ├── metal_device.mm  # ObjC++ Metal implementation
│       ├── metal_device_impl.h
│       └── kernels/         # .metal shader source files
├── tools/                   # CLI executables
│   ├── akunu_chat.cpp
│   ├── akunu_bench.cpp
│   ├── akunu_complete.cpp
│   ├── akunu_inspect.cpp
│   ├── akunu_profile.cpp
│   ├── akunu_benchmark.cpp
│   ├── akunu_serve.cpp
│   └── akunu_transcribe.cpp
├── tests/                   # Test executables
│   ├── kernels/             # Per-kernel correctness tests
│   └── ...
└── 3rdparty/
    └── xgrammar/            # Git submodule

CMake Configuration

The build is configured through CMakeLists.txt. Let’s walk through the key sections.

Language Standards

cmake_minimum_required(VERSION 3.20)
project(akunu VERSION 0.1 LANGUAGES CXX OBJCXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_OBJCXX_STANDARD 17)
set(CMAKE_OBJCXX_FLAGS "${CMAKE_OBJCXX_FLAGS} -fobjc-arc")

Akunu uses C++17 and Objective-C++ 17. The -fobjc-arc flag enables Automatic Reference Counting for Objective-C objects – this is how MetalDevice manages Metal API objects (MTLDevice, MTLCommandBuffer, etc.) without manual retain/release.¹

Backend Selection

option(AKUNU_BACKEND_METAL "Build Metal backend (Apple Silicon)" ON)
option(AKUNU_BACKEND_CUDA  "Build CUDA backend (NVIDIA)" OFF)

Metal is enabled by default. CUDA exists as a placeholder for future work. The backend selection determines which source files and frameworks are linked.

Core Sources

The core engine is pure C++ (no platform dependencies):

set(CORE_SOURCES
    src/weight/gguf_parser.cpp        # GGUF file format parser
    src/weight/weight_store.cpp       # Weight management + fusion
    src/weight/mlx_weight_store.cpp   # MLX SafeTensors parser
    src/core/table_builder.cpp        # Dispatch table construction
    src/core/device_defaults.cpp      # Default Device method implementations
    src/core/prefill.cpp              # Prefill (batched) forward pass
    src/tokenizer/tokenizer.cpp       # BPE tokenizer
    src/grammar/grammar.cpp           # GBNF grammar parser
    src/grammar/json_schema_to_grammar.cpp
    src/grammar/xgrammar_wrapper.cpp  # XGrammar integration
    src/inference/model_state.cpp     # Model lifecycle
    src/inference/sampling.cpp        # Temperature, top-k, top-p, min-p
    src/inference/model_loader.cpp    # Model loading orchestration
    src/inference/decode_greedy.cpp   # Greedy decode loop
    src/inference/decode_sampled.cpp  # Sampled decode loop
    src/inference/decode_speculative.cpp
    src/inference/decode_grammar.cpp  # Grammar-constrained decode
    src/inference/decode_loop.cpp     # Common decode infrastructure
    src/inference/embedding.cpp       # Text embedding (BERT)
    src/whisper/whisper_inference.cpp # Whisper transcription
    src/akunu_api.cpp                 # C API implementation
)

Metal Backend

When AKUNU_BACKEND_METAL is ON:

if(AKUNU_BACKEND_METAL)
    list(APPEND BACKEND_SOURCES backend/metal/metal_device.mm)
endif()

The Metal backend is a single Objective-C++ file (metal_device.mm) that implements the Device virtual interface.

Framework Linking

The Metal backend links five Apple frameworks:

target_link_libraries(akunu_engine PUBLIC
    "-framework Metal"                      # GPU compute
    "-framework MetalPerformanceShaders"    # (available for future use)
    "-framework Foundation"                 # NSObject, NSString, NSURL
    "-framework Accelerate"                 # vDSP (audio processing for Whisper)
    "-framework IOKit"                      # GPU core count detection
)

Framework	Purpose in Akunu
Metal	Core GPU API: device, command buffers, compute pipelines
MetalPerformanceShaders	Linked but not actively used (available for optimized primitives)
Foundation	Objective-C runtime, file URLs, string conversion
Accelerate	vDSP for FFT/mel spectrogram in Whisper audio preprocessing
IOKit	`IORegistryEntryCreateCFProperty` to query `gpu-core-count` from AGXAccelerator

XGrammar Integration

The XGrammar submodule provides grammar-constrained decoding:

set(XGRAMMAR_DIR "${CMAKE_CURRENT_SOURCE_DIR}/3rdparty/xgrammar")
if(EXISTS "${XGRAMMAR_DIR}/include/xgrammar/xgrammar.h")
    add_subdirectory(${XGRAMMAR_DIR} ${CMAKE_BINARY_DIR}/xgrammar EXCLUDE_FROM_ALL)
    set(AKUNU_HAS_XGRAMMAR ON)
endif()

If the submodule is not initialized, XGrammar is simply disabled and grammar-constrained generation will not be available. To enable it:

git submodule update --init --recursive

Shared Library for Bindings

option(AKUNU_BUILD_SHARED "Build shared library for language bindings" OFF)

When enabled, this builds libakunu.dylib in addition to the static libakunu_engine.a. The shared library exposes the C API (akunu.h) and can be loaded by Python, Swift, or any language with C FFI support.

Building from Source

Basic Build

cd ~/Projects/akunu
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(sysctl -n hw.ncpu)

This produces:

libakunu_engine.a – the static library
akunu_chat, akunu_bench, akunu_complete, etc. – CLI tools
akunu_test_*, akunu_kernel_* – test executables

Build with XGrammar

git submodule update --init --recursive
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(sysctl -n hw.ncpu)

The build system auto-detects XGrammar and sets AKUNU_HAS_XGRAMMAR=1.

Build Shared Library

cmake .. -DCMAKE_BUILD_TYPE=Release -DAKUNU_BUILD_SHARED=ON
make -j$(sysctl -n hw.ncpu)

Debug Build

cmake .. -DCMAKE_BUILD_TYPE=Debug
make -j$(sysctl -n hw.ncpu)

Debug builds enable assertions and disable optimizations. For Metal shader debugging, also enable the Metal validation layer:

export METAL_DEVICE_WRAPPER_TYPE=1
export MTL_DEBUG_LAYER=1
./akunu_chat model.gguf akunu.metallib

Metal Shader Compilation

The Metal shader sources live in backend/metal/kernels/. They must be compiled into a .metallib file before akunu can load them. The compilation pipeline is:

.metal source files
    │
    ▼ (metal compiler)
.air intermediate files
    │
    ▼ (metallib archiver)
akunu.metallib

The compilation command (not automated by CMake – you need to do this manually or via a script):

# Compile all .metal files to .air
xcrun -sdk macosx metal -c -target air64-apple-macos13.0 \
    -I backend/metal/kernels/metal/include \
    backend/metal/kernels/metal/kernel/**/*.metal \
    -o kernels.air

# Archive into metallib
xcrun -sdk macosx metallib kernels.air -o akunu.metallib

The -I flag adds the include directory for shared headers (ShaderTypes.h, KernelCommon.h) that are used across kernel files.

The resulting akunu.metallib file contains all GPU kernels in compiled form. At runtime, MetalDevice::load_library() loads this file and individual kernels are extracted by name via get_pipeline().²

Kernel Organization

The Metal kernels are organized by category:

Directory	Kernels	Count
`activation/`	SiLU, GELU, gated variants	~4
`attention/`	Flash attention decode, softmax, logit cap	~3
`common/`	Bias add, residual add, transpose, vector ops	~5
`conv/`	Conv1D for Whisper frontend	1
`convert/`	Dequantize (Q4_0, Q4_K, Q8_0, MLX, etc.)	~12
`embedding/`	Embedding lookup per dtype	~10
`fused/`	Fused SiLU+GEMV, fused head norm	~2
`kv_cache/`	KV cache write, shift	~2
`matmul/`	GEMV, wide GEMV, SIMD GEMM per dtype	~50+
`norm/`	RMSNorm, LayerNorm, residual norm, head norm	~5
`rope/`	RoPE (standard, NeoX), fused norm+RoPE	~4
`sampling/`	Argmax, temperature scaling, top-k/p	~4

Total: roughly 100+ kernel functions compiled into a single metallib.

CLI Tools

akunu_chat

Interactive chat with a loaded model. Handles conversation formatting using the model’s native chat template.

./akunu_chat path/to/model.gguf path/to/akunu.metallib

Features:

Auto-detects chat template (ChatML, Llama 3, Gemma, etc.)
Multi-turn conversation with KV cache reuse
Streaming token output
System prompt support

akunu_bench

Performance benchmarking tool. Measures prefill and decode throughput.

./akunu_bench path/to/model.gguf path/to/akunu.metallib

Reports:

Prefill tokens/second (for various prompt lengths)
Decode tokens/second (steady-state generation)
Memory usage
Model configuration summary

akunu_complete

Text completion (non-chat). Takes a prompt and generates a continuation.

./akunu_complete path/to/model.gguf path/to/akunu.metallib

Useful for testing raw model behavior without chat formatting.

akunu_inspect

Model inspection tool. Dumps model metadata and tensor information.

./akunu_inspect path/to/model.gguf

Shows:

Architecture, vocabulary size, embedding dimension
Number of layers, heads, KV heads
RoPE configuration
Tensor names, shapes, and dtypes

akunu_profile

Per-layer GPU timing profiler. Runs each operation in its own command buffer for accurate timing.

./akunu_profile path/to/model.gguf path/to/akunu.metallib

Output shows per-operation GPU time in milliseconds, useful for identifying bottlenecks.

akunu_serve

HTTP API server with OpenAI-compatible endpoints.

./akunu_serve path/to/model.gguf path/to/akunu.metallib --port 8080

Provides:

/v1/chat/completions – streaming and non-streaming chat
/v1/completions – text completion
Multi-model support (load multiple models)
Concurrent request handling with per-model mutex

akunu_transcribe

Speech-to-text using Whisper models.

./akunu_transcribe path/to/whisper-model.gguf path/to/akunu.metallib input.wav

Supports:

WAV input (resampled to 16kHz internally)
Language detection or forced language
Timestamp generation
Streaming segment callback

akunu_benchmark

Extended benchmarking tool with more detailed metrics.

./akunu_benchmark path/to/model.gguf path/to/akunu.metallib

Library Targets

The CMake build produces two main library targets:

Target	Type	Contents
`akunu_engine`	Static (`libakunu_engine.a`)	Core + backend, always built
`akunu`	Shared (`libakunu.dylib`)	Same, built when `AKUNU_BUILD_SHARED=ON`

Both expose the same C API defined in include/akunu/akunu.h. The static library is used by all CLI tools and tests. The shared library is intended for language bindings.

Test Executables

The build produces numerous test executables:

Integration Tests

Test	Purpose
`akunu_test_device`	Metal device creation, buffer allocation
`akunu_test_weights`	GGUF parsing, weight loading
`akunu_test_table`	Dispatch table construction
`akunu_e2e`	End-to-end generation test
`akunu_test_long_context`	Long context handling
`akunu_test_sampling_quality`	Sampling distribution tests
`akunu_test_config`	Model configuration parsing
`akunu_test_tokenizer`	Tokenizer encode/decode
`akunu_test_inference`	Inference pipeline
`akunu_test_kv_cache`	KV cache operations
`akunu_test_grammar`	Grammar parsing and constrained decoding
`akunu_test_server`	HTTP server endpoints
`akunu_test_whisper`	Whisper model loading
`akunu_test_whisper_e2e`	End-to-end transcription

Kernel Tests

Individual kernel correctness tests (each compiled as ObjC++):

Test	Kernel Under Test
`akunu_kernel_test_rmsnorm`	`rmsnorm_f16`
`akunu_kernel_test_gemma_rmsnorm`	`rmsnorm_gemma_f16`
`akunu_kernel_test_gemv_f16`	`gemv_f16`
`akunu_kernel_test_gemv_q4_0`	`gemv_q4_0`
`akunu_kernel_test_gemv_q8_0`	`gemv_q8_0`
`akunu_kernel_test_gemm_f16`	`simd_gemm_f16`
`akunu_kernel_test_silu`	`silu_f16`
`akunu_kernel_test_gelu`	`gelu_f16`
`akunu_kernel_test_silu_gate`	`silu_gate_f16`
`akunu_kernel_test_gelu_gate`	`gelu_gate_f16`
`akunu_kernel_test_rope`	`rope_qkv_write_f16`
`akunu_kernel_test_rope_neox`	`rope_neox_qkv_write_f16`
`akunu_kernel_test_flash_attention`	`flash_attention_decode_parallel_f16`
`akunu_kernel_test_embedding_f16`	`embedding_lookup_f16`
`akunu_kernel_test_f32_to_f16`	`f32_to_f16`
`akunu_kernel_test_dequant_q4_0`	`dequant_q4_0`

These tests compare GPU kernel output against CPU reference implementations to verify correctness within FP16 tolerance.

Troubleshooting

“Failed to load metallib”

The metallib path is incorrect or the file was compiled for a different target. Ensure:

The metallib file exists at the specified path
It was compiled with -target air64-apple-macos13.0 or later
The Metal device supports the required GPU family

“Failed to get pipeline: kernel_name”

A kernel function is missing from the metallib. This usually means:

The kernel source file was not included in the metallib compilation
There is a naming mismatch between the kernel function name in Metal and the string in C++

“allocate: failed to allocate N bytes”

The model is too large for available memory. Options:

Use a smaller quantization (Q4_0 instead of FP16)
Reduce max_context to shrink KV cache
Close other applications to free memory
Use a chip with more unified memory

Build Errors with XGrammar

If XGrammar fails to build, you can disable it:

cmake .. -DCMAKE_BUILD_TYPE=Release
# XGrammar auto-disables if submodule is not initialized

Summary

Building akunu is a standard CMake workflow. The main moving parts are:

CMake configuration with AKUNU_BACKEND_METAL=ON (default)
Metal shader compilation into akunu.metallib (manual step)
Framework linking for Metal, Foundation, Accelerate, IOKit
CLI tools for chat, benchmark, profiling, serving, and transcription

The next chapter covers the C API that all these tools are built on.

Apple, “Transitioning to ARC Release Notes.” ARC (Automatic Reference Counting) eliminates manual retain/release calls for Objective-C objects. The compiler inserts retain/release operations automatically. See https://developer.apple.com/library/archive/releasenotes/ObjectiveC/RN-TransitioningToARC/. ↩
Apple, “Building a Library with Metal’s Command-Line Tools.” The metal and metallib command-line tools compile .metal sources into .metallib archives. See https://developer.apple.com/documentation/metal/shader_libraries/building_a_shader_library_by_precompiling_source_files. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference