Building and Running Akunu
This chapter covers the practical mechanics of building akunu from source and using its CLI tools. If you have been reading the previous chapters to understand how akunu works, this is where you roll up your sleeves and actually compile it. We will walk through the CMake configuration, build options, Metal shader compilation, and each of the CLI tools.
Prerequisites
Akunu requires:
- macOS 13 (Ventura) or later – for Metal 3 support and Apple GPU family 7+
- Xcode 15+ (or at least the Command Line Tools) – for the
clang++compiler with Objective-C++ support and themetalshader compiler - CMake 3.20+ – the build system
- Apple Silicon Mac – M1 or later. Akunu’s Metal backend requires UMA and
simdgroup_matrixsupport (GPU family 7+) - A model file – either GGUF format (from llama.cpp ecosystem) or MLX SafeTensors directory
Optional dependencies:
- XGrammar (v0.1.33) – for grammar-constrained JSON output. Included as a git submodule at
3rdparty/xgrammar
Project Structure
akunu/
├── CMakeLists.txt # Top-level build configuration
├── include/
│ └── akunu/
│ ├── akunu.h # C API header
│ └── types.h # Shared type definitions
├── src/
│ ├── akunu_api.cpp # C API implementation
│ ├── core/ # Backend-agnostic core
│ │ ├── device.h # Device abstraction
│ │ ├── dispatch_table.h # Precompiled command sequence
│ │ ├── table_builder.cpp
│ │ ├── arch_descriptor.h
│ │ ├── chip_config.h
│ │ ├── dtype_descriptor.h
│ │ └── ...
│ ├── weight/ # GGUF/MLX weight loading
│ ├── tokenizer/ # BPE tokenizer
│ ├── grammar/ # GBNF/JSON schema grammar
│ ├── inference/ # Decode loops, sampling
│ ├── cache/ # KV cache, scratch buffers
│ ├── server/ # HTTP server
│ ├── speculative/ # Speculative decoding
│ └── whisper/ # Whisper speech-to-text
├── backend/
│ └── metal/
│ ├── metal_device.h
│ ├── metal_device.mm # ObjC++ Metal implementation
│ ├── metal_device_impl.h
│ └── kernels/ # .metal shader source files
├── tools/ # CLI executables
│ ├── akunu_chat.cpp
│ ├── akunu_bench.cpp
│ ├── akunu_complete.cpp
│ ├── akunu_inspect.cpp
│ ├── akunu_profile.cpp
│ ├── akunu_benchmark.cpp
│ ├── akunu_serve.cpp
│ └── akunu_transcribe.cpp
├── tests/ # Test executables
│ ├── kernels/ # Per-kernel correctness tests
│ └── ...
└── 3rdparty/
└── xgrammar/ # Git submodule
CMake Configuration
The build is configured through CMakeLists.txt. Let’s walk through the key sections.
Language Standards
cmake_minimum_required(VERSION 3.20)
project(akunu VERSION 0.1 LANGUAGES CXX OBJCXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_OBJCXX_STANDARD 17)
set(CMAKE_OBJCXX_FLAGS "${CMAKE_OBJCXX_FLAGS} -fobjc-arc")
Akunu uses C++17 and Objective-C++ 17. The -fobjc-arc flag enables Automatic Reference Counting for Objective-C objects – this is how MetalDevice manages Metal API objects (MTLDevice, MTLCommandBuffer, etc.) without manual retain/release.1
Backend Selection
option(AKUNU_BACKEND_METAL "Build Metal backend (Apple Silicon)" ON)
option(AKUNU_BACKEND_CUDA "Build CUDA backend (NVIDIA)" OFF)
Metal is enabled by default. CUDA exists as a placeholder for future work. The backend selection determines which source files and frameworks are linked.
Core Sources
The core engine is pure C++ (no platform dependencies):
set(CORE_SOURCES
src/weight/gguf_parser.cpp # GGUF file format parser
src/weight/weight_store.cpp # Weight management + fusion
src/weight/mlx_weight_store.cpp # MLX SafeTensors parser
src/core/table_builder.cpp # Dispatch table construction
src/core/device_defaults.cpp # Default Device method implementations
src/core/prefill.cpp # Prefill (batched) forward pass
src/tokenizer/tokenizer.cpp # BPE tokenizer
src/grammar/grammar.cpp # GBNF grammar parser
src/grammar/json_schema_to_grammar.cpp
src/grammar/xgrammar_wrapper.cpp # XGrammar integration
src/inference/model_state.cpp # Model lifecycle
src/inference/sampling.cpp # Temperature, top-k, top-p, min-p
src/inference/model_loader.cpp # Model loading orchestration
src/inference/decode_greedy.cpp # Greedy decode loop
src/inference/decode_sampled.cpp # Sampled decode loop
src/inference/decode_speculative.cpp
src/inference/decode_grammar.cpp # Grammar-constrained decode
src/inference/decode_loop.cpp # Common decode infrastructure
src/inference/embedding.cpp # Text embedding (BERT)
src/whisper/whisper_inference.cpp # Whisper transcription
src/akunu_api.cpp # C API implementation
)
Metal Backend
When AKUNU_BACKEND_METAL is ON:
if(AKUNU_BACKEND_METAL)
list(APPEND BACKEND_SOURCES backend/metal/metal_device.mm)
endif()
The Metal backend is a single Objective-C++ file (metal_device.mm) that implements the Device virtual interface.
Framework Linking
The Metal backend links five Apple frameworks:
target_link_libraries(akunu_engine PUBLIC
"-framework Metal" # GPU compute
"-framework MetalPerformanceShaders" # (available for future use)
"-framework Foundation" # NSObject, NSString, NSURL
"-framework Accelerate" # vDSP (audio processing for Whisper)
"-framework IOKit" # GPU core count detection
)
| Framework | Purpose in Akunu |
|---|---|
| Metal | Core GPU API: device, command buffers, compute pipelines |
| MetalPerformanceShaders | Linked but not actively used (available for optimized primitives) |
| Foundation | Objective-C runtime, file URLs, string conversion |
| Accelerate | vDSP for FFT/mel spectrogram in Whisper audio preprocessing |
| IOKit | IORegistryEntryCreateCFProperty to query gpu-core-count from AGXAccelerator |
XGrammar Integration
The XGrammar submodule provides grammar-constrained decoding:
set(XGRAMMAR_DIR "${CMAKE_CURRENT_SOURCE_DIR}/3rdparty/xgrammar")
if(EXISTS "${XGRAMMAR_DIR}/include/xgrammar/xgrammar.h")
add_subdirectory(${XGRAMMAR_DIR} ${CMAKE_BINARY_DIR}/xgrammar EXCLUDE_FROM_ALL)
set(AKUNU_HAS_XGRAMMAR ON)
endif()
If the submodule is not initialized, XGrammar is simply disabled and grammar-constrained generation will not be available. To enable it:
git submodule update --init --recursive
Shared Library for Bindings
option(AKUNU_BUILD_SHARED "Build shared library for language bindings" OFF)
When enabled, this builds libakunu.dylib in addition to the static libakunu_engine.a. The shared library exposes the C API (akunu.h) and can be loaded by Python, Swift, or any language with C FFI support.
Building from Source
Basic Build
cd ~/Projects/akunu
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(sysctl -n hw.ncpu)
This produces:
libakunu_engine.a– the static libraryakunu_chat,akunu_bench,akunu_complete, etc. – CLI toolsakunu_test_*,akunu_kernel_*– test executables
Build with XGrammar
git submodule update --init --recursive
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(sysctl -n hw.ncpu)
The build system auto-detects XGrammar and sets AKUNU_HAS_XGRAMMAR=1.
Build Shared Library
cmake .. -DCMAKE_BUILD_TYPE=Release -DAKUNU_BUILD_SHARED=ON
make -j$(sysctl -n hw.ncpu)
Debug Build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make -j$(sysctl -n hw.ncpu)
Debug builds enable assertions and disable optimizations. For Metal shader debugging, also enable the Metal validation layer:
export METAL_DEVICE_WRAPPER_TYPE=1
export MTL_DEBUG_LAYER=1
./akunu_chat model.gguf akunu.metallib
Metal Shader Compilation
The Metal shader sources live in backend/metal/kernels/. They must be compiled into a .metallib file before akunu can load them. The compilation pipeline is:
.metal source files
│
▼ (metal compiler)
.air intermediate files
│
▼ (metallib archiver)
akunu.metallib
The compilation command (not automated by CMake – you need to do this manually or via a script):
# Compile all .metal files to .air
xcrun -sdk macosx metal -c -target air64-apple-macos13.0 \
-I backend/metal/kernels/metal/include \
backend/metal/kernels/metal/kernel/**/*.metal \
-o kernels.air
# Archive into metallib
xcrun -sdk macosx metallib kernels.air -o akunu.metallib
The -I flag adds the include directory for shared headers (ShaderTypes.h, KernelCommon.h) that are used across kernel files.
The resulting akunu.metallib file contains all GPU kernels in compiled form. At runtime, MetalDevice::load_library() loads this file and individual kernels are extracted by name via get_pipeline().2
Kernel Organization
The Metal kernels are organized by category:
| Directory | Kernels | Count |
|---|---|---|
activation/ | SiLU, GELU, gated variants | ~4 |
attention/ | Flash attention decode, softmax, logit cap | ~3 |
common/ | Bias add, residual add, transpose, vector ops | ~5 |
conv/ | Conv1D for Whisper frontend | 1 |
convert/ | Dequantize (Q4_0, Q4_K, Q8_0, MLX, etc.) | ~12 |
embedding/ | Embedding lookup per dtype | ~10 |
fused/ | Fused SiLU+GEMV, fused head norm | ~2 |
kv_cache/ | KV cache write, shift | ~2 |
matmul/ | GEMV, wide GEMV, SIMD GEMM per dtype | ~50+ |
norm/ | RMSNorm, LayerNorm, residual norm, head norm | ~5 |
rope/ | RoPE (standard, NeoX), fused norm+RoPE | ~4 |
sampling/ | Argmax, temperature scaling, top-k/p | ~4 |
Total: roughly 100+ kernel functions compiled into a single metallib.
CLI Tools
akunu_chat
Interactive chat with a loaded model. Handles conversation formatting using the model’s native chat template.
./akunu_chat path/to/model.gguf path/to/akunu.metallib
Features:
- Auto-detects chat template (ChatML, Llama 3, Gemma, etc.)
- Multi-turn conversation with KV cache reuse
- Streaming token output
- System prompt support
akunu_bench
Performance benchmarking tool. Measures prefill and decode throughput.
./akunu_bench path/to/model.gguf path/to/akunu.metallib
Reports:
- Prefill tokens/second (for various prompt lengths)
- Decode tokens/second (steady-state generation)
- Memory usage
- Model configuration summary
akunu_complete
Text completion (non-chat). Takes a prompt and generates a continuation.
./akunu_complete path/to/model.gguf path/to/akunu.metallib
Useful for testing raw model behavior without chat formatting.
akunu_inspect
Model inspection tool. Dumps model metadata and tensor information.
./akunu_inspect path/to/model.gguf
Shows:
- Architecture, vocabulary size, embedding dimension
- Number of layers, heads, KV heads
- RoPE configuration
- Tensor names, shapes, and dtypes
akunu_profile
Per-layer GPU timing profiler. Runs each operation in its own command buffer for accurate timing.
./akunu_profile path/to/model.gguf path/to/akunu.metallib
Output shows per-operation GPU time in milliseconds, useful for identifying bottlenecks.
akunu_serve
HTTP API server with OpenAI-compatible endpoints.
./akunu_serve path/to/model.gguf path/to/akunu.metallib --port 8080
Provides:
/v1/chat/completions– streaming and non-streaming chat/v1/completions– text completion- Multi-model support (load multiple models)
- Concurrent request handling with per-model mutex
akunu_transcribe
Speech-to-text using Whisper models.
./akunu_transcribe path/to/whisper-model.gguf path/to/akunu.metallib input.wav
Supports:
- WAV input (resampled to 16kHz internally)
- Language detection or forced language
- Timestamp generation
- Streaming segment callback
akunu_benchmark
Extended benchmarking tool with more detailed metrics.
./akunu_benchmark path/to/model.gguf path/to/akunu.metallib
Library Targets
The CMake build produces two main library targets:
| Target | Type | Contents |
|---|---|---|
akunu_engine | Static (libakunu_engine.a) | Core + backend, always built |
akunu | Shared (libakunu.dylib) | Same, built when AKUNU_BUILD_SHARED=ON |
Both expose the same C API defined in include/akunu/akunu.h. The static library is used by all CLI tools and tests. The shared library is intended for language bindings.
Test Executables
The build produces numerous test executables:
Integration Tests
| Test | Purpose |
|---|---|
akunu_test_device | Metal device creation, buffer allocation |
akunu_test_weights | GGUF parsing, weight loading |
akunu_test_table | Dispatch table construction |
akunu_e2e | End-to-end generation test |
akunu_test_long_context | Long context handling |
akunu_test_sampling_quality | Sampling distribution tests |
akunu_test_config | Model configuration parsing |
akunu_test_tokenizer | Tokenizer encode/decode |
akunu_test_inference | Inference pipeline |
akunu_test_kv_cache | KV cache operations |
akunu_test_grammar | Grammar parsing and constrained decoding |
akunu_test_server | HTTP server endpoints |
akunu_test_whisper | Whisper model loading |
akunu_test_whisper_e2e | End-to-end transcription |
Kernel Tests
Individual kernel correctness tests (each compiled as ObjC++):
| Test | Kernel Under Test |
|---|---|
akunu_kernel_test_rmsnorm | rmsnorm_f16 |
akunu_kernel_test_gemma_rmsnorm | rmsnorm_gemma_f16 |
akunu_kernel_test_gemv_f16 | gemv_f16 |
akunu_kernel_test_gemv_q4_0 | gemv_q4_0 |
akunu_kernel_test_gemv_q8_0 | gemv_q8_0 |
akunu_kernel_test_gemm_f16 | simd_gemm_f16 |
akunu_kernel_test_silu | silu_f16 |
akunu_kernel_test_gelu | gelu_f16 |
akunu_kernel_test_silu_gate | silu_gate_f16 |
akunu_kernel_test_gelu_gate | gelu_gate_f16 |
akunu_kernel_test_rope | rope_qkv_write_f16 |
akunu_kernel_test_rope_neox | rope_neox_qkv_write_f16 |
akunu_kernel_test_flash_attention | flash_attention_decode_parallel_f16 |
akunu_kernel_test_embedding_f16 | embedding_lookup_f16 |
akunu_kernel_test_f32_to_f16 | f32_to_f16 |
akunu_kernel_test_dequant_q4_0 | dequant_q4_0 |
These tests compare GPU kernel output against CPU reference implementations to verify correctness within FP16 tolerance.
Troubleshooting
“Failed to load metallib”
The metallib path is incorrect or the file was compiled for a different target. Ensure:
- The metallib file exists at the specified path
- It was compiled with
-target air64-apple-macos13.0or later - The Metal device supports the required GPU family
“Failed to get pipeline: kernel_name”
A kernel function is missing from the metallib. This usually means:
- The kernel source file was not included in the metallib compilation
- There is a naming mismatch between the kernel function name in Metal and the string in C++
“allocate: failed to allocate N bytes”
The model is too large for available memory. Options:
- Use a smaller quantization (Q4_0 instead of FP16)
- Reduce
max_contextto shrink KV cache - Close other applications to free memory
- Use a chip with more unified memory
Build Errors with XGrammar
If XGrammar fails to build, you can disable it:
cmake .. -DCMAKE_BUILD_TYPE=Release
# XGrammar auto-disables if submodule is not initialized
Summary
Building akunu is a standard CMake workflow. The main moving parts are:
- CMake configuration with
AKUNU_BACKEND_METAL=ON(default) - Metal shader compilation into
akunu.metallib(manual step) - Framework linking for Metal, Foundation, Accelerate, IOKit
- CLI tools for chat, benchmark, profiling, serving, and transcription
The next chapter covers the C API that all these tools are built on.
-
Apple, “Transitioning to ARC Release Notes.” ARC (Automatic Reference Counting) eliminates manual
retain/releasecalls for Objective-C objects. The compiler inserts retain/release operations automatically. See https://developer.apple.com/library/archive/releasenotes/ObjectiveC/RN-TransitioningToARC/. ↩ -
Apple, “Building a Library with Metal’s Command-Line Tools.” The
metalandmetallibcommand-line tools compile .metal sources into .metallib archives. See https://developer.apple.com/documentation/metal/shader_libraries/building_a_shader_library_by_precompiling_source_files. ↩