Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Building and Running Akunu

This chapter covers the practical mechanics of building akunu from source and using its CLI tools. If you have been reading the previous chapters to understand how akunu works, this is where you roll up your sleeves and actually compile it. We will walk through the CMake configuration, build options, Metal shader compilation, and each of the CLI tools.

Prerequisites

Akunu requires:

  • macOS 13 (Ventura) or later – for Metal 3 support and Apple GPU family 7+
  • Xcode 15+ (or at least the Command Line Tools) – for the clang++ compiler with Objective-C++ support and the metal shader compiler
  • CMake 3.20+ – the build system
  • Apple Silicon Mac – M1 or later. Akunu’s Metal backend requires UMA and simdgroup_matrix support (GPU family 7+)
  • A model file – either GGUF format (from llama.cpp ecosystem) or MLX SafeTensors directory

Optional dependencies:

  • XGrammar (v0.1.33) – for grammar-constrained JSON output. Included as a git submodule at 3rdparty/xgrammar

Project Structure

akunu/
├── CMakeLists.txt           # Top-level build configuration
├── include/
│   └── akunu/
│       ├── akunu.h          # C API header
│       └── types.h          # Shared type definitions
├── src/
│   ├── akunu_api.cpp        # C API implementation
│   ├── core/                # Backend-agnostic core
│   │   ├── device.h         # Device abstraction
│   │   ├── dispatch_table.h # Precompiled command sequence
│   │   ├── table_builder.cpp
│   │   ├── arch_descriptor.h
│   │   ├── chip_config.h
│   │   ├── dtype_descriptor.h
│   │   └── ...
│   ├── weight/              # GGUF/MLX weight loading
│   ├── tokenizer/           # BPE tokenizer
│   ├── grammar/             # GBNF/JSON schema grammar
│   ├── inference/           # Decode loops, sampling
│   ├── cache/               # KV cache, scratch buffers
│   ├── server/              # HTTP server
│   ├── speculative/         # Speculative decoding
│   └── whisper/             # Whisper speech-to-text
├── backend/
│   └── metal/
│       ├── metal_device.h
│       ├── metal_device.mm  # ObjC++ Metal implementation
│       ├── metal_device_impl.h
│       └── kernels/         # .metal shader source files
├── tools/                   # CLI executables
│   ├── akunu_chat.cpp
│   ├── akunu_bench.cpp
│   ├── akunu_complete.cpp
│   ├── akunu_inspect.cpp
│   ├── akunu_profile.cpp
│   ├── akunu_benchmark.cpp
│   ├── akunu_serve.cpp
│   └── akunu_transcribe.cpp
├── tests/                   # Test executables
│   ├── kernels/             # Per-kernel correctness tests
│   └── ...
└── 3rdparty/
    └── xgrammar/            # Git submodule

CMake Configuration

The build is configured through CMakeLists.txt. Let’s walk through the key sections.

Language Standards

cmake_minimum_required(VERSION 3.20)
project(akunu VERSION 0.1 LANGUAGES CXX OBJCXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_OBJCXX_STANDARD 17)
set(CMAKE_OBJCXX_FLAGS "${CMAKE_OBJCXX_FLAGS} -fobjc-arc")

Akunu uses C++17 and Objective-C++ 17. The -fobjc-arc flag enables Automatic Reference Counting for Objective-C objects – this is how MetalDevice manages Metal API objects (MTLDevice, MTLCommandBuffer, etc.) without manual retain/release.1

Backend Selection

option(AKUNU_BACKEND_METAL "Build Metal backend (Apple Silicon)" ON)
option(AKUNU_BACKEND_CUDA  "Build CUDA backend (NVIDIA)" OFF)

Metal is enabled by default. CUDA exists as a placeholder for future work. The backend selection determines which source files and frameworks are linked.

Core Sources

The core engine is pure C++ (no platform dependencies):

set(CORE_SOURCES
    src/weight/gguf_parser.cpp        # GGUF file format parser
    src/weight/weight_store.cpp       # Weight management + fusion
    src/weight/mlx_weight_store.cpp   # MLX SafeTensors parser
    src/core/table_builder.cpp        # Dispatch table construction
    src/core/device_defaults.cpp      # Default Device method implementations
    src/core/prefill.cpp              # Prefill (batched) forward pass
    src/tokenizer/tokenizer.cpp       # BPE tokenizer
    src/grammar/grammar.cpp           # GBNF grammar parser
    src/grammar/json_schema_to_grammar.cpp
    src/grammar/xgrammar_wrapper.cpp  # XGrammar integration
    src/inference/model_state.cpp     # Model lifecycle
    src/inference/sampling.cpp        # Temperature, top-k, top-p, min-p
    src/inference/model_loader.cpp    # Model loading orchestration
    src/inference/decode_greedy.cpp   # Greedy decode loop
    src/inference/decode_sampled.cpp  # Sampled decode loop
    src/inference/decode_speculative.cpp
    src/inference/decode_grammar.cpp  # Grammar-constrained decode
    src/inference/decode_loop.cpp     # Common decode infrastructure
    src/inference/embedding.cpp       # Text embedding (BERT)
    src/whisper/whisper_inference.cpp # Whisper transcription
    src/akunu_api.cpp                 # C API implementation
)

Metal Backend

When AKUNU_BACKEND_METAL is ON:

if(AKUNU_BACKEND_METAL)
    list(APPEND BACKEND_SOURCES backend/metal/metal_device.mm)
endif()

The Metal backend is a single Objective-C++ file (metal_device.mm) that implements the Device virtual interface.

Framework Linking

The Metal backend links five Apple frameworks:

target_link_libraries(akunu_engine PUBLIC
    "-framework Metal"                      # GPU compute
    "-framework MetalPerformanceShaders"    # (available for future use)
    "-framework Foundation"                 # NSObject, NSString, NSURL
    "-framework Accelerate"                 # vDSP (audio processing for Whisper)
    "-framework IOKit"                      # GPU core count detection
)
FrameworkPurpose in Akunu
MetalCore GPU API: device, command buffers, compute pipelines
MetalPerformanceShadersLinked but not actively used (available for optimized primitives)
FoundationObjective-C runtime, file URLs, string conversion
AcceleratevDSP for FFT/mel spectrogram in Whisper audio preprocessing
IOKitIORegistryEntryCreateCFProperty to query gpu-core-count from AGXAccelerator

XGrammar Integration

The XGrammar submodule provides grammar-constrained decoding:

set(XGRAMMAR_DIR "${CMAKE_CURRENT_SOURCE_DIR}/3rdparty/xgrammar")
if(EXISTS "${XGRAMMAR_DIR}/include/xgrammar/xgrammar.h")
    add_subdirectory(${XGRAMMAR_DIR} ${CMAKE_BINARY_DIR}/xgrammar EXCLUDE_FROM_ALL)
    set(AKUNU_HAS_XGRAMMAR ON)
endif()

If the submodule is not initialized, XGrammar is simply disabled and grammar-constrained generation will not be available. To enable it:

git submodule update --init --recursive

Shared Library for Bindings

option(AKUNU_BUILD_SHARED "Build shared library for language bindings" OFF)

When enabled, this builds libakunu.dylib in addition to the static libakunu_engine.a. The shared library exposes the C API (akunu.h) and can be loaded by Python, Swift, or any language with C FFI support.

Building from Source

Basic Build

cd ~/Projects/akunu
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(sysctl -n hw.ncpu)

This produces:

  • libakunu_engine.a – the static library
  • akunu_chat, akunu_bench, akunu_complete, etc. – CLI tools
  • akunu_test_*, akunu_kernel_* – test executables

Build with XGrammar

git submodule update --init --recursive
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(sysctl -n hw.ncpu)

The build system auto-detects XGrammar and sets AKUNU_HAS_XGRAMMAR=1.

Build Shared Library

cmake .. -DCMAKE_BUILD_TYPE=Release -DAKUNU_BUILD_SHARED=ON
make -j$(sysctl -n hw.ncpu)

Debug Build

cmake .. -DCMAKE_BUILD_TYPE=Debug
make -j$(sysctl -n hw.ncpu)

Debug builds enable assertions and disable optimizations. For Metal shader debugging, also enable the Metal validation layer:

export METAL_DEVICE_WRAPPER_TYPE=1
export MTL_DEBUG_LAYER=1
./akunu_chat model.gguf akunu.metallib

Metal Shader Compilation

The Metal shader sources live in backend/metal/kernels/. They must be compiled into a .metallib file before akunu can load them. The compilation pipeline is:

.metal source files
    │
    ▼ (metal compiler)
.air intermediate files
    │
    ▼ (metallib archiver)
akunu.metallib

The compilation command (not automated by CMake – you need to do this manually or via a script):

# Compile all .metal files to .air
xcrun -sdk macosx metal -c -target air64-apple-macos13.0 \
    -I backend/metal/kernels/metal/include \
    backend/metal/kernels/metal/kernel/**/*.metal \
    -o kernels.air

# Archive into metallib
xcrun -sdk macosx metallib kernels.air -o akunu.metallib

The -I flag adds the include directory for shared headers (ShaderTypes.h, KernelCommon.h) that are used across kernel files.

The resulting akunu.metallib file contains all GPU kernels in compiled form. At runtime, MetalDevice::load_library() loads this file and individual kernels are extracted by name via get_pipeline().2

Kernel Organization

The Metal kernels are organized by category:

DirectoryKernelsCount
activation/SiLU, GELU, gated variants~4
attention/Flash attention decode, softmax, logit cap~3
common/Bias add, residual add, transpose, vector ops~5
conv/Conv1D for Whisper frontend1
convert/Dequantize (Q4_0, Q4_K, Q8_0, MLX, etc.)~12
embedding/Embedding lookup per dtype~10
fused/Fused SiLU+GEMV, fused head norm~2
kv_cache/KV cache write, shift~2
matmul/GEMV, wide GEMV, SIMD GEMM per dtype~50+
norm/RMSNorm, LayerNorm, residual norm, head norm~5
rope/RoPE (standard, NeoX), fused norm+RoPE~4
sampling/Argmax, temperature scaling, top-k/p~4

Total: roughly 100+ kernel functions compiled into a single metallib.

CLI Tools

akunu_chat

Interactive chat with a loaded model. Handles conversation formatting using the model’s native chat template.

./akunu_chat path/to/model.gguf path/to/akunu.metallib

Features:

  • Auto-detects chat template (ChatML, Llama 3, Gemma, etc.)
  • Multi-turn conversation with KV cache reuse
  • Streaming token output
  • System prompt support

akunu_bench

Performance benchmarking tool. Measures prefill and decode throughput.

./akunu_bench path/to/model.gguf path/to/akunu.metallib

Reports:

  • Prefill tokens/second (for various prompt lengths)
  • Decode tokens/second (steady-state generation)
  • Memory usage
  • Model configuration summary

akunu_complete

Text completion (non-chat). Takes a prompt and generates a continuation.

./akunu_complete path/to/model.gguf path/to/akunu.metallib

Useful for testing raw model behavior without chat formatting.

akunu_inspect

Model inspection tool. Dumps model metadata and tensor information.

./akunu_inspect path/to/model.gguf

Shows:

  • Architecture, vocabulary size, embedding dimension
  • Number of layers, heads, KV heads
  • RoPE configuration
  • Tensor names, shapes, and dtypes

akunu_profile

Per-layer GPU timing profiler. Runs each operation in its own command buffer for accurate timing.

./akunu_profile path/to/model.gguf path/to/akunu.metallib

Output shows per-operation GPU time in milliseconds, useful for identifying bottlenecks.

akunu_serve

HTTP API server with OpenAI-compatible endpoints.

./akunu_serve path/to/model.gguf path/to/akunu.metallib --port 8080

Provides:

  • /v1/chat/completions – streaming and non-streaming chat
  • /v1/completions – text completion
  • Multi-model support (load multiple models)
  • Concurrent request handling with per-model mutex

akunu_transcribe

Speech-to-text using Whisper models.

./akunu_transcribe path/to/whisper-model.gguf path/to/akunu.metallib input.wav

Supports:

  • WAV input (resampled to 16kHz internally)
  • Language detection or forced language
  • Timestamp generation
  • Streaming segment callback

akunu_benchmark

Extended benchmarking tool with more detailed metrics.

./akunu_benchmark path/to/model.gguf path/to/akunu.metallib

Library Targets

The CMake build produces two main library targets:

TargetTypeContents
akunu_engineStatic (libakunu_engine.a)Core + backend, always built
akunuShared (libakunu.dylib)Same, built when AKUNU_BUILD_SHARED=ON

Both expose the same C API defined in include/akunu/akunu.h. The static library is used by all CLI tools and tests. The shared library is intended for language bindings.

Test Executables

The build produces numerous test executables:

Integration Tests

TestPurpose
akunu_test_deviceMetal device creation, buffer allocation
akunu_test_weightsGGUF parsing, weight loading
akunu_test_tableDispatch table construction
akunu_e2eEnd-to-end generation test
akunu_test_long_contextLong context handling
akunu_test_sampling_qualitySampling distribution tests
akunu_test_configModel configuration parsing
akunu_test_tokenizerTokenizer encode/decode
akunu_test_inferenceInference pipeline
akunu_test_kv_cacheKV cache operations
akunu_test_grammarGrammar parsing and constrained decoding
akunu_test_serverHTTP server endpoints
akunu_test_whisperWhisper model loading
akunu_test_whisper_e2eEnd-to-end transcription

Kernel Tests

Individual kernel correctness tests (each compiled as ObjC++):

TestKernel Under Test
akunu_kernel_test_rmsnormrmsnorm_f16
akunu_kernel_test_gemma_rmsnormrmsnorm_gemma_f16
akunu_kernel_test_gemv_f16gemv_f16
akunu_kernel_test_gemv_q4_0gemv_q4_0
akunu_kernel_test_gemv_q8_0gemv_q8_0
akunu_kernel_test_gemm_f16simd_gemm_f16
akunu_kernel_test_silusilu_f16
akunu_kernel_test_gelugelu_f16
akunu_kernel_test_silu_gatesilu_gate_f16
akunu_kernel_test_gelu_gategelu_gate_f16
akunu_kernel_test_roperope_qkv_write_f16
akunu_kernel_test_rope_neoxrope_neox_qkv_write_f16
akunu_kernel_test_flash_attentionflash_attention_decode_parallel_f16
akunu_kernel_test_embedding_f16embedding_lookup_f16
akunu_kernel_test_f32_to_f16f32_to_f16
akunu_kernel_test_dequant_q4_0dequant_q4_0

These tests compare GPU kernel output against CPU reference implementations to verify correctness within FP16 tolerance.

Troubleshooting

“Failed to load metallib”

The metallib path is incorrect or the file was compiled for a different target. Ensure:

  1. The metallib file exists at the specified path
  2. It was compiled with -target air64-apple-macos13.0 or later
  3. The Metal device supports the required GPU family

“Failed to get pipeline: kernel_name”

A kernel function is missing from the metallib. This usually means:

  1. The kernel source file was not included in the metallib compilation
  2. There is a naming mismatch between the kernel function name in Metal and the string in C++

“allocate: failed to allocate N bytes”

The model is too large for available memory. Options:

  1. Use a smaller quantization (Q4_0 instead of FP16)
  2. Reduce max_context to shrink KV cache
  3. Close other applications to free memory
  4. Use a chip with more unified memory

Build Errors with XGrammar

If XGrammar fails to build, you can disable it:

cmake .. -DCMAKE_BUILD_TYPE=Release
# XGrammar auto-disables if submodule is not initialized

Summary

Building akunu is a standard CMake workflow. The main moving parts are:

  1. CMake configuration with AKUNU_BACKEND_METAL=ON (default)
  2. Metal shader compilation into akunu.metallib (manual step)
  3. Framework linking for Metal, Foundation, Accelerate, IOKit
  4. CLI tools for chat, benchmark, profiling, serving, and transcription

The next chapter covers the C API that all these tools are built on.



  1. Apple, “Transitioning to ARC Release Notes.” ARC (Automatic Reference Counting) eliminates manual retain/release calls for Objective-C objects. The compiler inserts retain/release operations automatically. See https://developer.apple.com/library/archive/releasenotes/ObjectiveC/RN-TransitioningToARC/.

  2. Apple, “Building a Library with Metal’s Command-Line Tools.” The metal and metallib command-line tools compile .metal sources into .metallib archives. See https://developer.apple.com/documentation/metal/shader_libraries/building_a_shader_library_by_precompiling_source_files.