Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The C API

Every akunu CLI tool – akunu_chat, akunu_bench, akunu_serve, akunu_transcribe – is built on top of a single C API defined in two header files: include/akunu/akunu.h and include/akunu/types.h. This chapter explains why akunu uses a C API rather than a C++ one, walks through every major function, and shows how to build a complete application from scratch.

Why C?

This is a C++ project. The core engine is written in C++17 with Objective-C++ for the Metal backend. So why is the public API in plain C?

1. FFI compatibility. Every programming language can call C functions. Python has ctypes and cffi. Swift has direct C interop. Rust has extern "C". Java has JNI. Go has cgo. A C API is the universal adapter.1

2. ABI stability. C++ name mangling, vtable layouts, and standard library implementations differ between compilers and versions. A C API with POD structs and opaque pointers has a stable ABI – you can swap out the shared library without recompiling the caller.

3. No header dependencies. The C API headers include only <stdint.h>, <stdbool.h>, and <stddef.h>. No C++ standard library, no Metal headers, no Objective-C. Any C or C++ compiler can parse them.

4. Opaque handles prevent misuse. The caller cannot poke at internal state because the model is just a void*. This forces all interaction through the API functions, making it possible to change internal representations without breaking callers.

Header Organization

The API is split across two files:

types.h – Shared Data Structures

// types.h contains POD structs used by both the C API and internal C++ code

typedef struct {
    uint32_t dim;              // embedding dimension
    uint32_t n_layers;         // transformer layers
    uint32_t n_heads;          // query heads
    uint32_t n_kv_heads;       // key/value heads (GQA)
    uint32_t head_dim;         // dim per head
    uint32_t q_dim;            // total Q projection output
    uint32_t kv_dim;           // total KV projection output
    uint32_t ffn_dim;          // feed-forward intermediate dimension
    uint32_t vocab_size;       // vocabulary size
    uint32_t max_seq_len;      // maximum context length
    float norm_eps;            // RMSNorm/LayerNorm epsilon
    float rope_theta;          // RoPE base frequency
    uint32_t sliding_window_pattern;
    float rope_local_theta;
    char architecture[32];     // "llama", "qwen3", "gemma", "whisper"

    // Encoder parameters (0 = decoder-only)
    uint32_t enc_n_layers;
    uint32_t enc_n_heads;
    uint32_t enc_dim;
    uint32_t enc_ffn_dim;
    uint32_t n_mels;           // mel spectrogram bins (Whisper)
    uint32_t enc_max_seq_len;
} AkunuModelConfig;

Notice that AkunuModelConfig uses fixed-size arrays (char architecture[32]) instead of std::string, and all fields are primitive types. This is a POD struct – it can be safely passed across C/C++ boundaries and even memory-mapped.

Other types in types.h:

TypePurposeFields
AkunuModelConfigModel architecture metadatadim, layers, heads, vocab, etc.
AkunuSamplingConfigGeneration sampling parameterstemperature, top_k, top_p, min_p, repeat_penalty
AkunuGenerationStatsPost-generation statisticsprompt tokens, generated tokens, prefill/decode times
AkunuTranscribeStatsPost-transcription statisticsaudio_ms, encode_ms, decode_ms, total_ms

akunu.h – The Function API

The main header declares all API functions inside an extern "C" block:

#ifdef __cplusplus
extern "C" {
#endif

typedef void *akunu_model_t;  // Opaque model handle

// ... function declarations ...

#ifdef __cplusplus
}
#endif

The extern "C" block ensures C linkage (no name mangling) when compiled as C++. The #ifdef __cplusplus guards make the header valid for both C and C++ compilers.

The Opaque Handle Pattern

The entire model state – weights, KV cache, scratch buffers, dispatch table, tokenizer, device – is wrapped behind a single opaque handle:

typedef void *akunu_model_t;

This is a pointer to an internal C++ ModelState object that the caller never sees. Every API function takes this handle as its first argument. The pattern is:

// Create
akunu_model_t model = akunu_load_model("model.gguf", "akunu.metallib", 0);

// Use
akunu_generate(model, tokens, n_tokens, 256, sampling, callback, NULL);

// Destroy
akunu_free_model(model);

No global state, no singletons. You can load multiple models simultaneously by creating multiple handles. Each handle owns its own GPU resources.2

Model Lifecycle

Loading

akunu_model_t akunu_load_model(const char *model_path,
                                const char *metallib_path,
                                int max_context);

This function does a lot of work:

  1. Creates a MetalDevice (or default device)
  2. Loads the metallib (device.load_library())
  3. Parses the model file (GGUF or MLX SafeTensors)
  4. Allocates weight buffers and uploads weights to GPU
  5. Creates the ArchDescriptor from model metadata
  6. Queries ChipConfig from the device
  7. Allocates KV cache and scratch buffers
  8. Builds the dispatch table (build_dispatch_table())
  9. Initializes the tokenizer
  10. Returns the opaque handle (or NULL on failure)

Parameters:

  • model_path: Path to a .gguf file or MLX SafeTensors directory
  • metallib_path: Path to compiled akunu.metallib. Pass NULL for auto-detection (searches common paths)
  • max_context: Maximum context window. 0 = use model default (capped at 4096)

Freeing

void akunu_free_model(akunu_model_t model);

Releases all GPU buffers, KV cache, scratch buffers, cached pipeline state objects, and the Metal device. After this call, the handle is invalid.

Error Handling

const char *akunu_get_error(void);

Returns the last error message. This uses thread-local storage, so it is safe to call from multiple threads. If akunu_load_model returns NULL, call this to find out why:

akunu_model_t model = akunu_load_model("bad_path.gguf", "akunu.metallib", 0);
if (!model) {
    printf("Error: %s\n", akunu_get_error());
    // "Error: Failed to open file: bad_path.gguf"
}

Model Information

AkunuModelConfig akunu_get_config(akunu_model_t model);
size_t akunu_model_memory(akunu_model_t model);

akunu_get_config returns a copy of the model configuration struct. Since AkunuModelConfig is a POD struct, this is a simple memcpy – no dynamic allocation.

akunu_model_memory returns the total GPU memory used by the model in bytes. This includes weights, KV cache, scratch buffers, and pre-allocated parameter buffers.

Tokenization

int akunu_encode(akunu_model_t model, const char *text,
                 uint32_t *out_tokens, int max_tokens);

const char *akunu_decode_token(akunu_model_t model, uint32_t token_id);

int akunu_token_count(akunu_model_t model, const char *text);

The tokenizer is a BPE implementation built into akunu (no external dependency). Token IDs are uint32_t values.

akunu_encode writes token IDs into a caller-provided buffer. Returns the number of tokens written. If the output buffer is too small, the text is silently truncated.

akunu_decode_token returns a pointer to the token’s text representation. The pointer is valid until the model is freed – it points into the tokenizer’s vocabulary table.

akunu_token_count is a convenience function that counts tokens without allocating an output buffer.

Generation: The Callback Pattern

Generation uses a callback function for streaming output:

typedef bool (*akunu_token_callback)(uint32_t token_id,
                                     const char *text,
                                     void *user_data);

The callback is invoked for each generated token. Returning false stops generation immediately. The user_data pointer is passed through from the akunu_generate call, allowing the callback to access caller state without globals.

AkunuGenerationStats akunu_generate(
    akunu_model_t model,
    const uint32_t *prompt_tokens, int n_prompt,
    int max_tokens,
    AkunuSamplingConfig sampling,
    akunu_token_callback callback,
    void *user_data);

This is the main generation entry point. It:

  1. Resets the KV cache
  2. Runs prefill on the prompt tokens
  3. Enters the decode loop, calling the callback for each token
  4. Returns statistics (prefill time, decode time, tokens/second)

Sampling Configuration

typedef struct {
    float temperature;  // 0 = greedy (argmax)
    int top_k;          // 0 = disabled
    float top_p;        // 1.0 = disabled
    float min_p;        // 0.0 = disabled
    float repeat_penalty;  // 1.0 = disabled
} AkunuSamplingConfig;

Temperature 0 triggers the greedy decode path (argmax on GPU, no CPU sampling). Non-zero temperature runs the sampled decode path with optional top-k, top-p, and min-p filtering.

Generation Statistics

typedef struct {
    int prompt_tokens;
    int generated_tokens;
    float prefill_time_ms;
    float decode_time_ms;
    float prefill_tokens_per_sec;
    float decode_tokens_per_sec;
} AkunuGenerationStats;

This struct is returned by value from akunu_generate. It contains everything you need to report performance.

Continued Generation

For multi-turn chat, you do not want to re-process the entire conversation history each turn. akunu_generate_continue extends the existing KV cache:

AkunuGenerationStats akunu_generate_continue(
    akunu_model_t model,
    const uint32_t *new_tokens, int n_new,
    int max_tokens,
    AkunuSamplingConfig sampling,
    akunu_token_callback callback,
    void *user_data);

This prefills only the new_tokens (the latest user message) and generates from the combined context. The KV cache from previous turns is preserved.

Grammar-Constrained Generation

For structured output (JSON, specific formats), akunu supports grammar-constrained decoding:

akunu_grammar_t akunu_grammar_create(akunu_model_t model, const char *gbnf);
akunu_grammar_t akunu_grammar_create_from_schema(akunu_model_t model,
                                                  const char *json_schema);
akunu_grammar_t akunu_grammar_create_json(akunu_model_t model);
void akunu_grammar_free(akunu_grammar_t grammar);

AkunuGenerationStats akunu_generate_grammar(
    akunu_model_t model,
    const uint32_t *prompt_tokens, int n_prompt,
    int max_tokens,
    AkunuSamplingConfig sampling,
    akunu_grammar_t grammar,
    akunu_token_callback callback,
    void *user_data);

The grammar handle is opaque, like the model handle. Three factory functions create grammars from GBNF strings, JSON Schema strings, or a generic JSON grammar. The grammar masks invalid tokens at each step, guaranteeing the output conforms to the grammar.3

Low-Level API

For benchmarking and custom decode loops, akunu exposes lower-level functions:

// Run prefill, return first generated token
uint32_t akunu_prefill(akunu_model_t model,
                       const uint32_t *tokens, int n_tokens);

// Run one decode step, return next token
uint32_t akunu_decode_step(akunu_model_t model,
                           uint32_t token_id, int position);

// Chain decode: multiple tokens in one GPU submission
int akunu_chain_decode(akunu_model_t model,
                       uint32_t first_token, int start_position,
                       int count, uint32_t *out_tokens);

// Get current KV cache position
int akunu_get_position(akunu_model_t model);

// Reset KV cache
void akunu_reset(akunu_model_t model);

The akunu_chain_decode function is the key primitive for fast greedy generation. It encodes the dispatch table N times into a single command buffer, patching position fields for each token. This is how akunu achieves high throughput for greedy (temperature=0) decoding.

Speculative Decoding

void akunu_set_speculation(akunu_model_t model, bool enabled);

When enabled, the decode loop uses n-gram prediction to speculatively generate multiple tokens, then verifies them against the model. Correctly predicted tokens skip full forward passes. This only works with greedy mode (temperature=0) because the verification requires deterministic token selection.

Embeddings

For BERT-style encoder models:

int akunu_embed(akunu_model_t model,
                const uint32_t *tokens, int n_tokens,
                float *out_embedding, int max_dims);

int akunu_embed_text(akunu_model_t model, const char *text,
                     float *out_embedding, int max_dims);

int akunu_embedding_dim(akunu_model_t model);

akunu_embed runs a forward pass through the encoder, mean-pools the final hidden layer, and writes the resulting embedding vector to out_embedding. Returns the embedding dimension on success, 0 on failure.

akunu_embed_text is a convenience wrapper that tokenizes the text internally.

Whisper Transcription

const char *akunu_transcribe(akunu_model_t model,
                             const char *wav_path,
                             const char *language,
                             AkunuTranscribeStats *stats_out);

const char *akunu_transcribe_pcm(akunu_model_t model,
                                 const float *samples, int n_samples,
                                 const char *language,
                                 AkunuTranscribeStats *stats_out);

bool akunu_is_whisper(akunu_model_t model);
void akunu_set_timestamps(akunu_model_t model, bool enabled);

The transcription API supports both file-based and PCM buffer input. The returned string is valid until the next call or model free – it points to an internal buffer.

Streaming callbacks are also available:

typedef bool (*akunu_segment_callback)(int start_ms, int end_ms,
                                       const char *text, void *user_data);

const char *akunu_transcribe_stream(akunu_model_t model,
                                    const char *wav_path,
                                    const char *language,
                                    AkunuTranscribeStats *stats_out,
                                    akunu_segment_callback callback,
                                    void *user_data);

Chat Templates

const char *akunu_format_chat(akunu_model_t model,
                              const char *system_prompt,
                              const char *user_message);

const char *akunu_chat_template(akunu_model_t model);

akunu_format_chat applies the model’s native chat template to format a system prompt and user message into the expected input format (e.g., ChatML, Llama 3 format, Gemma format). The returned string is valid until the next call.

akunu_chat_template returns the template name as a string (“chatml”, “llama3”, “gemma”, or “unknown”).

Profiling

int akunu_profile_decode_step(akunu_model_t model,
                              uint32_t token_id, int position,
                              float *timing_out, int max_entries);

const char *akunu_profile_label(akunu_model_t model, int index);

The profiling API runs each operation in its own command buffer to get per-operation GPU timing. timing_out receives an array of float values (milliseconds). akunu_profile_label returns the human-readable label for each entry (e.g., “layer.0.attention”, “layer.0.o_proj”).

GPU Sampling Operations

void akunu_gpu_temperature_scale(akunu_model_t model, float temperature);
void akunu_gpu_repetition_penalty(akunu_model_t model,
                                   const uint32_t *token_ids,
                                   int n_tokens, float penalty);

These functions run sampling operations directly on the GPU, avoiding CPU readback of the logits buffer. Temperature scaling is a simple element-wise multiply; repetition penalty adjusts logits for previously seen tokens.

Model Inspection

int akunu_tensor_count(akunu_model_t model);
const char *akunu_tensor_name(akunu_model_t model, int index);
uint32_t akunu_tensor_dtype(akunu_model_t model, int index);
const char *akunu_tensor_raw_dtype(akunu_model_t model, int index);

These functions allow iterating over all tensors in the model. akunu_inspect uses them to dump the full tensor list. akunu_tensor_raw_dtype returns the original dtype string (e.g., “BF16” for SafeTensors) while akunu_tensor_dtype returns the internal GGUF dtype code.

Thread Safety

The akunu API has the following thread safety guarantees:

  1. Different model handles are fully independent. You can call functions on model_A from thread 1 and model_B from thread 2 concurrently with no synchronization needed.

  2. A single model handle is NOT thread-safe. You must serialize all calls to the same model. The akunu_serve server handles this with a per-model mutex.

  3. akunu_get_error() is thread-safe. It uses thread-local storage.

  4. Model loading (akunu_load_model) is thread-safe. Each call creates its own device and resources.

Complete Example

Here is a complete program that loads a model, generates text, and reports statistics:

#include "akunu/akunu.h"
#include <stdio.h>
#include <string.h>

static bool on_token(uint32_t token_id, const char *text, void *user_data) {
    printf("%s", text);
    fflush(stdout);
    (void)token_id;
    (void)user_data;
    return true;  // continue generating
}

int main(int argc, char **argv) {
    if (argc < 3) {
        fprintf(stderr, "Usage: %s <model.gguf> <akunu.metallib>\n", argv[0]);
        return 1;
    }

    // Load model
    akunu_model_t model = akunu_load_model(argv[1], argv[2], 4096);
    if (!model) {
        fprintf(stderr, "Failed to load model: %s\n", akunu_get_error());
        return 1;
    }

    // Print model info
    AkunuModelConfig cfg = akunu_get_config(model);
    printf("Model: %s, %u layers, %u dim, %.1f MB GPU memory\n",
           cfg.architecture, cfg.n_layers, cfg.dim,
           akunu_model_memory(model) / 1048576.0);

    // Tokenize prompt
    const char *prompt = "Explain the roofline model in one paragraph:";
    uint32_t tokens[4096];
    int n_tokens = akunu_encode(model, prompt, tokens, 4096);
    printf("Prompt: %d tokens\n\n", n_tokens);

    // Generate
    AkunuSamplingConfig sampling = {
        .temperature = 0.0f,  // greedy
        .top_k = 0,
        .top_p = 1.0f,
        .min_p = 0.0f,
        .repeat_penalty = 1.0f
    };

    AkunuGenerationStats stats = akunu_generate(
        model, tokens, n_tokens,
        256,       // max_tokens
        sampling,
        on_token,
        NULL       // user_data
    );

    // Report
    printf("\n\n--- Stats ---\n");
    printf("Prefill: %d tokens in %.1f ms (%.0f tok/s)\n",
           stats.prompt_tokens, stats.prefill_time_ms,
           stats.prefill_tokens_per_sec);
    printf("Decode:  %d tokens in %.1f ms (%.0f tok/s)\n",
           stats.generated_tokens, stats.decode_time_ms,
           stats.decode_tokens_per_sec);

    akunu_free_model(model);
    return 0;
}

Compile and run:

clang -std=c11 -I include example.c -L build -lakunu_engine \
    -framework Metal -framework Foundation -framework Accelerate \
    -framework IOKit -lstdc++ -o example

./example path/to/model.gguf path/to/akunu.metallib

API Function Reference

FunctionReturnsDescription
akunu_load_modelakunu_model_tLoad model, returns NULL on error
akunu_free_modelvoidFree all model resources
akunu_get_configAkunuModelConfigGet model architecture metadata
akunu_model_memorysize_tTotal GPU memory in bytes
akunu_get_errorconst char*Last error message (thread-local)
akunu_encodeintTokenize text to token IDs
akunu_decode_tokenconst char*Token ID to text
akunu_token_countintCount tokens in text
akunu_generateAkunuGenerationStatsFull generation pipeline
akunu_generate_continueAkunuGenerationStatsContinue from existing KV cache
akunu_generate_grammarAkunuGenerationStatsGrammar-constrained generation
akunu_generate_grammar_continueAkunuGenerationStatsContinue with grammar
akunu_grammar_createakunu_grammar_tCreate grammar from GBNF
akunu_grammar_create_from_schemaakunu_grammar_tCreate grammar from JSON Schema
akunu_grammar_create_jsonakunu_grammar_tCreate generic JSON grammar
akunu_grammar_freevoidFree grammar
akunu_prefilluint32_tRun prefill, return first token
akunu_decode_stepuint32_tRun one decode step
akunu_chain_decodeintChain decode multiple tokens
akunu_get_positionintCurrent KV cache position
akunu_set_speculationvoidEnable/disable speculative decode
akunu_resetvoidReset KV cache
akunu_embedintCompute embeddings from tokens
akunu_embed_textintCompute embeddings from text
akunu_embedding_dimintGet embedding dimension
akunu_format_chatconst char*Format chat message
akunu_chat_templateconst char*Get template name
akunu_transcribeconst char*Transcribe WAV file
akunu_transcribe_pcmconst char*Transcribe PCM buffer
akunu_transcribe_streamconst char*Transcribe with segment callback
akunu_transcribe_pcm_streamconst char*Transcribe PCM with callback
akunu_set_timestampsvoidEnable/disable Whisper timestamps
akunu_is_whisperboolCheck if model is Whisper
akunu_profile_decode_stepintPer-operation GPU timing
akunu_profile_labelconst char*Label for profiled operation
akunu_gpu_temperature_scalevoidGPU-side temperature scaling
akunu_gpu_repetition_penaltyvoidGPU-side repetition penalty
akunu_tensor_countintNumber of model tensors
akunu_tensor_nameconst char*Tensor name by index
akunu_tensor_dtypeuint32_tTensor GGUF dtype code
akunu_tensor_raw_dtypeconst char*Tensor original dtype string

Summary

The C API is akunu’s external interface. It uses the opaque handle pattern, POD structs, and C linkage to provide maximum compatibility across languages and compilers. The callback-based generation pattern supports streaming output without allocating result buffers. Thread safety is per-model-handle, requiring callers to serialize access to a single model.



  1. The C FFI is effectively the lingua franca of systems programming. See “Foreign Function Interface” on Wikipedia for a survey of language support. Every major language runtime supports calling C functions with zero or minimal overhead. See https://en.wikipedia.org/wiki/Foreign_function_interface.

  2. This “handle + function” pattern is sometimes called the “C object pattern” or “ADT (Abstract Data Type) in C.” It provides encapsulation without language-level support for classes. The Linux kernel uses this pattern extensively for device drivers.

  3. Grammar-constrained decoding uses the XGrammar library (v0.1.33) internally. XGrammar compiles the grammar into a token mask that can be applied at each decoding step. See the XGrammar project: https://github.com/mlc-ai/xgrammar.