The C API
Every akunu CLI tool – akunu_chat, akunu_bench, akunu_serve, akunu_transcribe – is built on top of a single C API defined in two header files: include/akunu/akunu.h and include/akunu/types.h. This chapter explains why akunu uses a C API rather than a C++ one, walks through every major function, and shows how to build a complete application from scratch.
Why C?
This is a C++ project. The core engine is written in C++17 with Objective-C++ for the Metal backend. So why is the public API in plain C?
1. FFI compatibility. Every programming language can call C functions. Python has ctypes and cffi. Swift has direct C interop. Rust has extern "C". Java has JNI. Go has cgo. A C API is the universal adapter.1
2. ABI stability. C++ name mangling, vtable layouts, and standard library implementations differ between compilers and versions. A C API with POD structs and opaque pointers has a stable ABI – you can swap out the shared library without recompiling the caller.
3. No header dependencies. The C API headers include only <stdint.h>, <stdbool.h>, and <stddef.h>. No C++ standard library, no Metal headers, no Objective-C. Any C or C++ compiler can parse them.
4. Opaque handles prevent misuse. The caller cannot poke at internal state because the model is just a void*. This forces all interaction through the API functions, making it possible to change internal representations without breaking callers.
Header Organization
The API is split across two files:
types.h – Shared Data Structures
// types.h contains POD structs used by both the C API and internal C++ code
typedef struct {
uint32_t dim; // embedding dimension
uint32_t n_layers; // transformer layers
uint32_t n_heads; // query heads
uint32_t n_kv_heads; // key/value heads (GQA)
uint32_t head_dim; // dim per head
uint32_t q_dim; // total Q projection output
uint32_t kv_dim; // total KV projection output
uint32_t ffn_dim; // feed-forward intermediate dimension
uint32_t vocab_size; // vocabulary size
uint32_t max_seq_len; // maximum context length
float norm_eps; // RMSNorm/LayerNorm epsilon
float rope_theta; // RoPE base frequency
uint32_t sliding_window_pattern;
float rope_local_theta;
char architecture[32]; // "llama", "qwen3", "gemma", "whisper"
// Encoder parameters (0 = decoder-only)
uint32_t enc_n_layers;
uint32_t enc_n_heads;
uint32_t enc_dim;
uint32_t enc_ffn_dim;
uint32_t n_mels; // mel spectrogram bins (Whisper)
uint32_t enc_max_seq_len;
} AkunuModelConfig;
Notice that AkunuModelConfig uses fixed-size arrays (char architecture[32]) instead of std::string, and all fields are primitive types. This is a POD struct – it can be safely passed across C/C++ boundaries and even memory-mapped.
Other types in types.h:
| Type | Purpose | Fields |
|---|---|---|
AkunuModelConfig | Model architecture metadata | dim, layers, heads, vocab, etc. |
AkunuSamplingConfig | Generation sampling parameters | temperature, top_k, top_p, min_p, repeat_penalty |
AkunuGenerationStats | Post-generation statistics | prompt tokens, generated tokens, prefill/decode times |
AkunuTranscribeStats | Post-transcription statistics | audio_ms, encode_ms, decode_ms, total_ms |
akunu.h – The Function API
The main header declares all API functions inside an extern "C" block:
#ifdef __cplusplus
extern "C" {
#endif
typedef void *akunu_model_t; // Opaque model handle
// ... function declarations ...
#ifdef __cplusplus
}
#endif
The extern "C" block ensures C linkage (no name mangling) when compiled as C++. The #ifdef __cplusplus guards make the header valid for both C and C++ compilers.
The Opaque Handle Pattern
The entire model state – weights, KV cache, scratch buffers, dispatch table, tokenizer, device – is wrapped behind a single opaque handle:
typedef void *akunu_model_t;
This is a pointer to an internal C++ ModelState object that the caller never sees. Every API function takes this handle as its first argument. The pattern is:
// Create
akunu_model_t model = akunu_load_model("model.gguf", "akunu.metallib", 0);
// Use
akunu_generate(model, tokens, n_tokens, 256, sampling, callback, NULL);
// Destroy
akunu_free_model(model);
No global state, no singletons. You can load multiple models simultaneously by creating multiple handles. Each handle owns its own GPU resources.2
Model Lifecycle
Loading
akunu_model_t akunu_load_model(const char *model_path,
const char *metallib_path,
int max_context);
This function does a lot of work:
- Creates a
MetalDevice(or default device) - Loads the metallib (
device.load_library()) - Parses the model file (GGUF or MLX SafeTensors)
- Allocates weight buffers and uploads weights to GPU
- Creates the
ArchDescriptorfrom model metadata - Queries
ChipConfigfrom the device - Allocates KV cache and scratch buffers
- Builds the dispatch table (
build_dispatch_table()) - Initializes the tokenizer
- Returns the opaque handle (or NULL on failure)
Parameters:
model_path: Path to a.gguffile or MLX SafeTensors directorymetallib_path: Path to compiledakunu.metallib. Pass NULL for auto-detection (searches common paths)max_context: Maximum context window. 0 = use model default (capped at 4096)
Freeing
void akunu_free_model(akunu_model_t model);
Releases all GPU buffers, KV cache, scratch buffers, cached pipeline state objects, and the Metal device. After this call, the handle is invalid.
Error Handling
const char *akunu_get_error(void);
Returns the last error message. This uses thread-local storage, so it is safe to call from multiple threads. If akunu_load_model returns NULL, call this to find out why:
akunu_model_t model = akunu_load_model("bad_path.gguf", "akunu.metallib", 0);
if (!model) {
printf("Error: %s\n", akunu_get_error());
// "Error: Failed to open file: bad_path.gguf"
}
Model Information
AkunuModelConfig akunu_get_config(akunu_model_t model);
size_t akunu_model_memory(akunu_model_t model);
akunu_get_config returns a copy of the model configuration struct. Since AkunuModelConfig is a POD struct, this is a simple memcpy – no dynamic allocation.
akunu_model_memory returns the total GPU memory used by the model in bytes. This includes weights, KV cache, scratch buffers, and pre-allocated parameter buffers.
Tokenization
int akunu_encode(akunu_model_t model, const char *text,
uint32_t *out_tokens, int max_tokens);
const char *akunu_decode_token(akunu_model_t model, uint32_t token_id);
int akunu_token_count(akunu_model_t model, const char *text);
The tokenizer is a BPE implementation built into akunu (no external dependency). Token IDs are uint32_t values.
akunu_encode writes token IDs into a caller-provided buffer. Returns the number of tokens written. If the output buffer is too small, the text is silently truncated.
akunu_decode_token returns a pointer to the token’s text representation. The pointer is valid until the model is freed – it points into the tokenizer’s vocabulary table.
akunu_token_count is a convenience function that counts tokens without allocating an output buffer.
Generation: The Callback Pattern
Generation uses a callback function for streaming output:
typedef bool (*akunu_token_callback)(uint32_t token_id,
const char *text,
void *user_data);
The callback is invoked for each generated token. Returning false stops generation immediately. The user_data pointer is passed through from the akunu_generate call, allowing the callback to access caller state without globals.
AkunuGenerationStats akunu_generate(
akunu_model_t model,
const uint32_t *prompt_tokens, int n_prompt,
int max_tokens,
AkunuSamplingConfig sampling,
akunu_token_callback callback,
void *user_data);
This is the main generation entry point. It:
- Resets the KV cache
- Runs prefill on the prompt tokens
- Enters the decode loop, calling the callback for each token
- Returns statistics (prefill time, decode time, tokens/second)
Sampling Configuration
typedef struct {
float temperature; // 0 = greedy (argmax)
int top_k; // 0 = disabled
float top_p; // 1.0 = disabled
float min_p; // 0.0 = disabled
float repeat_penalty; // 1.0 = disabled
} AkunuSamplingConfig;
Temperature 0 triggers the greedy decode path (argmax on GPU, no CPU sampling). Non-zero temperature runs the sampled decode path with optional top-k, top-p, and min-p filtering.
Generation Statistics
typedef struct {
int prompt_tokens;
int generated_tokens;
float prefill_time_ms;
float decode_time_ms;
float prefill_tokens_per_sec;
float decode_tokens_per_sec;
} AkunuGenerationStats;
This struct is returned by value from akunu_generate. It contains everything you need to report performance.
Continued Generation
For multi-turn chat, you do not want to re-process the entire conversation history each turn. akunu_generate_continue extends the existing KV cache:
AkunuGenerationStats akunu_generate_continue(
akunu_model_t model,
const uint32_t *new_tokens, int n_new,
int max_tokens,
AkunuSamplingConfig sampling,
akunu_token_callback callback,
void *user_data);
This prefills only the new_tokens (the latest user message) and generates from the combined context. The KV cache from previous turns is preserved.
Grammar-Constrained Generation
For structured output (JSON, specific formats), akunu supports grammar-constrained decoding:
akunu_grammar_t akunu_grammar_create(akunu_model_t model, const char *gbnf);
akunu_grammar_t akunu_grammar_create_from_schema(akunu_model_t model,
const char *json_schema);
akunu_grammar_t akunu_grammar_create_json(akunu_model_t model);
void akunu_grammar_free(akunu_grammar_t grammar);
AkunuGenerationStats akunu_generate_grammar(
akunu_model_t model,
const uint32_t *prompt_tokens, int n_prompt,
int max_tokens,
AkunuSamplingConfig sampling,
akunu_grammar_t grammar,
akunu_token_callback callback,
void *user_data);
The grammar handle is opaque, like the model handle. Three factory functions create grammars from GBNF strings, JSON Schema strings, or a generic JSON grammar. The grammar masks invalid tokens at each step, guaranteeing the output conforms to the grammar.3
Low-Level API
For benchmarking and custom decode loops, akunu exposes lower-level functions:
// Run prefill, return first generated token
uint32_t akunu_prefill(akunu_model_t model,
const uint32_t *tokens, int n_tokens);
// Run one decode step, return next token
uint32_t akunu_decode_step(akunu_model_t model,
uint32_t token_id, int position);
// Chain decode: multiple tokens in one GPU submission
int akunu_chain_decode(akunu_model_t model,
uint32_t first_token, int start_position,
int count, uint32_t *out_tokens);
// Get current KV cache position
int akunu_get_position(akunu_model_t model);
// Reset KV cache
void akunu_reset(akunu_model_t model);
The akunu_chain_decode function is the key primitive for fast greedy generation. It encodes the dispatch table N times into a single command buffer, patching position fields for each token. This is how akunu achieves high throughput for greedy (temperature=0) decoding.
Speculative Decoding
void akunu_set_speculation(akunu_model_t model, bool enabled);
When enabled, the decode loop uses n-gram prediction to speculatively generate multiple tokens, then verifies them against the model. Correctly predicted tokens skip full forward passes. This only works with greedy mode (temperature=0) because the verification requires deterministic token selection.
Embeddings
For BERT-style encoder models:
int akunu_embed(akunu_model_t model,
const uint32_t *tokens, int n_tokens,
float *out_embedding, int max_dims);
int akunu_embed_text(akunu_model_t model, const char *text,
float *out_embedding, int max_dims);
int akunu_embedding_dim(akunu_model_t model);
akunu_embed runs a forward pass through the encoder, mean-pools the final hidden layer, and writes the resulting embedding vector to out_embedding. Returns the embedding dimension on success, 0 on failure.
akunu_embed_text is a convenience wrapper that tokenizes the text internally.
Whisper Transcription
const char *akunu_transcribe(akunu_model_t model,
const char *wav_path,
const char *language,
AkunuTranscribeStats *stats_out);
const char *akunu_transcribe_pcm(akunu_model_t model,
const float *samples, int n_samples,
const char *language,
AkunuTranscribeStats *stats_out);
bool akunu_is_whisper(akunu_model_t model);
void akunu_set_timestamps(akunu_model_t model, bool enabled);
The transcription API supports both file-based and PCM buffer input. The returned string is valid until the next call or model free – it points to an internal buffer.
Streaming callbacks are also available:
typedef bool (*akunu_segment_callback)(int start_ms, int end_ms,
const char *text, void *user_data);
const char *akunu_transcribe_stream(akunu_model_t model,
const char *wav_path,
const char *language,
AkunuTranscribeStats *stats_out,
akunu_segment_callback callback,
void *user_data);
Chat Templates
const char *akunu_format_chat(akunu_model_t model,
const char *system_prompt,
const char *user_message);
const char *akunu_chat_template(akunu_model_t model);
akunu_format_chat applies the model’s native chat template to format a system prompt and user message into the expected input format (e.g., ChatML, Llama 3 format, Gemma format). The returned string is valid until the next call.
akunu_chat_template returns the template name as a string (“chatml”, “llama3”, “gemma”, or “unknown”).
Profiling
int akunu_profile_decode_step(akunu_model_t model,
uint32_t token_id, int position,
float *timing_out, int max_entries);
const char *akunu_profile_label(akunu_model_t model, int index);
The profiling API runs each operation in its own command buffer to get per-operation GPU timing. timing_out receives an array of float values (milliseconds). akunu_profile_label returns the human-readable label for each entry (e.g., “layer.0.attention”, “layer.0.o_proj”).
GPU Sampling Operations
void akunu_gpu_temperature_scale(akunu_model_t model, float temperature);
void akunu_gpu_repetition_penalty(akunu_model_t model,
const uint32_t *token_ids,
int n_tokens, float penalty);
These functions run sampling operations directly on the GPU, avoiding CPU readback of the logits buffer. Temperature scaling is a simple element-wise multiply; repetition penalty adjusts logits for previously seen tokens.
Model Inspection
int akunu_tensor_count(akunu_model_t model);
const char *akunu_tensor_name(akunu_model_t model, int index);
uint32_t akunu_tensor_dtype(akunu_model_t model, int index);
const char *akunu_tensor_raw_dtype(akunu_model_t model, int index);
These functions allow iterating over all tensors in the model. akunu_inspect uses them to dump the full tensor list. akunu_tensor_raw_dtype returns the original dtype string (e.g., “BF16” for SafeTensors) while akunu_tensor_dtype returns the internal GGUF dtype code.
Thread Safety
The akunu API has the following thread safety guarantees:
-
Different model handles are fully independent. You can call functions on
model_Afrom thread 1 andmodel_Bfrom thread 2 concurrently with no synchronization needed. -
A single model handle is NOT thread-safe. You must serialize all calls to the same model. The
akunu_serveserver handles this with a per-model mutex. -
akunu_get_error()is thread-safe. It uses thread-local storage. -
Model loading (
akunu_load_model) is thread-safe. Each call creates its own device and resources.
Complete Example
Here is a complete program that loads a model, generates text, and reports statistics:
#include "akunu/akunu.h"
#include <stdio.h>
#include <string.h>
static bool on_token(uint32_t token_id, const char *text, void *user_data) {
printf("%s", text);
fflush(stdout);
(void)token_id;
(void)user_data;
return true; // continue generating
}
int main(int argc, char **argv) {
if (argc < 3) {
fprintf(stderr, "Usage: %s <model.gguf> <akunu.metallib>\n", argv[0]);
return 1;
}
// Load model
akunu_model_t model = akunu_load_model(argv[1], argv[2], 4096);
if (!model) {
fprintf(stderr, "Failed to load model: %s\n", akunu_get_error());
return 1;
}
// Print model info
AkunuModelConfig cfg = akunu_get_config(model);
printf("Model: %s, %u layers, %u dim, %.1f MB GPU memory\n",
cfg.architecture, cfg.n_layers, cfg.dim,
akunu_model_memory(model) / 1048576.0);
// Tokenize prompt
const char *prompt = "Explain the roofline model in one paragraph:";
uint32_t tokens[4096];
int n_tokens = akunu_encode(model, prompt, tokens, 4096);
printf("Prompt: %d tokens\n\n", n_tokens);
// Generate
AkunuSamplingConfig sampling = {
.temperature = 0.0f, // greedy
.top_k = 0,
.top_p = 1.0f,
.min_p = 0.0f,
.repeat_penalty = 1.0f
};
AkunuGenerationStats stats = akunu_generate(
model, tokens, n_tokens,
256, // max_tokens
sampling,
on_token,
NULL // user_data
);
// Report
printf("\n\n--- Stats ---\n");
printf("Prefill: %d tokens in %.1f ms (%.0f tok/s)\n",
stats.prompt_tokens, stats.prefill_time_ms,
stats.prefill_tokens_per_sec);
printf("Decode: %d tokens in %.1f ms (%.0f tok/s)\n",
stats.generated_tokens, stats.decode_time_ms,
stats.decode_tokens_per_sec);
akunu_free_model(model);
return 0;
}
Compile and run:
clang -std=c11 -I include example.c -L build -lakunu_engine \
-framework Metal -framework Foundation -framework Accelerate \
-framework IOKit -lstdc++ -o example
./example path/to/model.gguf path/to/akunu.metallib
API Function Reference
| Function | Returns | Description |
|---|---|---|
akunu_load_model | akunu_model_t | Load model, returns NULL on error |
akunu_free_model | void | Free all model resources |
akunu_get_config | AkunuModelConfig | Get model architecture metadata |
akunu_model_memory | size_t | Total GPU memory in bytes |
akunu_get_error | const char* | Last error message (thread-local) |
akunu_encode | int | Tokenize text to token IDs |
akunu_decode_token | const char* | Token ID to text |
akunu_token_count | int | Count tokens in text |
akunu_generate | AkunuGenerationStats | Full generation pipeline |
akunu_generate_continue | AkunuGenerationStats | Continue from existing KV cache |
akunu_generate_grammar | AkunuGenerationStats | Grammar-constrained generation |
akunu_generate_grammar_continue | AkunuGenerationStats | Continue with grammar |
akunu_grammar_create | akunu_grammar_t | Create grammar from GBNF |
akunu_grammar_create_from_schema | akunu_grammar_t | Create grammar from JSON Schema |
akunu_grammar_create_json | akunu_grammar_t | Create generic JSON grammar |
akunu_grammar_free | void | Free grammar |
akunu_prefill | uint32_t | Run prefill, return first token |
akunu_decode_step | uint32_t | Run one decode step |
akunu_chain_decode | int | Chain decode multiple tokens |
akunu_get_position | int | Current KV cache position |
akunu_set_speculation | void | Enable/disable speculative decode |
akunu_reset | void | Reset KV cache |
akunu_embed | int | Compute embeddings from tokens |
akunu_embed_text | int | Compute embeddings from text |
akunu_embedding_dim | int | Get embedding dimension |
akunu_format_chat | const char* | Format chat message |
akunu_chat_template | const char* | Get template name |
akunu_transcribe | const char* | Transcribe WAV file |
akunu_transcribe_pcm | const char* | Transcribe PCM buffer |
akunu_transcribe_stream | const char* | Transcribe with segment callback |
akunu_transcribe_pcm_stream | const char* | Transcribe PCM with callback |
akunu_set_timestamps | void | Enable/disable Whisper timestamps |
akunu_is_whisper | bool | Check if model is Whisper |
akunu_profile_decode_step | int | Per-operation GPU timing |
akunu_profile_label | const char* | Label for profiled operation |
akunu_gpu_temperature_scale | void | GPU-side temperature scaling |
akunu_gpu_repetition_penalty | void | GPU-side repetition penalty |
akunu_tensor_count | int | Number of model tensors |
akunu_tensor_name | const char* | Tensor name by index |
akunu_tensor_dtype | uint32_t | Tensor GGUF dtype code |
akunu_tensor_raw_dtype | const char* | Tensor original dtype string |
Summary
The C API is akunu’s external interface. It uses the opaque handle pattern, POD structs, and C linkage to provide maximum compatibility across languages and compilers. The callback-based generation pattern supports streaming output without allocating result buffers. Thread safety is per-model-handle, requiring callers to serialize access to a single model.
-
The C FFI is effectively the lingua franca of systems programming. See “Foreign Function Interface” on Wikipedia for a survey of language support. Every major language runtime supports calling C functions with zero or minimal overhead. See https://en.wikipedia.org/wiki/Foreign_function_interface. ↩
-
This “handle + function” pattern is sometimes called the “C object pattern” or “ADT (Abstract Data Type) in C.” It provides encapsulation without language-level support for classes. The Linux kernel uses this pattern extensively for device drivers. ↩
-
Grammar-constrained decoding uses the XGrammar library (v0.1.33) internally. XGrammar compiles the grammar into a token mask that can be applied at each decoding step. See the XGrammar project: https://github.com/mlc-ai/xgrammar. ↩