Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The HTTP Server

Akunu ships with a built-in HTTP server that implements the OpenAI-compatible /v1/chat/completions API. This means any application that talks to OpenAI’s API can switch to a local Akunu instance by changing a single URL. The server handles model management, chat template formatting, streaming via Server-Sent Events, rate limiting, metrics, grammar-constrained generation, tool calling, and prefix caching – all in a single header file (src/server/serve.h) with no external dependencies beyond Akunu’s own HTTP primitives.

ServerConfig

The server is configured through a ServerConfig struct:

struct ServerConfig {
    std::string host = "127.0.0.1";
    int port = 8080;
    int max_context = 4096;
    int max_queue_depth = 16;
    int rate_limit_per_minute = 0;  // 0 = unlimited
    std::string api_key;            // empty = no auth
    int default_max_tokens = 2048;
    int idle_timeout_seconds = 0;   // 0 = disabled
};
FieldDefaultDescription
host127.0.0.1Bind address (use 0.0.0.0 for all interfaces)
port8080Listen port
max_context4096Maximum KV cache context length
max_queue_depth16Maximum queued requests (not yet implemented as a semaphore)
rate_limit_per_minute0Requests per minute per client IP (0 = unlimited)
api_key""Bearer token for auth (empty = no auth)
default_max_tokens2048Default max_tokens if not specified in request
idle_timeout_seconds0Auto-unload models after this many seconds idle (0 = disabled)

When idle_timeout_seconds is set, a background thread checks every 30 seconds for models that have not been accessed within the timeout window and unloads them:

if (config_.idle_timeout_seconds > 0) {
    idle_thread_ = std::thread([this]() {
        while (!idle_stop_.load()) {
            std::this_thread::sleep_for(std::chrono::seconds(30));
            registry_.unload_idle(config_.idle_timeout_seconds);
        }
    });
}

The Model Registry

The server can host multiple models simultaneously. The ModelRegistry manages them:

ModelRegistry
  |
  +-- add(handle, id, path, metallib)
  +-- remove(id)
  +-- resolve(requested_id) -> ModelEntry
  +-- model_list() -> JSON
  +-- unload_idle(timeout_seconds)

Model Resolution

When a request comes in with a model field, the registry uses a multi-level matching strategy:

1. Exact match:            "llama-3.1-8b-q4" == "llama-3.1-8b-q4"
2. Case-insensitive match: "Llama-3.1-8B-Q4" == "llama-3.1-8b-q4"
3. Substring match:        "llama" matches "llama-3.1-8b-q4"
4. Default fallback:       any request -> first loaded model

This flexible matching means you can use short names in your client code ("llama") and they will resolve to the full model ID. The default fallback means single-model setups “just work” regardless of what model name the client sends.

ModelEntry and Prefix Caching

Each ModelEntry tracks state for prefix caching:

struct ModelEntry {
    akunu_model_t handle;
    std::string id;
    std::string path;
    std::atomic<int64_t> last_access{0};
    std::mutex mu;  // serialize inference per model

    // Prefix cache
    std::vector<uint32_t> cached_tokens;
    int cached_position = 0;

    int shared_prefix(const uint32_t *tokens, int n_tokens) const {
        int shared = 0;
        int limit = std::min((int)cached_tokens.size(), n_tokens);
        for (int i = 0; i < limit; i++) {
            if (cached_tokens[i] != tokens[i]) break;
            shared++;
        }
        return shared;
    }
};

Prefix caching is simple but effective: if the new request shares a prefix with the previous request’s tokens, Akunu can skip re-encoding that prefix and continue from where it left off. This is common in chat scenarios where each turn appends to the conversation history:

Turn 1: [system][user_1]              -> process all
Turn 2: [system][user_1][asst_1][user_2]  -> skip [system][user_1]
Turn 3: [system][user_1][asst_1][user_2][asst_2][user_3]  -> skip more

The server checks if the shared prefix length exceeds the current KV cache position, and if the total estimated length (prefix + new tokens + max generation) fits within max_context. If the context would overflow, it resets the KV cache and processes from scratch:

int shared = entry->shared_prefix(tokens.data(), n_tokens);
bool use_continue = (shared > 0 && shared <= entry->cached_position);

int est_position = use_continue
    ? shared + (n_tokens - shared) + max_tokens
    : n_tokens + max_tokens;

if (est_position > config_.max_context)
    use_continue = false;  // would overflow, reset

if (!use_continue)
    akunu_reset(entry->handle);

API Routes

The server registers these routes:

MethodPathDescription
POST/v1/chat/completionsChat completions (OpenAI-compatible)
POST/v1/completionsText completions
GET/v1/modelsList loaded models
POST/v1/tokenizeTokenize text (extension)
GET/healthHealth check
GET/v1/metricsServer metrics
POST/v1/models/loadLoad a model (extension)
POST/v1/models/unloadUnload a model (extension)
POST/v1/audio/transcriptionsWhisper transcription (OpenAI-compatible)

Chat Completions

The /v1/chat/completions endpoint accepts the standard OpenAI request format:

{
  "model": "llama-3.1-8b",
  "messages": [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 40,
  "min_p": 0.0,
  "stream": true,
  "stop": ["\n\n"],
  "response_format": {"type": "json_object"},
  "tools": [...]
}

Supported sampling parameters:

ParameterTypeDefaultDescription
temperaturefloat0.0Sampling temperature (0 = greedy)
top_pfloat0.9Nucleus sampling threshold
top_kint40Top-K sampling
min_pfloat0.0Minimum probability threshold
frequency_penaltyfloat0.0Mapped to repeat_penalty = 1 + freq_pen
max_tokensint2048Maximum tokens to generate
streamboolfalseEnable SSE streaming
stopstring/array[]Stop sequences

Chat Template Formatting

The server auto-detects the correct chat template based on the model architecture:

ArchitectureTemplateExample
llama, mistralLLaMA 3<|start_header_id|>user<|end_header_id|>
qwen3ChatML<|im_start|>user
gemma, gemma3Gemma<start_of_turn>user
(default)ChatML<|im_start|>user

For Qwen3 models, the server automatically appends /no_think to the system prompt to disable the model’s “thinking” mode, which produces verbose chain-of-thought output that most API users do not want.

Tool Calling

When the request includes a tools array, the server injects tool definitions into the system prompt:

# Tools

You have access to the following tools:

## get_weather
Get the current weather for a location.
Parameters: {"type": "object", "properties": {"location": ...}}

To call a tool, output: <tool_call>{"name": "function_name", "arguments": {...}}</tool_call>

After generation, the output is parsed for tool call patterns. The server recognizes two formats:

  1. ChatML/Qwen: <tool_call>{"name": "...", "arguments": ...}</tool_call>
  2. LLaMA 3.1+: Direct JSON with a "name" key

If tool calls are detected, the response’s finish_reason is set to "tool_calls" and the parsed calls are included in the response.

Streaming (Server-Sent Events)

When stream: true, the server uses SSE (Server-Sent Events) to stream tokens as they are generated:

HTTP/1.1 200 OK
Content-Type: text/event-stream

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk",...,"choices":[{"delta":{"role":"assistant"}}]}

data: {"id":"chatcmpl-abc",...,"choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-abc",...,"choices":[{"delta":{"content":" world"}}]}

data: {"id":"chatcmpl-abc",...,"choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Each generated token triggers the callback function, which:

  1. Feeds the token text through the StopSequenceDetector
  2. If safe to emit (no partial stop sequence match), sends an SSE event
  3. If a stop sequence is matched, stops generation

Stop Sequence Detection

The StopSequenceDetector handles the tricky case where a stop sequence might span multiple tokens. For example, if the stop sequence is "\n\n" and tokens arrive as "\n" then "\n", the detector must buffer the first "\n" until it can determine whether the next token completes the stop sequence or not:

Token: "Hello"  -> emit "Hello", buffer empty
Token: " \n"    -> emit " ", buffer "\n" (partial match)
Token: "\n"     -> stop sequence matched! emit nothing, stop

The algorithm:

std::string feed(const std::string& token, bool& stopped) {
    buffer_ += token;

    // Check for complete match
    for (auto& seq : sequences_) {
        if (buffer_.find(seq) != std::string::npos) {
            stopped = true;
            return buffer_.substr(0, pos);  // text before match
        }
    }

    // Check for partial match at end of buffer
    size_t max_prefix = 0;
    for (auto& seq : sequences_) {
        // How much of seq matches the end of buffer?
        for (size_t len = 1; len <= seq.size(); len++) {
            if (buffer_.ends_with(seq.substr(0, len)))
                max_prefix = max(max_prefix, len);
        }
    }

    // Emit everything except the potential partial match
    std::string safe = buffer_.substr(0, buffer_.size() - max_prefix);
    buffer_ = buffer_.substr(buffer_.size() - max_prefix);
    return safe;
}

Rate Limiter

The rate limiter uses a token bucket algorithm per client IP:

Rate limiter state per IP:
  tokens: double (starts at rpm, refills over time)
  last_refill: timestamp

On each request:
  elapsed = now - last_refill
  tokens += elapsed * (rpm / 60.0)
  tokens = min(tokens, rpm)       // cap at burst size
  if tokens >= 1.0:
    tokens -= 1.0
    -> allow
  else:
    -> reject with 429

The implementation includes periodic eviction of stale buckets (every 1000 calls, remove entries idle for more than 5 minutes) to prevent unbounded memory growth from unique client IPs.

Metrics

The Metrics class tracks per-model and aggregate statistics:

class Metrics {
    int total_requests_ = 0;
    int total_prompt_tokens_ = 0;
    int total_completion_tokens_ = 0;
    struct ModelMetrics {
        int requests = 0;
        int prompt_tokens = 0;
        int completion_tokens = 0;
    };
    std::unordered_map<std::string, ModelMetrics> per_model_;
};

The /v1/metrics endpoint returns a JSON snapshot:

{
  "total_requests": 42,
  "total_prompt_tokens": 12000,
  "total_completion_tokens": 8500,
  "uptime_seconds": 3600.5,
  "models": {
    "llama-3.1-8b-q4": {
      "total_requests": 30,
      "total_prompt_tokens": 9000,
      "total_completion_tokens": 6000
    }
  }
}

All metrics operations are mutex-protected for thread safety.

Thread Safety

The server’s thread safety model is straightforward:

HTTP Server
  |
  +-- Request arrives on server thread pool
  |
  +-- Auth check (stateless, safe)
  +-- Rate limiter check (mutex-protected)
  +-- Model resolve (registry mutex)
  |
  +-- Acquire model inference lock (entry->mu)
  |     Only ONE inference per model at a time
  |
  +-- Tokenize, prefill, generate
  |     (single-threaded inference)
  |
  +-- Release model lock
  +-- Record metrics (metrics mutex)

The key constraint is entry->mu – a per-model mutex that serializes inference. This is necessary because the GPU resources (KV cache, scratch buffers) are not duplicated per request. A future enhancement could support concurrent requests to the same model with multiple KV cache slots, but for single-user scenarios this serialization is both simple and correct.

There is a subtle TOCTOU (time-of-check-time-of-use) guard: after acquiring the inference lock, the server re-checks that the model handle is still valid, because the idle unload thread might have freed it between the resolve() call and the lock acquisition:

std::lock_guard<std::mutex> infer_lock(entry->mu);

// Re-check handle after acquiring lock
if (!entry->handle) {
    send_error(conn, 503, "Model was unloaded",
               "server_error", "model_unloaded");
    return;
}

JSON Mode and Grammar Constraints

The server supports three levels of output structure:

  1. Unconstrained: normal text generation
  2. JSON mode (response_format: {type: "json_object"}): augments the system prompt with a JSON instruction and uses Akunu’s grammar engine to constrain output to valid JSON
  3. JSON Schema (response_format: {type: "json_schema", json_schema: {schema: ...}}): constrains output to match a specific JSON schema

Grammar objects are managed with RAII to prevent leaks on early returns:

akunu_grammar_t grammar = nullptr;
struct GrammarGuard {
    akunu_grammar_t& g;
    ~GrammarGuard() { if (g) { akunu_grammar_free(g); g = nullptr; } }
} grammar_guard{grammar};

For JSON mode, the system prompt is augmented:

IMPORTANT: You must respond with valid JSON only. No markdown,
no explanation, just a JSON object or array.

After generation, the server attempts to extract clean JSON from the output, handling cases where the model wraps its response in markdown code fences.

Request Logging

Every request and response is logged to stderr with timestamps and performance data:

[14:32:05] --> POST /v1/chat/completions model=llama-3.1-8b stream max_tokens=256
[14:32:06] <-- 200 llama-3.1-8b prompt=45 completion=128 prefill=1200 t/s decode=95 t/s 1340ms stop

This gives operators immediate visibility into request patterns and model performance without any additional monitoring infrastructure.

Summary

The Akunu HTTP server packs a lot of functionality into a single header:

serve.h
  |
  +-- ServerConfig           (bind address, limits, auth)
  +-- ModelRegistry          (multi-model, flexible resolution)
  +-- ModelEntry             (per-model state, prefix caching)
  +-- RateLimiter            (token bucket per client IP)
  +-- Metrics                (per-model request/token counters)
  +-- StopSequenceDetector   (buffered multi-token stop detection)
  +-- Chat template logic    (LLaMA 3, ChatML, Gemma auto-detect)
  +-- Tool call parsing      (ChatML and LLaMA formats)
  +-- JSON mode / grammar    (constrained generation)
  +-- SSE streaming          (OpenAI-compatible chunks)
  +-- AkunuServer            (ties it all together)

The design philosophy is zero external dependencies and OpenAI wire compatibility. Any client library that works with the OpenAI API – Python’s openai package, LangChain, LlamaIndex, Cursor, Continue – can point at an Akunu server with no code changes beyond the base URL.