Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Grammar Engine

Here is a scenario that breaks naive LLM deployment: you ask a model to produce JSON, and it gives you {"name": "Alice", "age": 30,} – with a trailing comma. Perfectly reasonable text. Completely invalid JSON. Your downstream parser chokes, your API returns a 500, and your user files a bug report.

Grammar-constrained decoding fixes this by making it impossible for the model to produce invalid output. Before sampling each token, the grammar engine masks out every token that would violate the grammar. The model can only choose from tokens that lead to valid continuations. If the grammar says no trailing comma, the token for , gets its logit set to negative infinity before softmax.

Akunu implements this with a Nondeterministic Pushdown Automaton (NPDA) over GBNF grammars, ported from llama.cpp and extended with JSON Schema support and deferred activation for thinking models. This chapter covers the grammar format, the NPDA engine, token-to-codepoint mapping, the apply/accept cycle, and the JSON Schema converter.

GBNF: The Grammar Format

GBNF (GGML BNF) is a BNF-like grammar notation used by llama.cpp. It looks like this:

root   ::= object
object ::= "{" ws pair ("," ws pair)* ws "}"
pair   ::= string ws ":" ws value
string ::= "\"" chars "\""
chars  ::= char*
char   ::= [^"\\] | "\\" escape
value  ::= string | number | "true" | "false" | "null" | object | array
number ::= "-"? [0-9] [0-9]*
ws     ::= [ \t\n]*

Each line defines a rule. The left side is a rule name, ::= separates it from the body, and the body is a sequence of elements separated by spaces. Alternatives are separated by |.

The element types are:

+--------------------+-------------------+---------------------------+
| Syntax             | Example           | Meaning                   |
+--------------------+-------------------+---------------------------+
| "literal"          | "true"            | Match exact string        |
| [abc]              | [0-9]             | Character class           |
| [^abc]             | [^"\\]            | Negated character class   |
| rule-name          | string            | Non-terminal reference    |
| (group)            | ("a" | "b")       | Grouping                  |
| expr*              | char*             | Zero or more              |
| expr+              | [0-9]+            | One or more               |
| expr?              | "-"?              | Zero or one               |
| expr{n}            | [0-9]{4}          | Exactly n                 |
| expr{n,m}          | [0-9]{1,3}        | Between n and m           |
| .                  | .                 | Any character (wildcard)  |
+--------------------+-------------------+---------------------------+

Internal Representation: Grammar Elements

After parsing, each rule is a vector of GrammarElement structs:

enum GrammarElementType : uint32_t {
    GTYPE_END = 0,           // end of rule
    GTYPE_ALT = 1,           // alternate (|)
    GTYPE_RULE_REF = 2,      // non-terminal reference
    GTYPE_CHAR = 3,          // match Unicode code point
    GTYPE_CHAR_NOT = 4,      // inverse char class [^...]
    GTYPE_CHAR_RNG_UPR = 5,  // upper bound of range
    GTYPE_CHAR_ALT = 6,      // additional char in class
    GTYPE_CHAR_ANY = 7,      // wildcard .
};

struct GrammarElement {
    GrammarElementType type;
    uint32_t value;  // code point, rule ID, etc.
};

A rule like digit ::= [0-9] becomes:

[{GTYPE_CHAR, '0'}, {GTYPE_CHAR_RNG_UPR, '9'}, {GTYPE_END, 0}]

A rule like bool ::= "true" | "false" becomes:

[{GTYPE_CHAR, 't'}, {GTYPE_CHAR, 'r'}, {GTYPE_CHAR, 'u'}, {GTYPE_CHAR, 'e'},
 {GTYPE_ALT, 0},
 {GTYPE_CHAR, 'f'}, {GTYPE_CHAR, 'a'}, {GTYPE_CHAR, 'l'}, {GTYPE_CHAR, 's'}, {GTYPE_CHAR, 'e'},
 {GTYPE_END, 0}]

Character classes are encoded as the first char type (GTYPE_CHAR or GTYPE_CHAR_NOT) followed by GTYPE_CHAR_ALT elements for additional characters, with GTYPE_CHAR_RNG_UPR for ranges:

[a-zA-Z] becomes:
  [{GTYPE_CHAR, 'a'}, {GTYPE_CHAR_RNG_UPR, 'z'},
   {GTYPE_CHAR_ALT, 'A'}, {GTYPE_CHAR_RNG_UPR, 'Z'}]

The GBNF Parser

The parser in grammar.cpp is a recursive-descent parser operating on the GBNF source string. It is structured as a GBNFParser struct with these key methods:

parse_all()          -- top level: parse rule definitions until EOF
parse_rule_def()     -- parse "name ::= body\n"
parse_alternates()   -- parse "seq1 | seq2 | seq3"
parse_sequence()     -- parse a sequence of elements
parse_char_class()   -- parse [a-z] or [^"\\]
handle_repetition()  -- handle *, +, ?, {n,m} after an element

Repetition Expansion

Repetition operators are expanded into helper rules at parse time. This is necessary because the NPDA engine does not directly support repetition – it only understands sequences, alternates, and rule references.

Original:  digit ::= [0-9]+

Expanded:
  digit     ::= digit_sub digit_rep
  digit_sub ::= [0-9]
  digit_rep ::= digit_sub digit_rep |    (empty alternate)

The * operator (zero or more) generates:

expr* becomes:
  rep_rule ::= expr rep_rule |   (empty alternate = epsilon)

The ? operator (zero or one) generates:

expr? becomes:
  opt_rule ::= expr |   (empty alternate)

Bounded repetition {n,m} generates a chain of optional rules:

expr{2,4} becomes:
  expr expr opt_2
  opt_2 ::= expr opt_1 |
  opt_1 ::= expr opt_0 |
  opt_0 ::= expr |

This expansion happens inside handle_repetition(), which extracts the sub-expression, wraps it in a helper rule if complex, and emits the appropriate chain of rule references.

The NPDA Engine

Now for the core of the grammar engine: the nondeterministic pushdown automaton. This is what determines, at each generation step, which tokens are valid continuations.

What is an NPDA?

A pushdown automaton is like a finite state machine but with a stack. The stack allows it to match nested structures (like balanced braces) that regular expressions cannot. “Nondeterministic” means the automaton can be in multiple states simultaneously – it explores all possible paths through the grammar in parallel.

In Akunu’s implementation, the state is represented as a set of stacks, where each stack is a vector of pointers into rule elements:

using GrammarStack = std::vector<const GrammarElement *>;
using GrammarStacks = std::vector<GrammarStack>;

Each stack represents one possible parsing position in the grammar. The top of the stack points to the next terminal element that needs to be matched. Elements below the top represent “return addresses” – where to continue after the current rule finishes.

Example grammar:
  root   ::= "a" inner "c"
  inner  ::= "b"

Parsing "abc":

Initial stacks (after advance_stack):
  [root@"a"]             -- top of stack points to CHAR 'a' in root

After matching 'a':
  [root@inner, inner@"b"]  -- pushed inner rule, top points to CHAR 'b'

After matching 'b':
  [root@"c"]             -- inner rule finished, popped back to root

After matching 'c':
  []                     -- empty stack = grammar fully matched

advance_stack: Resolving Non-Terminals

The advance_stack() function takes a stack and “advances” it past all GTYPE_RULE_REF elements until every resulting stack ends at a terminal (CHAR, CHAR_NOT, CHAR_ANY) or is empty. This is where nondeterminism kicks in: a rule with alternates produces multiple stacks.

Given stack: [..., ptr to RULE_REF(inner)]

And inner ::= "x" | "y"

advance_stack produces:
  Stack 1: [..., continuation, ptr to CHAR 'x']   (first alternate)
  Stack 2: [..., continuation, ptr to CHAR 'y']   (second alternate)

The function uses a worklist algorithm with deduplication to avoid exponential blowup:

static void advance_stack(const GrammarRules& rules,
                          const GrammarStack& stack,
                          GrammarStacks& new_stacks) {
    std::set<GrammarStack> seen;  // deduplicate
    std::vector<GrammarStack> todo;
    todo.push_back(stack);

    while (!todo.empty()) {
        GrammarStack cur = todo.back(); todo.pop_back();
        if (seen.count(cur)) continue;
        seen.insert(cur);

        if (cur.empty()) {
            new_stacks.push_back(cur);  // grammar complete
            continue;
        }

        const GrammarElement *pos = cur.back();
        switch (pos->type) {
        case GTYPE_RULE_REF:
            // Push alternatives of the referenced rule
            // ... expand into todo ...
            break;
        case GTYPE_CHAR:
        case GTYPE_CHAR_NOT:
        case GTYPE_CHAR_ANY:
            // Terminal -- this stack is ready
            new_stacks.push_back(cur);
            break;
        }
    }
}

match_char: Testing a Code Point

When checking whether a code point matches a character element, match_char() handles chars, ranges, alternates, and negation:

static pair<bool, const GrammarElement*> match_char(
        const GrammarElement *pos, uint32_t chr) {
    bool found = false;
    bool is_positive = (pos->type == GTYPE_CHAR || pos->type == GTYPE_CHAR_ANY);

    do {
        if (pos[1].type == GTYPE_CHAR_RNG_UPR) {
            found = found || (pos->value <= chr && chr <= pos[1].value);
            pos += 2;
        } else if (pos->type == GTYPE_CHAR_ANY) {
            found = true; pos += 1;
        } else {
            found = found || pos->value == chr;
            pos += 1;
        }
    } while (pos->type == GTYPE_CHAR_ALT);

    return {found == is_positive, pos};
}

For [^"\\], is_positive is false (because GTYPE_CHAR_NOT), so the result is inverted: found == false means the character is accepted (it is NOT in the exclusion set).

Token-to-Codepoint Mapping

Here is a critical insight: the grammar operates on Unicode code points, but the model generates tokens. A single token might encode multiple code points (“Hello” is 5 code points in 1 token), and a single code point might span multiple tokens (a rare Chinese character might be 3 byte-fallback tokens).

The init_vocab() method pre-computes the code point sequence for every token in the vocabulary:

void Grammar::init_vocab(const Tokenizer& tokenizer) {
    vocab_size_ = tokenizer.vocab_size();
    token_codepoints_.resize(vocab_size_);

    for (int i = 0; i < vocab_size_; i++) {
        std::string text = tokenizer.decode(i);
        token_codepoints_[i] = decode_utf8(text);  // null-terminated
    }
}

This creates a lookup table: token ID -> null-terminated vector of Unicode code points. For example:

Token 9906 ("Hello") -> [72, 101, 108, 108, 111, 0]   (H, e, l, l, o, NUL)
Token 29892 (",")    -> [44, 0]                         (comma, NUL)
Token 259   ("<0xE4>")-> [0xE4 as partial UTF-8]        (continuation byte)

The apply() Method: Masking Invalid Tokens

apply() is called before sampling. It scans every token in the vocabulary, checks whether its code point sequence is a valid continuation of the grammar, and masks invalid tokens to -inf:

void Grammar::apply(float *logits, int vocab_size) const {
    if (awaiting_trigger_) return;  // deferred activation
    if (stacks_.empty()) return;

    // Check if grammar is complete (any empty stack)
    bool allow_eos = false;
    for (auto& stack : stacks_)
        if (stack.empty()) allow_eos = true;

    // Build candidates
    std::vector<GrammarCandidate> candidates;
    for (int i = 0; i < vocab_size; i++) {
        if (is_eos(i)) {
            if (!allow_eos) logits[i] = -INFINITY;
            continue;
        }
        // ... add to candidates ...
    }

    // Reject invalid candidates across all stacks
    auto rejects = reject_candidates(rules_, stacks_, candidates);

    for (auto& r : rejects)
        logits[r.index] = -INFINITY;
}

The reject_candidates() function uses union semantics: a token is rejected only if all stacks reject it. This is the nondeterministic part – the grammar accepts a token if any possible parse path accepts it.

The rejection algorithm works recursively, matching one code point at a time:

reject_candidates(stacks, candidates):
  rejects = reject_for_stack(stacks[0], candidates)
  for each remaining stack:
    rejects = reject_for_stack(stack, rejects)
  return rejects

reject_for_stack(stack, candidates):
  for each candidate token:
    if token's first code point matches stack top:
      add to next_candidates (with code_points advanced by 1)
    else:
      add to rejects
  advance stack past matched element
  recursively check next_candidates against advanced stacks

This handles multi-codepoint tokens correctly: a token like “Hello” (5 codepoints) is checked one codepoint at a time against the grammar, advancing the grammar state at each step.

Partial UTF-8 Handling

Some tokens produce partial UTF-8 sequences (e.g., byte-fallback tokens). The grammar engine handles this with match_partial_char(), which checks whether a partially decoded codepoint could match the current grammar position:

static bool match_partial_char(const GrammarElement *pos,
                               const PartialUTF8& partial) {
    // Calculate the range of codepoints this partial could become
    uint32_t low = partial.value << (partial.n_remain * 6);
    uint32_t high = low | ((1u << (partial.n_remain * 6)) - 1);
    // Check if any codepoint in [low, high] matches
    // ...
}

The accept() Method: Advancing Grammar State

After a token is sampled, accept() advances the grammar state by feeding the token’s code points through the NPDA:

void Grammar::accept(uint32_t token_id) {
    if (awaiting_trigger_) { /* check trigger, return */ }

    const auto& cps = token_codepoints_[token_id];
    GrammarStacks new_stacks;

    for (auto& stack : stacks_) {
        // Match each code point sequentially
        // ... advance stack through the token's code points ...
    }

    stacks_ = new_stacks;
}

After accept(), the grammar’s stacks reflect the new state – only valid continuations from this point forward will be allowed.

Left Recursion Detection

Left recursion in a grammar (e.g., expr ::= expr "+" term) would cause the NPDA’s advance_stack() to loop infinitely. Akunu detects this at parse time:

Grammar Grammar::parse(const std::string& gbnf) {
    // ... parse rules ...

    // Check for left recursion
    for (uint32_t i = 0; i < n; i++) {
        if (detect_left_recursion(rules, i, in_progress, may_be_empty, visited))
            throw std::runtime_error("GBNF: left recursion detected");
    }
}

The detection algorithm does a DFS through rule references, tracking which rules are “in progress” (on the current DFS stack) and which rules “may be empty” (can match epsilon). If a rule references itself (directly or through nullable intermediates) without consuming any input, that is left recursion and we reject the grammar.

JSON Schema to GBNF: The SchemaConverter

For the common case of JSON output, Akunu provides a json_schema_to_grammar() function that converts a JSON Schema definition into a GBNF grammar. This lives in json_schema_to_grammar.cpp.

The SchemaConverter class is a recursive visitor over JSON Schema nodes:

class SchemaConverter {
    std::string convert(const Json& schema) {
        // Add common rules (ws, json-string, json-number, etc.)
        add_rule("ws", "[ \\t\\n\\r]*");
        add_rule("json-string", "...");
        add_rule("json-number", "...");
        // ...

        // Visit the root schema
        std::string root_body = visit(schema, "root");
        add_rule("root", root_body);
        return format_output();
    }
};

The visitor dispatches on schema type:

Schema type    ->  GBNF generation
-----------        ----------------
"object"       ->  "{" ws prop1 "," ws prop2 ... ws "}"
"array"        ->  "[" ws item ("," ws item)* ws "]"
"string"       ->  json-string (or format-specific rule)
"number"       ->  json-number
"boolean"      ->  "true" | "false"
"null"         ->  "null"
"enum"         ->  "value1" | "value2" | ...
"oneOf"/"anyOf" -> variant1 | variant2 | ...
"allOf"        ->  merged object properties
"const"        ->  literal value

Object Schema

For an object schema with required and optional properties:

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "age": {"type": "integer"},
    "email": {"type": "string"}
  },
  "required": ["name", "age"]
}

The converter generates:

root ::= "{" ws root-kv-name "," ws root-kv-age ("," ws root-kv-email)? ws "}"
root-kv-name  ::= "\"name\"" ws ":" ws json-string
root-kv-age   ::= "\"age\"" ws ":" ws json-int
root-kv-email ::= "\"email\"" ws ":" ws json-string

Required properties are emitted unconditionally; optional properties are wrapped in (... )?.

String Formats

The converter handles common string formats:

"format": "date"      ->  [0-9]{4} "-" [0-9]{2} "-" [0-9]{2}
"format": "time"      ->  [0-9]{2} ":" [0-9]{2} ":" [0-9]{2}
"format": "date-time" ->  date "T" time ("Z" | offset)
"format": "uuid"      ->  hex{8} "-" hex{4} "-" hex{4} "-" hex{4} "-" hex{12}

Built-in JSON Grammar

For generic JSON mode (no schema), Akunu provides a built-in GBNF grammar:

root     ::= object | array
value    ::= object | array | string | number | "true" | "false" | "null"
object   ::= "{" ws (pair ("," ws pair)*)? ws "}"
pair     ::= string ws ":" ws value
array    ::= "[" ws (value ("," ws value)*)? ws "]"
string   ::= "\"" chars "\""
chars    ::= char*
char     ::= [^"\\] | "\\" escape
escape   ::= ["\\bfnrt/] | "u" [0-9a-fA-F]{4}
number   ::= "-"? int frac? exp?
int      ::= "0" | [1-9] [0-9]*
frac     ::= "." [0-9]+
exp      ::= [eE] [+-]? [0-9]+
ws       ::= [ \t\n\r]*

This grammar guarantees valid JSON output – no trailing commas, no unquoted keys, no mismatched braces.

Deferred Activation: Thinking Models

Models like Qwen3 emit <think>...</think> blocks before their actual answer. If the grammar is active from the start, it would block the model from emitting these thinking tokens (which are not valid JSON or whatever the target grammar is).

Deferred activation solves this:

void Grammar::set_trigger_text(const std::string& text) {
    awaiting_trigger_ = true;
    trigger_text_ = text;
    trigger_buf_.clear();
}

When awaiting_trigger_ is true, apply() returns immediately without masking any tokens. The accept() method accumulates generated text in trigger_buf_ and watches for the trigger string (typically </think>). Once the trigger is found, awaiting_trigger_ flips to false and the grammar begins constraining output normally.

Generation flow with deferred grammar:

  Tokens:  <think> I need to format this as JSON </think> { " n a m e " ...
           |<--- grammar inactive, all tokens allowed --->|<-- grammar active -->|
                                                          ^
                                              trigger "</think>" matched here

The Full apply/accept Cycle

Here is the complete lifecycle during constrained generation:

1. Encode prompt, prefill
2. For each decode step:
   a. Forward pass -> logits [vocab_size]
   b. grammar.apply(logits, vocab_size)
      - For each token, check if its codepoints form a valid continuation
      - Set invalid tokens' logits to -inf
   c. Sample token from masked logits
   d. grammar.accept(token_id)
      - Advance NPDA stacks through the token's codepoints
   e. If grammar.is_done() or grammar.is_exhausted(), stop
3. Output is guaranteed to match the grammar

Summary

Grammar engine architecture:

  GBNF string ──> [Parser] ──> GrammarRules (vectors of GrammarElement)
                                     |
                                     v
                               [NPDA engine]
                                     |
                         +-----------+-----------+
                         |                       |
                    apply(logits)           accept(token)
                         |                       |
                   mask invalid             advance stacks
                   tokens to -inf          through codepoints
                         |                       |
                    [Sampling]              [next step]

  JSON Schema ──> [SchemaConverter] ──> GBNF string ──> (same pipeline)

The grammar engine adds negligible latency to each decode step – the reject_candidates() function processes ~100K tokens in microseconds because most tokens are rejected by the first code point match. The real cost is the one-time init_vocab() call that pre-computes code points for the entire vocabulary, which takes a few milliseconds.