The Grammar Engine
Here is a scenario that breaks naive LLM deployment: you ask a model to produce
JSON, and it gives you {"name": "Alice", "age": 30,} – with a trailing comma.
Perfectly reasonable text. Completely invalid JSON. Your downstream parser chokes,
your API returns a 500, and your user files a bug report.
Grammar-constrained decoding fixes this by making it impossible for the model
to produce invalid output. Before sampling each token, the grammar engine masks
out every token that would violate the grammar. The model can only choose from
tokens that lead to valid continuations. If the grammar says no trailing comma,
the token for , gets its logit set to negative infinity before softmax.
Akunu implements this with a Nondeterministic Pushdown Automaton (NPDA) over GBNF grammars, ported from llama.cpp and extended with JSON Schema support and deferred activation for thinking models. This chapter covers the grammar format, the NPDA engine, token-to-codepoint mapping, the apply/accept cycle, and the JSON Schema converter.
GBNF: The Grammar Format
GBNF (GGML BNF) is a BNF-like grammar notation used by llama.cpp. It looks like this:
root ::= object
object ::= "{" ws pair ("," ws pair)* ws "}"
pair ::= string ws ":" ws value
string ::= "\"" chars "\""
chars ::= char*
char ::= [^"\\] | "\\" escape
value ::= string | number | "true" | "false" | "null" | object | array
number ::= "-"? [0-9] [0-9]*
ws ::= [ \t\n]*
Each line defines a rule. The left side is a rule name, ::= separates it from
the body, and the body is a sequence of elements separated by spaces. Alternatives
are separated by |.
The element types are:
+--------------------+-------------------+---------------------------+
| Syntax | Example | Meaning |
+--------------------+-------------------+---------------------------+
| "literal" | "true" | Match exact string |
| [abc] | [0-9] | Character class |
| [^abc] | [^"\\] | Negated character class |
| rule-name | string | Non-terminal reference |
| (group) | ("a" | "b") | Grouping |
| expr* | char* | Zero or more |
| expr+ | [0-9]+ | One or more |
| expr? | "-"? | Zero or one |
| expr{n} | [0-9]{4} | Exactly n |
| expr{n,m} | [0-9]{1,3} | Between n and m |
| . | . | Any character (wildcard) |
+--------------------+-------------------+---------------------------+
Internal Representation: Grammar Elements
After parsing, each rule is a vector of GrammarElement structs:
enum GrammarElementType : uint32_t {
GTYPE_END = 0, // end of rule
GTYPE_ALT = 1, // alternate (|)
GTYPE_RULE_REF = 2, // non-terminal reference
GTYPE_CHAR = 3, // match Unicode code point
GTYPE_CHAR_NOT = 4, // inverse char class [^...]
GTYPE_CHAR_RNG_UPR = 5, // upper bound of range
GTYPE_CHAR_ALT = 6, // additional char in class
GTYPE_CHAR_ANY = 7, // wildcard .
};
struct GrammarElement {
GrammarElementType type;
uint32_t value; // code point, rule ID, etc.
};
A rule like digit ::= [0-9] becomes:
[{GTYPE_CHAR, '0'}, {GTYPE_CHAR_RNG_UPR, '9'}, {GTYPE_END, 0}]
A rule like bool ::= "true" | "false" becomes:
[{GTYPE_CHAR, 't'}, {GTYPE_CHAR, 'r'}, {GTYPE_CHAR, 'u'}, {GTYPE_CHAR, 'e'},
{GTYPE_ALT, 0},
{GTYPE_CHAR, 'f'}, {GTYPE_CHAR, 'a'}, {GTYPE_CHAR, 'l'}, {GTYPE_CHAR, 's'}, {GTYPE_CHAR, 'e'},
{GTYPE_END, 0}]
Character classes are encoded as the first char type (GTYPE_CHAR or
GTYPE_CHAR_NOT) followed by GTYPE_CHAR_ALT elements for additional characters,
with GTYPE_CHAR_RNG_UPR for ranges:
[a-zA-Z] becomes:
[{GTYPE_CHAR, 'a'}, {GTYPE_CHAR_RNG_UPR, 'z'},
{GTYPE_CHAR_ALT, 'A'}, {GTYPE_CHAR_RNG_UPR, 'Z'}]
The GBNF Parser
The parser in grammar.cpp is a recursive-descent parser operating on the GBNF
source string. It is structured as a GBNFParser struct with these key methods:
parse_all() -- top level: parse rule definitions until EOF
parse_rule_def() -- parse "name ::= body\n"
parse_alternates() -- parse "seq1 | seq2 | seq3"
parse_sequence() -- parse a sequence of elements
parse_char_class() -- parse [a-z] or [^"\\]
handle_repetition() -- handle *, +, ?, {n,m} after an element
Repetition Expansion
Repetition operators are expanded into helper rules at parse time. This is necessary because the NPDA engine does not directly support repetition – it only understands sequences, alternates, and rule references.
Original: digit ::= [0-9]+
Expanded:
digit ::= digit_sub digit_rep
digit_sub ::= [0-9]
digit_rep ::= digit_sub digit_rep | (empty alternate)
The * operator (zero or more) generates:
expr* becomes:
rep_rule ::= expr rep_rule | (empty alternate = epsilon)
The ? operator (zero or one) generates:
expr? becomes:
opt_rule ::= expr | (empty alternate)
Bounded repetition {n,m} generates a chain of optional rules:
expr{2,4} becomes:
expr expr opt_2
opt_2 ::= expr opt_1 |
opt_1 ::= expr opt_0 |
opt_0 ::= expr |
This expansion happens inside handle_repetition(), which extracts the
sub-expression, wraps it in a helper rule if complex, and emits the appropriate
chain of rule references.
The NPDA Engine
Now for the core of the grammar engine: the nondeterministic pushdown automaton. This is what determines, at each generation step, which tokens are valid continuations.
What is an NPDA?
A pushdown automaton is like a finite state machine but with a stack. The stack allows it to match nested structures (like balanced braces) that regular expressions cannot. “Nondeterministic” means the automaton can be in multiple states simultaneously – it explores all possible paths through the grammar in parallel.
In Akunu’s implementation, the state is represented as a set of stacks, where each stack is a vector of pointers into rule elements:
using GrammarStack = std::vector<const GrammarElement *>;
using GrammarStacks = std::vector<GrammarStack>;
Each stack represents one possible parsing position in the grammar. The top of the stack points to the next terminal element that needs to be matched. Elements below the top represent “return addresses” – where to continue after the current rule finishes.
Example grammar:
root ::= "a" inner "c"
inner ::= "b"
Parsing "abc":
Initial stacks (after advance_stack):
[root@"a"] -- top of stack points to CHAR 'a' in root
After matching 'a':
[root@inner, inner@"b"] -- pushed inner rule, top points to CHAR 'b'
After matching 'b':
[root@"c"] -- inner rule finished, popped back to root
After matching 'c':
[] -- empty stack = grammar fully matched
advance_stack: Resolving Non-Terminals
The advance_stack() function takes a stack and “advances” it past all
GTYPE_RULE_REF elements until every resulting stack ends at a terminal
(CHAR, CHAR_NOT, CHAR_ANY) or is empty. This is where nondeterminism kicks in:
a rule with alternates produces multiple stacks.
Given stack: [..., ptr to RULE_REF(inner)]
And inner ::= "x" | "y"
advance_stack produces:
Stack 1: [..., continuation, ptr to CHAR 'x'] (first alternate)
Stack 2: [..., continuation, ptr to CHAR 'y'] (second alternate)
The function uses a worklist algorithm with deduplication to avoid exponential blowup:
static void advance_stack(const GrammarRules& rules,
const GrammarStack& stack,
GrammarStacks& new_stacks) {
std::set<GrammarStack> seen; // deduplicate
std::vector<GrammarStack> todo;
todo.push_back(stack);
while (!todo.empty()) {
GrammarStack cur = todo.back(); todo.pop_back();
if (seen.count(cur)) continue;
seen.insert(cur);
if (cur.empty()) {
new_stacks.push_back(cur); // grammar complete
continue;
}
const GrammarElement *pos = cur.back();
switch (pos->type) {
case GTYPE_RULE_REF:
// Push alternatives of the referenced rule
// ... expand into todo ...
break;
case GTYPE_CHAR:
case GTYPE_CHAR_NOT:
case GTYPE_CHAR_ANY:
// Terminal -- this stack is ready
new_stacks.push_back(cur);
break;
}
}
}
match_char: Testing a Code Point
When checking whether a code point matches a character element, match_char()
handles chars, ranges, alternates, and negation:
static pair<bool, const GrammarElement*> match_char(
const GrammarElement *pos, uint32_t chr) {
bool found = false;
bool is_positive = (pos->type == GTYPE_CHAR || pos->type == GTYPE_CHAR_ANY);
do {
if (pos[1].type == GTYPE_CHAR_RNG_UPR) {
found = found || (pos->value <= chr && chr <= pos[1].value);
pos += 2;
} else if (pos->type == GTYPE_CHAR_ANY) {
found = true; pos += 1;
} else {
found = found || pos->value == chr;
pos += 1;
}
} while (pos->type == GTYPE_CHAR_ALT);
return {found == is_positive, pos};
}
For [^"\\], is_positive is false (because GTYPE_CHAR_NOT), so the result
is inverted: found == false means the character is accepted (it is NOT in the
exclusion set).
Token-to-Codepoint Mapping
Here is a critical insight: the grammar operates on Unicode code points, but the model generates tokens. A single token might encode multiple code points (“Hello” is 5 code points in 1 token), and a single code point might span multiple tokens (a rare Chinese character might be 3 byte-fallback tokens).
The init_vocab() method pre-computes the code point sequence for every token
in the vocabulary:
void Grammar::init_vocab(const Tokenizer& tokenizer) {
vocab_size_ = tokenizer.vocab_size();
token_codepoints_.resize(vocab_size_);
for (int i = 0; i < vocab_size_; i++) {
std::string text = tokenizer.decode(i);
token_codepoints_[i] = decode_utf8(text); // null-terminated
}
}
This creates a lookup table: token ID -> null-terminated vector of Unicode code points. For example:
Token 9906 ("Hello") -> [72, 101, 108, 108, 111, 0] (H, e, l, l, o, NUL)
Token 29892 (",") -> [44, 0] (comma, NUL)
Token 259 ("<0xE4>")-> [0xE4 as partial UTF-8] (continuation byte)
The apply() Method: Masking Invalid Tokens
apply() is called before sampling. It scans every token in the vocabulary,
checks whether its code point sequence is a valid continuation of the grammar,
and masks invalid tokens to -inf:
void Grammar::apply(float *logits, int vocab_size) const {
if (awaiting_trigger_) return; // deferred activation
if (stacks_.empty()) return;
// Check if grammar is complete (any empty stack)
bool allow_eos = false;
for (auto& stack : stacks_)
if (stack.empty()) allow_eos = true;
// Build candidates
std::vector<GrammarCandidate> candidates;
for (int i = 0; i < vocab_size; i++) {
if (is_eos(i)) {
if (!allow_eos) logits[i] = -INFINITY;
continue;
}
// ... add to candidates ...
}
// Reject invalid candidates across all stacks
auto rejects = reject_candidates(rules_, stacks_, candidates);
for (auto& r : rejects)
logits[r.index] = -INFINITY;
}
The reject_candidates() function uses union semantics: a token is rejected only
if all stacks reject it. This is the nondeterministic part – the grammar
accepts a token if any possible parse path accepts it.
The rejection algorithm works recursively, matching one code point at a time:
reject_candidates(stacks, candidates):
rejects = reject_for_stack(stacks[0], candidates)
for each remaining stack:
rejects = reject_for_stack(stack, rejects)
return rejects
reject_for_stack(stack, candidates):
for each candidate token:
if token's first code point matches stack top:
add to next_candidates (with code_points advanced by 1)
else:
add to rejects
advance stack past matched element
recursively check next_candidates against advanced stacks
This handles multi-codepoint tokens correctly: a token like “Hello” (5 codepoints) is checked one codepoint at a time against the grammar, advancing the grammar state at each step.
Partial UTF-8 Handling
Some tokens produce partial UTF-8 sequences (e.g., byte-fallback tokens). The
grammar engine handles this with match_partial_char(), which checks whether
a partially decoded codepoint could match the current grammar position:
static bool match_partial_char(const GrammarElement *pos,
const PartialUTF8& partial) {
// Calculate the range of codepoints this partial could become
uint32_t low = partial.value << (partial.n_remain * 6);
uint32_t high = low | ((1u << (partial.n_remain * 6)) - 1);
// Check if any codepoint in [low, high] matches
// ...
}
The accept() Method: Advancing Grammar State
After a token is sampled, accept() advances the grammar state by feeding
the token’s code points through the NPDA:
void Grammar::accept(uint32_t token_id) {
if (awaiting_trigger_) { /* check trigger, return */ }
const auto& cps = token_codepoints_[token_id];
GrammarStacks new_stacks;
for (auto& stack : stacks_) {
// Match each code point sequentially
// ... advance stack through the token's code points ...
}
stacks_ = new_stacks;
}
After accept(), the grammar’s stacks reflect the new state – only valid
continuations from this point forward will be allowed.
Left Recursion Detection
Left recursion in a grammar (e.g., expr ::= expr "+" term) would cause the
NPDA’s advance_stack() to loop infinitely. Akunu detects this at parse time:
Grammar Grammar::parse(const std::string& gbnf) {
// ... parse rules ...
// Check for left recursion
for (uint32_t i = 0; i < n; i++) {
if (detect_left_recursion(rules, i, in_progress, may_be_empty, visited))
throw std::runtime_error("GBNF: left recursion detected");
}
}
The detection algorithm does a DFS through rule references, tracking which rules are “in progress” (on the current DFS stack) and which rules “may be empty” (can match epsilon). If a rule references itself (directly or through nullable intermediates) without consuming any input, that is left recursion and we reject the grammar.
JSON Schema to GBNF: The SchemaConverter
For the common case of JSON output, Akunu provides a json_schema_to_grammar()
function that converts a JSON Schema definition into a GBNF grammar. This lives
in json_schema_to_grammar.cpp.
The SchemaConverter class is a recursive visitor over JSON Schema nodes:
class SchemaConverter {
std::string convert(const Json& schema) {
// Add common rules (ws, json-string, json-number, etc.)
add_rule("ws", "[ \\t\\n\\r]*");
add_rule("json-string", "...");
add_rule("json-number", "...");
// ...
// Visit the root schema
std::string root_body = visit(schema, "root");
add_rule("root", root_body);
return format_output();
}
};
The visitor dispatches on schema type:
Schema type -> GBNF generation
----------- ----------------
"object" -> "{" ws prop1 "," ws prop2 ... ws "}"
"array" -> "[" ws item ("," ws item)* ws "]"
"string" -> json-string (or format-specific rule)
"number" -> json-number
"boolean" -> "true" | "false"
"null" -> "null"
"enum" -> "value1" | "value2" | ...
"oneOf"/"anyOf" -> variant1 | variant2 | ...
"allOf" -> merged object properties
"const" -> literal value
Object Schema
For an object schema with required and optional properties:
{
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": "string"}
},
"required": ["name", "age"]
}
The converter generates:
root ::= "{" ws root-kv-name "," ws root-kv-age ("," ws root-kv-email)? ws "}"
root-kv-name ::= "\"name\"" ws ":" ws json-string
root-kv-age ::= "\"age\"" ws ":" ws json-int
root-kv-email ::= "\"email\"" ws ":" ws json-string
Required properties are emitted unconditionally; optional properties are wrapped
in (... )?.
String Formats
The converter handles common string formats:
"format": "date" -> [0-9]{4} "-" [0-9]{2} "-" [0-9]{2}
"format": "time" -> [0-9]{2} ":" [0-9]{2} ":" [0-9]{2}
"format": "date-time" -> date "T" time ("Z" | offset)
"format": "uuid" -> hex{8} "-" hex{4} "-" hex{4} "-" hex{4} "-" hex{12}
Built-in JSON Grammar
For generic JSON mode (no schema), Akunu provides a built-in GBNF grammar:
root ::= object | array
value ::= object | array | string | number | "true" | "false" | "null"
object ::= "{" ws (pair ("," ws pair)*)? ws "}"
pair ::= string ws ":" ws value
array ::= "[" ws (value ("," ws value)*)? ws "]"
string ::= "\"" chars "\""
chars ::= char*
char ::= [^"\\] | "\\" escape
escape ::= ["\\bfnrt/] | "u" [0-9a-fA-F]{4}
number ::= "-"? int frac? exp?
int ::= "0" | [1-9] [0-9]*
frac ::= "." [0-9]+
exp ::= [eE] [+-]? [0-9]+
ws ::= [ \t\n\r]*
This grammar guarantees valid JSON output – no trailing commas, no unquoted keys, no mismatched braces.
Deferred Activation: Thinking Models
Models like Qwen3 emit <think>...</think> blocks before their actual answer.
If the grammar is active from the start, it would block the model from emitting
these thinking tokens (which are not valid JSON or whatever the target grammar is).
Deferred activation solves this:
void Grammar::set_trigger_text(const std::string& text) {
awaiting_trigger_ = true;
trigger_text_ = text;
trigger_buf_.clear();
}
When awaiting_trigger_ is true, apply() returns immediately without masking
any tokens. The accept() method accumulates generated text in trigger_buf_
and watches for the trigger string (typically </think>). Once the trigger is
found, awaiting_trigger_ flips to false and the grammar begins constraining
output normally.
Generation flow with deferred grammar:
Tokens: <think> I need to format this as JSON </think> { " n a m e " ...
|<--- grammar inactive, all tokens allowed --->|<-- grammar active -->|
^
trigger "</think>" matched here
The Full apply/accept Cycle
Here is the complete lifecycle during constrained generation:
1. Encode prompt, prefill
2. For each decode step:
a. Forward pass -> logits [vocab_size]
b. grammar.apply(logits, vocab_size)
- For each token, check if its codepoints form a valid continuation
- Set invalid tokens' logits to -inf
c. Sample token from masked logits
d. grammar.accept(token_id)
- Advance NPDA stacks through the token's codepoints
e. If grammar.is_done() or grammar.is_exhausted(), stop
3. Output is guaranteed to match the grammar
Summary
Grammar engine architecture:
GBNF string ──> [Parser] ──> GrammarRules (vectors of GrammarElement)
|
v
[NPDA engine]
|
+-----------+-----------+
| |
apply(logits) accept(token)
| |
mask invalid advance stacks
tokens to -inf through codepoints
| |
[Sampling] [next step]
JSON Schema ──> [SchemaConverter] ──> GBNF string ──> (same pipeline)
The grammar engine adds negligible latency to each decode step – the
reject_candidates() function processes ~100K tokens in microseconds because
most tokens are rejected by the first code point match. The real cost is the
one-time init_vocab() call that pre-computes code points for the entire
vocabulary, which takes a few milliseconds.