Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Grammar-Constrained Generation

Here is a problem you have probably run into: you ask an LLM to produce JSON, and it almost works – except there is a trailing comma, or a missing quote, or it randomly starts explaining the JSON instead of just outputting it. You wrap the prompt in “respond only in valid JSON” and it works 95% of the time. For the other 5%, you add retry logic with exponential backoff. Before long, your “simple API call” has three layers of error handling, and you are still finding edge cases in production.

Grammar-constrained generation solves this problem at the source. Instead of hoping the model produces valid output and cleaning up after it, you force every generated token to be valid according to a formal grammar. The model cannot produce a trailing comma because the grammar says commas must be followed by a value. It cannot produce unmatched braces because the grammar tracks nesting depth. The output is syntactically valid by construction.

This chapter covers how akunu implements grammar-constrained decoding, from the grammar format (GBNF) through the constraint engine (XGrammar), JSON schema conversion, and the GPU-accelerated decode loop.

The Constraint Mechanism

The basic idea is simple: before each sampling step, compute which tokens are valid continuations according to the grammar, then mask all other tokens to negative infinity so they have zero probability.

  Standard decode:
  logits --> [sample] --> token

  Grammar-constrained decode:
  logits --> [mask invalid] --> [sample] --> token --> [advance grammar]

The tricky part is doing this efficiently. A vocabulary of 128k tokens and a grammar with hundreds of rules means we need to check 128k tokens against the grammar state at every step. A naive implementation could easily cost more than the model forward pass itself.

GBNF: Grammars for LLMs

Akunu uses GBNF (GGML BNF), a variant of Backus-Naur Form adapted for LLM token-level constraint. Here is what a simple JSON grammar looks like in GBNF:

root     ::= object | array
value    ::= object | array | string | number | "true" | "false" | "null"
object   ::= "{" ws (pair ("," ws pair)*)? ws "}"
pair     ::= string ws ":" ws value
array    ::= "[" ws (value ("," ws value)*)? ws "]"
string   ::= "\"" chars "\""
chars    ::= char*
char     ::= [^"\\] | "\\" escape
escape   ::= ["\\bfnrt/] | "u" [0-9a-fA-F]{4}
number   ::= "-"? int frac? exp?
int      ::= "0" | [1-9] [0-9]*
frac     ::= "." [0-9]+
exp      ::= [eE] [+-]? [0-9]+
ws       ::= [ \t\n\r]*

Each line defines a rule. Rules can reference other rules (non-terminals) or specify literal characters and character classes (terminals). The root rule is the entry point.

GBNF supports:

  • Quoted literals: "true", "{", "\\n"
  • Character classes: [0-9], [a-fA-F], [^"\\]
  • Alternation: |
  • Repetition: * (zero or more), + (one or more), ? (optional)
  • Rule references: object, value, string

Akunu’s GBNF parser (grammar.cpp) is a complete recursive-descent parser that converts the textual grammar into a vector of GrammarRule objects. Each rule is a sequence of GrammarElement structs:

enum GrammarElementType : uint32_t {
    GTYPE_END       = 0,  // end of rule
    GTYPE_ALT       = 1,  // alternate (|)
    GTYPE_RULE_REF  = 2,  // non-terminal reference
    GTYPE_CHAR      = 3,  // character match
    GTYPE_CHAR_NOT  = 4,  // inverse class [^...]
    GTYPE_CHAR_RNG_UPR = 5,  // range upper bound
    GTYPE_CHAR_ALT  = 6,  // additional char in class
    GTYPE_CHAR_ANY  = 7,  // wildcard .
};

struct GrammarElement {
    GrammarElementType type;
    uint32_t value;  // code point, rule ID, etc.
};

The parser handles Unicode escapes (\u0041), hex escapes (\x41), and standard C escapes (\n, \t, etc.). Rule names are alphanumeric with hyphens and underscores.

The Legacy NPDA Engine

Akunu’s original grammar engine (before XGrammar) uses a nondeterministic pushdown automaton (NPDA) to track the grammar state. This is the Grammar class in grammar.h.

The NPDA state is a set of stacks, where each stack is a sequence of pointers into the grammar rules. The top of each stack points to the element the grammar expects to match next. Multiple stacks represent the nondeterminism – when a rule has alternation (A | B), both alternatives produce separate stacks.

  Grammar stacks (simplified for "object" rule):

  Stack 0: [char("{"), ws, pair, ...]
  Stack 1: [char("{"), ws, char("}")]

  Stack 0 expects the object with at least one pair.
  Stack 1 expects the empty object "{}".

When the apply() method is called with the logit array, it must:

  1. For each token in the vocabulary, decode the token to Unicode code points
  2. For each stack, simulate accepting those code points
  3. If any stack can accept the full token, the token is valid
  4. If no stack can accept the token, mask its logit to -inf

This is O(V * S * L) where V is vocab size, S is number of stacks, and L is average token length in code points. For V=128k and complex grammars, this becomes very expensive – often 10-50ms per token, which can dominate the total decode time.

The legacy engine also supports a deferred activation mode for thinking models. When set_trigger_text() is called with (for example) </think>, the grammar does not constrain output until that trigger text appears. This allows models like Qwen3 to emit their <think>...</think> block freely, then switch to grammar-constrained generation for the actual structured output:

  Generation flow with deferred activation:

  <think>Let me think about the JSON schema...</think>
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  Grammar inactive (unconstrained)              |
                                                v
  {"name": "Alice", "age": 30}
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  Grammar active (constrained to JSON)

XGrammar: The Fast Path

The NPDA approach works but is slow. XGrammar (integrated as a third-party library) takes a fundamentally different approach: precompute everything at grammar compile time.

The key insight is that for a given grammar state (which rules/positions are active), the set of valid tokens is deterministic. XGrammar precomputes these valid-token sets as packed bitmasks during grammar compilation, then at runtime, looking up the bitmask for the current state is essentially O(1).

  NPDA approach (per token):
  +--------------------------------------------------+
  | For each token (128k):                           |
  |   Decode token to code points                    |
  |   For each stack (10-100):                       |
  |     Try accepting code points                    |
  |     If success: token is valid                   |
  +--------------------------------------------------+
  Cost: O(V * S * L) per token step  (~10-50ms)

  XGrammar approach (per token):
  +--------------------------------------------------+
  | Look up precomputed bitmask for current state    |
  | Copy bitmask to GPU                              |
  | Apply bitmask to logits (GPU kernel)             |
  +--------------------------------------------------+
  Cost: O(1) lookup + O(V/32) GPU mask  (~0.05ms)

The XGrammar wrapper (xgrammar_wrapper.h and .cpp) provides a clean interface:

class XGrammarEngine {
public:
    void init_vocab(const Tokenizer& tokenizer);

    bool compile_json_schema(const std::string& schema);
    bool compile_grammar(const std::string& gbnf);
    bool compile_builtin_json();

    bool fill_next_token_bitmask(int32_t *bitmask);
    bool accept_token(int32_t token_id);
    void rollback(int n_tokens);

    bool is_terminated() const;
    bool is_completed() const;
    void reset();

    int bitmask_size() const;
};

The bitmask format is packed: one bit per token, 32 tokens per int32 word. For a 128k vocabulary, the bitmask is 128k / 32 = 4096 int32 words = 16 KB. This is small enough to copy to the GPU every token step with negligible overhead.

  Bitmask layout:  int32[ceil(vocab_size / 32)]

  Word 0:  bits 0-31   = tokens 0-31
  Word 1:  bits 0-31   = tokens 32-63
  Word 2:  bits 0-31   = tokens 64-95
  ...
  Word N:  bits 0-31   = tokens N*32 .. N*32+31

  bit = 1  -->  token is valid (keep logit)
  bit = 0  -->  token is invalid (set logit to -inf)

Vocabulary Initialization

Before compiling any grammar, XGrammar needs the tokenizer’s vocabulary. init_vocab() decodes every token in the vocabulary to its string representation:

void XGrammarEngine::init_vocab(const Tokenizer& tokenizer) {
    vocab_size_ = tokenizer.vocab_size();
    std::vector<std::string> encoded_vocab;
    encoded_vocab.reserve(vocab_size_);
    for (int i = 0; i < vocab_size_; i++) {
        encoded_vocab.push_back(tokenizer.decode((uint32_t)i));
    }
    auto vocab_type = xgrammar::VocabType::BYTE_FALLBACK;
    // ... build TokenizerInfo and GrammarCompiler ...
}

This is a one-time cost that happens during model loading. The decoded strings are what XGrammar uses to determine which tokens can match which grammar positions.

Grammar Compilation

XGrammar supports three compilation modes:

  1. JSON Schema (compile_json_schema): Takes a JSON Schema string and compiles it directly into a grammar matcher. This is the most common path for structured output.

  2. GBNF grammar (compile_grammar): Takes a raw GBNF string. Useful for custom grammars (SQL, code templates, etc.).

  3. Built-in JSON (compile_builtin_json): A pre-optimized grammar for “any valid JSON”. Used when the user requests JSON output without a specific schema.

Each compilation produces a CompiledGrammar object and a GrammarMatcher. The compilation can take 10-100ms (depending on grammar complexity and vocabulary size), but it is a one-time cost amortized over the entire generation.

JSON Schema to GBNF

For the common case of structured JSON output, akunu provides a json_schema_to_grammar() converter that translates a JSON Schema into GBNF.

Here is an example. Given this JSON Schema:

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "integer" },
    "email": { "type": "string", "format": "email" }
  },
  "required": ["name", "age"]
}

The converter produces GBNF roughly equivalent to:

root ::= "{" ws root-kv-name "," ws root-kv-age ("," ws root-kv-email)? ws "}"
root-kv-name  ::= "\"name\"" ws ":" ws json-string
root-kv-age   ::= "\"age\"" ws ":" ws json-int
root-kv-email ::= "\"email\"" ws ":" ws json-string
ws ::= [ \t\n\r]*
json-string ::= "\"" json-chars "\""
json-int    ::= "0" | [1-9] [0-9]*
...

The converter handles:

  • Object properties with required/optional distinction
  • Arrays with items, minItems, maxItems
  • Strings with format (date, time, date-time, uuid), minLength, maxLength
  • Numbers and integers
  • Enums and const values
  • oneOf, anyOf, allOf composition
  • Nested objects and recursive structures (with a depth limit of 64)
  • additionalProperties control

The SchemaConverter class builds rules incrementally, generating unique rule names to avoid collisions. Each schema node is visited recursively, producing either an inline GBNF expression or a named rule reference:

  visit(schema, "root")
    |
    +--> visit_object(schema, "root")
    |      |
    |      +--> visit(name_schema, "root-name")
    |      |      +--> returns "json-string"
    |      |
    |      +--> visit(age_schema, "root-age")
    |      |      +--> returns "json-int"
    |      |
    |      +--> visit(email_schema, "root-email")
    |             +--> returns "json-string"
    |
    +--> add_rule("root", "{" ws kv-name "," ws kv-age ("," ws kv-email)? "}")

The Grammar Bitmask GPU Kernel

Once we have the bitmask from XGrammar, we need to apply it to the logits. This is done by a simple Metal kernel:

kernel void grammar_bitmask_f16(
    device half *logits       [[buffer(0)]],
    device const uint *bitmask [[buffer(1)]],
    constant uint &vocab_size  [[buffer(2)]],
    uint tid [[thread_position_in_grid]])
{
    if (tid >= vocab_size)
        return;
    uint word = bitmask[tid / 32];
    bool allowed = (word >> (tid % 32)) & 1;
    if (!allowed)
        logits[tid] = half(-INFINITY);
}

Each thread handles one token. It reads the appropriate word from the bitmask, checks the bit for its token, and sets the logit to negative infinity if the token is not allowed. The kernel dispatches with vocab_size threads and a threadgroup size of 256.

For a 128k vocabulary:

  • 128k threads / 256 per threadgroup = 512 threadgroups
  • Each thread does: one integer division, one shift, one AND, one conditional write
  • Total: ~microseconds on Apple Silicon

This is negligible compared to the model forward pass.

The Grammar Decode Loop

Let us trace through decode_grammar() step by step. This is the most complex decode path in akunu because it orchestrates multiple GPU kernels per token.

Setup

bool use_sampling = (sampling.temperature > 0.0f);
uint32_t vocab_count = (uint32_t)state.config.vocab_size;

Pipeline temp_pso = use_sampling
    ? state.device->get_pipeline("temperature_scale_f16") : Pipeline{};
Pipeline mask_pso = state.device->get_pipeline("grammar_bitmask_f16");
Pipeline argmax_pso = state.device->get_pipeline("argmax_f16");

The function fetches three GPU pipelines:

  1. temperature_scale_f16: Multiplies logits by 1/T (only if sampling)
  2. grammar_bitmask_f16: Applies the grammar bitmask
  3. argmax_f16: Finds the token with the highest logit

If repetition penalty is enabled, a fourth pipeline is fetched: 4. repetition_penalty_f16: Penalizes previously generated tokens

The bitmask buffer is allocated once:

int bitmask_words = xg.bitmask_size();
size_t mask_bytes = bitmask_words * sizeof(int32_t);
std::vector<int32_t> bitmask(bitmask_words);
Buffer mask_buf = state.device->allocate(mask_bytes);

// Precompute first mask before entering the loop
xg.fill_next_token_bitmask(bitmask.data());

Per-token pipeline

Each iteration of the decode loop runs this pipeline:

  Step 1: Copy bitmask to GPU
  +-------------------------------------------+
  | memcpy(gpu_mask_buf, bitmask, mask_bytes)  |
  +-------------------------------------------+

  Step 2: Write current token to GPU
  +-------------------------------------------+
  | write_buffer(token_ids, &next_token, 4)    |
  +-------------------------------------------+

  Step 3: Begin GPU encoding
  +-------------------------------------------+
  | encode_dispatch_table (model forward)      |
  |   --> produces logits in scratch.logits    |
  +-------------------------------------------+

  Step 4 (optional): Temperature scaling
  +-------------------------------------------+
  | temperature_scale_f16 kernel               |
  | logits[i] *= inv_temp   for all i          |
  +-------------------------------------------+

  Step 5 (optional): Repetition penalty
  +-------------------------------------------+
  | repetition_penalty_f16 kernel              |
  | For each prev token: scale its logit       |
  +-------------------------------------------+

  Step 6: Apply grammar mask
  +-------------------------------------------+
  | grammar_bitmask_f16 kernel                 |
  | logits[i] = -inf  where bitmask bit is 0  |
  +-------------------------------------------+

  Step 7a (greedy): GPU argmax
  +-------------------------------------------+
  | argmax_f16 kernel --> winning token ID     |
  +-------------------------------------------+

  Step 7b (sampling): CPU sampling
  +-------------------------------------------+
  | Copy F16 logits to CPU                     |
  | Convert to F32                             |
  | sample_logits() with top-k/top-p/min-p    |
  +-------------------------------------------+

  Step 8: Accept token in grammar
  +-------------------------------------------+
  | xg.accept_token(tok)                       |
  +-------------------------------------------+

  Step 9: Compute next bitmask (for next iter)
  +-------------------------------------------+
  | xg.fill_next_token_bitmask(bitmask)        |
  +-------------------------------------------+

Steps 3-7 are all encoded into a single GPU command buffer and execute without CPU intervention. Step 7b (CPU sampling) requires a sync point – this is the slower path, used when sampling is enabled but the GPU Gumbel-max kernel is not available for grammar-constrained decode.

Note the ordering: the bitmask for the next token is computed at the end of each iteration (step 9), then applied at the beginning of the next iteration (step 6). This overlaps the CPU-side bitmask computation with GPU execution of the next forward pass – a simple form of pipelining.

Termination conditions

The loop terminates when:

if (state.tokenizer.is_eos(tok))    // Model produced EOS
    break;
if (xg.is_terminated())             // Grammar accepted stop token
    break;
if (xg.is_completed())              // Root rule fully matched
    break;

The grammar can signal completion in two ways:

  • Terminated: The grammar accepted a stop token and is done
  • Completed: The root rule has been fully matched (all required output has been generated)

Resource cleanup

The function manually frees GPU buffers at the end:

if (argmax_buf.handle)
    state.device->free_buffer(argmax_buf);
if (mask_buf.handle)
    state.device->free_buffer(mask_buf);
if (rep_tok_buf.handle)
    state.device->free_buffer(rep_tok_buf);

These are per-call allocations (the bitmask buffer, argmax output buffer, and repetition penalty token buffer) that are not part of the model’s persistent state.

The GrammarHandle: Dual Engine Support

Akunu supports both the legacy NPDA engine and XGrammar through a unified GrammarHandle struct:

struct GrammarHandle {
    Grammar legacy;              // Old NPDA engine
#ifdef AKUNU_HAS_XGRAMMAR
    XGrammarEngine xgrammar;     // New XGrammar engine
    bool use_xgrammar = false;
#endif
};

The decode_grammar() function uses the XGrammar path when available (guarded by #ifdef AKUNU_HAS_XGRAMMAR). The legacy engine is still available as a fallback for builds without the XGrammar submodule.

This dual-engine design means:

  • Minimal builds (no submodules) still get grammar support via the NPDA engine
  • Full builds get the ~100x faster XGrammar path
  • The decode loop code is shared; only the mask computation differs

Practical Usage Patterns

JSON output

The most common use case. The user specifies a JSON Schema, akunu converts it to GBNF (or passes it directly to XGrammar), and the output is guaranteed to be valid JSON matching the schema:

  User schema:    { "type": "object", "properties": { "x": {"type": "number"} } }
  Generated:      {"x": 42.5}           <-- always valid JSON
  Never:          {"x": 42.5,}          <-- no trailing commas
  Never:          {"x": forty-two}      <-- no unquoted values
  Never:          Here's the JSON: ...  <-- no prose

SQL generation

Custom GBNF grammars can constrain output to valid SQL:

root    ::= select-stmt
select  ::= "SELECT " columns " FROM " table where? ";"
columns ::= column ("," column)*
column  ::= [a-zA-Z_] [a-zA-Z0-9_]*
table   ::= [a-zA-Z_] [a-zA-Z0-9_]*
where   ::= " WHERE " condition
...

Code generation

Grammar constraints can enforce syntactic validity for code output – matching braces, proper indentation markers, valid identifier characters. This is especially useful for generating configuration files (YAML, TOML) where a single syntax error makes the output unusable.

Performance Characteristics

The performance of grammar-constrained decode depends on which engine is used:

  Operation            | NPDA engine | XGrammar
  ---------------------+-------------+----------
  Grammar compile      | ~1ms        | 10-100ms
  Per-token mask (CPU) | 10-50ms     | 0.01ms
  Per-token mask (GPU) | N/A         | 0.005ms
  Mask copy to GPU     | N/A         | ~0.01ms
  Bitmask size (128k)  | N/A         | 16 KB
  Memory overhead      | ~10 KB      | ~1 MB (compiled cache)

XGrammar’s higher compile time is amortized over the entire generation. For a 100-token output, the per-token savings of ~10ms add up to ~1 second saved, which dwarfs the 10-100ms compile cost.

The GPU bitmask kernel is so fast (microseconds) that it is essentially free relative to the model forward pass (~5-10ms). The bottleneck in grammar decode is the model itself, not the grammar constraint.

Summary

Grammar-constrained generation transforms LLMs from “usually produces valid output” to “always produces valid output.” Akunu implements this through a dual-engine approach: a legacy NPDA engine for builds without dependencies, and the XGrammar library for production use. XGrammar precomputes token validity bitmasks at compile time, making per-token constraint nearly free. The GPU kernel applies the bitmask in microseconds, and the decode loop orchestrates model forward pass, temperature scaling, repetition penalty, grammar masking, and argmax/sampling into a single GPU command buffer per token. JSON schema conversion to GBNF handles the most common use case automatically.