Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Tokenizer

Before a single GPU kernel fires, before attention scores are computed, before any matrix is multiplied – text must become numbers. The tokenizer is the bridge between the world of strings and the world of tensors. It takes a prompt like “Hello, world!” and produces a sequence of integer token IDs like [9906, 11, 1917, 0]. These IDs index into an embedding table to produce the model’s initial hidden states.

Akunu implements its tokenizer in pure C++ with no dependencies on Python, SentencePiece libraries, or Hugging Face transformers. It supports three tokenization algorithms – SentencePiece BPE (LLaMA, Qwen), GPT-2 byte-level BPE (Mistral, Phi), and WordPiece (BERT) – all in about 660 lines of code.

This chapter covers the BPE algorithm in detail, walks through the three variant implementations, explains GPT-2’s bizarre byte-to-unicode mapping, and describes the incremental decode system for real-time streaming.

What is BPE?

Byte Pair Encoding (BPE) is a subword tokenization algorithm. It does not split text into words or characters, but into subword units that balance vocabulary size with encoding efficiency.

The core idea: start with individual characters (or bytes), then iteratively merge the most frequent adjacent pair into a new token. After training, you have a vocabulary of subword units and a set of merge rules that define which pairs to combine and in what order.

Here is BPE encoding in pseudocode:

function bpe_encode(text, merges, vocab):
    tokens = split text into individual characters/bytes
    while tokens has at least 2 elements:
        find the adjacent pair with the highest merge priority
        if no valid merge exists: break
        merge that pair into a single token
    return [vocab[t] for t in tokens]

The critical difference between BPE variants is:

  1. What are the initial units? (characters, bytes, byte-mapped Unicode)
  2. How are merge priorities defined? (scores, ranks, longest-match)
  3. How are spaces handled? (space prefix, byte encoding, no special handling)

SentencePiece BPE (LLaMA, Qwen)

SentencePiece BPE, used by LLaMA and Qwen models, works on Unicode characters. Spaces are replaced with the special marker \u2581 (lower one eighth block, _). Each token in the vocabulary has a float score – higher scores are merged first.

Encoding Algorithm

std::vector<uint32_t> Tokenizer::bpe_sentencepiece(const std::string& text) const {
    // 1. Prepend space marker and replace spaces
    std::string processed;
    if (add_space_prefix_) processed = SP_SPACE;  // ▁
    for (char c : text) {
        if (c == ' ') processed += SP_SPACE;
        else processed += c;
    }

    // 2. Split into individual UTF-8 characters
    std::vector<std::string> tokens = utf8_chars(processed);

    // 3. Iteratively merge highest-score pair
    while (tokens.size() >= 2) {
        float best_score = -INF;
        int best_idx = -1;
        for (int i = 0; i < tokens.size() - 1; i++) {
            std::string merged = tokens[i] + tokens[i+1];
            auto it = token_to_id_.find(merged);
            if (it != token_to_id_.end()) {
                float score = scores_[it->second];
                if (score > best_score) {
                    best_score = score;
                    best_idx = i;
                }
            }
        }
        if (best_idx < 0) break;
        tokens[best_idx] = tokens[best_idx] + tokens[best_idx + 1];
        tokens.erase(tokens.begin() + best_idx + 1);
    }

    // 4. Convert to IDs (with byte fallback for unknown chars)
    // ...
}

Let us trace through an example:

Input: "Hello"
After space prefix: "▁Hello"
Split into chars: ["▁", "H", "e", "l", "l", "o"]

Round 1: Best merge is "l" + "l" -> "ll" (highest score)
  ["▁", "H", "e", "ll", "o"]

Round 2: Best merge is "ll" + "o" -> "llo"
  ["▁", "H", "e", "llo"]

Round 3: Best merge is "e" + "llo" -> "ello"
  ["▁", "H", "ello"]

Round 4: Best merge is "H" + "ello" -> "Hello"
  ["▁", "Hello"]

Round 5: Best merge is "▁" + "Hello" -> "▁Hello"
  ["▁Hello"]

Final: token_to_id["▁Hello"] = 15043
Output: [15043]

Byte Fallback

When a character sequence is not in the vocabulary, SentencePiece falls back to encoding individual bytes as hex tokens: <0x41> for byte 0x41 (‘A’). This ensures any input can be encoded, even if it contains rare Unicode characters or binary data.

// Byte fallback for unknown chars
for (uint8_t byte : t) {
    char hex[16];
    snprintf(hex, sizeof(hex), "<0x%02X>", byte);
    auto hit = token_to_id_.find(hex);
    if (hit != token_to_id_.end())
        ids.push_back(hit->second);
}

Decoding SentencePiece

Decoding is the reverse: look up each token ID in the vocabulary, replace \u2581 with a space, and handle byte fallback tokens:

if (is_sentencepiece_) {
    int byte_val = parse_byte_token(token);  // parse "<0xAB>" -> 0xAB
    if (byte_val >= 0)
        return std::string(1, (char)byte_val);
    // Replace ▁ with space
    // ... scan for 0xE2 0x96 0x81 (UTF-8 for ▁) ...
}

GPT-2 Byte-Level BPE (Mistral, Phi)

GPT-2’s BPE takes a fundamentally different approach. Instead of operating on Unicode characters, it operates on bytes, but with a twist: each byte is mapped to a printable Unicode character. This avoids control characters in the vocabulary while still being byte-complete.

The Byte-to-Unicode Mapping

This is the most confusing part of GPT-2 tokenization, so let us be very precise.

The mapping assigns each of the 256 possible byte values to a Unicode codepoint:

Printable bytes (0x21-0x7E and 0xA1-0xFF):
  Map to themselves.
  0x41 ('A') -> U+0041 ('A')
  0x61 ('a') -> U+0061 ('a')
  0xC0       -> U+00C0 ('A' with grave)

Non-printable bytes (0x00-0x20, 0x7F-0xA0):
  Map to U+0100 + offset, sequentially.
  0x00 -> U+0100 ('A' with macron)
  0x01 -> U+0101 ('a' with macron)
  ...
  0x20 -> U+0120 ('G' with dot above)
  0x7F -> U+0121 ('g' with dot above)
  ...

Here is a visual:

Byte value:     0x00  0x01  ...  0x20  0x21  ...  0x7E  0x7F  ...  0xA0  0xA1  ...  0xFF
                |--non-printable--|     |---printable---|     |-np-|     |---printable---|
Maps to:        U+100 U+101      U+120 U+21        U+7E U+121      U+160 U+A1        U+FF
                |--offset block--|     |---identity----|     |--block--|  |---identity--|

Akunu builds this mapping at construction time:

static void build_byte_to_unicode(std::string table[256]) {
    int offset = 0;
    for (int b = 0; b < 256; b++) {
        if ((b >= 0x21 && b <= 0x7E) || (b >= 0xA1 && b <= 0xFF)) {
            table[b] = codepoint_to_utf8(b);  // identity
        } else {
            table[b] = codepoint_to_utf8(0x100 + offset);
            offset++;
        }
    }
}

GPT-2 Encoding

With the byte mapping in hand, GPT-2 BPE encoding works like this:

  1. Split the input on special tokens (longest-match greedy)
  2. For each non-special segment, convert bytes to GPT-2 Unicode
  3. Split into individual Unicode characters
  4. Apply BPE merges (lowest rank first, not highest score)
std::vector<uint32_t> Tokenizer::bpe_gpt2(const std::string& text) const {
    auto segments = split_special_tokens(text);

    for (auto& segment : segments) {
        if (is_special_token(segment)) {
            ids.push_back(token_to_id_[segment]);
            continue;
        }

        // Convert bytes to GPT-2 Unicode
        std::string converted;
        for (uint8_t byte : segment)
            converted += byte_to_unicode_[byte];

        // Split into Unicode characters
        auto tokens = utf8_chars(converted);

        // BPE merge: lowest rank first
        while (tokens.size() >= 2) {
            int best_rank = INT_MAX;
            int best_idx = -1;
            for (int i = 0; i < tokens.size() - 1; i++) {
                std::string pair = tokens[i] + " " + tokens[i+1];
                auto it = merge_ranks_.find(pair);
                if (it != merge_ranks_.end() && it->second < best_rank) {
                    best_rank = it->second;
                    best_idx = i;
                }
            }
            if (best_idx < 0) break;
            tokens[best_idx] += tokens[best_idx + 1];
            tokens.erase(tokens.begin() + best_idx + 1);
        }
        // Convert to IDs ...
    }
}

The key difference from SentencePiece: merge rules use ranks (lower is better) stored as "token_a token_b" -> rank_int, while SentencePiece uses per-token scores (higher is better).

Special Token Splitting

Before BPE, GPT-2 tokenizers must split the input on special tokens like <|im_start|>, <|im_end|>, <|endoftext|>, etc. These are matched greedily (longest first) and passed through as-is:

Input: "<|im_start|>system\nHello<|im_end|>"

Split: ["<|im_start|>", "system\nHello", "<|im_end|>"]
        ^ special         ^ BPE encoded    ^ special

Akunu marks special tokens with a \x01 prefix byte during splitting so the encoder can distinguish them from regular text without additional data structures.

GPT-2 Decoding

Decoding reverses the byte mapping: for each Unicode codepoint in the token string, look up the corresponding byte value:

// GPT-2 decode: reverse byte-to-unicode mapping
std::string bytes;
size_t pos = 0;
while (pos < token.size()) {
    uint32_t cp = decode_utf8_cp(token, pos);
    auto it = unicode_to_byte_.find(cp);
    if (it != unicode_to_byte_.end())
        bytes += (char)it->second;
    else
        bytes += codepoint_to_utf8(cp);  // pass through
}
return bytes;

WordPiece (BERT)

The third algorithm, used by BERT and embedding models like nomic-embed, is WordPiece. Unlike BPE, WordPiece does not use merges at all – it uses greedy longest-prefix matching.

Algorithm:
  1. Lowercase input (for uncased BERT)
  2. Split on whitespace and punctuation
  3. For each word:
     a. Try to find the longest prefix in the vocab
     b. If found, emit it and continue from where it ended
     c. For continuation pieces, try "##" + prefix (WordPiece convention)
     d. If no prefix matches, emit [UNK]

Example:

Input: "unaffable"
  Try "unaffable" -> not in vocab
  Try "unaffabl"  -> not in vocab
  ...
  Try "un"        -> in vocab! Emit token for "un"

  Remaining: "affable"
  Try "##affable" -> not in vocab
  Try "##affabl"  -> not in vocab
  ...
  Try "##aff"     -> in vocab! Emit token for "##aff"

  Remaining: "able"
  Try "##able"    -> in vocab! Emit token for "##able"

Output: ["un", "##aff", "##able"]

Akunu’s implementation adds [SEP] at the end (BERT convention) and handles the nomic-bert variant where continuation tokens may not use the ## prefix.

The Tokenizer Construction Pipeline

Akunu supports two paths for loading a tokenizer:

Path 1: From GGUF Metadata

When loading a GGUF model file, the tokenizer vocabulary, scores, merges, and special token IDs are embedded in the file’s metadata:

Tokenizer tok = Tokenizer::from_gguf(
    vocab,           // vector<string>: token strings
    scores,          // vector<float>: per-token scores (SentencePiece)
    merges,          // vector<string>: merge rules (GPT-2)
    bos_id, eos_id,  // special token IDs
    model_type,      // "llama" or "gpt2" or "bert"
    add_bos,         // prepend BOS token?
    add_space_prefix // prepend ▁ in SentencePiece mode?
);

The constructor builds the token_to_id_ reverse lookup map, the merge ranks table (for GPT-2), and the byte-to-unicode mapping tables.

Path 2: From HuggingFace tokenizer.json

For models distributed as HF repos (without GGUF), Akunu has a hand-rolled JSON parser in hf_tokenizer_loader.h that reads tokenizer.json and tokenizer_config.json:

HFTokenizerData data;
bool ok = load_hf_tokenizer("/path/to/model/dir", data);
// data.vocab, data.merges, data.bos_id, data.eos_id, etc.

This parser is deliberately minimal – no JSON library dependency. It uses character-level scanning with find_key(), parse_json_string(), and parse_json_int() to extract exactly the fields it needs:

  • model.vocab: the token-to-ID mapping
  • model.merges: the BPE merge rules
  • added_tokens: special tokens with their IDs
  • pre_tokenizer.type: detects Gemma’s “Split” pre-tokenizer (no space prefix)

The parser also reads tokenizer_config.json for add_bos_token, bos_token, and eos_token fields, handling the various ways different model families specify these values.

Incremental Decode: UTF-8 Buffering for Streaming

When streaming tokens to the user in real time, there is a subtle problem: individual tokens may produce partial UTF-8 sequences. A multi-byte character like a Chinese character or an emoji might be split across two tokens.

Consider: the UTF-8 encoding of a Chinese character is 3 bytes. If token A produces bytes [0xE4, 0xBD] and token B produces byte [0xA0], you cannot emit token A’s output yet – those 2 bytes are not a valid UTF-8 string. You must buffer them and wait for the completing byte.

Akunu’s decode_incremental() handles this:

std::string Tokenizer::decode_incremental(uint32_t token_id) {
    std::string raw = decode(token_id);
    if (raw.empty()) return "";

    utf8_buf_ += raw;  // append to buffer

    // Scan for complete UTF-8 sequences
    std::string output;
    size_t i = 0, last_good = 0;
    while (i < utf8_buf_.size()) {
        int expected = utf8_expected_len(utf8_buf_[i]);
        if (expected == 0) {
            // Orphan continuation byte -- emit as-is
            output += utf8_buf_[i]; i++; last_good = i;
            continue;
        }
        if (i + expected > utf8_buf_.size())
            break;  // incomplete at end -- keep buffered
        output += utf8_buf_.substr(i, expected);
        i += expected; last_good = i;
    }

    // Keep only the incomplete tail
    utf8_buf_ = (last_good < utf8_buf_.size())
        ? utf8_buf_.substr(last_good)
        : "";
    return output;
}

The flow:

Token 1 decoded: [0xE4, 0xBD]
  Buffer: [0xE4, 0xBD]
  Expected: 3 bytes (0xE4 = 3-byte leader)
  Only 2 available -- keep buffered
  Output: ""

Token 2 decoded: [0xA0, 0x48, 0x65]
  Buffer: [0xE4, 0xBD, 0xA0, 0x48, 0x65]
  Sequence 1: [0xE4, 0xBD, 0xA0] = complete 3-byte UTF-8 = "你"
  Sequence 2: [0x48] = ASCII 'H'
  Sequence 3: [0x65] = ASCII 'e'
  Output: "你He"
  Buffer: empty

The state variable utf8_buf_ persists between calls and must be reset between conversations via reset_incremental().

Thinking Block Stripping

Some models (like Qwen3) emit <think>...</think> blocks during chain-of-thought reasoning. When streaming to the user, you typically want to strip these:

Model output: "<think>The user asked about France. Paris is the capital.</think>The capital of France is Paris."

After stripping: "The capital of France is Paris."

The strip_thinking() method implements a streaming-aware state machine:

States:
  THINK_IDLE:   normal output mode, scanning for <think>
  THINK_INSIDE: inside a thinking block, scanning for </think>

Edge cases handled:
  - Partial tag at chunk boundary: "<thi" + "nk>" across two calls
  - Nested angle brackets that are NOT think tags
  - Think blocks spanning multiple streaming chunks

The implementation buffers partial tag matches in think_buf_ and uses the think_state_ enum to track whether we are inside a thinking block.

Extra EOS Token IDs

Different model families use different end-of-sequence tokens. LLaMA 3 uses <|end_of_text|>, ChatML models use <|im_end|>, and older models use </s>. Some models have multiple valid EOS tokens.

Akunu handles this with add_eos_id():

void add_eos_id(uint32_t id) { extra_eos_ids_.push_back(id); }

bool is_eos(uint32_t id) const {
    if (id == eos_id_) return true;
    for (auto eid : extra_eos_ids_)
        if (id == eid) return true;
    return false;
}

The GGUF loader and HF tokenizer loader register the appropriate EOS tokens for each model architecture.

Performance Considerations

Akunu’s tokenizer is intentionally simple and not optimized for speed. The BPE inner loop is O(n^2) in the number of character tokens (scan all pairs, merge one, repeat). For typical prompts of a few hundred to a few thousand tokens, this takes microseconds – negligible compared to the milliseconds of GPU prefill.

Where tokenizer performance does matter is in the vocabulary lookup. The token_to_id_ map is an unordered_map<string, uint32_t> with ~100K-200K entries. Each find() call hashes a string and does a comparison. For the BPE inner loop, this is called O(n) times per merge round, for O(n) rounds, giving O(n^2) hash lookups total. For n=1000 characters, that is 1M lookups – still fast in practice but worth being aware of.

Summary

+------------------+------------------+-------------------+-----------+
| Feature          | SentencePiece    | GPT-2 BPE         | WordPiece |
+------------------+------------------+-------------------+-----------+
| Models           | LLaMA, Qwen     | Mistral, Phi      | BERT      |
| Input units      | UTF-8 chars      | Bytes (mapped)    | Chars     |
| Merge priority   | Score (higher=   | Rank (lower=      | N/A       |
|                  | better)          | better)            |           |
| Space handling   | Replace with ▁   | Byte-encode 0x20  | Split on  |
| Special tokens   | <s>, </s>        | <|...|> patterns   | [CLS] etc |
| Unknown chars    | <0xXX> fallback  | Always encodable   | [UNK]     |
+------------------+------------------+-------------------+-----------+

The tokenizer is a deceptively simple component – the BPE algorithm is straightforward, but the details (byte mappings, space handling, special token splitting, UTF-8 buffering) are where the bugs hide. Akunu’s implementation handles all three major tokenizer families in a single class, loaded from either GGUF metadata or HuggingFace JSON, with streaming-aware incremental decode.