The Tokenizer
Before a single GPU kernel fires, before attention scores are computed, before any
matrix is multiplied – text must become numbers. The tokenizer is the bridge
between the world of strings and the world of tensors. It takes a prompt like
“Hello, world!” and produces a sequence of integer token IDs like [9906, 11, 1917, 0].
These IDs index into an embedding table to produce the model’s initial hidden
states.
Akunu implements its tokenizer in pure C++ with no dependencies on Python, SentencePiece libraries, or Hugging Face transformers. It supports three tokenization algorithms – SentencePiece BPE (LLaMA, Qwen), GPT-2 byte-level BPE (Mistral, Phi), and WordPiece (BERT) – all in about 660 lines of code.
This chapter covers the BPE algorithm in detail, walks through the three variant implementations, explains GPT-2’s bizarre byte-to-unicode mapping, and describes the incremental decode system for real-time streaming.
What is BPE?
Byte Pair Encoding (BPE) is a subword tokenization algorithm. It does not split text into words or characters, but into subword units that balance vocabulary size with encoding efficiency.
The core idea: start with individual characters (or bytes), then iteratively merge the most frequent adjacent pair into a new token. After training, you have a vocabulary of subword units and a set of merge rules that define which pairs to combine and in what order.
Here is BPE encoding in pseudocode:
function bpe_encode(text, merges, vocab):
tokens = split text into individual characters/bytes
while tokens has at least 2 elements:
find the adjacent pair with the highest merge priority
if no valid merge exists: break
merge that pair into a single token
return [vocab[t] for t in tokens]
The critical difference between BPE variants is:
- What are the initial units? (characters, bytes, byte-mapped Unicode)
- How are merge priorities defined? (scores, ranks, longest-match)
- How are spaces handled? (space prefix, byte encoding, no special handling)
SentencePiece BPE (LLaMA, Qwen)
SentencePiece BPE, used by LLaMA and Qwen models, works on Unicode characters.
Spaces are replaced with the special marker \u2581 (lower one eighth block, _).
Each token in the vocabulary has a float score – higher scores are merged first.
Encoding Algorithm
std::vector<uint32_t> Tokenizer::bpe_sentencepiece(const std::string& text) const {
// 1. Prepend space marker and replace spaces
std::string processed;
if (add_space_prefix_) processed = SP_SPACE; // ▁
for (char c : text) {
if (c == ' ') processed += SP_SPACE;
else processed += c;
}
// 2. Split into individual UTF-8 characters
std::vector<std::string> tokens = utf8_chars(processed);
// 3. Iteratively merge highest-score pair
while (tokens.size() >= 2) {
float best_score = -INF;
int best_idx = -1;
for (int i = 0; i < tokens.size() - 1; i++) {
std::string merged = tokens[i] + tokens[i+1];
auto it = token_to_id_.find(merged);
if (it != token_to_id_.end()) {
float score = scores_[it->second];
if (score > best_score) {
best_score = score;
best_idx = i;
}
}
}
if (best_idx < 0) break;
tokens[best_idx] = tokens[best_idx] + tokens[best_idx + 1];
tokens.erase(tokens.begin() + best_idx + 1);
}
// 4. Convert to IDs (with byte fallback for unknown chars)
// ...
}
Let us trace through an example:
Input: "Hello"
After space prefix: "▁Hello"
Split into chars: ["▁", "H", "e", "l", "l", "o"]
Round 1: Best merge is "l" + "l" -> "ll" (highest score)
["▁", "H", "e", "ll", "o"]
Round 2: Best merge is "ll" + "o" -> "llo"
["▁", "H", "e", "llo"]
Round 3: Best merge is "e" + "llo" -> "ello"
["▁", "H", "ello"]
Round 4: Best merge is "H" + "ello" -> "Hello"
["▁", "Hello"]
Round 5: Best merge is "▁" + "Hello" -> "▁Hello"
["▁Hello"]
Final: token_to_id["▁Hello"] = 15043
Output: [15043]
Byte Fallback
When a character sequence is not in the vocabulary, SentencePiece falls back to
encoding individual bytes as hex tokens: <0x41> for byte 0x41 (‘A’). This
ensures any input can be encoded, even if it contains rare Unicode characters
or binary data.
// Byte fallback for unknown chars
for (uint8_t byte : t) {
char hex[16];
snprintf(hex, sizeof(hex), "<0x%02X>", byte);
auto hit = token_to_id_.find(hex);
if (hit != token_to_id_.end())
ids.push_back(hit->second);
}
Decoding SentencePiece
Decoding is the reverse: look up each token ID in the vocabulary, replace \u2581
with a space, and handle byte fallback tokens:
if (is_sentencepiece_) {
int byte_val = parse_byte_token(token); // parse "<0xAB>" -> 0xAB
if (byte_val >= 0)
return std::string(1, (char)byte_val);
// Replace ▁ with space
// ... scan for 0xE2 0x96 0x81 (UTF-8 for ▁) ...
}
GPT-2 Byte-Level BPE (Mistral, Phi)
GPT-2’s BPE takes a fundamentally different approach. Instead of operating on Unicode characters, it operates on bytes, but with a twist: each byte is mapped to a printable Unicode character. This avoids control characters in the vocabulary while still being byte-complete.
The Byte-to-Unicode Mapping
This is the most confusing part of GPT-2 tokenization, so let us be very precise.
The mapping assigns each of the 256 possible byte values to a Unicode codepoint:
Printable bytes (0x21-0x7E and 0xA1-0xFF):
Map to themselves.
0x41 ('A') -> U+0041 ('A')
0x61 ('a') -> U+0061 ('a')
0xC0 -> U+00C0 ('A' with grave)
Non-printable bytes (0x00-0x20, 0x7F-0xA0):
Map to U+0100 + offset, sequentially.
0x00 -> U+0100 ('A' with macron)
0x01 -> U+0101 ('a' with macron)
...
0x20 -> U+0120 ('G' with dot above)
0x7F -> U+0121 ('g' with dot above)
...
Here is a visual:
Byte value: 0x00 0x01 ... 0x20 0x21 ... 0x7E 0x7F ... 0xA0 0xA1 ... 0xFF
|--non-printable--| |---printable---| |-np-| |---printable---|
Maps to: U+100 U+101 U+120 U+21 U+7E U+121 U+160 U+A1 U+FF
|--offset block--| |---identity----| |--block--| |---identity--|
Akunu builds this mapping at construction time:
static void build_byte_to_unicode(std::string table[256]) {
int offset = 0;
for (int b = 0; b < 256; b++) {
if ((b >= 0x21 && b <= 0x7E) || (b >= 0xA1 && b <= 0xFF)) {
table[b] = codepoint_to_utf8(b); // identity
} else {
table[b] = codepoint_to_utf8(0x100 + offset);
offset++;
}
}
}
GPT-2 Encoding
With the byte mapping in hand, GPT-2 BPE encoding works like this:
- Split the input on special tokens (longest-match greedy)
- For each non-special segment, convert bytes to GPT-2 Unicode
- Split into individual Unicode characters
- Apply BPE merges (lowest rank first, not highest score)
std::vector<uint32_t> Tokenizer::bpe_gpt2(const std::string& text) const {
auto segments = split_special_tokens(text);
for (auto& segment : segments) {
if (is_special_token(segment)) {
ids.push_back(token_to_id_[segment]);
continue;
}
// Convert bytes to GPT-2 Unicode
std::string converted;
for (uint8_t byte : segment)
converted += byte_to_unicode_[byte];
// Split into Unicode characters
auto tokens = utf8_chars(converted);
// BPE merge: lowest rank first
while (tokens.size() >= 2) {
int best_rank = INT_MAX;
int best_idx = -1;
for (int i = 0; i < tokens.size() - 1; i++) {
std::string pair = tokens[i] + " " + tokens[i+1];
auto it = merge_ranks_.find(pair);
if (it != merge_ranks_.end() && it->second < best_rank) {
best_rank = it->second;
best_idx = i;
}
}
if (best_idx < 0) break;
tokens[best_idx] += tokens[best_idx + 1];
tokens.erase(tokens.begin() + best_idx + 1);
}
// Convert to IDs ...
}
}
The key difference from SentencePiece: merge rules use ranks (lower is better)
stored as "token_a token_b" -> rank_int, while SentencePiece uses per-token
scores (higher is better).
Special Token Splitting
Before BPE, GPT-2 tokenizers must split the input on special tokens like
<|im_start|>, <|im_end|>, <|endoftext|>, etc. These are matched greedily
(longest first) and passed through as-is:
Input: "<|im_start|>system\nHello<|im_end|>"
Split: ["<|im_start|>", "system\nHello", "<|im_end|>"]
^ special ^ BPE encoded ^ special
Akunu marks special tokens with a \x01 prefix byte during splitting so the
encoder can distinguish them from regular text without additional data structures.
GPT-2 Decoding
Decoding reverses the byte mapping: for each Unicode codepoint in the token string, look up the corresponding byte value:
// GPT-2 decode: reverse byte-to-unicode mapping
std::string bytes;
size_t pos = 0;
while (pos < token.size()) {
uint32_t cp = decode_utf8_cp(token, pos);
auto it = unicode_to_byte_.find(cp);
if (it != unicode_to_byte_.end())
bytes += (char)it->second;
else
bytes += codepoint_to_utf8(cp); // pass through
}
return bytes;
WordPiece (BERT)
The third algorithm, used by BERT and embedding models like nomic-embed, is WordPiece. Unlike BPE, WordPiece does not use merges at all – it uses greedy longest-prefix matching.
Algorithm:
1. Lowercase input (for uncased BERT)
2. Split on whitespace and punctuation
3. For each word:
a. Try to find the longest prefix in the vocab
b. If found, emit it and continue from where it ended
c. For continuation pieces, try "##" + prefix (WordPiece convention)
d. If no prefix matches, emit [UNK]
Example:
Input: "unaffable"
Try "unaffable" -> not in vocab
Try "unaffabl" -> not in vocab
...
Try "un" -> in vocab! Emit token for "un"
Remaining: "affable"
Try "##affable" -> not in vocab
Try "##affabl" -> not in vocab
...
Try "##aff" -> in vocab! Emit token for "##aff"
Remaining: "able"
Try "##able" -> in vocab! Emit token for "##able"
Output: ["un", "##aff", "##able"]
Akunu’s implementation adds [SEP] at the end (BERT convention) and handles
the nomic-bert variant where continuation tokens may not use the ## prefix.
The Tokenizer Construction Pipeline
Akunu supports two paths for loading a tokenizer:
Path 1: From GGUF Metadata
When loading a GGUF model file, the tokenizer vocabulary, scores, merges, and special token IDs are embedded in the file’s metadata:
Tokenizer tok = Tokenizer::from_gguf(
vocab, // vector<string>: token strings
scores, // vector<float>: per-token scores (SentencePiece)
merges, // vector<string>: merge rules (GPT-2)
bos_id, eos_id, // special token IDs
model_type, // "llama" or "gpt2" or "bert"
add_bos, // prepend BOS token?
add_space_prefix // prepend ▁ in SentencePiece mode?
);
The constructor builds the token_to_id_ reverse lookup map, the merge ranks
table (for GPT-2), and the byte-to-unicode mapping tables.
Path 2: From HuggingFace tokenizer.json
For models distributed as HF repos (without GGUF), Akunu has a hand-rolled JSON
parser in hf_tokenizer_loader.h that reads tokenizer.json and
tokenizer_config.json:
HFTokenizerData data;
bool ok = load_hf_tokenizer("/path/to/model/dir", data);
// data.vocab, data.merges, data.bos_id, data.eos_id, etc.
This parser is deliberately minimal – no JSON library dependency. It uses
character-level scanning with find_key(), parse_json_string(), and
parse_json_int() to extract exactly the fields it needs:
model.vocab: the token-to-ID mappingmodel.merges: the BPE merge rulesadded_tokens: special tokens with their IDspre_tokenizer.type: detects Gemma’s “Split” pre-tokenizer (no space prefix)
The parser also reads tokenizer_config.json for add_bos_token, bos_token,
and eos_token fields, handling the various ways different model families
specify these values.
Incremental Decode: UTF-8 Buffering for Streaming
When streaming tokens to the user in real time, there is a subtle problem: individual tokens may produce partial UTF-8 sequences. A multi-byte character like a Chinese character or an emoji might be split across two tokens.
Consider: the UTF-8 encoding of a Chinese character is 3 bytes. If token A
produces bytes [0xE4, 0xBD] and token B produces byte [0xA0], you cannot
emit token A’s output yet – those 2 bytes are not a valid UTF-8 string. You
must buffer them and wait for the completing byte.
Akunu’s decode_incremental() handles this:
std::string Tokenizer::decode_incremental(uint32_t token_id) {
std::string raw = decode(token_id);
if (raw.empty()) return "";
utf8_buf_ += raw; // append to buffer
// Scan for complete UTF-8 sequences
std::string output;
size_t i = 0, last_good = 0;
while (i < utf8_buf_.size()) {
int expected = utf8_expected_len(utf8_buf_[i]);
if (expected == 0) {
// Orphan continuation byte -- emit as-is
output += utf8_buf_[i]; i++; last_good = i;
continue;
}
if (i + expected > utf8_buf_.size())
break; // incomplete at end -- keep buffered
output += utf8_buf_.substr(i, expected);
i += expected; last_good = i;
}
// Keep only the incomplete tail
utf8_buf_ = (last_good < utf8_buf_.size())
? utf8_buf_.substr(last_good)
: "";
return output;
}
The flow:
Token 1 decoded: [0xE4, 0xBD]
Buffer: [0xE4, 0xBD]
Expected: 3 bytes (0xE4 = 3-byte leader)
Only 2 available -- keep buffered
Output: ""
Token 2 decoded: [0xA0, 0x48, 0x65]
Buffer: [0xE4, 0xBD, 0xA0, 0x48, 0x65]
Sequence 1: [0xE4, 0xBD, 0xA0] = complete 3-byte UTF-8 = "你"
Sequence 2: [0x48] = ASCII 'H'
Sequence 3: [0x65] = ASCII 'e'
Output: "你He"
Buffer: empty
The state variable utf8_buf_ persists between calls and must be reset between
conversations via reset_incremental().
Thinking Block Stripping
Some models (like Qwen3) emit <think>...</think> blocks during chain-of-thought
reasoning. When streaming to the user, you typically want to strip these:
Model output: "<think>The user asked about France. Paris is the capital.</think>The capital of France is Paris."
After stripping: "The capital of France is Paris."
The strip_thinking() method implements a streaming-aware state machine:
States:
THINK_IDLE: normal output mode, scanning for <think>
THINK_INSIDE: inside a thinking block, scanning for </think>
Edge cases handled:
- Partial tag at chunk boundary: "<thi" + "nk>" across two calls
- Nested angle brackets that are NOT think tags
- Think blocks spanning multiple streaming chunks
The implementation buffers partial tag matches in think_buf_ and uses the
think_state_ enum to track whether we are inside a thinking block.
Extra EOS Token IDs
Different model families use different end-of-sequence tokens. LLaMA 3 uses
<|end_of_text|>, ChatML models use <|im_end|>, and older models use </s>.
Some models have multiple valid EOS tokens.
Akunu handles this with add_eos_id():
void add_eos_id(uint32_t id) { extra_eos_ids_.push_back(id); }
bool is_eos(uint32_t id) const {
if (id == eos_id_) return true;
for (auto eid : extra_eos_ids_)
if (id == eid) return true;
return false;
}
The GGUF loader and HF tokenizer loader register the appropriate EOS tokens for each model architecture.
Performance Considerations
Akunu’s tokenizer is intentionally simple and not optimized for speed. The BPE inner loop is O(n^2) in the number of character tokens (scan all pairs, merge one, repeat). For typical prompts of a few hundred to a few thousand tokens, this takes microseconds – negligible compared to the milliseconds of GPU prefill.
Where tokenizer performance does matter is in the vocabulary lookup. The
token_to_id_ map is an unordered_map<string, uint32_t> with ~100K-200K
entries. Each find() call hashes a string and does a comparison. For the BPE
inner loop, this is called O(n) times per merge round, for O(n) rounds, giving
O(n^2) hash lookups total. For n=1000 characters, that is 1M lookups – still
fast in practice but worth being aware of.
Summary
+------------------+------------------+-------------------+-----------+
| Feature | SentencePiece | GPT-2 BPE | WordPiece |
+------------------+------------------+-------------------+-----------+
| Models | LLaMA, Qwen | Mistral, Phi | BERT |
| Input units | UTF-8 chars | Bytes (mapped) | Chars |
| Merge priority | Score (higher= | Rank (lower= | N/A |
| | better) | better) | |
| Space handling | Replace with ▁ | Byte-encode 0x20 | Split on |
| Special tokens | <s>, </s> | <|...|> patterns | [CLS] etc |
| Unknown chars | <0xXX> fallback | Always encodable | [UNK] |
+------------------+------------------+-------------------+-----------+
The tokenizer is a deceptively simple component – the BPE algorithm is straightforward, but the details (byte mappings, space handling, special token splitting, UTF-8 buffering) are where the bugs hide. Akunu’s implementation handles all three major tokenizer families in a single class, loaded from either GGUF metadata or HuggingFace JSON, with streaming-aware incremental decode.