Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Sampling Pipeline

When temperature > 0, we do not simply pick the most likely token. Instead, we sample from the probability distribution, introducing controlled randomness that makes the model’s output more diverse, creative, and human-like. But turning a raw FP16 logit vector into a sampled token involves a multi-stage pipeline with several filtering steps, each with its own semantics and tradeoffs.

Akunu implements two distinct sampling paths: a CPU path in src/inference/sampling.cpp (used by grammar-constrained decode) and a GPU path via the Gumbel-max kernel (used by the default sampled decode). This chapter covers both, starting with the CPU path because it makes the algorithmic steps explicit, then explaining how the GPU path achieves the same result in a fundamentally different way.

The Full Pipeline: Logits to Token

Here is the complete CPU sampling pipeline, in order:

Raw F16 logits [vocab_size]
    │
    ▼
1. F16 → F32 conversion
    │
    ▼
2. Repetition penalty
    │
    ▼
3. Temperature scaling (logits *= 1/T)
    │
    ▼
4. Softmax (exp + normalize)
    │
    ▼
5. Top-K filter (partial sort, keep K highest)
    │
    ▼
6. Min-P filter (remove tokens below min_p * max_prob)
    │
    ▼
7. Top-P filter (nucleus sampling, cumulative prob ≥ P)
    │
    ▼
8. Re-normalize
    │
    ▼
9. Categorical sampling (uniform random + CDF walk)
    │
    ▼
Token ID

Let’s walk through each stage with the actual code.

The SamplingState Structure

Akunu uses a thread-local SamplingState to avoid repeated allocation:

struct SamplingState {
    std::vector<float> logits;
    struct Candidate {
        uint32_t id;
        float prob;
    };
    std::vector<Candidate> candidates;
    int capacity = 0;

    void ensure(int vocab_size) {
        if (capacity >= vocab_size) return;
        logits.resize(vocab_size);
        candidates.resize(vocab_size);
        capacity = vocab_size;
    }
};

extern thread_local SamplingState sampling_state;

The thread_local qualifier is important: since sampling runs on the CPU and could potentially be called from multiple threads (though Akunu currently uses single-threaded decode), the state is per-thread to avoid data races. The ensure method only reallocates if the vocabulary size has grown, so for typical usage (same model throughout), allocation happens exactly once.

The Candidate struct pairs a token ID with its probability, enabling efficient sorting and filtering without losing track of which probability belongs to which token.

Stage 1: F16 to F32 Conversion

The GPU produces logits in FP16. The CPU needs FP32 for numerical stability during softmax:

const __fp16 *f16 = (const __fp16 *)logits_data;
for (int i = 0; i < vocab_size; i++)
    logits[i] = (float)f16[i];

This is a straightforward widening conversion. On Apple Silicon, __fp16 is a native type and the conversion is hardware-accelerated through the NEON FP16 extension.1 For a 128K vocabulary, this copies 256KB of data and takes roughly 50 microseconds.

Stage 2: Repetition Penalty

If enabled, repetition penalty discourages the model from repeating tokens it has already generated:

if (repeat_penalty != 1.0f && prev_tokens && n_prev > 0) {
    for (int i = 0; i < n_prev; i++) {
        uint32_t tok = prev_tokens[i];
        if (tok < (uint32_t)vocab_size) {
            if (logits[tok] > 0)
                logits[tok] /= repeat_penalty;
            else
                logits[tok] *= repeat_penalty;
        }
    }
}

The asymmetric treatment is deliberate:

Logit SignOperationEffect
PositiveDivide by penaltyReduces probability
NegativeMultiply by penaltyIncreases magnitude (makes it more negative, further reducing probability)

This ensures that repetition penalty always decreases the probability of previously seen tokens, regardless of whether the logit is positive or negative. A penalty of 1.2 reduces a positive logit by 20% and increases a negative logit’s magnitude by 20%.2

Stage 3: Temperature Scaling

Temperature controls the “sharpness” of the distribution:

if (temperature <= 0.0f) {
    // Greedy: return argmax
    uint32_t best = 0;
    float best_val = logits[0];
    for (int i = 1; i < vocab_size; i++) {
        if (logits[i] > best_val) {
            best_val = logits[i];
            best = (uint32_t)i;
        }
    }
    return best;
}

float inv_temp = 1.0f / temperature;
for (int i = 0; i < vocab_size; i++)
    logits[i] *= inv_temp;

Multiplying logits by 1/temperature before softmax is mathematically equivalent to dividing the softmax exponents by temperature:

$$\text{softmax}(x_i / T) = \frac{e^{x_i / T}}{\sum_j e^{x_j / T}}$$

TemperatureEffectUse Case
T = 0Argmax (greedy)Deterministic, factual
T < 1Sharper distributionMore focused, less creative
T = 1Model’s native distributionBalanced
T > 1Flatter distributionMore diverse, more creative

Note that temperature = 0 is handled as a special case that short-circuits the entire pipeline, returning the argmax directly.

Stage 4: Softmax (Numerically Stable)

The logits are converted to probabilities using softmax with the max-subtraction trick for numerical stability:

// Find max for numerical stability
float max_logit = logits[0];
for (int i = 1; i < vocab_size; i++) {
    if (logits[i] > max_logit)
        max_logit = logits[i];
}

// Build (index, probability) pairs
float sum = 0;
for (int i = 0; i < vocab_size; i++) {
    float p = expf(logits[i] - max_logit);
    candidates[i] = {(uint32_t)i, p};
    sum += p;
}
float inv_sum = 1.0f / sum;
for (int i = 0; i < vocab_size; i++)
    candidates[i].prob *= inv_sum;

Subtracting max_logit before exponentiation prevents overflow. Without this, expf(logits[i]) could produce +inf for large logits, corrupting the entire computation. After subtraction, the largest exponent is expf(0) = 1, and all others are in (0, 1].3

Stage 5: Top-K Filter

Top-K keeps only the K most probable tokens, zeroing out the rest:

int n_cand = vocab_size;
int effective_k = (top_k > 0 && top_k < n_cand) ? top_k : std::min(n_cand, 256);
std::partial_sort(
    candidates, candidates + effective_k, candidates + n_cand,
    [](const SamplingState::Candidate& a, const SamplingState::Candidate& b) {
        return a.prob > b.prob;
    });
n_cand = effective_k;

Note the use of std::partial_sort instead of std::sort. Partial sort is O(n log k) compared to full sort’s O(n log n). For a 128K vocabulary with top_k=40, this is roughly 128K * log(40) / (128K * log(128K)) ≈ 30% of the work of a full sort.4

When top_k is 0 or negative, the default cap of 256 is applied. This prevents the subsequent min-p and top-p steps from operating on the full 128K vocabulary, which would be slow.

Stage 6: Min-P Filter

Min-P is a relatively recent sampling technique that removes tokens whose probability is less than min_p * max_probability:5

if (min_p > 0 && n_cand > 0) {
    float threshold = candidates[0].prob * min_p;
    for (int i = 1; i < n_cand; i++) {
        if (candidates[i].prob < threshold) {
            n_cand = i;
            break;
        }
    }
}

Because the candidates are already sorted by probability (from the top-K partial sort), candidates[0] holds the maximum probability, and we just scan forward until we find a candidate below the threshold.

Min-P is adaptive: it removes more tokens when the model is confident (one token has very high probability) and fewer tokens when the model is uncertain (many tokens have similar probabilities). This makes it more robust than fixed top-K.

min_pWhen max_prob = 0.9When max_prob = 0.1
0.1Keep tokens with prob > 0.09Keep tokens with prob > 0.01
0.05Keep tokens with prob > 0.045Keep tokens with prob > 0.005

Stage 7: Top-P (Nucleus) Filter

Top-P sampling keeps the smallest set of tokens whose cumulative probability exceeds p:

if (top_p < 1.0f && n_cand > 0) {
    float cumsum = 0;
    for (int i = 0; i < n_cand; i++) {
        cumsum += candidates[i].prob;
        if (cumsum >= top_p) {
            n_cand = i + 1;
            break;
        }
    }
}

Since candidates are sorted by descending probability, this walks from the most likely token to the least likely, accumulating probability mass. As soon as the cumulative mass reaches top_p, the remaining tokens are discarded.6

For example, with top_p = 0.95, if the top 5 tokens have probabilities [0.4, 0.3, 0.15, 0.08, 0.04, ...], the cumulative sum reaches 0.97 at token 4, so we keep 5 tokens.

Stage 8: Re-Normalization

After filtering, the remaining candidates’ probabilities no longer sum to 1. We fix that:

sum = 0;
for (int i = 0; i < n_cand; i++)
    sum += candidates[i].prob;
inv_sum = 1.0f / sum;
for (int i = 0; i < n_cand; i++)
    candidates[i].prob *= inv_sum;

Stage 9: Categorical Sampling

Finally, we sample from the filtered distribution using the inverse CDF method:

std::uniform_real_distribution<float> dist(0.0f, 1.0f);
float r = dist(rng);
float cumsum = 0;
for (int i = 0; i < n_cand; i++) {
    cumsum += candidates[i].prob;
    if (r <= cumsum)
        return candidates[i].id;
}
return candidates[n_cand - 1].id;

A uniform random number r in [0, 1) is drawn, and we walk through the CDF until r <= cumsum. The last candidate is returned as a fallback in case of floating-point rounding.

The F32 Overload (Grammar Path)

There is a second sample_logits overload that takes pre-converted F32 logits:

uint32_t sample_logits(float *logits, int vocab_size,
                       float temperature, int top_k, float top_p, float min_p);

This is used by decode_grammar, where the CPU has already read and masked the logits. The pipeline is identical except it skips the F16->F32 conversion and the repetition penalty (grammar decode applies its own masks). The temperature check also differs slightly:

if (temperature != 1.0f) {
    float inv_temp = 1.0f / temperature;
    for (int i = 0; i < vocab_size; i++)
        logits[i] *= inv_temp;
}

When temperature is exactly 1.0, the scaling is skipped entirely (not just for performance – it avoids introducing unnecessary floating-point error).

The GPU Sampling Path: Gumbel-Max

The GPU path replaces the entire CPU pipeline with a single mathematical trick. Instead of:

  1. Softmax to get probabilities
  2. Sample from categorical distribution

We use:

  1. Add Gumbel noise to logits
  2. Argmax

The Gumbel-max theorem states:7

$$\arg\max_i \left(\log \pi_i + G_i\right) \sim \text{Categorical}(\pi)$$

where $G_i \sim \text{Gumbel}(0, 1)$ and $\pi_i = \text{softmax}(\text{logit}_i / T)$.

Since $\log \pi_i \propto \text{logit}_i / T$, we can simplify to:

$$\arg\max_i \left(\text{logit}_i + T \cdot G_i\right)$$

This is exactly what the gumbel_topk_f16 kernel computes:

float u = pcg_float(element_seed + i);
u = clamp(u, 1e-7f, 1.0f - 1e-7f);
float gumbel = -log(-log(u));
logits[i] = half(val + temp * gumbel);

The Gumbel noise is generated from a uniform random variable via the inverse CDF: $G = -\log(-\log(U))$ where $U \sim \text{Uniform}(0, 1)$.

The beauty of this approach is that top-k, top-p, and min-p are applied before the noise, so they function identically to the CPU path. Tokens that are filtered out get set to -inf, which means they can never win the argmax regardless of the noise.

SamplingParams: The Configuration

All sampling parameters come from AkunuSamplingConfig:

ParameterTypeDefaultMeaning
temperaturefloat0.0Softmax temperature (0 = greedy)
top_kint0Keep only top-K tokens (0 = no limit, uses 256 cap)
top_pfloat1.0Nucleus sampling threshold
min_pfloat0.0Minimum probability ratio threshold
repeat_penaltyfloat1.0Repetition penalty factor

These parameters can be combined freely. A common configuration for creative text generation is:

Use Casetemperaturetop_ktop_pmin_p
Greedy/Factual0.0
Balanced0.7400.950.05
Creative1.000.90.0
Very Creative1.200.950.0
Code Gen0.2400.950.1

Seed Handling and Reproducibility

The GPU path uses a PCG hash for random number generation:

inline float pcg_float(uint state) {
    state = state * 747796405u + 2891336453u;
    uint word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
    word = (word >> 22u) ^ word;
    return float(word) / 4294967296.0f;
}

This is a stateless hash: given the same input state, it always produces the same output. The state is derived from:

uint element_seed = (params.position + params.seed_offset) * 2654435761u;

The seed_offset is a time-based value set once per generation call, and position is patched per-token in the chain. The multiplication by 2654435761 (a Knuth multiplicative hash constant) spreads the entropy across the bits.8

For the CPU path, Akunu uses std::uniform_real_distribution with a thread-local rng (random number generator). The RNG state persists across tokens within a generation, providing a different random sequence for each call.

CPU vs GPU: When to Use Which

AspectCPU PathGPU Path
Latency per token~100us (F16->F32 + sort + sample)~0us (no CPU roundtrip)
Required forGrammar decodeNon-grammar sampled decode
Top-K methodstd::partial_sortBinary search on logit values
RNG qualitystd::mt19937 (Mersenne Twister)PCG hash (stateless)
Chain decode compatibleNo (requires CPU round-trip)Yes

The GPU path is strictly faster because it eliminates the CPU roundtrip. However, the CPU path is necessary when per-token CPU processing is required (grammar constraints, custom logit processors).

Summary

The sampling pipeline transforms raw logits into a sampled token through a carefully ordered sequence of filtering and normalization steps:

  1. Temperature controls distribution sharpness.
  2. Top-K provides a hard cutoff on candidate count.
  3. Min-P provides an adaptive cutoff based on the most likely token.
  4. Top-P provides a probability-mass-based cutoff.
  5. Categorical sampling draws from the filtered distribution.

Akunu offers two implementations: a CPU path for maximum flexibility (grammar constraints) and a GPU path (Gumbel-max) for maximum throughput (chain decode compatible). The GPU path achieves the same statistical result as the CPU path using a mathematical equivalence that replaces softmax + sampling with noise + argmax.



  1. ARM. “ARM Architecture Reference Manual: ARMv8-A.” Section C7.2, FCVT instruction. The __fp16 to float conversion on AArch64 uses the FCVT instruction which executes in 1-2 cycles. See https://developer.arm.com/documentation/ddi0487/latest/.

  2. Keskar, N. S., et al. “CTRL: A Conditional Transformer Language Model for Controllable Generation.” arXiv:1909.05858, 2019. The repetition penalty was popularized by the CTRL model and adopted by most LLM inference frameworks. See https://arxiv.org/abs/1909.05858.

  3. The log-sum-exp trick is a standard technique in numerical computing. See Blanchard, P., Higham, D.J., and Higham, N.J. “Accurately Computing the Log-Sum-Exp and Softmax Functions.” IMA Journal of Numerical Analysis, 2021. See https://doi.org/10.1093/imanum/draa038.

  4. std::partial_sort uses the introsort algorithm with heapselect fallback. In practice, it performs approximately n * log(k) comparisons for selecting the top k from n elements. See: Musser, D.R. “Introspective Sorting and Selection Algorithms.” Software: Practice and Experience, 1997. See https://en.wikipedia.org/wiki/Introsort.

  5. Min-P sampling was introduced in the open-source LLM community and formalized in: “Min-P Sampling: New Way to Control the Creativity of LLMs.” Discussion on llama.cpp GitHub, 2023. It provides more consistent behavior across different probability distributions than fixed top-K. See https://github.com/ggerganov/llama.cpp/pull/3841.

  6. Holtzman, A., et al. “The Curious Case of Neural Text Degeneration.” ICLR 2020. This paper introduced nucleus (top-p) sampling and showed that it produces more coherent text than top-k sampling for open-ended generation. See https://arxiv.org/abs/1904.09751.

  7. The Gumbel-max trick is proven in: Maddison, C.J., Tarlow, D., and Minka, T. “A* Sampling.” NeurIPS 2014. A modern treatment with applications to neural networks: Jang, E., Gu, S., and Poole, B. “Categorical Reparameterization with Gumbel-Softmax.” ICLR 2017. See https://arxiv.org/abs/1411.0030.

  8. O’Neill, M. “PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation.” Harvey Mudd College Technical Report HMC-CS-2014-0905, 2014. The PCG hash provides good statistical properties with minimal state, making it ideal for GPU kernels where per-thread state is expensive. See https://www.pcg-random.org/paper.html.