Testing Infrastructure

If there is one rule that separates a hobby GPU project from production-quality inference code, it is this: every kernel must be tested against a CPU reference implementation. GPUs are excellent at hiding bugs. A GEMV kernel with a subtle off-by-one error in its dequantization logic might produce plausible-looking text 95% of the time and garbled output the other 5%. Without rigorous testing, you will spend weeks chasing “random” quality regressions that are actually deterministic bugs in your shader code.

Akunu takes testing seriously. This chapter walks you through the entire testing infrastructure: the test directory structure, the kernel test pattern, the CPU reference implementations, the tolerance model, and how to write new tests.

Test Directory Structure

Let us start with a map of every test file in the repository:

tests/
+-- test_kernel_helpers.h         # Shared CPU references + GPU helpers
+-- test_device.mm                # MetalDevice lifecycle + basic ops
+-- test_weight_store.cpp         # GGUF/MLX weight loading
+-- test_table_builder.cpp        # Dispatch table construction
+-- test_config.cpp               # ChipConfig + device defaults
+-- test_tokenizer.cpp            # BPE tokenizer (needs model)
+-- test_tokenizer_internal.cpp   # Tokenizer internals (no model)
+-- test_inference.cpp            # Forward pass correctness (needs model)
+-- test_kv_cache.cpp             # KV cache write/shift/advance
+-- test_grammar.cpp              # GBNF parsing + bitmask generation
+-- test_sampling_quality.cpp     # Statistical sampling tests
+-- test_long_context.cpp         # Context window stress test
+-- test_server.cpp               # HTTP server parsing
+-- test_e2e.cpp                  # End-to-end generation test
+-- test_whisper.cpp              # Whisper format parsing
+-- test_whisper_e2e.cpp          # Whisper transcription test
+-- kernels/                      # Per-kernel GPU correctness tests
    +-- activation/
    |   +-- test_silu.cpp
    |   +-- test_gelu.cpp
    |   +-- test_silu_gate.cpp
    |   +-- test_gelu_gate.cpp
    +-- attention/
    |   +-- test_flash_attention.cpp
    +-- convert/
    |   +-- test_f32_to_f16.cpp
    |   +-- test_dequant_q4_0.cpp
    +-- embedding/
    |   +-- test_embedding_f16.cpp
    +-- matmul/
    |   +-- test_gemv_f16.cpp
    |   +-- test_gemv_q4_0.cpp
    |   +-- test_gemv_q8_0.cpp
    |   +-- test_gemm_f16.cpp
    +-- norm/
    |   +-- test_rmsnorm.cpp
    |   +-- test_gemma_rmsnorm.cpp
    +-- rope/
        +-- test_rope.cpp
        +-- test_rope_neox.cpp

There are three tiers of tests:

  Tier 1: Kernel Tests (16 tests)
  ================================
  - Test individual GPU kernels against CPU references
  - Need only the metallib, no model file
  - Fastest to run (< 1s each)
  - This is where you start when adding a new kernel

  Tier 2: Unit Tests (8 tests)
  ================================
  - Test CPU-side logic (tokenizer, grammar, config, etc.)
  - No GPU or model required for most
  - Run via "make test-unit"

  Tier 3: Integration / E2E Tests (6 tests)
  ================================
  - Test full inference pipeline with real models
  - Need a model file (GGUF or MLX SafeTensors)
  - Slower to run (seconds to minutes)
  - Run via "make test-infer"

The Kernel Test Pattern

Every kernel test in akunu follows the same pattern. Let us study it by examining the actual test_rmsnorm.cpp from the repository:

  Kernel Test Structure
  =====================

  1. Initialize GPU device + load metallib
  2. Get pipeline state object for the kernel
  3. For each test case:
     a. Generate deterministic float data on CPU
     b. Compute expected output using CPU reference
     c. Upload input data to GPU buffers (as F16)
     d. Set up shader parameters
     e. Encode + dispatch the kernel
     f. Synchronously wait for GPU completion
     g. Read back GPU output (F16 -> F32)
     h. Compare GPU output vs CPU reference within tolerance
  4. Report pass/fail counts

Here is the actual code for the RMSNorm kernel test, annotated:

// Step 1: Device initialization (from test_kernel_helpers.h macro)
INIT_DEVICE();
// This creates a MetalDevice, loads akunu.metallib, and prints GPU name

// Step 2: Get the pipeline
Pipeline pso = dev->get_pipeline("rmsnorm_f16");

// Step 3a: Sweep across multiple dimensions
int dims[] = {64, 128, 256, 512, 896, 1024, 2048, 2560, 4096};
for (int dim : dims) {
    int n_rows = 4;

    // Step 3b: Deterministic test data
    auto x_data = det_floats(n_rows * dim, 42 + dim, -2.0f, 2.0f);
    auto w_data = det_floats(dim, 99 + dim, 0.5f, 1.5f);

    // Step 3c: CPU reference computation
    auto expected = cpu_rmsnorm(x_data, w_data, dim, eps);

    // Step 3d: Upload to GPU as F16
    Buffer x_buf = make_buf_f16(*dev, x_data);
    Buffer w_buf = make_buf_f16(*dev, w_data);
    Buffer o_buf = dev->allocate(n_rows * dim * 2);

    // Step 3e: Set up params and dispatch
    RMSNormParams params = {(uint32_t)dim, eps, 0, 0};
    dev->begin_encoding();
    dev->set_pipeline(pso);
    dev->set_buffer(x_buf, 0, 0);
    dev->set_buffer(w_buf, 0, 1);
    dev->set_buffer(o_buf, 0, 2);
    dev->set_bytes(&params, sizeof(params), 3);
    dev->dispatch(Dim3(n_rows), Dim3(std::min(1024, dim)));

    // Step 3f: Wait for GPU
    dev->end_encoding_sync();

    // Step 3g-h: Read back and compare
    CHECK(assert_close(
        read_f16(o_buf, n_rows * dim),   // GPU output (F16 -> F32)
        expected,                         // CPU reference (F32)
        5e-2f,                            // absolute tolerance
        5e-2f,                            // relative tolerance
        3                                 // max errors to print
    ), "rmsnorm dim sweep");
}

The pattern is beautifully consistent. Every kernel test in the repository follows this exact flow. The variation is only in:

Which pipeline (kernel name) to use
What test data to generate
Which CPU reference function to call
What tolerance values are appropriate

The Test Helper Library

The file tests/test_kernel_helpers.h is the backbone of the testing infrastructure. It provides five categories of utilities:

1. Float16 Conversion (CPU-side)

Since Metal kernels operate on F16 data but CPU reference implementations use F32, we need conversion functions:

static inline uint16_t f32_to_f16(float f);  // CPU float -> F16 bits
static inline float f16_to_f32(uint16_t h);  // F16 bits -> CPU float

These are software implementations of IEEE 754 half-precision conversion. They handle denormals, infinities, and NaN correctly. The GPU does these conversions in hardware, so our CPU-side conversion must match exactly.

2. Deterministic Random Data

static std::vector<float> det_floats(int count, uint64_t seed = 42,
                                      float lo = -1.0f, float hi = 1.0f);

This generates pseudo-random floats using a linear congruential generator (LCG) with a fixed seed. The key property is determinism: the same seed always produces the same sequence, regardless of platform or compiler. This makes tests reproducible.

The LCG uses the Knuth constants:

s = s * 6364136223846793005 + 1442695040888963407

This is intentionally NOT a cryptographically strong PRNG. We want speed and reproducibility, not security.

3. GPU Buffer Helpers

static Buffer make_buf_f16(Device& dev, const std::vector<float>& data);
static Buffer make_buf_f32(Device& dev, const std::vector<float>& data);
static Buffer make_buf_u32(Device& dev, const std::vector<uint32_t>& data);

static std::vector<float> read_f16(Buffer buf, int count);
static std::vector<float> read_f32(Buffer buf, int count);

These handle the F32-to-F16 conversion on upload and F16-to-F32 conversion on readback. The make_buf_f16 function converts each float to F16 on the CPU side, then uploads the F16 data to a GPU buffer. The read_f16 function reads F16 data from a GPU buffer and converts each value back to F32.

This is the data flow:

  CPU F32 data          GPU F16 buffer          GPU F16 output
  [1.0, 2.0, ...]  --> make_buf_f16() -->  [kernel execution]
                                                    |
  CPU F32 result    <-- read_f16() <--    GPU F16 output buffer
  [0.99, 1.98, ...]

  Note: F16 conversion introduces quantization error.
  That is why we need tolerances in our comparisons.

4. Device Init and Result Macros

#define INIT_DEVICE() \
    auto dev = Device::create_default(); \
    printf("GPU: %s\n", dev->name()); \
    { \
        bool _ok = dev->load_library("../../.build/metallib/akunu.metallib"); \
        if (!_ok) _ok = dev->load_library(".build/metallib/akunu.metallib"); \
        if (!_ok) _ok = dev->load_library("../../../.build/metallib/akunu.metallib"); \
        if (!_ok) { printf("Metallib not found\n"); return 1; } \
    }

#define PRINT_RESULTS() \
    printf("\n=== Results: %d passed, %d failed ===\n", \
           g_tests_passed, g_tests_failed); \
    return g_tests_failed > 0 ? 1 : 0

The INIT_DEVICE() macro tries three paths for the metallib because tests can be run from different working directories. The PRINT_RESULTS() macro returns a non-zero exit code on failure, which lets CI systems detect test failures.

5. Tolerance Comparison

static bool assert_close(const std::vector<float>& actual,
                          const std::vector<float>& expected,
                          float atol, float rtol,
                          int max_errors = 5,
                          const char *label = "");

This is the heart of the comparison system. For each element pair, it checks:

|actual[i] - expected[i]| <= max(atol, |expected[i]| * rtol)

The tolerance model uses both absolute tolerance (atol) and relative tolerance (rtol), taking the maximum. This handles both near-zero values (where relative tolerance would be too strict) and large values (where absolute tolerance would be too loose).

  Tolerance Model
  ===============

  For expected value 0.001:
    atol = 0.05  -->  allowed error = 0.05 (absolute dominates)
    rtol = 0.05  -->  allowed error = 0.00005 (too strict)
    max(0.05, 0.00005) = 0.05  <-- we use this

  For expected value 100.0:
    atol = 0.05  -->  allowed error = 0.05 (too strict)
    rtol = 0.05  -->  allowed error = 5.0 (relative dominates)
    max(0.05, 5.0) = 5.0  <-- we use this

When an error is detected, the function prints up to max_errors mismatches with their indices, actual values, expected values, and the computed tolerance. This makes debugging much easier than a simple pass/fail.

CPU Reference Implementations

The test helper library includes CPU reference implementations for every kernel category. These are intentionally simple – no SIMD, no optimization, just readable code that is obviously correct.

RMSNorm Reference

static std::vector<float> cpu_rmsnorm(
    const std::vector<float>& x,
    const std::vector<float>& w,
    int dim, float eps)
{
    int rows = (int)x.size() / dim;
    std::vector<float> out(x.size());
    for (int r = 0; r < rows; r++) {
        float ss = 0;
        for (int i = 0; i < dim; i++)
            ss += x[r*dim + i] * x[r*dim + i];
        float rms = 1.0f / sqrtf(ss / dim + eps);
        for (int i = 0; i < dim; i++)
            out[r*dim + i] = x[r*dim + i] * rms * w[i];
    }
    return out;
}

Compare this with the Metal kernel that runs on the GPU:

kernel void rmsnorm_f16(
    device const half *input, device const half *weight,
    device half *output, constant RMSNormParams &params, ...)
{
    // ... threadgroup reduction for sum of squares ...
    float rms = rsqrt(total_sum_sq / float(dim) + eps);
    for (uint i = tid; i < dim; i += tg_size) {
        row_out[i] = half(float(row_in[i]) * rms) * weight[i];
    }
}

The CPU version operates on F32 floats. The GPU version operates on F16 halfs, converting to F32 only for the accumulation. This precision difference is exactly why we need tolerance in our comparisons.

GEMV Reference

// y[n] = sum_k W[n,k] * x[k]
static std::vector<float> cpu_gemv(
    const std::vector<float>& W,
    const std::vector<float>& x,
    int N, int K)
{
    std::vector<float> y(N, 0);
    for (int n = 0; n < N; n++)
        for (int k = 0; k < K; k++)
            y[n] += W[n*K + k] * x[k];
    return y;
}

This is the simplest possible matrix-vector multiply. The GPU version uses SIMD-group operations, threadgroup memory, and quantized weight formats. But the mathematical result should be the same (within tolerance).

Attention Reference

static std::vector<float> cpu_attention(
    const std::vector<float>& Q,
    const std::vector<float>& K,
    const std::vector<float>& V,
    int seq_len, int kv_seq_len, int head_dim,
    int n_heads, int n_kv_heads, float scale, bool causal)
{
    std::vector<float> O(n_heads * seq_len * head_dim, 0);
    int heads_per_group = n_heads / n_kv_heads;
    for (int h = 0; h < n_heads; h++) {
        int kv_h = h / heads_per_group;  // GQA mapping
        for (int s = 0; s < seq_len; s++) {
            // QK^T
            std::vector<float> scores(kv_seq_len);
            for (int t = 0; t < kv_seq_len; t++) {
                float dot = 0;
                for (int d = 0; d < head_dim; d++)
                    dot += Q[...] * K[...];
                scores[t] = dot * scale;
                if (causal && t > s) scores[t] = -1e9f;
            }
            // Softmax
            // ... standard numerically stable softmax ...
            // Weighted sum of V
            // ... accumulate ...
        }
    }
    return O;
}

This implements the full attention computation: QK^T with scaling, causal masking, softmax, and V weighting. It handles grouped-query attention (GQA) through the heads_per_group mapping. The GPU version uses FlashAttention with online softmax and tiled computation, but produces the same result.

RoPE References

There are two RoPE variants, matching the two GPU kernels:

// Standard (interleaved pairs): [x0,x1, x2,x3, ...]
static void cpu_rope(std::vector<float>& x,
    int seq_len, int n_heads, int head_dim,
    float theta, int pos_offset = 0);

// NeoX (split-half): [x0,x1,...,xN/2, xN/2+1,...,xN]
static void cpu_rope_neox(std::vector<float>& x,
    int seq_len, int n_heads, int head_dim,
    float theta, int pos_offset = 0);

The standard variant pairs adjacent elements (x[2d], x[2d+1]). The NeoX variant pairs elements from the first and second halves (x[d], x[half+d]). Both apply the same rotation math, just with different indexing.

Quantization Helpers

For testing quantized GEMV kernels, we need to quantize data on the CPU:

struct Q4_0Block {
    uint16_t scale_f16;  // absmax / 8 in F16
    uint8_t nibs[16];    // 32 x 4-bit values (each byte = 2 values)
};

static std::vector<uint8_t> quantize_q4_0(
    const std::vector<float>& vals,
    std::vector<float>& dequant_ref);

The quantize_q4_0 function returns both the quantized bytes (for GPU upload) and the dequantized reference values (for comparison). The dequantized values account for the quantization error introduced by the round-trip, so our tolerance comparison measures only the GPU computation error, not the inherent quantization loss.

  Quantization Test Flow
  ======================

  Original F32 values: [0.5, -0.3, 0.8, ...]
            |
            v
  quantize_q4_0() produces:
    - Quantized bytes (Q4_0 format) -> upload to GPU
    - Dequant reference [0.49, -0.29, 0.79, ...] -> compare against
            |
  GPU kernel: dequant + GEMV
            |
            v
  GPU output: [result using quantized weights]
            |
  Compare vs: cpu_gemv(dequant_ref, x)
              (GEMV using the SAME dequantized values)

  This isolates the GPU computation error from quantization error.

The 16 Kernel Tests

Here is a complete inventory of every kernel test, what it tests, and what CPU reference it uses:

+-------------------------+---------------------+-------------------+--------+
| Test                    | Kernel(s)           | CPU Reference     | Tol    |
+-------------------------+---------------------+-------------------+--------+
| test_rmsnorm            | rmsnorm_f16         | cpu_rmsnorm       | 5e-2   |
| test_gemma_rmsnorm      | rmsnorm_gemma_f16   | cpu_gemma_rmsnorm | 5e-2   |
| test_gemv_f16           | gemv_f16            | cpu_gemv          | 1e-2   |
| test_gemv_q4_0          | gemv_q4_0           | cpu_gemv(dequant) | 5e-2   |
| test_gemv_q8_0          | gemv_q8_0           | cpu_gemv(dequant) | 5e-2   |
| test_gemm_f16           | simd_gemm_f16       | cpu_gemm_bt       | 5e-2   |
| test_silu               | silu activation     | cpu_silu          | 1e-3   |
| test_gelu               | gelu activation     | cpu_gelu          | 1e-3   |
| test_silu_gate           | silu_gate_f16       | silu(a)*b         | 1e-2   |
| test_gelu_gate           | gelu_gate_f16       | gelu(a)*b         | 1e-2   |
| test_rope               | rope_f16            | cpu_rope          | 5e-2   |
| test_rope_neox          | rope_neox_f16       | cpu_rope_neox     | 5e-2   |
| test_flash_attention    | flash_attention_*   | cpu_attention     | 5e-2   |
| test_embedding_f16      | embedding_lookup_f16| direct lookup     | 0      |
| test_f32_to_f16         | f32_to_f16          | f32_to_f16 (CPU)  | 0      |
| test_dequant_q4_0       | dequant_q4_0        | quantize_q4_0     | 0      |
+-------------------------+---------------------+-------------------+--------+

Notice the tolerance values. Pure data movement kernels (embedding, conversion, dequantization) use zero tolerance – they must produce bit-identical results. Arithmetic kernels (GEMV, GEMM, norms) use tolerances of 1e-2 to 5e-2 to account for F16 precision loss and different accumulation orders.

Why Different Tolerances?

The tolerance differences come from the nature of each operation:

  Precision Loss Sources
  ======================

  1. F16 storage: 10-bit mantissa = ~3 decimal digits
     - Every F32->F16 conversion loses precision
     - Multiply two F16 values: error compounds

  2. Accumulation order:
     - CPU: sequential sum (deterministic)
     - GPU: SIMD reduction + threadgroup reduction (different order)
     - Floating-point addition is not associative:
       (a + b) + c  !=  a + (b + c) in general

  3. Transcendental functions:
     - CPU: libm's exp(), sqrt(), etc. (double precision internally)
     - GPU: Metal's exp(), rsqrt() (hardware F32, not F64)
     - Metal's "fast math" may use approximations

  Result: GEMV with K=4096 accumulates 4096 products.
  Each product has F16 quantization error. The sum has
  accumulated error proportional to sqrt(K) * epsilon_f16.
  For K=4096: sqrt(4096) * 0.001 ~ 0.03, which explains
  why we use atol=0.05 for GEMV tests.

Test Case Design

Each kernel test exercises multiple configurations to catch edge cases:

Dimension Sweep
===============
Tests run across multiple dimension values:
  dims[] = {64, 128, 256, 512, 896, 1024, 2048, 2560, 4096}

Why these values?
  64    = minimum useful size
  128   = one SIMD group width * 4
  256   = common small-model dim
  512   = common head_dim * n_heads for small models
  896   = Qwen3-0.6B's dim (non-power-of-2!)
  1024  = threadgroup max, common dim
  2048  = common mid-size dim
  2560  = another non-power-of-2 (tests stride logic)
  4096  = large model dim (LLaMA 7B/8B)

Non-power-of-2 values (896, 2560) are crucial because they
exercise the remainder handling in strided loops:
  for (uint i = tid; i < dim; i += tg_size)
If dim is not a multiple of tg_size, the last iteration
processes fewer elements. Bugs here are common.

Each kernel test also includes edge case tests:

Zero input: Ensures no NaN propagation (especially for norms where division by near-zero can produce infinity/NaN)
Constant input: Tests that reduction operations work correctly when all values are identical (catches bugs where partial SIMD lanes have garbage)
Large values: Tests for overflow in F16 (max representable: ~65504)
Multi-row: Tests that row indexing is correct (the kernel processes multiple independent rows in one dispatch)

Writing a New Kernel Test

Let us walk through writing a test for a hypothetical new kernel. Suppose you have added a vector_scale_f16 kernel that multiplies every element by a scalar:

Step 1: Create the Test File

tests/kernels/common/test_vector_scale.cpp

Step 2: Write the Test

/// GPU kernel test: vector_scale_f16
#include "test_kernel_helpers.h"
#include "metal/kernels/ShaderTypes.h"

// CPU reference: trivially correct
static std::vector<float> cpu_vector_scale(
    const std::vector<float>& x, float scale)
{
    std::vector<float> out(x.size());
    for (size_t i = 0; i < x.size(); i++)
        out[i] = x[i] * scale;
    return out;
}

int main() {
    printf("=== Kernel Test: Vector Scale ===\n\n");
    INIT_DEVICE();

    Pipeline pso = dev->get_pipeline("vector_scale_f16");

    // Dimension sweep
    int counts[] = {32, 64, 256, 1000, 1024, 4096, 10000};
    float scales[] = {0.5f, 1.0f, 2.0f, -1.0f, 0.001f};

    for (int count : counts) {
        for (float scale : scales) {
            auto x_data = det_floats(count, 42 + count);
            auto expected = cpu_vector_scale(x_data, scale);

            Buffer x_buf = make_buf_f16(*dev, x_data);
            Buffer o_buf = dev->allocate(count * 2);

            // Assume params struct: { uint32_t count; float scale; ... }
            struct { uint32_t count; float scale; uint32_t _p0, _p1; }
                params = {(uint32_t)count, scale, 0, 0};

            dev->begin_encoding();
            dev->set_pipeline(pso);
            dev->set_buffer(x_buf, 0, 0);
            dev->set_buffer(o_buf, 0, 1);
            dev->set_bytes(&params, sizeof(params), 2);
            dev->dispatch_threads(Dim3(count), Dim3(256));
            dev->end_encoding_sync();

            char label[64];
            snprintf(label, sizeof(label),
                     "count=%d, scale=%.3f", count, scale);
            CHECK(assert_close(
                read_f16(o_buf, count), expected,
                1e-3f, 1e-3f, 3, label
            ), label);

            dev->free_buffer(x_buf);
            dev->free_buffer(o_buf);
        }
    }
    printf("  dimension x scale sweep done\n");

    // Edge case: empty (count=0)
    // Edge case: single element
    // Edge case: scale=0 (should produce all zeros)

    PRINT_RESULTS();
}

Step 3: Add to CMakeLists.txt

Add the test to the KERNEL_TESTS list:

set(KERNEL_TESTS
    # ... existing tests ...
    tests/kernels/common/test_vector_scale.cpp
)

The foreach loop in CMakeLists.txt automatically creates the executable:

foreach(test_src ${KERNEL_TESTS})
    get_filename_component(test_name ${test_src} NAME_WE)
    set(target_name "akunu_kernel_${test_name}")
    add_executable(${target_name} ${test_src})
    target_link_libraries(${target_name} PRIVATE akunu_engine)
    target_include_directories(${target_name} PRIVATE
        ${CMAKE_CURRENT_SOURCE_DIR}/tests)
    set_source_files_properties(${test_src} PROPERTIES LANGUAGE OBJCXX)
endforeach()

This produces build/akunu_kernel_test_vector_scale.

Step 4: Build and Run

make && build/akunu_kernel_test_vector_scale

Expected output:

=== Kernel Test: Vector Scale ===

GPU: Apple M4 Pro

  dimension x scale sweep done

=== Results: 35 passed, 0 failed ===

Unit Tests (Non-Kernel)

The non-kernel tests verify CPU-side logic. They follow a simpler pattern since there is no GPU involvement.

test_tokenizer_internal.cpp

Tests BPE tokenizer internals without needing a model file. Validates byte-pair encoding, special token handling, and edge cases like empty strings and UTF-8 multi-byte characters.

test_grammar.cpp

Tests GBNF grammar parsing and bitmask generation. Creates a grammar from a GBNF string, feeds it token sequences, and verifies that the acceptance bitmask correctly allows/rejects tokens at each step.

test_table_builder.cpp

Tests dispatch table construction. Creates a mock model configuration and verifies that build_dispatch_table produces the expected sequence of commands with correct parameters, buffer bindings, and patch types.

test_kv_cache.cpp

Tests KV cache operations:

Write: new key/value vectors are written to correct positions
Advance: position counter increments correctly
Shift: sliding window eviction moves data correctly
Edge cases: write at max_seq_len, shift by more than current length

test_server.cpp

Tests HTTP request parsing and OpenAI API compatibility. No network involved – it feeds raw HTTP strings to the parser and checks the extracted fields.

Integration Tests

test_e2e.cpp

The end-to-end test loads a real model, tokenizes a prompt, runs prefill, generates tokens, and verifies that:

The model loads without error
Tokenization produces expected token count
Prefill returns a valid first token
Generation produces coherent text (checked by greedy decoding a known prompt)
EOS token stops generation

test_inference.cpp

More detailed inference tests:

Forward pass produces non-NaN outputs
Greedy decoding is deterministic (same prompt always produces same output)
KV cache reuse works (continuing generation from a saved position)
Different model architectures produce valid output

test_long_context.cpp

Stress-tests the context window by filling the KV cache to capacity and verifying that:

The model does not crash or produce NaN
KV cache shift correctly evicts old entries
Output quality does not degrade catastrophically after shift

Running Tests in CI

For continuous integration, the recommended test matrix is:

  CI Pipeline
  ===========

  Stage 1: Build (parallel)
  +------------------------+
  | make shaders           |  <-- Compile Metal shaders
  | make engine            |  <-- Compile C++ code
  +------------------------+

  Stage 2: Fast Tests (no model, < 10s)
  +------------------------+
  | make test-unit         |  <-- Tokenizer, grammar, server
  | kernel tests (all 16)  |  <-- GPU correctness
  +------------------------+

  Stage 3: Inference Tests (needs model, < 60s)
  +------------------------+
  | make test-infer        |  <-- E2E generation test
  +------------------------+

Kernel tests require Apple Silicon hardware with Metal support. They cannot run on Intel Macs or Linux CI runners. For CI, you need a macOS runner with Apple Silicon (GitHub Actions has macos-14 runners with M1).

Test Philosophy

A few principles that guide akunu’s testing approach:

1. Every GPU kernel has a CPU twin. No exceptions. If the math is complicated enough to run on a GPU, it is complicated enough to verify on a CPU.

2. Deterministic test data. All test data comes from det_floats() with fixed seeds. No rand(), no time(NULL). Tests must be reproducible.

3. Tolerance, not equality. Floating-point GPU computation will never match CPU computation bit-for-bit. Define appropriate tolerances and document why.

4. Sweep dimensions. Never test just one dimension. Bugs hide in edge cases: non-power-of-2 sizes, sizes smaller than threadgroup width, sizes larger than threadgroup memory.

5. Test the edges. Zero input (division by zero in norms), maximum values (F16 overflow), single-element (degenerate reduction), empty input (boundary conditions).

6. Fail loudly. When a comparison fails, print the index, actual value, expected value, difference, and tolerance. This turns a “test failed” into a diagnosis.

Summary

Akunu’s testing infrastructure rests on three pillars: CPU reference implementations that are simple enough to be obviously correct, GPU test harnesses that handle F16 conversion and buffer management, and a tolerance model that accounts for the inherent precision differences between CPU and GPU computation.

The 16 kernel tests cover every operation in the forward pass. The unit tests verify CPU-side logic. The integration tests validate the full inference pipeline with real models. Together, they provide confidence that changes to the codebase do not break correctness.

In the next chapter, we will use this testing infrastructure as we walk through adding a completely new Metal kernel from scratch.

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference