Testing Infrastructure
If there is one rule that separates a hobby GPU project from production-quality inference code, it is this: every kernel must be tested against a CPU reference implementation. GPUs are excellent at hiding bugs. A GEMV kernel with a subtle off-by-one error in its dequantization logic might produce plausible-looking text 95% of the time and garbled output the other 5%. Without rigorous testing, you will spend weeks chasing “random” quality regressions that are actually deterministic bugs in your shader code.
Akunu takes testing seriously. This chapter walks you through the entire testing infrastructure: the test directory structure, the kernel test pattern, the CPU reference implementations, the tolerance model, and how to write new tests.
Test Directory Structure
Let us start with a map of every test file in the repository:
tests/
+-- test_kernel_helpers.h # Shared CPU references + GPU helpers
+-- test_device.mm # MetalDevice lifecycle + basic ops
+-- test_weight_store.cpp # GGUF/MLX weight loading
+-- test_table_builder.cpp # Dispatch table construction
+-- test_config.cpp # ChipConfig + device defaults
+-- test_tokenizer.cpp # BPE tokenizer (needs model)
+-- test_tokenizer_internal.cpp # Tokenizer internals (no model)
+-- test_inference.cpp # Forward pass correctness (needs model)
+-- test_kv_cache.cpp # KV cache write/shift/advance
+-- test_grammar.cpp # GBNF parsing + bitmask generation
+-- test_sampling_quality.cpp # Statistical sampling tests
+-- test_long_context.cpp # Context window stress test
+-- test_server.cpp # HTTP server parsing
+-- test_e2e.cpp # End-to-end generation test
+-- test_whisper.cpp # Whisper format parsing
+-- test_whisper_e2e.cpp # Whisper transcription test
+-- kernels/ # Per-kernel GPU correctness tests
+-- activation/
| +-- test_silu.cpp
| +-- test_gelu.cpp
| +-- test_silu_gate.cpp
| +-- test_gelu_gate.cpp
+-- attention/
| +-- test_flash_attention.cpp
+-- convert/
| +-- test_f32_to_f16.cpp
| +-- test_dequant_q4_0.cpp
+-- embedding/
| +-- test_embedding_f16.cpp
+-- matmul/
| +-- test_gemv_f16.cpp
| +-- test_gemv_q4_0.cpp
| +-- test_gemv_q8_0.cpp
| +-- test_gemm_f16.cpp
+-- norm/
| +-- test_rmsnorm.cpp
| +-- test_gemma_rmsnorm.cpp
+-- rope/
+-- test_rope.cpp
+-- test_rope_neox.cpp
There are three tiers of tests:
Tier 1: Kernel Tests (16 tests)
================================
- Test individual GPU kernels against CPU references
- Need only the metallib, no model file
- Fastest to run (< 1s each)
- This is where you start when adding a new kernel
Tier 2: Unit Tests (8 tests)
================================
- Test CPU-side logic (tokenizer, grammar, config, etc.)
- No GPU or model required for most
- Run via "make test-unit"
Tier 3: Integration / E2E Tests (6 tests)
================================
- Test full inference pipeline with real models
- Need a model file (GGUF or MLX SafeTensors)
- Slower to run (seconds to minutes)
- Run via "make test-infer"
The Kernel Test Pattern
Every kernel test in akunu follows the same pattern. Let us study it by
examining the actual test_rmsnorm.cpp from the repository:
Kernel Test Structure
=====================
1. Initialize GPU device + load metallib
2. Get pipeline state object for the kernel
3. For each test case:
a. Generate deterministic float data on CPU
b. Compute expected output using CPU reference
c. Upload input data to GPU buffers (as F16)
d. Set up shader parameters
e. Encode + dispatch the kernel
f. Synchronously wait for GPU completion
g. Read back GPU output (F16 -> F32)
h. Compare GPU output vs CPU reference within tolerance
4. Report pass/fail counts
Here is the actual code for the RMSNorm kernel test, annotated:
// Step 1: Device initialization (from test_kernel_helpers.h macro)
INIT_DEVICE();
// This creates a MetalDevice, loads akunu.metallib, and prints GPU name
// Step 2: Get the pipeline
Pipeline pso = dev->get_pipeline("rmsnorm_f16");
// Step 3a: Sweep across multiple dimensions
int dims[] = {64, 128, 256, 512, 896, 1024, 2048, 2560, 4096};
for (int dim : dims) {
int n_rows = 4;
// Step 3b: Deterministic test data
auto x_data = det_floats(n_rows * dim, 42 + dim, -2.0f, 2.0f);
auto w_data = det_floats(dim, 99 + dim, 0.5f, 1.5f);
// Step 3c: CPU reference computation
auto expected = cpu_rmsnorm(x_data, w_data, dim, eps);
// Step 3d: Upload to GPU as F16
Buffer x_buf = make_buf_f16(*dev, x_data);
Buffer w_buf = make_buf_f16(*dev, w_data);
Buffer o_buf = dev->allocate(n_rows * dim * 2);
// Step 3e: Set up params and dispatch
RMSNormParams params = {(uint32_t)dim, eps, 0, 0};
dev->begin_encoding();
dev->set_pipeline(pso);
dev->set_buffer(x_buf, 0, 0);
dev->set_buffer(w_buf, 0, 1);
dev->set_buffer(o_buf, 0, 2);
dev->set_bytes(¶ms, sizeof(params), 3);
dev->dispatch(Dim3(n_rows), Dim3(std::min(1024, dim)));
// Step 3f: Wait for GPU
dev->end_encoding_sync();
// Step 3g-h: Read back and compare
CHECK(assert_close(
read_f16(o_buf, n_rows * dim), // GPU output (F16 -> F32)
expected, // CPU reference (F32)
5e-2f, // absolute tolerance
5e-2f, // relative tolerance
3 // max errors to print
), "rmsnorm dim sweep");
}
The pattern is beautifully consistent. Every kernel test in the repository follows this exact flow. The variation is only in:
- Which pipeline (kernel name) to use
- What test data to generate
- Which CPU reference function to call
- What tolerance values are appropriate
The Test Helper Library
The file tests/test_kernel_helpers.h is the backbone of the testing
infrastructure. It provides five categories of utilities:
1. Float16 Conversion (CPU-side)
Since Metal kernels operate on F16 data but CPU reference implementations use F32, we need conversion functions:
static inline uint16_t f32_to_f16(float f); // CPU float -> F16 bits
static inline float f16_to_f32(uint16_t h); // F16 bits -> CPU float
These are software implementations of IEEE 754 half-precision conversion. They handle denormals, infinities, and NaN correctly. The GPU does these conversions in hardware, so our CPU-side conversion must match exactly.
2. Deterministic Random Data
static std::vector<float> det_floats(int count, uint64_t seed = 42,
float lo = -1.0f, float hi = 1.0f);
This generates pseudo-random floats using a linear congruential generator (LCG) with a fixed seed. The key property is determinism: the same seed always produces the same sequence, regardless of platform or compiler. This makes tests reproducible.
The LCG uses the Knuth constants:
s = s * 6364136223846793005 + 1442695040888963407
This is intentionally NOT a cryptographically strong PRNG. We want speed and reproducibility, not security.
3. GPU Buffer Helpers
static Buffer make_buf_f16(Device& dev, const std::vector<float>& data);
static Buffer make_buf_f32(Device& dev, const std::vector<float>& data);
static Buffer make_buf_u32(Device& dev, const std::vector<uint32_t>& data);
static std::vector<float> read_f16(Buffer buf, int count);
static std::vector<float> read_f32(Buffer buf, int count);
These handle the F32-to-F16 conversion on upload and F16-to-F32 conversion on
readback. The make_buf_f16 function converts each float to F16 on the CPU
side, then uploads the F16 data to a GPU buffer. The read_f16 function reads
F16 data from a GPU buffer and converts each value back to F32.
This is the data flow:
CPU F32 data GPU F16 buffer GPU F16 output
[1.0, 2.0, ...] --> make_buf_f16() --> [kernel execution]
|
CPU F32 result <-- read_f16() <-- GPU F16 output buffer
[0.99, 1.98, ...]
Note: F16 conversion introduces quantization error.
That is why we need tolerances in our comparisons.
4. Device Init and Result Macros
#define INIT_DEVICE() \
auto dev = Device::create_default(); \
printf("GPU: %s\n", dev->name()); \
{ \
bool _ok = dev->load_library("../../.build/metallib/akunu.metallib"); \
if (!_ok) _ok = dev->load_library(".build/metallib/akunu.metallib"); \
if (!_ok) _ok = dev->load_library("../../../.build/metallib/akunu.metallib"); \
if (!_ok) { printf("Metallib not found\n"); return 1; } \
}
#define PRINT_RESULTS() \
printf("\n=== Results: %d passed, %d failed ===\n", \
g_tests_passed, g_tests_failed); \
return g_tests_failed > 0 ? 1 : 0
The INIT_DEVICE() macro tries three paths for the metallib because tests can
be run from different working directories. The PRINT_RESULTS() macro returns
a non-zero exit code on failure, which lets CI systems detect test failures.
5. Tolerance Comparison
static bool assert_close(const std::vector<float>& actual,
const std::vector<float>& expected,
float atol, float rtol,
int max_errors = 5,
const char *label = "");
This is the heart of the comparison system. For each element pair, it checks:
|actual[i] - expected[i]| <= max(atol, |expected[i]| * rtol)
The tolerance model uses both absolute tolerance (atol) and relative
tolerance (rtol), taking the maximum. This handles both near-zero values
(where relative tolerance would be too strict) and large values (where
absolute tolerance would be too loose).
Tolerance Model
===============
For expected value 0.001:
atol = 0.05 --> allowed error = 0.05 (absolute dominates)
rtol = 0.05 --> allowed error = 0.00005 (too strict)
max(0.05, 0.00005) = 0.05 <-- we use this
For expected value 100.0:
atol = 0.05 --> allowed error = 0.05 (too strict)
rtol = 0.05 --> allowed error = 5.0 (relative dominates)
max(0.05, 5.0) = 5.0 <-- we use this
When an error is detected, the function prints up to max_errors mismatches
with their indices, actual values, expected values, and the computed tolerance.
This makes debugging much easier than a simple pass/fail.
CPU Reference Implementations
The test helper library includes CPU reference implementations for every kernel category. These are intentionally simple – no SIMD, no optimization, just readable code that is obviously correct.
RMSNorm Reference
static std::vector<float> cpu_rmsnorm(
const std::vector<float>& x,
const std::vector<float>& w,
int dim, float eps)
{
int rows = (int)x.size() / dim;
std::vector<float> out(x.size());
for (int r = 0; r < rows; r++) {
float ss = 0;
for (int i = 0; i < dim; i++)
ss += x[r*dim + i] * x[r*dim + i];
float rms = 1.0f / sqrtf(ss / dim + eps);
for (int i = 0; i < dim; i++)
out[r*dim + i] = x[r*dim + i] * rms * w[i];
}
return out;
}
Compare this with the Metal kernel that runs on the GPU:
kernel void rmsnorm_f16(
device const half *input, device const half *weight,
device half *output, constant RMSNormParams ¶ms, ...)
{
// ... threadgroup reduction for sum of squares ...
float rms = rsqrt(total_sum_sq / float(dim) + eps);
for (uint i = tid; i < dim; i += tg_size) {
row_out[i] = half(float(row_in[i]) * rms) * weight[i];
}
}
The CPU version operates on F32 floats. The GPU version operates on F16 halfs, converting to F32 only for the accumulation. This precision difference is exactly why we need tolerance in our comparisons.
GEMV Reference
// y[n] = sum_k W[n,k] * x[k]
static std::vector<float> cpu_gemv(
const std::vector<float>& W,
const std::vector<float>& x,
int N, int K)
{
std::vector<float> y(N, 0);
for (int n = 0; n < N; n++)
for (int k = 0; k < K; k++)
y[n] += W[n*K + k] * x[k];
return y;
}
This is the simplest possible matrix-vector multiply. The GPU version uses SIMD-group operations, threadgroup memory, and quantized weight formats. But the mathematical result should be the same (within tolerance).
Attention Reference
static std::vector<float> cpu_attention(
const std::vector<float>& Q,
const std::vector<float>& K,
const std::vector<float>& V,
int seq_len, int kv_seq_len, int head_dim,
int n_heads, int n_kv_heads, float scale, bool causal)
{
std::vector<float> O(n_heads * seq_len * head_dim, 0);
int heads_per_group = n_heads / n_kv_heads;
for (int h = 0; h < n_heads; h++) {
int kv_h = h / heads_per_group; // GQA mapping
for (int s = 0; s < seq_len; s++) {
// QK^T
std::vector<float> scores(kv_seq_len);
for (int t = 0; t < kv_seq_len; t++) {
float dot = 0;
for (int d = 0; d < head_dim; d++)
dot += Q[...] * K[...];
scores[t] = dot * scale;
if (causal && t > s) scores[t] = -1e9f;
}
// Softmax
// ... standard numerically stable softmax ...
// Weighted sum of V
// ... accumulate ...
}
}
return O;
}
This implements the full attention computation: QK^T with scaling, causal
masking, softmax, and V weighting. It handles grouped-query attention (GQA)
through the heads_per_group mapping. The GPU version uses FlashAttention
with online softmax and tiled computation, but produces the same result.
RoPE References
There are two RoPE variants, matching the two GPU kernels:
// Standard (interleaved pairs): [x0,x1, x2,x3, ...]
static void cpu_rope(std::vector<float>& x,
int seq_len, int n_heads, int head_dim,
float theta, int pos_offset = 0);
// NeoX (split-half): [x0,x1,...,xN/2, xN/2+1,...,xN]
static void cpu_rope_neox(std::vector<float>& x,
int seq_len, int n_heads, int head_dim,
float theta, int pos_offset = 0);
The standard variant pairs adjacent elements (x[2d], x[2d+1]). The NeoX
variant pairs elements from the first and second halves (x[d], x[half+d]).
Both apply the same rotation math, just with different indexing.
Quantization Helpers
For testing quantized GEMV kernels, we need to quantize data on the CPU:
struct Q4_0Block {
uint16_t scale_f16; // absmax / 8 in F16
uint8_t nibs[16]; // 32 x 4-bit values (each byte = 2 values)
};
static std::vector<uint8_t> quantize_q4_0(
const std::vector<float>& vals,
std::vector<float>& dequant_ref);
The quantize_q4_0 function returns both the quantized bytes (for GPU upload)
and the dequantized reference values (for comparison). The dequantized values
account for the quantization error introduced by the round-trip, so our
tolerance comparison measures only the GPU computation error, not the
inherent quantization loss.
Quantization Test Flow
======================
Original F32 values: [0.5, -0.3, 0.8, ...]
|
v
quantize_q4_0() produces:
- Quantized bytes (Q4_0 format) -> upload to GPU
- Dequant reference [0.49, -0.29, 0.79, ...] -> compare against
|
GPU kernel: dequant + GEMV
|
v
GPU output: [result using quantized weights]
|
Compare vs: cpu_gemv(dequant_ref, x)
(GEMV using the SAME dequantized values)
This isolates the GPU computation error from quantization error.
The 16 Kernel Tests
Here is a complete inventory of every kernel test, what it tests, and what CPU reference it uses:
+-------------------------+---------------------+-------------------+--------+
| Test | Kernel(s) | CPU Reference | Tol |
+-------------------------+---------------------+-------------------+--------+
| test_rmsnorm | rmsnorm_f16 | cpu_rmsnorm | 5e-2 |
| test_gemma_rmsnorm | rmsnorm_gemma_f16 | cpu_gemma_rmsnorm | 5e-2 |
| test_gemv_f16 | gemv_f16 | cpu_gemv | 1e-2 |
| test_gemv_q4_0 | gemv_q4_0 | cpu_gemv(dequant) | 5e-2 |
| test_gemv_q8_0 | gemv_q8_0 | cpu_gemv(dequant) | 5e-2 |
| test_gemm_f16 | simd_gemm_f16 | cpu_gemm_bt | 5e-2 |
| test_silu | silu activation | cpu_silu | 1e-3 |
| test_gelu | gelu activation | cpu_gelu | 1e-3 |
| test_silu_gate | silu_gate_f16 | silu(a)*b | 1e-2 |
| test_gelu_gate | gelu_gate_f16 | gelu(a)*b | 1e-2 |
| test_rope | rope_f16 | cpu_rope | 5e-2 |
| test_rope_neox | rope_neox_f16 | cpu_rope_neox | 5e-2 |
| test_flash_attention | flash_attention_* | cpu_attention | 5e-2 |
| test_embedding_f16 | embedding_lookup_f16| direct lookup | 0 |
| test_f32_to_f16 | f32_to_f16 | f32_to_f16 (CPU) | 0 |
| test_dequant_q4_0 | dequant_q4_0 | quantize_q4_0 | 0 |
+-------------------------+---------------------+-------------------+--------+
Notice the tolerance values. Pure data movement kernels (embedding, conversion, dequantization) use zero tolerance – they must produce bit-identical results. Arithmetic kernels (GEMV, GEMM, norms) use tolerances of 1e-2 to 5e-2 to account for F16 precision loss and different accumulation orders.
Why Different Tolerances?
The tolerance differences come from the nature of each operation:
Precision Loss Sources
======================
1. F16 storage: 10-bit mantissa = ~3 decimal digits
- Every F32->F16 conversion loses precision
- Multiply two F16 values: error compounds
2. Accumulation order:
- CPU: sequential sum (deterministic)
- GPU: SIMD reduction + threadgroup reduction (different order)
- Floating-point addition is not associative:
(a + b) + c != a + (b + c) in general
3. Transcendental functions:
- CPU: libm's exp(), sqrt(), etc. (double precision internally)
- GPU: Metal's exp(), rsqrt() (hardware F32, not F64)
- Metal's "fast math" may use approximations
Result: GEMV with K=4096 accumulates 4096 products.
Each product has F16 quantization error. The sum has
accumulated error proportional to sqrt(K) * epsilon_f16.
For K=4096: sqrt(4096) * 0.001 ~ 0.03, which explains
why we use atol=0.05 for GEMV tests.
Test Case Design
Each kernel test exercises multiple configurations to catch edge cases:
Dimension Sweep
===============
Tests run across multiple dimension values:
dims[] = {64, 128, 256, 512, 896, 1024, 2048, 2560, 4096}
Why these values?
64 = minimum useful size
128 = one SIMD group width * 4
256 = common small-model dim
512 = common head_dim * n_heads for small models
896 = Qwen3-0.6B's dim (non-power-of-2!)
1024 = threadgroup max, common dim
2048 = common mid-size dim
2560 = another non-power-of-2 (tests stride logic)
4096 = large model dim (LLaMA 7B/8B)
Non-power-of-2 values (896, 2560) are crucial because they
exercise the remainder handling in strided loops:
for (uint i = tid; i < dim; i += tg_size)
If dim is not a multiple of tg_size, the last iteration
processes fewer elements. Bugs here are common.
Each kernel test also includes edge case tests:
- Zero input: Ensures no NaN propagation (especially for norms where division by near-zero can produce infinity/NaN)
- Constant input: Tests that reduction operations work correctly when all values are identical (catches bugs where partial SIMD lanes have garbage)
- Large values: Tests for overflow in F16 (max representable: ~65504)
- Multi-row: Tests that row indexing is correct (the kernel processes multiple independent rows in one dispatch)
Writing a New Kernel Test
Let us walk through writing a test for a hypothetical new kernel. Suppose you
have added a vector_scale_f16 kernel that multiplies every element by a
scalar:
Step 1: Create the Test File
tests/kernels/common/test_vector_scale.cpp
Step 2: Write the Test
/// GPU kernel test: vector_scale_f16
#include "test_kernel_helpers.h"
#include "metal/kernels/ShaderTypes.h"
// CPU reference: trivially correct
static std::vector<float> cpu_vector_scale(
const std::vector<float>& x, float scale)
{
std::vector<float> out(x.size());
for (size_t i = 0; i < x.size(); i++)
out[i] = x[i] * scale;
return out;
}
int main() {
printf("=== Kernel Test: Vector Scale ===\n\n");
INIT_DEVICE();
Pipeline pso = dev->get_pipeline("vector_scale_f16");
// Dimension sweep
int counts[] = {32, 64, 256, 1000, 1024, 4096, 10000};
float scales[] = {0.5f, 1.0f, 2.0f, -1.0f, 0.001f};
for (int count : counts) {
for (float scale : scales) {
auto x_data = det_floats(count, 42 + count);
auto expected = cpu_vector_scale(x_data, scale);
Buffer x_buf = make_buf_f16(*dev, x_data);
Buffer o_buf = dev->allocate(count * 2);
// Assume params struct: { uint32_t count; float scale; ... }
struct { uint32_t count; float scale; uint32_t _p0, _p1; }
params = {(uint32_t)count, scale, 0, 0};
dev->begin_encoding();
dev->set_pipeline(pso);
dev->set_buffer(x_buf, 0, 0);
dev->set_buffer(o_buf, 0, 1);
dev->set_bytes(¶ms, sizeof(params), 2);
dev->dispatch_threads(Dim3(count), Dim3(256));
dev->end_encoding_sync();
char label[64];
snprintf(label, sizeof(label),
"count=%d, scale=%.3f", count, scale);
CHECK(assert_close(
read_f16(o_buf, count), expected,
1e-3f, 1e-3f, 3, label
), label);
dev->free_buffer(x_buf);
dev->free_buffer(o_buf);
}
}
printf(" dimension x scale sweep done\n");
// Edge case: empty (count=0)
// Edge case: single element
// Edge case: scale=0 (should produce all zeros)
PRINT_RESULTS();
}
Step 3: Add to CMakeLists.txt
Add the test to the KERNEL_TESTS list:
set(KERNEL_TESTS
# ... existing tests ...
tests/kernels/common/test_vector_scale.cpp
)
The foreach loop in CMakeLists.txt automatically creates the executable:
foreach(test_src ${KERNEL_TESTS})
get_filename_component(test_name ${test_src} NAME_WE)
set(target_name "akunu_kernel_${test_name}")
add_executable(${target_name} ${test_src})
target_link_libraries(${target_name} PRIVATE akunu_engine)
target_include_directories(${target_name} PRIVATE
${CMAKE_CURRENT_SOURCE_DIR}/tests)
set_source_files_properties(${test_src} PROPERTIES LANGUAGE OBJCXX)
endforeach()
This produces build/akunu_kernel_test_vector_scale.
Step 4: Build and Run
make && build/akunu_kernel_test_vector_scale
Expected output:
=== Kernel Test: Vector Scale ===
GPU: Apple M4 Pro
dimension x scale sweep done
=== Results: 35 passed, 0 failed ===
Unit Tests (Non-Kernel)
The non-kernel tests verify CPU-side logic. They follow a simpler pattern since there is no GPU involvement.
test_tokenizer_internal.cpp
Tests BPE tokenizer internals without needing a model file. Validates byte-pair encoding, special token handling, and edge cases like empty strings and UTF-8 multi-byte characters.
test_grammar.cpp
Tests GBNF grammar parsing and bitmask generation. Creates a grammar from a GBNF string, feeds it token sequences, and verifies that the acceptance bitmask correctly allows/rejects tokens at each step.
test_table_builder.cpp
Tests dispatch table construction. Creates a mock model configuration and
verifies that build_dispatch_table produces the expected sequence of commands
with correct parameters, buffer bindings, and patch types.
test_kv_cache.cpp
Tests KV cache operations:
- Write: new key/value vectors are written to correct positions
- Advance: position counter increments correctly
- Shift: sliding window eviction moves data correctly
- Edge cases: write at max_seq_len, shift by more than current length
test_server.cpp
Tests HTTP request parsing and OpenAI API compatibility. No network involved – it feeds raw HTTP strings to the parser and checks the extracted fields.
Integration Tests
test_e2e.cpp
The end-to-end test loads a real model, tokenizes a prompt, runs prefill, generates tokens, and verifies that:
- The model loads without error
- Tokenization produces expected token count
- Prefill returns a valid first token
- Generation produces coherent text (checked by greedy decoding a known prompt)
- EOS token stops generation
test_inference.cpp
More detailed inference tests:
- Forward pass produces non-NaN outputs
- Greedy decoding is deterministic (same prompt always produces same output)
- KV cache reuse works (continuing generation from a saved position)
- Different model architectures produce valid output
test_long_context.cpp
Stress-tests the context window by filling the KV cache to capacity and verifying that:
- The model does not crash or produce NaN
- KV cache shift correctly evicts old entries
- Output quality does not degrade catastrophically after shift
Running Tests in CI
For continuous integration, the recommended test matrix is:
CI Pipeline
===========
Stage 1: Build (parallel)
+------------------------+
| make shaders | <-- Compile Metal shaders
| make engine | <-- Compile C++ code
+------------------------+
Stage 2: Fast Tests (no model, < 10s)
+------------------------+
| make test-unit | <-- Tokenizer, grammar, server
| kernel tests (all 16) | <-- GPU correctness
+------------------------+
Stage 3: Inference Tests (needs model, < 60s)
+------------------------+
| make test-infer | <-- E2E generation test
+------------------------+
Kernel tests require Apple Silicon hardware with Metal support. They cannot run
on Intel Macs or Linux CI runners. For CI, you need a macOS runner with Apple
Silicon (GitHub Actions has macos-14 runners with M1).
Test Philosophy
A few principles that guide akunu’s testing approach:
1. Every GPU kernel has a CPU twin. No exceptions. If the math is complicated enough to run on a GPU, it is complicated enough to verify on a CPU.
2. Deterministic test data. All test data comes from det_floats() with
fixed seeds. No rand(), no time(NULL). Tests must be reproducible.
3. Tolerance, not equality. Floating-point GPU computation will never match CPU computation bit-for-bit. Define appropriate tolerances and document why.
4. Sweep dimensions. Never test just one dimension. Bugs hide in edge cases: non-power-of-2 sizes, sizes smaller than threadgroup width, sizes larger than threadgroup memory.
5. Test the edges. Zero input (division by zero in norms), maximum values (F16 overflow), single-element (degenerate reduction), empty input (boundary conditions).
6. Fail loudly. When a comparison fails, print the index, actual value, expected value, difference, and tolerance. This turns a “test failed” into a diagnosis.
Summary
Akunu’s testing infrastructure rests on three pillars: CPU reference implementations that are simple enough to be obviously correct, GPU test harnesses that handle F16 conversion and buffer management, and a tolerance model that accounts for the inherent precision differences between CPU and GPU computation.
The 16 kernel tests cover every operation in the forward pass. The unit tests verify CPU-side logic. The integration tests validate the full inference pipeline with real models. Together, they provide confidence that changes to the codebase do not break correctness.
In the next chapter, we will use this testing infrastructure as we walk through adding a completely new Metal kernel from scratch.