Performance Profiling and Benchmarking

If you have made it this far in the book, you have a reasonably complete mental model of how akunu turns a pile of quantized weights and Metal shaders into streaming text on Apple Silicon. That is great, but mental models do not ship performance. At some point you need to measure things, and the gap between “I think the attention kernel is the bottleneck” and “the attention kernel consumes 38% of GPU time at 14.2 GB/s effective bandwidth on an M2 Pro” is the gap between guessing and engineering.

This chapter covers the full profiling stack: Apple’s own GPU tools, akunu’s built-in CLI profilers, the key metrics you should care about, roofline analysis for memory-bound inference, and how to interpret the numbers in context by comparing against llama.cpp and MLX.

The Three Metrics That Matter

Before we reach for any tool, let us agree on what we are measuring. LLM inference has three headline numbers that users and developers care about:

Metric	Definition	Why It Matters
Prefill tok/s	Prompt tokens processed per second	Determines how fast you can ingest a 4K context window. Governs perceived responsiveness for the first token.
Decode tok/s	Generated tokens per second	The sustained throughput the user sees while text is streaming. This is the number people compare across frameworks.
TTFT (Time To First Token)	Wall-clock time from prompt submission to first generated token	The most perceptually important metric. Users notice latency more than throughput. TTFT = prefill time + one decode step.

A fourth metric, peak memory, also matters on Apple Silicon because you are sharing a unified memory pool with the OS, the window compositor, and whatever else the user has open. Running out of memory does not just crash your process; it can trigger aggressive swapping that destroys system responsiveness.

Let us now walk through the tools that let you measure all of this.

akunu_bench: The llama-bench Equivalent

The simplest way to get prefill and decode numbers is akunu_bench, a C++ tool that replicates the methodology of llama-bench from the llama.cpp project. Here is the actual source signature from tools/akunu_bench.cpp:

Usage: akunu_bench <model> [-p N] [-n N] [-r N]

The flags are:

Flag	Default	Meaning
`-p N` / `--pp N`	512	Prompt length for prefill test
`-n N` / `--tg N`	128	Number of tokens for decode (text generation) test
`-r N` / `--reps N`	5	Repetitions per test (for statistical stability)

The tool works by creating synthetic prompts filled with the BOS token (token ID 1). This is deliberate – you want a reproducible input that does not depend on tokenizer behavior or prompt content. Here is what happens internally:

Prefill test: Fill a vector of pp tokens with BOS. Call akunu_prefill() and time it with std::chrono::high_resolution_clock. Repeat reps times. Report mean and standard deviation.
Decode test: Prefill a single BOS token, then call akunu_chain_decode() for tg tokens in a single GPU submission. Again, repeat and report statistics.

The output matches the llama-bench markdown table format so you can paste results directly into GitHub issues:

| model | size | test | t/s |
| --- | ---: | ---: | ---: |
| Qwen3-4B-Q4_0.gguf | 2341 MiB | pp512 | 1842.31 +/- 12.40 |
| Qwen3-4B-Q4_0.gguf | 2341 MiB | tg128 | 87.42 +/- 0.83 |

A few things to note about the methodology:

The decode test uses akunu_chain_decode(), which batches all tg tokens into a single GPU command buffer submission. This measures the true GPU-limited throughput, not the overhead of individual akunu_decode_step() round-trips. If you were to measure decode by calling decode_step in a loop, you would be measuring CPU-GPU synchronization overhead as much as actual compute.¹
Each repetition calls akunu_reset() to clear the KV cache, ensuring independent measurements. Without the reset, later iterations would operate on a larger KV cache, which changes the attention kernel’s memory access pattern.
The standard deviation across repetitions is typically very small (under 2%) on a quiet system. If you see high variance, check for thermal throttling or background processes competing for GPU resources.

akunu_benchmark: End-to-End with Real Prompts

While akunu_bench gives you clean synthetic numbers, akunu_benchmark exercises the full akunu_generate() path with real prompts of varying lengths:

Usage: akunu_benchmark <model>

This tool runs three prompts (short, medium, long), measures AkunuGenerationStats for each, and reports:

Column	Meaning
Prompt	Length category
Tokens	Actual token count after encoding
Prefill (t/s)	Prefill throughput
Decode (t/s)	Decode throughput
First-tok(ms)	Time to first token (the TTFT metric)
Prefill(ms)	Raw prefill time
Total(s)	Wall-clock total

After the prompt tests, it also runs a standalone chain decode measurement (128 tokens, greedy) to give you the raw GPU-limited decode throughput independent of sampling overhead.

The key insight from this tool is how prefill scales with prompt length. On Apple Silicon, prefill is a GEMM (matrix-matrix multiply) workload, and the GPU’s utilization increases with larger batch sizes. You will typically see:

Short prompts (1-10 tokens): Low prefill tok/s because the GEMMs have tiny M dimension and cannot saturate the GPU’s compute units
Medium prompts (50-200 tokens): Prefill tok/s climbs rapidly as GEMM occupancy improves
Long prompts (500+ tokens): Prefill tok/s plateaus near the compute-bound peak

akunu_profile: Per-Kernel GPU Timing

This is the real workhorse for optimization. Where akunu_bench tells you how fast, akunu_profile tells you where the time goes.

Usage: akunu_profile <model> [--tokens N]

Here is what happens under the hood, based on the actual source in tools/akunu_profile.cpp:

Load the model and prefill a single BOS token
Call akunu_profile_decode_step() which runs each dispatch command in its own MTLCommandBuffer, enabling accurate per-kernel GPU timing via Metal’s built-in command buffer timing
Repeat for N tokens (default 5), accumulating timing data
Sort kernels by total GPU time and print a breakdown table

The output looks something like this (simplified):

  Per-Kernel GPU Timing Breakdown
  ==========================================================================================
  Kernel                                  Dispatches  Total (ms)    Avg (ms)      % GPU
  ------------------------------------------------------------------------------------------
  L0 GEMV attn_qkv Q4_0                  5           0.412         0.082         18.2%
  L0 GEMV ffn_down Q4_0                  5           0.318         0.064         14.1%
  L0 GEMV ffn_gate_up Q4_0              5           0.304         0.061         13.4%
  L0 Attention                            5           0.201         0.040         8.9%
  L0 GEMV attn_output Q4_0              5           0.156         0.031         6.9%
  ...

There is an important caveat: profiled decode is much slower than normal decode. The profiler wraps each kernel dispatch in its own command buffer to get accurate GPU timing. In normal operation, akunu batches the entire forward pass (embedding + N layers + output norm + logit projection + argmax) into a single command buffer, and the chain decoder batches multiple tokens into one submission. Profiled mode breaks this batching completely, so the absolute numbers are not representative of production throughput – they are only useful for relative comparisons between kernels.²

Reading the Profiler Output

The typical decode step for a LLaMA-like model with n_layers transformer layers contains:

+------------------+
| Embedding Lookup |  1 kernel
+------------------+
        |
        v
+------------------+
| Layer 0          |  ~8-12 kernels per layer
|   Attention Norm |
|   QKV Projection |  (GEMV or fused GEMV+RoPE+KV-write)
|   RoPE + KV Write|
|   Attention      |
|   Output Proj    |
|   Residual Add   |
|   FFN Norm       |
|   Gate+Up Proj   |  (possibly fused into single GEMV)
|   Activation     |  (SiLU*gate or GELU*gate)
|   Down Proj      |
|   Residual Add   |
+------------------+
        |
        v
| Layer 1..N-1     |  (repeat)
        |
        v
+------------------+
| Output Norm      |  1 kernel
+------------------+
        |
        v
+------------------+
| Logit Projection |  1 GEMV (dim -> vocab_size)
+------------------+
        |
        v
+------------------+
| Argmax           |  1 kernel
+------------------+

When you look at the profiler output, the GEMV (matrix-vector multiply) kernels dominate. For a Q4_0 model, the three big GEMVs per layer are:

QKV projection: Multiplies the hidden state by the Q, K, and V weight matrices. For a model with n_heads=32, n_kv_heads=8, head_dim=128, this projects dim=4096 to q_dim + 2*kv_dim = 4096 + 2*1024 = 6144 elements.
FFN gate+up: Projects dim to 2*ffn_dim. For LLaMA-style models with SwiGLU, ffn_dim is typically ~2.7*dim, so this is the largest single GEMV.
FFN down: Projects ffn_dim back to dim.

The attention kernel itself is often not the biggest time consumer during decode (single token, long KV cache), because it is a relatively small operation: each head does a dot product of the query against kv_seq_len keys, then a weighted sum of values. The total work scales with n_heads * kv_seq_len * head_dim, which for moderate context lengths is much less than the GEMV work.

Xcode GPU Profiler (Instruments)

For the deepest level of insight, Apple provides GPU profiling through Instruments. There are two relevant instruments:

Metal System Trace

Metal System Trace shows the timeline of GPU command buffer submissions, encoding, and execution. This is the tool to use when you suspect CPU-GPU synchronization issues or want to understand the relationship between akunu’s chain decode submissions and actual GPU execution.

To capture a trace:

Build akunu with debug symbols (CMake RelWithDebInfo or Debug)
Open Instruments, choose “Metal System Trace” template
Select your akunu binary as the target
Record for a few seconds while running a generation

The trace shows:

Track	What You See
GPU Timeline	Individual compute dispatches on the GPU hardware. Each dispatch shows its duration, pipeline state object (PSO) name, and threadgroup configuration.
Command Buffer Track	When each `MTLCommandBuffer` was committed, scheduled, and completed. Gaps between command buffers indicate CPU-side stalls.
Encoder Track	The compute command encoder’s encode phase. If encoding takes longer than GPU execution, you are CPU-bound.

The key thing to look for in the Metal System Trace is GPU idle gaps. In a well-tuned chain decode:

CPU:  [encode CB1] [encode CB2]        [encode CB3]
GPU:           [execute CB1][execute CB2]        [execute CB3]
                                   ^-- no gap here: GPU stays busy

If you see gaps where the GPU is idle between command buffers, the CPU is not encoding fast enough. Akunu’s chain decode design specifically addresses this by encoding chain_decode_chunk tokens (64-128, depending on chip) into a single command buffer, ensuring the GPU has enough work to stay saturated.

GPU Counters

Instruments also provides GPU hardware counters (on supported devices) that show:

Counter Group	Key Metrics
Occupancy	How many threadgroups are resident on the GPU simultaneously. Low occupancy means the GPU has idle ALUs.
Memory	Read/write bandwidth, cache hit rates. Critical for understanding whether your GEMV kernels are memory-bound (they almost always are).
ALU	Arithmetic utilization. For quantized GEMV, this is typically low because you are waiting on memory, not compute.
Shader	Per-pipeline-state breakdown. Shows which PSOs consume the most GPU time.

Roofline Analysis for Apple Silicon

The roofline model is the single most useful framework for understanding LLM inference performance on Apple Silicon.³ The core idea is simple: every computation has an arithmetic intensity (operations per byte of memory accessed), and the hardware has a memory bandwidth ceiling and a compute ceiling. Your kernel’s throughput is limited by whichever ceiling it hits first.

Apple Silicon Memory Bandwidth

Chip	Memory BW (GB/s)	GPU FP16 TFLOPS	Roofline Knee (ops/byte)
M1	68.25	2.6	38
M1 Pro	200	5.2	26
M1 Max	400	10.4	26
M2	100	3.6	36
M2 Pro	200	7.0	35
M2 Max	400	13.6	34
M3	100	4.1	41
M3 Pro	150	7.0	47
M3 Max	400	14.2	36
M4	120	4.3	36
M4 Pro	273	9.2	34
M4 Max	546	18.0	33

The “roofline knee” is the arithmetic intensity where you transition from memory-bound to compute-bound. For LLM decode, the arithmetic intensity is almost always well below this knee.

Why Decode Is Memory-Bound

During single-token decode, each GEMV reads the entire weight matrix and multiplies it by a single vector. For a Q4_0 weight matrix of shape [N, K]:

Bytes read: N * K / 2 bytes (4 bits per weight, packed) + N * K / 32 * 2 bytes (one FP16 scale per block of 32)
FLOPs: 2 * N * K (multiply-accumulate)
Arithmetic intensity: roughly 2 * N * K / (N * K * 0.5625) = ~3.6 ops/byte

That is far below the roofline knee of 26-47 ops/byte. The GEMV is firmly memory-bound. This means:

Decode throughput is determined almost entirely by memory bandwidth.

The theoretical maximum decode tok/s for a model of total weight size W bytes on a chip with bandwidth B bytes/s is:

max_decode_tok_s = B / W

For a 4B parameter Q4_0 model (~2.3 GB weights):

Chip	BW (GB/s)	Theoretical Max (tok/s)
M1	68.25	29.7
M2 Pro	200	87.0
M3 Max	400	174.0
M4 Max	546	237.4

In practice, akunu achieves 70-85% of theoretical bandwidth utilization for decode, which is quite good for a real-world system with cache management, RoPE computation, attention, and norm overhead on top of the raw GEMVs.

Why Prefill Is Compute-Bound (for Large Batches)

During prefill, the projections become GEMMs (matrix-matrix multiply) because you are processing seq_len tokens simultaneously. The arithmetic intensity scales with the batch dimension:

Arithmetic intensity: ~2 * M ops/byte (where M = batch/seq_len)

For M >= 20 or so, you cross the roofline knee and become compute-bound. This is why prefill throughput is typically 10-50x higher than decode throughput – you are actually using the GPU’s ALUs instead of just waiting on memory.

Bandwidth Utilization: The Real Performance Metric

Raw tok/s numbers are useful for user-facing comparisons, but for engineering purposes, bandwidth utilization is the metric that tells you how close you are to optimal:

bandwidth_utilization = (model_weight_bytes / decode_time_per_token) / peak_memory_bandwidth

Here is how to compute this from akunu_bench output:

Get model weight bytes from akunu_model_memory() (reported as “size” in bench output)
Compute decode time per token: 1.0 / decode_tok_s
Divide effective bandwidth by peak bandwidth

For example, if akunu_bench reports 85 tok/s on a 2341 MiB model on M2 Pro (200 GB/s):

effective_bw = 2341 * 1024 * 1024 / (1/85) = 2341 * 1.0485e6 * 85 = 208.7 GB/s
utilization = 208.7 / 200 = 104.3%

Wait, over 100%? This happens because the System Level Cache (SLC) provides additional effective bandwidth for data that fits or partially fits in the cache hierarchy. The SLC on Apple Silicon can add 20-40% of effective bandwidth for workloads with good temporal locality.⁴ akunu’s chain decode exploits this: when processing 64-128 tokens sequentially through each layer, the weight data loaded for token N is still in cache for token N+1.

Identifying Common Bottlenecks

Here is a diagnostic flowchart based on what the profiling tools reveal:

Bottleneck: Low Decode tok/s

Is bandwidth utilization > 70%?
├── YES: You are near optimal for this chip/model combo.
│        Only way to go faster: smaller model or faster chip.
│
└── NO: Something is leaving bandwidth on the table.
         │
         ├── Are there GPU idle gaps in Metal System Trace?
         │   ├── YES: CPU encoding is too slow.
         │   │        Check: is chain_decode_chunk large enough?
         │   │        Check: are you using profiled decode by mistake?
         │   │
         │   └── NO: Kernels are suboptimal.
         │            Use akunu_profile to find the slowest kernel.
         │            Common culprits:
         │            - Attention kernel with very long KV cache
         │            - Logit projection (dim -> vocab_size GEMV, large N)
         │            - Unoptimized dtype (Q5_K, Q3_K lack wide variants)
         │
         └── Is memory usage near system limits?
             ├── YES: Memory pressure causes swapping. Reduce max_context
             │        or use a smaller quantization.
             └── NO: Check thermal state (sysctl machdep.xcpm.cpu_thermal_level)

Bottleneck: High TTFT

TTFT is prefill time plus one decode step. If TTFT is high:

Is the prompt very long (>1000 tokens)?
├── YES: Prefill is doing large GEMMs. Check:
│        - Is prefill chunked? (akunu chunks at max_prefill_chunk = 4096)
│        - Are GEMM kernels using simd_matrix operations?
│        - For Q4_0/Q8_0, are the GEMM kernels the quantized variants?
│
└── NO: Short prompt but still slow?
         Check if model loading is included in the measurement.
         akunu_load_model() compiles PSOs and builds the dispatch table
         on first call. Subsequent calls reuse cached state.

Bottleneck: Attention Dominating at Long Context

As context grows, the attention kernel’s cost scales linearly with KV cache length. At some point it overtakes the GEMVs:

Context Length	Attention % of Decode (typical 4B model)
128	3-5%
512	8-12%
2048	20-30%
4096	35-50%

If attention is your bottleneck, the options are:

Reduce max_context to avoid over-allocating KV cache
Use a model with GQA (fewer KV heads = less memory traffic in attention)
Wait for akunu to implement paged attention or sliding window eviction

Comparing Against llama.cpp and MLX

Benchmarking against other frameworks is valuable both for validating your measurements and for identifying optimization opportunities. Here is how to set up fair comparisons:

llama.cpp Comparison

Use llama-bench with matching parameters:

# llama.cpp
./llama-bench -m model.gguf -p 512 -n 128 -r 5

# akunu
./akunu_bench model.gguf -p 512 -n 128 -r 5

Key differences to account for:

Factor	llama.cpp	akunu
Backend	Metal (via ggml-metal)	Metal (direct MSL)
Decode strategy	Single token per GPU submission	Chain decode (64-128 tokens per submission)
KV cache layout	Per-layer, row-major	Per-layer, head-major `[n_kv_heads, max_seq, head_dim]`
Weight fusion	None	Gate+Up fused on Pro+ chips (SLC > 16MB)
GEMV kernels	ggml generic + Metal shaders	Custom per-dtype Metal shaders with chip-specific tuning

In practice, akunu’s decode throughput is typically 1.1-1.5x llama.cpp’s on the same hardware, primarily due to chain decode reducing GPU idle time and chip-specific GEMV tuning.⁵

MLX Comparison

MLX (Apple’s machine learning framework) uses a different approach:

# MLX benchmark
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3-4B-4bit")
# ... time the generation

Key differences:

Factor	MLX	akunu
Language	Python + C++ + Metal	C++ + Metal
Weight format	SafeTensors with MLX quantization	GGUF or MLX SafeTensors
Graph compilation	JIT traced graphs	Pre-compiled dispatch table
Quantization	Group quantized (group_size=64)	GGUF block quant or MLX group quant
Overhead	Python dispatch + JIT	Near-zero (POD struct iteration)

MLX’s Python overhead is minimal for long generations but can be significant for TTFT on short prompts. akunu’s pre-compiled dispatch table avoids any per-token overhead beyond the raw GPU dispatch cost.

What Fair Comparison Looks Like

For a fair comparison, ensure:

Same model weights – or at least the same effective bits-per-weight. Q4_0 GGUF (4.5 effective bpw) is roughly comparable to MLX 4-bit with group_size=64.
Same prompt and generation length – especially for prefill comparison, since prefill scales nonlinearly with prompt length.
Same sampling – use greedy (temperature=0) to eliminate sampling variance.
Warm start – run at least one throwaway generation before timing to ensure Metal shader compilation is complete and caches are warm.
Same hardware – obvious, but worth stating. The M3 Pro and M2 Pro have the same 200 GB/s bandwidth but different GPU architectures, which affects compute-bound workloads like prefill.

Profiling Checklist

When you sit down to profile an akunu deployment, here is the sequence:

Baseline: Run akunu_bench to establish prefill tok/s, decode tok/s, and TTFT
Bandwidth check: Compute bandwidth utilization from the bench numbers. If >70%, you are in good shape.
Kernel breakdown: Run akunu_profile to identify which kernels dominate. The top 3-5 kernels by GPU time are your optimization targets.
System-level: If you suspect CPU-GPU sync issues, use Metal System Trace in Instruments to check for GPU idle gaps.
Compare: Run the same model on llama.cpp and/or MLX to validate your numbers and identify framework-level differences.
Thermal: For sustained workloads, monitor thermal throttling. Apple Silicon aggressively throttles GPU frequency under thermal pressure, which can reduce throughput by 20-40% on fanless MacBooks.

Advanced: Custom Profiling with the C API

The akunu_profile_decode_step() C API function is available for integration into your own profiling harness:

// Allocate timing buffer: n_layers + 3 entries
// [embedding, norm, layer0, layer1, ..., layerN-1, logit, argmax]
float timing[512];
int n = akunu_profile_decode_step(model, token_id, position, timing, 512);

for (int i = 0; i < n; i++) {
    printf("%s: %.3f ms\n", akunu_profile_label(model, i), timing[i]);
}

Each entry corresponds to a dispatch command in the DispatchTable. The labels are stored in a parallel DispatchLabel array (cold data, separate from the hot command array) so that profiling metadata does not pollute the cache lines used by the decode inner loop.

The profiling works by running each dispatch command in its own MTLCommandBuffer and reading back GPUStartTime / GPUEndTime. This gives microsecond-accurate per-kernel GPU timing, but at the cost of massive overhead from the per-kernel command buffer synchronization. You would never use this in production – it is purely a diagnostic tool.

Summary

Tool	When to Use	Output
`akunu_bench`	Quick throughput comparison	Prefill tok/s, decode tok/s (markdown table)
`akunu_benchmark`	End-to-end with real prompts	TTFT, prefill/decode speed at multiple prompt lengths
`akunu_profile`	Identifying kernel bottlenecks	Per-kernel GPU time breakdown, sorted by cost
Metal System Trace	CPU-GPU sync analysis	Timeline of command buffer submissions and GPU execution
GPU Counters	Hardware utilization	Occupancy, bandwidth, ALU utilization
Roofline analysis	Understanding theoretical limits	Whether you are memory-bound or compute-bound

The fundamental insight for Apple Silicon LLM inference is that decode is memory-bound and will remain so for the foreseeable future. The job of the profiler is not to find ways to make the GPU compute faster – it is to find the places where you are wasting bandwidth or leaving the GPU idle. Chain decode, weight fusion, and chip-specific GEMV tuning are all strategies that akunu uses to close the gap between measured and theoretical bandwidth, and the profiling tools described in this chapter are how you verify that those strategies are working.

On Apple Silicon, each MTLCommandBuffer commit-and-wait cycle costs approximately 30-80 microseconds of CPU overhead. At 80+ tok/s, a 50us overhead per token adds up to 4ms per second – roughly 5% throughput loss just from synchronization. ↩
Metal’s GPU timing (GPUStartTime / GPUEndTime on MTLCommandBuffer) measures the time the command buffer was executing on the GPU. For a single kernel this is accurate, but for a command buffer containing hundreds of dispatches, you only get the total. Apple’s GPU Timeline in Instruments provides per-dispatch timing, but requires running inside Xcode. ↩
Williams, S., Waterman, A., & Patterson, D. (2009). “Roofline: an insightful visual performance model for multicore architectures.” Communications of the ACM, 52(4), 65-76. See https://doi.org/10.1145/1498765.1498785. ↩
Actual SLC sizes estimated in akunu’s ChipConfig: 8 MB (M1/M2/M3 base), 16 MB (M4 base), 24 MB (M1/M2/M3 Pro), 32 MB (M4 Pro), 48 MB (Max), 96 MB (Ultra). These are not published by Apple but inferred from performance measurements and die analysis. ↩
This comparison is for the Metal backend specifically. llama.cpp supports many backends (CUDA, Vulkan, CPU) and architectures; akunu targets Apple Silicon exclusively, which allows tighter optimization. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference