Performance Profiling and Benchmarking
If you have made it this far in the book, you have a reasonably complete mental model of how akunu turns a pile of quantized weights and Metal shaders into streaming text on Apple Silicon. That is great, but mental models do not ship performance. At some point you need to measure things, and the gap between “I think the attention kernel is the bottleneck” and “the attention kernel consumes 38% of GPU time at 14.2 GB/s effective bandwidth on an M2 Pro” is the gap between guessing and engineering.
This chapter covers the full profiling stack: Apple’s own GPU tools, akunu’s built-in CLI profilers, the key metrics you should care about, roofline analysis for memory-bound inference, and how to interpret the numbers in context by comparing against llama.cpp and MLX.
The Three Metrics That Matter
Before we reach for any tool, let us agree on what we are measuring. LLM inference has three headline numbers that users and developers care about:
| Metric | Definition | Why It Matters |
|---|---|---|
| Prefill tok/s | Prompt tokens processed per second | Determines how fast you can ingest a 4K context window. Governs perceived responsiveness for the first token. |
| Decode tok/s | Generated tokens per second | The sustained throughput the user sees while text is streaming. This is the number people compare across frameworks. |
| TTFT (Time To First Token) | Wall-clock time from prompt submission to first generated token | The most perceptually important metric. Users notice latency more than throughput. TTFT = prefill time + one decode step. |
A fourth metric, peak memory, also matters on Apple Silicon because you are sharing a unified memory pool with the OS, the window compositor, and whatever else the user has open. Running out of memory does not just crash your process; it can trigger aggressive swapping that destroys system responsiveness.
Let us now walk through the tools that let you measure all of this.
akunu_bench: The llama-bench Equivalent
The simplest way to get prefill and decode numbers is akunu_bench, a C++ tool that replicates the methodology of llama-bench from the llama.cpp project. Here is the actual source signature from tools/akunu_bench.cpp:
Usage: akunu_bench <model> [-p N] [-n N] [-r N]
The flags are:
| Flag | Default | Meaning |
|---|---|---|
-p N / --pp N | 512 | Prompt length for prefill test |
-n N / --tg N | 128 | Number of tokens for decode (text generation) test |
-r N / --reps N | 5 | Repetitions per test (for statistical stability) |
The tool works by creating synthetic prompts filled with the BOS token (token ID 1). This is deliberate – you want a reproducible input that does not depend on tokenizer behavior or prompt content. Here is what happens internally:
-
Prefill test: Fill a vector of
pptokens with BOS. Callakunu_prefill()and time it withstd::chrono::high_resolution_clock. Repeatrepstimes. Report mean and standard deviation. -
Decode test: Prefill a single BOS token, then call
akunu_chain_decode()fortgtokens in a single GPU submission. Again, repeat and report statistics.
The output matches the llama-bench markdown table format so you can paste results directly into GitHub issues:
| model | size | test | t/s |
| --- | ---: | ---: | ---: |
| Qwen3-4B-Q4_0.gguf | 2341 MiB | pp512 | 1842.31 +/- 12.40 |
| Qwen3-4B-Q4_0.gguf | 2341 MiB | tg128 | 87.42 +/- 0.83 |
A few things to note about the methodology:
-
The decode test uses
akunu_chain_decode(), which batches alltgtokens into a single GPU command buffer submission. This measures the true GPU-limited throughput, not the overhead of individualakunu_decode_step()round-trips. If you were to measure decode by callingdecode_stepin a loop, you would be measuring CPU-GPU synchronization overhead as much as actual compute.1 -
Each repetition calls
akunu_reset()to clear the KV cache, ensuring independent measurements. Without the reset, later iterations would operate on a larger KV cache, which changes the attention kernel’s memory access pattern. -
The standard deviation across repetitions is typically very small (under 2%) on a quiet system. If you see high variance, check for thermal throttling or background processes competing for GPU resources.
akunu_benchmark: End-to-End with Real Prompts
While akunu_bench gives you clean synthetic numbers, akunu_benchmark exercises the full akunu_generate() path with real prompts of varying lengths:
Usage: akunu_benchmark <model>
This tool runs three prompts (short, medium, long), measures AkunuGenerationStats for each, and reports:
| Column | Meaning |
|---|---|
| Prompt | Length category |
| Tokens | Actual token count after encoding |
| Prefill (t/s) | Prefill throughput |
| Decode (t/s) | Decode throughput |
| First-tok(ms) | Time to first token (the TTFT metric) |
| Prefill(ms) | Raw prefill time |
| Total(s) | Wall-clock total |
After the prompt tests, it also runs a standalone chain decode measurement (128 tokens, greedy) to give you the raw GPU-limited decode throughput independent of sampling overhead.
The key insight from this tool is how prefill scales with prompt length. On Apple Silicon, prefill is a GEMM (matrix-matrix multiply) workload, and the GPU’s utilization increases with larger batch sizes. You will typically see:
- Short prompts (1-10 tokens): Low prefill tok/s because the GEMMs have tiny M dimension and cannot saturate the GPU’s compute units
- Medium prompts (50-200 tokens): Prefill tok/s climbs rapidly as GEMM occupancy improves
- Long prompts (500+ tokens): Prefill tok/s plateaus near the compute-bound peak
akunu_profile: Per-Kernel GPU Timing
This is the real workhorse for optimization. Where akunu_bench tells you how fast, akunu_profile tells you where the time goes.
Usage: akunu_profile <model> [--tokens N]
Here is what happens under the hood, based on the actual source in tools/akunu_profile.cpp:
- Load the model and prefill a single BOS token
- Call
akunu_profile_decode_step()which runs each dispatch command in its ownMTLCommandBuffer, enabling accurate per-kernel GPU timing via Metal’s built-in command buffer timing - Repeat for
Ntokens (default 5), accumulating timing data - Sort kernels by total GPU time and print a breakdown table
The output looks something like this (simplified):
Per-Kernel GPU Timing Breakdown
==========================================================================================
Kernel Dispatches Total (ms) Avg (ms) % GPU
------------------------------------------------------------------------------------------
L0 GEMV attn_qkv Q4_0 5 0.412 0.082 18.2%
L0 GEMV ffn_down Q4_0 5 0.318 0.064 14.1%
L0 GEMV ffn_gate_up Q4_0 5 0.304 0.061 13.4%
L0 Attention 5 0.201 0.040 8.9%
L0 GEMV attn_output Q4_0 5 0.156 0.031 6.9%
...
There is an important caveat: profiled decode is much slower than normal decode. The profiler wraps each kernel dispatch in its own command buffer to get accurate GPU timing. In normal operation, akunu batches the entire forward pass (embedding + N layers + output norm + logit projection + argmax) into a single command buffer, and the chain decoder batches multiple tokens into one submission. Profiled mode breaks this batching completely, so the absolute numbers are not representative of production throughput – they are only useful for relative comparisons between kernels.2
Reading the Profiler Output
The typical decode step for a LLaMA-like model with n_layers transformer layers contains:
+------------------+
| Embedding Lookup | 1 kernel
+------------------+
|
v
+------------------+
| Layer 0 | ~8-12 kernels per layer
| Attention Norm |
| QKV Projection | (GEMV or fused GEMV+RoPE+KV-write)
| RoPE + KV Write|
| Attention |
| Output Proj |
| Residual Add |
| FFN Norm |
| Gate+Up Proj | (possibly fused into single GEMV)
| Activation | (SiLU*gate or GELU*gate)
| Down Proj |
| Residual Add |
+------------------+
|
v
| Layer 1..N-1 | (repeat)
|
v
+------------------+
| Output Norm | 1 kernel
+------------------+
|
v
+------------------+
| Logit Projection | 1 GEMV (dim -> vocab_size)
+------------------+
|
v
+------------------+
| Argmax | 1 kernel
+------------------+
When you look at the profiler output, the GEMV (matrix-vector multiply) kernels dominate. For a Q4_0 model, the three big GEMVs per layer are:
-
QKV projection: Multiplies the hidden state by the Q, K, and V weight matrices. For a model with
n_heads=32, n_kv_heads=8, head_dim=128, this projectsdim=4096toq_dim + 2*kv_dim = 4096 + 2*1024 = 6144elements. -
FFN gate+up: Projects
dimto2*ffn_dim. For LLaMA-style models with SwiGLU,ffn_dimis typically~2.7*dim, so this is the largest single GEMV. -
FFN down: Projects
ffn_dimback todim.
The attention kernel itself is often not the biggest time consumer during decode (single token, long KV cache), because it is a relatively small operation: each head does a dot product of the query against kv_seq_len keys, then a weighted sum of values. The total work scales with n_heads * kv_seq_len * head_dim, which for moderate context lengths is much less than the GEMV work.
Xcode GPU Profiler (Instruments)
For the deepest level of insight, Apple provides GPU profiling through Instruments. There are two relevant instruments:
Metal System Trace
Metal System Trace shows the timeline of GPU command buffer submissions, encoding, and execution. This is the tool to use when you suspect CPU-GPU synchronization issues or want to understand the relationship between akunu’s chain decode submissions and actual GPU execution.
To capture a trace:
- Build akunu with debug symbols (CMake
RelWithDebInfoorDebug) - Open Instruments, choose “Metal System Trace” template
- Select your akunu binary as the target
- Record for a few seconds while running a generation
The trace shows:
| Track | What You See |
|---|---|
| GPU Timeline | Individual compute dispatches on the GPU hardware. Each dispatch shows its duration, pipeline state object (PSO) name, and threadgroup configuration. |
| Command Buffer Track | When each MTLCommandBuffer was committed, scheduled, and completed. Gaps between command buffers indicate CPU-side stalls. |
| Encoder Track | The compute command encoder’s encode phase. If encoding takes longer than GPU execution, you are CPU-bound. |
The key thing to look for in the Metal System Trace is GPU idle gaps. In a well-tuned chain decode:
CPU: [encode CB1] [encode CB2] [encode CB3]
GPU: [execute CB1][execute CB2] [execute CB3]
^-- no gap here: GPU stays busy
If you see gaps where the GPU is idle between command buffers, the CPU is not encoding fast enough. Akunu’s chain decode design specifically addresses this by encoding chain_decode_chunk tokens (64-128, depending on chip) into a single command buffer, ensuring the GPU has enough work to stay saturated.
GPU Counters
Instruments also provides GPU hardware counters (on supported devices) that show:
| Counter Group | Key Metrics |
|---|---|
| Occupancy | How many threadgroups are resident on the GPU simultaneously. Low occupancy means the GPU has idle ALUs. |
| Memory | Read/write bandwidth, cache hit rates. Critical for understanding whether your GEMV kernels are memory-bound (they almost always are). |
| ALU | Arithmetic utilization. For quantized GEMV, this is typically low because you are waiting on memory, not compute. |
| Shader | Per-pipeline-state breakdown. Shows which PSOs consume the most GPU time. |
Roofline Analysis for Apple Silicon
The roofline model is the single most useful framework for understanding LLM inference performance on Apple Silicon.3 The core idea is simple: every computation has an arithmetic intensity (operations per byte of memory accessed), and the hardware has a memory bandwidth ceiling and a compute ceiling. Your kernel’s throughput is limited by whichever ceiling it hits first.
Apple Silicon Memory Bandwidth
| Chip | Memory BW (GB/s) | GPU FP16 TFLOPS | Roofline Knee (ops/byte) |
|---|---|---|---|
| M1 | 68.25 | 2.6 | 38 |
| M1 Pro | 200 | 5.2 | 26 |
| M1 Max | 400 | 10.4 | 26 |
| M2 | 100 | 3.6 | 36 |
| M2 Pro | 200 | 7.0 | 35 |
| M2 Max | 400 | 13.6 | 34 |
| M3 | 100 | 4.1 | 41 |
| M3 Pro | 150 | 7.0 | 47 |
| M3 Max | 400 | 14.2 | 36 |
| M4 | 120 | 4.3 | 36 |
| M4 Pro | 273 | 9.2 | 34 |
| M4 Max | 546 | 18.0 | 33 |
The “roofline knee” is the arithmetic intensity where you transition from memory-bound to compute-bound. For LLM decode, the arithmetic intensity is almost always well below this knee.
Why Decode Is Memory-Bound
During single-token decode, each GEMV reads the entire weight matrix and multiplies it by a single vector. For a Q4_0 weight matrix of shape [N, K]:
- Bytes read:
N * K / 2bytes (4 bits per weight, packed) +N * K / 32 * 2bytes (one FP16 scale per block of 32) - FLOPs:
2 * N * K(multiply-accumulate) - Arithmetic intensity: roughly
2 * N * K / (N * K * 0.5625)= ~3.6 ops/byte
That is far below the roofline knee of 26-47 ops/byte. The GEMV is firmly memory-bound. This means:
Decode throughput is determined almost entirely by memory bandwidth.
The theoretical maximum decode tok/s for a model of total weight size W bytes on a chip with bandwidth B bytes/s is:
max_decode_tok_s = B / W
For a 4B parameter Q4_0 model (~2.3 GB weights):
| Chip | BW (GB/s) | Theoretical Max (tok/s) |
|---|---|---|
| M1 | 68.25 | 29.7 |
| M2 Pro | 200 | 87.0 |
| M3 Max | 400 | 174.0 |
| M4 Max | 546 | 237.4 |
In practice, akunu achieves 70-85% of theoretical bandwidth utilization for decode, which is quite good for a real-world system with cache management, RoPE computation, attention, and norm overhead on top of the raw GEMVs.
Why Prefill Is Compute-Bound (for Large Batches)
During prefill, the projections become GEMMs (matrix-matrix multiply) because you are processing seq_len tokens simultaneously. The arithmetic intensity scales with the batch dimension:
- Arithmetic intensity: ~
2 * Mops/byte (where M = batch/seq_len)
For M >= 20 or so, you cross the roofline knee and become compute-bound. This is why prefill throughput is typically 10-50x higher than decode throughput – you are actually using the GPU’s ALUs instead of just waiting on memory.
Bandwidth Utilization: The Real Performance Metric
Raw tok/s numbers are useful for user-facing comparisons, but for engineering purposes, bandwidth utilization is the metric that tells you how close you are to optimal:
bandwidth_utilization = (model_weight_bytes / decode_time_per_token) / peak_memory_bandwidth
Here is how to compute this from akunu_bench output:
- Get model weight bytes from
akunu_model_memory()(reported as “size” in bench output) - Compute decode time per token:
1.0 / decode_tok_s - Divide effective bandwidth by peak bandwidth
For example, if akunu_bench reports 85 tok/s on a 2341 MiB model on M2 Pro (200 GB/s):
effective_bw = 2341 * 1024 * 1024 / (1/85) = 2341 * 1.0485e6 * 85 = 208.7 GB/s
utilization = 208.7 / 200 = 104.3%
Wait, over 100%? This happens because the System Level Cache (SLC) provides additional effective bandwidth for data that fits or partially fits in the cache hierarchy. The SLC on Apple Silicon can add 20-40% of effective bandwidth for workloads with good temporal locality.4 akunu’s chain decode exploits this: when processing 64-128 tokens sequentially through each layer, the weight data loaded for token N is still in cache for token N+1.
Identifying Common Bottlenecks
Here is a diagnostic flowchart based on what the profiling tools reveal:
Bottleneck: Low Decode tok/s
Is bandwidth utilization > 70%?
├── YES: You are near optimal for this chip/model combo.
│ Only way to go faster: smaller model or faster chip.
│
└── NO: Something is leaving bandwidth on the table.
│
├── Are there GPU idle gaps in Metal System Trace?
│ ├── YES: CPU encoding is too slow.
│ │ Check: is chain_decode_chunk large enough?
│ │ Check: are you using profiled decode by mistake?
│ │
│ └── NO: Kernels are suboptimal.
│ Use akunu_profile to find the slowest kernel.
│ Common culprits:
│ - Attention kernel with very long KV cache
│ - Logit projection (dim -> vocab_size GEMV, large N)
│ - Unoptimized dtype (Q5_K, Q3_K lack wide variants)
│
└── Is memory usage near system limits?
├── YES: Memory pressure causes swapping. Reduce max_context
│ or use a smaller quantization.
└── NO: Check thermal state (sysctl machdep.xcpm.cpu_thermal_level)
Bottleneck: High TTFT
TTFT is prefill time plus one decode step. If TTFT is high:
Is the prompt very long (>1000 tokens)?
├── YES: Prefill is doing large GEMMs. Check:
│ - Is prefill chunked? (akunu chunks at max_prefill_chunk = 4096)
│ - Are GEMM kernels using simd_matrix operations?
│ - For Q4_0/Q8_0, are the GEMM kernels the quantized variants?
│
└── NO: Short prompt but still slow?
Check if model loading is included in the measurement.
akunu_load_model() compiles PSOs and builds the dispatch table
on first call. Subsequent calls reuse cached state.
Bottleneck: Attention Dominating at Long Context
As context grows, the attention kernel’s cost scales linearly with KV cache length. At some point it overtakes the GEMVs:
| Context Length | Attention % of Decode (typical 4B model) |
|---|---|
| 128 | 3-5% |
| 512 | 8-12% |
| 2048 | 20-30% |
| 4096 | 35-50% |
If attention is your bottleneck, the options are:
- Reduce
max_contextto avoid over-allocating KV cache - Use a model with GQA (fewer KV heads = less memory traffic in attention)
- Wait for akunu to implement paged attention or sliding window eviction
Comparing Against llama.cpp and MLX
Benchmarking against other frameworks is valuable both for validating your measurements and for identifying optimization opportunities. Here is how to set up fair comparisons:
llama.cpp Comparison
Use llama-bench with matching parameters:
# llama.cpp
./llama-bench -m model.gguf -p 512 -n 128 -r 5
# akunu
./akunu_bench model.gguf -p 512 -n 128 -r 5
Key differences to account for:
| Factor | llama.cpp | akunu |
|---|---|---|
| Backend | Metal (via ggml-metal) | Metal (direct MSL) |
| Decode strategy | Single token per GPU submission | Chain decode (64-128 tokens per submission) |
| KV cache layout | Per-layer, row-major | Per-layer, head-major [n_kv_heads, max_seq, head_dim] |
| Weight fusion | None | Gate+Up fused on Pro+ chips (SLC > 16MB) |
| GEMV kernels | ggml generic + Metal shaders | Custom per-dtype Metal shaders with chip-specific tuning |
In practice, akunu’s decode throughput is typically 1.1-1.5x llama.cpp’s on the same hardware, primarily due to chain decode reducing GPU idle time and chip-specific GEMV tuning.5
MLX Comparison
MLX (Apple’s machine learning framework) uses a different approach:
# MLX benchmark
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3-4B-4bit")
# ... time the generation
Key differences:
| Factor | MLX | akunu |
|---|---|---|
| Language | Python + C++ + Metal | C++ + Metal |
| Weight format | SafeTensors with MLX quantization | GGUF or MLX SafeTensors |
| Graph compilation | JIT traced graphs | Pre-compiled dispatch table |
| Quantization | Group quantized (group_size=64) | GGUF block quant or MLX group quant |
| Overhead | Python dispatch + JIT | Near-zero (POD struct iteration) |
MLX’s Python overhead is minimal for long generations but can be significant for TTFT on short prompts. akunu’s pre-compiled dispatch table avoids any per-token overhead beyond the raw GPU dispatch cost.
What Fair Comparison Looks Like
For a fair comparison, ensure:
- Same model weights – or at least the same effective bits-per-weight. Q4_0 GGUF (4.5 effective bpw) is roughly comparable to MLX 4-bit with group_size=64.
- Same prompt and generation length – especially for prefill comparison, since prefill scales nonlinearly with prompt length.
- Same sampling – use greedy (temperature=0) to eliminate sampling variance.
- Warm start – run at least one throwaway generation before timing to ensure Metal shader compilation is complete and caches are warm.
- Same hardware – obvious, but worth stating. The M3 Pro and M2 Pro have the same 200 GB/s bandwidth but different GPU architectures, which affects compute-bound workloads like prefill.
Profiling Checklist
When you sit down to profile an akunu deployment, here is the sequence:
- Baseline: Run
akunu_benchto establish prefill tok/s, decode tok/s, and TTFT - Bandwidth check: Compute bandwidth utilization from the bench numbers. If >70%, you are in good shape.
- Kernel breakdown: Run
akunu_profileto identify which kernels dominate. The top 3-5 kernels by GPU time are your optimization targets. - System-level: If you suspect CPU-GPU sync issues, use Metal System Trace in Instruments to check for GPU idle gaps.
- Compare: Run the same model on llama.cpp and/or MLX to validate your numbers and identify framework-level differences.
- Thermal: For sustained workloads, monitor thermal throttling. Apple Silicon aggressively throttles GPU frequency under thermal pressure, which can reduce throughput by 20-40% on fanless MacBooks.
Advanced: Custom Profiling with the C API
The akunu_profile_decode_step() C API function is available for integration into your own profiling harness:
// Allocate timing buffer: n_layers + 3 entries
// [embedding, norm, layer0, layer1, ..., layerN-1, logit, argmax]
float timing[512];
int n = akunu_profile_decode_step(model, token_id, position, timing, 512);
for (int i = 0; i < n; i++) {
printf("%s: %.3f ms\n", akunu_profile_label(model, i), timing[i]);
}
Each entry corresponds to a dispatch command in the DispatchTable. The labels are stored in a parallel DispatchLabel array (cold data, separate from the hot command array) so that profiling metadata does not pollute the cache lines used by the decode inner loop.
The profiling works by running each dispatch command in its own MTLCommandBuffer and reading back GPUStartTime / GPUEndTime. This gives microsecond-accurate per-kernel GPU timing, but at the cost of massive overhead from the per-kernel command buffer synchronization. You would never use this in production – it is purely a diagnostic tool.
Summary
| Tool | When to Use | Output |
|---|---|---|
akunu_bench | Quick throughput comparison | Prefill tok/s, decode tok/s (markdown table) |
akunu_benchmark | End-to-end with real prompts | TTFT, prefill/decode speed at multiple prompt lengths |
akunu_profile | Identifying kernel bottlenecks | Per-kernel GPU time breakdown, sorted by cost |
| Metal System Trace | CPU-GPU sync analysis | Timeline of command buffer submissions and GPU execution |
| GPU Counters | Hardware utilization | Occupancy, bandwidth, ALU utilization |
| Roofline analysis | Understanding theoretical limits | Whether you are memory-bound or compute-bound |
The fundamental insight for Apple Silicon LLM inference is that decode is memory-bound and will remain so for the foreseeable future. The job of the profiler is not to find ways to make the GPU compute faster – it is to find the places where you are wasting bandwidth or leaving the GPU idle. Chain decode, weight fusion, and chip-specific GEMV tuning are all strategies that akunu uses to close the gap between measured and theoretical bandwidth, and the profiling tools described in this chapter are how you verify that those strategies are working.
-
On Apple Silicon, each
MTLCommandBuffercommit-and-wait cycle costs approximately 30-80 microseconds of CPU overhead. At 80+ tok/s, a 50us overhead per token adds up to 4ms per second – roughly 5% throughput loss just from synchronization. ↩ -
Metal’s GPU timing (
GPUStartTime/GPUEndTimeonMTLCommandBuffer) measures the time the command buffer was executing on the GPU. For a single kernel this is accurate, but for a command buffer containing hundreds of dispatches, you only get the total. Apple’s GPU Timeline in Instruments provides per-dispatch timing, but requires running inside Xcode. ↩ -
Williams, S., Waterman, A., & Patterson, D. (2009). “Roofline: an insightful visual performance model for multicore architectures.” Communications of the ACM, 52(4), 65-76. See https://doi.org/10.1145/1498765.1498785. ↩
-
Actual SLC sizes estimated in akunu’s
ChipConfig: 8 MB (M1/M2/M3 base), 16 MB (M4 base), 24 MB (M1/M2/M3 Pro), 32 MB (M4 Pro), 48 MB (Max), 96 MB (Ultra). These are not published by Apple but inferred from performance measurements and die analysis. ↩ -
This comparison is for the Metal backend specifically. llama.cpp supports many backends (CUDA, Vulkan, CPU) and architectures; akunu targets Apple Silicon exclusively, which allows tighter optimization. ↩