Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance Profiling and Benchmarking

If you have made it this far in the book, you have a reasonably complete mental model of how akunu turns a pile of quantized weights and Metal shaders into streaming text on Apple Silicon. That is great, but mental models do not ship performance. At some point you need to measure things, and the gap between “I think the attention kernel is the bottleneck” and “the attention kernel consumes 38% of GPU time at 14.2 GB/s effective bandwidth on an M2 Pro” is the gap between guessing and engineering.

This chapter covers the full profiling stack: Apple’s own GPU tools, akunu’s built-in CLI profilers, the key metrics you should care about, roofline analysis for memory-bound inference, and how to interpret the numbers in context by comparing against llama.cpp and MLX.

The Three Metrics That Matter

Before we reach for any tool, let us agree on what we are measuring. LLM inference has three headline numbers that users and developers care about:

MetricDefinitionWhy It Matters
Prefill tok/sPrompt tokens processed per secondDetermines how fast you can ingest a 4K context window. Governs perceived responsiveness for the first token.
Decode tok/sGenerated tokens per secondThe sustained throughput the user sees while text is streaming. This is the number people compare across frameworks.
TTFT (Time To First Token)Wall-clock time from prompt submission to first generated tokenThe most perceptually important metric. Users notice latency more than throughput. TTFT = prefill time + one decode step.

A fourth metric, peak memory, also matters on Apple Silicon because you are sharing a unified memory pool with the OS, the window compositor, and whatever else the user has open. Running out of memory does not just crash your process; it can trigger aggressive swapping that destroys system responsiveness.

Let us now walk through the tools that let you measure all of this.

akunu_bench: The llama-bench Equivalent

The simplest way to get prefill and decode numbers is akunu_bench, a C++ tool that replicates the methodology of llama-bench from the llama.cpp project. Here is the actual source signature from tools/akunu_bench.cpp:

Usage: akunu_bench <model> [-p N] [-n N] [-r N]

The flags are:

FlagDefaultMeaning
-p N / --pp N512Prompt length for prefill test
-n N / --tg N128Number of tokens for decode (text generation) test
-r N / --reps N5Repetitions per test (for statistical stability)

The tool works by creating synthetic prompts filled with the BOS token (token ID 1). This is deliberate – you want a reproducible input that does not depend on tokenizer behavior or prompt content. Here is what happens internally:

  1. Prefill test: Fill a vector of pp tokens with BOS. Call akunu_prefill() and time it with std::chrono::high_resolution_clock. Repeat reps times. Report mean and standard deviation.

  2. Decode test: Prefill a single BOS token, then call akunu_chain_decode() for tg tokens in a single GPU submission. Again, repeat and report statistics.

The output matches the llama-bench markdown table format so you can paste results directly into GitHub issues:

| model | size | test | t/s |
| --- | ---: | ---: | ---: |
| Qwen3-4B-Q4_0.gguf | 2341 MiB | pp512 | 1842.31 +/- 12.40 |
| Qwen3-4B-Q4_0.gguf | 2341 MiB | tg128 | 87.42 +/- 0.83 |

A few things to note about the methodology:

  • The decode test uses akunu_chain_decode(), which batches all tg tokens into a single GPU command buffer submission. This measures the true GPU-limited throughput, not the overhead of individual akunu_decode_step() round-trips. If you were to measure decode by calling decode_step in a loop, you would be measuring CPU-GPU synchronization overhead as much as actual compute.1

  • Each repetition calls akunu_reset() to clear the KV cache, ensuring independent measurements. Without the reset, later iterations would operate on a larger KV cache, which changes the attention kernel’s memory access pattern.

  • The standard deviation across repetitions is typically very small (under 2%) on a quiet system. If you see high variance, check for thermal throttling or background processes competing for GPU resources.

akunu_benchmark: End-to-End with Real Prompts

While akunu_bench gives you clean synthetic numbers, akunu_benchmark exercises the full akunu_generate() path with real prompts of varying lengths:

Usage: akunu_benchmark <model>

This tool runs three prompts (short, medium, long), measures AkunuGenerationStats for each, and reports:

ColumnMeaning
PromptLength category
TokensActual token count after encoding
Prefill (t/s)Prefill throughput
Decode (t/s)Decode throughput
First-tok(ms)Time to first token (the TTFT metric)
Prefill(ms)Raw prefill time
Total(s)Wall-clock total

After the prompt tests, it also runs a standalone chain decode measurement (128 tokens, greedy) to give you the raw GPU-limited decode throughput independent of sampling overhead.

The key insight from this tool is how prefill scales with prompt length. On Apple Silicon, prefill is a GEMM (matrix-matrix multiply) workload, and the GPU’s utilization increases with larger batch sizes. You will typically see:

  • Short prompts (1-10 tokens): Low prefill tok/s because the GEMMs have tiny M dimension and cannot saturate the GPU’s compute units
  • Medium prompts (50-200 tokens): Prefill tok/s climbs rapidly as GEMM occupancy improves
  • Long prompts (500+ tokens): Prefill tok/s plateaus near the compute-bound peak

akunu_profile: Per-Kernel GPU Timing

This is the real workhorse for optimization. Where akunu_bench tells you how fast, akunu_profile tells you where the time goes.

Usage: akunu_profile <model> [--tokens N]

Here is what happens under the hood, based on the actual source in tools/akunu_profile.cpp:

  1. Load the model and prefill a single BOS token
  2. Call akunu_profile_decode_step() which runs each dispatch command in its own MTLCommandBuffer, enabling accurate per-kernel GPU timing via Metal’s built-in command buffer timing
  3. Repeat for N tokens (default 5), accumulating timing data
  4. Sort kernels by total GPU time and print a breakdown table

The output looks something like this (simplified):

  Per-Kernel GPU Timing Breakdown
  ==========================================================================================
  Kernel                                  Dispatches  Total (ms)    Avg (ms)      % GPU
  ------------------------------------------------------------------------------------------
  L0 GEMV attn_qkv Q4_0                  5           0.412         0.082         18.2%
  L0 GEMV ffn_down Q4_0                  5           0.318         0.064         14.1%
  L0 GEMV ffn_gate_up Q4_0              5           0.304         0.061         13.4%
  L0 Attention                            5           0.201         0.040         8.9%
  L0 GEMV attn_output Q4_0              5           0.156         0.031         6.9%
  ...

There is an important caveat: profiled decode is much slower than normal decode. The profiler wraps each kernel dispatch in its own command buffer to get accurate GPU timing. In normal operation, akunu batches the entire forward pass (embedding + N layers + output norm + logit projection + argmax) into a single command buffer, and the chain decoder batches multiple tokens into one submission. Profiled mode breaks this batching completely, so the absolute numbers are not representative of production throughput – they are only useful for relative comparisons between kernels.2

Reading the Profiler Output

The typical decode step for a LLaMA-like model with n_layers transformer layers contains:

+------------------+
| Embedding Lookup |  1 kernel
+------------------+
        |
        v
+------------------+
| Layer 0          |  ~8-12 kernels per layer
|   Attention Norm |
|   QKV Projection |  (GEMV or fused GEMV+RoPE+KV-write)
|   RoPE + KV Write|
|   Attention      |
|   Output Proj    |
|   Residual Add   |
|   FFN Norm       |
|   Gate+Up Proj   |  (possibly fused into single GEMV)
|   Activation     |  (SiLU*gate or GELU*gate)
|   Down Proj      |
|   Residual Add   |
+------------------+
        |
        v
| Layer 1..N-1     |  (repeat)
        |
        v
+------------------+
| Output Norm      |  1 kernel
+------------------+
        |
        v
+------------------+
| Logit Projection |  1 GEMV (dim -> vocab_size)
+------------------+
        |
        v
+------------------+
| Argmax           |  1 kernel
+------------------+

When you look at the profiler output, the GEMV (matrix-vector multiply) kernels dominate. For a Q4_0 model, the three big GEMVs per layer are:

  1. QKV projection: Multiplies the hidden state by the Q, K, and V weight matrices. For a model with n_heads=32, n_kv_heads=8, head_dim=128, this projects dim=4096 to q_dim + 2*kv_dim = 4096 + 2*1024 = 6144 elements.

  2. FFN gate+up: Projects dim to 2*ffn_dim. For LLaMA-style models with SwiGLU, ffn_dim is typically ~2.7*dim, so this is the largest single GEMV.

  3. FFN down: Projects ffn_dim back to dim.

The attention kernel itself is often not the biggest time consumer during decode (single token, long KV cache), because it is a relatively small operation: each head does a dot product of the query against kv_seq_len keys, then a weighted sum of values. The total work scales with n_heads * kv_seq_len * head_dim, which for moderate context lengths is much less than the GEMV work.

Xcode GPU Profiler (Instruments)

For the deepest level of insight, Apple provides GPU profiling through Instruments. There are two relevant instruments:

Metal System Trace

Metal System Trace shows the timeline of GPU command buffer submissions, encoding, and execution. This is the tool to use when you suspect CPU-GPU synchronization issues or want to understand the relationship between akunu’s chain decode submissions and actual GPU execution.

To capture a trace:

  1. Build akunu with debug symbols (CMake RelWithDebInfo or Debug)
  2. Open Instruments, choose “Metal System Trace” template
  3. Select your akunu binary as the target
  4. Record for a few seconds while running a generation

The trace shows:

TrackWhat You See
GPU TimelineIndividual compute dispatches on the GPU hardware. Each dispatch shows its duration, pipeline state object (PSO) name, and threadgroup configuration.
Command Buffer TrackWhen each MTLCommandBuffer was committed, scheduled, and completed. Gaps between command buffers indicate CPU-side stalls.
Encoder TrackThe compute command encoder’s encode phase. If encoding takes longer than GPU execution, you are CPU-bound.

The key thing to look for in the Metal System Trace is GPU idle gaps. In a well-tuned chain decode:

CPU:  [encode CB1] [encode CB2]        [encode CB3]
GPU:           [execute CB1][execute CB2]        [execute CB3]
                                   ^-- no gap here: GPU stays busy

If you see gaps where the GPU is idle between command buffers, the CPU is not encoding fast enough. Akunu’s chain decode design specifically addresses this by encoding chain_decode_chunk tokens (64-128, depending on chip) into a single command buffer, ensuring the GPU has enough work to stay saturated.

GPU Counters

Instruments also provides GPU hardware counters (on supported devices) that show:

Counter GroupKey Metrics
OccupancyHow many threadgroups are resident on the GPU simultaneously. Low occupancy means the GPU has idle ALUs.
MemoryRead/write bandwidth, cache hit rates. Critical for understanding whether your GEMV kernels are memory-bound (they almost always are).
ALUArithmetic utilization. For quantized GEMV, this is typically low because you are waiting on memory, not compute.
ShaderPer-pipeline-state breakdown. Shows which PSOs consume the most GPU time.

Roofline Analysis for Apple Silicon

The roofline model is the single most useful framework for understanding LLM inference performance on Apple Silicon.3 The core idea is simple: every computation has an arithmetic intensity (operations per byte of memory accessed), and the hardware has a memory bandwidth ceiling and a compute ceiling. Your kernel’s throughput is limited by whichever ceiling it hits first.

Apple Silicon Memory Bandwidth

ChipMemory BW (GB/s)GPU FP16 TFLOPSRoofline Knee (ops/byte)
M168.252.638
M1 Pro2005.226
M1 Max40010.426
M21003.636
M2 Pro2007.035
M2 Max40013.634
M31004.141
M3 Pro1507.047
M3 Max40014.236
M41204.336
M4 Pro2739.234
M4 Max54618.033

The “roofline knee” is the arithmetic intensity where you transition from memory-bound to compute-bound. For LLM decode, the arithmetic intensity is almost always well below this knee.

Why Decode Is Memory-Bound

During single-token decode, each GEMV reads the entire weight matrix and multiplies it by a single vector. For a Q4_0 weight matrix of shape [N, K]:

  • Bytes read: N * K / 2 bytes (4 bits per weight, packed) + N * K / 32 * 2 bytes (one FP16 scale per block of 32)
  • FLOPs: 2 * N * K (multiply-accumulate)
  • Arithmetic intensity: roughly 2 * N * K / (N * K * 0.5625) = ~3.6 ops/byte

That is far below the roofline knee of 26-47 ops/byte. The GEMV is firmly memory-bound. This means:

Decode throughput is determined almost entirely by memory bandwidth.

The theoretical maximum decode tok/s for a model of total weight size W bytes on a chip with bandwidth B bytes/s is:

max_decode_tok_s = B / W

For a 4B parameter Q4_0 model (~2.3 GB weights):

ChipBW (GB/s)Theoretical Max (tok/s)
M168.2529.7
M2 Pro20087.0
M3 Max400174.0
M4 Max546237.4

In practice, akunu achieves 70-85% of theoretical bandwidth utilization for decode, which is quite good for a real-world system with cache management, RoPE computation, attention, and norm overhead on top of the raw GEMVs.

Why Prefill Is Compute-Bound (for Large Batches)

During prefill, the projections become GEMMs (matrix-matrix multiply) because you are processing seq_len tokens simultaneously. The arithmetic intensity scales with the batch dimension:

  • Arithmetic intensity: ~2 * M ops/byte (where M = batch/seq_len)

For M >= 20 or so, you cross the roofline knee and become compute-bound. This is why prefill throughput is typically 10-50x higher than decode throughput – you are actually using the GPU’s ALUs instead of just waiting on memory.

Bandwidth Utilization: The Real Performance Metric

Raw tok/s numbers are useful for user-facing comparisons, but for engineering purposes, bandwidth utilization is the metric that tells you how close you are to optimal:

bandwidth_utilization = (model_weight_bytes / decode_time_per_token) / peak_memory_bandwidth

Here is how to compute this from akunu_bench output:

  1. Get model weight bytes from akunu_model_memory() (reported as “size” in bench output)
  2. Compute decode time per token: 1.0 / decode_tok_s
  3. Divide effective bandwidth by peak bandwidth

For example, if akunu_bench reports 85 tok/s on a 2341 MiB model on M2 Pro (200 GB/s):

effective_bw = 2341 * 1024 * 1024 / (1/85) = 2341 * 1.0485e6 * 85 = 208.7 GB/s
utilization = 208.7 / 200 = 104.3%

Wait, over 100%? This happens because the System Level Cache (SLC) provides additional effective bandwidth for data that fits or partially fits in the cache hierarchy. The SLC on Apple Silicon can add 20-40% of effective bandwidth for workloads with good temporal locality.4 akunu’s chain decode exploits this: when processing 64-128 tokens sequentially through each layer, the weight data loaded for token N is still in cache for token N+1.

Identifying Common Bottlenecks

Here is a diagnostic flowchart based on what the profiling tools reveal:

Bottleneck: Low Decode tok/s

Is bandwidth utilization > 70%?
├── YES: You are near optimal for this chip/model combo.
│        Only way to go faster: smaller model or faster chip.
│
└── NO: Something is leaving bandwidth on the table.
         │
         ├── Are there GPU idle gaps in Metal System Trace?
         │   ├── YES: CPU encoding is too slow.
         │   │        Check: is chain_decode_chunk large enough?
         │   │        Check: are you using profiled decode by mistake?
         │   │
         │   └── NO: Kernels are suboptimal.
         │            Use akunu_profile to find the slowest kernel.
         │            Common culprits:
         │            - Attention kernel with very long KV cache
         │            - Logit projection (dim -> vocab_size GEMV, large N)
         │            - Unoptimized dtype (Q5_K, Q3_K lack wide variants)
         │
         └── Is memory usage near system limits?
             ├── YES: Memory pressure causes swapping. Reduce max_context
             │        or use a smaller quantization.
             └── NO: Check thermal state (sysctl machdep.xcpm.cpu_thermal_level)

Bottleneck: High TTFT

TTFT is prefill time plus one decode step. If TTFT is high:

Is the prompt very long (>1000 tokens)?
├── YES: Prefill is doing large GEMMs. Check:
│        - Is prefill chunked? (akunu chunks at max_prefill_chunk = 4096)
│        - Are GEMM kernels using simd_matrix operations?
│        - For Q4_0/Q8_0, are the GEMM kernels the quantized variants?
│
└── NO: Short prompt but still slow?
         Check if model loading is included in the measurement.
         akunu_load_model() compiles PSOs and builds the dispatch table
         on first call. Subsequent calls reuse cached state.

Bottleneck: Attention Dominating at Long Context

As context grows, the attention kernel’s cost scales linearly with KV cache length. At some point it overtakes the GEMVs:

Context LengthAttention % of Decode (typical 4B model)
1283-5%
5128-12%
204820-30%
409635-50%

If attention is your bottleneck, the options are:

  • Reduce max_context to avoid over-allocating KV cache
  • Use a model with GQA (fewer KV heads = less memory traffic in attention)
  • Wait for akunu to implement paged attention or sliding window eviction

Comparing Against llama.cpp and MLX

Benchmarking against other frameworks is valuable both for validating your measurements and for identifying optimization opportunities. Here is how to set up fair comparisons:

llama.cpp Comparison

Use llama-bench with matching parameters:

# llama.cpp
./llama-bench -m model.gguf -p 512 -n 128 -r 5

# akunu
./akunu_bench model.gguf -p 512 -n 128 -r 5

Key differences to account for:

Factorllama.cppakunu
BackendMetal (via ggml-metal)Metal (direct MSL)
Decode strategySingle token per GPU submissionChain decode (64-128 tokens per submission)
KV cache layoutPer-layer, row-majorPer-layer, head-major [n_kv_heads, max_seq, head_dim]
Weight fusionNoneGate+Up fused on Pro+ chips (SLC > 16MB)
GEMV kernelsggml generic + Metal shadersCustom per-dtype Metal shaders with chip-specific tuning

In practice, akunu’s decode throughput is typically 1.1-1.5x llama.cpp’s on the same hardware, primarily due to chain decode reducing GPU idle time and chip-specific GEMV tuning.5

MLX Comparison

MLX (Apple’s machine learning framework) uses a different approach:

# MLX benchmark
import mlx.core as mx
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3-4B-4bit")
# ... time the generation

Key differences:

FactorMLXakunu
LanguagePython + C++ + MetalC++ + Metal
Weight formatSafeTensors with MLX quantizationGGUF or MLX SafeTensors
Graph compilationJIT traced graphsPre-compiled dispatch table
QuantizationGroup quantized (group_size=64)GGUF block quant or MLX group quant
OverheadPython dispatch + JITNear-zero (POD struct iteration)

MLX’s Python overhead is minimal for long generations but can be significant for TTFT on short prompts. akunu’s pre-compiled dispatch table avoids any per-token overhead beyond the raw GPU dispatch cost.

What Fair Comparison Looks Like

For a fair comparison, ensure:

  1. Same model weights – or at least the same effective bits-per-weight. Q4_0 GGUF (4.5 effective bpw) is roughly comparable to MLX 4-bit with group_size=64.
  2. Same prompt and generation length – especially for prefill comparison, since prefill scales nonlinearly with prompt length.
  3. Same sampling – use greedy (temperature=0) to eliminate sampling variance.
  4. Warm start – run at least one throwaway generation before timing to ensure Metal shader compilation is complete and caches are warm.
  5. Same hardware – obvious, but worth stating. The M3 Pro and M2 Pro have the same 200 GB/s bandwidth but different GPU architectures, which affects compute-bound workloads like prefill.

Profiling Checklist

When you sit down to profile an akunu deployment, here is the sequence:

  1. Baseline: Run akunu_bench to establish prefill tok/s, decode tok/s, and TTFT
  2. Bandwidth check: Compute bandwidth utilization from the bench numbers. If >70%, you are in good shape.
  3. Kernel breakdown: Run akunu_profile to identify which kernels dominate. The top 3-5 kernels by GPU time are your optimization targets.
  4. System-level: If you suspect CPU-GPU sync issues, use Metal System Trace in Instruments to check for GPU idle gaps.
  5. Compare: Run the same model on llama.cpp and/or MLX to validate your numbers and identify framework-level differences.
  6. Thermal: For sustained workloads, monitor thermal throttling. Apple Silicon aggressively throttles GPU frequency under thermal pressure, which can reduce throughput by 20-40% on fanless MacBooks.

Advanced: Custom Profiling with the C API

The akunu_profile_decode_step() C API function is available for integration into your own profiling harness:

// Allocate timing buffer: n_layers + 3 entries
// [embedding, norm, layer0, layer1, ..., layerN-1, logit, argmax]
float timing[512];
int n = akunu_profile_decode_step(model, token_id, position, timing, 512);

for (int i = 0; i < n; i++) {
    printf("%s: %.3f ms\n", akunu_profile_label(model, i), timing[i]);
}

Each entry corresponds to a dispatch command in the DispatchTable. The labels are stored in a parallel DispatchLabel array (cold data, separate from the hot command array) so that profiling metadata does not pollute the cache lines used by the decode inner loop.

The profiling works by running each dispatch command in its own MTLCommandBuffer and reading back GPUStartTime / GPUEndTime. This gives microsecond-accurate per-kernel GPU timing, but at the cost of massive overhead from the per-kernel command buffer synchronization. You would never use this in production – it is purely a diagnostic tool.

Summary

ToolWhen to UseOutput
akunu_benchQuick throughput comparisonPrefill tok/s, decode tok/s (markdown table)
akunu_benchmarkEnd-to-end with real promptsTTFT, prefill/decode speed at multiple prompt lengths
akunu_profileIdentifying kernel bottlenecksPer-kernel GPU time breakdown, sorted by cost
Metal System TraceCPU-GPU sync analysisTimeline of command buffer submissions and GPU execution
GPU CountersHardware utilizationOccupancy, bandwidth, ALU utilization
Roofline analysisUnderstanding theoretical limitsWhether you are memory-bound or compute-bound

The fundamental insight for Apple Silicon LLM inference is that decode is memory-bound and will remain so for the foreseeable future. The job of the profiler is not to find ways to make the GPU compute faster – it is to find the places where you are wasting bandwidth or leaving the GPU idle. Chain decode, weight fusion, and chip-specific GEMV tuning are all strategies that akunu uses to close the gap between measured and theoretical bandwidth, and the profiling tools described in this chapter are how you verify that those strategies are working.


  1. On Apple Silicon, each MTLCommandBuffer commit-and-wait cycle costs approximately 30-80 microseconds of CPU overhead. At 80+ tok/s, a 50us overhead per token adds up to 4ms per second – roughly 5% throughput loss just from synchronization.

  2. Metal’s GPU timing (GPUStartTime / GPUEndTime on MTLCommandBuffer) measures the time the command buffer was executing on the GPU. For a single kernel this is accurate, but for a command buffer containing hundreds of dispatches, you only get the total. Apple’s GPU Timeline in Instruments provides per-dispatch timing, but requires running inside Xcode.

  3. Williams, S., Waterman, A., & Patterson, D. (2009). “Roofline: an insightful visual performance model for multicore architectures.” Communications of the ACM, 52(4), 65-76. See https://doi.org/10.1145/1498765.1498785.

  4. Actual SLC sizes estimated in akunu’s ChipConfig: 8 MB (M1/M2/M3 base), 16 MB (M4 base), 24 MB (M1/M2/M3 Pro), 32 MB (M4 Pro), 48 MB (Max), 96 MB (Ultra). These are not published by Apple but inferred from performance measurements and die analysis.

  5. This comparison is for the Metal backend specifically. llama.cpp supports many backends (CUDA, Vulkan, CPU) and architectures; akunu targets Apple Silicon exclusively, which allows tighter optimization.