Performance Optimization Patterns

We have covered the memory model and the matrix operations. Now we need to talk about how to make them fast. This chapter is a catalog of GPU performance patterns – the techniques that separate a naive kernel from one that saturates the hardware. Each pattern is explained in general terms, then grounded in how akunu applies it. If you have ever stared at a GPU profiler and wondered why your kernel is at 30% utilization, this chapter is for you.

The Roofline Model: Know Your Bottleneck

Before optimizing anything, you need to know what you are optimizing for. The roofline model gives you a simple framework: every kernel is either compute-bound or memory-bandwidth-bound.¹

Performance (GFLOPS)
    ^
    |                  ┌─────────────── Peak Compute
    |                 /
    |                /
    |               / <-- Roofline
    |              /
    |             /
    |            /
    |           /
    |──────────/───────────────────── Peak Bandwidth
    |         /
    |        /
    |       /
    └──────┴──────────────────────────>
           Arithmetic Intensity (FLOPS/byte)

Arithmetic intensity = (total FLOPs) / (total bytes transferred). It tells you how much compute work you do for each byte you read from memory.

Low arithmetic intensity (e.g., GEMV with M=1): ~1 FLOP/byte. You read a weight, multiply once, move on. You are bandwidth-bound. Adding more compute units does not help.
High arithmetic intensity (e.g., GEMM with M=512): ~512 FLOPS/byte. You read a weight and reuse it 512 times. You are compute-bound. More bandwidth does not help.

For Apple Silicon, the crossover point depends on the chip:

Chip	Peak Compute (FP16)	Memory BW	Crossover (FLOPS/byte)
M1	~2.6 TFLOPS	68 GB/s	~38
M1 Max	~10.4 TFLOPS	400 GB/s	~26
M4	~4.3 TFLOPS	120 GB/s	~36
M4 Pro	~8.7 TFLOPS	273 GB/s	~32
M4 Max	~17.4 TFLOPS	546 GB/s	~32

This means: if your kernel’s arithmetic intensity is below ~32, it is bandwidth-bound on most Apple Silicon. Every single GEMV in the decode path (arithmetic intensity ~1-2) is bandwidth-bound. Only prefill GEMM and attention (with long sequences) are compute-bound.

What This Means for Akunu

The roofline model dictates akunu’s optimization priorities:

For decode (bandwidth-bound): Minimize bytes read. Use quantization. Fuse operations to avoid re-reading intermediates. Optimize for memory access patterns.
For prefill (compute-bound): Maximize compute utilization. Use SIMD matrix operations. Maximize occupancy. Optimize threadgroup sizes.
For attention (mixed): Short sequences are bandwidth-bound (small KV cache); long sequences become compute-bound.

Pattern 1: Vectorized Loads

GPU memory controllers deliver data in large chunks. On Apple Silicon, the memory bus is 128 or 256 bits wide. Reading a single float16 (2 bytes) wastes most of that bus width. Instead, you want to read 4 or 8 elements at once using vector types:

// Slow: scalar loads (2 bytes each)
half val0 = input[tid * 4 + 0];
half val1 = input[tid * 4 + 1];
half val2 = input[tid * 4 + 2];
half val3 = input[tid * 4 + 3];

// Fast: vectorized load (8 bytes at once)
half4 vals = *(device const half4 *)(input + tid * 4);

The vectorized version generates a single memory transaction instead of four. On Apple Silicon, half4 loads are the sweet spot – 8 bytes per load, which matches the register file width.²

In akunu’s GEMV kernels, you will see patterns like:

// Load 16 half values at once (32 bytes)
device const half *x = B_half + row * K + il * 16;
// Thread reads 16 consecutive halves via multiple half4 loads

The simd_gemm_f16.metal kernel has each thread load 16 consecutive half-precision values per K-step, using the thread’s position within the SIMD group to cover different portions of the tile.

Pattern 2: Coalesced Memory Access

Coalescing means that adjacent threads access adjacent memory locations, so the hardware can merge their loads into a single wide memory transaction.

Thread 0 reads address 0x1000
Thread 1 reads address 0x1002
Thread 2 reads address 0x1004
...
Thread 31 reads address 0x103E

=> Hardware merges into ONE 64-byte transaction

Uncoalesced access – where threads read scattered addresses – is catastrophic. Instead of one transaction, you get 32 individual transactions, each wasting bus bandwidth.

Thread 0 reads address 0x1000   // row 0, col 0
Thread 1 reads address 0x4000   // row 1, col 0
Thread 2 reads address 0x7000   // row 2, col 0

=> 32 separate transactions (32x slower)

How Akunu Ensures Coalescing

In akunu’s GEMV kernels, the weight matrix B is stored in row-major order with rows corresponding to output dimensions. Each SIMD group processes a contiguous block of rows. Within the K-reduction loop, threads within a SIMD group read contiguous elements along the K dimension:

Thread 0: B[row][k_base + 0..3]
Thread 1: B[row][k_base + 4..7]
Thread 2: B[row][k_base + 8..11]
...

This is perfectly coalesced along K. For the output (N dimension), different SIMD groups write to different output rows, which are widely separated – but writes are much less frequent than reads, so this is acceptable.

For quantized formats, coalescing is trickier. Q4_0 blocks are 18 bytes: 16 four-bit values + 2 bytes of scale. The block layout is designed so that adjacent threads can read adjacent blocks, maintaining coalescing despite the non-power-of-2 block size.

Pattern 3: Threadgroup Memory and Bank Conflicts

Threadgroup memory (also called shared memory on NVIDIA) is a fast scratchpad local to a threadgroup. On Apple Silicon, it is organized into banks – typically 32 banks of 4 bytes each.³

Bank conflicts occur when multiple threads in a SIMD group access the same bank simultaneously:

// Bank conflict: threads 0 and 16 both access bank 0
shmem[tid * 32]  // stride 32 = exact bank width => every access hits bank 0!

// No conflict: threads access consecutive addresses
shmem[tid]       // each thread hits a different bank

In akunu’s GEMM kernels, the A and B tiles are loaded into threadgroup memory. The tile layout is carefully chosen to avoid bank conflicts during the subsequent simdgroup_load operations. The sa (A tile) and sb (B tile) pointers are offset:

threadgroup half *sa = shmem;
threadgroup half *sb = shmem + 4096 / sizeof(half);

The A tile gets 4096 bytes (2048 halves) and the B tile gets 2048 bytes (1024 halves), for a total of 6144 bytes per threadgroup. These sizes are chosen to minimize bank conflicts when loading into SIMD matrix registers.

Pattern 4: Loop Unrolling

GPU compilers can unroll loops, but sometimes you need to help. Unrolling reduces loop overhead (branch instructions, index increments) and exposes instruction-level parallelism:

// Before: tight loop, branch every iteration
for (int i = 0; i < 8; i++) {
    simdgroup_load(mb, sb + ..., stride);
    simdgroup_multiply_accumulate(mc[i], ma, mb, mc[i]);
}

// After (compiler typically does this): 8 loads + 8 MACs, no branches
// The 'constexpr' loop bound helps the compiler unroll

In akunu’s GEMM kernel, the inner K-loop has a stride of NK=32, with 4 sub-steps of 8 elements each. The sub-steps process 8 output tiles per SIMD group. The compiler unrolls both the sub-step loop and the tile loop because the bounds are compile-time constants.

Pattern 5: Function Constants (Metal Specialization)

Metal’s function constants are a form of compile-time specialization that lets you create optimized kernel variants without code duplication:

constant uint FC_GEMM_K [[function_constant(10)]];
constant bool FC_GEMM_K_SPECIALIZED = is_function_constant_defined(FC_GEMM_K);

// In the kernel:
const uint K_dim = FC_GEMM_K_SPECIALIZED ? FC_GEMM_K : K;

When FC_GEMM_K is defined, the compiler knows the K dimension at compile time and can:

Unroll the K-loop completely for small K
Eliminate bounds checking
Pre-compute strides and offsets
Optimize register allocation

Akunu uses function constants extensively for MLX quantized kernels, where group_size and K are baked into the pipeline state object:

uint32_t fc_indices[] = {0, 1};
uint32_t fc_values[] = {(uint32_t)quant_group_size, (uint32_t)K};
pso = device.get_pipeline(kernel_name, cache_key, fc_indices, fc_values, 2);

Each unique (group_size, K) combination gets a specialized pipeline. These are cached in MetalDevice::pso_cache_ so the specialization cost is paid once at model load time.⁴

Pattern 6: Kernel Fusion

Kernel fusion combines multiple operations into a single GPU dispatch. Each dispatch has overhead: pipeline binding, buffer binding, dispatch command encoding, and GPU scheduling. More importantly, each dispatch boundary forces intermediate results to be written to and re-read from global memory.

Fusions in Akunu

Akunu applies several fusions:

1. Residual + RMSNorm Fusion

Instead of separate residual_add and rmsnorm dispatches:

// Unfused: 3 dispatches, 3 reads + 3 writes
residual = input + skip_connection     // read input, skip; write residual
norm_input = rmsnorm(residual, weight) // read residual, weight; write norm_input

Akunu uses residual_rmsnorm_f16:

// Fused: 1 dispatch, reads input + skip + weight, writes norm_output + updated residual
residual_rmsnorm_f16(input, skip, weight, norm_output, residual, params)

This saves two kernel launches and two round-trips to global memory.

2. SiLU(gate) * up + Down GEMV Fusion

As discussed in the previous chapter, the fused SiLU+down kernel combines activation and projection:

// Unfused: 2 dispatches
act = SiLU(gate) * up              // read gate, up; write act
down = GEMV(down_weight, act)      // read act, down_weight; write down

// Fused: 1 dispatch
down = fused_silu_gemv(gate, up, down_weight)  // reads gate, up, weight; writes down

This eliminates the intermediate act buffer write and re-read.

3. QK-Norm + RoPE + KV Cache Write Fusion

For architectures with per-head Q/K norms (Qwen3, Gemma), akunu fuses head normalization, rotary position encoding, and KV cache writes into a single kernel:

// From table_builder.cpp:
Pipeline fused_pso = device.get_pipeline("head_norm_rope_neox_kv_write_f16");

This replaces 3-4 separate dispatches with one, which is especially impactful because these operations are tiny (operating on a single head at a time) and the dispatch overhead would dominate.

4. QKV Projection Fusion

When Q, K, and V weight matrices share the same dtype and the SLC is large enough to benefit:

bool fuse_qkv = (chip.should_fuse_weights || is_mlx) && q_dtype == k_dtype && k_dtype == v_dtype;
if (fuse_qkv) {
    Buffer fused_w = weights.fuse_weights(q_name, k_name, v_name);
    gemv(scratch.residual, fused_w, scratch.qkv, 0, q_dtype, qkv_total, dim);
}

Three GEMV dispatches become one, reading the fused weight matrix once instead of three times.

5. Gate + Up Projection Fusion

Same principle applied to the FFN gate and up projections:

bool fuse_gate_up = (chip.should_fuse_weights || gate_is_mlx) && (gate_dtype == up_dtype);
if (fuse_gate_up) {
    Buffer fused_gate_up_w = weights.fuse_weights(gate_name, up_name);
    gemv(scratch.attn_out, fused_gate_up_w, scratch.ffn_gate, 0, gate_dtype, 2 * ffn_dim, dim);
}

When NOT to Fuse

Fusion is not always beneficial. The SLC-gated fusion decisions in akunu illustrate this: on chips with small SLC (M1 base, 8 MB estimated), the fused QKV weight matrix may be too large to fit in cache, causing more cache thrashing than the unfused version. The ChipConfig::should_fuse_weights flag controls this:

c.should_fuse_weights = (c.slc_bytes >= 16 * 1024 * 1024);  // Pro+ and M4 Base

Pattern 7: Occupancy and Threadgroup Sizing

Occupancy is the ratio of active threads to the maximum the GPU can support simultaneously. Higher occupancy generally means better latency hiding – when one SIMD group stalls on a memory access, another can execute.

On Apple Silicon, each GPU core can run multiple threadgroups concurrently (the exact limit depends on register pressure and threadgroup memory usage). The threadgroup size directly affects occupancy:

Threadgroup Size	SIMD Groups	Typical Use
32	1	Very light kernels (argmax)
128	4	Standard GEMV (4 SIMD groups)
256	8	Wide GEMV, standard GEMM
1024	32	Flash attention, RMSNorm

Akunu’s ChipConfig controls threadgroup sizing for normalization kernels:

c.norm_tg_size = 1024;  // max threads for RMSNorm
c.max_threads_per_tg = 1024;  // Metal's maximum

For RMSNorm on a 4096-dimensional model, the threadgroup size is min(4096, 1024) = 1024. All 1024 threads participate in the reduction (computing the root-mean-square), with the final result broadcast to all threads for the normalization step.

For GEMV, the threadgroup size trades off between parallelism and overhead:

128 threads (4 SIMD groups): Standard GEMV. Good for small chips with limited cores.
256 threads (8 SIMD groups): Wide GEMV. Better occupancy on Pro+ chips with many cores. Controlled by chip.gemv_wide_standard.

Pattern 8: Avoiding Redundant Work with Pre-Computation

The most efficient computation is the one you do not do. Akunu pre-computes everything possible at model load time:

Pipeline state objects are created and cached in pso_cache_. No pipeline compilation during inference.
Buffer bindings are resolved once in the dispatch table. The DispatchCmd stores actual Buffer handles and byte offsets, not symbolic names.
Kernel parameters (dimensions, epsilon values, strides) are stored as raw bytes in DispatchCmd::param_bytes[64]. For static params, a GPU buffer is pre-allocated:

// From table_builder.cpp, end of build_dispatch_table():
for (auto& cmd : cmds) {
    if (cmd.param_size > 0) {
        cmd.param_buf = device.allocate(cmd.param_bytes, cmd.param_size);
    }
}

This means the dispatch table replay loop does almost zero work per command beyond Metal API calls. No string lookups, no hash table queries, no conditional logic.

RoPE frequencies can be pre-computed into a GPU buffer (arch.rope_freqs) rather than computed per-token.
Dispatch geometry (grid size, threadgroup size) is computed once and stored in DispatchCmd::grid and DispatchCmd::threadgroup.

Pattern 9: Minimizing Command Buffer Overhead

Each Metal command buffer has submission overhead: the CPU must package the commands, the GPU command processor must parse them, and there is a synchronization cost at completion. Akunu minimizes this in two ways:

1. Chain Decode: Many Tokens Per Command Buffer

Instead of one command buffer per token, akunu’s chain decode batches multiple tokens into a single submission:

// ChipConfig determines batch size:
c.chain_decode_chunk = 128;  // M4: 128 tokens per submission
c.chain_decode_chunk = 64;   // M1 base: 64 tokens per submission

One command buffer encodes the full dispatch table N times (once per token), with per-token patching of position and offsets. This amortizes the command buffer overhead across many tokens.

2. Unretained References

Akunu uses commandBufferWithUnretainedReferences, which tells Metal not to retain buffer references. This avoids atomic reference counting on every setBuffer call – a significant savings when a single command buffer contains thousands of buffer bindings.⁵

3. Event-Based Pipelining

Akunu supports overlapping GPU execution with CPU encoding using MTLSharedEvent:

// GPU signals event after completing
[cmdBuffer encodeSignalEvent:pipelineEvent value:signalVal];
[cmdBuffer commit];

// Next command buffer waits for the event (GPU-GPU sync, no CPU involvement)
[nextCmdBuffer encodeWaitForEvent:pipelineEvent value:eventValue];

This allows the CPU to encode the next batch of tokens while the GPU is still processing the current one. The event-based synchronization is GPU-to-GPU, avoiding a CPU round-trip.

Pattern 10: The setBuffer vs setBytes Split

This pattern is specific to akunu’s dispatch table replay and deserves its own section. During chain decode, the same commands are repeated for each token. Most parameters are identical across tokens – only the position-dependent fields change.

Akunu splits parameters into two categories:

Category	Mechanism	Per-Token Cost	Example
Static params	`setBuffer()` with pre-allocated GPU buffer	Zero (buffer already bound)	GEMV dimensions, strides
Position-patched params	`setBytes()` with inline patching	~64 bytes memcpy + encode	RoPE position, KV seq length

This is visible in the encode_chain fast path. For each command, the encoder patches position-dependent parameters in-place and calls set_bytes to inline the (small, <64 byte) parameter data into the command buffer:

for (int i = 0; i < n; i++) {
    for (auto& cmd : table.commands) {
        device.set_pipeline(cmd.pso);
        // Bind buffers (static — same every token)
        for (int b = 0; b < cmd.buffer_count; b++)
            device.set_buffer(cmd.buffers[b], cmd.offsets[b], b);
        // Patch and set parameters (only position/kv_len change per token)
        if (cmd.patch_type != PATCH_NONE)
            apply_patch(cmd, pos + i);
        device.set_bytes(cmd.param_bytes, cmd.param_size, cmd.param_index);
        device.dispatch(cmd.grid, cmd.threadgroup);
    }
}

For a typical 32-layer model, the dispatch table has ~260 commands. Only the RoPE, attention, and argmax commands need per-token patching (~100 commands). The rest pass through with unchanged param_bytes — but all use set_bytes (not setBuffer) since the parameter data is always small enough (<64 bytes) to inline.

Profiling Tools

Knowing these patterns is only useful if you can measure their impact. Apple provides several profiling tools:

Xcode GPU Debugger

Capture a Metal frame and inspect:

Per-dispatch GPU time
Memory bandwidth utilization
Occupancy
Wait time (stalls)

Metal System Trace (Instruments)

Part of Instruments.app. Shows:

Command buffer submission and completion timeline
GPU utilization over time
CPU-GPU synchronization points
Memory allocation events

akunu_profile

Akunu includes a profiling tool that uses per-layer command buffers to get GPU timing for each operation:

// From akunu.h:
int akunu_profile_decode_step(akunu_model_t model, uint32_t token_id,
                              int position, float *timing_out, int max_entries);
const char *akunu_profile_label(akunu_model_t model, int index);

This runs each layer in its own command buffer (much slower than normal inference) but gives you per-operation GPU timing. The output looks like:

embedding              0.012 ms
layer.0.attention      0.045 ms
layer.0.rope_kv_write  0.008 ms
layer.0.attention      0.082 ms
layer.0.o_proj         0.041 ms
layer.0.fused_ffn_norm 0.006 ms
layer.0.gate_up_proj   0.078 ms
layer.0.ffn            0.043 ms
...

Counter Sampling

Metal supports GPU hardware counter sampling for detailed performance analysis. You can measure:

ALU utilization
Memory read/write bytes
Cache hit rates
Occupancy percentages

These are available through MTLCounterSampleBuffer and are essential for diagnosing whether a kernel is compute-bound or bandwidth-bound.⁶

Putting It All Together: A Single Layer’s Performance Profile

Let’s trace through one transformer layer during decode (M=1) on M4 Pro and identify the bottleneck for each operation:

Operation	Kernel	Bottleneck	Time	Notes
Attention Norm	`residual_rmsnorm_f16`	BW (dim reads)	0.006 ms	Light: 4096 elements
Q/K/V GEMV (fused)	`gemv_q4_0`	BW (weight read)	0.12 ms	Reads ~6 MB fused QKV weight
RoPE + KV Write	`rope_qkv_write_f16`	BW	0.008 ms	Light: head_dim/2 elements
Flash Attention	`flash_attn_decode_parallel`	Mixed	0.04-0.1 ms	Depends on seq length
O Projection	`gemv_q4_0`	BW	0.04 ms	4096x4096 weight
FFN Norm	`residual_rmsnorm_f16`	BW	0.006 ms	Light
Gate+Up GEMV (fused)	`gemv_q4_0`	BW	0.15 ms	Reads ~12 MB fused gate+up
Fused SiLU+Down	`gemv_q4_0_silu`	BW	0.04 ms	4096x14336 weight
Total per layer			~0.43 ms

For 32 layers: ~13.8 ms/token = ~72 tok/s. This is close to the theoretical bandwidth limit of ~70 tok/s we computed in the memory chapter, confirming that decode is bandwidth-saturated.

Common Mistakes

Mistake 1: Optimizing Compute for a Bandwidth-Bound Kernel

If your GEMV kernel is at 95% bandwidth utilization and 10% compute utilization, making the math faster will not help. You need to reduce the number of bytes read (quantize to lower bits, fuse operations to eliminate intermediate buffers).

Mistake 2: Tiny Threadgroups

Using a threadgroup of 32 threads for a GEMV kernel means only 1 SIMD group per threadgroup. The GPU core has no threads to switch to when this SIMD group stalls on memory. Use at least 128 threads (4 SIMD groups) for any memory-heavy kernel.

Mistake 3: Forgetting Threadgroup Barriers

In GEMM kernels that use threadgroup memory, forgetting threadgroup_barrier(mem_flags::mem_threadgroup) between writing to and reading from shared memory causes data races. The barrier ensures all threads in the threadgroup have completed their writes before any thread reads.

Mistake 4: Over-Fusing

Fusing too many operations into one kernel can increase register pressure, reducing occupancy and hurting performance. If a fused kernel needs more registers than the hardware can provide, the GPU will “spill” registers to memory, destroying performance. The separate activation + GEMV fallback in akunu exists for exactly this reason.

Summary

Performance optimization on Apple Silicon GPU follows a clear decision tree:

Is the kernel bandwidth-bound or compute-bound? Use the roofline model.
If bandwidth-bound: Reduce bytes (quantize, fuse), improve access patterns (coalesce, vectorize).
If compute-bound: Maximize utilization (occupancy, SIMD matrix ops, loop unrolling).
Always: Pre-compute everything possible, minimize dispatch overhead, use function constants for specialization.

Akunu applies all of these patterns systematically, with the ChipConfig and DTypeDescriptor tables encoding the chip-specific and dtype-specific tuning decisions. The dispatch table pre-resolves everything at model load time so the hot path is a tight loop of Metal API calls.

Williams, S., Waterman, A., & Patterson, D. (2009). “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM, 52(4), 65-76. The roofline model remains the most effective tool for classifying GPU kernel performance. See https://doi.org/10.1145/1498765.1498785. ↩
Apple, “Metal Best Practices Guide: Optimize Memory Accesses,” 2024. Vectorized loads are recommended for maximizing memory throughput. See https://developer.apple.com/documentation/xcode/analyzing-the-performance-of-your-metal-app. ↩
The exact bank configuration on Apple Silicon is not officially documented. The 32-bank / 4-byte-per-bank configuration is inferred from performance profiling and is consistent with other GPU architectures. See Dougall Johnson’s Apple GPU documentation: https://dougallj.github.io/applegpu/. ↩
Apple, “Using Function Specialization to Build Pipeline Variants.” Function constants are the recommended way to create specialized shader variants without preprocessor macros. See https://developer.apple.com/documentation/metal/using-function-specialization-to-build-pipeline-variants. ↩
Apple, “commandBufferWithUnretainedReferences Documentation.” Unretained references eliminate atomic retain/release overhead but require manual lifetime management. See https://developer.apple.com/documentation/metal/mtlcommandqueue/1508684-makecommandbufferwithunretainedr. ↩
Apple, “Optimizing Performance with the GPU Counters Instrument,” 2024. GPU hardware counters provide per-kernel metrics including ALU utilization, memory bandwidth, and cache hit rates. See https://developer.apple.com/documentation/xcode/analyzing-the-performance-of-your-metal-app. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference