Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memory Model and Buffer Management

If you have ever profiled a GPU workload and been disappointed by the utilization numbers, the root cause was almost certainly memory. Not compute. Memory. On Apple Silicon, the story is both simpler and subtler than on discrete GPUs: the CPU and GPU share a single physical memory pool (Unified Memory Architecture, or UMA), which eliminates an entire class of problems – but introduces a few new ones. This chapter walks through Metal’s memory model from first principles, shows how akunu maps it to inference, and explains why bandwidth – not FLOPS – is the number you should be worrying about.

The Unified Memory Architecture

Traditional GPU programming on NVIDIA or AMD involves two physically separate memory pools. The CPU has its DDR/LPDDR; the GPU has its GDDR/HBM. Every time you want the GPU to see CPU data, you perform an explicit copy across the PCIe bus (16 GB/s for PCIe 4.0 x16). Every time you want CPU results, you copy back. These copies are slow, asynchronous, and a constant source of bugs.1

Apple Silicon threw this model out. Starting with M1, the CPU, GPU, and Neural Engine all share a single pool of LPDDR memory with a single set of page tables.2 There is no PCIe bus. There is no copy. When the CPU writes to address 0x1234, the GPU can read from that same address – because it is the same physical page.

┌──────────────────────────────────────────────────┐
│               Unified Memory (LPDDR5/5X)         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐          │
│  │  CPU     │  │  GPU    │  │  Neural │          │
│  │  Cores   │  │  Cores  │  │  Engine │          │
│  └────┬─────┘  └────┬────┘  └────┬────┘          │
│       │             │            │                │
│       └─────────────┼────────────┘                │
│                     │                             │
│           ┌─────────┴─────────┐                   │
│           │  System Level     │                   │
│           │  Cache (SLC)      │                   │
│           └─────────┬─────────┘                   │
│                     │                             │
│           ┌─────────┴─────────┐                   │
│           │  Fabric / Memory  │                   │
│           │  Controller       │                   │
│           └───────────────────┘                   │
└──────────────────────────────────────────────────┘

This has profound implications for inference engines:

  1. Zero-copy buffer sharing. A MTLBuffer allocated with MTLResourceStorageModeShared is readable and writable by both CPU and GPU without any explicit transfer. The CPU gets a raw pointer (buf.contents); the GPU gets the same backing pages.

  2. No staging buffers needed. On CUDA, you allocate a “host-pinned” staging buffer, memcpy into it, then launch a cuMemcpyHtoD. On Metal with shared mode, you just write directly and dispatch.

  3. Coherency is automatic (mostly). After GPU work completes (i.e., waitUntilCompleted returns), the CPU sees all GPU writes immediately. The hardware’s cache coherency protocol handles this.3

Metal Storage Modes

Metal offers three storage modes for buffers. Understanding them is essential because choosing wrong means either unnecessary copies or corrupted data.

Storage ModeCPU AccessGPU AccessCopy Needed?Best For
SharedRead/WriteRead/WriteNoUMA devices (all Apple Silicon)
PrivateNoneRead/WriteYes (blit)GPU-only temporaries
ManagedRead/WriteRead/WriteYes (synchronize)macOS discrete GPU (not Apple Silicon)

On Apple Silicon, Shared mode is the overwhelmingly correct choice. It provides direct CPU and GPU access with no copies. Private mode can be slightly faster for buffers the CPU never touches (the GPU may have more freedom in cache management), but in practice the difference is negligible for inference workloads – and you lose the ability to read results without a blit encoder.4

Managed mode exists for macOS systems with discrete AMD GPUs and is irrelevant to Apple Silicon. If you see Managed in Metal sample code, it is targeting a different hardware class.

Why Akunu Uses Shared Mode Everywhere

Look at how MetalDevice::allocate works in akunu:

Buffer MetalDevice::allocate(size_t bytes) {
    id<MTLBuffer> buf = [STATE.device newBufferWithLength:MAX(bytes, 16)
                                                 options:MTLResourceStorageModeShared];
    void *h = (void *)CFBridgingRetain(buf);
    allocated_bytes_ += MAX(bytes, 16);
    return {h, bytes, [buf contents]};
}

Every allocation uses MTLResourceStorageModeShared. The returned Buffer struct stores both the opaque Metal handle (h) and the CPU-accessible pointer ([buf contents]). This means:

  • Weight loading is zero-copy. When akunu parses a GGUF file, it can mmap the file and allocate Metal buffers that point at the same pages. The GPU reads weights directly from the memory-mapped file. No memcpy, no staging buffer, no DMA transfer.

  • KV cache is CPU-readable. After decode completes, the CPU can inspect KV cache contents directly through buf.contents – useful for debugging and speculative decoding verification.

  • Scratch buffers just work. The ScratchBuffers struct allocates everything up front with device.allocate(). During the hot path, nothing is allocated or freed.

The Buffer Struct: Akunu’s Abstraction

Akunu wraps Metal buffers in a simple POD struct defined in device.h:

struct Buffer {
    void *handle;    // Backend-specific pointer (MTLBuffer*, CUdeviceptr, etc.)
    size_t size;     // Size in bytes
    void *contents;  // CPU-accessible pointer (for UMA or mapped memory)
};

This is deliberately minimal. No reference counting, no smart pointers, no virtual methods. Just three fields that fit in 24 bytes. The handle is an opaque pointer – on Metal it is a CFBridgingRetain’d id<MTLBuffer>, on a future CUDA backend it would be a CUdeviceptr. The contents pointer is the CPU-visible address; on discrete GPUs it would be nullptr.

Why no reference counting? Because buffer lifetimes in inference are trivially static. You allocate all buffers at model load time, use them for the entire session, and free them at shutdown. There is no dynamic buffer creation in the hot path. This means the overhead of shared_ptr or arc reference counting is pure waste.5

The allocate-with-data Overload

There is a second allocate overload that accepts initial data:

Buffer MetalDevice::allocate(const void *data, size_t bytes) {
    id<MTLBuffer> buf = [STATE.device newBufferWithBytes:data
                                                  length:bytes
                                                 options:MTLResourceStorageModeShared];
    void *h = (void *)CFBridgingRetain(buf);
    allocated_bytes_ += bytes;
    return {h, bytes, [buf contents]};
}

This is used for two things: uploading initial weight data and creating pre-filled parameter buffers. Metal’s newBufferWithBytes: internally memcpys the data into the newly allocated buffer. On UMA this is a simple memcpy; there is no bus transfer.

The setBytes() Optimization: Inline Small Data

Not everything needs a buffer. Metal provides setBytes:length:atIndex: for small, frequently-changing data. Instead of allocating a buffer, writing to it, and binding it, you just hand the encoder a pointer and a length. Metal copies the bytes directly into the command buffer’s inline data area.6

The rules are:

Data SizeMechanismPer-Dispatch Cost
<= 4 KBsetBytes()~0 (inline in command buffer)
> 4 KBsetBuffer()Buffer bind + potential cache miss

Akunu uses this aggressively. Every kernel’s parameter struct (dimensions, epsilon values, strides) is under 64 bytes. The DispatchCmd struct stores these inline:

// Inline params (up to 64 bytes -- covers all kernel param structs)
uint8_t param_bytes[64];
int param_size;
int param_index;  // buffer index for setBytes/setBuffer

During dispatch table replay, static params use pre-allocated GPU buffers via setBuffer() (zero per-token work), while position-patched params use setBytes() (the patched bytes get copied inline into the command buffer). This split is the key insight from akunu’s Metal backend – setBytes() is perfect for the few parameters that change per token, while setBuffer() avoids redundant work for the many parameters that do not.

// From MetalDevice::encode_dispatch_table:
// Static params: use pre-allocated GPU buffer (no per-token work)
if (cmd.param_buf.handle)
    [enc setBuffer:(__bridge id<MTLBuffer>)cmd.param_buf.handle
            offset:0
           atIndex:(NSUInteger)cmd.param_index];
else
    [enc setBytes:cmd.param_bytes
           length:(NSUInteger)cmd.param_size
          atIndex:(NSUInteger)cmd.param_index];

Memory Alignment

Metal requires 16-byte alignment for buffer offsets when binding with setBuffer:offset:atIndex:.7 This is a hardware constraint – the GPU’s memory controller fetches aligned 16-byte chunks, and unaligned offsets would require splitting a fetch across two cache lines.

Akunu ensures alignment in several ways:

  1. Buffer sizes are rounded up. The minimum allocation is MAX(bytes, 16), ensuring every buffer is at least one alignment unit.

  2. QKV sub-offsets are naturally aligned. The ScratchBuffers struct computes QKV offsets as q_dim * 2 and (q_dim + kv_dim) * 2. Since q_dim and kv_dim are always multiples of head_dim (typically 64 or 128), and each element is 2 bytes (FP16), these offsets are always multiples of 128 or 256 bytes – far exceeding the 16-byte requirement.

  3. GGUF block alignment. Quantized formats like Q4_0 use 32-element blocks (18 bytes each). Weight tensors are stored as contiguous arrays of these blocks, and GGUF pads to alignment boundaries.

Resource Hazards and Synchronization

On discrete GPUs with separate memory, you have explicit copy commands that create natural synchronization points. On UMA with shared mode, the GPU and CPU can both touch the same memory at any time. This creates resource hazards: what happens if the CPU writes to a buffer while the GPU is reading it?

Metal handles this through command buffer completion:

  1. CPU-to-GPU: The CPU writes data before calling begin_encoding(). The encoder captures buffer references at encode time. As long as writes complete before encoding begins, the GPU will see them.

  2. GPU-to-CPU: After end_encoding_sync() returns (or waitUntilCompleted signals), all GPU writes are visible to the CPU. The hardware flushes caches as part of command buffer completion.

  3. GPU-to-GPU (within a command buffer): Metal guarantees that compute dispatches within a single command encoder execute in order. Dispatch N sees all writes from dispatch N-1. No barriers needed.8

  4. GPU-to-GPU (across command buffers): You need explicit synchronization. Akunu’s end_encoding_event() / begin_encoding_after_event() uses MTLSharedEvent for GPU-to-GPU signaling across command buffers:

// Signal after this command buffer completes
[STATE.cmdBuffer encodeSignalEvent:STATE.pipelineEvent value:signal_val];
[STATE.cmdBuffer commit];

// Next command buffer waits for the signal
[STATE.cmdBuffer encodeWaitForEvent:STATE.pipelineEvent value:STATE.eventValue];

This is critical for pipelined chain decode, where one command buffer is executing on the GPU while the CPU encodes the next one.

The Bandwidth Bottleneck

Here is the uncomfortable truth about LLM inference on Apple Silicon: you will almost never be compute-bound during token generation. You will be memory-bandwidth-bound.

Why? Consider a single decode step for a 7B parameter model with Q4_0 quantization. Each parameter is 4.5 bits on average (4 bits of data plus amortized scale/min). The total weight data is roughly:

$$7 \times 10^9 \times 4.5 / 8 \approx 3.9 \text{ GB}$$

Every decode step reads every weight exactly once (each layer’s Q, K, V, O, gate, up, down projections). That is 3.9 GB of memory reads per token. On an M4 Pro with ~273 GB/s memory bandwidth, the theoretical floor is:

$$3.9 \text{ GB} / 273 \text{ GB/s} \approx 14.3 \text{ ms/token} \approx 70 \text{ tok/s}$$

The actual compute work (multiply-accumulate operations) takes a fraction of this time. The GPU cores are waiting for data to arrive from DRAM, not crunching numbers.9

ChipMemory BandwidthTheoretical Max tok/s (7B Q4)Theoretical Max tok/s (70B Q4)
M168 GB/s~17~1.7
M1 Pro200 GB/s~51~5.1
M1 Max400 GB/s~102~10.2
M2100 GB/s~26~2.6
M3 Pro150 GB/s~38~3.8
M4120 GB/s~31~3.1
M4 Pro273 GB/s~70~7.0
M4 Max546 GB/s~140~14.0

This table reveals a fundamental reality: your token generation speed is almost entirely determined by how fast your chip can feed data to the GPU cores. More compute cores help for prefill (which is compute-bound), but for autoregressive decode, bandwidth is king.10

Implications for Engine Design

This bandwidth bottleneck drives several akunu design decisions:

  1. Minimize weight reads. Read each weight exactly once per token. Fuse operations where possible (SiLU + down projection, QKV projection) to reduce the number of kernel launches and avoid re-reading intermediate buffers.

  2. Use the System Level Cache (SLC). Apple Silicon chips have a large last-level cache shared between CPU and GPU. On M4 Pro, it is estimated at 32 MB. Weights that fit in SLC are served at cache bandwidth (~2 TB/s on M4), not DRAM bandwidth. This is why akunu fuses QKV and gate+up weights on chips with large SLC – the fused weight matrix is more likely to be in cache for the second read.11

  3. Quantize aggressively. Q4_0 reads half the bytes of FP16. Q2_K reads a quarter. Lower precision means less bandwidth, which directly translates to higher tok/s.

  4. Avoid unnecessary reads. The dispatch table pre-computes everything that can be pre-computed. Buffer bindings, pipeline states, parameter structs – all resolved at build time, not at dispatch time.

Pre-Allocated Scratch Buffers

Akunu allocates all intermediate buffers once at model load time via the ScratchBuffers struct:

struct ScratchBuffers {
    Buffer h0;         // [dim] FP16
    Buffer h1;         // [dim] FP16
    Buffer residual;   // [dim] FP16
    Buffer qkv;        // [q_dim + 2*kv_dim] FP16
    Buffer attn_out;   // [max(q_dim, dim)] FP16
    Buffer ffn_gate;   // [2 * ffn_dim] FP16 (2x for fused gate+up)
    Buffer ffn_up;     // [ffn_dim] FP16
    Buffer ffn_act;    // [ffn_dim] FP16
    Buffer logits;     // [vocab_size] FP16
    Buffer token_ids;  // [max_chain] U32
    // ... plus batch buffers for prefill
};

A few things to notice:

  • ffn_gate is 2x ffn_dim. This accommodates fused gate+up projections, where the output of a single GEMV contains both the gate and up vectors contiguously.

  • qkv is contiguous. Q, K, and V are stored in a single buffer at computed offsets (qkv_q_offset, qkv_k_offset, qkv_v_offset). When QKV fusion is enabled, a single GEMV writes all three projections into this buffer. When not fused, three separate GEMVs write to their respective sub-regions.

  • No dynamic allocation. The ScratchBuffers::create factory is called once. After that, every forward pass reuses the same buffers. This means zero memory allocation overhead in the hot path, zero fragmentation, and deterministic memory usage.

  • Prefill buffers are separate. Batch buffers (batch_h0, batch_q, etc.) are sized for the maximum prefill chunk. They are larger than decode buffers by a factor of prefill_chunk (typically 4096).

KV Cache Memory Layout

The KV cache is another critical memory structure, defined in kv_cache.h:

struct KVCache {
    int n_layers;
    int n_kv_heads;
    int head_dim;
    int max_length;
    int current_length;
    std::vector<Buffer> k_buffers;  // one per layer
    std::vector<Buffer> v_buffers;  // one per layer
    int kv_stride;  // max_length * head_dim
};

Each layer gets two buffers: one for K, one for V. The layout is [n_kv_heads, max_length, head_dim] in FP16. The kv_stride (= max_length * head_dim) is the number of elements between consecutive KV heads.

For a 32-layer model with 8 KV heads, 128 head dim, and 4096 max context:

$$\text{KV bytes per layer} = 8 \times 4096 \times 128 \times 2 = 8 \text{ MB}$$ $$\text{Total KV cache} = 32 \times 2 \times 8 = 512 \text{ MB}$$

This is significant – half a gigabyte just for KV cache on a model that “only” has 7B parameters. For 70B models with 80 layers, the KV cache can easily exceed 4 GB. This is why max_context is capped at 4096 by default in akunu_load_model().

Buffer Memory Layout (Shared Mode — UMA)

  ┌──────────────────────────────────────────────┐
  │  Model Weights (3.9 GB)                      │  GPU reads, CPU writes at load
  ├──────────────────────────────────────────────┤
  │  KV Cache (512 MB)                           │  GPU reads+writes per token
  ├──────────────────────────────────────────────┤
  │  Scratch Buffers (~10 MB)                    │  GPU reads+writes, reused every step
  ├──────────────────────────────────────────────┤
  │  Prefill Batch Buffers (~200 MB)             │  Used only during prefill
  └──────────────────────────────────────────────┘

Write Buffer and Read Buffer: Portability

The Device base class provides default implementations of write_buffer and read_buffer that use plain memcpy:

virtual void write_buffer(Buffer dst, const void *src, size_t bytes, size_t offset = 0) {
    if (dst.contents)
        memcpy((char *)dst.contents + offset, src, bytes);
}

virtual void read_buffer(void *dst, Buffer src, size_t bytes, size_t offset = 0) {
    if (src.contents)
        memcpy(dst, (const char *)src.contents + offset, bytes);
}

On UMA, this is literally a memcpy – there is no DMA, no bus, no asynchronous transfer. The comment says “Override for discrete GPU backends (CUDA cuMemcpyHtoD, etc.).” This is the portability escape hatch: a future CUDA backend would override these methods to use proper device-to-host copies.

Memory Tracking

MetalDevice tracks total bytes allocated:

size_t allocated_bytes_ = 0;
// In allocate():
allocated_bytes_ += MAX(bytes, 16);
// In free_buffer():
if (buf.size <= allocated_bytes_) allocated_bytes_ -= buf.size;

This is exposed through akunu_model_memory() in the C API, letting callers see how much GPU memory a loaded model uses. It is a simple counter, not a full memory allocator – because akunu does not need a full memory allocator. Buffers are allocated at init, freed at shutdown, and nothing happens in between.

Common Pitfalls

Before we leave the memory model, let me highlight a few traps that catch even experienced Metal developers:

Pitfall 1: Reading GPU Results Before Completion

// WRONG: GPU may still be writing
device.end_encoding_async();
float *result = (float *)logits_buf.contents;  // Race condition!

Always call wait() or end_encoding_sync() before reading GPU-written buffers from the CPU. The UMA does not mean coherent-at-all-times; it means coherent-after-completion.

Pitfall 2: Buffer Lifetime with Unretained References

Akunu uses commandBufferWithUnretainedReferences for performance. This means Metal will NOT retain buffer references – if you free a buffer before the GPU finishes, you get a GPU fault. This is safe in akunu because all buffers outlive the GPU work (they are freed only at model destruction), but adding dynamic buffer management would require switching to commandBuffer (with retain) or careful lifetime tracking.12

Pitfall 3: Assuming Private Mode is Faster

On Apple Silicon, Private mode offers minimal benefit over Shared mode for inference workloads. The GPU’s cache hierarchy works the same way for both. Private prevents CPU access, which can be slightly more efficient for GPU-only temporaries, but the difference is typically <1% for bandwidth-bound kernels. Akunu chose simplicity over micro-optimization here.

Pitfall 4: Ignoring the 16-Byte Alignment Rule

If you bind a buffer with an offset that is not a multiple of 16, Metal will either silently produce garbage or crash with a validation error (if Metal validation is enabled). The fix is always the same: pad your data structures to 16-byte boundaries. Notice how akunu’s param structs include _p0, _p1 padding fields:

struct { uint32_t dim; float eps; uint32_t _p0, _p1; } norm_params;

Those _p0, _p1 fields pad the struct to 16 bytes, ensuring alignment when passed through setBytes().

Summary

Apple Silicon’s UMA simplifies GPU programming enormously: no copies, no staging buffers, no DMA. But it does not eliminate the fundamental bottleneck of memory bandwidth. For LLM inference, where every token requires reading the entire weight matrix, bandwidth determines throughput.

Akunu’s memory strategy is:

  • Shared mode everywhere for zero-copy CPU-GPU access
  • Static allocation of all buffers at model load time
  • Pre-allocated param buffers with the setBytes/setBuffer split for per-token patching
  • 16-byte aligned everything
  • Bandwidth-aware design driving quantization, weight fusion, and SLC exploitation

In the next chapter, we will look at the compute side: how Apple’s SIMD group matrix operations turn those bandwidth-fed bytes into actual matrix multiplications.



  1. NVIDIA CUDA Programming Guide, Section 3.2.2, “Device Memory.” The PCIe bus bottleneck is well-documented; PCIe 4.0 x16 provides ~32 GB/s bidirectional, but only ~16 GB/s in one direction. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/.

  2. Apple, “Apple M1 Chip,” 2020. The Unified Memory Architecture was first introduced with M1, providing up to 68.25 GB/s bandwidth with a single memory pool. See https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/.

  3. Apple, “Metal Best Practices Guide: Resource Storage Modes,” 2024. On Apple Silicon, shared mode buffers are coherent after command buffer completion. See https://developer.apple.com/documentation/metal/choosing-a-resource-storage-mode-for-apple-gpus.

  4. Apple, “Metal Feature Set Tables.” On Apple Silicon (Apple GPU family 7+), MTLResourceStorageModeShared is the recommended mode for most buffers. Private mode is useful for render targets and textures that the CPU never accesses. See https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf.

  5. This is a deliberate design decision. llama.cpp uses a similar approach with ggml_backend_buffer, where buffer lifetimes are tied to the model context. Reference counting would add atomic operations to every buffer bind – thousands per forward pass.

  6. Apple, “setBytes:length:atIndex: Documentation.” The data is copied inline into the command buffer. The maximum size is 4 KB. For larger data, use a buffer. See https://developer.apple.com/documentation/metal/mtlcomputecommandencoder.

  7. Apple, “Metal Best Practices Guide: Buffer Alignment.” Buffer offsets must be a multiple of 16 bytes for setBuffer:offset:atIndex:. See https://developer.apple.com/documentation/metal/mtlcomputecommandencoder.

  8. Apple, “Metal Programming Guide: Command Organization.” Within a single compute command encoder, dispatches execute in order with implicit memory barriers. See https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu.

  9. This analysis follows the roofline model methodology. See Williams, S., Waterman, A., & Patterson, D. (2009). “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM, 52(4), 65-76. See https://doi.org/10.1145/1498765.1498785.

  10. Memory bandwidth numbers sourced from Apple’s product specifications and independent testing by Anandtech and Chips and Cheese. Actual achieved bandwidth is typically 70-85% of theoretical peak due to memory controller overhead. See https://chipsandcheese.com/.

  11. The System Level Cache (SLC) is described in various Apple Silicon die analyses. See “Apple M1 Die Shot Analysis” by Chips and Cheese, 2021. SLC sizes are estimated from die area analysis and performance profiling; Apple does not officially disclose exact SLC sizes. See https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/.

  12. Apple, “commandBufferWithUnretainedReferences Documentation.” Using unretained references avoids the overhead of Metal retaining every buffer referenced by a command buffer. This is safe when buffer lifetimes are manually managed. See https://developer.apple.com/documentation/metal/mtlcommandqueue/1508684-makecommandbufferwithunretainedr.