Unified Memory Architecture

If there’s one hardware feature that makes Apple Silicon uniquely suited for running large language models locally, it’s the Unified Memory Architecture (UMA). It’s the reason a MacBook with 64 GB of RAM can run a 70B model that would require a $2,000+ NVIDIA GPU on a PC. Let’s understand exactly how it works and why it matters for inference.

The Traditional Memory Problem

On a discrete GPU system, there are two completely separate memory pools:

┌─────────────────────────────────────────────────────────┐
│             TRADITIONAL DISCRETE GPU SYSTEM             │
│                                                         │
│   CPU Side                        GPU Side              │
│   ┌──────────┐                    ┌──────────────┐      │
│   │  System   │    PCIe 4.0 x16   │    VRAM      │      │
│   │  RAM      │◄──────────────────►│    (GDDR6X)  │      │
│   │  64 GB    │    ~32 GB/s        │    24 GB     │      │
│   │  DDR5     │    bidirectional   │              │      │
│   │           │                    │  Bandwidth:  │      │
│   │  BW: ~90  │                    │  ~1 TB/s     │      │
│   │   GB/s    │                    │              │      │
│   └──────────┘                    └──────────────┘      │
│                                                          │
│   To run inference:                                      │
│   1. Load model into system RAM (disk → RAM)             │
│   2. Copy weights to VRAM (RAM → PCIe → VRAM)          │
│   3. GPU computes on VRAM                                │
│   4. Copy results back (VRAM → PCIe → RAM)              │
│                                                          │
│   PROBLEM: Model must fit in VRAM (24 GB max typical)    │
│   PROBLEM: PCIe transfer is slow (~32 GB/s)              │
│   PROBLEM: Double the memory usage (model in both pools) │
└─────────────────────────────────────────────────────────┘

A 70B model in Q4_0 quantization is about 40 GB. It literally doesn’t fit in a 24 GB RTX 4090’s VRAM. You need model parallelism across multiple GPUs, or offloading to CPU (which is painfully slow over PCIe).

Apple’s Unified Memory

On Apple Silicon, there’s one memory pool, shared by everything:

┌─────────────────────────────────────────────────────────┐
│             APPLE SILICON UNIFIED MEMORY                  │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │                   SoC Die                           │  │
│  │                                                     │  │
│  │   CPU ──┐                                          │  │
│  │         ├──► Fabric ──► SLC ──► Memory Controller  │  │
│  │   GPU ──┤                                          │  │
│  │         ├──►                                       │  │
│  │   NE  ──┘                                          │  │
│  └────────────────────────────────────────────────────┘  │
│                          │                               │
│                   ┌──────┴──────┐                        │
│                   │   LPDDR5    │                        │
│                   │   Unified   │                        │
│                   │   Memory    │                        │
│                   │   Pool      │                        │
│                   │             │                        │
│                   │  16-192 GB  │                        │
│                   └─────────────┘                        │
│                                                          │
│   To run inference:                                      │
│   1. Load model into memory (disk → unified memory)      │
│   2. GPU computes on it directly. Done.                  │
│                                                          │
│   NO COPY. NO VRAM LIMIT. NO PCIe BOTTLENECK.           │
└─────────────────────────────────────────────────────────┘

The implications:

No VRAM limit: A Mac Studio with 192 GB can run models that would need 8× NVIDIA A100s
No copy overhead: Weights loaded once are accessible by both CPU and GPU
Zero-copy operations: CPU can rearrange GPU data directly (akunu uses this for Whisper)

The Memory Hierarchy

Even though memory is “unified,” there’s still a hierarchy of caches for performance. The animation below shows a memory request falling through the layers. Click a layer to see what happens when data is found there (cache hit) vs when it must go deeper (cache miss).

Click any cache level to simulate a memory request hitting at that level. Watch the latency difference.

For inference, the critical numbers are:

SLC size: determines how much of the model weights are “hot” in cache
DRAM bandwidth: determines maximum decode speed for memory-bound operations

Why Memory Bandwidth Is Everything for Decode

During token generation (decode), the model processes one token at a time. For each token, it needs to:

Read the embedding for the input token: tiny
For each layer (e.g., 32 layers in a 7B model):
- Read Q/K/V projection weights: (dim + 2 × kv_dim) × dim × bytes_per_weight (KV is smaller with GQA)
- Read KV cache: 2 × seq_len × kv_dim × 2 bytes (FP16)
- Read output projection weights: dim × dim × bytes_per_weight
- Read FFN gate+up weights: 2 × dim × ffn_dim × bytes_per_weight
- Read FFN down weights: ffn_dim × dim × bytes_per_weight

For a 7B Q4_0 model, that’s roughly 3.8 GB of weights read per token. The computation (multiply-accumulate) is fast. The bottleneck is reading the weights from memory.

Token generation speed = memory bandwidth / model size (bytes).¹

LLaMA 7B Q4_0 (~3.8 GB) theoretical decode speed:

Chip	Bandwidth	Theoretical tok/s
M4	120 GB/s	~31
M4 Pro	273 GB/s	~71
M4 Max	546 GB/s	~143
M4 Ultra	819 GB/s	~215
RTX 4090 (GDDR6X)	1,008 GB/s	~265 (but 24 GB VRAM limit)

The RTX 4090 has higher bandwidth, but only 24 GB of VRAM. For models that fit, it’s faster. For large models (70B+), it can’t run them at all without multi-GPU setups.

The Roofline Model

The roofline model helps visualize whether an operation is compute-bound or memory-bound:

Performance
(FLOPS)
    │
    │            ┌─────────────────────── Compute Ceiling
    │           /│                        (peak TFLOPS)
    │          / │
    │         /  │
    │        /   │
    │       /    │
    │      /     │
    │     /      │
    │    /       │
    │   / Memory │ Compute
    │  /  Bound  │  Bound
    │ /          │
    │/           │
    └────────────┼──────────────────────
                 Ridge Point
           Operational Intensity
           (FLOPS / byte)

GEMV (decode, M=1, see Chapter 34): Each weight is read once and used for 1 multiply-add. Operational intensity = 2 FLOPS / (0.5 bytes for Q4_0) = 4 FLOPS/byte. This is well below the ridge point → memory-bound.

GEMM (prefill, M=seq_len): Each weight is read once but used for M multiply-adds. Operational intensity = 2M FLOPS / (0.5 bytes). For M=512: 2048 FLOPS/byte → compute-bound.

This is why prefill is fast (compute-bound, GPU cores are busy) and decode is slower (memory-bound, waiting for data).

How Akunu Exploits UMA

1. Memory-Mapped Weight Loading

GGUF files are opened with mmap(). The OS maps the file directly into the process’s virtual address space. When a page is first accessed, the kernel loads it from disk. The GPU can then access these pages directly — no explicit copy needed.

Traditional:  Disk → fread() → RAM buffer → cudaMemcpy() → VRAM
Akunu UMA:    Disk → mmap() → Unified Memory ← GPU reads directly

2. CPU-Side Buffer Operations

For Whisper’s cross-attention, akunu precomputes K/V projections on the GPU, then rearranges the memory layout on the CPU:

// CPU directly reads/writes GPU buffer contents (UMA makes this efficient)
for (int h = 0; h < n_heads; h++)
    for (int p = 0; p < enc_seq; p++)
        memcpy(&dst[(h*enc_seq + p)*head_dim],
               &src[p*dec_dim + h*head_dim],
               head_dim * sizeof(__fp16));

On a discrete GPU, this would require: GPU→VRAM→PCIe→RAM→CPU rearrange→RAM→PCIe→VRAM→GPU. On Apple Silicon: CPU reads and writes directly to the same memory the GPU uses. Hundreds of microseconds vs potentially milliseconds.

3. Minimal Synchronization

Because CPU and GPU share memory, akunu can submit GPU work and immediately prepare the next operation. There’s no need to wait for a PCIe transfer to complete — just a lightweight event/fence for GPU command completion.

4. KV Cache Persistence

The KV cache lives in unified memory and persists across requests. In akunu’s server mode, if a new request shares a prefix with the previous one, the cached KV entries are already in memory — no need to recompute or retransfer.

Memory Bandwidth in Practice

Real bandwidth utilization is always less than theoretical peak. Several factors reduce effective bandwidth:

Theoretical Peak:     273 GB/s (M4 Pro)
                        │
Bank conflicts:         │ -5%
Cache misses:           │ -10%
Non-coalesced access:   │ -15%
TLB misses:             │ -5%
                        ▼
Practical Peak:       ~175-195 GB/s (65-70% utilization)

Akunu achieves high bandwidth utilization through:

Vectorized loads: half4 reads (8 bytes at once) instead of individual half reads
Coalesced access: Adjacent threads read adjacent memory addresses
Block-strided K-loops: Process K-dimension in blocks of 512 to maximize cache line reuse
Weight layout optimization: Quantized blocks are laid out for sequential access

Comparison: UMA vs Discrete vs Cloud

Feature	Apple UMA	Discrete GPU	Cloud (A100)
Max memory	192 GB	24 GB (consumer)	80 GB
Bandwidth	819 GB/s (Ultra)	1 TB/s (4090)	2 TB/s
Transfer overhead	None	PCIe ~32 GB/s	NVLink ~900 GB/s
70B Q4 model	Fits in 96 GB Mac	Needs multi-GPU	Fits one A100
Cost	$4K Mac Studio	$2K + PC	~$3/hr rental
Power	~60W	~450W	~400W
Programmability	Metal	CUDA	CUDA

Apple Silicon’s sweet spot is models that don’t quite fit in consumer GPU VRAM (13B-70B). The ability to run these models locally, at reasonable speed, with no cloud dependency, is akunu’s core value proposition.

Summary

Unified Memory Architecture means:

One memory pool shared by CPU and GPU
No data copying between host and device memory
Total system RAM available for model weights
Memory bandwidth is the primary bottleneck for decode
Quantization directly translates to speed (less data to read)

The key equation for decode performance: tokens/sec ≈ memory_bandwidth / model_size_bytes

This is why akunu focuses heavily on quantization support (Q4_0 is 4× smaller than FP16, so 4× faster decode) and why its dispatch table eliminates every unnecessary overhead — when you’re memory-bound, every wasted cycle is a wasted byte of bandwidth.

Pope, R., et al. “Efficiently Scaling Transformer Inference.” MLSys 2023. This paper provides a thorough analysis of the memory-bound vs compute-bound regimes of transformer inference and derives the bandwidth-limited decode throughput formula. See https://arxiv.org/abs/2211.05102. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference