Unified Memory Architecture
If there’s one hardware feature that makes Apple Silicon uniquely suited for running large language models locally, it’s the Unified Memory Architecture (UMA). It’s the reason a MacBook with 64 GB of RAM can run a 70B model that would require a $2,000+ NVIDIA GPU on a PC. Let’s understand exactly how it works and why it matters for inference.
The Traditional Memory Problem
On a discrete GPU system, there are two completely separate memory pools:
┌─────────────────────────────────────────────────────────┐
│ TRADITIONAL DISCRETE GPU SYSTEM │
│ │
│ CPU Side GPU Side │
│ ┌──────────┐ ┌──────────────┐ │
│ │ System │ PCIe 4.0 x16 │ VRAM │ │
│ │ RAM │◄──────────────────►│ (GDDR6X) │ │
│ │ 64 GB │ ~32 GB/s │ 24 GB │ │
│ │ DDR5 │ bidirectional │ │ │
│ │ │ │ Bandwidth: │ │
│ │ BW: ~90 │ │ ~1 TB/s │ │
│ │ GB/s │ │ │ │
│ └──────────┘ └──────────────┘ │
│ │
│ To run inference: │
│ 1. Load model into system RAM (disk → RAM) │
│ 2. Copy weights to VRAM (RAM → PCIe → VRAM) │
│ 3. GPU computes on VRAM │
│ 4. Copy results back (VRAM → PCIe → RAM) │
│ │
│ PROBLEM: Model must fit in VRAM (24 GB max typical) │
│ PROBLEM: PCIe transfer is slow (~32 GB/s) │
│ PROBLEM: Double the memory usage (model in both pools) │
└─────────────────────────────────────────────────────────┘
A 70B model in Q4_0 quantization is about 40 GB. It literally doesn’t fit in a 24 GB RTX 4090’s VRAM. You need model parallelism across multiple GPUs, or offloading to CPU (which is painfully slow over PCIe).
Apple’s Unified Memory
On Apple Silicon, there’s one memory pool, shared by everything:
┌─────────────────────────────────────────────────────────┐
│ APPLE SILICON UNIFIED MEMORY │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ SoC Die │ │
│ │ │ │
│ │ CPU ──┐ │ │
│ │ ├──► Fabric ──► SLC ──► Memory Controller │ │
│ │ GPU ──┤ │ │
│ │ ├──► │ │
│ │ NE ──┘ │ │
│ └────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ LPDDR5 │ │
│ │ Unified │ │
│ │ Memory │ │
│ │ Pool │ │
│ │ │ │
│ │ 16-192 GB │ │
│ └─────────────┘ │
│ │
│ To run inference: │
│ 1. Load model into memory (disk → unified memory) │
│ 2. GPU computes on it directly. Done. │
│ │
│ NO COPY. NO VRAM LIMIT. NO PCIe BOTTLENECK. │
└─────────────────────────────────────────────────────────┘
The implications:
- No VRAM limit: A Mac Studio with 192 GB can run models that would need 8× NVIDIA A100s
- No copy overhead: Weights loaded once are accessible by both CPU and GPU
- Zero-copy operations: CPU can rearrange GPU data directly (akunu uses this for Whisper)
The Memory Hierarchy
Even though memory is “unified,” there’s still a hierarchy of caches for performance. The animation below shows a memory request falling through the layers. Click a layer to see what happens when data is found there (cache hit) vs when it must go deeper (cache miss).
For inference, the critical numbers are:
- SLC size: determines how much of the model weights are “hot” in cache
- DRAM bandwidth: determines maximum decode speed for memory-bound operations
Why Memory Bandwidth Is Everything for Decode
During token generation (decode), the model processes one token at a time. For each token, it needs to:
- Read the embedding for the input token: tiny
- For each layer (e.g., 32 layers in a 7B model):
- Read Q/K/V projection weights:
(dim + 2 × kv_dim) × dim × bytes_per_weight(KV is smaller with GQA) - Read KV cache:
2 × seq_len × kv_dim × 2bytes (FP16) - Read output projection weights:
dim × dim × bytes_per_weight - Read FFN gate+up weights:
2 × dim × ffn_dim × bytes_per_weight - Read FFN down weights:
ffn_dim × dim × bytes_per_weight
- Read Q/K/V projection weights:
For a 7B Q4_0 model, that’s roughly 3.8 GB of weights read per token. The computation (multiply-accumulate) is fast. The bottleneck is reading the weights from memory.
Token generation speed = memory bandwidth / model size (bytes).1
LLaMA 7B Q4_0 (~3.8 GB) theoretical decode speed:
| Chip | Bandwidth | Theoretical tok/s |
|---|---|---|
| M4 | 120 GB/s | ~31 |
| M4 Pro | 273 GB/s | ~71 |
| M4 Max | 546 GB/s | ~143 |
| M4 Ultra | 819 GB/s | ~215 |
| RTX 4090 (GDDR6X) | 1,008 GB/s | ~265 (but 24 GB VRAM limit) |
The RTX 4090 has higher bandwidth, but only 24 GB of VRAM. For models that fit, it’s faster. For large models (70B+), it can’t run them at all without multi-GPU setups.
The Roofline Model
The roofline model helps visualize whether an operation is compute-bound or memory-bound:
Performance
(FLOPS)
│
│ ┌─────────────────────── Compute Ceiling
│ /│ (peak TFLOPS)
│ / │
│ / │
│ / │
│ / │
│ / │
│ / │
│ / │
│ / Memory │ Compute
│ / Bound │ Bound
│ / │
│/ │
└────────────┼──────────────────────
Ridge Point
Operational Intensity
(FLOPS / byte)
GEMV (decode, M=1, see Chapter 34): Each weight is read once and used for 1 multiply-add. Operational intensity = 2 FLOPS / (0.5 bytes for Q4_0) = 4 FLOPS/byte. This is well below the ridge point → memory-bound.
GEMM (prefill, M=seq_len): Each weight is read once but used for M multiply-adds. Operational intensity = 2M FLOPS / (0.5 bytes). For M=512: 2048 FLOPS/byte → compute-bound.
This is why prefill is fast (compute-bound, GPU cores are busy) and decode is slower (memory-bound, waiting for data).
How Akunu Exploits UMA
1. Memory-Mapped Weight Loading
GGUF files are opened with mmap(). The OS maps the file directly into the process’s virtual address space. When a page is first accessed, the kernel loads it from disk. The GPU can then access these pages directly — no explicit copy needed.
Traditional: Disk → fread() → RAM buffer → cudaMemcpy() → VRAM
Akunu UMA: Disk → mmap() → Unified Memory ← GPU reads directly
2. CPU-Side Buffer Operations
For Whisper’s cross-attention, akunu precomputes K/V projections on the GPU, then rearranges the memory layout on the CPU:
// CPU directly reads/writes GPU buffer contents (UMA makes this efficient)
for (int h = 0; h < n_heads; h++)
for (int p = 0; p < enc_seq; p++)
memcpy(&dst[(h*enc_seq + p)*head_dim],
&src[p*dec_dim + h*head_dim],
head_dim * sizeof(__fp16));
On a discrete GPU, this would require: GPU→VRAM→PCIe→RAM→CPU rearrange→RAM→PCIe→VRAM→GPU. On Apple Silicon: CPU reads and writes directly to the same memory the GPU uses. Hundreds of microseconds vs potentially milliseconds.
3. Minimal Synchronization
Because CPU and GPU share memory, akunu can submit GPU work and immediately prepare the next operation. There’s no need to wait for a PCIe transfer to complete — just a lightweight event/fence for GPU command completion.
4. KV Cache Persistence
The KV cache lives in unified memory and persists across requests. In akunu’s server mode, if a new request shares a prefix with the previous one, the cached KV entries are already in memory — no need to recompute or retransfer.
Memory Bandwidth in Practice
Real bandwidth utilization is always less than theoretical peak. Several factors reduce effective bandwidth:
Theoretical Peak: 273 GB/s (M4 Pro)
│
Bank conflicts: │ -5%
Cache misses: │ -10%
Non-coalesced access: │ -15%
TLB misses: │ -5%
▼
Practical Peak: ~175-195 GB/s (65-70% utilization)
Akunu achieves high bandwidth utilization through:
- Vectorized loads:
half4reads (8 bytes at once) instead of individualhalfreads - Coalesced access: Adjacent threads read adjacent memory addresses
- Block-strided K-loops: Process K-dimension in blocks of 512 to maximize cache line reuse
- Weight layout optimization: Quantized blocks are laid out for sequential access
Comparison: UMA vs Discrete vs Cloud
| Feature | Apple UMA | Discrete GPU | Cloud (A100) |
|---|---|---|---|
| Max memory | 192 GB | 24 GB (consumer) | 80 GB |
| Bandwidth | 819 GB/s (Ultra) | 1 TB/s (4090) | 2 TB/s |
| Transfer overhead | None | PCIe ~32 GB/s | NVLink ~900 GB/s |
| 70B Q4 model | Fits in 96 GB Mac | Needs multi-GPU | Fits one A100 |
| Cost | $4K Mac Studio | $2K + PC | ~$3/hr rental |
| Power | ~60W | ~450W | ~400W |
| Programmability | Metal | CUDA | CUDA |
Apple Silicon’s sweet spot is models that don’t quite fit in consumer GPU VRAM (13B-70B). The ability to run these models locally, at reasonable speed, with no cloud dependency, is akunu’s core value proposition.
Summary
Unified Memory Architecture means:
- One memory pool shared by CPU and GPU
- No data copying between host and device memory
- Total system RAM available for model weights
- Memory bandwidth is the primary bottleneck for decode
- Quantization directly translates to speed (less data to read)
The key equation for decode performance: tokens/sec ≈ memory_bandwidth / model_size_bytes
This is why akunu focuses heavily on quantization support (Q4_0 is 4× smaller than FP16, so 4× faster decode) and why its dispatch table eliminates every unnecessary overhead — when you’re memory-bound, every wasted cycle is a wasted byte of bandwidth.
-
Pope, R., et al. “Efficiently Scaling Transformer Inference.” MLSys 2023. This paper provides a thorough analysis of the memory-bound vs compute-bound regimes of transformer inference and derives the bandwidth-limited decode throughput formula. See https://arxiv.org/abs/2211.05102. ↩