System-on-Chip Architecture
If you’ve spent your career thinking about computers as a collection of separate chips on a motherboard — a CPU here, a GPU there, RAM sticks in their slots — then Apple Silicon is going to fundamentally rewire how you think about hardware. Everything lives on one die (or two, in the Ultra variants). And that changes everything about how we write high-performance inference code.
What Is a System-on-Chip?
A System-on-Chip (SoC) integrates what traditionally were separate components — processor, graphics, memory controller, I/O — onto a single piece of silicon. This isn’t a new idea; your phone has been running on SoCs for over a decade. What Apple did with the M-series was bring this approach to laptop and desktop-class performance.
In a traditional PC, the CPU has its own RAM (DDR5, ~90 GB/s) and the GPU has its own VRAM (GDDR6X, ~1 TB/s). Data must cross the PCIe bus (~32 GB/s) to move between them. Two separate memory pools, two separate bandwidth domains.
On Apple Silicon, everything — CPU, GPU, Neural Engine — shares one pool of LPDDR5 memory (120-819 GB/s depending on chip). No copy, no PCIe bottleneck. The GPU reads the same bytes the CPU wrote.
┌─────────────────────────────────────────────────────────────┐
│ TRADITIONAL PC ARCHITECTURE │
│ │
│ ┌──────────┐ PCIe x16 ┌────────────────┐ │
│ │ CPU │◄──────────────►│ Discrete GPU │ │
│ │ (Intel/ │ ~32 GB/s │ (NVIDIA/AMD) │ │
│ │ AMD) │ │ │ │
│ └────┬─────┘ └───────┬────────┘ │
│ │ DDR5 │ GDDR6X │
│ │ ~90 GB/s │ ~1 TB/s │
│ ┌────┴─────┐ ┌───────┴───────┐ │
│ │ System │ │ Video RAM │ │
│ │ RAM │ │ (VRAM) │ │
│ │ 32-128GB │ │ 8-24 GB │ │
│ └──────────┘ └───────────────┘ │
│ │
│ Two separate memory pools. Data must be COPIED between them│
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ APPLE SILICON ARCHITECTURE │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ SINGLE SoC DIE │ │
│ │ │ │
│ │ ┌─────┐ ┌─────┐ ┌──────┐ ┌──────────┐ │ │
│ │ │ CPU │ │ GPU │ │Neural│ │ Media │ │ │
│ │ │ │ │ │ │Engine│ │ Engine │ │ │
│ │ └──┬──┘ └──┬──┘ └──┬───┘ └────┬─────┘ │ │
│ │ └───────┴───────┴──────────┘ │ │
│ │ │ Fabric │ │
│ │ ┌────────┴────────┐ │ │
│ │ │ System Level │ │ │
│ │ │ Cache (SLC) │ │ │
│ │ └────────┬────────┘ │ │
│ └──────────────┼───────────────────────────┘ │
│ ┌──────┴──────┐ │
│ │ Unified │ │
│ │ Memory │ │
│ │ 16-192 GB │ │
│ └─────────────┘ │
│ │
│ ONE memory pool. CPU and GPU access it with ZERO copies. │
└─────────────────────────────────────────────────────────────┘
The animation below shows this difference. Watch how data flows in each architecture — pay attention to the PCIe bottleneck in the traditional model vs the direct access in UMA.
Anatomy of the Die
Let’s map out every major component on an M4 Pro die:1
┌──────────────────────────────────────────────────────────────────┐
│ M4 Pro SoC Die │
│ │
│ ┌─────────────────────────┐ ┌──────────────────────────────┐ │
│ │ CPU Cluster │ │ GPU (20 cores) │ │
│ │ │ │ │ │
│ │ ┌────┐┌────┐┌────┐ │ │ ┌───┐┌───┐┌───┐┌───┐┌───┐ │ │
│ │ │ P0 ││ P1 ││ P2 │ │ │ │ 0 ││ 1 ││ 2 ││ 3 ││ 4 │ │ │
│ │ └────┘└────┘└────┘ │ │ └───┘└───┘└───┘└───┘└───┘ │ │
│ │ ┌────┐┌────┐┌────┐ │ │ ┌───┐┌───┐┌───┐┌───┐┌───┐ │ │
│ │ │ P3 ││ P4 ││ P5 │ │ │ │ 5 ││ 6 ││ 7 ││ 8 ││ 9 │ │ │
│ │ └────┘└────┘└────┘ │ │ └───┘└───┘└───┘└───┘└───┘ │ │
│ │ ┌────┐┌────┐┌────┐ │ │ ┌───┐┌───┐┌───┐┌───┐┌───┐ │ │
│ │ │ P6 ││ P7 ││ P8 │ │ │ │10 ││11 ││12 ││13 ││14 │ │ │
│ │ └────┘└────┘└────┘ │ │ └───┘└───┘└───┘└───┘└───┘ │ │
│ │ ┌────┐ │ │ ┌───┐┌───┐┌───┐┌───┐┌───┐ │ │
│ │ │ P9 │ P=Performance │ │ │15 ││16 ││17 ││18 ││19 │ │ │
│ │ └────┘ │ │ └───┘└───┘└───┘└───┘└───┘ │ │
│ │ ┌────┐┌────┐┌────┐ │ │ 20 GPU Cores │ │
│ │ │ E0 ││ E1 ││ E2 │ │ │ │ │
│ │ └────┘└────┘└────┘ │ │ │ │
│ │ ┌────┐ E=Efficiency │ │ │ │
│ │ │ E3 │ │ │ │ │
│ │ └────┘ │ │ │ │
│ └─────────────────────────┘ └──────────────────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────┐ ┌────────────┐ ┌──────────┐ │
│ │ Neural Engine│ │ Media │ │ Display │ │ Secure │ │
│ │ 16 Cores │ │ Engine │ │ Engine │ │ Enclave │ │
│ │ 38 TOPS │ │ ProRes │ │ │ │ │ │
│ └──────────────┘ └──────────┘ └────────────┘ └──────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Fabric / Interconnect │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ System Level Cache (SLC) — 36 MB │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Memory │ │Thunderbolt│ │ PCIe │ │ USB │ │
│ │Controller│ │Controller │ │Controller│ │ Controller │ │
│ └──────────┘ └───────────┘ └──────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────────┘
CPU Clusters: Performance and Efficiency
Apple Silicon uses ARM’s big.LITTLE concept with two core types:
Performance Cores (P-cores):
- Wide, out-of-order execution with deep pipelines
- High clock speeds (up to ~4.5 GHz on M4)
- Large L1 caches (192 KB instruction, 128 KB data per core)
- Large shared L2 cache (16-32 MB per cluster)
- Used for compute-intensive tasks
Efficiency Cores (E-cores):
- Narrower, simpler pipeline
- Much lower power consumption
- Lower clock speeds (~2.8 GHz)
- Handle background tasks, I/O-bound work
For akunu, the CPU mainly handles model loading, tokenization, dispatch table building, grammar constraint checking, and control flow between GPU dispatches. The GPU does the neural network computation.
The GPU
This is where akunu spends most of its time. Each GPU core contains ALUs, a texture sampling unit, and tile memory. Core counts vary: M4 has 10, M4 Pro has 20, M4 Max has 40, M4 Ultra has 80. We’ll cover GPU architecture in extreme detail in Chapter 3.
The Neural Engine
Apple’s dedicated matrix multiplication accelerator: 16 cores on M4 Pro, 38 TOPS at INT8. However, akunu does NOT use the Neural Engine. It’s only accessible through CoreML, which doesn’t give the fine-grained control needed for optimized inference. Metal compute shaders give us full control over memory layout, kernel design, and dispatch — which is why akunu achieves 1.83x faster decode than llama.cpp.
System Level Cache (SLC)
The SLC is a large, shared last-level cache between all die components and DRAM:
CPU L2 GPU L2 NE Cache
\ | /
▼ ▼ ▼
┌─────────────────────────┐
│ System Level Cache │
│ M4: 16 MB │
│ M4 Pro: 36 MB │
│ M4 Max: 48 MB │
│ M4 Ultra: 96 MB │
└────────────┬────────────┘
▼
┌───────────────┐
│ LPDDR5 DRAM │
└───────────────┘
The SLC is critical for inference: weight tiles that fit in SLC get accessed much faster, and intermediate activations often fit entirely.2 Akunu’s ChipConfig::slc_size tunes behavior based on available SLC.
Memory Controller
| Chip | Memory Bus | Bandwidth | Max Memory |
|---|---|---|---|
| M4 | 128-bit | 120 GB/s | 32 GB |
| M4 Pro | 192-bit | 273 GB/s | 48 GB |
| M4 Max | 256-bit | 546 GB/s | 128 GB |
| M4 Ultra | 512-bit | 819 GB/s | 192 GB |
For decode, token generation time is approximately model_size_bytes / memory_bandwidth. For a 7B Q4_0 model (~3.8 GB): M4 Pro gives ~71 tok/s theoretical, M4 Max ~143 tok/s. Akunu gets close to these limits.
The Chip Hierarchy: Binning and Variants
Apple creates chip families from the same base design:
BASE DIE ──────► M4 (binned, fewer cores)
│
├──────────► M4 Pro (mid-range)
│
└──────────► M4 Max (full config)
│
│ UltraFusion (2 dies)
▼
M4 Ultra (2x Max)
The Ultra variant connects two Max dies via UltraFusion (~2.5 TB/s die-to-die bandwidth). Software sees it as one unified chip — no special programming needed.
What This Means for Akunu
- Zero-copy weight loading: GGUF files are memory-mapped; the GPU reads directly from mmap’d regions
- CPU-side operations on GPU buffers: Whisper’s cross-attention K/V rearrangement happens on CPU, directly on GPU-accessible memory
- Minimal CPU-GPU sync: Precompiled dispatch table minimizes CPU’s role in the hot path
- SLC-aware tuning: ChipConfig adjusts tile and batch sizes based on SLC
- Bandwidth as bottleneck: During decode, akunu is memory-bandwidth bound — quantization (Q4_0) has dramatic impact
In the next chapter, we’ll zoom into the GPU architecture specifically.
-
Apple. “Apple M4 Pro chip.” apple.com, 2024. Die specifications including GPU core count, memory bandwidth, and SLC size. See https://www.apple.com/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/. ↩
-
Frumusanu, A. “Apple’s M4 Family: A Deep Dive.” AnandTech / Chips and Cheese, 2024. Independent analysis of Apple Silicon die layout, cache hierarchy, and fabric bandwidth. See https://chipsandcheese.com/. ↩