System-on-Chip Architecture

If you’ve spent your career thinking about computers as a collection of separate chips on a motherboard — a CPU here, a GPU there, RAM sticks in their slots — then Apple Silicon is going to fundamentally rewire how you think about hardware. Everything lives on one die (or two, in the Ultra variants). And that changes everything about how we write high-performance inference code.

What Is a System-on-Chip?

A System-on-Chip (SoC) integrates what traditionally were separate components — processor, graphics, memory controller, I/O — onto a single piece of silicon. This isn’t a new idea; your phone has been running on SoCs for over a decade. What Apple did with the M-series was bring this approach to laptop and desktop-class performance.

In a traditional PC, the CPU has its own RAM (DDR5, ~90 GB/s) and the GPU has its own VRAM (GDDR6X, ~1 TB/s). Data must cross the PCIe bus (~32 GB/s) to move between them. Two separate memory pools, two separate bandwidth domains.

On Apple Silicon, everything — CPU, GPU, Neural Engine — shares one pool of LPDDR5 memory (120-819 GB/s depending on chip). No copy, no PCIe bottleneck. The GPU reads the same bytes the CPU wrote.

┌─────────────────────────────────────────────────────────────┐
│                    TRADITIONAL PC ARCHITECTURE              │
│                                                             │
│  ┌──────────┐    PCIe x16    ┌────────────────┐             │
│  │   CPU    │◄──────────────►│  Discrete GPU  │             │
│  │ (Intel/  │    ~32 GB/s    │  (NVIDIA/AMD)  │             │
│  │  AMD)    │                │                │             │
│  └────┬─────┘                └───────┬────────┘             │
│       │ DDR5                         │ GDDR6X               │
│       │ ~90 GB/s                     │ ~1 TB/s              │
│  ┌────┴─────┐                ┌───────┴───────┐              │
│  │ System   │                │   Video RAM   │              │
│  │ RAM      │                │   (VRAM)      │              │
│  │ 32-128GB │                │   8-24 GB     │              │
│  └──────────┘                └───────────────┘              │
│                                                             │
│  Two separate memory pools. Data must be COPIED between them│
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                 APPLE SILICON ARCHITECTURE                  │
│                                                             │
│  ┌──────────────────────────────────────────┐               │
│  │            SINGLE SoC DIE                │               │
│  │                                          │               │
│  │  ┌─────┐ ┌─────┐ ┌──────┐ ┌──────────┐   │               │
│  │  │ CPU │ │ GPU │ │Neural│ │  Media   │   │               │
│  │  │     │ │     │ │Engine│ │  Engine  │   │               │
│  │  └──┬──┘ └──┬──┘ └──┬───┘ └────┬─────┘   │               │
│  │     └───────┴───────┴──────────┘         │               │
│  │              │ Fabric                    │               │
│  │     ┌────────┴────────┐                  │               │
│  │     │ System Level    │                  │               │
│  │     │ Cache (SLC)     │                  │               │
│  │     └────────┬────────┘                  │               │
│  └──────────────┼───────────────────────────┘               │
│          ┌──────┴──────┐                                    │
│          │  Unified    │                                    │
│          │  Memory     │                                    │
│          │  16-192 GB  │                                    │
│          └─────────────┘                                    │
│                                                             │
│  ONE memory pool. CPU and GPU access it with ZERO copies.   │
└─────────────────────────────────────────────────────────────┘

The animation below shows this difference. Watch how data flows in each architecture — pay attention to the PCIe bottleneck in the traditional model vs the direct access in UMA.

    
Traditional PC (Discrete GPU)
Apple Silicon (UMA)

Anatomy of the Die

Let’s map out every major component on an M4 Pro die:¹

┌──────────────────────────────────────────────────────────────────┐
│                        M4 Pro SoC Die                            │
│                                                                  │
│  ┌─────────────────────────┐  ┌──────────────────────────────┐   │
│  │   CPU Cluster           │  │      GPU (20 cores)          │   │
│  │                         │  │                              │   │
│  │  ┌────┐┌────┐┌────┐     │  │  ┌───┐┌───┐┌───┐┌───┐┌───┐   │   │
│  │  │ P0 ││ P1 ││ P2 │     │  │  │ 0 ││ 1 ││ 2 ││ 3 ││ 4 │   │   │
│  │  └────┘└────┘└────┘     │  │  └───┘└───┘└───┘└───┘└───┘   │   │
│  │  ┌────┐┌────┐┌────┐     │  │  ┌───┐┌───┐┌───┐┌───┐┌───┐   │   │
│  │  │ P3 ││ P4 ││ P5 │     │  │  │ 5 ││ 6 ││ 7 ││ 8 ││ 9 │   │   │
│  │  └────┘└────┘└────┘     │  │  └───┘└───┘└───┘└───┘└───┘   │   │
│  │  ┌────┐┌────┐┌────┐     │  │  ┌───┐┌───┐┌───┐┌───┐┌───┐   │   │
│  │  │ P6 ││ P7 ││ P8 │     │  │  │10 ││11 ││12 ││13 ││14 │   │   │
│  │  └────┘└────┘└────┘     │  │  └───┘└───┘└───┘└───┘└───┘   │   │
│  │  ┌────┐                 │  │  ┌───┐┌───┐┌───┐┌───┐┌───┐   │   │
│  │  │ P9 │ P=Performance   │  │  │15 ││16 ││17 ││18 ││19 │   │   │
│  │  └────┘                 │  │  └───┘└───┘└───┘└───┘└───┘   │   │
│  │  ┌────┐┌────┐┌────┐     │  │         20 GPU Cores         │   │
│  │  │ E0 ││ E1 ││ E2 │     │  │                              │   │
│  │  └────┘└────┘└────┘     │  │                              │   │
│  │  ┌────┐ E=Efficiency    │  │                              │   │
│  │  │ E3 │                 │  │                              │   │
│  │  └────┘                 │  │                              │   │
│  └─────────────────────────┘  └──────────────────────────────┘   │
│                                                                  │
│  ┌──────────────┐  ┌──────────┐  ┌────────────┐  ┌──────────┐    │
│  │ Neural Engine│  │  Media   │  │  Display   │  │  Secure  │    │
│  │  16 Cores    │  │  Engine  │  │  Engine    │  │ Enclave  │    │
│  │  38 TOPS     │  │ ProRes   │  │            │  │          │    │
│  └──────────────┘  └──────────┘  └────────────┘  └──────────┘    │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │                    Fabric / Interconnect                 │    │
│  └──────────────────────────────────────────────────────────┘    │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │              System Level Cache (SLC) — 36 MB            │    │
│  └──────────────────────────────────────────────────────────┘    │
│  ┌──────────┐  ┌───────────┐  ┌──────────┐  ┌──────────────┐     │
│  │ Memory   │  │Thunderbolt│  │  PCIe    │  │    USB       │     │
│  │Controller│  │Controller │  │Controller│  │  Controller  │     │
│  └──────────┘  └───────────┘  └──────────┘  └──────────────┘     │
└──────────────────────────────────────────────────────────────────┘

CPU Clusters: Performance and Efficiency

Apple Silicon uses ARM’s big.LITTLE concept with two core types:

Performance Cores (P-cores):

Wide, out-of-order execution with deep pipelines
High clock speeds (up to ~4.5 GHz on M4)
Large L1 caches (192 KB instruction, 128 KB data per core)
Large shared L2 cache (16-32 MB per cluster)
Used for compute-intensive tasks

Efficiency Cores (E-cores):

Narrower, simpler pipeline
Much lower power consumption
Lower clock speeds (~2.8 GHz)
Handle background tasks, I/O-bound work

For akunu, the CPU mainly handles model loading, tokenization, dispatch table building, grammar constraint checking, and control flow between GPU dispatches. The GPU does the neural network computation.

The GPU

This is where akunu spends most of its time. Each GPU core contains ALUs, a texture sampling unit, and tile memory. Core counts vary: M4 has 10, M4 Pro has 20, M4 Max has 40, M4 Ultra has 80. We’ll cover GPU architecture in extreme detail in Chapter 3.

The Neural Engine

Apple’s dedicated matrix multiplication accelerator: 16 cores on M4 Pro, 38 TOPS at INT8. However, akunu does NOT use the Neural Engine. It’s only accessible through CoreML, which doesn’t give the fine-grained control needed for optimized inference. Metal compute shaders give us full control over memory layout, kernel design, and dispatch — which is why akunu achieves 1.83x faster decode than llama.cpp.

System Level Cache (SLC)

The SLC is a large, shared last-level cache between all die components and DRAM:

         CPU L2     GPU L2     NE Cache
            \        |        /
             ▼       ▼       ▼
         ┌─────────────────────────┐
         │   System Level Cache    │
         │   M4:      16 MB        │
         │   M4 Pro:  36 MB        │
         │   M4 Max:  48 MB        │
         │   M4 Ultra: 96 MB       │
         └────────────┬────────────┘
                      ▼
              ┌───────────────┐
              │  LPDDR5 DRAM  │
              └───────────────┘

The SLC is critical for inference: weight tiles that fit in SLC get accessed much faster, and intermediate activations often fit entirely.² Akunu’s ChipConfig::slc_size tunes behavior based on available SLC.

Memory Controller

Chip	Memory Bus	Bandwidth	Max Memory
M4	128-bit	120 GB/s	32 GB
M4 Pro	192-bit	273 GB/s	48 GB
M4 Max	256-bit	546 GB/s	128 GB
M4 Ultra	512-bit	819 GB/s	192 GB

For decode, token generation time is approximately model_size_bytes / memory_bandwidth. For a 7B Q4_0 model (~3.8 GB): M4 Pro gives ~71 tok/s theoretical, M4 Max ~143 tok/s. Akunu gets close to these limits.

The Chip Hierarchy: Binning and Variants

Apple creates chip families from the same base design:

          BASE DIE ──────► M4 (binned, fewer cores)
              │
              ├──────────► M4 Pro (mid-range)
              │
              └──────────► M4 Max (full config)
                               │
                               │ UltraFusion (2 dies)
                               ▼
                           M4 Ultra (2x Max)

The Ultra variant connects two Max dies via UltraFusion (~2.5 TB/s die-to-die bandwidth). Software sees it as one unified chip — no special programming needed.

What This Means for Akunu

Zero-copy weight loading: GGUF files are memory-mapped; the GPU reads directly from mmap’d regions
CPU-side operations on GPU buffers: Whisper’s cross-attention K/V rearrangement happens on CPU, directly on GPU-accessible memory
Minimal CPU-GPU sync: Precompiled dispatch table minimizes CPU’s role in the hot path
SLC-aware tuning: ChipConfig adjusts tile and batch sizes based on SLC
Bandwidth as bottleneck: During decode, akunu is memory-bandwidth bound — quantization (Q4_0) has dramatic impact

In the next chapter, we’ll zoom into the GPU architecture specifically.

Apple. “Apple M4 Pro chip.” apple.com, 2024. Die specifications including GPU core count, memory bandwidth, and SLC size. See https://www.apple.com/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/. ↩
Frumusanu, A. “Apple’s M4 Family: A Deep Dive.” AnandTech / Chips and Cheese, 2024. Independent analysis of Apple Silicon die layout, cache hierarchy, and fabric bandwidth. See https://chipsandcheese.com/. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference