The Apple Silicon Revolution
On June 22, 2020, at the Worldwide Developers Conference, Apple CEO Tim Cook announced that the Mac was transitioning from Intel x86 processors to Apple’s own custom ARM-based chips.1 This was not a surprise to industry watchers – rumors had been circulating for years – but the ambition and speed of the transition stunned everyone.2 Apple was not just switching CPU vendors. They were bringing the entire system-on-chip (SoC) approach that had made the iPhone and iPad dominant in mobile computing to the desktop and laptop. Within two years, every Mac in the lineup would run on Apple Silicon.
This chapter tells the story of how we got here, why it matters, and what it means for the kind of workload we care about most in this book: running large language models as fast as possible on local hardware.
Why Apple Left Intel
To understand why Apple Silicon exists, you need to understand why Apple was unhappy with Intel. The relationship between the two companies began in 2005, when Steve Jobs announced the Mac’s transition from PowerPC to Intel x86. At the time, Intel’s chips offered a compelling combination of performance and power efficiency, and the x86 ecosystem was the dominant force in personal computing.
For about a decade, this partnership worked well. Intel’s tick-tock cadence – alternating between process shrinks and microarchitecture improvements – delivered steady performance gains. But starting around 2015, things began to go wrong.
Intel’s Process Stagnation
Intel’s manufacturing process, once the envy of the semiconductor industry, hit a wall. The company’s 10nm node (which they originally planned to ship in 2016) was delayed repeatedly. The 14nm process that was supposed to be a one-generation stopgap ended up being stretched across four generations of products:
Intel's 14nm Purgatory
=======================
2014: Broadwell (14nm) -- on schedule
2015: Skylake (14nm) -- ok, next year we move to 10nm
2016: Kaby Lake (14nm+) -- 10nm delayed, here's a refinement
2017: Coffee Lake (14nm++) -- still no 10nm, more cores this time
2018: Coffee Lake (14nm++) -- yeah, 10nm is still not ready
2019: Ice Lake (10nm) -- finally! but only in low-power laptops
2020: Rocket Lake (14nm) -- back to 14nm for desktops. really.
Each year that Intel stayed on 14nm, the performance improvements got smaller. The company resorted to increasing power consumption and adding cores to show benchmark improvements, but the fundamental per-core performance was stagnating.
Meanwhile, Apple’s in-house chip team – born from the 2008 acquisition of PA Semi, a boutique chip design firm – was making ARM-based processors that improved by 20-40% in performance every single year. By 2020, Apple’s A14 chip (in the iPhone 12) was competitive with Intel’s laptop processors in single-threaded performance, despite running at a fraction of the power.
The Power Problem
For a laptop maker like Apple, power efficiency is not a nice-to-have; it is the single most important metric. Every watt of power consumed becomes a watt of heat that must be dissipated. More heat means bigger fans, thicker chassis, and shorter battery life. Intel’s chips were designed primarily for desktops and servers, then adapted for laptops. Apple wanted chips designed for laptops first.
Power Efficiency Comparison (circa 2020)
=========================================
Intel Core i7 (10th gen laptop):
+----------------------------+
| TDP: 28W (actual: 35-50W) |
| Performance: ████████ |
| Per-watt: ███ |
+----------------------------+
Apple A14 (iPad/iPhone):
+----------------------------+
| TDP: ~5-6W |
| Performance: ██████ |
| Per-watt: █████████████ |
+----------------------------+
Apple wanted to bring that per-watt advantage to the Mac.
The gap in power efficiency was not just about process technology (though TSMC’s nodes were ahead of Intel’s). It was about fundamental architectural choices. ARM’s instruction set, which we will discuss shortly, is inherently more power-efficient than x86. And Apple’s custom microarchitecture was designed from the ground up for efficiency in a way that Intel’s x86 legacy made difficult.
The Integration Advantage
Perhaps the most important reason Apple moved to its own silicon was integration. With Intel chips, a Mac was a collection of discrete components: an Intel CPU, a separate AMD or NVIDIA GPU (or Intel integrated graphics), a separate ISP (image signal processor), a separate Thunderbolt controller, a separate security chip (T1, then T2). Each of these communicated over various buses and protocols, with all the latency and power overhead that implies.
Intel-era Mac Architecture (simplified)
=========================================
+--------+ PCIe +------------+
| Intel |<------------>| AMD/NVIDIA |
| CPU | | Discrete |
| | | GPU |
+---+----+ +------------+
| |
| DDR4 | GDDR6
v v
+--------+ +-----------+
| System | | VRAM |
| Memory | | (separate)|
| (RAM) | | |
+--------+ +-----------+
Data must cross PCIe bus to go between CPU and GPU.
GPU has its own memory (VRAM), separate from system RAM.
Bandwidth between CPU and GPU is limited (~16 GB/s PCIe 3.0).
Apple’s SoC approach puts everything on a single chip, sharing a single pool of memory. We will explore this unified memory architecture in detail in Chapter 4, but the key insight is this: when the CPU and GPU share the same memory, you eliminate the need to copy data between them. For machine learning workloads, where models can be tens of gigabytes, this is transformative.
The ARM Instruction Set Architecture
Before we dive into Apple Silicon specifically, let’s talk about ARM – the instruction set architecture (ISA) that forms the foundation of every Apple Silicon chip.
ARM stands for “Advanced RISC Machines” (originally “Acorn RISC Machine,” after the British company that designed the first ARM processor in 1985). ARM does not manufacture chips; it designs instruction set architectures and licenses them to chip makers. Apple licenses the ARMv8-A architecture (ARMv8.5-A for M1, ARMv8.6-A for M2/M3, ARMv9.2-A for M4) and then designs its own custom microarchitecture that implements that ISA.
This distinction is important: the ISA defines what instructions the processor understands. The microarchitecture defines how those instructions are executed. Two chips can implement the same ISA with radically different performance characteristics. Apple’s custom ARM cores (codenamed Firestorm, Icestorm, Avalanche, Blizzard, Everest, Sawtooth, etc.) are not the same as Qualcomm’s Kryo cores or ARM’s own Cortex cores, even though they all execute ARM instructions.
RISC vs. CISC
ARM is a RISC (Reduced Instruction Set Computer) architecture.3 Intel’s x86 is a CISC (Complex Instruction Set Computer) architecture. This is one of the most important distinctions in processor design, and it has deep implications for power efficiency and performance.
CISC (x86) philosophy: Provide a large number of complex instructions. A single instruction might load data from memory, perform an arithmetic operation, and store the result back to memory. The idea is to minimize the number of instructions the compiler needs to generate, reducing code size.
RISC (ARM) philosophy: Provide a smaller number of simple instructions. Each instruction does one thing: load data, perform arithmetic, or store data, but not a combination. The idea is that simpler instructions can be executed faster, and the hardware can be simpler (and therefore more power-efficient).
CISC vs. RISC: Adding Two Numbers from Memory
===============================================
x86 (CISC):
+-----------------------------------------+
| ADD [mem_a], [mem_b] | <-- One instruction
| | but internally decoded
| Internally becomes: | into multiple micro-ops
| load temp1, [mem_a] |
| load temp2, [mem_b] |
| add temp1, temp1, temp2 |
| store [mem_a], temp1 |
+-----------------------------------------+
ARM (RISC):
+-----------------------------------------+
| LDR X0, [mem_a] | <-- Load first operand
| LDR X1, [mem_b] | <-- Load second operand
| ADD X0, X0, X1 | <-- Add them
| STR X0, [mem_a] | <-- Store result
+-----------------------------------------+
ARM: 4 instructions, each simple and predictable
x86: 1 instruction, but complex internal decoding required
In practice, modern x86 processors bridge the gap by decoding CISC instructions into internal RISC-like micro-operations (micro-ops). But this decoding step consumes die area and power. ARM processors skip this step entirely because their instructions are already simple. This is one of the main reasons ARM chips are more power-efficient.
Fixed-Width Instructions
One of ARM’s most important characteristics is that all instructions are the same width: 32 bits (4 bytes). (ARM also supports a 16-bit “Thumb” mode for code density, but Apple Silicon runs in AArch64 mode where all instructions are 32 bits.)
This might seem like a minor detail, but it has profound implications for the processor’s front end – the part of the chip that fetches and decodes instructions.
x86 Variable-Length Instructions
=================================
Memory layout of x86 code:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| 1 byte | 3 bytes | 7 bytes | 2 bytes|
| instr | instr | instr | instr |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The processor cannot tell where one instruction ends and
the next begins without decoding each instruction sequentially.
This is hard to parallelize.
ARM Fixed-Width Instructions
==============================
Memory layout of ARM code:
+----------+----------+----------+----------+----------+
| 4 bytes | 4 bytes | 4 bytes | 4 bytes | 4 bytes |
| instr | instr | instr | instr | instr |
+----------+----------+----------+----------+----------+
Every instruction starts at a 4-byte boundary.
The processor can easily fetch and decode multiple
instructions in parallel. No ambiguity about boundaries.
For Intel, the variable-length instruction format means the front-end decode logic is one of the most complex (and power-hungry) parts of the chip. The decoder must examine each byte to determine the instruction length before it can find the start of the next instruction. Intel has thrown enormous engineering resources at this problem (instruction caches, micro-op caches, complex decode pipelines), but it remains an inherent disadvantage.
The Register File
ARM’s AArch64 architecture provides 31 general-purpose 64-bit registers (X0-X30), compared to x86-64’s 16 general-purpose registers (RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8-R15). Having more registers means less “register spilling” – the costly process of saving register values to the stack when the processor runs out of registers.
Register Files Compared
========================
ARM AArch64: x86-64:
+------+------+------+ +------+------+
| X0 | X1 | X2 | | RAX | RBX |
+------+------+------+ +------+------+
| X3 | X4 | X5 | | RCX | RDX |
+------+------+------+ +------+------+
| X6 | X7 | X8 | | RSI | RDI |
+------+------+------+ +------+------+
| X9 | X10 | X11 | | RBP | RSP |
+------+------+------+ +------+------+
| X12 | X13 | X14 | | R8 | R9 |
+------+------+------+ +------+------+
| X15 | X16 | X17 | | R10 | R11 |
+------+------+------+ +------+------+
| X18 | X19 | X20 | | R12 | R13 |
+------+------+------+ +------+------+
| X21 | X22 | X23 | | R14 | R15 |
+------+------+------+ +------+------+
| X24 | X25 | X26 | 16 registers
+------+------+------+
| X27 | X28 | X29 |
+------+------+------+
| X30 (LR) | SP |
+------+------+------+
31 registers + SP
Plus 32 SIMD/FP registers (V0-V31) on both architectures,
but ARM's are 128-bit (NEON) with optional SVE extensions.
ARM also specifies a dedicated link register (X30/LR) for function return addresses, which simplifies function call conventions and branch prediction.
NEON and Advanced SIMD
ARM includes NEON, a SIMD (Single Instruction, Multiple Data) extension that operates on 128-bit vector registers. NEON can process multiple data elements simultaneously:
NEON SIMD Operation: Adding 4 floats at once
==============================================
V0: | float_a0 | float_a1 | float_a2 | float_a3 | (128 bits)
+----------+----------+----------+----------+
+ + + +
V1: | float_b0 | float_b1 | float_b2 | float_b3 | (128 bits)
+----------+----------+----------+----------+
= = = =
V2: | float_c0 | float_c1 | float_c2 | float_c3 | (128 bits)
+----------+----------+----------+----------+
FADD V2.4S, V0.4S, V1.4S -- One instruction, four additions
Apple’s implementation of NEON is extremely wide and high-throughput. The performance cores (P-cores) on M-series chips can execute up to four 128-bit NEON operations per cycle, giving them enormous SIMD throughput for CPU-side workloads.
For the purposes of this book, NEON is relevant primarily for CPU-side preprocessing (tokenizer operations, weight format conversion) rather than the main inference workload, which runs on the GPU. But it is good to know it is there.
AMX: Apple’s Secret Matrix Coprocessor
Here is something you will not find in the official ARM specification: Apple’s M-series chips include a custom matrix coprocessor called AMX (Apple Matrix Extensions). AMX is not part of the ARM ISA – it is a proprietary Apple extension that provides hardware-accelerated matrix multiplication on the CPU side.
AMX operates on its own set of registers and can perform operations like multiplying two 16x16 matrices of FP16 values in a single instruction. Apple uses AMX internally in the Accelerate framework and CoreML, but it is not officially documented or exposed as a public API.
AMX: Apple's Matrix Coprocessor (undocumented)
================================================
+---------------------+
| Apple CPU Core |
| |
| +------+ +------+ |
| | ALU | | NEON | |
| +------+ +------+ |
| |
| +---------------+ |
| | AMX | | <-- Custom matrix coprocessor
| | 16x16 FP16 | | Not part of ARM ISA
| | matrix mul | | Undocumented by Apple
| | in one cycle | | Used by Accelerate/CoreML
| +---------------+ |
| |
+---------------------+
AMX registers are separate from the ARM register file.
AMX instructions are encoded as system register writes.
We mention AMX here for completeness, but akunu does not use it. Akunu runs the entire inference workload on the GPU, where the much larger number of ALUs and the higher memory bandwidth make GPU compute far more efficient for the matrix operations that dominate LLM inference. AMX is more useful for smaller matrix operations that need low latency (like real-time audio processing).
The A-Series Lineage
Apple Silicon did not spring into existence fully formed. It is the culmination of over a decade of chip design work that began with the original iPhone.
The Mobile Era (2010-2019)
Apple started designing its own chips with the A4, which powered the original iPad and iPhone 4 in 2010. Each generation brought significant improvements:
The A-Series Evolution
=======================
A4 (2010) -- First Apple-designed SoC. ARM Cortex-A8 based.
| Single core, 45nm Samsung process.
v
A5 (2011) -- Dual-core ARM Cortex-A9. First Apple GPU (PowerVR).
|
v
A6 (2012) -- First CUSTOM Apple CPU core (Swift). No more ARM reference
| designs. This is where Apple diverged from everyone else.
v
A7 (2013) -- First 64-bit mobile processor. EVER. Caught the industry
| off guard. Desktop-class architecture in a phone.
v
A8 (2014) -- 20nm TSMC. Performance and efficiency improvements.
|
v
A9 (2015) -- Custom "Twister" core. Huge IPC gains.
|
v
A10 (2016) -- First big.LITTLE configuration (2 perf + 2 efficiency cores)
| "Fusion" architecture.
v
A11 (2017) -- First Apple-designed GPU. Dropped PowerVR.
| First Neural Engine (2-core, 600B ops/sec).
v
A12 (2018) -- 7nm TSMC. 8-core Neural Engine (5 trillion ops/sec).
| "Bionic" branding begins.
v
A13 (2019) -- "Bionic." Fastest mobile chip by a huge margin.
| 8-core Neural Engine (6 trillion ops/sec).
v
A14 (2020) -- 5nm TSMC. First 5nm chip in any consumer device.
| Performance competitive with Intel laptop chips.
v
M1 (2020) -- A14 core design, scaled up for Mac.
The revolution begins.
Several key milestones in this lineage are directly relevant to Apple Silicon and akunu:
A6 (2012): Custom CPU cores. Apple stopped using ARM’s reference CPU designs and started designing its own microarchitectures. This gave Apple the freedom to make architectural decisions optimized for their specific use cases, and it is why Apple’s ARM cores consistently outperform everyone else’s ARM cores.
A7 (2013): 64-bit ARM. Apple was the first company to ship a 64-bit ARM processor in a consumer device. Qualcomm and Samsung scrambled to catch up. The 64-bit address space would later be essential for the large memory configurations in M-series chips.
A11 (2017): Custom GPU. Apple designed its own GPU, replacing the PowerVR cores it had licensed for years. This custom GPU design is the direct ancestor of the GPU cores in every M-series chip, and understanding its architecture is critical for understanding akunu’s Metal kernels.
A11 (2017): First Neural Engine. Apple’s first dedicated ML accelerator appeared in the A11. While akunu does not use the Neural Engine (it targets the GPU for maximum flexibility and performance), the Neural Engine’s existence shows how seriously Apple takes on-device ML.
From Phone to Mac: The M1 Announcement
On November 10, 2020, Apple announced the M1 chip and the first three Macs to use it: the MacBook Air, MacBook Pro 13-inch, and Mac mini. The M1 was not just an A14 chip with a different name – it was the A14’s core design scaled up for the thermal and power envelope of a laptop:
A14 (iPhone) vs. M1 (Mac)
===========================
A14: M1:
+---------------------+ +-------------------------------+
| 2 Perf + 4 Eff cores| | 4 Perf + 4 Eff cores |
| 4 GPU cores | | 7 or 8 GPU cores |
| 16-core Neural Eng | | 16-core Neural Engine |
| 4GB/6GB RAM | | 8GB/16GB Unified Memory |
| ~4W power budget | | ~15-20W power budget |
| LPDDR4X | | LPDDR4X, 128-bit bus |
| 11.8B transistors | | 16B transistors |
| 88mm^2 die size | | 120mm^2 die size |
+---------------------+ +-------------------------------+
Same CPU core design (Firestorm + Icestorm)
Same GPU core design
But more of everything, with more memory and bandwidth
The M1 was a revelation. It offered:
- Better CPU performance than Intel’s Core i9 in most workloads
- Better GPU performance than Intel’s Iris integrated graphics (and competitive with low-end discrete GPUs)
- 20-hour battery life in the MacBook Air
- Fanless operation in the MacBook Air (the chip ran cool enough to not need a fan)
- Unified Memory Architecture with zero-copy sharing between CPU and GPU
The tech industry was stunned. Not because Apple had made a fast chip – the A-series trajectory made that predictable – but because the first-generation Mac chip was this good. It was not a compromise “good enough for the Mac” chip; it was genuinely better than Intel’s best laptop offerings in almost every measurable way.
What Makes Apple Silicon Different
Now that we understand the history, let’s talk about what makes Apple Silicon architecturally unique. There are several key differentiators that matter for our purposes:
1. System-on-Chip (SoC) Integration
Unlike traditional PCs where the CPU, GPU, memory controller, I/O controllers, and other components are separate chips on a motherboard, Apple Silicon puts everything on a single die (or, for Ultra variants, two dies connected by a high-speed interconnect).
Traditional PC Architecture Apple Silicon SoC
========================== ==================
+------+ +------+ +------+ +---------------------------+
| CPU | | GPU | | WiFi | | CPU GPU Neural Eng |
+--+---+ +--+---+ +--+---+ | |
| | | | Media ISP Secure Enc |
===+====PCIe==+====USB===+==== | |
| | Memory Thunderbolt |
+--+---------------------------+ | Controller Controller |
| Motherboard | | |
| +--------+ +-----------+ | | I/O SLC Fabric |
| | Memory | | Storage | | +---------------------------+
| | DIMMs | | Controller| | |
| +--------+ +-----------+ | +---------------------------+
+------------------------------+ | Unified Memory |
| (LPDDR on-package) |
Multiple chips, multiple buses, +---------------------------+
multiple memory pools
One chip, one memory pool,
one interconnect fabric
This integration has several benefits:
-
Lower latency: Components communicate over an on-die fabric instead of PCIe or USB. On-die communication can be an order of magnitude lower latency than crossing a PCIe bus.
-
Lower power: Driving signals across a PCIe bus or memory DIMMs requires much more power than on-die communication. The shorter the wire, the less power it takes.
-
Higher bandwidth: Apple can build wide, custom interconnects between components because they are all on the same die. The M4 Max, for example, has a 512-bit memory bus – wider than what you would typically see on a discrete GPU.
2. Unified Memory Architecture (UMA)
This is the single most important differentiator for ML workloads, and we will devote all of Chapter 4 to it. The short version: on Apple Silicon, the CPU and GPU share the same physical memory. There is no “system RAM” and “VRAM” – there is just memory, and both the CPU and GPU can access it.
For LLM inference, this means:
-
No copying model weights between CPU and GPU. When akunu loads a model from disk, the CPU reads the file into a Metal buffer, and the GPU can immediately access those weights without any data transfer.
-
No VRAM limitation. On a discrete GPU, your model must fit in VRAM (12GB, 24GB, etc.). On Apple Silicon, your model can use the entire unified memory pool – up to 192GB on the M4 Max.
-
CPU post-processing is free. After the GPU generates logits, the CPU can read the output buffer directly without waiting for a DMA transfer.
Why UMA Matters for LLM Inference
===================================
Discrete GPU (NVIDIA):
1. Load model from disk -> CPU memory (RAM) [slow: disk I/O]
2. Copy weights from RAM -> GPU VRAM (PCIe) [slow: PCIe ~32 GB/s]
3. GPU computes inference using VRAM [fast: HBM ~900 GB/s]
4. Copy results from VRAM -> RAM (PCIe) [slow: PCIe ~32 GB/s]
5. CPU reads results from RAM [fast: DDR5 ~50 GB/s]
Apple Silicon (UMA):
1. Load model from disk -> unified memory [slow: disk I/O]
2. GPU computes inference using same memory [fast: ~273-546 GB/s]
3. CPU reads results from same memory [fast: same bus]
Steps 2 and 4 from the discrete GPU path simply do not exist on Apple Silicon.
3. Efficiency Cores and Performance Cores (big.LITTLE)
Apple Silicon uses a heterogeneous CPU design with two types of cores:
-
Performance cores (P-cores): Wide, deep-pipeline, out-of-order cores designed for maximum single-threaded performance. These are among the fastest CPU cores in the world. They consume more power.
-
Efficiency cores (E-cores): Narrower, simpler cores designed for maximum performance per watt. They handle background tasks and light workloads, allowing the P-cores to stay idle (and powered down) when they are not needed.
CPU Cluster Architecture
=========================
+-------------------------------------------+
| Performance Cluster |
| +---------+ +---------+ +---------+ +--+ |
| | P-core | | P-core | | P-core | |..| |
| | Wide | | Wide | | Wide | | | |
| | OoO | | OoO | | OoO | | | |
| | 192KB | | 192KB | | 192KB | | | |
| | L1 | | L1 | | L1 | | | |
| +---------+ +---------+ +---------+ +--+ |
| Shared L2 Cache (12-16MB) |
+-------------------------------------------+
+-------------------------------------------+
| Efficiency Cluster |
| +-------+ +-------+ +-------+ +-------+ |
| |E-core | |E-core | |E-core | |E-core | |
| |Narrow | |Narrow | |Narrow | |Narrow | |
| |In-ord | |In-ord | |In-ord | |In-ord | |
| | 128KB | | 128KB | | 128KB | | 128KB | |
| | L1 | | L1 | | L1 | | L1 | |
| +-------+ +-------+ +-------+ +-------+ |
| Shared L2 Cache (4MB) |
+-------------------------------------------+
For akunu, the P-cores matter during model loading and weight rearrangement (CPU-side work), while the GPU handles the actual inference computation. The E-cores might handle I/O and tokenization during inference.
4. The Custom GPU
Apple’s GPU is a tile-based deferred renderer (TBDR) designed primarily for mobile graphics, but it turns out to be surprisingly capable for compute workloads as well. We will cover the GPU architecture in detail in Chapter 3, but the key points are:
- Apple’s GPU uses 32-thread SIMD groups (similar to NVIDIA’s 32-thread warps)
- It has on-chip tile memory (similar to NVIDIA’s shared memory)
- It supports SIMD group matrix operations for hardware-accelerated matrix multiplication
- It shares the same unified memory as the CPU, with no separate VRAM
The GPU is where akunu spends the vast majority of its time, and understanding its architecture is the foundation for understanding akunu’s performance.
5. The Neural Engine
Apple Silicon includes a dedicated Neural Engine – a specialized accelerator for neural network inference. The Neural Engine is optimized for a specific set of operations (convolutions, matrix multiplications, etc.) and can be more power-efficient than the GPU for those operations.
However, akunu does not use the Neural Engine. Here is why:
- Limited programmability: The Neural Engine is accessed through CoreML, which abstracts away the hardware details. You cannot write custom kernels for it.
- Limited precision support: The Neural Engine is optimized for FP16 and INT8 operations. Many of akunu’s kernels need mixed-precision arithmetic.
- Limited flexibility: The Neural Engine is designed for standard neural network layers. Custom operations (like akunu’s fused kernels) cannot run on it.
- GPU is fast enough: Apple’s GPU, with SIMD group matrix operations and high memory bandwidth, is more than fast enough for inference when properly optimized.
Why Akunu Uses the GPU, Not the Neural Engine
===============================================
Neural Engine:
+-------------------+
| + Power efficient |
| + Good for std ops |
| - No custom kernels|
| - CoreML only |
| - Fixed data types |
| - Limited control |
+-------------------+
GPU:
+-------------------+
| + Full control |
| + Custom kernels |
| + Any data type |
| + High bandwidth |
| + SIMD group ops |
| + Threadgroup mem |
| - More power |
+-------------------+
For maximum performance with custom operations, the GPU wins.
How Apple Silicon Changes the Game for Local Inference
With the background in place, let’s talk about why Apple Silicon is uniquely well-suited for running LLMs locally – and why akunu exists.
The Memory Wall
The fundamental bottleneck in LLM inference is not compute; it is memory bandwidth. During the decode phase (generating one token at a time), the model must read its entire weight matrix for every single token. For a 7-billion-parameter model in 4-bit quantization, that is roughly 3.5GB of data that must be read from memory for every single token generated.
The speed at which you can read data from memory – the memory bandwidth – determines the upper bound on your decode speed. This is the “memory wall.”
The Memory Wall: Why Bandwidth Matters
========================================
Model: Llama 3.1 8B (Q4_0 quantization)
Weight size: ~4.3 GB
Operation per token: read all weights + compute
If memory bandwidth is B (GB/s) and model size is S (GB):
Maximum theoretical tokens/sec = B / S
+-------------------+----------+-----------------------+
| Hardware | BW (GB/s)| Max decode (tok/s) |
+-------------------+----------+-----------------------+
| DDR4 laptop | 38 | 38/4.3 = ~9 tok/s |
| M1 (LPDDR4X) | 68 | 68/4.3 = ~16 tok/s |
| RTX 3090 (HBM) | 936 | 936/4.3 = ~218 tok/s |
| M4 Pro (LPDDR5X) | 273 | 273/4.3 = ~63 tok/s |
| M4 Max (LPDDR5X) | 546 | 546/4.3 = ~127 tok/s |
+-------------------+----------+-----------------------+
Note: These are THEORETICAL MAXIMUMS. Actual performance depends on
how efficiently the software uses the available bandwidth.
Akunu gets remarkably close to these theoretical limits.
Apple Silicon’s advantage here is threefold:
-
High bandwidth LPDDR5X memory. The M4 Max provides 546 GB/s of memory bandwidth, which is in the range of older high-end discrete GPUs.
-
No PCIe bottleneck. On an NVIDIA system, even if the GPU has 900+ GB/s of HBM bandwidth, the model weights must first cross the 32-64 GB/s PCIe bus to get from system RAM to VRAM. If the model does not fit in VRAM, you hit the PCIe wall. On Apple Silicon, there is no such bottleneck – the full memory bandwidth is available to the GPU.
-
Large memory capacity. The M4 Max supports up to 128GB of unified memory. You can run a 70B model in 4-bit quantization (~35GB) without any model splitting or offloading.
The Total-Cost-of-Ownership Argument
There is also a practical economic argument for Apple Silicon inference. An NVIDIA A100 or H100 GPU costs thousands of dollars (or hundreds per month in cloud rental), requires a dedicated server, consumes hundreds of watts, and needs active cooling. A MacBook Pro with an M4 Max chip can run significant models while sitting on your lap, running on battery, making no noise.
For personal use, development, testing, and small-scale deployment, Apple Silicon offers a compelling total cost of ownership that discrete GPU setups cannot match.
What Akunu Exploits
Akunu is designed from the ground up to exploit Apple Silicon’s unique characteristics:
Akunu's Hardware-Aware Design
==============================
Apple Silicon Feature Akunu Exploitation
======================== ================================
Unified Memory (UMA) --> Zero-copy weight loading.
CPU rearranges weights directly
in GPU-accessible buffers.
High memory bandwidth --> Bandwidth-optimized kernels.
GEMV kernels saturate the
memory bus.
SIMD group matrix ops --> Hardware-accelerated matrix
multiplication in GEMM and
attention kernels.
System Level Cache (SLC) --> Kernel tiling tuned to SLC
size per chip generation.
GPU core count varies --> Dispatch parameters tuned per
chip (M1 vs M4 Pro vs M4 Max).
Threadgroup memory --> Cooperative kernels that share
data between SIMD groups via
fast on-chip memory.
A Roadmap of What’s Coming
Let’s close this chapter with a preview of the journey ahead. Here is the conceptual stack we will build up over the course of this book:
The Akunu Stack (bottom to top)
================================
Layer 7: Application
+----------------------------------------------------------+
| HTTP Server | CLI | Swift/Python Bindings |
+----------------------------------------------------------+
Layer 6: Decoding Strategies
+----------------------------------------------------------+
| Greedy | Sampled | Speculative | Grammar-Constrained |
+----------------------------------------------------------+
Layer 5: Inference Pipeline
+----------------------------------------------------------+
| Model Loading | Prefill | Decode Loop | KV Cache |
+----------------------------------------------------------+
Layer 4: Metal Kernels
+----------------------------------------------------------+
| GEMV | GEMM | FlashAttn | RMSNorm | RoPE | Sampling |
+----------------------------------------------------------+
Layer 3: Metal Compute Framework
+----------------------------------------------------------+
| Pipelines | Buffers | Command Encoders | Threadgroups |
+----------------------------------------------------------+
Layer 2: Apple GPU Architecture
+----------------------------------------------------------+
| GPU Cores | SIMD Groups | Tile Memory | Register File |
+----------------------------------------------------------+
Layer 1: Apple Silicon SoC
+----------------------------------------------------------+
| CPU | GPU | Neural Engine | Memory | SLC | Fabric |
+----------------------------------------------------------+
Layer 0: Silicon
+----------------------------------------------------------+
| TSMC 3nm/5nm Process | Transistors | Interconnect |
+----------------------------------------------------------+
We start at Layer 0/1 (this chapter and the next four)
and work our way up to Layer 7.
In the next chapter, we will zoom into Layer 1 and examine the System-on-Chip architecture in detail. We will look at every component on the Apple Silicon die and understand how they work together.
Summary
Let’s recap what we covered in this chapter:
-
Apple transitioned from Intel to ARM because Intel’s process technology stagnated, ARM offers better power efficiency, and Apple’s SoC approach enables superior integration.
-
The ARM ISA is a RISC architecture with fixed-width 32-bit instructions, 31 general- purpose registers, NEON SIMD extensions, and (on Apple chips) the undocumented AMX matrix coprocessor.
-
Apple’s chip design lineage stretches from the A4 (2010) through the A14 (2020) to the M1 and beyond, with key milestones including custom CPU cores (A6), 64-bit ARM (A7), custom GPU (A11), and the Neural Engine (A11).
-
Apple Silicon’s key differentiators include SoC integration, unified memory architecture, heterogeneous CPU cores, a custom TBDR GPU, and a dedicated Neural Engine.
-
For LLM inference, the critical factors are memory bandwidth (the memory wall), the absence of a PCIe bottleneck (UMA), and the large unified memory pool.
-
Akunu exploits UMA for zero-copy weight loading, high bandwidth for saturating GEMV kernels, SIMD group matrix ops for hardware-accelerated matmul, and per-chip tuning for optimal dispatch parameters.
Next up: let’s crack open the chip and see what is inside.
-
Apple. “Apple announces M1.” apple.com, November 2020. The original announcement detailing unified memory architecture and performance-per-watt claims. See https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/. ↩
-
Turley, J. “Apple Ignites the ARM Mac.” Microprocessor Report, 2020. Analysis of Apple’s transition from Intel to ARM and the architectural advantages of the M1. See https://www.linleygroup.com/. ↩
-
Patterson, D. and Hennessy, J. “Computer Architecture: A Quantitative Approach.” 6th Edition. The foundational text on RISC vs CISC tradeoffs, pipeline design, and the memory wall. See https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1. ↩