The GPU: Cores, SIMD Groups, and Threadgroups
This is the most important chapter for understanding akunu’s Metal kernels. Every kernel in akunu — every GEMV, every FlashAttention variant, every normalization — is designed around the execution model we’re about to explore. If you’ve worked with CUDA, many concepts will feel familiar, but the terminology and some architectural details differ. If you haven’t, don’t worry — we’ll build up from first principles.
The Big Picture: Why GPUs?
A CPU is designed to execute a single thread of complex instructions as fast as possible. It has deep pipelines, branch prediction, out-of-order execution, and large caches. A single CPU core might run at 4+ GHz but can only do one or two multiply-accumulate operations per clock.
A GPU takes the opposite approach. It trades single-thread performance for massive parallelism. Each individual “thread” is simpler and slower, but there are thousands of them running simultaneously.
CPU (M4 Pro Performance Core):
┌──────────────────────────────────┐
│ Complex OoO pipeline │
│ Branch predictor │
│ 192 KB L1I + 128 KB L1D │
│ ~4.5 GHz │
│ 1-2 FMA per clock │
│ ≈ 9 GFLOPS FP32 per core │
└──────────────────────────────────┘
× 10 P-cores = ~90 GFLOPS
GPU (M4 Pro, 20 cores):
┌──────────┐┌──────────┐┌──────────┐ ... ┌──────────┐
│ Core 0 ││ Core 1 ││ Core 2 │ │ Core 19 │
│ 128 ALUs││ 128 ALUs││ 128 ALUs│ │ 128 ALUs│
│ ~1.5 GHz ││ ~1.5 GHz ││ ~1.5 GHz │ │ ~1.5 GHz │
└──────────┘└──────────┘└──────────┘ └──────────┘
20 cores × 128 ALUs × 2 FMA × 1.5 GHz ≈ 7,680 GFLOPS FP32
FP16: ~15,360 GFLOPS (2x throughput)
For matrix multiplication — the core operation in neural networks — you need to do the same thing (multiply and add) billions of times with different data. GPUs are purpose-built for exactly this.
GPU Core Architecture
Each Apple GPU core is a self-contained processing unit. Let’s look inside one:
┌─────────────────────────────────────────────────────────┐
│ GPU CORE │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Execution Units │ │
│ │ │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ ALU×32 │ │ ALU×32 │ │ ALU×32 │ │ ALU×32 │ │ │
│ │ │ (SG 0) │ │ (SG 1) │ │ (SG 2) │ │ (SG 3) │ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ │ SIMD SIMD SIMD SIMD │ │
│ │ Group 0 Group 1 Group 2 Group 3 │ │
│ │ │ │
│ │ Up to 32 SIMD groups can be resident │ │
│ │ (1024 threads max per core) │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────┐ ┌──────────────────────────────┐ │
│ │ Register File │ │ Threadgroup Memory (32 KB) │ │
│ │ (per thread) │ │ (shared within threadgroup)│ │
│ └─────────────────┘ └──────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ L1 Cache / Tile Memory │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ L2 Cache (shared across cores) │
└─────────────────────────────────────────────────────────┘
Key facts:
- Each core has 128 ALUs organized into groups of 321
- Each group of 32 ALUs forms a SIMD group (Apple’s term for what NVIDIA calls a “warp”)
- A core can have up to 1024 threads resident (32 SIMD groups × 32 threads)
- 32 KB of threadgroup memory (what NVIDIA calls “shared memory”)
The SIMD Group: 32 Threads in Lockstep
The SIMD group is the fundamental unit of execution on the GPU. It’s a group of 32 threads that execute the exact same instruction at the exact same time, but on different data. This is called SIMT (Single Instruction, Multiple Threads).
SIMD Group (32 threads):
┌──────────────────────────────────────────────────────┐
│ Thread 0 Thread 1 Thread 2 ... Thread 31 │
│ │
│ All execute: result = a[tid] * b[tid] + c[tid] │
│ │
│ t0: a[0]*b[0]+c[0] │
│ t1: a[1]*b[1]+c[1] │
│ t2: a[2]*b[2]+c[2] │
│ ... │
│ t31: a[31]*b[31]+c[31] │
│ │
│ ALL happen in ONE clock cycle (on 32 ALUs) │
└──────────────────────────────────────────────────────┘
This has profound implications:
- No divergence penalty if all threads take the same branch. If thread 0 takes the
ifpath and thread 1 takes theelsepath, the SIMD group must execute BOTH paths (with inactive threads masked out). This wastes cycles. - Memory coalescing: When all 32 threads access consecutive memory addresses, the hardware can combine them into a single wide memory transaction. Random access patterns are much slower.
- SIMD intrinsics: Threads within a SIMD group can communicate directly via
simd_sum,simd_max,simd_shuffle, etc. These are essentially free — no memory access needed.
SIMD Group Communication
This is one of the most powerful features for reduction operations (like computing a dot product). Threads in a SIMD group can share data without going through memory:
simd_sum example (sum 32 values across a SIMD group):
Thread: 0 1 2 3 ... 31
Value: 1.0 2.0 3.0 4.0 ... 32.0
simd_sum → Every thread gets: 528.0
No shared memory needed! Hardware does this in ~5 cycles (log2(32) shuffle steps).
simd_shuffle_down (shift values):
Thread: 0 1 2 3 4 5 ...
Before: a0 a1 a2 a3 a4 a5 ...
After (offset=2): a2 a3 a4 a5 a6 a7 ...
simd_shuffle_xor (butterfly pattern):
Thread: 0 1 2 3 4 5 6 7
XOR(1): a1 a0 a3 a2 a5 a4 a7 a6
XOR(2): a2 a3 a0 a1 a6 a7 a4 a5
XOR(4): a4 a5 a6 a7 a0 a1 a2 a3
Akunu uses these extensively. For example, in GEMV kernels, each thread in a SIMD group computes a partial dot product, then simd_sum combines them. In FlashAttention decode, simd_sum computes the Q·K dot product across threads that each hold different head dimensions.
The Threadgroup
A threadgroup is a collection of threads that:
- Execute on the same GPU core
- Can share threadgroup memory (fast on-chip SRAM)
- Can synchronize with threadgroup_barrier()
Threadgroup (128 threads = 4 SIMD groups):
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ SIMD Group 0│ │ SIMD Group 1│ │
│ │ threads 0-31│ │ threads 32-63│ │
│ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ SIMD Group 2│ │ SIMD Group 3│ │
│ │ threads 64-95│ │threads 96-127│ │
│ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Threadgroup Memory (up to 32 KB) │ │
│ │ Accessible by ALL threads in this threadgroup │ │
│ │ ~1 cycle latency (vs ~100+ cycles for DRAM) │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ threadgroup_barrier(mem_flags::mem_threadgroup) │
│ ↑ Ensures all threads have finished their writes │
│ before any thread reads │
└─────────────────────────────────────────────────────────┘
Common threadgroup sizes in akunu:2
- 32 threads (1 SG): FlashAttention decode fast — no barriers needed!
- 128 threads (4 SGs): GEMV, GEMM, RMSNorm, RoPE
- 256 threads (8 SGs): Large GEMV variants
- 1024 threads (32 SGs): FlashAttention decode parallel, Gumbel-max sampling
Why Threadgroup Memory Matters
Threadgroup memory is like a programmer-managed L1 cache. It’s fast (~1 cycle) but small (32 KB). The classic use is tiled matrix multiply:
Without threadgroup memory:
Thread 0 reads A[0,0] from DRAM (100+ cycles)
Thread 1 reads A[0,0] from DRAM (100+ cycles) — SAME DATA!
Thread 2 reads A[0,0] from DRAM — again!
... massive bandwidth waste
With threadgroup memory:
1. Thread 0 loads A[0,0..31] into shared[0..31]
2. threadgroup_barrier() — wait for all loads
3. ALL threads read from shared[] (1 cycle each)
4. One DRAM fetch serves 32 threads
Akunu’s GEMM kernel (simd_gemm_f16.metal, covered in Chapter 35) allocates 6144 bytes of threadgroup memory: sa[4096] for weight tiles and sb[2048] for activation tiles. This lets 128 threads share weight data loaded once from DRAM.
The Execution Hierarchy
Click on any level to zoom in and see what’s inside. This is the most important mental model for GPU programming — every kernel you write maps onto this hierarchy.
Thread identification variables (used in every kernel):
kernel void my_kernel(
uint tid [[thread_position_in_grid]], // Global thread ID
uint tgid [[threadgroup_position_in_grid]], // Which threadgroup
uint tiisg [[thread_index_in_simdgroup]], // Lane within SG (0-31)
uint sgitg [[simdgroup_index_in_threadgroup]], // Which SG in TG (0-N)
uint tiitg [[thread_position_in_threadgroup]] // Thread within TG
) { ... }
CUDA vs Metal Terminology
If you’re coming from CUDA, here’s the translation table:
| CUDA Concept | Metal Concept | Size |
|---|---|---|
| Warp | SIMD Group | 32 threads |
| Block | Threadgroup | Variable (32-1024) |
| Grid | Grid | Variable |
| Shared Memory | Threadgroup Memory | Up to 32 KB |
__syncthreads() | threadgroup_barrier() | — |
__shfl_sync() | simd_shuffle() | — |
warpSize | Always 32 | 32 |
blockDim | threads_per_threadgroup | — |
threadIdx | thread_position_in_threadgroup | — |
blockIdx | threadgroup_position_in_grid | — |
wmma (Tensor Cores) | simdgroup_matrix | 8×8 tiles |
Key differences:
- No warp divergence penalty tracking in Metal3 — the hardware handles it, but you should still avoid it
- SIMD group matrix operations use 8×8 tiles (vs NVIDIA’s 16×16 Tensor Core tiles)
- Threadgroup memory is explicitly sized per dispatch, not declared with
__shared__ - No dynamic parallelism — kernels can’t launch other kernels
Occupancy and Register Pressure
Each GPU core has a fixed amount of resources. The more resources a threadgroup uses, the fewer threadgroups can be resident simultaneously:
GPU Core Resources:
┌────────────────────────────────┐
│ Register File: ~32K regs │
│ Threadgroup Mem: 32 KB │
│ Max Threads: 1024 │
│ Max SGs: 32 │
└────────────────────────────────┘
Example: Kernel uses 32 registers per thread, 4096 bytes TG mem
Threads per TG: 128
Registers per TG: 128 × 32 = 4096
TG memory per TG: 4096 bytes
Max concurrent TGs: min(
32K / 4096 = 8 (register limited),
32K / 4096 = 8 (TG mem limited),
1024 / 128 = 8 (thread limited)
) = 8 threadgroups, 1024 threads
Occupancy: 1024 / 1024 = 100%
High occupancy means more threads to hide memory latency. When one SIMD group is waiting for a memory fetch (100+ cycles), the core can switch to another SIMD group that has work ready. This is how GPUs tolerate high memory latency — they never run out of work.
Dispatch Models
Metal offers two dispatch models:
dispatchThreadgroups(gridSize, threadgroupSize):
- You specify how many threadgroups to launch and their size
- Total threads = gridSize × threadgroupSize
- May launch more threads than needed (you handle bounds checking)
dispatchThreads(totalThreads, threadgroupSize):
- You specify the total number of threads needed
- Metal handles partial threadgroups at the edges
- Cleaner for simple 1:1 thread-to-data mappings
Akunu’s DispatchCmd has a use_dispatch_threads flag that selects between these. Most kernels use dispatchThreadgroups for precise control.
The Tile-Based Deferred Renderer (TBDR)
Apple’s GPU was originally designed for mobile graphics, which uses a Tile-Based Deferred Rendering architecture. In graphics, the screen is divided into tiles and each tile is rendered entirely in fast tile memory before writing to DRAM.
For compute shaders (which akunu uses exclusively), TBDR doesn’t apply directly. Compute shaders bypass the tiling hardware and operate like a traditional GPU compute model. However, the tile memory architecture means:
- The GPU has fast on-chip storage (threadgroup memory)
- Memory access patterns that fit in cache lines are rewarded
- The GPU is efficient at processing data in blocks/tiles
This is why akunu’s kernels are organized around tiles: 32-element K-blocks in GEMV, 32×64 output tiles in GEMM, 32-position KV tiles in FlashAttention.
Summary
The Apple GPU execution model is:
- Grid dispatches many threadgroups
- Each threadgroup runs on one GPU core
- Each threadgroup contains multiple SIMD groups of 32 threads
- Threads within a SIMD group execute in lockstep and communicate via SIMD intrinsics
- Threads within a threadgroup share threadgroup memory and synchronize via barriers
- High occupancy hides memory latency
Akunu’s kernels are designed around this hierarchy: SIMD-level reductions for dot products, threadgroup-level cooperation for tiled matrix multiply, and grid-level parallelism across output rows and attention heads.
Next, we’ll look at the unified memory architecture that makes Apple Silicon uniquely suited for inference workloads.
-
Grinberg, D. “Reverse-engineering Apple GPU cores.” Asahi Linux project, 2022. The most detailed public analysis of Apple GPU core internals, SIMD group behavior, and threadgroup memory layout. See https://dougallj.github.io/applegpu/. ↩
-
Apple. “Metal Best Practices Guide.” developer.apple.com. Official guidance on threadgroup sizing, occupancy, and memory access patterns for Metal compute shaders. See https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/index.html. ↩
-
Apple. “Metal Feature Set Tables.” developer.apple.com. Defines GPU family capabilities including simdgroup_matrix support (Family 7+), max threadgroup size, and threadgroup memory limits. See https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf. ↩