The GPU: Cores, SIMD Groups, and Threadgroups

This is the most important chapter for understanding akunu’s Metal kernels. Every kernel in akunu — every GEMV, every FlashAttention variant, every normalization — is designed around the execution model we’re about to explore. If you’ve worked with CUDA, many concepts will feel familiar, but the terminology and some architectural details differ. If you haven’t, don’t worry — we’ll build up from first principles.

The Big Picture: Why GPUs?

A CPU is designed to execute a single thread of complex instructions as fast as possible. It has deep pipelines, branch prediction, out-of-order execution, and large caches. A single CPU core might run at 4+ GHz but can only do one or two multiply-accumulate operations per clock.

A GPU takes the opposite approach. It trades single-thread performance for massive parallelism. Each individual “thread” is simpler and slower, but there are thousands of them running simultaneously.

CPU (M4 Pro Performance Core):
┌──────────────────────────────────┐
│  Complex OoO pipeline            │
│  Branch predictor                │
│  192 KB L1I + 128 KB L1D         │
│  ~4.5 GHz                        │
│  1-2 FMA per clock               │
│  ≈ 9 GFLOPS FP32 per core        │
└──────────────────────────────────┘
  × 10 P-cores = ~90 GFLOPS

GPU (M4 Pro, 20 cores):
┌──────────┐┌──────────┐┌──────────┐  ...  ┌──────────┐
│  Core 0  ││  Core 1  ││  Core 2  │       │  Core 19 │
│  128 ALUs││  128 ALUs││  128 ALUs│       │  128 ALUs│
│ ~1.5 GHz ││ ~1.5 GHz ││ ~1.5 GHz │       │ ~1.5 GHz │
└──────────┘└──────────┘└──────────┘       └──────────┘
  20 cores × 128 ALUs × 2 FMA × 1.5 GHz ≈ 7,680 GFLOPS FP32
  FP16: ~15,360 GFLOPS (2x throughput)

For matrix multiplication — the core operation in neural networks — you need to do the same thing (multiply and add) billions of times with different data. GPUs are purpose-built for exactly this.

GPU Core Architecture

Each Apple GPU core is a self-contained processing unit. Let’s look inside one:

┌─────────────────────────────────────────────────────────┐
│                    GPU CORE                             │
│                                                         │
│  ┌────────────────────────────────────────────────────┐ │
│  │              Execution Units                       │ │
│  │                                                    │ │
│  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐       │ │
│  │  │ ALU×32 │ │ ALU×32 │ │ ALU×32 │ │ ALU×32 │       │ │
│  │  │ (SG 0) │ │ (SG 1) │ │ (SG 2) │ │ (SG 3) │       │ │
│  │  └────────┘ └────────┘ └────────┘ └────────┘       │ │
│  │    SIMD       SIMD       SIMD       SIMD           │ │
│  │   Group 0    Group 1    Group 2    Group 3         │ │
│  │                                                    │ │
│  │  Up to 32 SIMD groups can be resident              │ │
│  │  (1024 threads max per core)                       │ │
│  └────────────────────────────────────────────────────┘ │
│                                                         │
│  ┌─────────────────┐  ┌──────────────────────────────┐  │
│  │ Register File    │  │  Threadgroup Memory (32 KB) │  │
│  │ (per thread)     │  │  (shared within threadgroup)│  │
│  └─────────────────┘  └──────────────────────────────┘  │
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │              L1 Cache / Tile Memory              │   │
│  └──────────────────────────────────────────────────┘   │
│                          │                              │
│                     L2 Cache (shared across cores)      │
└─────────────────────────────────────────────────────────┘

Key facts:

Each core has 128 ALUs organized into groups of 32¹
Each group of 32 ALUs forms a SIMD group (Apple’s term for what NVIDIA calls a “warp”)
A core can have up to 1024 threads resident (32 SIMD groups × 32 threads)
32 KB of threadgroup memory (what NVIDIA calls “shared memory”)

The SIMD Group: 32 Threads in Lockstep

The SIMD group is the fundamental unit of execution on the GPU. It’s a group of 32 threads that execute the exact same instruction at the exact same time, but on different data. This is called SIMT (Single Instruction, Multiple Threads).

SIMD Group (32 threads):
┌──────────────────────────────────────────────────────┐
│  Thread 0   Thread 1   Thread 2   ...   Thread 31    │
│                                                      │
│  All execute: result = a[tid] * b[tid] + c[tid]      │
│                                                      │
│  t0: a[0]*b[0]+c[0]                                  │
│  t1: a[1]*b[1]+c[1]                                  │
│  t2: a[2]*b[2]+c[2]                                  │
│  ...                                                 │
│  t31: a[31]*b[31]+c[31]                              │
│                                                      │
│  ALL happen in ONE clock cycle (on 32 ALUs)          │
└──────────────────────────────────────────────────────┘

This has profound implications:

No divergence penalty if all threads take the same branch. If thread 0 takes the if path and thread 1 takes the else path, the SIMD group must execute BOTH paths (with inactive threads masked out). This wastes cycles.
Memory coalescing: When all 32 threads access consecutive memory addresses, the hardware can combine them into a single wide memory transaction. Random access patterns are much slower.
SIMD intrinsics: Threads within a SIMD group can communicate directly via simd_sum, simd_max, simd_shuffle, etc. These are essentially free — no memory access needed.

SIMD Group Communication

This is one of the most powerful features for reduction operations (like computing a dot product). Threads in a SIMD group can share data without going through memory:

simd_sum example (sum 32 values across a SIMD group):

Thread:   0    1    2    3   ...  31
Value:   1.0  2.0  3.0  4.0 ... 32.0

simd_sum → Every thread gets: 528.0

No shared memory needed! Hardware does this in ~5 cycles (log2(32) shuffle steps).

simd_shuffle_down (shift values):

Thread:   0    1    2    3    4    5   ...
Before:  a0   a1   a2   a3   a4   a5  ...
After (offset=2):  a2   a3   a4   a5   a6   a7  ...

simd_shuffle_xor (butterfly pattern):

Thread:  0    1    2    3    4    5    6    7
XOR(1): a1   a0   a3   a2   a5   a4   a7   a6
XOR(2): a2   a3   a0   a1   a6   a7   a4   a5
XOR(4): a4   a5   a6   a7   a0   a1   a2   a3

Akunu uses these extensively. For example, in GEMV kernels, each thread in a SIMD group computes a partial dot product, then simd_sum combines them. In FlashAttention decode, simd_sum computes the Q·K dot product across threads that each hold different head dimensions.

The Threadgroup

A threadgroup is a collection of threads that:

Execute on the same GPU core
Can share threadgroup memory (fast on-chip SRAM)
Can synchronize with threadgroup_barrier()

Threadgroup (128 threads = 4 SIMD groups):
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  ┌──────────────┐  ┌──────────────┐                     │
│  │  SIMD Group 0│  │  SIMD Group 1│                     │
│  │  threads 0-31│  │ threads 32-63│                     │
│  └──────────────┘  └──────────────┘                     │
│  ┌──────────────┐  ┌──────────────┐                     │
│  │  SIMD Group 2│  │  SIMD Group 3│                     │
│  │ threads 64-95│  │threads 96-127│                     │
│  └──────────────┘  └──────────────┘                     │
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │         Threadgroup Memory (up to 32 KB)         │   │
│  │    Accessible by ALL threads in this threadgroup │   │
│  │    ~1 cycle latency (vs ~100+ cycles for DRAM)   │   │
│  └──────────────────────────────────────────────────┘   │
│                                                         │
│  threadgroup_barrier(mem_flags::mem_threadgroup)        │
│  ↑ Ensures all threads have finished their writes       │
│    before any thread reads                              │
└─────────────────────────────────────────────────────────┘

Common threadgroup sizes in akunu:²

32 threads (1 SG): FlashAttention decode fast — no barriers needed!
128 threads (4 SGs): GEMV, GEMM, RMSNorm, RoPE
256 threads (8 SGs): Large GEMV variants
1024 threads (32 SGs): FlashAttention decode parallel, Gumbel-max sampling

Why Threadgroup Memory Matters

Threadgroup memory is like a programmer-managed L1 cache. It’s fast (~1 cycle) but small (32 KB). The classic use is tiled matrix multiply:

Without threadgroup memory:
  Thread 0 reads A[0,0] from DRAM (100+ cycles)
  Thread 1 reads A[0,0] from DRAM (100+ cycles) — SAME DATA!
  Thread 2 reads A[0,0] from DRAM — again!
  ... massive bandwidth waste

With threadgroup memory:
  1. Thread 0 loads A[0,0..31] into shared[0..31]
  2. threadgroup_barrier() — wait for all loads
  3. ALL threads read from shared[] (1 cycle each)
  4. One DRAM fetch serves 32 threads

Akunu’s GEMM kernel (simd_gemm_f16.metal, covered in Chapter 35) allocates 6144 bytes of threadgroup memory: sa[4096] for weight tiles and sb[2048] for activation tiles. This lets 128 threads share weight data loaded once from DRAM.

The Execution Hierarchy

Click on any level to zoom in and see what’s inside. This is the most important mental model for GPU programming — every kernel you write maps onto this hierarchy.

Click a level to explore. The GPU executes work as: Grid → Threadgroups → SIMD Groups → Threads
GRID — the entire dispatch

      THREADGROUP — runs on one GPU core
      
Threadgroup Memory (32 KB) — shared by all threads above

      SIMD GROUP — 32 threads in lockstep
      
      SIMD intrinsics: simd_sum, simd_max, simd_shuffle — instant cross-thread communication, no memory needed

Thread identification variables (used in every kernel):

kernel void my_kernel(
    uint tid       [[thread_position_in_grid]],      // Global thread ID
    uint tgid      [[threadgroup_position_in_grid]],  // Which threadgroup
    uint tiisg     [[thread_index_in_simdgroup]],     // Lane within SG (0-31)
    uint sgitg     [[simdgroup_index_in_threadgroup]], // Which SG in TG (0-N)
    uint tiitg     [[thread_position_in_threadgroup]]  // Thread within TG
) { ... }

CUDA vs Metal Terminology

If you’re coming from CUDA, here’s the translation table:

CUDA Concept	Metal Concept	Size
Warp	SIMD Group	32 threads
Block	Threadgroup	Variable (32-1024)
Grid	Grid	Variable
Shared Memory	Threadgroup Memory	Up to 32 KB
`__syncthreads()`	`threadgroup_barrier()`	—
`__shfl_sync()`	`simd_shuffle()`	—
`warpSize`	Always 32	32
`blockDim`	`threads_per_threadgroup`	—
`threadIdx`	`thread_position_in_threadgroup`	—
`blockIdx`	`threadgroup_position_in_grid`	—
`wmma` (Tensor Cores)	`simdgroup_matrix`	8×8 tiles

Key differences:

No warp divergence penalty tracking in Metal³ — the hardware handles it, but you should still avoid it
SIMD group matrix operations use 8×8 tiles (vs NVIDIA’s 16×16 Tensor Core tiles)
Threadgroup memory is explicitly sized per dispatch, not declared with __shared__
No dynamic parallelism — kernels can’t launch other kernels

Occupancy and Register Pressure

Each GPU core has a fixed amount of resources. The more resources a threadgroup uses, the fewer threadgroups can be resident simultaneously:

GPU Core Resources:
┌────────────────────────────────┐
│  Register File:  ~32K regs     │
│  Threadgroup Mem: 32 KB        │
│  Max Threads:     1024         │
│  Max SGs:         32           │
└────────────────────────────────┘

Example: Kernel uses 32 registers per thread, 4096 bytes TG mem

  Threads per TG: 128
  Registers per TG: 128 × 32 = 4096
  TG memory per TG: 4096 bytes

  Max concurrent TGs: min(
    32K / 4096 = 8 (register limited),
    32K / 4096 = 8 (TG mem limited),
    1024 / 128 = 8 (thread limited)
  ) = 8 threadgroups, 1024 threads

  Occupancy: 1024 / 1024 = 100%

High occupancy means more threads to hide memory latency. When one SIMD group is waiting for a memory fetch (100+ cycles), the core can switch to another SIMD group that has work ready. This is how GPUs tolerate high memory latency — they never run out of work.

Dispatch Models

Metal offers two dispatch models:

dispatchThreadgroups(gridSize, threadgroupSize):

You specify how many threadgroups to launch and their size
Total threads = gridSize × threadgroupSize
May launch more threads than needed (you handle bounds checking)

dispatchThreads(totalThreads, threadgroupSize):

You specify the total number of threads needed
Metal handles partial threadgroups at the edges
Cleaner for simple 1:1 thread-to-data mappings

Akunu’s DispatchCmd has a use_dispatch_threads flag that selects between these. Most kernels use dispatchThreadgroups for precise control.

The Tile-Based Deferred Renderer (TBDR)

Apple’s GPU was originally designed for mobile graphics, which uses a Tile-Based Deferred Rendering architecture. In graphics, the screen is divided into tiles and each tile is rendered entirely in fast tile memory before writing to DRAM.

For compute shaders (which akunu uses exclusively), TBDR doesn’t apply directly. Compute shaders bypass the tiling hardware and operate like a traditional GPU compute model. However, the tile memory architecture means:

The GPU has fast on-chip storage (threadgroup memory)
Memory access patterns that fit in cache lines are rewarded
The GPU is efficient at processing data in blocks/tiles

This is why akunu’s kernels are organized around tiles: 32-element K-blocks in GEMV, 32×64 output tiles in GEMM, 32-position KV tiles in FlashAttention.

Summary

The Apple GPU execution model is:

Grid dispatches many threadgroups
Each threadgroup runs on one GPU core
Each threadgroup contains multiple SIMD groups of 32 threads
Threads within a SIMD group execute in lockstep and communicate via SIMD intrinsics
Threads within a threadgroup share threadgroup memory and synchronize via barriers
High occupancy hides memory latency

Akunu’s kernels are designed around this hierarchy: SIMD-level reductions for dot products, threadgroup-level cooperation for tiled matrix multiply, and grid-level parallelism across output rows and attention heads.

Next, we’ll look at the unified memory architecture that makes Apple Silicon uniquely suited for inference workloads.

Grinberg, D. “Reverse-engineering Apple GPU cores.” Asahi Linux project, 2022. The most detailed public analysis of Apple GPU core internals, SIMD group behavior, and threadgroup memory layout. See https://dougallj.github.io/applegpu/. ↩
Apple. “Metal Best Practices Guide.” developer.apple.com. Official guidance on threadgroup sizing, occupancy, and memory access patterns for Metal compute shaders. See https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/index.html. ↩
Apple. “Metal Feature Set Tables.” developer.apple.com. Defines GPU family capabilities including simdgroup_matrix support (Family 7+), max threadgroup size, and threadgroup memory limits. See https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference