The Apple GPU Family: M1 through M4
Now that we understand the SoC architecture, unified memory, and GPU execution model, let’s survey the actual hardware across Apple Silicon generations. Each generation brought meaningful improvements for ML inference, and akunu’s ChipConfig system tunes its behavior for each.
Generation Overview1
| Gen | Year | Process | GPU Family | Key ML Feature |
|---|---|---|---|---|
| M1 | 2020 | 5nm | Apple 7 | SIMD group matrix ops, UMA |
| M2 | 2022 | 5nm (2nd gen) | Apple 8 | More bandwidth, cores |
| M3 | 2023 | 3nm | Apple 9 | Dynamic caching, ray tracing HW |
| M4 | 2024 | 3nm (2nd gen) | Apple 9+ | Enhanced ML, higher clocks |
M1 Family (2020-2021)
The M1 was the first Apple Silicon chip for Mac. It proved the concept works.
M1 Family Specifications:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ │ M1 │ M1 Pro │ M1 Max │ M1 Ultra │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ GPU Cores│ 7-8 │ 14-16 │ 24-32 │ 48-64 │
│ CPU Cores│ 8 │ 8-10 │ 10 │ 20 │
│ Max RAM │ 16 GB │ 32 GB │ 64 GB │ 128 GB │
│ Mem BW │ 68 GB/s │ 200 GB/s │ 400 GB/s │ 800 GB/s │
│ SLC │ 16 MB │ 24 MB │ 48 MB │ 96 MB │
│ GPU Fam │ Apple 7 │ Apple 7 │ Apple 7 │ Apple 7 │
│ FP32 TF │ 2.6 │ 5.2 │ 10.4 │ 20.8 │
│ FP16 TF │ 5.2 │ 10.4 │ 20.8 │ 41.6 │
│ NE TOPS │ 11 │ 11 │ 11 │ 22 │
└──────────┴──────────┴──────────┴──────────┴──────────┘
For inference:
- M1 base: Usable for small models (1-3B Q4). 68 GB/s bandwidth limits decode speed
- M1 Max 64GB: First chip that could comfortably run 13B Q4 models
- M1 Ultra 128GB: Could run 70B Q4 models, albeit slowly
Key capability: Apple GPU Family 7 introduced simdgroup_matrix operations (8×8 matrix tiles), which akunu uses for GEMM kernels during prefill.
M2 Family (2022-2023)
An evolutionary improvement: more cores, more bandwidth, same architecture.
M2 Family Specifications:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ │ M2 │ M2 Pro │ M2 Max │ M2 Ultra │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ GPU Cores│ 8-10 │ 16-19 │ 30-38 │ 60-76 │
│ CPU Cores│ 8 │ 10-12 │ 12 │ 24 │
│ Max RAM │ 24 GB │ 32 GB │ 96 GB │ 192 GB │
│ Mem BW │ 100 GB/s │ 200 GB/s │ 400 GB/s │ 800 GB/s │
│ SLC │ 16 MB │ 24 MB │ 48 MB │ 96 MB │
│ GPU Fam │ Apple 8 │ Apple 8 │ Apple 8 │ Apple 8 │
│ FP32 TF │ 3.6 │ 6.8 │ 13.6 │ 27.2 │
│ FP16 TF │ 7.2 │ 13.6 │ 27.2 │ 54.4 │
│ NE TOPS │ 15.8 │ 15.8 │ 15.8 │ 31.6 │
└──────────┴──────────┴──────────┴──────────┴──────────┘
For inference:
- M2 Max 96GB: Sweet spot for 70B Q4 models
- M2 Ultra 192GB: Could run quantized 100B+ models
- ~30% faster than M1 at equivalent tier
M3 Family (2023-2024)
The jump to 3nm brought significant GPU architectural changes.
M3 Family Specifications:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ │ M3 │ M3 Pro │ M3 Max │ M3 Ultra │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ GPU Cores│ 8-10 │ 14-18 │ 30-40 │ 60-80 │
│ CPU Cores│ 8 │ 11-12 │ 14-16 │ 28-32 │
│ Max RAM │ 24 GB │ 36 GB │ 128 GB │ 192 GB │
│ Mem BW │ 100 GB/s │ 150 GB/s │ 400 GB/s │ 800 GB/s │
│ SLC │ 16 MB │ 24 MB │ 48 MB │ 96 MB │
│ GPU Fam │ Apple 9 │ Apple 9 │ Apple 9 │ Apple 9 │
│ FP32 TF │ 4.1 │ 7.4 │ 16.4 │ 32.8 │
│ FP16 TF │ 8.2 │ 14.8 │ 32.8 │ 65.6 │
│ NE TOPS │ 18 │ 18 │ 18 │ 36 │
└──────────┴──────────┴──────────┴──────────┴──────────┘
M3’s GPU innovations:
- Dynamic Caching: GPU dynamically allocates register and threadgroup memory per kernel, instead of statically at dispatch time. This improves occupancy for kernels that don’t use much threadgroup memory.
- Ray Tracing Hardware: Not relevant for inference, but shows architectural maturation
- Mesh Shading: Graphics feature, not relevant here
- Apple GPU Family 9: same SIMD group matrix ops as Family 7, but with improved scheduling
M4 Family (2024-2025)
The latest generation, optimized for AI workloads.
M4 Family Specifications:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ │ M4 │ M4 Pro │ M4 Max │ M4 Ultra │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ GPU Cores│ 10 │ 20 │ 40 │ 80 │
│ CPU Cores│ 4P+6E │ 10P+4E │ 12P+4E │ 24P+8E │
│ Max RAM │ 32 GB │ 48 GB │ 128 GB │ 192 GB │
│ Mem BW │ 120 GB/s │ 273 GB/s │ 546 GB/s │ 819 GB/s │
│ SLC │ 16 MB │ 36 MB │ 48 MB │ 96 MB │
│ GPU Fam │ Apple 9 │ Apple 9 │ Apple 9 │ Apple 9 │
│ FP32 TF │ 4.6 │ 9.2 │ 18.4 │ 36.8 │
│ FP16 TF │ 9.2 │ 18.4 │ 36.8 │ 73.6 │
│ NE TOPS │ 38 │ 38 │ 38 │ 76 │
└──────────┴──────────┴──────────┴──────────┴──────────┘
M4 highlights for inference:
- Significantly higher memory bandwidth at each tier (M4 Pro: 273 vs M3 Pro: 150 GB/s)
- Larger SLC (M4 Pro: 36 MB vs M3 Pro: 24 MB)
- More GPU cores per tier
- Enhanced Neural Engine (38 TOPS)
Inference Performance by Chip
Here are theoretical decode speeds for common model sizes:
Tokens/sec (theoretical max, Q4_0 quantization):
Model Size: 3B (~1.7GB) 7B (~3.8GB) 13B (~7.3GB) 70B (~40GB)
─────────────────────────────────────────────────────────────────────
M4 70 31 16 —*
M4 Pro 160 71 37 6.8
M4 Max 321 143 74 13.6
M4 Ultra 481 215 112 20.4
─────────────────────────────────────────────────────────────────────
*Doesn't fit in 32 GB RAM
Actual performance will be 60-80% of theoretical due to cache misses, overhead, and non-ideal memory access patterns.2 Akunu typically achieves 70-85% of theoretical bandwidth utilization.
Akunu’s ChipConfig
Akunu detects the hardware at startup and selects tuning parameters via the ChipConfig struct. Here’s how different chips get different configurations:
ChipConfig Parameters:
┌────────────────────┬─────────────────────────────────────┐
│ Parameter │ Purpose │
├────────────────────┼─────────────────────────────────────┤
│ slc_size │ SLC size in bytes. Affects tile │
│ │ sizes and prefetch strategies. │
├────────────────────┼─────────────────────────────────────┤
│ gemv_k_threshold │ K dimension where GEMV switches │
│ │ from small (4 SGs) to large (8 SGs) │
│ │ variant. Higher bandwidth chips can │
│ │ benefit from more parallelism. │
├────────────────────┼─────────────────────────────────────┤
│ prefill_chunk │ Max tokens per prefill batch. │
│ │ Limited by threadgroup memory for │
│ │ attention. Larger chips handle more. │
├────────────────────┼─────────────────────────────────────┤
│ chain_decode_count │ Tokens per chained decode GPU │
│ │ submission. More cores → more tokens │
│ │ can be chained without overhead. │
└────────────────────┴─────────────────────────────────────┘
Example configurations:
M4 (10 GPU cores, 120 GB/s, 16 MB SLC):
gemv_k_threshold = 2048
prefill_chunk = 512
chain_decode_count = 4
M4 Pro (20 GPU cores, 273 GB/s, 36 MB SLC):
gemv_k_threshold = 2048
prefill_chunk = 2048
chain_decode_count = 6
M4 Max (40 GPU cores, 546 GB/s, 48 MB SLC):
gemv_k_threshold = 4096
prefill_chunk = 4096
chain_decode_count = 8
The key insight: the same kernels run on all chips, but the dispatch table uses different configurations. The GEMV kernel for Q4_0 on an M4 uses 4 SIMD groups (128 threads), while on an M4 Max it might use 8 SIMD groups (256 threads) for K dimensions above the threshold.
Choosing the Right Hardware for Your Workload
┌─────────────────────────────────────────────────────────────┐
│ MODEL SIZE vs CHIP GUIDE │
│ │
│ Model Size Recommended Minimum Sweet Spot │
│ ────────── ──────────────────── ────────── │
│ 1-3B M4 (16 GB) M4 Pro (24 GB) │
│ 7B M4 Pro (24 GB) M4 Pro (36 GB) │
│ 13B M4 Pro (36 GB) M4 Max (64 GB) │
│ 34B M4 Max (64 GB) M4 Max (128 GB) │
│ 70B M4 Max (128 GB) M4 Ultra (192 GB) │
│ 100B+ M4 Ultra (192 GB) — │
│ │
│ Rule of thumb: you need ~1.2x the model size in Q4 as RAM │
│ (weights + KV cache + scratch buffers + OS overhead) │
└─────────────────────────────────────────────────────────────┘
Summary
Across four generations:
- M1 introduced UMA and SIMD group matrix ops — the foundation
- M2 increased bandwidth and core counts — evolutionary improvement
- M3 added dynamic caching and 3nm efficiency — architectural refinement
- M4 pushed bandwidth dramatically higher — the best for inference
Akunu’s ChipConfig abstracts these differences into tuning parameters, so the same codebase runs optimally on everything from an M1 MacBook Air to an M4 Ultra Mac Studio. The key variables are always the same: memory bandwidth, GPU core count, and SLC size.
In the next part, we’ll learn how to actually program these GPUs using the Metal framework.
-
Apple. “Apple M1 chip”, “Apple M2 chip”, “Apple M3 chip”, “Apple M4 chip.” apple.com, 2020-2024. Official specifications for each chip generation. See https://www.apple.com/newsroom/2024/05/apple-introduces-m4-chip/. ↩
-
Frumusanu, A. and Smith, R. “Apple M-series deep dives.” AnandTech, 2020-2023. Independent benchmarks and die analysis across generations. See https://chipsandcheese.com/. ↩