The Apple GPU Family: M1 through M4

Now that we understand the SoC architecture, unified memory, and GPU execution model, let’s survey the actual hardware across Apple Silicon generations. Each generation brought meaningful improvements for ML inference, and akunu’s ChipConfig system tunes its behavior for each.

Generation Overview 1

Gen	Year	Process	GPU Family	Key ML Feature
M1	2020	5nm	Apple 7	SIMD group matrix ops, UMA
M2	2022	5nm (2nd gen)	Apple 8	More bandwidth, cores
M3	2023	3nm	Apple 9	Dynamic caching, ray tracing HW
M4	2024	3nm (2nd gen)	Apple 9+	Enhanced ML, higher clocks

M1 Family (2020-2021)

The M1 was the first Apple Silicon chip for Mac. It proved the concept works.

M1 Family Specifications:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│          │   M1     │ M1 Pro   │ M1 Max   │ M1 Ultra │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ GPU Cores│   7-8    │  14-16   │  24-32   │  48-64   │
│ CPU Cores│   8      │  8-10    │  10      │  20      │
│ Max RAM  │  16 GB   │  32 GB   │  64 GB   │  128 GB  │
│ Mem BW   │ 68 GB/s  │ 200 GB/s │ 400 GB/s │ 800 GB/s │
│ SLC      │  16 MB   │  24 MB   │  48 MB   │  96 MB   │
│ GPU Fam  │ Apple 7  │ Apple 7  │ Apple 7  │ Apple 7  │
│ FP32 TF  │  2.6     │  5.2     │  10.4    │  20.8    │
│ FP16 TF  │  5.2     │  10.4    │  20.8    │  41.6    │
│ NE TOPS  │  11      │  11      │  11      │  22      │
└──────────┴──────────┴──────────┴──────────┴──────────┘

For inference:

M1 base: Usable for small models (1-3B Q4). 68 GB/s bandwidth limits decode speed
M1 Max 64GB: First chip that could comfortably run 13B Q4 models
M1 Ultra 128GB: Could run 70B Q4 models, albeit slowly

Key capability: Apple GPU Family 7 introduced simdgroup_matrix operations (8×8 matrix tiles), which akunu uses for GEMM kernels during prefill.

M2 Family (2022-2023)

An evolutionary improvement: more cores, more bandwidth, same architecture.

M2 Family Specifications:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│          │   M2     │ M2 Pro   │ M2 Max   │ M2 Ultra │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ GPU Cores│   8-10   │  16-19   │  30-38   │  60-76   │
│ CPU Cores│   8      │  10-12   │  12      │  24      │
│ Max RAM  │  24 GB   │  32 GB   │  96 GB   │  192 GB  │
│ Mem BW   │ 100 GB/s │ 200 GB/s │ 400 GB/s │ 800 GB/s │
│ SLC      │  16 MB   │  24 MB   │  48 MB   │  96 MB   │
│ GPU Fam  │ Apple 8  │ Apple 8  │ Apple 8  │ Apple 8  │
│ FP32 TF  │  3.6     │  6.8     │  13.6    │  27.2    │
│ FP16 TF  │  7.2     │  13.6    │  27.2    │  54.4    │
│ NE TOPS  │  15.8    │  15.8    │  15.8    │  31.6    │
└──────────┴──────────┴──────────┴──────────┴──────────┘

For inference:

M2 Max 96GB: Sweet spot for 70B Q4 models
M2 Ultra 192GB: Could run quantized 100B+ models
~30% faster than M1 at equivalent tier

M3 Family (2023-2024)

The jump to 3nm brought significant GPU architectural changes.

M3 Family Specifications:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│          │   M3     │ M3 Pro   │ M3 Max   │ M3 Ultra │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ GPU Cores│   8-10   │  14-18   │  30-40   │  60-80   │
│ CPU Cores│   8      │  11-12   │  14-16   │  28-32   │
│ Max RAM  │  24 GB   │  36 GB   │  128 GB  │  192 GB  │
│ Mem BW   │ 100 GB/s │ 150 GB/s │ 400 GB/s │ 800 GB/s │
│ SLC      │  16 MB   │  24 MB   │  48 MB   │  96 MB   │
│ GPU Fam  │ Apple 9  │ Apple 9  │ Apple 9  │ Apple 9  │
│ FP32 TF  │  4.1     │  7.4     │  16.4    │  32.8    │
│ FP16 TF  │  8.2     │  14.8    │  32.8    │  65.6    │
│ NE TOPS  │  18      │  18      │  18      │  36      │
└──────────┴──────────┴──────────┴──────────┴──────────┘

M3’s GPU innovations:

Dynamic Caching: GPU dynamically allocates register and threadgroup memory per kernel, instead of statically at dispatch time. This improves occupancy for kernels that don’t use much threadgroup memory.
Ray Tracing Hardware: Not relevant for inference, but shows architectural maturation
Mesh Shading: Graphics feature, not relevant here
Apple GPU Family 9: same SIMD group matrix ops as Family 7, but with improved scheduling

M4 Family (2024-2025)

The latest generation, optimized for AI workloads.

M4 Family Specifications:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│          │   M4     │ M4 Pro   │ M4 Max   │ M4 Ultra │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ GPU Cores│   10     │  20      │  40      │  80      │
│ CPU Cores│ 4P+6E    │ 10P+4E   │ 12P+4E   │ 24P+8E  │
│ Max RAM  │  32 GB   │  48 GB   │  128 GB  │  192 GB  │
│ Mem BW   │ 120 GB/s │ 273 GB/s │ 546 GB/s │ 819 GB/s │
│ SLC      │  16 MB   │  36 MB   │  48 MB   │  96 MB   │
│ GPU Fam  │ Apple 9  │ Apple 9  │ Apple 9  │ Apple 9  │
│ FP32 TF  │  4.6     │  9.2     │  18.4    │  36.8    │
│ FP16 TF  │  9.2     │  18.4    │  36.8    │  73.6    │
│ NE TOPS  │  38      │  38      │  38      │  76      │
└──────────┴──────────┴──────────┴──────────┴──────────┘

M4 highlights for inference:

Significantly higher memory bandwidth at each tier (M4 Pro: 273 vs M3 Pro: 150 GB/s)
Larger SLC (M4 Pro: 36 MB vs M3 Pro: 24 MB)
More GPU cores per tier
Enhanced Neural Engine (38 TOPS)

Inference Performance by Chip

Here are theoretical decode speeds for common model sizes:

Tokens/sec (theoretical max, Q4_0 quantization):

Model Size:     3B (~1.7GB)  7B (~3.8GB)  13B (~7.3GB)  70B (~40GB)
─────────────────────────────────────────────────────────────────────
M4              70           31           16            —*
M4 Pro          160          71           37            6.8
M4 Max          321          143          74            13.6
M4 Ultra        481          215          112           20.4
─────────────────────────────────────────────────────────────────────
*Doesn't fit in 32 GB RAM

Actual performance will be 60-80% of theoretical due to cache misses, overhead, and non-ideal memory access patterns.² Akunu typically achieves 70-85% of theoretical bandwidth utilization.

Akunu’s ChipConfig

Akunu detects the hardware at startup and selects tuning parameters via the ChipConfig struct. Here’s how different chips get different configurations:

ChipConfig Parameters:
┌────────────────────┬─────────────────────────────────────┐
│ Parameter          │ Purpose                              │
├────────────────────┼─────────────────────────────────────┤
│ slc_size           │ SLC size in bytes. Affects tile     │
│                    │ sizes and prefetch strategies.       │
├────────────────────┼─────────────────────────────────────┤
│ gemv_k_threshold   │ K dimension where GEMV switches     │
│                    │ from small (4 SGs) to large (8 SGs) │
│                    │ variant. Higher bandwidth chips can  │
│                    │ benefit from more parallelism.       │
├────────────────────┼─────────────────────────────────────┤
│ prefill_chunk      │ Max tokens per prefill batch.       │
│                    │ Limited by threadgroup memory for    │
│                    │ attention. Larger chips handle more. │
├────────────────────┼─────────────────────────────────────┤
│ chain_decode_count │ Tokens per chained decode GPU       │
│                    │ submission. More cores → more tokens │
│                    │ can be chained without overhead.     │
└────────────────────┴─────────────────────────────────────┘

Example configurations:

M4 (10 GPU cores, 120 GB/s, 16 MB SLC):
  gemv_k_threshold = 2048
  prefill_chunk = 512
  chain_decode_count = 4

M4 Pro (20 GPU cores, 273 GB/s, 36 MB SLC):
  gemv_k_threshold = 2048
  prefill_chunk = 2048
  chain_decode_count = 6

M4 Max (40 GPU cores, 546 GB/s, 48 MB SLC):
  gemv_k_threshold = 4096
  prefill_chunk = 4096
  chain_decode_count = 8

The key insight: the same kernels run on all chips, but the dispatch table uses different configurations. The GEMV kernel for Q4_0 on an M4 uses 4 SIMD groups (128 threads), while on an M4 Max it might use 8 SIMD groups (256 threads) for K dimensions above the threshold.

Choosing the Right Hardware for Your Workload

┌─────────────────────────────────────────────────────────────┐
│                  MODEL SIZE vs CHIP GUIDE                     │
│                                                              │
│  Model Size    Recommended Minimum   Sweet Spot              │
│  ──────────    ────────────────────  ──────────              │
│  1-3B          M4 (16 GB)           M4 Pro (24 GB)          │
│  7B            M4 Pro (24 GB)       M4 Pro (36 GB)          │
│  13B           M4 Pro (36 GB)       M4 Max (64 GB)          │
│  34B           M4 Max (64 GB)       M4 Max (128 GB)         │
│  70B           M4 Max (128 GB)      M4 Ultra (192 GB)       │
│  100B+         M4 Ultra (192 GB)    —                       │
│                                                              │
│  Rule of thumb: you need ~1.2x the model size in Q4 as RAM  │
│  (weights + KV cache + scratch buffers + OS overhead)        │
└─────────────────────────────────────────────────────────────┘

Summary

Across four generations:

M1 introduced UMA and SIMD group matrix ops — the foundation
M2 increased bandwidth and core counts — evolutionary improvement
M3 added dynamic caching and 3nm efficiency — architectural refinement
M4 pushed bandwidth dramatically higher — the best for inference

Akunu’s ChipConfig abstracts these differences into tuning parameters, so the same codebase runs optimally on everything from an M1 MacBook Air to an M4 Ultra Mac Studio. The key variables are always the same: memory bandwidth, GPU core count, and SLC size.

In the next part, we’ll learn how to actually program these GPUs using the Metal framework.

Apple. “Apple M1 chip”, “Apple M2 chip”, “Apple M3 chip”, “Apple M4 chip.” apple.com, 2020-2024. Official specifications for each chip generation. See https://www.apple.com/newsroom/2024/05/apple-introduces-m4-chip/. ↩
Frumusanu, A. and Smith, R. “Apple M-series deep dives.” AnandTech, 2020-2023. Independent benchmarks and die analysis across generations. See https://chipsandcheese.com/. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference