Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chip Configuration and Hardware Tuning

Akunu runs on everything from an M1 MacBook Air with 8 GPU cores and 68 GB/s bandwidth to an M4 Ultra Mac Studio with 80 GPU cores and over a terabyte per second of bandwidth. The same inference engine, the same kernels, the same dispatch table structure – but with very different optimal tuning parameters. A threadgroup size that saturates an M1 might leave an M4 Max starved. A chain decode chunk size that is efficient on M4 might cause command buffer overhead to dominate on M1.

The ChipConfig struct solves this: it derives all hardware-specific tuning parameters from two inputs – the GPU core count and the GPU family number. Every magic number in the engine lives here, not scattered across the codebase.

The ChipConfig Struct

Defined in src/core/chip_config.h:

struct ChipConfig {
    // === Hardware info ===
    int gpu_cores;           // GPU core count (from IOKit)
    int gpu_family;          // Apple GPU family (7=M1, 8=M2/M3, 9=M4)
    int max_threads_per_tg;  // Metal max (typically 1024)

    // === SLC (System Level Cache) ===
    int slc_bytes;             // estimated SLC size
    bool should_fuse_weights;  // fuse gate+up, QKV when SLC is large enough

    // === GEMV tuning ===
    int q4_small_k_threshold;  // Q4_0: use small-K variant below this
    int wide_gemv_threshold;   // switch to wide GEMV when N exceeds this
    bool gemv_wide_standard;   // use 256-thread standard GEMV on Pro+

    // === Chain decode ===
    int chain_decode_chunk;  // tokens per GPU submission

    // === Prefill ===
    int max_prefill_chunk;  // max tokens per prefill GEMM batch

    // === Norm dispatch ===
    int norm_tg_size;  // threadgroup size for RMSNorm
};

Every field has a clear purpose and measurable impact on performance. Let’s go through each one.

Chip Detection

The factory function ChipConfig::from_gpu(int cores, int family) builds a config from two hardware-detected values:

  • cores: The GPU core count, queried from IOKit’s AGXAccelerator service via IORegistryEntryCreateCFProperty(CFSTR("gpu-core-count")). This is the most reliable indicator of chip tier.
  • family: The Apple GPU family number, queried via Metal’s [device supportsFamily:]. Family 7 = M1, Family 8 = M2/M3, Family 9 = M4.

These two values together uniquely identify the chip tier and generation:

CoresFamilyChipSLC (est.)Bandwidth
8-107M18 MB68 GB/s
14-167M1 Pro24 MB200 GB/s
24-327M1 Max48 MB400 GB/s
48-647M1 Ultra96 MB800 GB/s
8-108M2/M38 MB100-150 GB/s
16-188M2/M3 Pro24 MB150-200 GB/s
30-408M2/M3 Max48 MB400 GB/s
8-109M416 MB120 GB/s
16-209M4 Pro32 MB273 GB/s
30-409M4 Max48 MB546 GB/s
60-809M4 Ultra96 MB1092 GB/s

SLC Size Estimation

The System Level Cache (SLC) is the large last-level cache shared between CPU, GPU, and other IP blocks. Apple does not officially disclose SLC sizes, but they can be estimated from die analysis and performance profiling.1

if (cores >= 60)
    c.slc_bytes = 96 * 1024 * 1024;  // Ultra
else if (cores >= 30)
    c.slc_bytes = 48 * 1024 * 1024;  // Max
else if (cores >= 16)
    c.slc_bytes = (family >= 9) ? 32 * 1024 * 1024 : 24 * 1024 * 1024;  // Pro
else
    c.slc_bytes = (family >= 9) ? 16 * 1024 * 1024 : 8 * 1024 * 1024;  // Base

Key observations:

  • M4 (family 9) gets larger SLC estimates at every tier. Die analysis suggests M4 improved the cache hierarchy significantly.2
  • Ultra chips (2x Max die-to-die) get double the Max SLC.

Impact on Weight Fusion

The SLC size directly controls whether QKV and gate+up weight fusion is beneficial:

c.should_fuse_weights = (c.slc_bytes >= 16 * 1024 * 1024);  // Pro+ and M4 Base

Weight fusion concatenates two or three weight matrices into one, replacing multiple GEMV dispatches with a single larger one. This reduces kernel launch overhead and can improve cache utilization – but only if the fused weight matrix fits (or mostly fits) in the SLC. On an M1 base with 8 MB SLC, a fused QKV weight matrix for a 4096-dim model might be:

$$Q + K + V = (4096 \times 4096 + 2 \times 4096 \times 1024) \times 0.5 \text{ bytes/elem} = 12.3 \text{ MB (Q4_0)}$$

This exceeds the 8 MB SLC, causing cache thrashing. On an M4 Pro with 32 MB SLC, the fused matrix fits comfortably, and the fusion saves two kernel launches.

GEMV Kernel Selection

Three parameters control GEMV variant selection:

q4_small_k_threshold

if (family >= 9 && cores >= 16)
    c.q4_small_k_threshold = 512;   // M4 Pro+
else if (cores >= 16)
    c.q4_small_k_threshold = 1024;  // M1-M3 Pro+
else
    c.q4_small_k_threshold = 2048;  // Base chips

This determines when to use a “large-K” GEMV variant that uses more SIMD groups for the K-dimension reduction. Larger chips can afford more parallelism at smaller K values because they have more GPU cores to keep busy.

wide_gemv_threshold

c.wide_gemv_threshold = 32768;

When the output dimension N exceeds this threshold (e.g., vocabulary projection with 128K tokens), switch to a “wide” GEMV kernel with more output rows per threadgroup. This is constant across all chips because the decision depends on the N dimension, not the hardware.

gemv_wide_standard

c.gemv_wide_standard = (cores >= 16);

On Pro+ chips (16+ cores), use 256-thread threadgroups for the standard GEMV instead of 128-thread. More threads per threadgroup means more SIMD groups, which means better occupancy on chips with many cores. On base chips with 8-10 cores, the extra threads would just increase register pressure without improving occupancy.

Chain Decode Chunk Size

if (cores >= 60)
    c.chain_decode_chunk = 128;      // Ultra
else if (cores >= 30)
    c.chain_decode_chunk = 128;      // Max
else if (family >= 9)
    c.chain_decode_chunk = 128;      // M4 Pro/Base
else if (cores >= 16)
    c.chain_decode_chunk = 96;       // M3 Pro
else
    c.chain_decode_chunk = 64;       // M1/M2 Base

This is the number of tokens batched into a single GPU command buffer during chain decode. The trade-offs:

  • Larger chunks: Amortize command buffer overhead across more tokens. Better GPU utilization. But each submission takes longer, increasing latency to first token in a batch.
  • Smaller chunks: Lower per-submission latency. But higher overhead ratio (command buffer setup / useful work).

M4’s improved GPU command processor can handle larger chunks efficiently. Older chips with smaller caches and slower command processing benefit from smaller chunks.

For a 32-layer model with ~200 commands per token:

  • Chunk 128: 25,600 dispatches per command buffer
  • Chunk 64: 12,800 dispatches per command buffer

The total generation throughput is virtually identical (both are bandwidth-limited), but the chunk size affects how quickly the first token in a chunk appears.

Prefill Chunk Size

c.max_prefill_chunk = 4096;

Maximum tokens per prefill GEMM batch. This is constant because prefill is compute-bound (large M), and the GEMM kernel performance is relatively insensitive to batch size above a few hundred tokens. The limit prevents excessive memory usage for the batch scratch buffers.

Norm Threadgroup Size

c.norm_tg_size = 1024;

RMSNorm and LayerNorm use a single threadgroup to process the entire dimension vector. The threadgroup size is min(dim, norm_tg_size). For a 4096-dim model, this means 1024 threads (the Metal maximum), with each thread processing 4 elements.

This is constant because normalization kernels are lightweight and the threadgroup size has minimal impact on performance.

How ChipConfig Flows Through the System

The configuration flows from hardware detection to kernel dispatch:

┌──────────────┐      ┌────────────────┐      ┌──────────────────┐
│  MetalDevice  │─────>│  ChipConfig    │─────>│  table_builder   │
│  gpu_cores()  │      │  from_gpu()    │      │  emit_gemv()     │
│  gpu_family() │      │                │      │  build_dispatch  │
└──────────────┘      └────────────────┘      └──────────────────┘
                                                      │
                                                      ▼
                                              ┌──────────────────┐
                                              │  DispatchTable   │
                                              │  (commands[]     │
                                              │   with resolved  │
                                              │   kernel+geom)   │
                                              └──────────────────┘
  1. MetalDevice::chip_config() calls ChipConfig::from_gpu(gpu_core_count(), gpu_family())
  2. The config is passed to build_dispatch_table() and prefill()
  3. The table builder uses chip.should_fuse_weights to decide QKV/gate+up fusion
  4. emit_gemv() uses chip.wide_gemv_threshold, chip.q4_small_k_threshold, and chip.gemv_wide_standard to select kernel variants
  5. The chain decoder uses chip.chain_decode_chunk for batch sizing
  6. The resulting dispatch table has hardware-optimal kernel choices baked in

All of this happens once at model load time. At inference time, the dispatch table is simply replayed – no hardware queries, no branching, no configuration lookups.

Real-World Impact

To make the impact concrete, here are approximate decode speeds for a 7B Q4_0 model on different chips, showing how ChipConfig-driven tuning affects performance:

ChipCoresFamilyChunkFuse?Wide GEMV?Decode tok/s
M18764NoNo~15
M1 Pro16796YesYes~45
M1 Max327128YesYes~90
M4109128YesNo~28
M4 Pro209128YesYes~65
M4 Max409128YesYes~130

The “Fuse?” column shows whether weight fusion is enabled (SLC >= 16 MB). The “Wide GEMV?” column shows whether 256-thread GEMV is used (cores >= 16). These configuration-driven choices account for a significant portion of the performance difference between “works” and “fast.”

Design Philosophy

ChipConfig embodies a few key principles:

1. All magic numbers in one place. Instead of hardcoded constants in table_builder.cpp, prefill.cpp, and decode_loop.cpp, every tuning parameter is in ChipConfig. This makes it easy to review, adjust, and reason about the tuning.

2. Derived, not configured. Users do not set these values. They are automatically derived from hardware detection. This eliminates “you need to tune parameter X for your specific hardware” friction.

3. Conservative defaults. The base chip configuration (64 tokens/chunk, no fusion, 128-thread GEMV) is conservative. It works correctly on every chip. The Pro/Max/Ultra configurations are more aggressive but only activate when the hardware supports them.

4. Chip tier matters more than generation. The core count (which determines chip tier: base/Pro/Max/Ultra) has more impact on tuning than the GPU family (which determines generation: M1/M2/M3/M4). An M1 Max (32 cores) gets more aggressive tuning than an M4 base (10 cores), even though the M4 is a newer generation.

Deep Dive: The SLC and Weight Fusion Decision

The should_fuse_weights decision deserves a more thorough analysis because it is one of the most impactful configuration choices. Let’s trace through the numbers for a concrete model.

Consider Llama 3 8B with Q4_0 quantization:

  • dim = 4096, n_heads = 32, n_kv_heads = 8, head_dim = 128
  • q_dim = 4096, kv_dim = 1024, ffn_dim = 14336

QKV fusion candidate:

WeightShapeBytes (Q4_0)
Q projection4096 x 4096~9.4 MB
K projection1024 x 4096~2.4 MB
V projection1024 x 4096~2.4 MB
Fused QKV6144 x 4096~14.2 MB

On an M1 base (8 MB SLC), the fused QKV matrix (14.2 MB) far exceeds the cache. Reading it causes cache thrashing – the end of the matrix evicts the beginning before the next token needs it. Three separate GEMVs at 9.4 + 2.4 + 2.4 = 14.2 MB read the same total bytes but each individual matrix is smaller. The K and V matrices (2.4 MB each) may actually fit in the SLC between dispatches, providing a small cache benefit.

On an M4 Pro (32 MB SLC), the fused QKV matrix (14.2 MB) fits with room to spare. The single GEMV dispatch saves two kernel launches (~0.01ms each) and potentially benefits from better memory controller scheduling (one large sequential read vs. three separate reads with dispatch overhead between them).

Gate+Up fusion candidate:

WeightShapeBytes (Q4_0)
Gate projection14336 x 4096~33.0 MB
Up projection14336 x 4096~33.0 MB
Fused Gate+Up28672 x 4096~66.0 MB

The fused gate+up matrix (66 MB) exceeds the SLC on every Apple Silicon chip. So why fuse at all? Because the fusion still saves one kernel launch and one dispatch table entry per layer, and the memory controller can prefetch more efficiently when reading one contiguous allocation. On Pro+ chips where the SLC is large enough to hold a meaningful portion of the matrix, the cache hit rate on the “tail” of the read (after the GPU has processed the head) is better than with two separate reads.

On base chips, the gate+up fusion is skipped entirely (controlled by should_fuse_weights = false) because the kernel launch saving does not compensate for the worse cache behavior.

Deep Dive: Chain Decode Chunk Size

The chunk size determines how many tokens are packed into a single Metal command buffer. Let’s analyze the trade-offs quantitatively.

Command buffer overhead: Creating a command buffer, encoding commands, and submitting takes a fixed amount of CPU time. On Apple Silicon, this is approximately:

  • Command buffer creation: ~2 us
  • Encoder creation: ~1 us
  • Submission (commit): ~5 us
  • GPU scheduling delay: ~10-50 us (varies by chip and load)

Total fixed overhead per submission: ~20-60 us.

Per-token encoding time: Each token requires ~260 Metal API calls (set pipeline, set buffer, set bytes, dispatch). Each API call takes approximately 50-100ns. Total per-token encoding: ~13-26 us.

For different chunk sizes on M4 Pro:

Chunk SizeEncoding TimeFixed OverheadOverhead RatioTokens Until Display
126 us60 us70%1
16416 us60 us13%16
641.7 ms60 us3.4%64
1283.3 ms60 us1.8%128

The “Tokens Until Display” column shows the latency cost: with a chunk size of 128, the user must wait for all 128 tokens to be generated before any are displayed. For interactive chat at 70 tok/s, a chunk of 128 takes ~1.8 seconds – potentially noticeable.

Akunu balances this by using smaller chunks for the first few tokens (to minimize time-to-first-token) and larger chunks for sustained generation. The chain_decode_chunk in ChipConfig sets the maximum chunk size for sustained generation.

On older chips (M1 base), the per-token encoding time is higher (slower CPU) and the GPU execution time is longer (less bandwidth), so the fixed overhead is relatively smaller. A chunk of 64 achieves a good balance. On M4, the faster CPU makes encoding cheaper and the faster GPU makes execution shorter, so a chunk of 128 is optimal.

Deep Dive: GEMV Variant Selection Logic

The three GEMV-related parameters work together to form a decision tree. Let me trace through how emit_gemv in table_builder.cpp uses them.

For each GEMV dispatch, the kernel variant is selected as follows:

Input: N (output rows), K (reduction dimension), dtype

1. Look up DTypeDescriptor for dtype
2. If N > chip.wide_gemv_threshold AND wide kernel exists:
     → Use wide GEMV (more output rows per TG)
3. Else if large-K kernel exists AND K >= chip.q4_small_k_threshold:
     → Use large-K GEMV (more threads for K-reduction)
4. Else:
     → Use standard GEMV

For a concrete example with Llama 3 8B (Q4_0):

OperationNKM4 Pro DecisionM1 Base Decision
Q projection40964096Standard (N < 32768)Standard
K projection10244096StandardStandard
V projection10244096StandardStandard
O projection40964096StandardStandard
Gate projection143364096StandardStandard
Up projection143364096StandardStandard
Down projection409614336StandardStandard
Logit (vocab=128K)1282564096Wide (N > 32768)Wide

The wide GEMV is only used for the vocabulary projection, where N (vocab size) is very large. All other projections use the standard kernel. The q4_small_k_threshold would matter for models with smaller hidden dimensions or for the KV projections, where K might be below the threshold on base chips.

The Backwards Compatibility Shim

The struct includes a backwards-compatibility factory:

static ChipConfig from_gpu_cores(int cores) { return from_gpu(cores, 0); }

This allows older code that only knows the core count (not the GPU family) to still create a configuration. With family = 0, the M4-specific optimizations (larger SLC, lower K-threshold) are skipped, and the config falls back to conservative M1-era tuning.

Extending ChipConfig

If you were adding CUDA support, ChipConfig would need additional fields:

// Hypothetical CUDA additions:
int sm_count;               // number of streaming multiprocessors
int shared_mem_per_sm;      // shared memory (threadgroup memory equivalent)
int warp_size;              // always 32, but good to be explicit
bool has_tensor_cores;
int tensor_core_gen;        // 0=none, 3=Volta, 4=Ampere, 5=Hopper
int l2_cache_bytes;         // L2 cache size (analogous to SLC)

The factory function would detect these from cuDeviceGetAttribute and derive tuning parameters. The should_fuse_weights decision would key off l2_cache_bytes instead of slc_bytes. The chain_decode_chunk would adapt to the different command submission model (CUDA streams vs Metal command buffers). The table builder would use the same ChipConfig interface, just with different values.

For Vulkan, additional fields might include:

int subgroup_size;          // analogous to SIMD width (varies: 16, 32, 64)
bool has_subgroup_matrix;   // cooperative matrix support
int max_workgroup_size;     // varies by device

The beauty of the ChipConfig approach is that the table builder does not care which backend produced the values. It just reads should_fuse_weights, wide_gemv_threshold, etc. The backend-specific detection logic is encapsulated in the factory function.3

Summary

ChipConfig is a simple struct that maps hardware capabilities (GPU core count, GPU family) to optimal tuning parameters (SLC size, fusion decisions, GEMV variant selection, chain decode chunk size). It is built once at device creation and consumed at dispatch table build time. All decisions are baked into the dispatch table – nothing is queried or branched on at inference time.



  1. Chips and Cheese. (2021-2024). “Apple M1/M2/M3/M4 Die Analysis.” SLC sizes are estimated from die area analysis, performance counter measurements, and benchmark profiling. Apple does not officially disclose SLC sizes. See https://chipsandcheese.com/.

  2. Apple, “Apple M4 chip” (2024). The M4 chip announcement highlights “next-generation GPU” with improved ray tracing and machine learning performance. Independent analysis suggests significant cache hierarchy improvements. See https://www.apple.com/newsroom/2024/05/apple-introduces-m4-chip/.

  3. This separation of concerns – hardware detection in the backend, tuning consumption in the core – is the same pattern used by the Device abstraction layer. ChipConfig is effectively a “capabilities descriptor” that the core engine reads, analogous to how ArchDescriptor is a “model capabilities descriptor.”