Chip Configuration and Hardware Tuning
Akunu runs on everything from an M1 MacBook Air with 8 GPU cores and 68 GB/s bandwidth to an M4 Ultra Mac Studio with 80 GPU cores and over a terabyte per second of bandwidth. The same inference engine, the same kernels, the same dispatch table structure – but with very different optimal tuning parameters. A threadgroup size that saturates an M1 might leave an M4 Max starved. A chain decode chunk size that is efficient on M4 might cause command buffer overhead to dominate on M1.
The ChipConfig struct solves this: it derives all hardware-specific tuning parameters from two inputs – the GPU core count and the GPU family number. Every magic number in the engine lives here, not scattered across the codebase.
The ChipConfig Struct
Defined in src/core/chip_config.h:
struct ChipConfig {
// === Hardware info ===
int gpu_cores; // GPU core count (from IOKit)
int gpu_family; // Apple GPU family (7=M1, 8=M2/M3, 9=M4)
int max_threads_per_tg; // Metal max (typically 1024)
// === SLC (System Level Cache) ===
int slc_bytes; // estimated SLC size
bool should_fuse_weights; // fuse gate+up, QKV when SLC is large enough
// === GEMV tuning ===
int q4_small_k_threshold; // Q4_0: use small-K variant below this
int wide_gemv_threshold; // switch to wide GEMV when N exceeds this
bool gemv_wide_standard; // use 256-thread standard GEMV on Pro+
// === Chain decode ===
int chain_decode_chunk; // tokens per GPU submission
// === Prefill ===
int max_prefill_chunk; // max tokens per prefill GEMM batch
// === Norm dispatch ===
int norm_tg_size; // threadgroup size for RMSNorm
};
Every field has a clear purpose and measurable impact on performance. Let’s go through each one.
Chip Detection
The factory function ChipConfig::from_gpu(int cores, int family) builds a config from two hardware-detected values:
cores: The GPU core count, queried from IOKit’sAGXAcceleratorservice viaIORegistryEntryCreateCFProperty(CFSTR("gpu-core-count")). This is the most reliable indicator of chip tier.family: The Apple GPU family number, queried via Metal’s[device supportsFamily:]. Family 7 = M1, Family 8 = M2/M3, Family 9 = M4.
These two values together uniquely identify the chip tier and generation:
| Cores | Family | Chip | SLC (est.) | Bandwidth |
|---|---|---|---|---|
| 8-10 | 7 | M1 | 8 MB | 68 GB/s |
| 14-16 | 7 | M1 Pro | 24 MB | 200 GB/s |
| 24-32 | 7 | M1 Max | 48 MB | 400 GB/s |
| 48-64 | 7 | M1 Ultra | 96 MB | 800 GB/s |
| 8-10 | 8 | M2/M3 | 8 MB | 100-150 GB/s |
| 16-18 | 8 | M2/M3 Pro | 24 MB | 150-200 GB/s |
| 30-40 | 8 | M2/M3 Max | 48 MB | 400 GB/s |
| 8-10 | 9 | M4 | 16 MB | 120 GB/s |
| 16-20 | 9 | M4 Pro | 32 MB | 273 GB/s |
| 30-40 | 9 | M4 Max | 48 MB | 546 GB/s |
| 60-80 | 9 | M4 Ultra | 96 MB | 1092 GB/s |
SLC Size Estimation
The System Level Cache (SLC) is the large last-level cache shared between CPU, GPU, and other IP blocks. Apple does not officially disclose SLC sizes, but they can be estimated from die analysis and performance profiling.1
if (cores >= 60)
c.slc_bytes = 96 * 1024 * 1024; // Ultra
else if (cores >= 30)
c.slc_bytes = 48 * 1024 * 1024; // Max
else if (cores >= 16)
c.slc_bytes = (family >= 9) ? 32 * 1024 * 1024 : 24 * 1024 * 1024; // Pro
else
c.slc_bytes = (family >= 9) ? 16 * 1024 * 1024 : 8 * 1024 * 1024; // Base
Key observations:
- M4 (family 9) gets larger SLC estimates at every tier. Die analysis suggests M4 improved the cache hierarchy significantly.2
- Ultra chips (2x Max die-to-die) get double the Max SLC.
Impact on Weight Fusion
The SLC size directly controls whether QKV and gate+up weight fusion is beneficial:
c.should_fuse_weights = (c.slc_bytes >= 16 * 1024 * 1024); // Pro+ and M4 Base
Weight fusion concatenates two or three weight matrices into one, replacing multiple GEMV dispatches with a single larger one. This reduces kernel launch overhead and can improve cache utilization – but only if the fused weight matrix fits (or mostly fits) in the SLC. On an M1 base with 8 MB SLC, a fused QKV weight matrix for a 4096-dim model might be:
$$Q + K + V = (4096 \times 4096 + 2 \times 4096 \times 1024) \times 0.5 \text{ bytes/elem} = 12.3 \text{ MB (Q4_0)}$$
This exceeds the 8 MB SLC, causing cache thrashing. On an M4 Pro with 32 MB SLC, the fused matrix fits comfortably, and the fusion saves two kernel launches.
GEMV Kernel Selection
Three parameters control GEMV variant selection:
q4_small_k_threshold
if (family >= 9 && cores >= 16)
c.q4_small_k_threshold = 512; // M4 Pro+
else if (cores >= 16)
c.q4_small_k_threshold = 1024; // M1-M3 Pro+
else
c.q4_small_k_threshold = 2048; // Base chips
This determines when to use a “large-K” GEMV variant that uses more SIMD groups for the K-dimension reduction. Larger chips can afford more parallelism at smaller K values because they have more GPU cores to keep busy.
wide_gemv_threshold
c.wide_gemv_threshold = 32768;
When the output dimension N exceeds this threshold (e.g., vocabulary projection with 128K tokens), switch to a “wide” GEMV kernel with more output rows per threadgroup. This is constant across all chips because the decision depends on the N dimension, not the hardware.
gemv_wide_standard
c.gemv_wide_standard = (cores >= 16);
On Pro+ chips (16+ cores), use 256-thread threadgroups for the standard GEMV instead of 128-thread. More threads per threadgroup means more SIMD groups, which means better occupancy on chips with many cores. On base chips with 8-10 cores, the extra threads would just increase register pressure without improving occupancy.
Chain Decode Chunk Size
if (cores >= 60)
c.chain_decode_chunk = 128; // Ultra
else if (cores >= 30)
c.chain_decode_chunk = 128; // Max
else if (family >= 9)
c.chain_decode_chunk = 128; // M4 Pro/Base
else if (cores >= 16)
c.chain_decode_chunk = 96; // M3 Pro
else
c.chain_decode_chunk = 64; // M1/M2 Base
This is the number of tokens batched into a single GPU command buffer during chain decode. The trade-offs:
- Larger chunks: Amortize command buffer overhead across more tokens. Better GPU utilization. But each submission takes longer, increasing latency to first token in a batch.
- Smaller chunks: Lower per-submission latency. But higher overhead ratio (command buffer setup / useful work).
M4’s improved GPU command processor can handle larger chunks efficiently. Older chips with smaller caches and slower command processing benefit from smaller chunks.
For a 32-layer model with ~200 commands per token:
- Chunk 128: 25,600 dispatches per command buffer
- Chunk 64: 12,800 dispatches per command buffer
The total generation throughput is virtually identical (both are bandwidth-limited), but the chunk size affects how quickly the first token in a chunk appears.
Prefill Chunk Size
c.max_prefill_chunk = 4096;
Maximum tokens per prefill GEMM batch. This is constant because prefill is compute-bound (large M), and the GEMM kernel performance is relatively insensitive to batch size above a few hundred tokens. The limit prevents excessive memory usage for the batch scratch buffers.
Norm Threadgroup Size
c.norm_tg_size = 1024;
RMSNorm and LayerNorm use a single threadgroup to process the entire dimension vector. The threadgroup size is min(dim, norm_tg_size). For a 4096-dim model, this means 1024 threads (the Metal maximum), with each thread processing 4 elements.
This is constant because normalization kernels are lightweight and the threadgroup size has minimal impact on performance.
How ChipConfig Flows Through the System
The configuration flows from hardware detection to kernel dispatch:
┌──────────────┐ ┌────────────────┐ ┌──────────────────┐
│ MetalDevice │─────>│ ChipConfig │─────>│ table_builder │
│ gpu_cores() │ │ from_gpu() │ │ emit_gemv() │
│ gpu_family() │ │ │ │ build_dispatch │
└──────────────┘ └────────────────┘ └──────────────────┘
│
▼
┌──────────────────┐
│ DispatchTable │
│ (commands[] │
│ with resolved │
│ kernel+geom) │
└──────────────────┘
MetalDevice::chip_config()callsChipConfig::from_gpu(gpu_core_count(), gpu_family())- The config is passed to
build_dispatch_table()andprefill() - The table builder uses
chip.should_fuse_weightsto decide QKV/gate+up fusion emit_gemv()useschip.wide_gemv_threshold,chip.q4_small_k_threshold, andchip.gemv_wide_standardto select kernel variants- The chain decoder uses
chip.chain_decode_chunkfor batch sizing - The resulting dispatch table has hardware-optimal kernel choices baked in
All of this happens once at model load time. At inference time, the dispatch table is simply replayed – no hardware queries, no branching, no configuration lookups.
Real-World Impact
To make the impact concrete, here are approximate decode speeds for a 7B Q4_0 model on different chips, showing how ChipConfig-driven tuning affects performance:
| Chip | Cores | Family | Chunk | Fuse? | Wide GEMV? | Decode tok/s |
|---|---|---|---|---|---|---|
| M1 | 8 | 7 | 64 | No | No | ~15 |
| M1 Pro | 16 | 7 | 96 | Yes | Yes | ~45 |
| M1 Max | 32 | 7 | 128 | Yes | Yes | ~90 |
| M4 | 10 | 9 | 128 | Yes | No | ~28 |
| M4 Pro | 20 | 9 | 128 | Yes | Yes | ~65 |
| M4 Max | 40 | 9 | 128 | Yes | Yes | ~130 |
The “Fuse?” column shows whether weight fusion is enabled (SLC >= 16 MB). The “Wide GEMV?” column shows whether 256-thread GEMV is used (cores >= 16). These configuration-driven choices account for a significant portion of the performance difference between “works” and “fast.”
Design Philosophy
ChipConfig embodies a few key principles:
1. All magic numbers in one place. Instead of hardcoded constants in table_builder.cpp, prefill.cpp, and decode_loop.cpp, every tuning parameter is in ChipConfig. This makes it easy to review, adjust, and reason about the tuning.
2. Derived, not configured. Users do not set these values. They are automatically derived from hardware detection. This eliminates “you need to tune parameter X for your specific hardware” friction.
3. Conservative defaults. The base chip configuration (64 tokens/chunk, no fusion, 128-thread GEMV) is conservative. It works correctly on every chip. The Pro/Max/Ultra configurations are more aggressive but only activate when the hardware supports them.
4. Chip tier matters more than generation. The core count (which determines chip tier: base/Pro/Max/Ultra) has more impact on tuning than the GPU family (which determines generation: M1/M2/M3/M4). An M1 Max (32 cores) gets more aggressive tuning than an M4 base (10 cores), even though the M4 is a newer generation.
Deep Dive: The SLC and Weight Fusion Decision
The should_fuse_weights decision deserves a more thorough analysis because it is one of the most impactful configuration choices. Let’s trace through the numbers for a concrete model.
Consider Llama 3 8B with Q4_0 quantization:
dim = 4096,n_heads = 32,n_kv_heads = 8,head_dim = 128q_dim = 4096,kv_dim = 1024,ffn_dim = 14336
QKV fusion candidate:
| Weight | Shape | Bytes (Q4_0) |
|---|---|---|
| Q projection | 4096 x 4096 | ~9.4 MB |
| K projection | 1024 x 4096 | ~2.4 MB |
| V projection | 1024 x 4096 | ~2.4 MB |
| Fused QKV | 6144 x 4096 | ~14.2 MB |
On an M1 base (8 MB SLC), the fused QKV matrix (14.2 MB) far exceeds the cache. Reading it causes cache thrashing – the end of the matrix evicts the beginning before the next token needs it. Three separate GEMVs at 9.4 + 2.4 + 2.4 = 14.2 MB read the same total bytes but each individual matrix is smaller. The K and V matrices (2.4 MB each) may actually fit in the SLC between dispatches, providing a small cache benefit.
On an M4 Pro (32 MB SLC), the fused QKV matrix (14.2 MB) fits with room to spare. The single GEMV dispatch saves two kernel launches (~0.01ms each) and potentially benefits from better memory controller scheduling (one large sequential read vs. three separate reads with dispatch overhead between them).
Gate+Up fusion candidate:
| Weight | Shape | Bytes (Q4_0) |
|---|---|---|
| Gate projection | 14336 x 4096 | ~33.0 MB |
| Up projection | 14336 x 4096 | ~33.0 MB |
| Fused Gate+Up | 28672 x 4096 | ~66.0 MB |
The fused gate+up matrix (66 MB) exceeds the SLC on every Apple Silicon chip. So why fuse at all? Because the fusion still saves one kernel launch and one dispatch table entry per layer, and the memory controller can prefetch more efficiently when reading one contiguous allocation. On Pro+ chips where the SLC is large enough to hold a meaningful portion of the matrix, the cache hit rate on the “tail” of the read (after the GPU has processed the head) is better than with two separate reads.
On base chips, the gate+up fusion is skipped entirely (controlled by should_fuse_weights = false) because the kernel launch saving does not compensate for the worse cache behavior.
Deep Dive: Chain Decode Chunk Size
The chunk size determines how many tokens are packed into a single Metal command buffer. Let’s analyze the trade-offs quantitatively.
Command buffer overhead: Creating a command buffer, encoding commands, and submitting takes a fixed amount of CPU time. On Apple Silicon, this is approximately:
- Command buffer creation: ~2 us
- Encoder creation: ~1 us
- Submission (commit): ~5 us
- GPU scheduling delay: ~10-50 us (varies by chip and load)
Total fixed overhead per submission: ~20-60 us.
Per-token encoding time: Each token requires ~260 Metal API calls (set pipeline, set buffer, set bytes, dispatch). Each API call takes approximately 50-100ns. Total per-token encoding: ~13-26 us.
For different chunk sizes on M4 Pro:
| Chunk Size | Encoding Time | Fixed Overhead | Overhead Ratio | Tokens Until Display |
|---|---|---|---|---|
| 1 | 26 us | 60 us | 70% | 1 |
| 16 | 416 us | 60 us | 13% | 16 |
| 64 | 1.7 ms | 60 us | 3.4% | 64 |
| 128 | 3.3 ms | 60 us | 1.8% | 128 |
The “Tokens Until Display” column shows the latency cost: with a chunk size of 128, the user must wait for all 128 tokens to be generated before any are displayed. For interactive chat at 70 tok/s, a chunk of 128 takes ~1.8 seconds – potentially noticeable.
Akunu balances this by using smaller chunks for the first few tokens (to minimize time-to-first-token) and larger chunks for sustained generation. The chain_decode_chunk in ChipConfig sets the maximum chunk size for sustained generation.
On older chips (M1 base), the per-token encoding time is higher (slower CPU) and the GPU execution time is longer (less bandwidth), so the fixed overhead is relatively smaller. A chunk of 64 achieves a good balance. On M4, the faster CPU makes encoding cheaper and the faster GPU makes execution shorter, so a chunk of 128 is optimal.
Deep Dive: GEMV Variant Selection Logic
The three GEMV-related parameters work together to form a decision tree. Let me trace through how emit_gemv in table_builder.cpp uses them.
For each GEMV dispatch, the kernel variant is selected as follows:
Input: N (output rows), K (reduction dimension), dtype
1. Look up DTypeDescriptor for dtype
2. If N > chip.wide_gemv_threshold AND wide kernel exists:
→ Use wide GEMV (more output rows per TG)
3. Else if large-K kernel exists AND K >= chip.q4_small_k_threshold:
→ Use large-K GEMV (more threads for K-reduction)
4. Else:
→ Use standard GEMV
For a concrete example with Llama 3 8B (Q4_0):
| Operation | N | K | M4 Pro Decision | M1 Base Decision |
|---|---|---|---|---|
| Q projection | 4096 | 4096 | Standard (N < 32768) | Standard |
| K projection | 1024 | 4096 | Standard | Standard |
| V projection | 1024 | 4096 | Standard | Standard |
| O projection | 4096 | 4096 | Standard | Standard |
| Gate projection | 14336 | 4096 | Standard | Standard |
| Up projection | 14336 | 4096 | Standard | Standard |
| Down projection | 4096 | 14336 | Standard | Standard |
| Logit (vocab=128K) | 128256 | 4096 | Wide (N > 32768) | Wide |
The wide GEMV is only used for the vocabulary projection, where N (vocab size) is very large. All other projections use the standard kernel. The q4_small_k_threshold would matter for models with smaller hidden dimensions or for the KV projections, where K might be below the threshold on base chips.
The Backwards Compatibility Shim
The struct includes a backwards-compatibility factory:
static ChipConfig from_gpu_cores(int cores) { return from_gpu(cores, 0); }
This allows older code that only knows the core count (not the GPU family) to still create a configuration. With family = 0, the M4-specific optimizations (larger SLC, lower K-threshold) are skipped, and the config falls back to conservative M1-era tuning.
Extending ChipConfig
If you were adding CUDA support, ChipConfig would need additional fields:
// Hypothetical CUDA additions:
int sm_count; // number of streaming multiprocessors
int shared_mem_per_sm; // shared memory (threadgroup memory equivalent)
int warp_size; // always 32, but good to be explicit
bool has_tensor_cores;
int tensor_core_gen; // 0=none, 3=Volta, 4=Ampere, 5=Hopper
int l2_cache_bytes; // L2 cache size (analogous to SLC)
The factory function would detect these from cuDeviceGetAttribute and derive tuning parameters. The should_fuse_weights decision would key off l2_cache_bytes instead of slc_bytes. The chain_decode_chunk would adapt to the different command submission model (CUDA streams vs Metal command buffers). The table builder would use the same ChipConfig interface, just with different values.
For Vulkan, additional fields might include:
int subgroup_size; // analogous to SIMD width (varies: 16, 32, 64)
bool has_subgroup_matrix; // cooperative matrix support
int max_workgroup_size; // varies by device
The beauty of the ChipConfig approach is that the table builder does not care which backend produced the values. It just reads should_fuse_weights, wide_gemv_threshold, etc. The backend-specific detection logic is encapsulated in the factory function.3
Summary
ChipConfig is a simple struct that maps hardware capabilities (GPU core count, GPU family) to optimal tuning parameters (SLC size, fusion decisions, GEMV variant selection, chain decode chunk size). It is built once at device creation and consumed at dispatch table build time. All decisions are baked into the dispatch table – nothing is queried or branched on at inference time.
-
Chips and Cheese. (2021-2024). “Apple M1/M2/M3/M4 Die Analysis.” SLC sizes are estimated from die area analysis, performance counter measurements, and benchmark profiling. Apple does not officially disclose SLC sizes. See https://chipsandcheese.com/. ↩
-
Apple, “Apple M4 chip” (2024). The M4 chip announcement highlights “next-generation GPU” with improved ray tracing and machine learning performance. Independent analysis suggests significant cache hierarchy improvements. See https://www.apple.com/newsroom/2024/05/apple-introduces-m4-chip/. ↩
-
This separation of concerns – hardware detection in the backend, tuning consumption in the core – is the same pattern used by the Device abstraction layer. ChipConfig is effectively a “capabilities descriptor” that the core engine reads, analogous to how ArchDescriptor is a “model capabilities descriptor.” ↩