Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GEMV: The Workhorse of Decode

During token-by-token decode, every linear projection in the transformer – Q, K, V, output, gate, up, down, and the final logit projection – is a matrix-vector multiply (GEMV). The activation is a single row ([1, K]), the weight matrix is [N, K], and the output is [1, N]. Since every generated token triggers all of these GEMVs, the GEMV kernels are the single most important performance factor in decode throughput.

Akunu ships a large family of GEMV kernels in backend/metal/kernels/metal/kernel/matmul/, each specialized for a different weight format, matrix size, or fused operation. This chapter covers every major variant: FP16, Q4_0, Q8_0, Q4_K, Wide GEMV, MLX Q4, Batched GEMV, and the fused SiLU variants. For each, we will look at the actual Metal Shading Language (MSL) code, understand the thread mapping, and explain why each design decision was made.

Why So Many Kernels?

A reasonable question: why not write one generic GEMV kernel and parameterize it? The answer is that the dequantization logic for each format is fundamentally different, and templating it all into one kernel would prevent the Metal compiler from generating optimal code. Each format has different:

  • Block sizes: Q4_0 uses 32 elements per block, Q4_K uses 256 elements, Q8_0 uses 32 elements, MLX uses group_size (typically 32 or 64).
  • Memory access patterns: Q4_0 reads uint16 pairs, Q8_0 reads int8 arrays, Q4_K reads interleaved scale metadata, MLX reads from three separate arrays.
  • Arithmetic: Q4_0 uses the nibble pre-scaling trick, Q8_0 uses simple multiply, Q4_K uses multi-level scale+min reconstruction, MLX uses affine dequant (scale * quant + bias).
  • Optimal thread counts: Q4_0 works best with 128 or 256 threads, Q8_0 with 256, Q4_K with 256.

The compiler can specialize loop bounds, eliminate dead code paths, and optimize register allocation when each kernel has a single code path. A unified kernel with runtime dispatch would pay a branch penalty and register pressure cost on every iteration of the inner loop.

The downside is code duplication, but Akunu mitigates this through C++ templates (for the SiLU variants) and consistent naming conventions that make the kernel family navigable.

The Universal GEMV Pattern

Despite their diversity, all Akunu GEMV kernels share a common structure:

  1. Thread mapping: Map thread IDs to output rows and K-dimension positions.
  2. Accumulation loop: Each thread computes partial dot products for its assigned rows.
  3. SIMD reduction: simd_sum() across the 32-lane SIMD group to get the full dot product.
  4. Output write: Lane 0 of each SIMD group writes the final result.
                    Thread Mapping (NR=4, NSIMD=4)
    ┌─────────────────────────────────────────────────┐
    │  Threadgroup (128 threads = 4 SIMD groups)       │
    │                                                  │
    │  SG0: rows 0-3    SG1: rows 4-7                 │
    │  SG2: rows 8-11   SG3: rows 12-15               │
    │                                                  │
    │  Within each SG (32 lanes):                      │
    │  Each lane reads VPT=16 elements per K-block     │
    │  Lane stride = 32 lanes × 16 = 512 elements     │
    └─────────────────────────────────────────────────┘

Interactive: How a GEMV Kernel Executes on Apple Silicon

The animation below walks through akunu’s FP16 GEMV kernel step by step, showing how GPU hardware maps threads to data. It uses a simplified model (8 rows, 4 K-blocks) to keep things readable — real kernels use 16 rows and 8 K-blocks of 512 elements each, but the principle is identical. Click Step to advance one phase at a time, or Play to run through all phases automatically.

DRAM: Weight matrix W [8 rows, K=2048]
DRAM: Activation vector x [K=2048]
DRAM: Output y[8]
SG0 (rows 0-3)
SG1 (rows 4-7)

FP16 GEMV (gemv_f16)

The FP16 GEMV is the simplest and serves as a reference design. Let’s examine it in detail.

Constants and Configuration

constant constexpr uint GEMV_NR = 4;     // rows per SIMD group
constant constexpr uint GEMV_NSIMD = 4;  // SIMD groups per TG
constant constexpr uint GEMV_VPT = 16;   // values per thread per K-block
constant constexpr uint GEMV_BLOCK_K = GEMV_VPT * SIMD_WIDTH;  // 512
ParameterValueMeaning
NR4Each SIMD group handles 4 output rows simultaneously
NSIMD44 SIMD groups per threadgroup = 128 threads
VPT16Each thread processes 16 K-elements per block
BLOCK_K512K-elements processed per outer loop iteration
Rows/TG16NR * NSIMD = 16 output rows per threadgroup

Thread-to-Row Mapping

uint rows_per_tg = GEMV_NR * GEMV_NSIMD;  // 16
uint base_row = tgid * rows_per_tg + sgid * GEMV_NR;

Threadgroup tgid handles rows [tgid*16, tgid*16+15]. Within the threadgroup, SIMD group sgid handles 4 consecutive rows starting at base_row.

The Accumulation Loop

float sums[GEMV_NR] = {};
const uint k_aligned = (K / GEMV_BLOCK_K) * GEMV_BLOCK_K;

for (uint k_block = 0; k_block < k_aligned; k_block += GEMV_BLOCK_K) {
    uint k_off = k_block + slid * GEMV_VPT;

    // Load activation block (contiguous, cache-friendly)
    device const half4 *x4 = (device const half4 *)(x + k_off);
    float4 xf0 = float4(x4[0]), xf1 = float4(x4[1]),
           xf2 = float4(x4[2]), xf3 = float4(x4[3]);

    // Process NR rows
    for (uint r = 0; r < GEMV_NR; r++) {
        uint row = base_row + r;
        if (row >= N) break;
        device const half4 *w4 = (device const half4 *)(W + row * K + k_off);
        float4 wf0 = float4(w4[0]), wf1 = float4(w4[1]),
               wf2 = float4(w4[2]), wf3 = float4(w4[3]);
        sums[r] += dot(xf0, wf0) + dot(xf1, wf1)
                 + dot(xf2, wf2) + dot(xf3, wf3);
    }
}

Key observations:

  1. Vector loads: half4 loads read 8 bytes at once. 4 of them cover 16 half values = VPT.
  2. Activation reuse: The activation vector x is loaded once and reused across all NR=4 rows. This is the fundamental optimization of multi-row GEMV.
  3. dot() intrinsic: Metal’s dot(float4, float4) computes a 4-element dot product in a single instruction on Apple Silicon.
  4. Contiguous access: Both x and W are accessed contiguously within each K-block, maximizing cache line utilization.

SIMD Reduction and Output

for (uint r = 0; r < GEMV_NR; r++) {
    float total = simd_sum(sums[r]);
    if (slid == 0) {
        uint row = base_row + r;
        if (row < N) y[row] = half(total * params.alpha);
    }
}

simd_sum() sums a value across all 32 lanes of the SIMD group in O(log2(32)) = 5 steps using butterfly shuffles.1 Only lane 0 writes the result, since all other lanes hold the same value after the reduction.

Remainder Handling

for (uint k = k_aligned + slid; k < K; k += 32) {
    float xk = float(x[k]);
    for (uint r = 0; r < GEMV_NR; r++) {
        uint row = base_row + r;
        if (row >= N) break;
        sums[r] += xk * float(W[row * K + k]);
    }
}

When K is not a multiple of 512, the remainder elements are handled with scalar accesses at stride 32 (one element per lane). This is less efficient but handles edge cases correctly.

Q4_0 GEMV (gemv_q4_0)

Quantized GEMV is where things get interesting. Q4_0 packs two 4-bit values per byte with a shared FP16 scale factor per block of 32 elements. Akunu provides two variants tuned for different K dimensions.

Q4_0 Data Layout

block_q4_0 (20 bytes per 32 elements):
┌──────────┬──────────────────────────────┐
│  d (FP16) │  qs[16] (16 bytes = 32 nibbles) │
│  2 bytes  │  low nibble = elem 0..15         │
│           │  high nibble = elem 16..31       │
└──────────┴──────────────────────────────┘

The Nibble Extraction Trick

The most performance-critical code is the dot product function:

inline float block_q4_0_dot_y(device const block_q4_0 *qb, float sumy,
                               thread float *yl, int il) {
    float d = qb->d;
    float acc[4] = {0.f, 0.f, 0.f, 0.f};
    device const uint16_t *qs = ((device const uint16_t *)qb + 1 + il/2);
    for (int i = 0; i < 8; i += 2) {
        acc[0] += yl[i + 0] * (qs[i / 2] & 0x000F);
        acc[1] += yl[i + 1] * (qs[i / 2] & 0x0F00);
        acc[2] += yl[i + 8] * (qs[i / 2] & 0x00F0);
        acc[3] += yl[i + 9] * (qs[i / 2] & 0xF000);
    }
    return d * (sumy * -8.f + acc[0] + acc[1] + acc[2] + acc[3]);
}

This is dense code. Let’s decode what it does:

  1. uint16 reads: Instead of reading individual bytes, it reads uint16_t values. Each uint16 contains 4 nibbles.

  2. Mask extraction: The four masks (0x000F, 0x0F00, 0x00F0, 0xF000) extract different nibbles from the uint16:

    • 0x000F: bits 0-3 (low nibble of low byte)
    • 0x00F0: bits 4-7 (high nibble of low byte)
    • 0x0F00: bits 8-11 (low nibble of high byte)
    • 0xF000: bits 12-15 (high nibble of high byte)
  3. Pre-scaled activations: The activations yl[] are pre-divided by powers of 256 to account for the positional shift of each nibble:

    yl[i + 0] = y0;             // × 1 (matches 0x000F)
    yl[i + 1] = y1 / 256.f;    // × 1/256 (matches 0x0F00, shifted by 8 bits)
    yl[i + 8] = y16 / 16.f;    // × 1/16 (matches 0x00F0, shifted by 4 bits)
    yl[i + 9] = y17 / 4096.f;  // × 1/4096 (matches 0xF000, shifted by 12 bits)
    
  4. Zero-point correction: The sumy * -8.f term subtracts 8 from each quantized value (Q4_0 uses unsigned 0-15 with a zero point of 8), multiplied by the sum of activation values.

This approach avoids any explicit bit-shifting operations, relying instead on pre-scaled multiplications that compile to efficient multiply-add sequences on Apple Silicon.

Two Size Variants

VariantSIMD GroupsThreadsRows/TGOptimal K
gemv_q4_04 (NSG=4)12816K < 2048
gemv_q4_0_l8 (NSG=8)25632K >= 2048

The large variant doubles the threadgroup size, which provides more parallelism across the K dimension. For large K, this better saturates the GPU’s memory bandwidth. For small K, the extra threads would be wasted because there are not enough K-blocks to keep them busy.

The threshold is hardware-dependent and configured in ChipConfig:

if (family >= 9 && cores >= 16)
    c.q4_small_k_threshold = 512;   // M4 Pro+
else if (cores >= 16)
    c.q4_small_k_threshold = 1024;  // M3 Pro+
else
    c.q4_small_k_threshold = 2048;  // Base chips

Q8_0 GEMV (gemv_q8_0)

Q8_0 is simpler than Q4_0 because each element is a full byte – no nibble extraction needed:

constant constexpr uint NR = 4;
constant constexpr uint NSIMD = 8;  // 256 threads, 32 rows/TG

for (uint kb = tiisg; kb < nblocks; kb += SIMD_WIDTH) {
    const uint base_k = kb * QK8_0;

    device const half4 *x4 = (device const half4 *)(x + base_k);
    float4 xf[8];
    for (uint i = 0; i < 8; i++)
        xf[i] = float4(x4[i]);

    for (uint r = 0; r < NR; r++) {
        uint row = base_row + r;
        if (row >= N) break;
        device const block_q8_0 &blk = W[row * nblocks + kb];
        float d = float(blk.d);
        float dot = 0;
        for (uint g = 0; g < 8; g++) {
            dot += xf[g][0] * float(blk.qs[g*4+0])
                 + xf[g][1] * float(blk.qs[g*4+1])
                 + xf[g][2] * float(blk.qs[g*4+2])
                 + xf[g][3] * float(blk.qs[g*4+3]);
        }
        sums[r] += d * dot;
    }
}

Each block_q8_0 contains 32 int8 values with a shared FP16 scale. The 8-way unrolled inner loop processes all 32 elements (4 per group, 8 groups). The int8-to-float conversion is straightforward – no masking or shifting needed.

Q8_0 is 2x the size of Q4_0 but significantly faster because the dequantization is trivial. It is used for models where quality is prioritized over memory footprint.

Q4_K GEMV (gemv_q4_k)

Q4_K is a “super-block” format from GGML that uses a more sophisticated quantization scheme with per-sub-block scales and mins. Each super-block contains 256 elements (vs Q4_0’s 32), with 12 bytes of scale/min metadata plus 128 bytes of quantized data.

The kernel is ported directly from llama.cpp’s kernel_mul_mv_q4_K_f32_impl:

constant constexpr short GEMV_Q4K_NCOLS = 2;   // nr0: rows per SIMD group
constant constexpr short GEMV_Q4K_NSG   = 8;   // SIMD groups per threadgroup
ParameterValueMeaning
NCOLS22 rows per SIMD group (vs 4 for Q4_0)
NSG88 SIMD groups = 256 threads
Rows/TG16NCOLS * NSG = 16

The reduced NCOLS (2 vs 4) reflects the larger per-block metadata overhead: more registers are needed for scale/min decoding, leaving less room for row parallelism.

Thread Mapping

The Q4_K thread mapping is more complex than Q4_0:

const short ix = tiisg / 8;   // 0..3 (4 threads share blocks)
const short it = tiisg % 8;   // 0..7
const short iq = it / 4;      // 0 or 1 (sub-block selector)
const short ir = it % 4;      // 0..3 (element within sub-block)

4 threads share a block (ix selects which group of blocks), and within each block, the 8 remaining thread indices select sub-blocks and elements. This complex mapping ensures that each thread reads a contiguous portion of the quantized data.

Scale Decoding

The scales are packed in a 12-byte structure using 6-bit values with 2-bit offsets:

constexpr uint16_t kmask1 = 0x3f3f;
constexpr uint16_t kmask2 = 0x0f0f;
constexpr uint16_t kmask3 = 0xc0c0;

sc16[0] = sc[0] & kmask1;
sc16[1] = sc[2] & kmask1;
sc16[2] = ((sc[4] >> 0) & kmask2) | ((sc[0] & kmask3) >> 2);
sc16[3] = ((sc[4] >> 4) & kmask2) | ((sc[2] & kmask3) >> 2);

This decodes the interleaved 6-bit scale values from the packed representation. The kmask1 extracts the low 6 bits of each byte, while kmask2 and kmask3 extract and combine the high bits to form the 4-bit sub-block offsets.

Q4_K Accumulation and Output

The final dot product combines the dequantized values with the per-sub-block scales and the block-level scale/minimum:

sumf[row] += dh[0] * ((acc1[0] + 1.f/256.f * acc1[1]) * sc8[0] +
                       (acc1[2] + 1.f/256.f * acc1[3]) * sc8[1] * 1.f/16.f +
                       (acc2[0] + 1.f/256.f * acc2[1]) * sc8[4] +
                       (acc2[2] + 1.f/256.f * acc2[3]) * sc8[5] * 1.f/16.f) -
              dh[1] * (sumy[0] * sc8[2] + sumy[1] * sc8[3] +
                       sumy[2] * sc8[6] + sumy[3] * sc8[7]);

The dh[0] is the block scale and dh[1] is the block minimum. The sc8[] values are the sub-block scales decoded earlier. The 1.f/256.f and 1.f/16.f factors compensate for the nibble positional encoding (same technique as Q4_0). The minimum term (dh[1] * sumy * sc8) subtracts the zero-point contribution.

This complex expression evaluates to a simple conceptual operation: sum(dequant(weight[i]) * activation[i]) for all elements in the super-block, but expressed in terms of the packed representation to avoid explicit dequantization into a temporary buffer.

Q4_K vs Q4_0: When to Use Which

AspectQ4_0Q4_K
Block size32 elements256 elements
Metadata overhead2 bytes per 32 elements (6.25%)12 bytes per 256 elements (4.7%)
Scale granularityPer 32 elementsPer 32 elements (sub-block) + per 256 (super-block)
Has minimum/biasNo (zero point = 8)Yes (per sub-block min)
Quantization qualityGoodBetter (lower quantization error)
Dequant complexityLowMedium
Memory per element0.5625 bytes0.5469 bytes

Q4_K achieves slightly better compression ratio AND better quality than Q4_0, at the cost of more complex dequantization. In practice, Q4_K is preferred for larger models where the quality difference matters, while Q4_0 is used for speed-critical paths or when model quality is less sensitive to quantization.

Wide GEMV (gemv_wide_f16)

The “wide” variant uses 8 SIMD groups (256 threads) instead of 4, processing 32 rows per threadgroup:

constant constexpr uint GEMV_WIDE_NR = 4;
constant constexpr uint GEMV_WIDE_NSIMD = 8;     // 8 SIMD groups
constant constexpr uint GEMV_WIDE_VPT = 16;
constant constexpr uint GEMV_WIDE_BLOCK_K = GEMV_WIDE_VPT * SIMD_WIDTH;  // 512

The algorithm is identical to gemv_f16; only the NSIMD parameter changes. The wider threadgroup provides better occupancy on chips with many GPU cores (Pro, Max, Ultra), where the smaller threadgroup would leave execution units idle.

Wide variants exist for multiple formats: gemv_wide_f16, gemv_wide_q4_0, gemv_wide_q4_k, gemv_wide_q8_0, gemv_wide_mlx_q4, and gemv_wide_mlx_q8.

MLX Q4 GEMV (gemv_mlx_q4)

MLX-format quantization differs from GGML’s block format. Instead of per-block scale factors interleaved with data, MLX stores weights, scales, and biases in three separate contiguous arrays:

Packed buffer layout:
┌────────────────────┬──────────────┬──────────────┐
│ U32 weights        │ FP16 scales  │ FP16 biases  │
│ [N × K/8] uint32   │ [N × n_groups]│ [N × n_groups]│
└────────────────────┴──────────────┴──────────────┘

The dequantization formula is: value = scale * quant_value + bias

This is an affine quantization scheme (scale + bias) vs GGML’s scaled scheme (scale + zero-point). The per-group bias allows better representation of asymmetric distributions.

Function Constant Specialization

constant uint FC_GROUP_SIZE [[function_constant(0)]];
constant uint FC_K_DIM      [[function_constant(1)]];
constant bool FC_SPECIALIZED = is_function_constant_defined(FC_GROUP_SIZE);

When the host knows the group size and K dimension at pipeline creation time, it specializes the kernel with these as compile-time constants. This enables the Metal compiler to:

  1. Replace division/modulo by group_size with shifts/masks (if power of 2)
  2. Unroll loops with known trip counts
  3. Eliminate branches on K alignment

Pre-Scaled Activation Trick

constant constexpr float PS0 = 1.0f;
constant constexpr float PS1 = 1.0f / 16.0f;
constant constexpr float PS2 = 1.0f / 256.0f;
constant constexpr float PS3 = 1.0f / 4096.0f;

const float4 ps4 = float4(PS0, PS1, PS2, PS3);
float4 xs0 = xf0 * ps4, xs1 = xf1 * ps4, xs2 = xf2 * ps4, xs3 = xf3 * ps4;

This is the same pre-scaling trick used in Q4_0, adapted for MLX’s nibble layout. Each uint16 contains 4 nibbles at bit positions 0, 4, 8, and 12. The pre-scale factors compensate for the positional shift, allowing the kernel to multiply the pre-scaled activation directly with the masked uint16 value.

The Accumulation

float dot = xs0[0] * float(w0 & 0x000fu) + xs0[1] * float(w0 & 0x00f0u)
          + xs0[2] * float(w0 & 0x0f00u) + xs0[3] * float(w0 & 0xf000u)
          + xs1[0] * float(w1 & 0x000fu) + xs1[1] * float(w1 & 0x00f0u)
          + xs1[2] * float(w1 & 0x0f00u) + xs1[3] * float(w1 & 0xf000u)
          + ...;
sums[r] += s * dot + b * x_sum;

The s * dot term handles the scale, and b * x_sum handles the bias. The bias contributes b * sum(x_elements) because each quantized value is multiplied by the bias (via the affine formula), and summing gives bias * sum(activations).

Template-Based Variants

The MLX GEMV uses a template to generate both standard and wide variants:

template<int NSIMD_T, int NR_T = 4>
void gemv_mlx_q4_impl(...) { ... }

// Standard: 4 SIMD groups, 128 threads, 16 rows/TG
kernel void gemv_mlx_q4(...) {
    gemv_mlx_q4_impl<4, 4>(x, packed, y, params, tgpig, tiisg, sgitg);
}

// Wide: 8 SIMD groups, 256 threads, 32 rows/TG
kernel void gemv_mlx_q4_l(...) {
    gemv_mlx_q4_impl<8, 4>(x, packed, y, params, tgpig, tiisg, sgitg);
}

Batched GEMV (gemv_batched_f16)

Batched GEMV handles the case where M > 1 but is still small (2-16 rows). This occurs during:

  • Speculative decoding verification (batch of draft + 1 tokens)
  • Very short prefill sequences (2-8 tokens)
constant uint FC_BATCH_M [[function_constant(11)]];
constant constexpr uint MAX_BATCH_M = 16;
constant constexpr uint BGEMV_NCOLS = 4;
constant constexpr uint BGEMV_NSIMD = 8;

float sums[BGEMV_NCOLS][MAX_BATCH_M] = {};

for (uint k = slid; k < K; k += 32) {
    for (uint bm = 0; bm < batchM; bm++) {
        float xk = float(x[bm * lda + k]);
        for (uint c = 0; c < BGEMV_NCOLS; c++) {
            uint col = base_col + c;
            if (col < N) sums[c][bm] += xk * float(W[col * K + k]);
        }
    }
}

The key difference from standard GEMV: the inner loop iterates over both batch rows (bm) and output columns (c). The activation matrix has batchM rows, each producing an independent output row.

FC_BATCH_M is a function constant that lets the host specialize the kernel for a known batch size, enabling the compiler to unroll the batch loop.

Fused SiLU GEMV (gemv_q4_0_silu)

The fused SiLU variant is a significant optimization for the FFN down-projection. Instead of three separate dispatches (gate GEMV, up GEMV, SiLU element-wise multiply, down GEMV), it computes SiLU(gate) * up inline during the down projection GEMV:

for (short i = 0; i < 8; i += 2) {
    const float g0  = float(gb[i +  0]);
    const float g1  = float(gb[i +  1]);
    const float g16 = float(gb[i + 16]);
    const float g17 = float(gb[i + 17]);

    const float a0  = (g0  / (1.f + exp(-g0)))  * float(ub[i +  0]);
    const float a1  = (g1  / (1.f + exp(-g1)))  * float(ub[i +  1]);
    const float a16 = (g16 / (1.f + exp(-g16))) * float(ub[i + 16]);
    const float a17 = (g17 / (1.f + exp(-g17))) * float(ub[i + 17]);

    sumy     += a0 + a1 + a16 + a17;
    yl[i + 0] = a0;
    yl[i + 1] = a1  / 256.f;
    yl[i + 8] = a16 / 16.f;
    yl[i + 9] = a17 / 4096.f;
}

The SiLU computation x / (1 + exp(-x)) is computed per element, then multiplied by the corresponding up projection value. The result feeds into the same Q4_0 dot product machinery used by the standard kernel.

Performance impact: This fusion eliminates:

  • The gate GEMV dispatch (saved entirely – gate values are read from the existing buffer)
  • The up GEMV dispatch (saved entirely – up values are read from the existing buffer)
  • The SiLU element-wise kernel dispatch
  • Two intermediate buffer writes and one buffer read (gate output, up output, SiLU output)

For a 7B model, this saves 3 dispatches and ~6 buffer round-trips per layer, or ~96 dispatches for 32 layers. The tradeoff is more ALU work per thread (the exp() and division), but since GEMV is memory-bound, the extra ALU is free.2

Fused SiLU variants exist for Q4_0, Q8_0, Q4_K, MLX Q4, and other formats. Each is a template instantiation with the corresponding dequantization logic.

MLX Embedding Lookup (embedding_lookup_mlx_q4)

While not strictly a GEMV, the MLX embedding lookup kernel lives alongside the GEMV kernels and uses similar dequantization:

const uint32_t word = W_u32[token_id * n_u32_per_row + u32_idx];
const uint qval = (word >> (within * bits)) & mask;
output[token_idx * K + d_idx] = half(s * float(qval) + b);

It supports arbitrary bit widths (not just 4) via the bits parameter, extracting quantized values from packed 32-bit words. The dispatch is 2D: (dim, seq_len), with one thread per output element.

Kernel Selection Strategy

The host selects GEMV kernels based on the weight format and matrix dimensions. Here is the decision tree:

Weight FormatK < thresholdK >= thresholdN > wide_threshold
FP16gemv_f16 (4 SG)gemv_f16 (4 SG)gemv_wide_f16 (8 SG)
Q4_0gemv_q4_0 (4 SG, 128t)gemv_q4_0_l (8 SG, 256t)gemv_wide_q4_0
Q8_0gemv_q8_0 (8 SG, 256t)samegemv_wide_q8_0
Q4_Kgemv_q4_k (8 SG, 256t)samegemv_wide_q4_k
MLX Q4gemv_mlx_q4 (4 SG)gemv_mlx_q4_l (8 SG)gemv_wide_mlx_q4

The wide_gemv_threshold is typically 32768 (output dimension), which is only exceeded by the logit projection in models with very large vocabularies.

On Pro+ chips (cores >= 16), the “wide standard” flag is set, which promotes even the standard GEMV to use 8 SIMD groups. This is because Pro chips have enough GPU cores to benefit from the additional parallelism even at moderate N dimensions.

Memory Access Patterns

Understanding the memory access pattern is crucial for GEMV performance, since it is memory-bound:

For gemv_f16 with K=4096, NR=4:

Thread 0: reads x[0..15],    W[row][0..15]    for 4 rows
Thread 1: reads x[16..31],   W[row][16..31]   for 4 rows
...
Thread 31: reads x[496..511], W[row][496..511] for 4 rows

Next K-block: Thread 0 reads x[512..527], etc.

Within a K-block, all threads in a SIMD group access contiguous memory for both x and W. This means the GPU can coalesce these into wide (512-byte) cache line reads. The activation vector x is loaded once per K-block and reused NR=4 times, giving a theoretical 4x activation reuse factor.

For quantized formats, the memory access is even more favorable: Q4_0 reads 10 bytes per 32 elements (20 bytes per block, 2 blocks per 32-element group), compared to 64 bytes for FP16. This 6.4x memory reduction translates directly to higher effective bandwidth for the weight reads, which dominate the memory traffic.

Q5_0, Q5_K, Q6_K, Q2_K, Q3_K GEMV Variants

Beyond the core Q4_0, Q8_0, and Q4_K kernels, Akunu supports several additional GGML quantization formats:

Q5_0 (5-bit per element, group size 32): Each block contains 4 bytes of high-bit storage in addition to the Q4_0-style nibble data. The extra bit per element improves precision slightly. The kernel extracts the 5th bit from a separate bitfield and OR’s it into the 4-bit value during dequantization.

Q5_K (5-bit K-quant, super-block 256): Similar to Q4_K but with an extra bit per element. Uses the same super-block structure with per-sub-block scales and mins, plus an additional bit plane.

Q6_K (6-bit K-quant, super-block 256): Provides significantly better quality than Q4_K at 1.5x the memory cost. The dequantization extracts 6-bit values from packed bytes, using two distinct bit planes. Popular for models where quality is critical but FP16 is too expensive.

Q2_K (2-bit K-quant, super-block 256): The most aggressive quantization, storing only 2 bits per element plus 4-bit scales and 4-bit mins. Quality is noticeably degraded but memory usage is halved vs Q4_0. Used for very large models that would not otherwise fit in memory.

Q3_K (3-bit K-quant, super-block 256): A middle ground between Q2_K and Q4_K. Uses 3 bits per element with similar super-block metadata to Q4_K. Provides better quality than Q2_K at 1.5x the memory.

All of these kernels follow the same multi-row GEMV pattern described for Q4_0 and Q4_K, differing only in:

  • The number of bits extracted per element (2, 3, 5, or 6)
  • The block/super-block metadata layout
  • The dequantization arithmetic

Each has standard, wide, and batched variants, plus fused SiLU variants where applicable. The code is generated from templates where possible, but the dequantization functions are hand-written for each format to ensure optimal register usage and instruction scheduling.

BF16 GEMV (gemv_bf16)

Akunu also supports BF16 (Brain Float 16) weights, which use 8 exponent bits and 7 mantissa bits (vs FP16’s 5 exponent and 10 mantissa). The BF16 GEMV kernel is structurally identical to the FP16 kernel but includes a BF16-to-FP32 conversion step during loading:

BF16 weights are used by some models (particularly those trained in BF16) and offer better dynamic range than FP16 at the cost of reduced precision. On Apple Silicon, there is no native BF16 support in the ALU (unlike NVIDIA’s Ampere+), so BF16 values are converted to FP32 for computation. The conversion is simple: BF16 is the upper 16 bits of an IEEE 754 float32, so the conversion is just a left-shift by 16 bits.

Arithmetic Intensity Analysis

Understanding why GEMV is memory-bound requires analyzing its arithmetic intensity – the ratio of compute operations to memory bytes transferred.

For an FP16 GEMV computing y[N] = x[K] @ W[N,K]:

MetricValueNotes
Operations2 * N * Kmultiply + add per element
Weight bytesN * K * 2FP16, 2 bytes per element
Activation bytesK * 2Read once
Output bytesN * 2Write once
Total bytesNK2 + K2 + N2 ≈ NK2Dominated by weight reads
Arithmetic intensity2NK / (NK2) = 1.0 FLOP/byteWith multi-row: NR * 1.0

An arithmetic intensity of 1.0 FLOP/byte means that for every byte of memory transferred, we perform only 1 floating-point operation. Apple Silicon GPUs have ~200-400 GB/s memory bandwidth but 7-14 TFLOPS of compute. At 1.0 FLOP/byte, we would need 7-14 TB/s of bandwidth to saturate the compute – about 35x more than available.3

Multi-row processing (NR=4) improves this to 4 FLOP/byte by reusing the activation vector, but this is still well within the memory-bound regime. The roofline model confirms that GEMV performance is limited by memory bandwidth, not compute capability.

For quantized formats, the effective arithmetic intensity is higher because less data is transferred:

FormatBytes per elementEffective AI (NR=4)
FP162.04.0 FLOP/byte
Q8_0~1.06~7.5 FLOP/byte
Q4_0~0.56~14.3 FLOP/byte
Q4_K~0.56~14.3 FLOP/byte

Q4_0 achieves 14.3 FLOP/byte – still memory-bound on Apple Silicon, but much closer to the roofline ridge point. This is why quantization provides such a large speedup for decode: the weight read traffic is cut by ~3.5x compared to FP16.

The Bandwidth Utilization Picture

On an M4 Pro with ~200 GB/s memory bandwidth and a 7B Q4_0 model (dim=4096, ffn_dim=14336), the theoretical decode speed is:

Weight bytes per token:
  QKV:  (4096 + 1024 + 1024) * 4096 * 0.56 = ~14.0 MB
  O:    4096 * 4096 * 0.56 = ~9.4 MB
  Gate: 14336 * 4096 * 0.56 = ~32.9 MB
  Up:   14336 * 4096 * 0.56 = ~32.9 MB
  Down: 4096 * 14336 * 0.56 = ~32.9 MB
  ─────────────────────────────────
  Per layer: ~122 MB
  32 layers: ~3.9 GB
  + Logit projection: ~37 MB (vocab=128K)
  Total: ~3.94 GB

Theoretical tok/sec at 200 GB/s: 200 / 3.94 ≈ 50.8 tok/sec

In practice, Akunu achieves 80-90% of this theoretical maximum. The gap comes from:

  1. Non-weight memory traffic (activations, KV cache reads, norm computation)
  2. Kernel launch overhead (minimized by chain decode)
  3. Attention computation (memory-bound but separate from GEMV)
  4. Imperfect memory coalescing at tile boundaries

Register Pressure and Occupancy

Each GEMV kernel maintains NR float accumulators per SIMD group, plus registers for the loaded activation and weight values. For the FP16 kernel:

Register UseCountBytes
sums[NR=4]4 floats16 bytes
xf0-xf3 (activation)4 float464 bytes
wf0-wf3 (weight, per row)4 float464 bytes
Loop indices, temporaries~832 bytes
Total per thread~176 bytes

Apple Silicon GPUs have 96 registers per thread (32-bit each) = 384 bytes, so the GEMV kernels use less than half the available register file. This ensures high occupancy: the GPU can schedule multiple warps per compute unit, hiding memory latency with concurrent threads.

For the Q4_K kernel, register pressure is higher (16-element float arrays for yl and yh, plus the scale decoding registers), but still within bounds. The 2-row-per-SIMD-group design (NCOLS=2) was specifically chosen to fit in registers.

GPU Dispatch Geometry Examples

Let’s work through the dispatch for a concrete case: the K projection of Llama 3.1 8B, where N=1024 (kv_dim) and K=4096, using Q4_0 quantization:

K = 4096 > q4_small_k_threshold (2048 on base chip)
→ Use gemv_q4_0_l (NSG=8, 256 threads)
→ Rows per TG = NR0 * NSG = 4 * 8 = 32
→ Grid size = ceil(1024 / 32) = 32 threadgroups
→ Total threads = 32 * 256 = 8192

Within each threadgroup:
  8 SIMD groups, each handling 4 output rows
  Each thread in a SIMD group:
    - Processes K/16 = 256 Q4_0 blocks (strided by 16 blocks at a time)
    - 16 values per block iteration
    - 256/16 = 16 outer loop iterations

For the logit projection (N=128256, K=4096):

N = 128256 > wide_gemv_threshold (32768)
→ Use gemv_wide_q4_0 (NSG=8, 256 threads, 32 rows/TG)
→ Grid size = ceil(128256 / 32) = 4009 threadgroups
→ Total threads = 4009 * 256 ≈ 1M

This is the largest GEMV dispatch in the model, taking ~2ms on M4 Pro.

Complete Kernel Variant Matrix

Here is the full matrix of all GEMV kernel files in Akunu:

Base KernelStandardWideBatchedFused SiLU
FP16gemv_f16gemv_wide_f16gemv_batched_f16
BF16gemv_bf16
Q4_0gemv_q4_0 (+_l)gemv_wide_q4_0gemv_batched_q4_0gemv_q4_0_silu (+_l)
Q4_1gemv_q4_1gemv_wide_q4_1gemv_batched_q4_1
Q5_0gemv_q5_0gemv_wide_q5_0gemv_batched_q5_0
Q8_0gemv_q8_0gemv_wide_q8_0gemv_batched_q8_0gemv_q8_0_silu
Q4_Kgemv_q4_kgemv_wide_q4_kgemv_batched_q4_kgemv_q4_k_silu
Q5_Kgemv_q5_kgemv_batched_q5_k
Q6_Kgemv_q6_kgemv_batched_q6_k
Q2_Kgemv_q2_kgemv_batched_q2_k
Q3_Kgemv_q3_kgemv_batched_q3_k
MLX Q4gemv_mlx_q4 (+_l)gemv_wide_mlx_q4gemv_batched_mlx_q4gemv_mlx_q4_silu
MLX Q3gemv_mlx_q3_silu
MLX Q6gemv_mlx_q6_silu
MLX Q8gemv_wide_mlx_q8gemv_mlx_q8_silu
MLX Gengemv_mlx_gengemv_wide_mlx_gengemv_batched_mlx_gen

That is over 50 kernel entry points. Each is generated from a relatively small set of patterns (the base algorithm is the same), but the dequantization logic, thread counts, and buffer layouts differ per format.

The fused SiLU variants are particularly valuable: they combine three operations (gate GEMV, up GEMV, SiLU multiply) into one, eliminating two full weight scans and three buffer round-trips per layer. For a 32-layer model, this saves 64 GEMV dispatches and ~192 buffer reads/writes.

Metal-Specific Optimizations

Several Metal-specific features are exploited across the GEMV kernels:

make_uniform()

The Q4_0 and Q4_K kernels use make_uniform() on parameters:

const int N = make_uniform(int(params.N));
const int K = make_uniform(int(params.K));

make_uniform() is a Metal compiler hint that tells the GPU’s SIMD group scheduler that all lanes will have the same value. This enables the compiler to:

  • Use scalar registers instead of vector registers for the value
  • Hoist loop-invariant computations out of SIMD-divergent branches
  • Optimize conditional execution (all lanes take the same branch)

Without make_uniform(), the compiler must conservatively assume that different lanes might have different values for params.N, generating slower divergent code.

half4 and float4 Vector Operations

Apple Silicon’s GPU ALU natively supports 4-component vector operations. The dot(float4, float4) function compiles to a single instruction that computes all 4 multiply-adds in parallel. Using half4 for loads ensures that the load-store unit reads 8 bytes in one transaction, which is the optimal load granularity for Apple’s memory subsystem.

The pattern of loading as half4, converting to float4, computing, and accumulating is optimal:

  • Load: 8 bytes in one transaction (half4)
  • Convert: hardware F16->F32 widening (1 cycle)
  • Compute: FMA in F32 (full precision accumulation)
  • This avoids F16 accumulation errors while keeping loads narrow

SIMD Group Size

On Apple Silicon, the SIMD group size is always 32 (unlike NVIDIA where it varies between 32 and 64, or AMD where it can be 32 or 64). This simplifies kernel design:

  • All simd_sum() and simd_max() calls reduce exactly 32 values
  • Thread-to-row mappings can use fixed constants (SIMD_WIDTH=32)
  • No need for runtime warp/wavefront size queries

Threadgroup Memory Absence

Unlike GEMM kernels, most GEMV kernels use no threadgroup memory. The entire computation is done in registers with simd_sum() for cross-lane reduction. This is possible because GEMV’s data flow is simple: each SIMD group independently computes NR output rows, and the only cross-lane communication is the final reduction.

This zero-TG-memory design has two benefits:

  1. No threadgroup memory allocation overhead at dispatch time
  2. No barrier overhead (simd_sum is barrier-free within a SIMD group)

The only exceptions are the batched GEMV (which uses shared memory for cross-SIMD-group reduction when the batch dimension exceeds one SIMD group) and some Q4_K variants that use threadgroup memory for scale decoding shared across threads.

Summary

Akunu’s GEMV kernel family is the performance backbone of decode. The key design principles are:

  1. Multi-row processing (NR): Reusing the activation vector across multiple output rows to improve arithmetic intensity from 1 to NR FLOP/byte.
  2. Vectorized loads: half4 and uint16_t reads to maximize memory bandwidth utilization via coalesced transactions.
  3. SIMD reduction: Apple’s simd_sum() for efficient cross-lane communication without threadgroup barriers.
  4. Size-specialized variants: Different threadgroup sizes (128 vs 256 threads) for different K dimensions and hardware tiers.
  5. Fused operations: SiLU + GEMV fusion to eliminate intermediate buffers, saving 3 dispatches and 6 buffer round-trips per layer.
  6. Function constant specialization: Compile-time K and group_size for better code generation and loop unrolling.
  7. Format-specific dequantization: Nibble pre-scaling for Q4_0, multi-level scale+min for Q4_K, affine dequant for MLX – each tuned for its format’s memory layout.
  8. Zero threadgroup memory: Most GEMV kernels avoid threadgroup memory entirely, relying on SIMD intrinsics for all cross-lane communication.

The GEMV kernels collectively represent Akunu’s largest investment in kernel engineering – over 50 entry points across 30+ source files – because decode throughput directly determines how fast text appears on screen. Every percentage point of bandwidth utilization improvement translates directly to faster generation.

Practical Impact: End-to-End Decode Profile

To understand how GEMV kernels fit into the full picture, here is a representative decode profile for Llama 3.1 8B Q4_0 on M4 Pro:

OperationTime per Token% of TotalKernel Type
Embedding lookup0.01ms0.1%Embedding
QKV GEMV (x32 layers)2.4ms30%gemv_q4_0
RoPE + KV write (x32)0.3ms4%rope_kv_write
Attention (x32)1.2ms15%flash_attention
O GEMV (x32)0.8ms10%gemv_q4_0
Fused SiLU GEMV (x32)2.8ms35%gemv_q4_0_silu
Output norm0.01ms0.1%rmsnorm
Logit GEMV0.3ms4%gemv_q4_0
Argmax0.01ms0.1%argmax
Total~8ms100%

GEMV operations account for ~79% of the total decode time. The fused SiLU GEMV is the single largest contributor because the FFN’s intermediate dimension (14336) is 3.5x the model dimension (4096). Optimizing GEMV kernels has the highest return on investment of any kernel engineering work in Akunu.

The most impactful optimizations in Akunu’s GEMV history were:

  1. Multi-row (NR=4): Gave a 2-3x speedup over NR=1 by reusing the activation vector.
  2. Fused SiLU: Eliminated 3 dispatches per layer, saving ~15% total decode time.
  3. Hardware-tuned NSG: Matching the threadgroup size to the hardware tier improved occupancy by ~10%.
  4. Function constant K specialization (for MLX): Enabled 5-8% speedup from compiler loop optimizations.
  5. half4 vectorized loads: Improved memory coalescing by 4x compared to scalar half loads.

These optimizations compound: a kernel that is 2x faster from multi-row, 15% faster from fusion, and 10% faster from tuning delivers ~2.5x total improvement over a naive implementation.



  1. Apple. “Metal Shading Language Specification, Version 3.1.” Section 6.9, SIMD-group Functions. simd_sum() performs a butterfly reduction across all active threads in a SIMD group, completing in ceil(log2(SIMD_WIDTH)) steps. See https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf.

  2. Williams, S., Waterman, A., and Patterson, D. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM, 2009. GEMV is firmly in the memory-bound regime (low arithmetic intensity), meaning additional ALU operations like SiLU computation do not affect wall-clock time. See https://doi.org/10.1145/1498765.1498785.

  3. The roofline model shows that an algorithm with arithmetic intensity below the “ridge point” (peak FLOPS / peak bandwidth) is memory-bound. For Apple M4 Pro: ridge point = 14 TFLOPS / 200 GB/s = 70 FLOP/byte. GEMV at 1-14 FLOP/byte is well below this threshold.