Appendix C: Glossary

This glossary covers approximately 70 terms used throughout this book and in the akunu source code. Each entry gives a definition pitched at a CS audience and notes where the concept appears in akunu’s implementation. Terms are listed alphabetically.

ALU (Arithmetic Logic Unit) The functional unit within a GPU core that performs integer and floating-point arithmetic. Apple Silicon GPUs have ALUs organized into SIMD groups of 32 threads. In akunu, ALU utilization is typically low during decode (memory-bound) and high during prefill (compute-bound). See Chapter 55 for roofline analysis.

Apple GPU Family Apple’s versioning scheme for GPU feature sets. Family 7 = M1, Family 8 = M2/M3, Family 9 = M4. In akunu, ChipConfig::gpu_family stores this value and uses it to select kernel variants and tuning parameters (e.g., native BF16 support requires Family 9+).

Argmax The operation that returns the index of the maximum value in a vector. In greedy decoding, argmax(logits) selects the next token. In akunu, the argmax kernel runs on the GPU as the final step of the dispatch table, writing the result to the token_ids buffer for chain decode.

ARM (Advanced RISC Machines) The CPU instruction set architecture used by Apple Silicon. All M-series chips use ARM’s AArch64 (64-bit) ISA. Relevant to akunu only for CPU-side operations (tokenization, weight loading, sampling); the inference hot path runs entirely on the GPU.

Attention The core mechanism of transformer models. Given queries Q, keys K, and values V, computes softmax(Q @ K^T / sqrt(d)) @ V. In akunu, the attention kernel reads Q from the scratch buffer and K/V from the head-major KV cache. See AttentionParams in Appendix A.

BF16 (Brain Float 16) A 16-bit floating-point format with 8-bit exponent and 7-bit mantissa, matching FP32’s exponent range at the cost of precision. Native hardware support on M4 (GPU Family 9). In akunu, BF16 weights use dtype code 30 (converted to FP16 at load) or 31 (native BF16 on M4+). See dtype_descriptor.h.

BOS (Beginning of Sequence) A special token (typically ID 1) that marks the start of a sequence. In akunu, akunu_bench uses BOS-filled synthetic prompts for reproducible benchmarking. See tools/akunu_bench.cpp.

BPE (Byte-Pair Encoding) A subword tokenization algorithm that iteratively merges the most frequent adjacent pairs of characters/tokens. Used by most modern LLMs (GPT, LLaMA, Qwen). In akunu, the tokenizer implementation in src/tokenizer/tokenizer.h handles BPE encoding and decoding.

Causal Masking The constraint in autoregressive language models that position i can only attend to positions 0..i (not future positions). During prefill, akunu’s GEMM-based attention applies a causal mask to the attention scores. During single-token decode, causal masking is implicit because the query is always the latest position.

Chain Decode Akunu’s technique for batching multiple greedy decode steps into a single GPU command buffer submission. Instead of committing one command buffer per token (incurring ~50us sync overhead each time), akunu encodes 64-128 forward passes back-to-back, with the argmax output of token N feeding as input to token N+1 via a shared GPU buffer. See encode_chain() in dispatch_table.h and ADR-5 in Chapter 56.

ChipConfig A struct in src/core/chip_config.h that captures hardware-derived tuning parameters for Apple Silicon GPU families. Includes SLC size estimates, GEMV kernel thresholds, chain decode chunk sizes, and norm dispatch geometry. Created via ChipConfig::from_gpu(cores, family).

Command Buffer A Metal API object (MTLCommandBuffer) that holds a sequence of encoded GPU commands. In akunu, each chain decode chunk is encoded into one command buffer. The command buffer is committed to the GPU queue and either waited on synchronously (end_encoding_sync) or monitored asynchronously.

Compute Command Encoder A Metal API object (MTLComputeCommandEncoder) used to encode compute dispatches (set pipeline, set buffers, dispatch threads) into a command buffer. In akunu, MetalDevice::begin_encoding() creates a new encoder, and the dispatch table is encoded through it.

Decode The autoregressive token generation phase where the model processes one token at a time, appending each to the KV cache. Decode is memory-bound on Apple Silicon because each step reads the entire weight matrix for a single vector multiplication. Contrast with Prefill.

Dispatch Table A pre-compiled sequence of DispatchCmd structs representing one token’s complete forward pass. Built once during model initialization by build_dispatch_table(). Replayed N times by encode_chain() during inference. See src/core/dispatch_table.h and ADR-1 in Chapter 56.

DispatchCmd A POD struct containing everything needed for a single GPU kernel dispatch: pipeline state object, buffer bindings (up to 8), inline parameters (up to 64 bytes), threadgroup memory, dispatch geometry, and per-token patching instructions. Defined in dispatch_table.h.

DType Descriptor A struct in src/core/dtype_descriptor.h that maps a GGUF dtype code to the appropriate kernel names and dispatch geometry. Contains fields for GEMV, GEMV-wide, GEMM, embedding, and fused SiLU kernel names, plus threadgroup sizes for each. The kDTypes[] array is the single source of truth for all dtype-dependent behavior.

Embedding The process of converting a discrete token ID into a dense floating-point vector. The embedding table is a matrix of shape [vocab_size, dim] where each row is the learned representation of one token. In akunu, the embedding lookup is the first kernel in the dispatch table, reading from the token_ids buffer and writing to the h0 scratch buffer. Quantized embedding kernels (e.g., embedding_lookup_q4_0) dequantize on the fly.

Encoder-Decoder A transformer architecture with separate encoder and decoder stacks connected by cross-attention. The encoder processes the input (e.g., mel spectrograms for Whisper) in parallel; the decoder generates output tokens autoregressively, attending to both its own previous outputs and the encoder’s representations. In akunu, enabled by ArchDescriptor::is_encoder_decoder = true. See arch_whisper() in arch_descriptor.h.

EOS (End of Sequence) A special token that signals the model wants to stop generating. When the model outputs EOS, akunu’s akunu_generate() terminates the decode loop and returns the generation statistics.

FFN (Feed-Forward Network) The position-wise fully-connected sub-layer in each transformer block. Modern LLMs use a gated variant (SwiGLU or GEGLU) with three weight matrices: gate, up, and down projections. In akunu, the FFN intermediate dimension is stored in AkunuModelConfig::ffn_dim and is typically ~2.7x the model dimension for SwiGLU architectures.

FNV-1a A non-cryptographic hash function (Fowler-Noll-Vo) used in akunu’s N-gram predictor for hashing token contexts. The 64-bit variant uses offset basis 14695981039346656037 and prime 1099511628211. See NGramPredictor::context_hash() in ngram_predictor.h.

FP16 (Half-Precision Float) IEEE 754 half-precision: 5-bit exponent, 10-bit mantissa. The native compute precision for Apple Silicon GPUs. In akunu, all intermediate activations (hidden states, attention outputs, FFN intermediates) are FP16. Weight matrices may be quantized to lower precision but are dequantized to FP16 during computation.

FlashAttention An efficient attention algorithm that tiles the softmax computation to avoid materializing the full [seq_len, seq_len] attention matrix in memory.¹ In akunu, the prefill attention kernel uses a tiled approach inspired by FlashAttention, computing attention in chunks that fit in threadgroup memory.

GBNF (GGML BNF) A grammar specification format based on BNF (Backus-Naur Form), used for constrained decoding in llama.cpp and akunu. In akunu, akunu_grammar_create() parses a GBNF string and creates a grammar constraint that masks invalid tokens at each generation step. See src/grammar/json_schema_to_grammar.h.

GELU (Gaussian Error Linear Unit) An activation function: GELU(x) = x * Phi(x) where Phi is the standard Gaussian CDF. Used by Gemma (with gate: GEGLU) and Whisper (plain GELU). In akunu, implemented as act_gelu() in KernelCommon.h using the tanh approximation.

GEMM (General Matrix-Matrix Multiply) The C = alpha * A @ B + beta * C operation. Used during prefill when processing multiple tokens simultaneously. The arithmetic intensity scales with the batch dimension M, making prefill compute-bound for moderate batch sizes. In akunu, GEMM kernels use simdgroup_matrix hardware intrinsics and are selected via gemm_kernel_for() in dtype_descriptor.h. See GEMMParams in Appendix A.

GEMV (General Matrix-Vector Multiply) The y = A @ x operation (M=1 case of GEMM). The dominant operation during single-token decode. Memory-bound on Apple Silicon because the entire weight matrix must be read for each multiplication. In akunu, GEMV kernels are specialized per dtype and chip configuration, with standard, large-K, and wide-N variants.

GGUF (GGML Universal File) A binary file format for storing quantized LLM weights and metadata. Successor to GGML format, used by llama.cpp and supported by most open-source LLM tools. In akunu, parsed by src/weight/gguf_parser.h. Contains tensor data, model architecture metadata, tokenizer vocabulary, and quantization parameters in a single file.

GQA (Grouped Query Attention) An attention variant where multiple query heads share a single key/value head, reducing KV cache memory and attention compute.² For example, LLaMA 3 uses 32 query heads but only 8 KV heads (ratio 4:1). In akunu, GQA is handled by the attention kernel via the n_heads / n_kv_heads fields in AttentionParams.

Gumbel-Max Trick A method for sampling from a categorical distribution by adding Gumbel-distributed noise to log-probabilities and taking the argmax. Considered but not adopted as the default sampling strategy in akunu (see ADR-6 in Chapter 56). The main barrier is incompatibility with grammar-constrained decoding and top-k/top-p filtering.

Head-Major Layout A memory layout for the KV cache where all positions for a given head are contiguous: [n_kv_heads, max_seq_len, head_dim]. Chosen by akunu (ADR-10) because the attention kernel reads all K/V vectors for one head sequentially, and contiguous layout maximizes memory bandwidth utilization.

K-Quant (K-Quantization) A family of GGUF quantization formats (Q2_K through Q6_K) that use a two-level hierarchical scheme with 256-element super-blocks containing smaller sub-blocks with their own 6-bit scales. Provides better accuracy than basic block quantization at the same bit width. See Appendix B for format details.

KV Cache A buffer that stores the key and value vectors for all previously processed tokens, avoiding recomputation during autoregressive decode. In akunu, defined in src/cache/kv_cache.h as a KVCache struct with per-layer K and V buffers in head-major FP16 layout. Memory cost scales as 2 * n_layers * n_kv_heads * max_seq_len * head_dim * 2 bytes.

LayerNorm (Layer Normalization) A normalization technique: LN(x) = (x - mean(x)) / sqrt(var(x) + eps) * weight + bias. Used by Whisper. In akunu, the LayerNormParams struct (Appendix A) drives the kernel. Most modern LLMs use RMSNorm instead.

LLM (Large Language Model) A neural network with billions of parameters trained on large text corpora to generate text autoregressively. Akunu is an inference engine for LLMs on Apple Silicon, supporting architectures like LLaMA, Qwen, and Gemma.

Logits The raw (unnormalized) output scores from the model’s final linear projection. A vector of size vocab_size where each element represents the model’s confidence that the corresponding token should come next. In akunu, logits are stored in the scratch.logits buffer (FP16, vocab_size * 2 bytes).

Mel Spectrogram A time-frequency representation of audio, computed by applying the mel-scale filterbank to the Short-Time Fourier Transform (STFT) of a waveform. Whisper models expect 80-bin or 128-bin mel spectrograms at 16kHz sample rate. In akunu, mel computation is handled by src/audio/mel.h as a preprocessing step before encoder inference. The bin count is stored in AkunuModelConfig::n_mels.

Memory Mapping (mmap) An operating system facility that maps a file’s contents into virtual memory, allowing the file to be read as if it were in RAM without explicit read calls. In akunu, both GGUF and SafeTensors files are opened via mmap(), enabling zero-copy access to tensor data. The OS manages paging from disk as needed, which means model loading appears near-instantaneous for files already in the page cache.

make_uniform() A Metal shader helper function defined in KernelCommon.h that wraps simd_broadcast_first(). Tells the Metal compiler that a value is the same across all threads in a SIMD group, enabling better predication and vectorization. Used for loop bounds and uniform conditionals in GEMV/GEMM kernels.

Metal Apple’s low-level GPU programming framework, analogous to Vulkan or Direct3D 12. Provides direct access to GPU compute via compute pipelines, command buffers, and buffers. Akunu uses Metal exclusively as its GPU backend via MetalDevice in backend/metal/metal_device.h.

Metallib A pre-compiled Metal shader library (.metallib file). Contains compiled pipeline state objects for all of akunu’s GPU kernels. Loaded at model init time via MetalDevice::load_library(). Using a pre-compiled metallib avoids runtime shader compilation, which can take seconds.

MHA (Multi-Head Attention) The standard attention mechanism where Q, K, and V are split into multiple heads, each attending independently, then concatenated. In akunu, the number of heads is specified by AkunuModelConfig::n_heads (for Q) and n_kv_heads (for K/V in GQA).

MLX Apple’s array computation framework for machine learning, implemented in C++ and Metal with Python bindings. Uses SafeTensors format with group quantization. In akunu, MLX-format models are loaded via MLXWeightStore in src/weight/mlx_weight_store.h, with weight name mapping from HuggingFace conventions to akunu’s canonical names.

MSL (Metal Shading Language) The programming language for Metal GPU shaders, based on C++14 with extensions for GPU-specific types (half, simdgroup, threadgroup memory). All of akunu’s GPU kernels are written in MSL. The source lives in backend/metal/kernels/.

NeoX RoPE A variant of Rotary Position Embeddings where the rotation dimensions are arranged in a split-half pattern: the first head_dim/2 elements are one component, the second half is the other. Used by Qwen, Gemma, and GPT-NeoX. In akunu, selected via ArchDescriptor::rope_kernel = "rope_neox_qkv_write_f16".

Neural Engine A dedicated machine learning accelerator on Apple Silicon SoCs, optimized for dense matrix operations on fixed-size tensors via Core ML. Akunu does not use the Neural Engine because it requires models in Core ML format and does not support the dynamic shapes needed for autoregressive decoding with variable-length KV caches. The GPU provides more flexibility for custom kernel implementations.

N-Gram Predictor Akunu’s lightweight speculative decoding module that predicts future tokens based on frequency tables of recently observed n-gram patterns (up to 4-grams). Does not require a draft model. Defined in src/speculative/ngram_predictor.h. Enabled via akunu_set_speculation(model, true).

NPDA (Neural Processing and Data Acceleration) Apple’s term for the collection of hardware blocks on their SoCs that accelerate ML workloads, including the GPU, Neural Engine, and AMX (Apple Matrix eXtensions). Akunu uses only the GPU via Metal; it does not target the Neural Engine or AMX.

Ping-Pong Buffers The technique of alternating between two buffers (h0 and h1) for the transformer’s residual stream. Each layer reads from one buffer, writes intermediate results, then adds the residual back to the other buffer. This avoids allocating a new buffer per layer. See ScratchBuffers in src/cache/scratch.h and ADR-9 in Chapter 56.

Pipeline State Object (PSO) A Metal API object (MTLComputePipelineState) representing a compiled GPU kernel ready for dispatch. In akunu, PSOs are cached in MetalDevice::pso_cache_ (keyed by kernel name) and looked up by get_pipeline(). The dispatch table stores PSO handles directly to avoid per-dispatch lookups.

Prefill The phase of LLM inference where the entire prompt is processed in one batch to populate the KV cache. Unlike decode (which processes one token at a time), prefill uses GEMM (matrix-matrix) operations and can be compute-bound for longer prompts. In akunu, prefill is triggered by akunu_prefill() and uses the batch_* scratch buffers.

Q4_0 The most common GGUF quantization format. 32-element blocks with one FP16 scale each; 4 bits per weight value; 4.5 effective bits per weight. See Appendix B for the full block layout and dequantization formula.

QKV Fusion The optimization of fusing the Q, K, and V linear projections into a single GEMV that writes to a contiguous output buffer [q_dim + 2*kv_dim]. Reduces three GEMV dispatches to one. In akunu, the QKV buffer is scratch.qkv with sub-offsets qkv_q_offset, qkv_k_offset, qkv_v_offset.

Residual Connection A shortcut that adds a layer’s input to its output: output = layer(x) + x. Prevents vanishing gradients in deep networks and is used in every transformer layer (both after attention and after FFN). In akunu, residual additions alternate between the h0 and h1 ping-pong buffers.

Repetition Penalty A technique to discourage the model from repeating tokens by modifying logits for recently generated tokens. Positive logits are divided by the penalty factor; negative logits are multiplied. In akunu, configurable via AkunuSamplingConfig::repeat_penalty and can optionally be applied on the GPU via the RepetitionPenaltyParams kernel.

Roofline Model A visual performance model that plots a kernel’s achievable throughput (FLOPS) against its arithmetic intensity (FLOPS/byte), bounded by the hardware’s peak compute and peak memory bandwidth. For LLM decode on Apple Silicon, most kernels (GEMV, attention, norms) fall in the memory-bound region. See Chapter 55 for a detailed roofline analysis with Apple Silicon numbers.

RMSNorm (Root Mean Square Normalization) A simplified normalization: RMSNorm(x) = x / sqrt(mean(x^2) + eps) * weight. Omits the mean subtraction of LayerNorm. Used by LLaMA, Qwen, Gemma, and most modern LLMs. In akunu, driven by RMSNormParams (Appendix A) and selected via ArchDescriptor::norm_type = "rmsnorm".

RoPE (Rotary Position Embeddings) A position encoding method that rotates query and key vectors by position-dependent angles, allowing the model to learn relative position relationships.³ In akunu, RoPE is applied by the fused rope_qkv_write_f16 kernel during decode and the standalone rope_f16 kernel during prefill. See RoPEParams and RoPEQKVWriteParams in Appendix A.

SafeTensors A simple binary format for storing tensors, developed by Hugging Face. The header is a JSON object mapping tensor names to their dtype, shape, and byte offsets; the rest of the file is raw tensor data. In akunu, parsed by SafeTensorsParser in src/weight/safetensors_parser.h. MLX models use SafeTensors as their container format.

Sampling The process of selecting the next token from the logit distribution. Options include greedy (argmax), temperature scaling, top-k (keep only top K logits), top-p/nucleus (keep logits whose cumulative probability exceeds p), and min-p (keep logits above a minimum probability threshold). In akunu, configured via AkunuSamplingConfig in types.h.

Scratch Buffers Pre-allocated GPU buffers for all intermediate computations during inference. Created once at model load time. Includes h0/h1 (residual ping-pong), qkv, attn_out, ffn_gate/up/act, logits, and batch variants for prefill. See ScratchBuffers in src/cache/scratch.h.

SIMD Group A group of 32 threads that execute in lockstep on Apple Silicon GPUs (equivalent to a “warp” on NVIDIA GPUs or “wavefront” on AMD). SIMD group operations (simd_sum, simd_max, simd_broadcast_first) are used extensively in akunu’s reduction kernels. The width is defined as SIMD_WIDTH = 32 in KernelCommon.h.

simdgroup_matrix A Metal intrinsic type that maps to Apple Silicon’s hardware matrix multiply unit. Supports 8x8 FP16 matrix tiles. Used by akunu’s GEMM kernels (simd_gemm_*) for prefill operations with tiling constants TILE_M=64, TILE_N=64, TILE_K=32 defined in KernelCommon.h.

SiLU (Sigmoid Linear Unit) An activation function: SiLU(x) = x * sigmoid(x) = x / (1 + exp(-x)). Used by LLaMA and Qwen in the FFN’s SwiGLU block. In akunu, implemented as act_silu() in KernelCommon.h. Fused SiLU GEMV kernels (gemv_q4_0_silu, gemv_mlx_q4_silu, etc.) apply this during the GEMV accumulation.

SLC (System Level Cache) A large shared cache on Apple Silicon that sits between the GPU/CPU cores and main memory. Size ranges from 8 MB (M1 base) to 96 MB (Ultra). Not directly programmable, but its presence means that data recently read by the GPU may still be in cache for subsequent reads. In akunu, ChipConfig::slc_bytes estimates the SLC size and should_fuse_weights is enabled when the SLC is large enough to benefit from weight fusion.

SoC (System on Chip) An integrated circuit that combines CPU, GPU, Neural Engine, memory controller, and other components on a single die. Apple’s M-series chips are SoCs with unified memory architecture. Relevant to akunu because UMA eliminates the PCIe bottleneck found in discrete GPU systems.

Softmax The function softmax(x)_i = exp(x_i) / sum(exp(x_j)) that converts logits to a probability distribution. Used in attention (over the attention scores) and optionally for final token sampling. In akunu, the standalone softmax kernel is driven by SoftmaxParams (Appendix A); during decode attention, softmax is fused into the attention kernel.

Speculative Decoding A technique to accelerate autoregressive generation by using a fast predictor to draft multiple tokens, then verifying them in parallel with the full model.⁴ In akunu, implemented via the N-gram predictor (src/speculative/ngram_predictor.h) which does not require a separate draft model. Enabled via akunu_set_speculation(model, true).

SwiGLU A gated FFN variant: FFN(x) = (SiLU(W_gate @ x) * (W_up @ x)) @ W_down. Combines SiLU activation with a gating mechanism. Used by LLaMA, Qwen, and most modern LLMs. In akunu, the SwiGLU pattern is encoded in the dispatch table as: (1) fused gate+up GEMV, (2) SiLU-gate activation kernel (or fused SiLU GEMV), (3) down projection GEMV.

Threadgroup A group of threads that share threadgroup memory and can synchronize via barriers. In Metal, threadgroups are the unit of dispatch: you specify (grid_size, threadgroup_size) when dispatching a kernel. In akunu, threadgroup sizes are tuned per kernel type and chip: GEMV typically uses 128 or 256 threads, GEMM uses 128 (4 SIMD groups).

Threadgroup Memory Fast on-chip memory shared among threads in a threadgroup (equivalent to “shared memory” in CUDA). Limited to 32 KB per threadgroup on Apple Silicon (MAX_TG_MEMORY in KernelCommon.h). Used in akunu for GEMM tile buffers, reduction scratch space, and attention score accumulation.

TTFT (Time To First Token) The wall-clock time from submitting a prompt to receiving the first generated token. Equals prefill time plus one decode step. The most perceptually important latency metric for interactive applications. Measured by akunu_benchmark (Chapter 55).

Tokenizer The component that converts text to token IDs (encoding) and token IDs back to text (decoding). In akunu, the tokenizer is loaded from GGUF metadata or MLX tokenizer.json and exposes akunu_encode() and akunu_decode_token() via the C API. Implementation in src/tokenizer/tokenizer.h.

UMA (Unified Memory Architecture) Apple Silicon’s memory architecture where CPU and GPU share the same physical memory pool. Eliminates the need for explicit data transfers between CPU and GPU. In akunu, all buffers (weights, KV cache, scratch) are allocated once and accessed by both CPU and GPU without copying. Metal buffers allocated via MTLDevice.makeBuffer(bytesNoCopy:...) enable true zero-copy access.

Tiling A technique for breaking large matrix operations into smaller blocks (tiles) that fit in fast on-chip memory. In akunu’s GEMM kernels, tiling constants are TILE_M=64, TILE_N=64, TILE_K=32 (defined in KernelCommon.h). Each threadgroup processes one output tile, loading A and B sub-tiles into threadgroup memory cooperatively before computing the tile product using simdgroup_matrix intrinsics.

Top-k Sampling A sampling strategy that restricts the candidate set to the k tokens with the highest logits before applying softmax and drawing a random sample. Reduces the probability of low-quality long-tail tokens. In akunu, configured via AkunuSamplingConfig::top_k.

Top-p Sampling (Nucleus Sampling) A sampling strategy that sorts tokens by probability and includes tokens until the cumulative probability exceeds p.⁵ More adaptive than top-k because the number of candidates varies with the distribution’s entropy. In akunu, configured via AkunuSamplingConfig::top_p.

Min-p Sampling A sampling strategy that keeps all tokens whose probability is at least min_p * max_probability. Unlike top-k (fixed count) or top-p (fixed cumulative threshold), min-p scales naturally with the model’s confidence: when the model is confident, fewer tokens pass the threshold; when uncertain, more pass. In akunu, configured via AkunuSamplingConfig::min_p.

Transformer The neural network architecture underlying modern LLMs, based on self-attention and position-wise feed-forward networks.⁶ A decoder-only transformer (LLaMA, GPT) processes tokens autoregressively; an encoder-decoder transformer (Whisper) has separate encoder and decoder stacks. Akunu supports both via the ArchDescriptor::is_encoder_decoder flag.

Vocabulary Size The number of distinct tokens the model can produce, typically 32K to 128K for modern LLMs. Stored in AkunuModelConfig::vocab_size. Determines the size of the final logit projection GEMV (dim -> vocab_size) and the logits scratch buffer (vocab_size * 2 bytes FP16).

Warp / Wave Terms used by NVIDIA (“warp”, 32 threads) and AMD (“wavefront”, 32 or 64 threads) for the SIMD execution unit equivalent to Apple’s “SIMD group.” All three refer to the same concept: a group of threads executing the same instruction in lockstep. Apple Silicon uses a fixed SIMD width of 32.

Weight Fusion The optimization of concatenating two or more weight matrices into a single contiguous buffer so they can be loaded by a single GEMV dispatch. In akunu, gate and up projection weights are fused (WeightProvider::fuse_weights()) on Pro+ chips where the SLC is large enough (>16 MB) to benefit from sequential access to the larger combined buffer. QKV weights can also be fused.

WeightProvider The abstraction layer in src/weight/weight_provider.h that wraps either a GGUF WeightStore or an MLX MLXWeightStore, providing a uniform interface for tensor access, metadata queries, and weight fusion regardless of the underlying file format. Format detection is automatic based on file path (directory or .safetensors = MLX, otherwise GGUF).

WeightStore The GGUF-specific weight loading backend. Opens a GGUF file via gguf_open(), extracts model configuration from metadata, and provides zero-copy GPU buffer access to tensor data via memory mapping. Defined alongside the GGUF parser in src/weight/weight_store.h.

Whisper OpenAI’s speech recognition model, an encoder-decoder transformer that processes mel spectrograms to produce text transcriptions.⁷ In akunu, Whisper is supported via the arch_whisper() descriptor, which enables encoder-decoder mode, cross-attention, Conv1D frontend, LayerNorm, and bias terms. The C API exposes akunu_transcribe() and related functions.

xgrammar A third-party library (vendored in 3rdparty/xgrammar/) that implements grammar-constrained decoding. Compiles GBNF grammars and JSON schemas into efficient token masks that can be applied at each generation step to guarantee structurally valid output. Integrated into akunu via akunu_grammar_create() and akunu_generate_grammar().

Zero-Copy The ability to share data between CPU and GPU without physically copying bytes. On Apple Silicon with UMA, Metal buffers are backed by physical pages that both the CPU and GPU can access. In akunu, GGUF tensor data is memory-mapped (mmap) from the file and the GPU buffer is created over the same pages, achieving true zero-copy weight loading. The SafeTensorsParser similarly uses mmap for MLX format files.

Index of Terms by Category

For quick navigation, here are the glossary terms grouped by topic:

Hardware and Platform: ALU, Apple GPU Family, ARM, ChipConfig, Metal, Metallib, MSL, NPDA, SIMD Group, simdgroup_matrix, SLC, SoC, Threadgroup, Threadgroup Memory, UMA, Warp/Wave

Quantization and Data Formats: BF16, FP16, GGUF, K-Quant, MLX, Q4_0, SafeTensors, Zero-Copy

Model Architecture: Attention, Causal Masking, GELU, GQA, LayerNorm, MHA, NeoX RoPE, RMSNorm, RoPE, SiLU, Softmax, SwiGLU, Tiling, Transformer

Inference Engine: Argmax, Chain Decode, Command Buffer, Compute Command Encoder, Decode, Dispatch Table, DispatchCmd, DType Descriptor, EOS, KV Cache, Logits, Head-Major Layout, Ping-Pong Buffers, Pipeline State Object, Prefill, QKV Fusion, Scratch Buffers, Speculative Decoding, TTFT, Weight Fusion, WeightProvider, WeightStore

Operations: BOS, BPE, GEMM, GEMV, Sampling, Top-k, Top-p, Min-p, Tokenizer, Vocabulary Size

External Libraries and Tools: FlashAttention, GBNF, Gumbel-Max Trick, LLM, N-Gram Predictor, Whisper, xgrammar

Dao, T. et al. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS 2022. See https://arxiv.org/abs/2205.14135. ↩
Ainslie, J. et al. (2023). “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.” EMNLP 2023. See https://arxiv.org/abs/2305.13245. ↩
Su, J. et al. (2021). “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv:2104.09864. See https://arxiv.org/abs/2104.09864. ↩
Leviathan, Y. et al. (2023). “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. See https://arxiv.org/abs/2211.17192. ↩
Holtzman, A. et al. (2020). “The Curious Case of Neural Text Degeneration.” ICLR 2020. See https://arxiv.org/abs/1904.09751. ↩
Vaswani, A. et al. (2017). “Attention Is All You Need.” NeurIPS 2017. See https://arxiv.org/abs/1706.03762. ↩
Radford, A. et al. (2022). “Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv:2212.04356. See https://arxiv.org/abs/2212.04356. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference

Appendix C: Glossary

Index of Terms by Category