Introduction to Metal

Welcome to Part II of this book. In Part I, we explored the hardware that makes Apple Silicon so compelling for machine learning inference: the system-on-chip design, the GPU architecture with its SIMD groups and threadgroups, and the unified memory architecture that eliminates the CPU-to-GPU copy bottleneck. Now it is time to learn how to actually program that hardware.

The answer is Metal – Apple’s low-level GPU programming framework. Over the next seven chapters, we will go from zero to writing high-performance compute shaders that form the backbone of ML inference engines like akunu. By the end of Part II, you will understand every layer of the Metal compute stack, from the API calls in Swift/Objective-C all the way down to the individual threads executing on GPU cores.

Let us begin.

What Is Metal?

Metal is Apple’s unified graphics and compute API.¹ Introduced at WWDC 2014, it replaced the aging OpenGL ES on iOS and eventually OpenGL and OpenCL on macOS. Think of Metal as Apple’s answer to Vulkan or DirectX 12 – a modern, low-overhead GPU programming interface that gives you explicit control over the hardware.²

But Metal is more than just a graphics API. It has three major facets:

+-----------------------------------------------------------+
|                      Metal Framework                       |
+-----------------------------------------------------------+
|                                                           |
|  +----------------+  +----------------+  +-------------+  |
|  |  Metal API     |  |  Metal Shading |  |  Metal      |  |
|  |  (Swift/ObjC)  |  |  Language      |  |  Performance|  |
|  |                |  |  (C++14-based) |  |  Shaders    |  |
|  |  - Device      |  |                |  |  (MPS)      |  |
|  |  - Buffers     |  |  - Compute     |  |             |  |
|  |  - Queues      |  |    kernels     |  |  - MatMul   |  |
|  |  - Encoders    |  |  - Vertex/     |  |  - Conv     |  |
|  |  - Pipelines   |  |    Fragment    |  |  - Image    |  |
|  |  - Textures    |  |    shaders     |  |    ops      |  |
|  +----------------+  +----------------+  +-------------+  |
|                                                           |
+-----------------------------------------------------------+

The Metal API (Swift/Objective-C): The host-side interface. You use this to create GPU devices, allocate memory, build command buffers, and submit work to the GPU. This runs on the CPU.
The Metal Shading Language (MSL): The language you write GPU programs in. It is based on C++14 with Apple-specific extensions for GPU concepts like threadgroups, SIMD operations, and address spaces. Your compute kernels are written in MSL.
Metal Performance Shaders (MPS): A library of pre-built, highly optimized GPU kernels for common operations – matrix multiplication, convolution, image processing, neural network layers, and more. Think of MPS as Apple’s cuDNN equivalent.

For ML inference, we primarily care about the compute side of Metal. We will barely touch graphics (vertex/fragment shaders, render passes, etc.). Our world is compute kernels, buffers, and dispatches.

Metal vs. The Competition

If you are coming from CUDA, OpenCL, or Vulkan, you will find Metal familiar in some ways and different in others. Let us compare:

Metal vs. CUDA

+-------------------+------------------------------------------+
|     Aspect        |          CUDA           |     Metal      |
+-------------------+-------------------------+----------------+
| Vendor            | NVIDIA only             | Apple only     |
| Language          | CUDA C/C++ (extended)   | MSL (C++14     |
|                   |                         |   extended)    |
| Host API          | CUDA Runtime/Driver     | Metal API      |
|                   | (C/C++)                 | (Swift/ObjC)   |
| Execution model   | Grid → Block → Thread   | Grid →         |
|                   |                         |  Threadgroup → |
|                   |                         |  Thread        |
| SIMD width        | 32 (warp)               | 32 (SIMD group)|
| Shared memory     | __shared__              | threadgroup    |
|                   |                         |   address space|
| Sync primitive    | __syncthreads()         | threadgroup_   |
|                   |                         |   barrier()    |
| Matrix accel.     | Tensor Cores (wmma)     | simdgroup_     |
|                   |                         |   matrix       |
| Memory model      | Discrete + UVA          | Unified (UMA)  |
| Ecosystem         | Massive (cuDNN, cuBLAS, | Smaller (MPS,  |
|                   |  TensorRT, Triton)      |  MPSGraph,     |
|                   |                         |  Core ML)      |
| Maturity for ML   | 15+ years               | ~5 years       |
+-------------------+-------------------------+----------------+

The biggest conceptual difference: CUDA’s programming model treats the GPU as a separate device with its own memory.³ You explicitly copy data between CPU and GPU. On Apple Silicon with Metal, the CPU and GPU share the same physical memory (UMA). There is no copy – you allocate a buffer once, and both CPU and GPU can access it.

The biggest practical difference: CUDA has a massive ecosystem for ML. Libraries like cuDNN, cuBLAS, TensorRT, and Triton make it possible to write high-performance ML code without touching raw CUDA kernels. Metal’s ecosystem is smaller. MPS provides some building blocks, but for state-of-the-art inference, you often need to write custom kernels – which is exactly what akunu does.

Metal vs. OpenCL

+-------------------+------------------------------------------+
|     Aspect        |         OpenCL          |     Metal      |
+-------------------+-------------------------+----------------+
| Portability       | Cross-platform          | Apple only     |
| API style         | C-based, verbose        | ObjC/Swift,    |
|                   |                         |   modern       |
| Shader language   | OpenCL C (C99-based)    | MSL (C++14)    |
| Runtime compile   | Yes (common)            | Yes + offline  |
| Performance       | Driver-dependent        | Tuned for      |
|                   |                         |   Apple HW     |
| Status on Apple   | Deprecated since        | Active, primary|
|                   |   macOS 10.14           |   GPU API      |
+-------------------+-------------------------+----------------+

OpenCL was once available on macOS, but Apple deprecated it in 2018. Metal is the only supported path for GPU compute on Apple platforms. If you have existing OpenCL kernels, they need to be ported to MSL.

The good news: the conceptual mapping is straightforward. OpenCL work-groups are Metal threadgroups. OpenCL __local memory is Metal threadgroup memory. OpenCL __global is Metal device. The languages are different (C99 vs C++14), but the GPU programming model is fundamentally the same.

Metal vs. Vulkan

+-------------------+------------------------------------------+
|     Aspect        |         Vulkan          |     Metal      |
+-------------------+-------------------------+----------------+
| Portability       | Cross-platform          | Apple only     |
| Verbosity         | Extremely verbose       | Moderate       |
| Shader language   | SPIR-V (usually from    | MSL (write     |
|                   |   GLSL/HLSL)            |   directly)    |
| Compute support   | Full                    | Full           |
| Validation layers | External (very helpful) | Metal API      |
|                   |                         |   Validation   |
| Driver overhead   | Very low                | Very low       |
| On Apple          | Via MoltenVK (wrapper   | Native         |
|                   |   over Metal)           |                |
+-------------------+-------------------------+----------------+

Vulkan and Metal share the same philosophy: explicit, low-overhead GPU control. Both require you to manage command buffers, synchronization, and pipeline states yourself. Vulkan is more verbose (creating a compute pipeline in Vulkan can take hundreds of lines), while Metal strikes a balance between control and usability.

Fun fact: MoltenVK, the Vulkan-on-Apple implementation, is actually a translation layer that converts Vulkan calls to Metal calls underneath. So Metal is the true native API.

Summary: Why Metal for ML on Apple?

  Why Metal?
  ==========

  1. It is the ONLY way to access Apple GPU compute
     (OpenCL is deprecated, no CUDA, Vulkan is via MoltenVK)

  2. Unified Memory Architecture means ZERO-COPY buffer sharing
     between CPU and GPU -- huge for inference

  3. Metal Shading Language is pleasant to write
     (C++14 with nice extensions, not as painful as GLSL)

  4. simdgroup_matrix operations give you hardware-accelerated
     matrix multiply (like Tensor Cores)

  5. Apple tunes Metal drivers specifically for their hardware
     (you get the best possible performance)

The Metal Ecosystem

Let us zoom in on the three pillars of the Metal ecosystem and understand how they fit together for ML workloads.

The Metal API (Host Side)

The Metal API is an Objective-C/Swift framework. You use it on the CPU to orchestrate GPU work. The key objects are:

  +----------------------------------------------------+
  |                  Your Application                   |
  |                  (Swift / ObjC / C++)               |
  +----------------------------------------------------+
              |
              v
  +----------------------------------------------------+
  |                  MTLDevice                          |
  |  Represents the GPU. Entry point for everything.   |
  |  - makeCommandQueue()                              |
  |  - makeBuffer(length:options:)                     |
  |  - makeComputePipelineState(function:)             |
  |  - makeLibrary(source:options:)                    |
  +----------------------------------------------------+
              |
              +------------------+------------------+
              |                  |                  |
              v                  v                  v
  +-----------------+  +-----------------+  +-----------------+
  | MTLCommandQueue |  | MTLBuffer       |  | MTLLibrary      |
  | Ordered queue   |  | GPU memory      |  | Collection of   |
  | of cmd buffers  |  | allocation      |  | compiled shaders|
  +-----------------+  +-----------------+  +-----------------+
              |                                     |
              v                                     v
  +-----------------+                     +-----------------+
  | MTLCommandBuffer|                     | MTLFunction     |
  | A batch of GPU  |                     | A single shader |
  | commands        |                     | entry point     |
  +-----------------+                     +-----------------+
              |                                     |
              v                                     v
  +-----------------------+          +---------------------------+
  |MTLComputeCommandEncoder|          |MTLComputePipelineState   |
  | Records compute cmds  |          | Compiled, ready-to-run   |
  | (set buffers, dispatch)|          | version of a kernel      |
  +-----------------------+          +---------------------------+

We will explore each of these objects in detail in Chapter 7. For now, just know the flow: you create a device, create a command queue, create command buffers, encode commands into them, and commit them to the GPU.

The Metal Shading Language (MSL)

MSL is the language you write GPU programs in. It looks like C++14 with some extra keywords and types:

// A simple MSL compute kernel
kernel void add_arrays(
    device const float* a [[buffer(0)]],
    device const float* b [[buffer(1)]],
    device float* result   [[buffer(2)]],
    uint id [[thread_position_in_grid]]
) {
    result[id] = a[id] + b[id];
}

Key MSL features for compute:

Address space qualifiers: device, constant, threadgroup, thread
Attribute syntax: [[buffer(0)]], [[thread_position_in_grid]], [[kernel]]
Built-in vector types: half, half4, float4, uint2, etc.
SIMD group intrinsics: simd_sum(), simd_shuffle(), etc.
Threadgroup memory: shared memory within a threadgroup
SIMD group matrix ops: simdgroup_half8x8, simdgroup_multiply_accumulate()

MSL is covered in depth in Chapter 8.

Metal Performance Shaders (MPS)

MPS is Apple’s library of optimized GPU kernels. For ML, the most relevant parts are:

MPSMatrixMultiplication: Optimized GEMM
MPSImageConvolution: 2D convolution
MPSNNGraph: Neural network inference graph (older API)
MPSGraph: A more modern compute graph framework

MPS is useful as a starting point, but for state-of-the-art inference performance, custom kernels often outperform MPS. This is because:

MPS kernels are general-purpose. A custom kernel can be specialized for exact matrix dimensions, quantization formats, and fusion patterns.
MPS cannot fuse arbitrary operations. A custom kernel can fuse (say) dequantization + matrix multiply + bias add + activation into a single pass, saving memory bandwidth.
MPS does not support all quantization formats used in modern LLM inference.

This is exactly why akunu writes its own Metal kernels rather than relying on MPS.

When to Use Metal Compute vs. Metal Graphics

Metal supports two kinds of GPU work:

Graphics (render pipelines): Vertex shaders, fragment shaders, rasterization, render passes. This is for drawing things on screen.
Compute (compute pipelines): General-purpose computation on the GPU. No rendering, no pixels – just data in, data out.

For ML inference, we use compute exclusively. Here is why:

  Graphics Pipeline:                 Compute Pipeline:
  ==================                ==================

  Vertices → Vertex Shader          Data → Compute Kernel → Data
           → Rasterizer
           → Fragment Shader
           → Framebuffer

  - Fixed-function stages           - Fully programmable
  - Designed for rendering          - Designed for GPGPU
  - Data flows through              - You control data flow
    a rigid pipeline                  completely
  - Output is pixels                - Output is whatever you want

Compute pipelines give us:

Arbitrary data access patterns: Read from and write to any buffer location
Threadgroup shared memory: Fast scratchpad for inter-thread communication
Flexible dispatch: 1D, 2D, or 3D grids of arbitrary size
No rendering overhead: No rasterizer, no framebuffer, no blend state

There are rare cases where graphics shaders are (ab)used for compute – for example, some older GPU compute techniques use fragment shaders to process textures. But on modern Apple GPUs, compute shaders are the right tool for ML.

The Metal Programming Model

Now let us build up the mental model for how Metal compute works. This is the single most important section in this chapter, so take your time with it.

The Big Picture

Here is the full pipeline from your application to GPU execution:

  YOUR APPLICATION (CPU)
  ======================

  1. Get a reference to the GPU
     +------------------+
     | MTLDevice        |  <-- Represents the GPU hardware
     +------------------+
              |
  2. Create a command queue (once, reuse it)
              |
              v
     +------------------+
     | MTLCommandQueue  |  <-- FIFO queue of command buffers
     +------------------+
              |
  3. Create a command buffer (one per "batch" of work)
              |
              v
     +------------------+
     | MTLCommandBuffer |  <-- Container for GPU commands
     +------------------+
              |
  4. Create a compute command encoder
              |
              v
     +--------------------------+
     | MTLComputeCommandEncoder |  <-- Records compute commands
     +--------------------------+
              |
  5. Set the pipeline state (which kernel to run)
  6. Set buffers (input/output data)
  7. Dispatch threadgroups (how many threads to launch)
  8. End encoding
              |
              v
  9. Commit the command buffer to the GPU
              |
              v

  GPU EXECUTION
  =============

  The GPU picks up the command buffer from the queue,
  executes the recorded commands:
    - Binds the kernel
    - Binds the buffers
    - Launches threadgroups across GPU cores
    - Each thread runs the kernel function
    - Results are written to output buffers

              |
              v

  RESULTS AVAILABLE
  =================
  (In the output buffer, which on UMA is already
   accessible to the CPU -- no copy needed!)

Let us walk through each component.

MTLDevice – The GPU

Everything starts with a MTLDevice. This object represents the GPU hardware. You get it like this:

// Swift
guard let device = MTLCreateSystemDefaultDevice() else {
    fatalError("Metal is not supported on this device")
}

Or in Objective-C:

id<MTLDevice> device = MTLCreateSystemDefaultDevice();

On a Mac with Apple Silicon, this gives you the built-in GPU. On a Mac with multiple GPUs (e.g., an older Mac Pro with an eGPU), you can enumerate all devices with MTLCopyAllDevices().

The device is your factory for creating everything else: buffers, command queues, pipeline states, libraries.

MTLCommandQueue – The Submission Highway

A command queue is an ordered sequence of command buffers. You typically create one queue at startup and reuse it for the lifetime of your application:

guard let commandQueue = device.makeCommandQueue() else {
    fatalError("Could not create command queue")
}

Think of it as a highway on-ramp. Command buffers you commit to the queue will be executed in order (mostly – Metal can reorder independent work for efficiency, but the observable results respect submission order for resources).

MTLCommandBuffer – A Batch of Work

A command buffer is a container for GPU commands. You create one whenever you have work to submit:

guard let commandBuffer = commandQueue.makeCommandBuffer() else {
    fatalError("Could not create command buffer")
}

A command buffer can contain multiple encoder passes. For compute work, each pass uses a MTLComputeCommandEncoder. The command buffer is not executed until you call commit().

  Command Buffer Lifecycle:
  =========================

  Created ──> Encoding ──> Committed ──> Scheduled ──> Completed
    |            |             |             |              |
    |   (you record    (you call     (Metal       (GPU has
    |    commands)     .commit())    schedules     finished)
    |                               execution)

MTLComputeCommandEncoder – Recording Commands

The encoder is how you record commands into the command buffer. For compute work:

guard let encoder = commandBuffer.makeComputeCommandEncoder() else {
    fatalError("Could not create compute encoder")
}

// Record commands:
encoder.setComputePipelineState(pipelineState)
encoder.setBuffer(inputBuffer, offset: 0, index: 0)
encoder.setBuffer(outputBuffer, offset: 0, index: 1)
encoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: groupSize)
encoder.endEncoding()

Important: the encoder does not execute anything. It records commands into the command buffer. Execution happens later when you commit.

MTLComputePipelineState – The Compiled Kernel

Before you can dispatch a kernel, you need to compile it into a pipeline state object (PSO). This involves:

Loading your MSL source code (or a pre-compiled .metallib binary)
Getting a MTLFunction from the library
Creating a MTLComputePipelineState from the function

// Load the default library (compiled from .metal files in your project)
let library = device.makeDefaultLibrary()!

// Get the kernel function by name
let function = library.makeFunction(name: "add_arrays")!

// Create the pipeline state (this compiles the function for the GPU)
let pipelineState = try device.makeComputePipelineState(function: function)

Creating a PSO can be expensive (it involves final compilation and optimization), so you typically do it once at startup and cache the result. We will explore this in detail in Chapter 7.

Dispatch – Launching Threads

The final piece is telling the GPU how many threads to launch. Metal gives you two options:

// Option 1: Dispatch by threadgroup count
// You specify: (number of threadgroups) x (threads per threadgroup)
let gridSize = MTLSize(width: 64, height: 1, depth: 1)
let groupSize = MTLSize(width: 256, height: 1, depth: 1)
encoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: groupSize)
// Total threads = 64 * 256 = 16,384

// Option 2: Dispatch by total thread count (Metal adjusts automatically)
let totalThreads = MTLSize(width: 16384, height: 1, depth: 1)
let groupSize = MTLSize(width: 256, height: 1, depth: 1)
encoder.dispatchThreads(totalThreads, threadsPerThreadgroup: groupSize)

Option 1 (dispatchThreadgroups) requires you to calculate the grid dimensions yourself. Option 2 (dispatchThreads) lets you specify the total number of threads, and Metal handles the math. We will discuss the tradeoffs in Chapter 9.

Putting It All Together: Hello World Compute Shader

Let us write a complete example that adds two arrays on the GPU. This is the “Hello World” of GPU computing.

Step 1: The Metal Shader (MSL)

Create a file called compute.metal:

// compute.metal
// A simple kernel that adds two arrays element-wise.

#include <metal_stdlib>
using namespace metal;

kernel void add_arrays(
    device const float* inA  [[buffer(0)]],
    device const float* inB  [[buffer(1)]],
    device float*       out  [[buffer(2)]],
    uint id [[thread_position_in_grid]]
) {
    out[id] = inA[id] + inB[id];
}

Let us break down every piece:

#include <metal_stdlib> – Includes standard Metal functions and types.
using namespace metal; – Avoids having to prefix everything with metal::.
kernel void add_arrays(...) – The kernel keyword marks this as a compute kernel entry point. It must return void.
device const float* inA [[buffer(0)]] – A pointer to read-only data in the device address space (GPU-accessible memory). The [[buffer(0)]] attribute tells Metal this is bound at buffer index 0.
device float* out [[buffer(2)]] – A writable output buffer at index 2.
uint id [[thread_position_in_grid]] – A built-in variable that gives each thread its unique index in the dispatch grid. Thread 0 gets id=0, thread 1 gets id=1, etc.

The kernel body is trivial: each thread reads one element from inA and inB, adds them, and writes the result to out.

  Visualization of execution:
  ===========================

  Thread 0:  out[0] = inA[0] + inB[0]
  Thread 1:  out[1] = inA[1] + inB[1]
  Thread 2:  out[2] = inA[2] + inB[2]
  ...
  Thread N:  out[N] = inA[N] + inB[N]

  Each thread handles exactly one element.
  All threads execute in parallel across GPU cores.

Step 2: The Host Code (Swift)

Here is the complete Swift code to set up Metal, compile the kernel, prepare data, dispatch the kernel, and read the results:

import Metal
import Foundation

// ============================================================
// STEP 1: Get the GPU device
// ============================================================
guard let device = MTLCreateSystemDefaultDevice() else {
    fatalError("Metal is not supported on this device")
}
print("Using GPU: \(device.name)")

// ============================================================
// STEP 2: Create a command queue
// ============================================================
guard let commandQueue = device.makeCommandQueue() else {
    fatalError("Could not create command queue")
}

// ============================================================
// STEP 3: Load and compile the shader
// ============================================================

// Option A: Load from a .metal file in the project bundle
// let library = device.makeDefaultLibrary()!

// Option B: Compile from source string at runtime
let shaderSource = """
#include <metal_stdlib>
using namespace metal;

kernel void add_arrays(
    device const float* inA  [[buffer(0)]],
    device const float* inB  [[buffer(1)]],
    device float*       out  [[buffer(2)]],
    uint id [[thread_position_in_grid]]
) {
    out[id] = inA[id] + inB[id];
}
"""

let library = try! device.makeLibrary(source: shaderSource, options: nil)
let function = library.makeFunction(name: "add_arrays")!
let pipelineState = try! device.makeComputePipelineState(function: function)

// ============================================================
// STEP 4: Prepare the data
// ============================================================
let arrayLength = 1_000_000
let bufferSize = arrayLength * MemoryLayout<Float>.size

// Create Metal buffers with shared storage mode (CPU + GPU access)
let bufferA = device.makeBuffer(length: bufferSize, options: .storageModeShared)!
let bufferB = device.makeBuffer(length: bufferSize, options: .storageModeShared)!
let bufferOut = device.makeBuffer(length: bufferSize, options: .storageModeShared)!

// Fill input buffers with data
let pointerA = bufferA.contents().bindMemory(to: Float.self, capacity: arrayLength)
let pointerB = bufferB.contents().bindMemory(to: Float.self, capacity: arrayLength)

for i in 0..<arrayLength {
    pointerA[i] = Float(i)
    pointerB[i] = Float(i) * 2.0
}

// ============================================================
// STEP 5: Create a command buffer and encoder
// ============================================================
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!

// ============================================================
// STEP 6: Encode the compute command
// ============================================================
encoder.setComputePipelineState(pipelineState)
encoder.setBuffer(bufferA, offset: 0, index: 0)
encoder.setBuffer(bufferB, offset: 0, index: 1)
encoder.setBuffer(bufferOut, offset: 0, index: 2)

// Calculate dispatch sizes
let threadGroupSize = MTLSize(width: 256, height: 1, depth: 1)
let threadGroups = MTLSize(
    width: (arrayLength + 255) / 256,  // Ceiling division
    height: 1,
    depth: 1
)
encoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)
encoder.endEncoding()

// ============================================================
// STEP 7: Commit and wait
// ============================================================
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// ============================================================
// STEP 8: Read the results
// ============================================================
let pointerOut = bufferOut.contents().bindMemory(to: Float.self, capacity: arrayLength)

// Verify a few results
for i in 0..<5 {
    print("out[\(i)] = \(pointerOut[i])  (expected: \(Float(i) + Float(i) * 2.0))")
}
// Output:
// out[0] = 0.0  (expected: 0.0)
// out[1] = 3.0  (expected: 3.0)
// out[2] = 6.0  (expected: 6.0)
// out[3] = 9.0  (expected: 9.0)
// out[4] = 12.0  (expected: 12.0)

The Full Flow Visualized

Here is what happens when you run this code:

  CPU Side                              GPU Side
  ========                              ========

  1. MTLCreateSystemDefaultDevice()
     +---> Gets handle to GPU

  2. makeCommandQueue()
     +---> Creates submission queue

  3. makeLibrary(source:)               Compiler: MSL → AIR → GPU ISA
     makeFunction(name:)
     makeComputePipelineState()

  4. makeBuffer() x 3                   Allocates in unified memory
     Fill bufferA, bufferB              (same physical RAM)

  5. makeCommandBuffer()
     makeComputeCommandEncoder()

  6. setComputePipelineState()           |
     setBuffer() x 3                    | All recorded, not
     dispatchThreadgroups()             | yet executed
     endEncoding()                      |

  7. commandBuffer.commit()
     +----------------------------------->  GPU picks up work
                                            |
     commandBuffer.waitUntilCompleted()     v
     (CPU blocks here)                      GPU launches 3,907
                                            threadgroups of 256
                                            threads each
                                            |
                                            v
                                            Each thread runs
                                            add_arrays kernel
                                            |
                                            v
                                            Results written to
                                            bufferOut
                                            |
     <-----(completion signal)------------- Done!

  8. Read pointerOut[i]                 (Same physical memory,
                                         no copy needed!)

Notice step 8: we read the results directly from the buffer’s contents() pointer. There is no “copy back from GPU” step. This is the UMA advantage – the buffer lives in unified memory, accessible to both CPU and GPU.

The Objective-C Version

Since akunu is written in C/C++ and uses the Metal Objective-C API (via Objective-C++), here is the same example in Objective-C:

#import <Metal/Metal.h>
#import <Foundation/Foundation.h>

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        // 1. Get the device
        id<MTLDevice> device = MTLCreateSystemDefaultDevice();
        NSLog(@"Using GPU: %@", device.name);

        // 2. Create command queue
        id<MTLCommandQueue> commandQueue = [device newCommandQueue];

        // 3. Compile the shader
        NSString *shaderSource = @
            "#include <metal_stdlib>\n"
            "using namespace metal;\n"
            "kernel void add_arrays(\n"
            "    device const float* inA  [[buffer(0)]],\n"
            "    device const float* inB  [[buffer(1)]],\n"
            "    device float*       out  [[buffer(2)]],\n"
            "    uint id [[thread_position_in_grid]]\n"
            ") {\n"
            "    out[id] = inA[id] + inB[id];\n"
            "}\n";

        NSError *error = nil;
        id<MTLLibrary> library = [device newLibraryWithSource:shaderSource
                                                      options:nil
                                                        error:&error];
        id<MTLFunction> function = [library newFunctionWithName:@"add_arrays"];
        id<MTLComputePipelineState> pso =
            [device newComputePipelineStateWithFunction:function error:&error];

        // 4. Create buffers
        NSUInteger arrayLength = 1000000;
        NSUInteger bufferSize = arrayLength * sizeof(float);

        id<MTLBuffer> bufA = [device newBufferWithLength:bufferSize
                                options:MTLResourceStorageModeShared];
        id<MTLBuffer> bufB = [device newBufferWithLength:bufferSize
                                options:MTLResourceStorageModeShared];
        id<MTLBuffer> bufOut = [device newBufferWithLength:bufferSize
                                  options:MTLResourceStorageModeShared];

        // Fill input data
        float *ptrA = (float *)bufA.contents;
        float *ptrB = (float *)bufB.contents;
        for (NSUInteger i = 0; i < arrayLength; i++) {
            ptrA[i] = (float)i;
            ptrB[i] = (float)i * 2.0f;
        }

        // 5-7. Encode and dispatch
        id<MTLCommandBuffer> cmdBuf = [commandQueue commandBuffer];
        id<MTLComputeCommandEncoder> enc = [cmdBuf computeCommandEncoder];

        [enc setComputePipelineState:pso];
        [enc setBuffer:bufA offset:0 atIndex:0];
        [enc setBuffer:bufB offset:0 atIndex:1];
        [enc setBuffer:bufOut offset:0 atIndex:2];

        MTLSize groupSize = MTLSizeMake(256, 1, 1);
        MTLSize gridSize = MTLSizeMake((arrayLength + 255) / 256, 1, 1);
        [enc dispatchThreadgroups:gridSize threadsPerThreadgroup:groupSize];
        [enc endEncoding];

        [cmdBuf commit];
        [cmdBuf waitUntilCompleted];

        // 8. Read results
        float *ptrOut = (float *)bufOut.contents;
        for (int i = 0; i < 5; i++) {
            NSLog(@"out[%d] = %.1f (expected: %.1f)", i, ptrOut[i],
                  (float)i + (float)i * 2.0f);
        }
    }
    return 0;
}

The Objective-C API maps almost 1:1 to Swift. The main difference is syntax ([device newCommandQueue] vs device.makeCommandQueue()). akunu uses Objective-C++ so it can mix C++ code with Metal API calls seamlessly.

How akunu Uses Metal

Now that you understand the basics, let us peek at how akunu structures its Metal usage. We will go much deeper in Part IV, but a high-level overview helps connect the dots:

  akunu Architecture (simplified):
  =================================

  +----------------------------------------------------------+
  |                     akunu C API                          |
  |  ak_context_create()  ak_generate()  ak_decode_step()   |
  +----------------------------------------------------------+
              |
              v
  +----------------------------------------------------------+
  |                   MetalDevice                            |
  |                                                          |
  |  - device_: id<MTLDevice>                                |
  |  - queue_: id<MTLCommandQueue>                           |
  |  - pso_cache_: HashMap<string, MTLComputePipelineState>  |
  |                                                          |
  |  Methods:                                                |
  |  - allocate(size) → Buffer                               |
  |  - begin_encoding() → starts recording                   |
  |  - set_pipeline(Pipeline) → sets the kernel              |
  |  - set_buffer(Buffer, offset, index) → binds data        |
  |  - dispatch(grid, threadgroup) → launches threads        |
  |  - end_encoding() → finishes recording                   |
  |  - commit() → submits to GPU                             |
  +----------------------------------------------------------+
              |
              v
  +----------------------------------------------------------+
  |                   Metal Shaders                          |
  |  (.metal files compiled to .metallib)                    |
  |                                                          |
  |  - gemv_f16.metal     (matrix-vector multiply)           |
  |  - gemm_f16.metal     (matrix-matrix multiply)           |
  |  - attention.metal    (flash attention)                   |
  |  - rms_norm.metal     (RMS normalization)                |
  |  - rope.metal         (rotary position embedding)        |
  |  - ...                                                   |
  +----------------------------------------------------------+

akunu wraps the Metal API in a MetalDevice class that provides a cleaner, C++-friendly interface. Instead of creating MTLCommandBuffer and MTLComputeCommandEncoder objects directly, you call methods like begin_encoding(), set_pipeline(), set_buffer(), and dispatch().

The pipeline state objects (PSOs) are cached in a hash map (pso_cache_) so they are only compiled once. The cache key includes the kernel name and any specialization constants (e.g., "gemv_k128" for a GEMV kernel specialized for K=128).

Key Takeaways

Before moving to Chapter 7 where we dive deep into compute pipelines, let us summarize what we have learned:

Metal is Apple’s low-level GPU API – the only way to do GPU compute on Apple platforms.
Metal has three parts: the host API (Swift/ObjC), the Metal Shading Language (MSL), and Metal Performance Shaders (MPS).
For ML inference, we use compute pipelines, not graphics pipelines.

The programming model flows:

Device → Queue → Command Buffer → Encoder → Dispatch → Commit

On Apple Silicon, UMA means zero-copy: buffers allocated with .storageModeShared are accessible to both CPU and GPU without any data transfer.
MSL is C++14 with GPU extensions: address spaces, vector types, SIMD intrinsics, and thread indexing built-ins.
akunu wraps Metal in a MetalDevice class that caches pipeline states and provides a streamlined C++ interface.

In the next chapter, we will go deep on compute pipelines: how kernels are compiled, how pipeline states are created and cached, and how the command encoding flow works in practice.

Exercises

Run the Hello World example: If you have a Mac with Apple Silicon, create an Xcode project (macOS Command Line Tool), add a .metal file with the add_arrays kernel, and run the Swift host code. Verify the output.
Modify the kernel: Change the kernel to compute out[i] = inA[i] * inB[i] + 1.0 (fused multiply-add). How does the host code change? (Hint: it does not, only the shader changes.)
Explore MTLDevice properties: Print device.maxThreadgroupMemoryLength, device.maxThreadsPerThreadgroup, and device.name. What values do you get on your hardware?
Think about error handling: Our example uses fatalError and force-unwraps everywhere. In production code (like akunu), what error handling strategy would you use?
Compare with CUDA: If you have experience with CUDA, write down the mapping between CUDA concepts and Metal concepts. For example: cudaMalloc maps to device.makeBuffer, <<<grid, block>>> maps to dispatchThreadgroups, etc.

Apple. “Metal Programming Guide.” developer.apple.com. The official reference for Metal’s programming model, including device creation, command queues, and compute pipelines. See https://developer.apple.com/documentation/metal/performing-calculations-on-a-gpu. ↩
Apple. “Metal Shading Language Specification.” developer.apple.com. The formal MSL specification covering types, address spaces, and built-in functions. See https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf. ↩
Apple. “Metal Best Practices Guide.” developer.apple.com. Performance guidance for buffer management, pipeline caching, and dispatch strategies. See https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/index.html. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference