The Device Abstraction Layer

If you look at akunu’s core code – the table builder, the prefill engine, the decode loops – you will notice something conspicuously absent: there are no Metal API calls. No id<MTLBuffer>, no [encoder setComputePipelineState:], no MTLSizeMake. The core is pure C++17, completely agnostic to the GPU backend. All hardware interaction flows through a single abstract class: Device.

This chapter examines the Device abstraction, its concrete MetalDevice implementation, and the design decisions behind it.

The Problem

An inference engine needs to:

Allocate GPU memory
Load compiled shader libraries
Create compute pipelines from shader functions
Encode sequences of GPU commands (set buffers, set parameters, dispatch)
Submit and synchronize

These operations look completely different across GPU APIs. Metal uses Objective-C message passing ([encoder setBuffer:...]). CUDA uses C function calls (cuLaunchKernel()). Vulkan uses verbose descriptor sets and command buffers. If the core engine hardcodes any of these, it cannot be ported.

The solution is a pure virtual interface that captures the common operations and lets each backend implement them in its native API.

The Device Interface

The Device class in src/core/device.h defines the contract. Let’s walk through it section by section.

Handles: Buffer, Pipeline, Dim3

Before the class itself, three simple structs define the currency of GPU programming:

struct Buffer {
    void *handle;    // Backend-specific (MTLBuffer*, CUdeviceptr, etc.)
    size_t size;     // Size in bytes
    void *contents;  // CPU-accessible pointer (UMA) or nullptr (discrete)
};

struct Pipeline {
    void *handle;    // Backend-specific (MTLComputePipelineState*, CUfunction, etc.)
};

struct Dim3 {
    uint32_t x, y, z;
    Dim3(uint32_t x = 1, uint32_t y = 1, uint32_t z = 1) : x(x), y(y), z(z) {}
};

These are POD types. No virtual methods, no reference counting, no destructors. Buffer is 24 bytes. Pipeline is 8 bytes. Dim3 is 12 bytes. They are cheap to copy, pass by value, and store in arrays – which matters because the dispatch table stores hundreds of them.

The void* handle pattern is the C equivalent of generics. On Metal, Buffer::handle is a CFBridgingRetain’d id<MTLBuffer>. On CUDA, it would be a CUdeviceptr. The core code never dereferences these pointers; it just passes them back to the device.

Device Info

virtual const char *name() const = 0;
virtual const char *backend_name() const = 0;  // "metal", "cuda", "vulkan"
virtual int gpu_core_count() const = 0;
virtual int gpu_family() const { return 0; }    // Apple GPU family (7=M1, 8=M2/M3, 9=M4)
virtual size_t total_memory() const = 0;

These methods expose hardware capabilities that affect algorithmic decisions. gpu_core_count() is used by ChipConfig::from_gpu() to determine SLC size estimates, GEMV variant selection, and chain decode chunk size. gpu_family() distinguishes M1/M2/M3/M4 for generation-specific tuning.

Note that gpu_family() has a default implementation returning 0. This is because non-Apple backends do not have “GPU families” – the method is Apple-specific but exposed at the abstract level because ChipConfig needs it.

Library and Pipeline Management

virtual bool load_library(const std::string& path) = 0;
virtual Pipeline get_pipeline(const std::string& name) = 0;
virtual Pipeline get_pipeline(const std::string& name,
                              const std::string& cache_key,
                              const uint32_t *constant_indices,
                              const uint32_t *constant_values,
                              int n_constants,
                              const uint32_t *constant_types = nullptr) = 0;

load_library loads a compiled shader library (metallib on Metal, PTX/CUBIN on CUDA). get_pipeline retrieves a named compute function, optionally specialized with function constants.

The two-overload design is important. The simple overload (get_pipeline(name)) handles the common case of a kernel with no specialization. The extended overload with constant_indices/constant_values handles Metal function constants and could map to CUDA template instantiation or Vulkan specialization constants.

The cache_key parameter in the extended overload is separate from name because the same kernel function can produce multiple specialized pipelines. For example, gemv_mlx_q4 with group_size=64, K=4096 gets a cache key of "gemv_mlx_q4_gs64_k4096", while the same kernel with K=2048 gets "gemv_mlx_q4_gs64_k2048". Both pipelines come from the same source function but are compiled with different constants.

Buffer Management

virtual Buffer allocate(size_t bytes) = 0;
virtual Buffer allocate(const void *data, size_t bytes) = 0;
virtual void free_buffer(Buffer buf) = 0;

virtual void write_buffer(Buffer dst, const void *src, size_t bytes, size_t offset = 0) {
    if (dst.contents) memcpy((char *)dst.contents + offset, src, bytes);
}

virtual void read_buffer(void *dst, Buffer src, size_t bytes, size_t offset = 0) {
    if (src.contents) memcpy(dst, (const char *)src.contents + offset, bytes);
}

virtual void *buffer_contents(Buffer buf) { return buf.contents; }

The two allocate overloads handle empty allocation (for scratch buffers) and initialized allocation (for uploading weight data). free_buffer releases the GPU memory.

write_buffer and read_buffer have default implementations that use memcpy – correct for UMA where buf.contents is a CPU-accessible pointer. A CUDA backend would override these to use cuMemcpyHtoD and cuMemcpyDtoH.

Command Encoding

This is the heart of the interface – the methods that encode GPU commands:

virtual void begin_encoding() = 0;
virtual void set_pipeline(Pipeline pso) = 0;
virtual void set_buffer(Buffer buf, int offset, int index) = 0;
virtual void set_bytes(const void *data, int size, int index) = 0;
virtual void set_threadgroup_memory(int bytes, int index) = 0;
virtual void dispatch(Dim3 grid, Dim3 threadgroup) = 0;
virtual void dispatch_threads(Dim3 total, Dim3 threadgroup) = 0;
virtual double end_encoding_sync() = 0;
virtual void end_encoding_async() = 0;
virtual void wait() = 0;

The encoding sequence mirrors Metal’s command encoder pattern:

begin_encoding()
  set_pipeline(pso)
  set_buffer(buf, offset, index)
  set_bytes(params, size, index)
  dispatch(grid, threadgroup)
  // ... more dispatches ...
end_encoding_sync()  // or end_encoding_async() + wait()

On Metal, these map directly:

Device Method	Metal Equivalent
`begin_encoding()`	`[queue commandBuffer]` + `[cmdBuffer computeCommandEncoder]`
`set_pipeline(pso)`	`[encoder setComputePipelineState:pso]`
`set_buffer(buf, off, idx)`	`[encoder setBuffer:buf offset:off atIndex:idx]`
`set_bytes(data, size, idx)`	`[encoder setBytes:data length:size atIndex:idx]`
`set_threadgroup_memory(bytes, idx)`	`[encoder setThreadgroupMemoryLength:bytes atIndex:idx]`
`dispatch(grid, tg)`	`[encoder dispatchThreadgroups:grid threadsPerThreadgroup:tg]`
`dispatch_threads(total, tg)`	`[encoder dispatchThreads:total threadsPerThreadgroup:tg]`
`end_encoding_sync()`	`[encoder endEncoding]` + `[cmdBuffer commit]` + `[cmdBuffer waitUntilCompleted]`

On CUDA, the mapping would be different but conceptually similar: begin_encoding would create a CUDA stream, set_pipeline would set the current function, and dispatch would call cuLaunchKernel.

Advanced Synchronization

The interface provides several synchronization patterns beyond simple sync/async:

virtual void end_encoding_handler() { end_encoding_sync(); }
virtual void wait_handler() { wait(); }
virtual void end_encoding_event() { end_encoding_async(); }
virtual void begin_encoding_after_event() { begin_encoding(); }
virtual void wait_event() { wait(); }
virtual void wait_async() { wait(); }

These have default implementations that fall back to simple sync/async, but MetalDevice overrides them with more sophisticated patterns:

Pattern	Purpose	Metal Implementation
Handler	CPU+GPU overlap via completion callback	`addCompletedHandler:` + semaphore
Event	GPU-GPU pipeline via shared event	`encodeSignalEvent:` + `encodeWaitForEvent:`
Wait Async	Wait for previous buffer only	Check `prevCmdBuffer` only

The event-based pattern is used for chain decode pipelining: while the GPU executes command buffer N, the CPU encodes command buffer N+1. The shared event ensures N+1 does not start executing until N completes, without requiring a CPU round-trip.

Dispatch Table Fast Path

virtual void encode_dispatch_table(const void *table_ptr,
                                   int start_position, int count) {
    // Default: use generic encode_chain (virtual calls per command)
}

virtual void encode_command_range(const void *table_ptr,
                                  int position,
                                  int cmd_start, int cmd_count) {}

These methods allow the backend to bypass the virtual call overhead of the generic encode_chain function. MetalDevice overrides encode_dispatch_table with a tight loop that calls Metal API functions directly on the ObjC encoder object, avoiding the per-command virtual dispatch through set_pipeline, set_buffer, etc.

The void* parameter (pointing to a DispatchTable) is a deliberate abstraction leak – the backend needs to know the table structure to iterate it efficiently. The alternative (encoding through the virtual interface) works but is slower.

Hardware Capabilities

virtual ChipConfig chip_config() const;
virtual const DTypeDescriptor& dtype_lookup(uint32_t dtype) const;
const char *embedding_kernel_for(uint32_t dtype) const;
const char *gemm_kernel_for(uint32_t dtype, int M) const;

chip_config() returns hardware tuning parameters derived from gpu_core_count() and gpu_family(). The default implementation (in device_defaults.cpp) calls ChipConfig::from_gpu():

ChipConfig Device::chip_config() const {
    return ChipConfig::from_gpu(gpu_core_count(), gpu_family());
}

dtype_lookup returns the kernel name and dispatch geometry for a given quantization format. Both have default implementations using the global tables in dtype_descriptor.h and chip_config.h.

Factory

static std::unique_ptr<Device> Device::create_default();

This static factory method creates the default device for the current platform. On macOS, it creates a MetalDevice. On a hypothetical CUDA platform, it would create a CudaDevice.

The MetalDevice Implementation

MetalDevice in backend/metal/metal_device.h implements all pure virtual methods. It is an Objective-C++ class (compiled from .mm files) that wraps Metal API objects.

Internal State

class MetalDevice : public Device {
private:
    void *device_;      // id<MTLDevice> (via CFBridgingRetain)
    void *queue_;       // id<MTLCommandQueue>
    void *library_;     // id<MTLLibrary>
    void *cmd_buffer_;  // id<MTLCommandBuffer> (current)
    void *encoder_;     // id<MTLComputeCommandEncoder> (current)
    size_t allocated_bytes_ = 0;
    std::unordered_map<std::string, void *> pso_cache_;
    std::string device_name_;
};

Notice the void* pointers – even within the MetalDevice, the ObjC objects are stored as raw pointers. This is because the header file (metal_device.h) is included by C++ translation units that cannot parse Objective-C types. The actual ObjC types are accessed through the AkunuMetalState wrapper in metal_device_impl.h:

@interface AkunuMetalState : NSObject
@property(nonatomic, strong) id<MTLDevice> device;
@property(nonatomic, strong) id<MTLCommandQueue> queue;
@property(nonatomic, strong) id<MTLLibrary> library;
@property(nonatomic, strong) id<MTLCommandBuffer> cmdBuffer;
@property(nonatomic, strong) id<MTLComputeCommandEncoder> encoder;
@property(nonatomic, strong) id<MTLCommandBuffer> prevCmdBuffer;
@property(nonatomic, strong) id<MTLSharedEvent> pipelineEvent;
@property(nonatomic, assign) uint64_t eventValue;
@end

The device_ pointer is actually a CFBridgingRetain’d reference to an AkunuMetalState instance. In the .mm implementation, a macro bridges back to the typed object:

#define STATE ((__bridge AkunuMetalState *)device_)

This pattern – ObjC state wrapped in a C++ class with void* storage – is the standard way to mix ObjC and C++ in headers that must be parseable by both compilers.¹

Pipeline Caching

std::unordered_map<std::string, void *> pso_cache_;

Every get_pipeline call first checks the cache. Pipeline creation (compiling a Metal function into a compute pipeline state) is expensive – it involves shader compilation, register allocation, and GPU resource validation. Caching ensures this happens once per kernel variant.

For function-constant-specialized pipelines, the cache key includes the specialization parameters:

Pipeline MetalDevice::get_pipeline(const std::string &name,
                                   const std::string &cache_key, ...) {
    auto it = pso_cache_.find(cache_key);
    if (it != pso_cache_.end())
        return {it->second};
    // ... create specialized pipeline ...
    pso_cache_[cache_key] = ptr;
}

A model with MLX Q4 weights might create 20+ specialized pipelines (different K dimensions for each layer’s GEMV). These are all cached after the first build_dispatch_table() call.

GPU Core Count Detection

MetalDevice queries the actual GPU core count from IOKit, not from Metal:

static int iokit_gpu_core_count() {
    io_iterator_t iter = 0;
    IOServiceGetMatchingServices(kIOMainPortDefault,
                                 IOServiceMatching("AGXAccelerator"), &iter);
    io_service_t service = IOIteratorNext(iter);
    CFNumberRef ref = IORegistryEntryCreateCFProperty(
        service, CFSTR("gpu-core-count"), kCFAllocatorDefault, 0);
    CFNumberGetValue(ref, kCFNumberIntType, &cores);
    // ...
}

Metal’s API does not expose the GPU core count directly. The AGXAccelerator IOKit service (Apple’s GPU driver) stores it as a registry property. This is more reliable than string-parsing the device name, though a fallback heuristic exists:

if ([name containsString:@"Ultra"]) cached = 60;
else if ([name containsString:@"Max"]) cached = 30;
else if ([name containsString:@"Pro"]) cached = 16;
else cached = 10;

GPU Family Detection

int MetalDevice::gpu_family() const {
    id<MTLDevice> dev = STATE.device;
    if ([dev supportsFamily:MTLGPUFamilyApple9]) return 9; // M4
    if ([dev supportsFamily:MTLGPUFamilyApple8]) return 8; // M2/M3
    if ([dev supportsFamily:MTLGPUFamilyApple7]) return 7; // M1
    return 6;
}

This is used by ChipConfig::from_gpu() to apply generation-specific tuning. M4 (family 9) has improved cache hierarchy and native BF16 support; M1 (family 7) has smaller SLC and no BF16.

Buffer Allocation

Buffer MetalDevice::allocate(size_t bytes) {
    id<MTLBuffer> buf = [STATE.device newBufferWithLength:MAX(bytes, 16)
                                                 options:MTLResourceStorageModeShared];
    void *h = (void *)CFBridgingRetain(buf);
    allocated_bytes_ += MAX(bytes, 16);
    return {h, bytes, [buf contents]};
}

Key details:

Minimum 16 bytes (alignment requirement)
MTLResourceStorageModeShared (UMA zero-copy)
CFBridgingRetain prevents ARC from releasing the buffer
[buf contents] provides the CPU-accessible pointer
allocated_bytes_ tracks total GPU memory usage

The Encode Dispatch Table Fast Path

The most performance-critical method is encode_dispatch_table, which replays the dispatch table for chain decode. Rather than calling the virtual set_pipeline/set_buffer/dispatch methods (which go through vtable dispatch), it operates directly on the Metal encoder:

void MetalDevice::encode_dispatch_table(const void *table_ptr,
                                        int start_position, int count) {
    const DispatchCmd *cmds = table.commands.data();
    id<MTLComputeCommandEncoder> enc = STATE.encoder;

    for (int tok = 0; tok < count; tok++) {
        for (int c = 0; c < n_cmds; c++) {
            const DispatchCmd &cmd = cmds[c];
            [enc setComputePipelineState:cmd.pso.handle];
            // ... buffer binding, param patching, dispatch ...
        }
    }
}

This avoids ~6 virtual calls per command (set_pipeline, set_buffer x N, set_bytes, dispatch), which at 200 commands per token and 128 tokens per batch would be ~150,000 virtual calls per submission. The direct Metal calls are significantly faster.²

Thread Safety Note

From the header:

/// THREAD SAFETY: This class is NOT thread-safe. All encoding operations
/// must be called from a single thread. The server serializes access
/// via the per-model mutex in ModelEntry.

Metal command encoders are inherently single-threaded. The MetalDevice does not add locking because the performance cost would be unacceptable in the hot path. Callers must ensure serialized access.

Why This Abstraction Exists

Three reasons:

1. Portability. While akunu currently only has a Metal backend, the abstraction makes it possible to add CUDA, Vulkan, or WebGPU backends without touching the core engine. The dispatch table, table builder, and decode loops work identically regardless of backend.

2. Testability. A mock Device implementation can be used for unit testing without a real GPU. The table builder can be tested by checking the commands it generates, without actually dispatching them.

3. Code organization. The separation forces a clean boundary between “what to compute” (core) and “how to compute it” (backend). Objective-C++ is confined to a single .mm file. The rest of the project compiles as standard C++17.

The abstraction is intentionally thin. It does not try to abstract away GPU programming – it just abstracts away the specific API calls. You still think in terms of buffers, pipelines, threadgroups, and dispatches. This is a pragmatic choice: the alternative (a high-level “tensor operation” abstraction) would hide the performance-critical details that make the difference between 30 tok/s and 70 tok/s.

Summary

The Device abstraction layer is a pure virtual C++ class with ~20 virtual methods covering device info, buffer management, command encoding, and synchronization. MetalDevice implements it using Objective-C++ with direct Metal API calls, an ObjC state wrapper for ARC compatibility, and a pipeline cache for kernel specialization. The fast path (encode_dispatch_table) bypasses the virtual interface entirely for maximum chain decode throughput.

This pattern is described in Apple’s “Mixing Objective-C and C++” technical note. The key challenge is that Objective-C types (id<MTLDevice>, etc.) cannot appear in C++ headers that might be included by pure C++ translation units. The void* + bridge cast pattern is the standard workaround. See https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ProgrammingWithObjectiveC/Introduction/Introduction.html. ↩
Virtual function call overhead on modern CPUs is typically 2-5ns due to the indirect branch prediction miss on the first call and potential instruction cache miss for the vtable. At 150,000 calls per submission, this adds up to 0.3-0.75ms – significant compared to the ~14ms total decode time for a 7B model. ↩

Keyboard shortcuts

Akunu: The Complete Guide to Apple Silicon Inference