What is Zinc: Zig's LLM engine for AMD and Apple GPUs

Worldbytes » Software » What is Zinc, the LLM inference engine in Zig for AMD and Apple

Zinc is an LLM inference engine in Zig that leverages AMD RDNA3/RDNA4 GPUs and Apple Silicon without relying on ROCm, CUDA, or Python.
It uses Vulkan on Linux and Metal on macOS with shaders fine-tuned for each architecture and validated support for quantized GGUF models.
It offers CLI, HTTP server with OpenAI-compatible API, and integrated web chat in a single binary that is easy to compile with Zig.
The project is under active development, with good performance figures and a roadmap focused on batching, KV compression and greater efficiency.

Zinc inference engine with Zig and Vulkan

Zinc is a local inference engine This project is designed for large language models (LLM) to get the most out of AMD's consumer GPUs (RDNA3 and RDNA4) and Apple Silicon chips, using Zig as the language and Vulkan/Metal as graphics backends. If you have a modern graphics card and are frustrated by not being able to use it properly with vLLM, ROCm, or other poorly optimized solutions, this project is precisely designed to fill that gap.

Zinc's objective is very clear.To offer a simple, fast, and dependency-free way to run GGUF models locally, with a single build, a single binary, and a server compatible with the OpenAI API and featuring an integrated browser chat interface. All this with hand-crafted shaders for each platform, a low-level design, and a user experience tailored for developers and advanced users.

What is Zinc (Zig inference engine) and what problem does it solve?

Zinc is an LLM inference engine Written primarily in Zig and designed from the ground up to fully leverage the hardware of two widespread but traditionally underserved GPU families: AMD's consumer GPUs based on RDNA3/RDNA4 and Apple Silicon SoCs (M1, M2, M3, M4, M5). Instead of relying on ROCm, CUDA, or MLX, it is based on Vulkan 1.3 on Linux and Metal on macOS.

The project originated from a recurring complaint. In the community: those with a powerful AMD graphics card find that ROCm doesn't support consumer cards well, vLLM simply doesn't work without ROCm, and the Vulkan route of llama.cpp treats the GPU as a secondary player, with generic shaders, no architecture-specific tuning, and no solid server history.

Zinc's premise is that the hardware is already ready.The RDNA3 and RDNA4 graphics cards have more than enough memory bandwidth, compute power, and VRAM to run large models, and the Apple Silicon chips feature unified memory and powerful compute units. The real bottleneck is in the software, so a dedicated engine has been built for these platforms, with a very low-level systems approach.

In practice, Zinc is currently able to load GGUF models large (around 21 GB of VRAM weights for a 35B parameter model), build a computation graph with hundreds of nodes (for example for a hybrid transformer with MoE and SSM layers) and generate coherent text at competitive speeds, especially on finely tuned RDNA4 hardware.

Why was Zig chosen to build Zinc?

Zig fits the type of work very well. It requires a GPU inference engine focused on Vulkan and Metal: intensive calls to the graphics API, manual GPU memory management, command buffer recording, and an integrated shader build pipeline. It's pure, unadulterated systems code, and Zig offers a toolset very well suited to that reality.

A key point is @cImportThis allows direct access to Vulkan's C ABI without generating complex bindings or additional layers. This makes it easier to call low-level API functions exactly as defined, reducing friction and overhead when interacting with the driver and the graphics runtime.

The use of comptime in Zig It is very useful for generating dispatch tables by quantization type and for specializing code paths at compile time, which translates into less branching at runtime and a more efficient selection of kernels and formats such as Q4_K, Q5_K, Q6_K, Q8_0 or F16 depending on the GGUF model that is loaded.

Error and resource management with errdefer It helps keep GPU resource cleanup under control: buffers, images, descriptors, and other Vulkan/Metal objects are properly released even when there are intermediate errors, preventing VRAM leaks and inconsistent states after failures in the middle of initialization or the decode loop.

Zig's own build system greatly simplifies integration For shader compilation: the `zig build` command can chain together steps to invoke `glslc` on Linux, place the already compiled SPIR-Vs in `zig-out/share/zinc/shaders/`, and produce a single binary in `zig-out/bin/zinc` that bundles everything. On macOS, Metal shaders (MSL) are compiled at runtime.

General architecture: Vulkan in AMD and Metal in Apple Silicon

Zinc is designed with two distinct execution routesOne for Linux with AMD GPU using Vulkan 1.3 and another for macOS with Apple Silicon using Metal, although from the user's point of view the experience is practically the same: build the binary, choose a validated GGUF model and launch inference via CLI, HTTP server or web chat interface.

The AMD route takes full advantage of the capabilities of RDNA3 and RDNA4Hand-written GLSL compute shaders are used, with wave64, cooperative matrix, and architecture-specific tiling strategies, running on Vulkan. This isn't a generic backend that "also works on AMD," but rather kernels specifically tuned to take advantage of the hardware.

Complete guide to installing DeepSeek on your computer

On Apple Silicon, Zinc uses native shaders in Metal (MSL) with simdgroup operations, leveraging unified memory and zero-copy mmap to load models directly without costly intermediate copying steps. This aligns well with Apple's SoC philosophy, where the CPU and GPU share the same memory space.

The backend selection is done automatically at compile timeZig detects whether it's being built in a Linux environment with Vulkan support or on macOS with Metal and activates the corresponding path. From the build command's perspective, the user simply runs `zig build -Doptimize=ReleaseFast` and obtains a binary ready for their platform.

This design avoids dependence on specific driver stacks like ROCm or MLX, as well as runtimes like CUDA or Python. Zinc is presented as "a single binary, without a heavyweight ML stack," something especially attractive for those who want to set up a local LLM server on a desktop or laptop machine without getting into complex installations.

Supported platforms, GPUs, and models

Zinc currently focuses on two major environmentsLinux with AMD RDNA3/RDNA4 GPUs via Vulkan 1.3, and macOS with Apple Silicon from M1 to M5, via Metal. This covers both dedicated desktop graphics cards and AI-oriented cards (e.g., Radeon AI PRO) and the chips integrated into Apple laptops and desktops.

In Linux, compatibility has been validated on cards such as The AMD Radeon AI PRO R9700 (RDNA4, 32 GB) and RDNA3 cards such as the RX 7900 XTX. For RDNA4, it is recommended to enable cooperative matrix optimization using the RADV_PERFTEST=coop_matrix environment variable before running Zinc, as this enables more efficient compute paths in the RADV driver.

On macOS, Zinc runs on Apple Silicon From the early M1 models to recent generations like the M4 and later, using Metal as the backend and specifically written MSL shaders. It has been tested, for example, on an M1 Pro with 32 GB of unified memory, enough to accommodate models with several billion parameters, and certain 35 GB variants with specific memory requirements.

In terms of models, Zinc is limited to a very specific set of GGUFs. These have been validated end-to-end, rather than listing a huge theoretical catalog. Among them are variants of Qwen3.5 with 2B and 35B parameters, with quantizations such as Q4_K_M or Q4_K_XL, designed to balance VRAM consumption and inference performance.

Supported quantization formats include Q4_K, Q5_K, Q6_K, Q8_0, and F16, allowing for quality and speed adjustments based on available hardware. In current benchmarks, for example, the Qwen3.5 2B Q4_K_M model achieves around 27 tok/s on RDNA4 and about 17 tok/s on Apple Silicon in single-stream decoding.

Current performance and project status

Zinc is still experimental softwareIt is currently under active development, but already boasts respectable performance figures and a stable pipeline for CLI and HTTP server inference. In a test environment with a Radeon AI PRO R9700 (RDNA4, 32 GB, 576 GB/s), speeds of around 38 tok/s were measured on the Qwen3.5 35B-A3B UD model quantized to Q4_K_XL.

In simple command-line decoding testsA typical example is launching a short prompt like "The capital of France is" and generating 128 tokens. Under these conditions, figures close to 37,95 tok/s with around 26,3 ms/tok have been seen in the 35B model, and around 26,71 tok/s (37,4 ms/tok) in the 2B Q4_K_M model on the same RDNA4 node.

It is noteworthy that the 35B model It can run faster than 2B on that hardware, indicating that the bottleneck isn't simply the number of parameters, but rather the shape of the graph, the type of kernels used, and the efficiency of the decode path. The engine is achieving approximately 112,5 GB/s of modeled bandwidth for the full token path, around 19,5% of the chip's theoretical peak.

This apparent “underuse” of memory This isn't a problem in itself, since a single decoding stream isn't designed to completely saturate the DRAM bandwidth. The remaining time is spent in medium and small kernels, and at greater depths in the compute graph. To improve overall utilization, the right approach isn't to ask a single stream to do everything, but rather to implement batching and concurrency.

Compared to other engines like llama.cppZinc's current baseline on RDNA4 for the 35B model and basic decoding is around 40 tok/s, while llama.cpp can exceed 100 tok/s on the same node and model. Zinc's roadmap involves closing this gap by optimizing hot kernels, reducing Vulkan overhead, and improving the reasoned chat route (/v1/chat/completions) to bring it closer to the performance of the pure decode route.

Tuned internals and shaders

The zinc engine is composed of several fundamental blocks Already completed: Vulkan infrastructure, GGUF parser and model loader, RDNA3/RDNA4 GPU detection, native BPE tokenizer (from GGUF metadata), a set of 16 GLSL compute shaders, compute graph builders and architectures, and the forward pass loop for decoding.

DeepSeek R2: China's new AI that aims to revolutionize the market at minimal cost

Shaders at AMD are written specifically for RDNAUsing wave64, cooperative matrix, and tiling schemes tailored to the memory hierarchy and compute units of the architecture, this contrasts with generic solutions where the same shader serves all purposes and performance falls far short of the card's potential.

In Apple Silicon, routing is supported by native MSL kernels These exploit simdgroup operations and direct access to mapped models in unified memory, minimizing copies and taking advantage of the integrated nature of the chip. Optimization of this approach is slightly behind that of RDNA4, but is actively underway.

The computational graph that Zinc builds for a complex model Qwen3.5 35B-A3B-UD can support over 700 nodes, including MoE (Mixture of Experts) layers, SSM, and vocabulary projection. This entire graph is moving towards a strategy where decoding is recorded as a single Vulkan command per token, reducing the 120 GPU-CPU trips per token that currently penalize performance.

In addition to the inference core, Zinc includes an integrated HTTP server with an OpenAI-compatible API under /v1, as well as a browser-based chat interface served from /. The API supports token streaming, and there is a health endpoint at /health for integrations with orchestrators and monitoring systems.

Installing dependencies and compiling Zinc

To use Zinc you need very few external toolsEspecially if you're on macOS with Apple Silicon. In that environment, simply install Zig (version 0.15.2 or higher) and the Xcode command-line tools using `xcode-select --install`. Vulkan, glslc, Python, and MLX are not required, as the entire process relies on Metal and Zig's build system.

On Linux with an AMD GPU, the setup is just as straightforward.It is recommended to update the packages with `apt update` and install `libvulkan-dev`, `vulkan-tools`, and `glslc`, as well as `git` to clone the repository. Zig 0.15.2+ must be downloaded from the official website and added to the PATH to run `zig build` without problems.

Once the dependencies are installed, the build flow is identical. On both platforms: clone the official GitHub repository with `git clone https://github.com/zolotukhin/zinc.git`, navigate to the directory, and run `zig build -Doptimize=ReleaseFast`. The `ReleaseFast` option is important to obtain an optimized binary and avoid misleading performance metrics.

The resulting binary is saved in ./zig-out/bin/zincOn Linux, the build process also compiles GLSL shaders to SPIR-V and places them in zig-out/share/zinc/shaders/. On macOS, however, Metal shaders are compiled at runtime from the MSL code included in the project.

It is worth taking into account a relevant detail in Linux RDNA4Some newer versions of glslc may introduce significant performance regressions when compiling shaders. The project recommends using the version of glslc that comes in the system repositories rather than newer versions obtained through other means.

Preliminary check and first steps with Zinc

Before issuing any prompt, it is advisable to perform a pre-check. with the command ./zig-out/bin/zinc –check. This command verifies that the host environment, GPU, and necessary assets are in order, and that there are no missing shaders, Vulkan/Metal capabilities, or model metadata.

On machines with RDNA4, exporting is especially useful. Set the variable RADV_PERFTEST=coop_matrix before the check and inference sessions, and then call ./zig-out/bin/zinc –check again. The goal is to confirm that the GPU has been detected correctly, that cooperative matrix optimizations are active, and that no critical warnings arise.

If the check ends by showing READY [OK]This indicates that the environment is in a reasonable state for testing inference. If warnings appear, it is advisable to resolve them first (drivers, Vulkan versions, uncompiled shaders, lack of VRAM for the chosen model, etc.) before judging the engine's performance or stability.

The check also validates aspects such as Vulkan device discovery, active GPU selection, compatibility of managed models when passing –model-id, GGUF metadata, and an estimate of whether the model fits in the available VRAM on the selected GPU.

Once that phase is over, the recommended flow is very simpleList the catalog of compatible models for the machine, download one validated for your GPU profile, and perform an initial CLI inference with a short prompt. This allows you to verify that the entire pipeline (tokenization, forward pass, text output) functions correctly.

Catalog of models, download and management

Zinc incorporates a managed model system This detects the GPU profile (for example, amd-rdna4-32gb or apple-silicon) and displays which validated models best match that memory and capacity configuration. The command `./zig-out/bin/zinc model list` is used to browse this catalog.

To download a model, the command `model pull` is used.For example, `./zig-out/bin/zinc model pull qwen35-2b-q4k-m`. This downloads the GGUF file corresponding to the Qwen3.5 2B model with Q4_K_M quantization and saves it to a local cache, also verifying the SHA-256 hash to ensure the integrity of the download.

Once downloaded, you can set a default template For future executions, use `./zig-out/bin/zinc model use qwen35-2b-q4k-m`, and check which model is currently active with `./zig-out/bin/zinc model active`. This way, many subsequent invocations don't need to explicitly specify `--model-id`.

Best Chinese AI platforms: a complete guide and comparison

When you no longer need a model, it is possible to release it. from the cache with ./zig-out/bin/zinc model rm qwen35-2b-q4k-m, which helps manage disk space if you experiment with multiple quantization versions or alternative models.

The catalog is intentionally kept narrowIt focuses on models that the project team has validated end-to-end, rather than promising theoretical compatibility with hundreds of GGUF variants. For finer details on model workflows and caching options, refer to the dedicated Zinc web documentation.

CLI inference, HTTP server, and web chat

The most direct way to test Zinc is with the CLIOnce the binary is compiled, a model downloaded, and the check passed, you can run a simple first test with something like `./zig-out/bin/zinc --model-id qwen35-2b-q4k-m --prompt "The capital of France is"`. On Linux RDNA4, remember to have `export RADV_PERFTEST=coop_matrix` enabled to take advantage of the cooperative matrix.

If everything is configured correctly, the logs should show Messages from the model loader, the completion of the prefill, and the number of tokens generated along with the speed in tok/s, ending with a coherent output text, typically with the word "Paris" in the example above. Seeing that sequence indicates that the main inference path is operational.

In addition to pure CLI mode, Zinc integrates a server mode Accessible with `./zig-out/bin/zinc chat`, which starts an HTTP server (by default on port 9090) and opens the built-in chat interface in your browser. From there, you can interact with the model in a way similar to an online chat, including support for longer "reasoning" modes.

The server can also be launched manually Use `./zig-out/bin/zinc --model-id qwen35-2b-q4k-m -p 8080`, so you can choose the port that suits you best. Then simply open http://localhost:8080/ in your browser to access the same chat interface integrated into the binary.

Zinc's exposed HTTP API is compatible with OpenAI. Under the /v1 path, this allows you to point clients or SDKs that already work with the OpenAI API directly to your Zinc instance. A /health endpoint is also included, designed for health checks in production deployments or orchestrated environments.

Current limitations and roadmap

Although Zinc is already functional, the project openly declares It is still in an experimental phase, and some parts are still evolving. The CLI mode is currently the most polished way to start; the server, model coverage, and fine-tuning of performance are being refined with frequent iterations.

There are some acknowledged limitationsThe list of supported models is intentionally kept narrow; Apple Silicon's path is still being fine-tuned to approach the efficiency of RDNA4; and the implementation of continuous, multi-process batching to serve multiple clients simultaneously is on the roadmap but not completely finalized.

The development team is working on several key lines The goal is to move from "raw decoding above 30 tok/s" to supporting long reasoning workloads while maintaining that figure and improving overall GPU utilization. This includes bridging the gap between /v1/completions and /v1/chat/completions, where chat templates and time-to-first-token (TTFT) add extra overhead.

Another area where iteration is taking place It's about reducing descriptor churn in Vulkan's hot path: the idea is to reuse bindings and minimize work per token in the decoding loop, so that CPU time and latency per Vulkan operation are kept as low as possible.

The planned advanced phases include TurboQuant KV compression, a KV (key-value) state compression technique that can help reduce memory consumption and improve speed in long sequences, as well as the implementation of continuous batching to increase the token rate per second at the aggregate level with multiple simultaneous streams.

Those who want to delve deeper into the code They have a relatively manageable base: about 5.000 Zig lines and about 2.000 GLSL lines, along with MSL shaders for Apple, profiling tools (–profile, still being adjusted so that it does not distort measurements) and a detailed development guide in the web documentation.

Overall, Zinc paints an interesting picture. For anyone with a modern AMD GPU or Apple Silicon system who wants to truly leverage their hardware for local LLM, without resorting to ROCm, CUDA, or heavy Python runtimes, this is a great option. Its combination of Zig, Vulkan, and Metal, along with a set of shaders tailored to each architecture, makes it a strong contender if you're interested in building an efficient inference environment on your own machine, whether for experimentation, serving an OpenAI-compatible API, or gaining a deep understanding of how to build a low-level inference engine.

Isaac

Passionate writer about the world of bytes and technology in general. I love sharing my knowledge through writing, and that's what I'll do on this blog, show you all the most interesting things about gadgets, software, hardware, tech trends, and more. My goal is to help you navigate the digital world in a simple and entertaining way.