[RFC] ggml: new backend for API Remoting #17072

kpouget · 2025-11-07T11:15:26Z

Hello, I would like to discuss if this work could be integrated in the llama.cpp codebase.

The API Remoting backend/frontend allow escaping the VM isolation, with the help of the virt-gpu paravirtualization (and the virglrenderer library on the host side).

ggml-remotingfrontend is a GGML API implementation, which intercepts the GGML API calls and forwards them to the virt-gpu virtual device
ggml-remotingbackend is library loaded by virglrenderer (PR will be opened soon for discussion), which opens a GGML library and forwards the call received from virglrenderer.

The code is currently a POC, I will refine it after the first round of feedback.

Some serialization functions have been borrowed from ggml-RPC. The overall idea is the same, but the transport layer is virtualization aware, which helps limiting the buffer copies.
the supports_op method is implemented in a hacky way: I've copied the ggml-metal definition to the frontend library, and I expose the few properties required to compute it from the ggml-metal backend. IIRC, this was only needed for the micro-benchmark to work correctly (the ggml-rpc simply returns true to avoid this bottleneck).

Here is the context behind this PR:

How we improved AI inference on macOS Podman containers --> the performance of ggml-Vulkan on Mac is 75-80% of ggml-metal
Reach native speed with MacOS llama.cpp container inference --> with API Remoting, the llama.cpp in a VM container runs at nearly 100% of ggml-metal

rgerganov · 2025-11-09T09:45:20Z

Very interesting work, thanks for sharing it!

Is it possible to get your PoC running on a Linux host with libkrun and KVM?

kpouget · 2025-11-10T14:31:04Z

Is it possible to get your PoC running on a Linux host with libkrun and KVM?

not yet, as MacOS has been the main target so far, but I'm working now on setting up the Linux environment where I can test this setup.
In theory, it should work fine out of the box. In practice ... time will tell :)

The host side relies on virglrenderer, which had to be modified for libkrun/MacOS to work in-process, while on Linux virglrenderer runs a separate process. So I need to see if my code works well when triggered inside the separate process. When confirmed, I'll open a PR on virglrenderer upstream, and I'll share the instructions to test the full stack on Linux.

For MacOS, the user-friendly instructions are detailed in the blog post, and I can share the steps to build from sources on demand.

…timize

kpouget · 2025-12-08T16:18:03Z

I opened the RFC PR on virglrenderer: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1584

and the code now works on Linux (tested with the podman --runtime krun lightweight container/VM)

To reproduce the POC on Linux (with krun), prepare a container image
where you have:

virglrenderer compiled with -Dapir=true
llama.cpp PR compiled with
- -DGGML_REMOTINGFRONTEND=ON to intercept the GGML API calls in the guest container
- -DGGML_REMOTINGBACKEND=ON to compile the backend library, loaded by the virglrenderer server
- -DGGML_VULKAN=ON so that the backend library can load ggml-vulkan in the host, which will perform the actual GPU acceleration.
- expose the following environment variables:

# for virglrenderer to load the API Remoting backend
VIRGL_APIR_BACKEND_LIBRARY=/usr/lib64/libggml-remotingbackend.so
VIRGL_APIR_LOG_TO_FILE=/tmp/topsail_apir_virglrenderer.log

# for the API Remoting backend to know how to load the GGML backend
APIR_LLAMA_CPP_GGML_LIBRARY_PATH=/usr/lib64/libggml-vulkan.so
APIR_LLAMA_CPP_GGML_LIBRARY_REG=ggml_backend_vk_reg
APIR_LLAMA_CPP_GGML_LIBRARY_INIT=ggml_backend_vk_init
APIR_LLAMA_CPP_LOG_TO_FILE=/tmp/topsail_apir_llama_cpp.log

# may not be necessary
RENDER_SERVER_EXEC_PATH=/usr/libexec/virgl_render_server

or try simply it with this command:

VERSION=v0.15.0-apir.0.1.1_apir.b7003-remoting-0.2.1_b15-linux

ramalama run --oci-runtime krun \
  --env GGML_VK_DISABLE_INTEGER_DOT_PRODUCT=1 \
  --image  quay.io/crcont/remoting:$VERSION \
  llama3.2 "hello"

Note that:

this POC focused on MacOS, where the in VM/container inference performance are tight to the remoting stack
- but virglrenderer does not currently natively work on this system (we have a fork with an in-process implementation, but the upstream integration is pending by lack of time)
this POC works on Linux, but with degraded performance. I will investigate what is behind this performance loss and how it can be resolved.
this POC does not support multi-user (eg, multiple llama.cpp using the API Remoting simultaneously)

kpouget · 2026-01-09T13:30:30Z

Closing this PR, I opened a new PR #18718 with the v2

kpouget requested review from ggerganov and slaren as code owners November 7, 2025 11:15

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 7, 2025

DajanaV mentioned this pull request Nov 7, 2025

UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting auroralabs-loci/llama.cpp#114

Open

kpouget added 3 commits December 8, 2025 16:55

ggml: add the ggml-remotingfrontend and ggml-remotingbackend libraries

8711660

ggml: src: ggml-remotingfrontend/ggml-backend: add stub for .graph_op…

6816bf9

…timize

ggml: remoting: update to make compatible with Linux

9f1ea1c

kpouget force-pushed the up/remoting branch from f28602d to 9f1ea1c Compare December 8, 2025 16:03

kpouget mentioned this pull request Jan 9, 2026

ggml: new backend for Virglrenderer API Remoting acceleration (v2) #18718

Merged

kpouget closed this Jan 9, 2026

loci-dev mentioned this pull request Jan 9, 2026

UPSTREAM PR #18718: ggml: new backend for Virglrenderer API Remoting acceleration (v2) auroralabs-loci/llama.cpp#867

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] ggml: new backend for API Remoting #17072

[RFC] ggml: new backend for API Remoting #17072

Uh oh!

kpouget commented Nov 7, 2025

Uh oh!

rgerganov commented Nov 9, 2025

Uh oh!

kpouget commented Nov 10, 2025

Uh oh!

kpouget commented Dec 8, 2025

Uh oh!

kpouget commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[RFC] ggml: new backend for API Remoting #17072

[RFC] ggml: new backend for API Remoting #17072

Uh oh!

Conversation

kpouget commented Nov 7, 2025

Uh oh!

rgerganov commented Nov 9, 2025

Uh oh!

kpouget commented Nov 10, 2025

Uh oh!

kpouget commented Dec 8, 2025

Uh oh!

kpouget commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants