Skip to content

Conversation

@kpouget
Copy link
Contributor

@kpouget kpouget commented Nov 7, 2025

Hello, I would like to discuss if this work could be integrated in the llama.cpp codebase.

The API Remoting backend/frontend allow escaping the VM isolation, with the help of the virt-gpu paravirtualization (and the virglrenderer library on the host side).

  • ggml-remotingfrontend is a GGML API implementation, which intercepts the GGML API calls and forwards them to the virt-gpu virtual device
  • ggml-remotingbackend is library loaded by virglrenderer (PR will be opened soon for discussion), which opens a GGML library and forwards the call received from virglrenderer.

The code is currently a POC, I will refine it after the first round of feedback.

  • Some serialization functions have been borrowed from ggml-RPC. The overall idea is the same, but the transport layer is virtualization aware, which helps limiting the buffer copies.
  • the supports_op method is implemented in a hacky way: I've copied the ggml-metal definition to the frontend library, and I expose the few properties required to compute it from the ggml-metal backend. IIRC, this was only needed for the micro-benchmark to work correctly (the ggml-rpc simply returns true to avoid this bottleneck).

Here is the context behind this PR:

image

@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 7, 2025
@rgerganov
Copy link
Collaborator

Very interesting work, thanks for sharing it!

Is it possible to get your PoC running on a Linux host with libkrun and KVM?

@kpouget
Copy link
Contributor Author

kpouget commented Nov 10, 2025

Is it possible to get your PoC running on a Linux host with libkrun and KVM?

not yet, as MacOS has been the main target so far, but I'm working now on setting up the Linux environment where I can test this setup.
In theory, it should work fine out of the box. In practice ... time will tell :)

The host side relies on virglrenderer, which had to be modified for libkrun/MacOS to work in-process, while on Linux virglrenderer runs a separate process. So I need to see if my code works well when triggered inside the separate process. When confirmed, I'll open a PR on virglrenderer upstream, and I'll share the instructions to test the full stack on Linux.

For MacOS, the user-friendly instructions are detailed in the blog post, and I can share the steps to build from sources on demand.

@kpouget
Copy link
Contributor Author

kpouget commented Dec 8, 2025

I opened the RFC PR on virglrenderer: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1584

and the code now works on Linux (tested with the podman --runtime krun lightweight container/VM)

To reproduce the POC on Linux (with krun), prepare a container image
where you have:

  • virglrenderer compiled with -Dapir=true
  • llama.cpp PR compiled with
    • -DGGML_REMOTINGFRONTEND=ON to intercept the GGML API calls in the guest container
    • -DGGML_REMOTINGBACKEND=ON to compile the backend library, loaded by the virglrenderer server
    • -DGGML_VULKAN=ON so that the backend library can load ggml-vulkan in the host, which will perform the actual GPU acceleration.
    • expose the following environment variables:
# for virglrenderer to load the API Remoting backend
VIRGL_APIR_BACKEND_LIBRARY=/usr/lib64/libggml-remotingbackend.so
VIRGL_APIR_LOG_TO_FILE=/tmp/topsail_apir_virglrenderer.log

# for the API Remoting backend to know how to load the GGML backend
APIR_LLAMA_CPP_GGML_LIBRARY_PATH=/usr/lib64/libggml-vulkan.so
APIR_LLAMA_CPP_GGML_LIBRARY_REG=ggml_backend_vk_reg
APIR_LLAMA_CPP_GGML_LIBRARY_INIT=ggml_backend_vk_init
APIR_LLAMA_CPP_LOG_TO_FILE=/tmp/topsail_apir_llama_cpp.log

# may not be necessary
RENDER_SERVER_EXEC_PATH=/usr/libexec/virgl_render_server

or try simply it with this command:

VERSION=v0.15.0-apir.0.1.1_apir.b7003-remoting-0.2.1_b15-linux

ramalama run --oci-runtime krun \
  --env GGML_VK_DISABLE_INTEGER_DOT_PRODUCT=1 \
  --image  quay.io/crcont/remoting:$VERSION \
  llama3.2 "hello"

Note that:

  • this POC focused on MacOS, where the in VM/container inference performance are tight to the remoting stack

    • but virglrenderer does not currently natively work on this system (we have a fork with an in-process implementation, but the upstream integration is pending by lack of time)
  • this POC works on Linux, but with degraded performance. I will investigate what is behind this performance loss and how it can be resolved.

  • this POC does not support multi-user (eg, multiple llama.cpp using the API Remoting simultaneously)

@kpouget
Copy link
Contributor Author

kpouget commented Jan 9, 2026

Closing this PR, I opened a new PR #18718 with the v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) build Compilation issues ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants