Skip to content

Conversation

@kpouget
Copy link
Contributor

@kpouget kpouget commented Jan 9, 2026

This is a follow up of #17072

The API Remoting backend/frontend allow escaping the VM isolation, with the help of the virt-gpu paravirtualization (and the virglrenderer library on the host side).

  • ggml-remotingfrontend is a GGML API implementation, which intercepts the GGML API calls and forwards them to the virt-gpu virtual device
  • ggml-remotingbackend is library loaded by virglrenderer (PR will be opened soon for discussion), which opens a GGML library and forwards the call received from virglrenderer.

Here is the context behind this PR:

image

See the Virglrenderer PR which enables the API Remoting trampoline required in Virglrenderer:
https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590

  • this work focused on MacOS, where the in VM/container inference performance are tight to the remoting stack

  • the code works on Linux. I didn't evaluate thoroughly the performance.

  • Add support for the APIR capset containers/libkrun#508 --> libkrun VMM patch that allows the routing of the APIR capset to Virglrenderer

Disclaimer: I got helped by Claude Code to finalize this PR. Mostly through pre-submit reviews (no automated C code generation involved). Claude Code did generate the Python code generator (see the *.gen.h and *,gen.c files) used for the backend/frontend RPC (it was generated based on the C/H files I had manually written).

@kpouget kpouget requested a review from ggerganov as a code owner January 9, 2026 13:29
@kpouget kpouget changed the title ggml: new backend for Virglrenderer API Remoting ggml: new backend for Virglrenderer API Remoting (v2) Jan 9, 2026
@kpouget kpouget changed the title ggml: new backend for Virglrenderer API Remoting (v2) ggml: new backend for Virglrenderer API Remoting acceleration (v2) Jan 9, 2026
@github-actions github-actions bot added build Compilation issues python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jan 9, 2026
@taronaeo taronaeo self-assigned this Jan 10, 2026
@taronaeo
Copy link
Collaborator

I'll review this in awhile. If we were to merge this, we will need a named maintainer for the backend for maintainability reasons. Will it be you? :)

Copy link
Collaborator

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Spacing across the PR is very inconsistent. Please follow 4 spaces and make it consistent.
  2. The vendor files within ggml-remotingfrontend/include - can they be discovered/downloaded separately from the codebase? See:
    - Avoid adding third-party dependencies, extra files, extra headers, etc.
  3. Inconsistent styling:
__attribute__((unused))
static inline const char *apir_command_name(ApirCommandType type)
{

vs.

static ggml_status ggml_backend_remoting_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {

Please follow CONTRIBUTING.md: https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md

@kpouget
Copy link
Contributor Author

kpouget commented Jan 12, 2026

thanks for the review @taronaeo, I think I followed and fixed all the suggestions

If we were to merge this, we will need a named maintainer for the backend for maintainability reasons. Will it be you? :)

yes, would be me indeed :)

Copy link
Collaborator

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a lot better now, thank you for cleaning the code.

  1. I'm still wondering, are the 3rd party vendor files required to be part of GGML/Llama.cpp? (Can they be downloaded separately during development time via a script?)
  2. I'm not sure if I missed it, but I don't see the required GGML_BACKEND_DL_IMPL macro call in this PR. Did GGML register your backend correctly?
  3. #18718 (comment)

I'm also interested in testing this PR out on my MacBook. Do you have any guides/steps for me to follow to test it?

@kpouget
Copy link
Contributor Author

kpouget commented Jan 13, 2026

I'm also interested in testing this PR out on my MacBook. Do you have any guides/steps for me to follow to test it?

sure :)

the blog post has the steps to reproduce it with pre-compiled binaries:
https://developers.redhat.com/articles/2025/09/18/reach-native-speed-macos-llamacpp-container-inference#try_api_remoting_with_ramalama

actually, you should be able to follow the INSTALL steps from my release page:
https://github.com/crc-org/llama.cpp/releases/tag/b7356-remoting-0.3.0

(I'll try to regenerate the binaries before the end of the week)

and this document has the steps to rebuild the different sources, you can request access

happy to discuss it on IBM-RH slack if you need help

@kpouget
Copy link
Contributor Author

kpouget commented Jan 14, 2026

For information, I'll be at FOSDEM at the end of the month to present the work behind this PR:
https://fosdem.org/2026/schedule/event/C9NF8K-api_remoting_for_llama_cpp_near-native_gpu_speed_in_macos_containers/

@kpouget
Copy link
Contributor Author

kpouget commented Jan 14, 2026

I'm not sure if I missed it, but I don't see the required GGML_BACKEND_DL_IMPL macro call in this PR. Did GGML register your backend correctly?

indeed, I'm not using it at the moment (and everything works fine), I'll review tomorrow how it should be used

@taronaeo
Copy link
Collaborator

For information, I'll be at FOSDEM at the end of the month to present the work behind this PR: https://fosdem.org/2026/schedule/event/C9NF8K-api_remoting_for_llama_cpp_near-native_gpu_speed_in_macos_containers/

That's great and congratulations! I apologise for my slowness in reviewing this.

I've tested this, and it looks great. Performance is pretty good.

There are CI failures. For the LLAMA_CURL failures, could you rebase with master to fix the cURL failures?

I'm not sure if I missed it, but I don't see the required GGML_BACKEND_DL_IMPL macro call in this PR. Did GGML register your backend correctly?

indeed, I'm not using it at the moment (and everything works fine), I'll review tomorrow how it should be used

Odd and interesting, but I can see it registered during the benchmark, so all is good :)

@taronaeo
Copy link
Collaborator

Also, could you do/consider the following?

  1. Add yourself to the CODEOWNERS file so that GitHub/we can identify the maintainer to ping when issues arise.
  2. If possible, and can be done in a follow-up PR, have a backend documentation f.ex., https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/zDNN.md
  3. nitpick: Could the backend name be a little more descriptive? I was thinking something like ggml-virgl.

@kpouget
Copy link
Contributor Author

kpouget commented Jan 16, 2026

That's great and congratulations! I apologise for my slowness in reviewing this.

no rush, thanks for having a look, that's much appreciated 🙏🏻

I'm not sure if I missed it, but I don't see the required GGML_BACKEND_DL_IMPL macro call in this PR. Did GGML register your backend correctly?

indeed, I'm not using it at the moment (and everything works fine), I'll review tomorrow how it should be used

Odd and interesting, but I can see it registered during the benchmark, so all is good :)

I'm actually confused about the intended behavior of this. Or more the actual behavior, as I have a good clue about the intend.

if I add this:

GGML_BACKEND_DL_IMPL(ggml_backend_remoting_frontend_reg)

I still see that:

load_backend: failed to find ggml_backend_init in /home/kpouget/remoting-linux/llama_cpp/build.remoting-frontend/bin/libggml-vulkan.so
load_backend: failed to find ggml_backend_init in /home/kpouget/remoting-linux/llama_cpp/build.remoting-frontend/bin/libggml-remotingfrontend.so
load_backend: failed to find ggml_backend_init in /home/kpouget/remoting-linux/llama_cpp/build.remoting-frontend/bin/libggml-cpu.so

and I'm confused about the way it's actually implemented (I didn't review it in depth), as I feel that this part already does the job, but in a non-generic way:

    ggml_backend_registry() {
        ...
#ifdef GGML_USE_REMOTINGFRONTEND
        register_backend(ggml_backend_remoting_frontend_reg());
#endif
        ...
    }

Also, could you do/consider the following?

yes sure.
I'll push the rebase soon, I need to CI-validate it first.

@taronaeo
Copy link
Collaborator

I'm actually confused about the intended behavior of this. Or more the actual behavior, as I have a good clue about the intend.

IIRC,

    ggml_backend_registry() {
        ...
#ifdef GGML_USE_REMOTINGFRONTEND
        register_backend(ggml_backend_remoting_frontend_reg());
#endif
        ...
    }

Only registers the backend for static builds i.e., with -DGGML_NATIVE=ON -DGGML_BACKEND_DL=OFF. But when we try to build for dynamic loading i.e., with -DGGML_NATIVE=OFF -DGGML_BACKEND_DL=ON, it would not be able to register the backend.

Please have a go with switching those 2 macros and see if your backend is registered for both cases. It's likely that because GGML_BACKEND_DL_IMPL is not part of your code, building for dynamic loading will fail to load your backend.

@kpouget
Copy link
Contributor Author

kpouget commented Jan 20, 2026

Please have a go with switching those 2 macros and see if your backend is registered for both cases.

thanks for the suggestion, I went back to this part of the code (loading the ggml library) and I could rework and simplify the way the remoting backend loads the GGML library implementation. From 3 config options (lib path, reg fct, init fct), I'm down to one ! (lib path)

I'm also reworking the way the API Remoting is getting configured (the path to the libraries to load), so that the hypervisor is in charge of it, via an API, instead of environment variables. Will make things much cleaner. I'll push the update (and rebase) soon.

nitpick: Could the backend name be a little more descriptive? I was thinking something like ggml-virgl.

yes, I'm thinking about ggml-virtgpu-apir, and I'll try to see if I can have the backend (currently ggml-remotingbackend)
stored in a subdirectory of the frontend.

EDIT: ggml-virtgpu-apir cannot work in the build system (because of the second -), so going for ggml-virtgpu

@kpouget
Copy link
Contributor Author

kpouget commented Jan 26, 2026

@taronaeo, I could complete the rebase and finalize multiple aspects of the polishing, including this part:

nitpick: Could the backend name be a little more descriptive? I was thinking something like ggml-virgl.

I used ggml-virtgpu (ggml-virtgpu-apir isn't allowed by the build system, unfortunately, because of the second -) and I moved the backend to ggml-virtgpu/backend. Cleaner this way :)

The only pending thing I see is the documentation, I'll give that a kick during the week.

After quite some struggle to align Virglrenderer, the CI test harness and llama.cpp to work on Linux and MacOS, I managed to get things properly aligned :)

This version of the PR (named b7755-remoting-0.4.4 in my repo) + release virglrenderer v1.2.0-remoting-0.3.5-macos/v1.2.0-remoting-0.3.5-linux) should be feature complete, I don't expect any significant change coming from my side anymore.

I could get the build and some test running on Linux, but I'm afraid the Vulkan backend on my Intel CPU has come race conditions that prevent the testing from running end to end :/ I'll review that further to be sure it doesn't come from the API Remoting layer, but the Vulkan testing doesn't succeed any better.

@taronaeo
Copy link
Collaborator

The only pending thing I see is the documentation, I'll give that a kick during the week.

Feel free to push the documentation in a separate PR :)

I could get the build and some test running on Linux, but I'm afraid the Vulkan backend on my Intel CPU has come race conditions that prevent the testing from running end to end :/ I'll review that further to be sure it doesn't come from the API Remoting layer, but the Vulkan testing doesn't succeed any better.

I think its fine if the feature is limited to macOS for now. You'll just need to specify in your documentation that there is currently this limitation and it is being worked on.

Aside, there are CI errors again haha. Can you fix them? I'll review the PR again in a while.

Copy link
Collaborator

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGML to device implementation generally looks okay. Just one question about the backend initialization.

@taronaeo
Copy link
Collaborator

CIs are still failing :(

Once those are fixed, let me know when this PR is ready for merge.

@kpouget
Copy link
Contributor Author

kpouget commented Jan 26, 2026

Once those are fixed, let me know when this PR is ready for merge.

@taronaeo, PR is ready from my POV, your comment should have all been addressed,
the documentation will come later this week with another PR
my CI test passed

only thing is that I couldn't test against the latest master (b7837) because llama-cli wasn't answering correctly :/

./llama_cpp/build.remoting-backend/bin/llama-cli -ngl 99 -m /Users/kevinpouget/models/llama3.2 
> say nothing

{"name": "say", "parameters": {"x": "nothing"}}
> What's the GGML API?

{"name": "get_api_documentation", "parameters": {"x": "GGML API"}}

this ^^^ is the MacOS native run, so I guess something's broken elsewhere ...

seems to be this commit that broke it actually:

as with b7755 I get the expected answer (😛)

> What's the GGML API?

GGML (Geometry Game Markup Language) is a markup language used to describe 3D geometry in games. It's primarily used in the context of game development, particularly with the Unity game engine...

@taronaeo
Copy link
Collaborator

only thing is that I couldn't test against the latest master (b7837) because llama-cli wasn't answering correctly :/

./llama_cpp/build.remoting-backend/bin/llama-cli -ngl 99 -m /Users/kevinpouget/models/llama3.2 
> say nothing

{"name": "say", "parameters": {"x": "nothing"}}
> What's the GGML API?

{"name": "get_api_documentation", "parameters": {"x": "GGML API"}}

this ^^^ is the MacOS native run, so I guess something's broken elsewhere ...

Interesting. llama-cli is a thin-client with llama-server running in the background. Were you able to get it working with llama-server and a simple HTTP request?

IMO we should try to aim for a working backend before merging with upstream.

@kpouget
Copy link
Contributor Author

kpouget commented Jan 28, 2026

I've opened #19155, the issue is unrelated to my PR 😌

Below is the manual latest testing, rebased on top of b7849, and there is the automated build and perf test.

$ ramalama   run --image quay.io/crcont/remoting:v0.16.0-apir.0.1.4-rc4  ibm/granite:2b
🦭 > hello
Hello! It's a pleasure to meet you. How can I assist you today?
$ ramalama   run --image quay.io/crcont/remoting:v0.16.0-apir.0.1.4-rc4   smollm:135m
🦭 > hello
Hello! How can I help you?
$ ramalama   run --image quay.io/crcont/remoting:v0.16.0-apir.0.1.4-rc4  ollama://llama3.2
🦭 > hello
{"name": "print", "parameters": {"s": "hello"}}
$ ramalama   run --image quay.io/crcont/remoting:v0.16.0-apir.0.1.4-rc4  mistral:7b
🦭 > hello
 Hello! How can I assist you today? If you have any questions or need help with something, feel free to ask. I'm here to help!

interestingly, the perf testing is unaffected by the bug, although the output text is clearly wrong (and that's the vanilla ggml-metal output)

      "output_text": "{\"name\": \"decide\", \"parameters\": {\"value\": \"see and buy the bike\"}}",
      "output_tokens": 22,
image

@taronaeo
Copy link
Collaborator

Great! Yeah usually it's good to test other models to see if the issue is related to a specific model. I guess this PR is good to merge, merging.

@taronaeo taronaeo merged commit b7feacf into ggml-org:master Jan 28, 2026
147 of 151 checks passed
@kpouget kpouget deleted the upstream branch January 28, 2026 09:52
@kpouget
Copy link
Contributor Author

kpouget commented Jan 28, 2026

great, thanks again for your help in the review, it was really appreciated!
and great timing to have this merged before Fosdem, that will make a good conclusion to the talk 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues ggml changes relating to the ggml tensor library for machine learning python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants