-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[RFC] ggml: new backend for API Remoting #17072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Very interesting work, thanks for sharing it! Is it possible to get your PoC running on a Linux host with |
not yet, as MacOS has been the main target so far, but I'm working now on setting up the Linux environment where I can test this setup. The host side relies on For MacOS, the user-friendly instructions are detailed in the blog post, and I can share the steps to build from sources on demand. |
|
I opened the RFC PR on virglrenderer: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1584 and the code now works on Linux (tested with the To reproduce the POC on Linux (with
or try simply it with this command: Note that:
|
|
Closing this PR, I opened a new PR #18718 with the v2 |
Hello, I would like to discuss if this work could be integrated in the
llama.cppcodebase.The API Remoting backend/frontend allow escaping the VM isolation, with the help of the
virt-gpuparavirtualization (and thevirglrendererlibrary on the host side).ggml-remotingfrontendis a GGML API implementation, which intercepts the GGML API calls and forwards them to thevirt-gpuvirtual deviceggml-remotingbackendis library loaded byvirglrenderer(PR will be opened soon for discussion), which opens a GGML library and forwards the call received fromvirglrenderer.The code is currently a POC, I will refine it after the first round of feedback.
ggml-RPC. The overall idea is the same, but the transport layer is virtualization aware, which helps limiting the buffer copies.supports_opmethod is implemented in a hacky way: I've copied theggml-metaldefinition to the frontend library, and I expose the few properties required to compute it from theggml-metalbackend. IIRC, this was only needed for the micro-benchmark to work correctly (theggml-rpcsimply returnstrueto avoid this bottleneck).Here is the context behind this PR: