Examples/kubernetes dev with model downloading functionality by shenron0101 · Pull Request #7 · phymbert/llama.cpp

shenron0101 · 2024-07-21T11:59:35Z

Hi,

I have built the helm chart according to the template you had provided earlier. I think this can still be improved in some ways. Any comments are welcome.

Feature set for the Helm chart

High availability
Multi models
Support of embeddings and completions models
Load balancing
Auto scaling
CUDA support
Downloading functionality
Redownload on upgrade hook. (Currently the models are downloaded only on the first deployment, there is no redownload functionality on upgrade if required)

Pending testing

Load balancing
multi GPU support using MiG for kubernetes docs & microk8s

…testing.

phymbert

Thanks for the effort, this is a good start. We need to bring it to the original repo. Let's merge it then we can discuss there

phymbert · 2024-07-22T16:13:03Z

examples/kubernetes/llamacpp/charts/embedding/values.yaml

+
+livenessProbe:
+  httpGet:
+    path: /


We have health endpoints

You mean you want to remove this?

No the path must be /health

phymbert · 2024-07-22T16:13:38Z

examples/kubernetes/llamacpp/charts/modelRunner/Chart.yaml

+name: modelRunner
+description: A Helm chart for Kubernetes
+
+# A chart can be either an 'application' or a 'library' chart.


Can be deleted

phymbert · 2024-07-22T16:14:14Z

examples/kubernetes/llamacpp/charts/modelRunner/templates/PersistentVolume.yaml

+
+---
+
+{{- end}}


Mind that each file must end with an empty line

phymbert · 2024-07-22T16:18:13Z

examples/kubernetes/llamacpp/charts/embedding/templates/deployment.yaml

+          - -c
+          - |
+            set -e
+            if curl -L {{ $modelConfig.url }} --output /models/{{ $modelName }}/{{ $modelName }}.gguf; then


It will not support sharded model files. Better to let llama.cpp server handles the initial download

Ok but then we wont be able to have a job running it. This will prevent us from updating it using kubectl apply. Also i dont believe llama server supports autodownload? I know Ollama does. When llamacpp server container tries to start it needs a model file to point to or else it errors out.

No I developed that feature some time ago, see the doc.

phymbert · 2024-07-22T16:22:21Z

Maybe it would be easier if I push the base branch to the original repo ?

shenron0101 · 2024-07-22T19:53:39Z

Yes, Ideally we merge here first and once finalized we can push example/kubernetes to the main repo.

ceddybi · 2024-07-26T22:56:47Z

@phymbert @OmegAshEnr01n Awesome work you've done here, small question, when this chart is deployed are the models's api compatible with Open IA api, like the way together ai works, where i just change the OPENAI_API_KEY and OPENAI_BASE_URL (https://api.together.xyz/v1)

shenron0101 · 2024-08-06T02:49:24Z

Hi @ceddybi,

Please check the server API docs from llama.cpp.

POST /v1/chat/completions: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in messages, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a supported chat template can be used optimally with this endpoint. By default, the ChatML template will be used.

lee-b · 2024-09-20T14:36:18Z

Is it necessary to limit to MiG here? llama.cpp supports pre-ampere GPUs, so it would be nice to use more standard multi-GPU container techniques.

HappyPony · 2025-09-29T03:42:27Z

Hi,

OmegAshEnr01n wrote:

Please check the server API docs from llama.cpp.

POST /v1/chat/completions: OpenAI-compatible Chat Completions API.

The URL has probably changed. The docs are currently published at this URL

Shobhit added 2 commits July 21, 2024 19:52

Added demo chart. Version is functional on single GPU system pending …

8029c40

…testing.

Updated readme with feature set

0579fbe

phymbert approved these changes Jul 22, 2024

View reviewed changes

phymbert mentioned this pull request Jul 22, 2024

kubernetes example ggml-org/llama.cpp#6546

Open


		---

		{{- end}} No newline at end of file

Conversation

shenron0101 commented Jul 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature set for the Helm chart

Pending testing

Uh oh!

phymbert left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phymbert commented Jul 22, 2024

Uh oh!

shenron0101 commented Jul 22, 2024

Uh oh!

ceddybi commented Jul 26, 2024

Uh oh!

shenron0101 commented Aug 6, 2024

Uh oh!

lee-b commented Sep 20, 2024

Uh oh!

HappyPony commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shenron0101 commented Jul 21, 2024 •

edited

Loading