Skip to content

fix: only reset LoRa configs when they have changed from previous batch#19280

Open
agent-enemy-2 wants to merge 1 commit intoggml-org:masterfrom
agent-enemy-2:bug-fix/memory-leak
Open

fix: only reset LoRa configs when they have changed from previous batch#19280
agent-enemy-2 wants to merge 1 commit intoggml-org:masterfrom
agent-enemy-2:bug-fix/memory-leak

Conversation

@agent-enemy-2
Copy link

@agent-enemy-2 agent-enemy-2 commented Feb 3, 2026

Overview

Fix for #19217

Currently we are setting the LoRa config for each and every token request, even if the configuration between batches has not changed. It appears in this PR, in llama-context.cpp, line 1078 we added schedule reserving for when we set new LoRa configs, and so now when we have a LoRa config, for every batch we are decoding, we are reserving the scheduler.

This change adds a field to the server slot that holds the last batche's LoRa config. For each batch, we run a check to see if the config is the same as the previous batch, and only if it is not do we go ahead and set it. Otherwise, proceed.

Testing

I was able to re-produce the issue relatively easily with some debug logs

./llama-server -hf bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:Q8_0 --lora ~/Downloads/LoRA-Llama-3.1-8B-MultiReflection-f16.gguf -c 1024 --parallel 1 --host 0.0.0.0 --port 8080 --verbose

and was able to see this endlessly along with huge memory spikes:

Setting sched_need_reserve to TRUE in the set_adapter_lora functionset_embeddings: value = 0
Reserving memory during decoding

sched_reserve: reserving ...
sched_reserve: max_nodes = 3692
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" requests"}}],"created":1770077607,"id":"chatcmpl-1gqsKPcqCDHRRMZFNqfWqnl557ARXVc9","model":"bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:Q8_0","system_fingerprint":"b7916-0dfcd3b60","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":109,"prompt_ms":628.118,"prompt_per_token_ms":5.762550458715597,"prompt_per_second":173.5342722227352,"predicted_n":2,"predicted_ms":180.088,"predicted_per_token_ms":90.044,"predicted_per_second":11.105681666740704}}


sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
sched_reserve:       MTL0 compute buffer size =   509.12 MiB
sched_reserve:        CPU compute buffer size =    18.01 MiB
sched_reserve: graph nodes  = 1903
sched_reserve: graph splits = 2

After these changes, the issue no longer appears and memory remains stable.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move the check inside llama_context::set_adapter_lora() and not set the sched_need_reserve flag if the same adapters are passed.

@agent-enemy-2 agent-enemy-2 force-pushed the bug-fix/memory-leak branch 2 times, most recently from e810f4d to 8edc85d Compare February 4, 2026 05:34
fix: updating LoRa config check to the common.cpp

Removing dangling unused import
@agent-enemy-2
Copy link
Author

agent-enemy-2 commented Feb 4, 2026

Had to implement the check in the common_set_adapter_lora because in the method we were clearing the loras every time, and then setting sched_need_reserve flag as well.

Added some helpers to validate if the lora configs are the same, and if they are, skip clearing and skip reserving the scheduler.

@ggerganov
Copy link
Member

ggerganov commented Feb 4, 2026

Had to implement the check in the common_set_adapter_lora because in the method we were clearing the loras every time, and then setting sched_need_reserve flag as well.

Can't we update the logic in llama_set_adapter_lora() to first check if they are the same and if they are - not do anything. We already have such logic in other functions, such as:

void llama_context::set_causal_attn(bool value) {
LLAMA_LOG_DEBUG("%s: value = %d\n", __func__, value);
if (cparams.causal_attn == value) {
return;
}
cparams.causal_attn = value;
sched_need_reserve = true;
}

@agent-enemy-2
Copy link
Author

Apologies because I might be missing something.

In the common_set_adapter_lora function, we call llama_clear_adapter_lora which would prevent any checks in lama_context::set_adapter_lora from being able to tell if the LoRa adapter being passed in is the same as what is currently in the context, since there would be no current adapter.

We could just remove that call to clear the LoRas (or move that to the llama_context::set_adapter_lora`, but I wasn't sure if there is some reason we wanted that in the common functionality.

My latest change attempts to do what you're looking for, just in the common_set_adapter_lora function, prior to clearing the context LoRas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants