forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 2
feat:merge from master in #16376 #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ggml-org#15533) * Add DeepSeek V3.1 thinking mode support - Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value - Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation) - Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format: - Extracts reasoning content before '</think>' tag into reasoning_content - Extracts regular content after '</think>' tag into content - No opening '<think>' tag in V3.1 format - Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking' - Added V3.1 case to parsing switch statement This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag. * Another attempt by V3.1 non-thinking * Fix test, but it's not asserting anything. * Ignore vim swap files in tests dir * Update the test * Try using try_find_literal instead of regex * passing test * Revert "Try using try_find_literal instead of regex" This reverts commit c50d887. * Remove unnecessary change * Remove comment * Add code to handle non-thinking mode. * Try to set message['prefix'] when thinking is enabled. * This fixes reasoning, but breaks normal content. We need state in the chat parser. * DeepSeek V3.1 thinking is now the default. Disable with `--reasoning-budget 0`. * Simplify (DeepSeek V3.1 reasoning) * Fix sign inversion bug * Add some tool calling code (not working). * Tool calls working in non-reasoning mode. * Attempt a unit test for tool call parsing. * Passing test * Add tests for both happy path and broken fenced DeepSeek V3.1 tool call variants. * Passing DeepSeek V3.1 tool call tests, but model is not working. * Revert assistance response prefill change. Not my monkeys. * Add fenced_thinking unit test variant. Passes, but thinking tool calling still isn't working for some reason. * Tests pass in reasoning mode. Also e2e tool test passes. * Make a copy of the parse_json_tool_calls function for deepseek-v3.1 so as to not accidentally introduce regressions. * Fix thinking_forced_open logic. tool calling broken. Need to add another test case. * That's what I get for cargo culting a newline. * Add multi tool call test for deepseek v3.1 non-reasoning * Move test, remove .gitignore change * Place deepseek-v3.1 reasoning test directly into existing reasoning function per CISC's request. * Address whitespace CI failure. * Merge two assert_equals per CISC's request. * Add DeepSeek-V3.1 tests to tests/test-chat.cpp per CISC's request. * Merge deepseek V3.1 and regular parse_json_tool_calls() function behaviors by adding optional update_cursor argument. * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * DeepSeek V3.1 fix reasoning_format none * Strip grammar down to strictly what we expect based on model card. Throw out parts we cargo culted from R1 that don't make sense. * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * DeepSeek V3.1 - Add edge case where thinking is forced open, there is tool calling in the reasoning content, but then the model just stops the output without closing the </think> tag, so it's not a partial. In this case, use the tool call in the reasoning content. * DeepSeek V3.1 - simplify update_cursor * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix indent --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With ggml-org#15489, this reduces the number of synchronizations needed. * call optimize_graph per-split
Add svg and png based off llama1-icon.svg
…gml-org#15877) * feat: Extra debugging support for model conversion - added BF16 support for llama-callback-eval and support for dumping intermediate steps in run-org-model.py
…g#15828) * requirements : update transformers/torch for Embedding Gemma This commit updates the requirements to support converting Embedding Gemma 300m models. The motivation for this change is that during development I had a local copy of the transformers package which is what I used for converting the models. This was a mistake on my part and I should have also updated my transformers version to the official release. I had checked the requirements/requirements-convert_legacy_llama.txt file and noted that the version was >=4.45.1,<5.0.0 and came to the conculusion that no updated would be needed, this assumed that Embedding Gemma would be in a transformers release at the time Commit fb15d64 ("llama : add support for EmbeddingGemma 300m (ggml-org#15798)) was merged. So anyone wanting to convert themselves would be able to do so. However, Embedding Gemma is a preview release and this commit updates the requirements to use this preview release. * resolve additional python dependencies * fix pyright errors in tokenizer test and remove unused import
* contrib : add notes about merging PRs * Update CONTRIBUTING.md Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: Add mul_mat_id support the mmf Add support for mul_mat_id for bs < 16 * Review: use warp_size, fix should_use_mmf condition * Launch one block per expert, stride along n_expert_used * templatize mul_mat_id * Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids * Reduce compile times by dividing mmf into f16, bf16 and f32 variants * Divide mmf by ncols_dst * Add missing files * Fix MUSA/HIP builds
) This commit adds caching of the ROCm installation for the windows-latest-cmake-hip job. The motivation for this is that the installation can sometimes hang and/or not complete properly leaving an invalid installation which later fails the build. By caching the installation hopefully we can keep a good installation available in the cache and avoid the installation step. Refs: ggml-org#15365
…ggml-org#15893) This commit adds check for two function pointers returned from ggml_backend_reg_get_proc_address. The motivation for this is that the function pointer could be nullptr if the get proc address function changes in the future. This is also consistent with all the other calls to ggml_backend_reg_get_proc_address in the code base.
* CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <757486878@qq.com> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>
* CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <huafengchun@gmail.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>
* tests : filter out no-ops from coverage report This commit is a follow-up commit for ggml-org#15745 to address the feedback on how no-op operations should be filtered out from the coverage report. The feedback regarding the UNARY and GLU sub-operations not being handled I not exactly sure what should be done. They are included in the coverage, for example ABS, ELU, EXP, GELU, GEGLU, GEGLU_ERF etc are in the list of covered operations: ```console $ ./build/bin/test-backend-ops --show-coverage Operations covered by tests (89): ✓ ABS ✓ ACC ✓ ADD ✓ ADD1 ✓ ADD_ID ✓ ARANGE ✓ ARGMAX ✓ ARGSORT ✓ CLAMP ✓ CONCAT ✓ CONV_2D ✓ CONV_2D_DW ✓ CONV_3D ✓ CONV_TRANSPOSE_1D ✓ CONV_TRANSPOSE_2D ✓ COS ✓ COUNT_EQUAL ✓ CPY ✓ CROSS_ENTROPY_LOSS ✓ CROSS_ENTROPY_LOSS_BACK ✓ DIAG_MASK_INF ✓ DIV ✓ DUP ✓ ELU ✓ EXP ✓ FLASH_ATTN_EXT ✓ GATED_LINEAR_ATTN ✓ GEGLU ✓ GEGLU_ERF ✓ GEGLU_QUICK ✓ GELU ✓ GELU_ERF ✓ GELU_QUICK ✓ GET_ROWS ✓ GET_ROWS_BACK ✓ GROUP_NORM ✓ HARDSIGMOID ✓ HARDSWISH ✓ IM2COL ✓ IM2COL_3D ✓ L2_NORM ✓ LEAKY_RELU ✓ LOG ✓ MEAN ✓ MUL ✓ MUL_MAT ✓ MUL_MAT_ID ✓ NEG ✓ NORM ✓ OPT_STEP_ADAMW ✓ OPT_STEP_SGD ✓ OUT_PROD ✓ PAD ✓ PAD_REFLECT_1D ✓ POOL_2D ✓ REGLU ✓ RELU ✓ REPEAT ✓ REPEAT_BACK ✓ RMS_NORM ✓ RMS_NORM_BACK ✓ ROLL ✓ ROPE ✓ ROPE_BACK ✓ RWKV_WKV6 ✓ RWKV_WKV7 ✓ SCALE ✓ SET ✓ SET_ROWS ✓ SGN ✓ SIGMOID ✓ SILU ✓ SILU_BACK ✓ SIN ✓ SOFT_MAX ✓ SOFT_MAX_BACK ✓ SQR ✓ SQRT ✓ SSM_CONV ✓ SSM_SCAN ✓ STEP ✓ SUB ✓ SUM ✓ SUM_ROWS ✓ SWIGLU ✓ SWIGLU_OAI ✓ TANH ✓ TIMESTEP_EMBEDDING ✓ UPSCALE Operations without tests (14): ✗ ADD_REL_POS ✗ CUSTOM ✗ DIAG ✗ DIAG_MASK_ZERO ✗ FLASH_ATTN_BACK ✗ GET_REL_POS ✗ IM2COL_BACK ✗ MAP_CUSTOM1 ✗ MAP_CUSTOM2 ✗ MAP_CUSTOM3 ✗ POOL_1D ✗ POOL_2D_BACK ✗ WIN_PART ✗ WIN_UNPART Coverage Summary: Total operations: 103 Tested operations: 89 Untested operations: 14 Coverage: 86.4% ``` Refs: ggml-org#15745 * use of ggml_op enum values instead of strcmp
…15924) This commit applies the same caching to the release workflow which currently exists for the main CI workflow that was introduced in Commit ff02caf ("ci : cache ROCm installation in windows-latest-cmake-hip (ggml-org#15887)").
* metal : make the backend async ggml-ci * cont : add comments, extend op offload, clean up ggml-ci * metal : fix batch size for MUL_MAT_ID * metal : remove deprecated ggml_backend_metal_buffer_from_ptr * metal : create only metal buffers, no wrapping of host memory ggml-ci * metal : restore .alloc_buffer for buffer_from_ptr_type ggml-ci * metal : remove broken implementation of GGML_OP_SET ggml-ci * metal : clean-up loose ends, ready for tests ggml-ci * metal : support both private and shared buffers ggml-ci * metal : enable private buffers + add global device queue * metal : disable host buffer to prevent races ggml-ci * metal : avoid extra copy during set_tensor ggml-ci * metal : use separate buffer types for shread and private Metal buffers ggml-ci * metal : simplify synchronization logic ggml-ci * metal : fix build ggml-ci * metal : do not implement cpy_tensor ggml-ci * metal : separate implementations for shared and private buffers ggml-ci
This commit fixes the zero padding for odd dimensions in ggml_compute_forward_timestep_embedding_f32. The motivation for this is that currently if an odd dimension is used, the padding check incorrectly uses the dimension value for indexing. For example, with dim=15: Elements 0-6 are set to cosine values Elements 7-13 are set to sine values Element 14 is left uninitialized (contains garbage) Element 15 is correctly set to zero This fix changes embed_data[dim] to embed_data[2 * half] so that element 14 (the first unused element) is properly set to zero as well as the last element. Resolves: ggml-org/ggml#1324
* support non-contiguous Q in build_attn_mha * Update src/llama-graph.cpp ggml-ci Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…rs (ggml-org#15909) * Extend the support of T5 models with different encoder-decoder layers Signed-off-by: Jie Fu <jiefu@tencent.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Rename n_dec_layer --> dec_n_layer Signed-off-by: Jie Fu <jiefu@tencent.com> * Adapt to cases when dec_n_layer > n_layer Signed-off-by: Jie Fu <jiefu@tencent.com> --------- Signed-off-by: Jie Fu <jiefu@tencent.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…gml-org#15872) * Add fastdiv and fastmodulo to k_bin_bcast kernel * Address review comments * `prod_` instead of `prod` suffix * Add test case for `k_bin_bcast_unravel` in CUDA backend
Since the prefill length is not fixed, graphs constructed for the prefill stage cannot be reused. For this reason, ACL graph execution is disabled by default during prefill.
* kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed * removes the Whisper-specific check for GET_ROWS support
* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build
…onditional rendering for Actions Dropdown for Chat Conversation Items (ggml-org#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build
* common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <adrien@gallouet.fr> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co> Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0
…gml-org#16075) * Fix to use hidden_size_per_head * Fix num heads * Fix array * Fix loading weights * Support old GGUF converted by the previous version of llama.cpp * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Move shared parameter definitions to the outside of loop * Not calculating n_embd_head_k,v by n_embd / n_head --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ggml-org#16221) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn
dindinw
approved these changes
Oct 2, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Apple Metal
Ascend NPU
build
devops
documentation
Improvements or additions to documentation
examples
ggml
IBM zDNN
nix
Nvidia GPU
OpenCL
python
script
server
SYCL
testing
Vulkan
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Make sure to read the contributing guidelines before submitting a PR