-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Step3.5 support #19271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Step3.5 support #19271
Conversation
|
I was just downloading it and will give it a try soon. |
|
@ggerganov Any idea what the deal is with this? /usr/bin/ld: ../../bin/libllama.so.0.0.1: undefined reference to `bool llama_model_loader::get_key_or_arr<float, 512ul>(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::array<float, 512ul>&, unsigned int, bool)' Getting it locally too, not just on CI. |
| case LLM_ARCH_STEP35: | ||
| { | ||
| ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps); | ||
|
|
||
| hparams.swa_type = LLAMA_SWA_TYPE_STANDARD; | ||
|
|
||
| // MoE + SWA parameters | ||
| ml.get_key(LLM_KV_EXPERT_FEED_FORWARD_LENGTH, hparams.n_ff_exp); | ||
| ml.get_key(LLM_KV_EXPERT_SHARED_FEED_FORWARD_LENGTH, hparams.n_ff_shexp, false); | ||
| ml.get_key(LLM_KV_EXPERT_GATING_FUNC, hparams.expert_gating_func, false); | ||
| ml.get_key(LLM_KV_EXPERT_WEIGHTS_SCALE, hparams.expert_weights_scale, false); | ||
| ml.get_key(LLM_KV_EXPERT_WEIGHTS_NORM, hparams.expert_weights_norm, false); | ||
|
|
||
| // Step35 uses sigmoid gating by default (if not set in GGUF) | ||
| if (hparams.expert_gating_func == LLAMA_EXPERT_GATING_FUNC_TYPE_NONE) { | ||
| hparams.expert_gating_func = LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID; | ||
| } | ||
|
|
||
| ml.get_key(LLM_KV_ATTENTION_SLIDING_WINDOW, hparams.n_swa); | ||
| ml.get_key_or_arr(LLM_KV_ATTENTION_SLIDING_WINDOW_PATTERN, hparams.swa_layers, hparams.n_layer); | ||
| ml.get_key_or_arr(LLM_KV_ROPE_DIMENSION_COUNT_PER_LAYER, hparams.rope_dim_per_layer, hparams.n_layer); | ||
| ml.get_key_or_arr(LLM_KV_SWIGLU_LIMITS, hparams.swiglu_limits, hparams.n_layer); | ||
| ml.get_key_or_arr(LLM_KV_SWIGLU_LIMITS_SHARED, hparams.swiglu_limits_shared, hparams.n_layer); | ||
|
|
||
| // Optional: Step35-only gating for applying rope scaling (HF: yarn_only_types). | ||
| // Default is 3 (apply on all layers) if the key is absent. | ||
| ml.get_key( | ||
| format("%s.rope.scaling.apply_mask", ml.get_arch_name().c_str()), | ||
| hparams.rope_scaling_apply_mask, | ||
| false | ||
| ); | ||
|
|
||
| hparams.has_rope_freq_base_per_layer = ml.get_key_or_arr( | ||
| format("%s.rope.freq_base_per_layer", ml.get_arch_name().c_str()), | ||
| hparams.rope_freq_base_per_layer, | ||
| hparams.n_layer, | ||
| false | ||
| ); | ||
|
|
||
| type = LLM_TYPE_UNKNOWN; | ||
| } break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one of these does not compile correctly. On mac it builds OK btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
|
@pwilkin Thanks a lot for putting this together so quickly, and for taking the time to extract the changes into a clean PR — we really appreciate the interest and effort around step3.5. We’re the authors of the step3.5 model, and we already have an official llama.cpp integration prepared, which we’re planning to submit very shortly (likely tomorrow). Since the upcoming PR will be the reference implementation we’ll be maintaining going forward, we’d prefer to have that land as the primary integration. For that reason, we were hoping this PR could be held off from merging for now. |
| ml.get_key( | ||
| format("%s.rope.scaling.apply_mask", ml.get_arch_name().c_str()), | ||
| hparams.rope_scaling_apply_mask, | ||
| false | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you will need something like this (const char* instead of std::string):
template bool llama_model_loader::get_key_or_arr<float, 512>(const char *, std::array<float, 512> & result, uint32_t n, bool required);
But I think it's better to just put this as a LLM_KV_*, like all other metadata keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already did it :>
| apply_mask |= 0x1 | ||
| if "sliding_attention" in yarn_only_types: | ||
| apply_mask |= 0x2 | ||
| self.gguf_writer.add_uint32(f"{self.gguf_writer.arch}.rope.scaling.apply_mask", int(apply_mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add this as a self.gguf_writer.add_* function
| raise ValueError(f"Unprocessed experts: {experts}") | ||
|
|
||
|
|
||
| @ModelBase.register("Step35ForCausalLM") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
| @ModelBase.register("Step35ForCausalLM") | |
| @ModelBase.register("Step3p5ForCausalLM") |
|
Closing as duplicate of #19283 |
Since the model creators put up a llama.cpp fork, but they did it by bare directory upload and the llama.cpp version is quite far outdated, I attempted to pull a clean PR from their code. Not guaranteeing it's all correct, but putting it up since there's probably going to be lots of interest in the new model.