Skip to content

Conversation

@bjf-frz
Copy link

@bjf-frz bjf-frz commented Jan 12, 2026

What does this PR do?

This PR fixes a critical bug in the NPUQwen3VLMoeTextExperts's training mode where incorrect routing weights were being used during the token unpermutation step.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: [fsdp] feat: update NPU fused kernels for Qwen3 moe block #4406
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Following the modifications, the GPU and NPU results achieve numerical consistency: the reward trends are aligned, and the discrepancies fall within an acceptable margin of error.

results

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request attempts to fix an issue in the NPU patch for Qwen3-VL MoE models. However, the change introduces a critical bug due to a tensor shape mismatch. The modification assumes routing_weights has the full expert dimension, but the calling function passes a tensor with only the top-k probabilities. This will cause an index-out-of-bounds error during execution. I've provided a comment with a detailed explanation and a suggestion to revert the change.

Comment on lines +188 to +193
num_tokens = hidden_states.shape[0]
top_k = router_indices.shape[1]
batch_idx = torch.arange(num_tokens, device=routing_weights.device)
batch_idx = batch_idx.unsqueeze(1).expand(-1, top_k)
selected_probs = routing_weights[batch_idx, router_indices]
next_states = torch_npu.npu_moe_token_unpermute(output, row_ids_map, probs=selected_probs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change appears to introduce a bug due to a mismatch in tensor shapes between the caller and this function.

The calling function, NPUQwen3VLMoeTextSparseMoeBlock.forward, reassigns routing_weights to the top-k probability values at line 225:

# verl/models/transformers/npu_patch.py:225
routing_weights, router_indices = torch.topk(routing_weights, self.top_k, dim=-1)

This results in routing_weights having a shape of (num_tokens, top_k) when passed to this function during training.

However, the new code here attempts to index routing_weights using router_indices, which contains expert indices that can be larger than top_k - 1:

# verl/models/transformers/npu_patch.py:192
selected_probs = routing_weights[batch_idx, router_indices]

This will lead to an index out of bounds error.

The original implementation seems correct, as it directly uses the (num_tokens, top_k) routing_weights tensor, which appears to be the expected input for npu_moe_token_unpermute.

If the intent is for routing_weights to be the full (num_tokens, num_experts) probability distribution, the fix should be in NPUQwen3VLMoeTextSparseMoeBlock.forward to avoid overwriting routing_weights. Without that change in the caller, this PR is incorrect.

Suggested change
num_tokens = hidden_states.shape[0]
top_k = router_indices.shape[1]
batch_idx = torch.arange(num_tokens, device=routing_weights.device)
batch_idx = batch_idx.unsqueeze(1).expand(-1, top_k)
selected_probs = routing_weights[batch_idx, router_indices]
next_states = torch_npu.npu_moe_token_unpermute(output, row_ids_map, probs=selected_probs)
next_states = torch_npu.npu_moe_token_unpermute(output, row_ids_map, probs=routing_weights)

@bjf-frz bjf-frz changed the title [model]: qwen3-vl-30b npu_patch fix [model] fix: qwen3-vl-30b npu_patch fix Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant