[Qwen3VL] add qwen3 vl by shuhuayu · Pull Request #2409 · pytorch/torchtitan

shuhuayu · 2026-02-20T20:58:58Z

Add qwen3 vl model support. To be refactored for new config systems. Currently support: Decoder (FSDP, EP, TP, PP), Visual encoder (FSDP, TP), to be adjusted based on needs and testing. Tested training on 2B, 8B, 30B-A3B models, from scratch and loading from hf checkpoint.

tianyu-l · 2026-02-23T07:11:06Z

torchtitan/distributed/utils.py

-    if isinstance(non_ep_grads_total_norm, DTensor):
-        non_ep_grads_total_norm = non_ep_grads_total_norm.full_tensor()
+    # Group non-EP grads by mesh to handle mixed meshes (e.g., VLM models
+    # where vision encoder params are on (fsdp) mesh while decoder params


If decoder has FSDP 2, TP 2 -- what FSDP degree does vision encoder has, 2 or 4?

It is 2. Both decoder and visual encoder share the same fsdp mesh and tp mesh.

So on TP mesh, encoder is doing replicated computation? If so, why would you need special treatment on grad clipping here?

Thanks for the question. I think this is similar to the separate grouping of tensors on ep mesh and non-ep meshes. In our implementation, when tp is enabled, some tensors (e.g., mlp.weight) are on the (fsdp, tp) mesh and some (e.g., pos_embed.weight) are on the (fsdp, ) mesh, so we need to reduce separately before summing up.

I believe you need to put them on TP mesh (as Replicate), instead of just let them hang there. O/w DCP won't be able to figure out how to save / load.

Thanks for the suggestion! Used NoParallel to wrap norms and patch_embed, simplified a lot, and no longer need the extra logics to handle mixed non-EP meshes.

wconstab · 2026-02-23T21:07:07Z

torchtitan/experiments/qwen3_vl/model.py

+        self.enable_weight_tying = orig_weight_tying
+
+        if self.enable_weight_tying:
+            if self.tok_embeddings is not None and self.output is not None:


just curious, what happens when we use PP- we stop weight tying? is this algorithmically correct?

I think of weight-tying as an extremal memory saving technique, only used for very small models (e.g. Qwen3 0.6B), where it doesn't need PP for training. So it should be OK that they are just disjoint, we need to make sure NotImplementedError is thrown properly though.

Good catch. This is a workaround when we use PP on a model with enable_weight_tying being true, if we do not temporarily disable weight tying, it will hit an assertion error when we train from scratch. If we load from checkpoint, this workaround is still problematic and I am still working on it. Weight tying is used on small models like for qwen3-VL, it is only used for 2B model. We are considering removing this feature, i.e., error out when we apply PP on a model with weight tying. What do you think?

i see. i didn't realize weight tying was strictly a memory saving technique. i assumed it also reduced # learnable parameters in a way that is meaningful to the convergence.

Weight tying also introduces some complexity in initialization, since if we init output layer in the way of token embedding, the initial errors would be huge. So to make weight tying work well, we need add extra initialization logic.

pytorch-bot bot added the ciflow/8gpu label Feb 20, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 20, 2026

shuhuayu requested review from SherlockNoMad, fegin, tianyu-l, wconstab and wwwjn and removed request for SherlockNoMad February 20, 2026 20:59

shuhuayu marked this pull request as draft February 20, 2026 21:00

shuhuayu force-pushed the modeldev branch from 105a364 to 7db4516 Compare February 21, 2026 00:37

tianyu-l reviewed Feb 23, 2026

View reviewed changes

wconstab reviewed Feb 23, 2026

View reviewed changes

shuhuayu added 18 commits February 24, 2026 11:26

init qwen3 vl

4dc21c5

use flexattention

792d032

cleanup

e0607a8

add requirement

969ac56

replace conv3d with linear

d5fa1af

cleanup dataloader

e9f4671

fsdp on visual encoder

11d72b9

model cleanup

e7c741e

add tp support

ed03fa7

add pp with visual processing outside of pp schedule

2404623

add state dict adapter

d77df24

add model configs

37a6a1e

add mix-mesh aware grad clip

5f8bda7

use noparallel to handle mixed meshes and dtensor boundaries

39805a5

lint

8220a49

clean up

4c9ba89

clean up

9fa002b

refactor to new config systems

b1660ff

shuhuayu force-pushed the modeldev branch from cdda991 to b1660ff Compare February 24, 2026 23:52

update default configs

08176fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3VL] add qwen3 vl#2409

[Qwen3VL] add qwen3 vl#2409
shuhuayu wants to merge 19 commits intopytorch:mainfrom
shuhuayu:modeldev

shuhuayu commented Feb 20, 2026 •

edited

Loading

Uh oh!

tianyu-l Feb 23, 2026

Uh oh!

shuhuayu Feb 23, 2026

Uh oh!

tianyu-l Feb 23, 2026

Uh oh!

shuhuayu Feb 23, 2026

Uh oh!

tianyu-l Feb 23, 2026

Uh oh!

shuhuayu Feb 24, 2026

Uh oh!

wconstab Feb 23, 2026

Uh oh!

tianyu-l Feb 23, 2026

Uh oh!

shuhuayu Feb 23, 2026

Uh oh!

wconstab Feb 23, 2026

Uh oh!

shuhuayu Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shuhuayu commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuhuayu commented Feb 20, 2026 •

edited

Loading