docs: add supported features, parity checks, and perf sections to all model READMEs (#2354)#2420
Conversation
|
Hi @HemantSudarshan! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
… model READMEs Addresses pytorch#2354 — Better document for model perf / supported techniques / parity checks. Changes: - Add comprehensive supported features tables to all 7 model READMEs (llama3, llama3_ft, llama4, deepseek_v3, qwen3, flux, gpt_oss) - Add cross-model feature matrix to torchtitan/models/README.md - Add parity check methodology sections with HF baseline comparison - Add performance sections with Llama 3 H100/H200 benchmarks - Add parity/performance sections to 6 experiment READMEs - Add Parity Testing section to tests/README.md pointing to scripts/checkpoint_conversion/numerical_tests_example.py All feature claims verified against source code (parallelize_fn, TrainSpec, model configs). Three review cycles performed.
81b9c05 to
cd8841c
Compare
Summary
Closes #2354 — "Better document for model perf / supported techniques / parity checks"
This PR adds standardized documentation to every model and experiment README, addressing the three requirements from user feedback:
All feature claims have been verified against source code across three independent review cycles (parallelize functions, TrainSpec registrations, model configs, runtime guards).
What Changed (15 files, 693 additions, 42 deletions)
Model READMEs (7 files)
Top-level Feature Matrix (1 file)
torchtitan/models/README.md— Added a 20-row × 7-model comparison table covering FSDP, HSDP, TP, PP, CP, EP, ETP, DDP, AC, torch.compile, Float8, MXFP8, Async TP, Loss Parallel, HF Interop, DualPipeV, Validation, MoE, Custom Trainer, Benchmarks Published.Experiment READMEs (6 files)
Added parity checks and performance sections to:
autoparallel,compiler_toolkit,simple_fsdp,torchcomms,transformers_modeling_backend,vlm.Tests README (1 file)
tests/README.md— Added a Parity Testing section that addresses the original complaint: "tests directory doesn't seem to have [parity checks]". Points users toscripts/checkpoint_conversion/numerical_tests_example.pywith full instructions.Verification Methodology
Every feature claim was cross-referenced against the actual source code:
parallelize_*.py,TrainSpec.pipelining_fnapply_compile()call presence in parallelize functionsmodel.convertersconfig,Float8ColwiseParallel/MXLinearusageenable_async_tpparameter value in parallelize callsget_dual_pipe_v_flag()runtime guard (requirespp_enabled AND ep_enabled)build_validator_fnwiring in TrainSpecscripts/checkpoint_conversion/numerical_tests_example.pyactual outputSpecific corrections made during verification:
pipeline_llm)pipelining_fn=None)apply_compile()not called)enable_async_tp=False)build_validatoris wired)What This PR Does NOT Do
numerical_tests_example.pyrather than creating new onesHow to Review
parallelize_*.pyor__init__.pyTrainSpecscripts/checkpoint_conversion/README.mdnumerical_tests_example.py