Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v1.5.rc4
v1.5.rc3
v1.5.rc2
What's Changed
- B200 M-bridge misconfig by @srivatsankrishnan in #777
- Fix M-bridge report generation by @srivatsankrishnan in #778
- Fix installation logic for File on k8s by @amaslenn in #781
- Address issues with Sleep test over K8s by @amaslenn in #779
- M-bridge Job ID extraction by @srivatsankrishnan in #783
- Make tests more stable on systems with slurm binaries by @amaslenn in #784
- Update doc on using hf token for the first time by @amaslenn in #785
- Improvements for NCCL over k8s by @amaslenn in #786
Full Changelog: v1.5.rc1...v1.5.rc2
v1.5.rc1
What's Changed
- Provide CMS-friendly documentation build by @amaslenn in #769
- Model Name/ModeL size for verify configs by @srivatsankrishnan in #772
- Add test/test scenario files for AIConfigurator for QA testing by @srivatsankrishnan in #773
- Improve reliability in ports selection for Dynamo on Slurm by @amaslenn in #771
- Upgrade container versions for common examples by @amaslenn in #776
- Add diff (value + percentge) in cmp report table if exactly two results are compared by @amaslenn in #774
- Do not use internal URLs in documentation by @amaslenn in #775
- Do not enable recompute-activations by default by @amaslenn in #768
Full Changelog: v1.5.beta7...v1.5.rc1
v1.5.beta7
What's Changed
- M bridge Documentation by @srivatsankrishnan in #765
- Remove hardcoded
--distribution=arbitraryby @juntaowww in #766 - M bridge updates by @srivatsankrishnan in #767
New Contributors
- @juntaowww made their first contribution in #766
Full Changelog: v1.5.beta6...v1.5.beta7
v1.5.beta6
What's Changed
- Update codeowners by @srivatsankrishnan in #717
- Aiconfig by @srivatsankrishnan in #760
- Rula review by @RulaHallak in #761
- Automatically install sshd for NCCL k8s workers if no available by @amaslenn in #759
- Add workload for OSU Micro Benchmark by @allkoow in #742
- Rename field model_config to model_cfg in NIXLKVBench workload by @allkoow in #763
- Megatron Bridge in CloudAI by @srivatsankrishnan in #764
New Contributors
Full Changelog: v1.5.beta5...v1.5.beta6
v1.5.beta5
What's Changed
- UCC add file generator by @yaeliyac in #747
- Do not set -N/--nodes if nodelist is specified by @amaslenn in #746
- Use genai-perf from Dynamo container when running k8s by @amaslenn in #748
- Fix empty table if not all results are available by @amaslenn in #753
- Ensure reports order by @amaslenn in #754
- Update documentation on Dynamo k8s multi node by @amaslenn in #749
- Fix bokeh charts generation by @amaslenn in #755
- Enhancements for Dynamo with k8s by @amaslenn in #752
- Fix a crash during dry-run for Dynamo scenario by @amaslenn in #757
- Describe global options for cloudai CLI by @amaslenn in #758
Full Changelog: v1.5.beta4...v1.5.beta5
v1.5.beta4
What's Changed
- Add new installable type: HF model by @amaslenn in #735
- Add extra_srun_args on TestRun level by @amaslenn in #734
- Dynamo pass/fail and slurm example by @amaslenn in #736
- Add support for HF model in K8s by @amaslenn in #737
- Configure Dynamo k8s based on TOML, not an extra config by @amaslenn in #738
- Fine tune CodeRabbit reviews by @amaslenn in #740
- Expand K8s Dynamo support to disagg and multinode by @amaslenn in #739
- Generate reports in dry-run by @amaslenn in #741
- Update documentation by @amaslenn in #743
- Simplify Dynamo slurm configuration by @amaslenn in #745
Full Changelog: v1.5.beta3...v1.5.beta4
v1.5.beta3
What's Changed
- Print scenario status table at the end of a run by @amaslenn in #730
- Always set number of nodes for srun cmd by @amaslenn in #729
- Convert base System into pydantic model by @amaslenn in #732
- Add HF home dir property inside System model by @amaslenn in #733
Full Changelog: v1.5.beta2...v1.5.beta3
v1.5.beta2
What's Changed
- Fix NameError for K8s batch run by @amaslenn in #721
- Add DDLB workload by @nsarka in #711
- Updates for Dynamo over K8s by @amaslenn in #724
- Fixed and issue when using dependencies could result in an infinite loop by @amaslenn in #725
- Report results dir to users as early as possible by @amaslenn in #726
- Configure AI code review tools by @amaslenn in #728
- Kill and wait for ETCD process to be gone by @amaslenn in #727
- DeepEP benchmark by @ybenvidia in #723
New Contributors
Full Changelog: v1.5.beta1...v1.5.beta2