Fix MSE calibration distributed amax sync and add multi‑GPU test #730

CedricHwong · 2025-12-26T11:23:41Z

What does this PR do?

Type of change: Bug fix

Overview:

Synchronizes MSE calibration amax across distributed groups (DP/EP/TP) after calibration finishes.
Adds a multi‑GPU test that verifies amax values match when distributed_sync=True and differ when distributed_sync=False.

Usage

  import copy
  import modelopt.torch.quantization as mtq

  # Build a quantization config that uses MSE calibration
  cfg = copy.deepcopy(mtq.INT8_DEFAULT_CFG)
  cfg["algorithm"] = {
      "method": "mse",
      "distributed_sync": True,
  }

#Run quantization + calibration (forward_loop feeds calibration data)
  model = mtq.quantize(model, cfg, forward_loop)

Testing

PYTHONPATH=/root/epfs/workspace/code/personal_repos/Model-Optimizer pytest -q tests/gpu/torch/quantization/test_mse_calibrate_sync.py
      - Result: 3 passed, 1 skipped (skip: 1‑GPU case)

Make sure you read and follow Contributor guidelines (https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes
Did you add or update any necessary documentation?: No
Did you update Changelog (https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: YES

Additional Information

New test validates distributed amax synchronization for MSE calibration under NCCL.

copy-pr-bot · 2025-12-26T11:23:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: CedricHwong <997630814@qq.com>

CedricHwong requested a review from a team as a code owner December 26, 2025 11:23

CedricHwong requested a review from ajrasane December 26, 2025 11:23

CedricHwong added 3 commits December 26, 2025 17:17

Fix MSE calibration amax sync in distributed

1864a65

Signed-off-by: CedricHwong <997630814@qq.com>

Update changelog for MSE calibration sync

06d9b0c

Signed-off-by: CedricHwong <997630814@qq.com>

Sync bias across distributed calibration

41180e9

Signed-off-by: CedricHwong <997630814@qq.com>

CedricHwong force-pushed the fix/mse-calib-sync branch from 356ab95 to 41180e9 Compare December 26, 2025 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MSE calibration distributed amax sync and add multi‑GPU test #730

Fix MSE calibration distributed amax sync and add multi‑GPU test #730

CedricHwong commented Dec 26, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix MSE calibration distributed amax sync and add multi‑GPU test #730

Are you sure you want to change the base?

Fix MSE calibration distributed amax sync and add multi‑GPU test #730

Conversation

CedricHwong commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Additional Information

Uh oh!

copy-pr-bot bot commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CedricHwong commented Dec 26, 2025 •

edited

Loading