Skip to content

Make tensor of indices a LongTensor#2431

Merged
tianyu-l merged 3 commits intopytorch:mainfrom
hirschsn:fix-moe-integer-overflow
Feb 25, 2026
Merged

Make tensor of indices a LongTensor#2431
tianyu-l merged 3 commits intopytorch:mainfrom
hirschsn:fix-moe-integer-overflow

Conversation

@hirschsn
Copy link
Contributor

Indices get multiplied with strides in operations like aten::index_put. A 32-bit index can lead to a silent overflow, since the Torch Inductor does not do any overflow checking on its own.

This commit fixes GitHub issue #2430

Indices get multiplied with strides in operations like aten::index_put. A
32-bit index can lead to a silent overflow, since the Torch Inductor does not
do any overflow checking on its own.

This commit fixes GitHub issue pytorch#2430
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 24, 2026
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable to me, but why it only happens for compile, not eager?

@tianyu-l tianyu-l requested a review from xmfan February 24, 2026 19:06
@hirschsn
Copy link
Contributor Author

hirschsn commented Feb 25, 2026

The kernel that gets launched by index_put in eager mode is (according to a Torch profile) void at::native::index_elementwise_kernel</* ... */, at::native::gpu_index_kernel<at::native::index_put_kernel_impl //... .
The kernel that does the indexing is gpu_index_kernel (aten/src/ATen/native/cuda/IndexKernel.cu:101ff) and it receives the indices as 64-bit integers.

I think they are converted in index_put_with_sort_kernel (which seems to be the dispatch target for CUDA), there the indices go through makeLinearIndex (aten/src/ATen/native/cuda/Indexing.cu:615), which in turn does:

  for (auto & i : indices) {
    if (i.defined() && i.dtype() == at::kInt) {
      i = i.to(at::kLong);
    }
  }

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you measured the perf/memory impact of this change? sounds necessary to me anyway

@hirschsn
Copy link
Contributor Author

We measured on a Qwen3 MoE model with moderate local batch size (16), performance impact was negligible (<1% TPS).

@tianyu-l tianyu-l merged commit ac359c0 into pytorch:main Feb 25, 2026
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants