Skip to content

Conversation

@jamesETsmith
Copy link
Collaborator

@jamesETsmith jamesETsmith commented Nov 12, 2025

Description

This PR bumps hipCollections to the latest release release/rocmds-25.10

As we move toward theRock-based versions of ROCm (which are all ROCm@7.9+), I'm going to remove build system utilities for ROCm@6. They are bulky workarounds and it will slow us down as we try to get the ROCm@7.9-based builds going. We can always add them back in later if we absolutely have to (I don't think that's going to be the case).

As a benefit of these changes, we've been able to remove some of the nasty macros inside the source code that switch between cuda/hip.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code. 👈 still need to do this
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

This commit (sloppily) adds support for the latest hipcollections
release (release/rocmds-25.10). There are all sorts of potholes here,
hipcollections + libhipcxx + rocm need special care to work correctly.

I've opened issued in libhipcxx to try and address some of these but for
now, we handle it with some source file manipulation in our script to
install graphbolt dependencies.

This PR will break backward compatibility with rocm < 7, but that should
be ok because we'll probably be focused on rocm@7.9+ in the next
release.
There are some challenges with the new hipco installation. For example,
libhipcxx cmake config get installed in /opt/rocm/lib/rapids/cmake
instead of the /opt/rocm/lib/cmake.
@jamesETsmith jamesETsmith self-assigned this Nov 12, 2025
@jamesETsmith jamesETsmith added the enhancement New feature or request label Nov 12, 2025
@jamesETsmith jamesETsmith changed the title [Feature] Bump hipCollection version [Feature] Bump hipCollections version Nov 12, 2025
Copy link
Collaborator

@GMNGeoffrey GMNGeoffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does only supporting ROCm 7 place any constraints on the hardware we can support

Comment on lines +509 to +512
output_edge_tensors.push_back(ops::IndptrEdgeIdsImpl(
output_indptr, output_indptr.scalar_type(),
*edge_id_offsets,
static_cast<int64_t>(static_cast<indptr_t>(output_size))));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a reformatting? Maybe revert, as it seems unrelated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just reformatting, but it seems like it was incorrect before. When I made changes to the file, it then tripped the CI checks here so even though it's separate from the important changes here, I think we should keep it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think such linters should really only be looking at modified lines, but I see it comes from upstream, so no need to fight it.

sed -i '/#error semaphore is not supported on AMD hardware and should not be included/d' ${INSTALL_PREFIX}/include/rapids/libhipcxx/cuda/std/semaphore
sed -i '/#error semaphore is not supported on AMD hardware and should not be included/d' ${INSTALL_PREFIX}/include/rapids/libhipcxx/hip/std/semaphore

# TODO remove this once the patches are merged
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend tagging all TODOs with a corresponding GitHub issue, e.g. TODO(#123)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is a miss by myself when I added this todo a couple of weeks ago. It's tracked with an internal ticket and I'm trying to find the public reference. I think it's already been merged into the rocm-libraries, but those changes weren't included in several important releases of rocm (e.g. 7.0.0 and 7.1.0). I'll track down the links and add them here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it was merged into the develop branch on rocm-libraries here: ROCm/rocm-libraries#1883. Not sure when it will get included in the rocm release though. I've added a link to this PR and more comments here to better document this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was suggesting something like making an issue in this repo to fix this and linking this TODO to that issue. Same with the TODO above. So it would be

Suggested change
# TODO remove this once the patches are merged
# TODO(#1234) remove this once the patches are merged upstream

and then issue dmlc#1234 can have all the necessary detail. I think it would also be more self-explanatory if rather than copying all the lose header files in this directory, this applied this upstream commit as a patch, but anyways, this is all kind of unrelated to your change here.

Comment on lines 22 to 28
# TODO this is an unacceptable way to do this, hopefully we can resolve https://github.com/ROCm/libhipcxx/issues/10 quickly
# previous version of libhipcxx allowed semaphores and we didn't have a problems so we're turning them on here
# find and remove all lines in the libhipcxx that contain "#error semaphore is not supported on AMD hardware and should not be included"
sed -i '/#error semaphore is not supported on AMD hardware and should not be included/d' ${INSTALL_PREFIX}/include/rapids/libhipcxx/cuda/semaphore
sed -i '/#error semaphore is not supported on AMD hardware and should not be included/d' ${INSTALL_PREFIX}/include/rapids/libhipcxx/hip/semaphore
sed -i '/#error semaphore is not supported on AMD hardware and should not be included/d' ${INSTALL_PREFIX}/include/rapids/libhipcxx/cuda/std/semaphore
sed -i '/#error semaphore is not supported on AMD hardware and should not be included/d' ${INSTALL_PREFIX}/include/rapids/libhipcxx/hip/std/semaphore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬 I think it would be good to dig deeper and understand why this worked anyways and if we can really ignore these errors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out that <cuda/std/semaphore> was also unsupported in previous versions of libhipcxx, but now they've added firm guardrails around it. Apparently, it can sometimes cause issues when run on the device (but not host).

I dug into how it's used in DGL and reported on it in my ticket on libhipcxx.

TLDR, we use it just to manage access to our IO resources used by graphbolt. However, we only ever use it on the host (as far as I can tell) which means that we should be able to use <cuda/std/semaphore> here since it never runs on the device.

With that said, I understand this solution is bad practice. My hope with this PR was to get something working that we could refer to later because we need to clean up our dependencies eventually and I think we should do it before we try and get things running with theRock based builds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say have it as a reference, do you mean you don't want to merge it? It would probably be worth testing to see if a normal std::counting_semaphore would work here, as you mention in that issue

@jamesETsmith
Copy link
Collaborator Author

@GMNGeoffrey thanks for the comments, ROCm@7 supports all the same instinct architectures as ROCm@6.4.3 plus gfx950. References for rocm@7 and rocm@6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants