Fabrics 2/2: Implementation of the fabrics API#266
Fabrics 2/2: Implementation of the fabrics API#266jonasohland wants to merge 52 commits intodmf-mxl:mainfrom
Conversation
d20cc77 to
fb07fdd
Compare
7c3f830 to
8eefc1f
Compare
083112b to
20f7c76
Compare
KimonHoffmann
left a comment
There was a problem hiding this comment.
Thanks for this massive contribution!
I looked through everything except for the test implementation and the demo. Most of remarks refer to syntactical aspects.
KimonHoffmann
left a comment
There was a problem hiding this comment.
Thanks for this massive contribution!
I looked through everything except for the test implementation and the demo. Most of my remarks refer to syntactical aspects.
Thanks for looking through our code! We really appreciate the effort. I have addressed almost everything, I have only one or two questions and have left your comments regarding those unresolved. |
6b0e78a to
6d6fade
Compare
KimonHoffmann
left a comment
There was a problem hiding this comment.
Looks good, thank you for addressing my comments!
I noticed two smaller things that still need addressing (use of bit_cast<> and constexpr vs. inline), but the rest looks great.
8f6b6d0 to
4dac926
Compare
b5509a7 to
4c4dcda
Compare
KimonHoffmann
left a comment
There was a problem hiding this comment.
Looks good to me, thanks for the updates!
KimonHoffmann
left a comment
There was a problem hiding this comment.
Two small comments on the four commits added since my last review.
lapointejp
left a comment
There was a problem hiding this comment.
Seems okay to me. Just a couple typos and wording things to check.
Thank you so very much for contributing this @jonasohland and @mlefebvre1.
6c0f2ee to
7e9060c
Compare
|
I've run into a couple of issues while trying to get the mxl-fabrics-demo app to stream between initiator and target using the verbs provider. I'm running on Ubuntu 22.04 with a Mellanox ConnectX 6 DX card. The various verbs packages are at version 2510.0.11-1, installed via Nvidia's doca-roce package version 3.2.1-044000. I'm not sure if these issues could be fixed by modifying the environment (system packages, OS settings etc). However, I'm describing here the changes I made to the code to make it work for me. The first error is when the initiator attempts to register the shared memory: This happens because the flow reader opens the shared memory in read-only mode, but the memory registration requests write access using This issue can be temporarily fixed by changing the access mode in The second issue is the target receives a value of 0 for the number of valid slices received. The log message shows the following: The value of valid slices is received in the completion data which is handled by Target::ReadResult RMAGrainIngressProtocol::processCompletion(Endpoint& ep, Completion const& completion)
{
if (auto data = completion.tryData(); data)
{
if (_immDataBuffer)
{
// request the next completion data with the internal buffer
ep.recv(immDataRegion());
// return the data in the internal buffer (before it's arrived)
return Target::ReadResult{_immDataBuffer->data};
}
else
{
return Target::ReadResult{data->data()};
}
}
return {};
}The issue can be fixed by always returning the data from the current completion, after posting a request for the next one: Target::ReadResult RMAGrainIngressProtocol::processCompletion(Endpoint& ep, Completion const& completion)
{
if (auto data = completion.tryData(); data)
{
if (_immDataBuffer)
{
ep.recv(immDataRegion());
}
return Target::ReadResult{data->data()};
}
return {};
} |
|
Hi! @pprayle Thanks for your comment! To your first point: Thanks! |
|
@pprayle BTW, Memory registration in libfabric with the |
Co-authored-by: Michael Lefebvre <michael.lefebvre.31@gmail.com> Signed-off-by: Michael Lefebvre <michael.lefebvre@riedel.net>
Co-authored-by: Michael Lefebvre <michael.lefebvre.31@gmail.com> Signed-off-by: Michael Lefebvre <michael.lefebvre@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
…emless interopability. Co-authored-by: Michael Lefebvre <michael.lefebvre.31@gmail.com> Signed-off-by: Michael Lefebvre <michael.lefebvre@riedel.net>
…ng the supplied domain instead of hardcoded /dev/shm. Co-authored-by: Michael Lefebvre <michael.lefebvre.31@gmail.com> Signed-off-by: Michael Lefebvre <michael.lefebvre@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
As reported by @pprayle Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
a6cecc7 to
0436f4e
Compare
Co-authored-by: Jonas Ohland <jonas.ohland@gmail.com> Signed-off-by: Jonas Ohland (Riedel) <jonas.ohland@riedel.net>
KimonHoffmann
left a comment
There was a problem hiding this comment.
A few small things I stumbled upon.
| */ | ||
| MXL_EXPORT | ||
| mxlStatus mxlFabricsInitiatorRemoveTarget(mxlFabricsInitiator in_initiator, mxlTargetInfo const in_targetInfo); | ||
| mxlStatus mxlFabricsInitiatorRemoveTarget(mxlFabricsInitiator in_initiator, mxlFabricsTargetInfo const in_targetInfo); |
There was a problem hiding this comment.
We could get rid of this const qualification, especially in the context of of this declaration. Feels free to keep it for the definition if it proves to be helpful there (even though I personally tend to not use them on by value arguments).
| */ | ||
| MXL_EXPORT | ||
| mxlStatus mxlFabricsInitiatorAddTarget(mxlFabricsInitiator in_initiator, mxlTargetInfo const in_targetInfo); | ||
| mxlStatus mxlFabricsInitiatorAddTarget(mxlFabricsInitiator in_initiator, mxlFabricsTargetInfo const in_targetInfo); |
There was a problem hiding this comment.
We could get rid of this const qualification, especially in the context of of this declaration. Feels free to keep it for the definition if it proves to be helpful there (even though I personally tend to not use them on by value arguments).
| */ | ||
| MXL_EXPORT | ||
| mxlStatus mxlFabricsTargetInfoToString(mxlTargetInfo const in_targetInfo, char* out_string, size_t* in_stringSize); | ||
| mxlStatus mxlFabricsTargetInfoToString(mxlFabricsTargetInfo const in_targetInfo, char* out_string, size_t* in_stringSize); |
There was a problem hiding this comment.
We could get rid of this const qualification of in_targetInfo, especially in the context of of this declaration. Feels free to keep it for the definition if it proves to be helpful there (even though I personally tend to not use them on by value arguments).
| : _inner(entry) | ||
| {} | ||
|
|
||
| Completion::Token Completion::randomToken() |
There was a problem hiding this comment.
Ist this called often? If yes this way of obtaining the token could prove quite expensive.
| mxlFabricsRegionsFromUserBuffers(memoryRegions.data(), memoryRegions.size(), &outRegions); | ||
|
|
||
| return {outRegions, regions}; | ||
| std::vector<Region> regions; |
There was a problem hiding this comment.
Could use auto with right hand type auto regions = std::vector<Region>{}.
| }; | ||
|
|
||
| // clang-format on | ||
| std::vector<Region> regions; |
There was a problem hiding this comment.
Could use auto with right hand type auto regions = std::vector<Region>{}.
|
|
||
| bool TargetInfo::operator==(TargetInfo const& other) const noexcept | ||
| { | ||
| return fabricAddress == other.fabricAddress && remoteRegions == other.remoteRegions && id == other.id; |
There was a problem hiding this comment.
Could use parentheses around binary subexpressions.
| auto regionObj = regionValue.get<picojson::object>(); | ||
| auto addr = std::stoull(regionObj.at("addr").get<std::string>()); | ||
| auto len = std::stoull(regionObj.at("len").get<std::string>()); | ||
| auto rkey = std::stoull(regionObj.at("rkey").get<std::string>()); |
There was a problem hiding this comment.
Could use a comment as to why those fields are not fetched as numeric values directly.
| throw Exception::invalidArgument("Invalid provider specified"); | ||
| } | ||
|
|
||
| uint64_t caps = FI_RMA | FI_REMOTE_WRITE; |
There was a problem hiding this comment.
Please use the std:: qualified variant.
This is step 2 of merging the fabrics functionality into the main mxl repo. Now we would like to merge the code that implements the api in
fabrics.hEnables transfer of grains and slices using remote memory access, tcp and shared memory on supported hardware.
Future improvements will include audio support and capability management.
Makes #78 obsolete.
Closes #181
NOTE: You must use the following cmake configure flag to enable the Fabric's API build :
-DMXL_ENABLE_FABRICS_OFI=on