-
Notifications
You must be signed in to change notification settings - Fork 33
feat: add support for multi-stream SSDs (such as FDP SSDs) #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: add support for multi-stream SSDs (such as FDP SSDs) #92
Conversation
The NVMe SSD (e.g. Flexible Data Placement SSD, TP4146) is supporting to recognize data lifetime information on device. Adding data lifetime information (writeHint) that passed to the devices to achieve lower write amplification and better performance. Kernel file-systems (ext4, XFS, btrfs, F2FS) have already supported to set the writeHint by fcntl(). This patch adds support in BeeGFS for data lifetime information. This patch enables BeeGFS to use multi-stream SSDs, such as FDP SSDs.
iamjoemccormick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Cici-Lii, thanks for the proposed feature and the PR. I took an initial pass and left some comments. The main blocker right now is mixed-version compatibility of the new message format. Please take a look at the feedback and let me know if you have any questions.
|
|
||
| outIOInfo->userID = i_uid_read(&this->vfs_inode); | ||
| outIOInfo->groupID = i_gid_read(&this->vfs_inode); | ||
| outIOInfo->writeHint = this->vfs_inode.i_write_hint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: The field i_write_hint is likely not present in all kernels BeeGFS supports.
Please add a feature-detect check in client_module/build/feature-detect.sh to detect the presence of i_write_hint in struct inode and gate the client-side access accordingly.
For example I presume the client should fall back to RWH_WRITE_LIFE_NOT_SET when the field is unavailable:
#ifdef KERNEL_HAS_INODE_I_WRITE_HINT
outIOInfo->writeHint = this->vfs_inode.i_write_hint;
#else
outIOInfo->writeHint = 0; /* RWH_WRITE_LIFE_NOT_SET */
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes made as suggested. Thanks for your feedback!
| int r = fcntl(fd, F_SET_RW_HINT, &writeHint); | ||
| if (r < 0) { | ||
| LOG(GENERAL, ERR, "Failed to set writeHint.", | ||
| ("writeHint", StringTk::uint64ToStr(writeHint))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: If writeHint == RWH_WRITE_LIFE_NOT_SET (0), we should skip calling fcntl(F_SET_RW_HINT, …) entirely, since no hint was requested.
If a non-zero writeHint is provided and fcntl() fails for any reason (including EINVAL for unsupported kernels), we should log an error rather than silently dropping the hint. This makes it visible when the client expects lifetime hints to be applied but the server cannot honor them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Unfortunately adding fields to network messages like the writeHint to WriteLocalFileMsg and WriteLocalFileRDMAMsg will create incompatibilities when there are mixed client/server versions. If a client sends a message with this field to a server that doesn't know about it yet, the server-side message deserialization will fail. This is also a problem if the server was updated and expects the new message format, but the client omits this field.
The expectation is that any 8.x client can communicate to any 8.x server. While sometimes we can find a way to rollout a change like this in a minor BeeGFS release, it will always require extra handling. Because this tends to introduce technical debt we have to go back and cleanup at the next major release, we try to avoid these kinds of changes if possible.
So I'll need to still justify the change internally. It would be helpful if you could share any testing you've done that demonstrates the performance improvement and/or other benefits you've seen with this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I've added more details to the pull request. Please take a look.
| #ifdef BEEGFS_NVFS | ||
| bool nvfs; | ||
| #endif | ||
| uint64_t writeHint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: we need to ensure the writeHint is also initialized in functions like RemotingIOInfo_initOpen() and RemotingIOInfo_initSpecialClose(). Otherwise it might be initialized to whatever random garbage is on the stack and try to assign random or invalid lifetimes on the storage side.
| WriteLocalFileMsgBase(const NumNodeID clientNumID, const char* fileHandleID, | ||
| const uint16_t targetID, const PathInfo* pathInfo, const unsigned accessFlags, | ||
| const int64_t offset, const int64_t count) | ||
| const int64_t offset, const int64_t count, const unsigned writeHint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: this uses unsigned but elsewhere uses uint64_t.
This commit addresses the review comments from commit 046a3af. It includes minor fixes such as correcting field types, refining condition checks, and adding initialization assignments. Additionally, it enhances the PR description with further details on FDP technology and the benchmark results demonstrating WAF reduction. Signed-off-by: Qian Li <qian01.li@samsung.com>
The NVMe SSD (e.g. Flexible Data Placement SSD, TP4146) is supporting to recognize data lifetime information on device. Adding data lifetime information (writeHint) that passed to the devices to achieve lower write amplification and better performance.
Kernel file-systems (ext4, XFS, btrfs, F2FS) have already supported to set the writeHint by fcntl(). This patch adds support for data lifetime information in BeeGFS, and it enables BeeGFS to use multi-stream SSDs, such as FDP SSDs.
A brief proposal for this feature is here: #59
The following provides an overview of the FDP technology, the current state of kernel support, the design of this patch, and the benchmark results.
FDP Technology
Flexible Data Placement (FDP) is a new data placement technology has been merged in NVMe specification v2.1. FDP SSDs can reduce write amplification (WAF) due to the allowance of the host to control where data are written according to the data lifetime.
In summary, the biggest advantage of FDP compared to conventional SSDs lies in the flexibility it provides to the host—enabling precise control over data placement into isolated Reclaim Units (RUs) via Reclaim Unit Handles (RUHs). This feature allows developers to place data with similar lifetimes into the same RU. As a result, during garbage collection (GC), most data in an RU becomes invalid simultaneously, significantly reducing the amount of valid data that needs to be migrated. This greatly lowers write amplification and extends device lifespan.
Current Kernel Support for FDP
Since commit 449813515d3e (block, fs: Restore the per-bio/request data lifetime fields), both file systems (f2fs, ext4, btrfs) and the block layer in the Linux kernel have supported data lifetime fields. The key fields involved are i_write_hint in inode and bi_write_hint in bio.
In 2025, commit 38e8397dde63 (nvme: use fdp streams if write stream is provided) extended the kernel driver to support FDP functionality. Notably, bi_write_stream is essentially redundant with bi_write_hint—personally, I don’t fully understand why the kernel community accepted a new, redundant field after bi_write_hint already existed.
BeeGFS Support for FDP
This patch adds FDP support to BeeGFS. As described in the initial pull request: "This patch adds support for data lifetime information in BeeGFS, and it enables BeeGFS to use multi-stream SSDs, such as FDP SSDs."
We have modified three I/O paths in BeeGFS—direct I/O, buffered I/O, and page cache I/O (use the kernel pagecache)—to accept data lifetime hints from the kernel, propagate them through the network, and ultimately deliver them to the storage server, where they are used to direct data placement via FDP commands. The overall design can be simply summarized as shown in the figure below.
BeeGFS Benchmark Results
Since FDP technology significantly reduces write amplification through intelligent data placement based on data lifetime, we conducted comparative testing of BeeGFS on conventional SSDs versus FDP SSDs, focusing on WAF metric.
Test Config
To measure WAF, we first performed a precondition write to fill the disk to over 90% of its capacity (the disk is 8TB), and then conducted the WAF test.
We used FIO for testing, with FDP configured to use six streams (0–5). When testing the FDP SSD, user data was categorized into four lifetime hints—short, medium, long, and extreme—and each hint was written into a separate stream (streams 2–5). All remaining data was directed to stream 0, while stream 1 was left unused.
WAF was measured across all three I/O paths: direct I/O, buffered I/O, and page cache I/O.
Test Results
The maximum reduction in WAF reached 40%. In all cases, the WAF approached 1, indicating minimal to no write amplification on the device.
In conclusion, this patch demonstrably reduces WAF substantially, validating its effectiveness in reducing WAF by utilizing the FDP feature.