Skip to content

Conversation

@Cici-Lii
Copy link

@Cici-Lii Cici-Lii commented Dec 15, 2025

The NVMe SSD (e.g. Flexible Data Placement SSD, TP4146) is supporting to recognize data lifetime information on device. Adding data lifetime information (writeHint) that passed to the devices to achieve lower write amplification and better performance.
Kernel file-systems (ext4, XFS, btrfs, F2FS) have already supported to set the writeHint by fcntl(). This patch adds support for data lifetime information in BeeGFS, and it enables BeeGFS to use multi-stream SSDs, such as FDP SSDs.
A brief proposal for this feature is here: #59

The following provides an overview of the FDP technology, the current state of kernel support, the design of this patch, and the benchmark results.

FDP Technology

Flexible Data Placement (FDP) is a new data placement technology has been merged in NVMe specification v2.1. FDP SSDs can reduce write amplification (WAF) due to the allowance of the host to control where data are written according to the data lifetime.

In summary, the biggest advantage of FDP compared to conventional SSDs lies in the flexibility it provides to the host—enabling precise control over data placement into isolated Reclaim Units (RUs) via Reclaim Unit Handles (RUHs). This feature allows developers to place data with similar lifetimes into the same RU. As a result, during garbage collection (GC), most data in an RU becomes invalid simultaneously, significantly reducing the amount of valid data that needs to be migrated. This greatly lowers write amplification and extends device lifespan.

                 Conventional SSD                                           FDP SSD
+---------------------------------------------------+ +---------------------------------------------------+ 
|				 Host Data Streams				    | |	       		   Host Data Streams				  |
|	+-------+   +-------+   +-------+   +-------+   | |	  +-------+   +-------+   +-------+   +-------+   |
|	|   1   |   |   2   |   |   3   |   |   4   |   | |	  |   1   |   |   2   |   |   3   |   |   4   |   |
|	+-------+   +-------+   +-------+   +-------+   | |	  +-------+   +-------+   +-------+   +-------+   |
+---------------------------------------------------+ +---------------------------------------------------+
+---------------------------------------------------+ +---------------------------------------------------+
|                       SSD                         | |                       SSD                         |
|   +-------------------------------------------+   | |   +-------------------------------------------+   |
|	|					FTL					    |	| |	  |					  FTL					  |	  |
|	+-------------------------------------------+	| |	  +-------------------------------------------+	  |
|   +---+---+---+---+---+---+---+---+---+---+---+   | |   +-----+    +-----+         +-----+    +-----+   |
|	| 2 | 4 | 1 | 3 | 2 | 3 | 4 | 1 | 2 | 2 | 4 |   | |   |  1  |    |  1  |         |  3  |    |  3  |   |
|	+---+---+---+---+---+---+---+---+---+---+---+   | |   +-----+    +-----+         +-----+    +-----+   |
|	| 3 | 1 | 4 | 2 | 3 | 1 | 1 | 2 | 1 | 3 | 1 |   | |   +-----+    +-----+         +-----+    +-----+   |
|	+---+---+---+---+---+---+---+---+---+---+---+   | |   |  2  |    |  2  |         |  4  |    |  4  |   |
|	| 4 | 2 | 3 | 1 | 2 | 2 | 1 | 4 | 3 | 2 | 3 |   | |   +-----+    +-----+         +-----+    +-----+   |
|	+---+---+---+---+---+---+---+---+---+---+---+   | |   +-----+    +-----+         +-----+    +-----+   |
|	| 1 | 3 | 2 | 4 | 3 |   |   |   |   |   |   |   | |   |  1  |    |  2  |         |     |    |     |   |
|	+---+---+---+---+---+---+---+---+---+---+---+   | |   +-----+    +-----+         +-----+    +-----+   |
|	|   |   |   |   |   |   |   |   |   |   |   |   | |   <-RU-->                                         |
|	+---+---+---+---+---+---+---+---+---+---+---+   | |   <-------RG0------>         <-------RG1------>   |
+---------------------------------------------------+ +---------------------------------------------------+

Current Kernel Support for FDP

Since commit 449813515d3e (block, fs: Restore the per-bio/request data lifetime fields), both file systems (f2fs, ext4, btrfs) and the block layer in the Linux kernel have supported data lifetime fields. The key fields involved are i_write_hint in inode and bi_write_hint in bio.

In 2025, commit 38e8397dde63 (nvme: use fdp streams if write stream is provided) extended the kernel driver to support FDP functionality. Notably, bi_write_stream is essentially redundant with bi_write_hint—personally, I don’t fully understand why the kernel community accepted a new, redundant field after bi_write_hint already existed.

BeeGFS Support for FDP

This patch adds FDP support to BeeGFS. As described in the initial pull request: "This patch adds support for data lifetime information in BeeGFS, and it enables BeeGFS to use multi-stream SSDs, such as FDP SSDs."

We have modified three I/O paths in BeeGFS—direct I/O, buffered I/O, and page cache I/O (use the kernel pagecache)—to accept data lifetime hints from the kernel, propagate them through the network, and ultimately deliver them to the storage server, where they are used to direct data placement via FDP commands. The overall design can be simply summarized as shown in the figure below.

		  VFS_Inode                  RemotingIOInfo              WriteLocalFileMsg
	+---------------------+      +---------------------+      +---------------------+ 
	|  enum i_write_hint  + ---> |  unit64_t writeHint + ---> |  +---------------+  |
	+---------------------+      +---------------------+      |  |   NetMessage  |  |
															  |  +---------------+  |
															  |  unit64_t writeHint |
															  +----------+----------+
															   serialize to payload
																		 |
																		 v
Client														  +---------------------+
—————————————————————— TCP/IP or RDMA ——————————————————————— |         CTX         | ————
Server														  +----------+----------+ 
               +--------------- get writeHint -------------+   deserialize to member
			   | 										   |		     |
			   v					WriteLocalFileMsg      |		     v
	+---------------------+		 +---------------------+   |  +---------------------+
	|	   openFile		  |	<--- +  +---------------+  | <-+- +  unit64_t writeHint |
	+----------+----------+		 |  |  NetMessage   |  |	  +---------------------+
			   |				 |	+---------------+  |	   WriteLocalFileMsgBase
			   |                 |  +---------------+  |
			   v                 |  | WriteLocalFile|  |                    
    +---------------------+      |  |     MsgBase   |  |
    |	     fcntl        |      |  +---------------+  |
	+---------------------+      +---------------------+

BeeGFS Benchmark Results

Since FDP technology significantly reduces write amplification through intelligent data placement based on data lifetime, we conducted comparative testing of BeeGFS on conventional SSDs versus FDP SSDs, focusing on WAF metric.

Test Config
To measure WAF, we first performed a precondition write to fill the disk to over 90% of its capacity (the disk is 8TB), and then conducted the WAF test.
We used FIO for testing, with FDP configured to use six streams (0–5). When testing the FDP SSD, user data was categorized into four lifetime hints—short, medium, long, and extreme—and each hint was written into a separate stream (streams 2–5). All remaining data was directed to stream 0, while stream 1 was left unused.
WAF was measured across all three I/O paths: direct I/O, buffered I/O, and page cache I/O.

Test Results
The maximum reduction in WAF reached 40%. In all cases, the WAF approached 1, indicating minimal to no write amplification on the device.

I/O path Conventional SSDs FDP SSDs
DIO 1.68 1.01 (↓39.88%)
Buffer I/O 1.70 1.01 (↓40.59%)
PageCache I/O 1.61 1.01 (↓37.27%)

In conclusion, this patch demonstrably reduces WAF substantially, validating its effectiveness in reducing WAF by utilizing the FDP feature.

The NVMe SSD (e.g. Flexible Data Placement SSD, TP4146)
is supporting to recognize data lifetime information
on device. Adding data lifetime information (writeHint)
that passed to the devices to achieve lower write
amplification and better performance.
Kernel file-systems (ext4, XFS, btrfs, F2FS)
have already supported to set the writeHint by fcntl().
This patch adds support in BeeGFS for data lifetime
information. This patch enables BeeGFS to use
multi-stream SSDs, such as FDP SSDs.
Copy link
Member

@iamjoemccormick iamjoemccormick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Cici-Lii, thanks for the proposed feature and the PR. I took an initial pass and left some comments. The main blocker right now is mixed-version compatibility of the new message format. Please take a look at the feedback and let me know if you have any questions.


outIOInfo->userID = i_uid_read(&this->vfs_inode);
outIOInfo->groupID = i_gid_read(&this->vfs_inode);
outIOInfo->writeHint = this->vfs_inode.i_write_hint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: The field i_write_hint is likely not present in all kernels BeeGFS supports.

Please add a feature-detect check in client_module/build/feature-detect.sh to detect the presence of i_write_hint in struct inode and gate the client-side access accordingly.

For example I presume the client should fall back to RWH_WRITE_LIFE_NOT_SET when the field is unavailable:

#ifdef KERNEL_HAS_INODE_I_WRITE_HINT
    outIOInfo->writeHint = this->vfs_inode.i_write_hint;
#else
    outIOInfo->writeHint = 0; /* RWH_WRITE_LIFE_NOT_SET */
#endif

Copy link
Author

@Cici-Lii Cici-Lii Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes made as suggested. Thanks for your feedback!

Comment on lines 586 to 590
int r = fcntl(fd, F_SET_RW_HINT, &writeHint);
if (r < 0) {
LOG(GENERAL, ERR, "Failed to set writeHint.",
("writeHint", StringTk::uint64ToStr(writeHint)));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: If writeHint == RWH_WRITE_LIFE_NOT_SET (0), we should skip calling fcntl(F_SET_RW_HINT, …) entirely, since no hint was requested.

If a non-zero writeHint is provided and fcntl() fails for any reason (including EINVAL for unsupported kernels), we should log an error rather than silently dropping the hint. This makes it visible when the client expects lifetime hints to be applied but the server cannot honor them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Unfortunately adding fields to network messages like the writeHint to WriteLocalFileMsg and WriteLocalFileRDMAMsg will create incompatibilities when there are mixed client/server versions. If a client sends a message with this field to a server that doesn't know about it yet, the server-side message deserialization will fail. This is also a problem if the server was updated and expects the new message format, but the client omits this field.

The expectation is that any 8.x client can communicate to any 8.x server. While sometimes we can find a way to rollout a change like this in a minor BeeGFS release, it will always require extra handling. Because this tends to introduce technical debt we have to go back and cleanup at the next major release, we try to avoid these kinds of changes if possible.

So I'll need to still justify the change internally. It would be helpful if you could share any testing you've done that demonstrates the performance improvement and/or other benefits you've seen with this patch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've added more details to the pull request. Please take a look.

#ifdef BEEGFS_NVFS
bool nvfs;
#endif
uint64_t writeHint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: we need to ensure the writeHint is also initialized in functions like RemotingIOInfo_initOpen() and RemotingIOInfo_initSpecialClose(). Otherwise it might be initialized to whatever random garbage is on the stack and try to assign random or invalid lifetimes on the storage side.

WriteLocalFileMsgBase(const NumNodeID clientNumID, const char* fileHandleID,
const uint16_t targetID, const PathInfo* pathInfo, const unsigned accessFlags,
const int64_t offset, const int64_t count)
const int64_t offset, const int64_t count, const unsigned writeHint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: this uses unsigned but elsewhere uses uint64_t.

@iamjoemccormick iamjoemccormick linked an issue Dec 18, 2025 that may be closed by this pull request
@iamjoemccormick iamjoemccormick added beegfs/client beegfs/storage enhancement New feature or request signed-cla Contributor has a signed ThinkParQ CLA on file. labels Dec 18, 2025
This commit addresses the review comments from
commit 046a3af.
It includes minor fixes such as correcting field types,
refining condition checks, and adding initialization assignments.
Additionally, it enhances the PR description with
further details on FDP technology and the benchmark results
demonstrating WAF reduction.

Signed-off-by: Qian Li <qian01.li@samsung.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

beegfs/client beegfs/storage enhancement New feature or request signed-cla Contributor has a signed ThinkParQ CLA on file.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Supports multi-stream devices

2 participants