Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds configurable streaming download support for remote log segments to improve download throughput. The implementation adds a new streaming path using OpenDAL's reader API while keeping the existing range-based download as a fallback option, allowing for easy A/B testing and rollback.
Changes:
- Added two new configuration fields:
scanner_remote_log_streaming_read(default: true) andscanner_remote_log_streaming_read_concurrency(default: 4) to control the streaming behavior - Refactored the download implementation to support both streaming and range-based approaches, selected at runtime based on configuration
- Updated C++ bindings to expose the new configuration options
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/fluss/src/config.rs | Adds new config fields for streaming read control with comprehensive tests for the boolean flag behavior |
| crates/fluss/src/client/table/scanner.rs | Passes new streaming config parameters to RemoteLogDownloader |
| crates/fluss/src/client/table/remote_log.rs | Implements streaming download path and refactors existing range-based download into separate helper functions |
| bindings/cpp/src/lib.rs | Updates FFI config struct and connection initialization to include new streaming fields |
| bindings/cpp/src/ffi_converter.hpp | Maps C++ Configuration fields to FFI config structure |
| bindings/cpp/include/fluss.hpp | Adds streaming configuration fields to C++ Configuration struct with appropriate defaults |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| remote_path: &str, | ||
| local_path: &Path, | ||
| remote_fs_props: &HashMap<String, String>, | ||
| streaming_read: bool, | ||
| streaming_read_concurrency: usize, | ||
| ) -> Result<PathBuf> { |
There was a problem hiding this comment.
The documentation comment at line 862 states "using streaming read/write" but the function now supports both streaming and range-based downloads depending on the streaming_read parameter. Consider updating the comment to reflect this, such as "Download a file from remote storage to local using either streaming or range-based read/write".
|
@luoyuxia @fresh-borzoni @leekeiabstraction PTAL if u have time. |
luoyuxia
left a comment
There was a problem hiding this comment.
@zhaohaidao Thanks. LGTM overall.
- also update bindings/python to respect the config
- update the docs in website directory
Co-authored-by: yuxia Luo <luoyuxia@alumni.sjtu.edu.cn>
Thanks. Comments are addressed. PTAL @luoyuxia |
luoyuxia
left a comment
There was a problem hiding this comment.
Thanks. LGTM! Will wait to see if there's more comments before merge
fresh-borzoni
left a comment
There was a problem hiding this comment.
LGTM, left minor comments
|
@luoyuxia @fresh-borzoni Comments are addressed. PTAL if u have time. |
fresh-borzoni
left a comment
There was a problem hiding this comment.
@zhaohaidao Ty, LGTM
Summary
Why
Range-loop reads on remote segments increase per-chunk request overhead and can reduce effective object-store throughput. Streaming download improves transfer shape and reduces remote-read overhead for this workload profile.
What Changed
crates/fluss/src/config.rs:scanner_remote_log_streaming_read(defaulttrue)scanner_remote_log_streaming_read_concurrency(default4)crates/fluss/src/client/table/scanner.rs: pass streaming knobs intoRemoteLogDownloader.crates/fluss/src/client/table/remote_log.rs:reader_with().chunk().concurrent())bindings/cpp/src/lib.rs: switch toConfig::default() + field overridesto avoid missing new config fields.Benchmark
Fixed workload:
cargo run --release --example custom-consume-table -- --task LdsHomeFeedQH --hours-ago 10 --bucket-mode range --max-buckets 4timeout 180sper run| Factor | (r/s) |
|---|---:|
|
streaming_read=true(B0_v3) 39750.45 ||
streaming_read=false(F6) 17950.00 |Turning streaming off: -54.84% throughput