Skip to content

Conversation

@xbattlax
Copy link
Contributor

Summary

Implement size-based file scan task planning for iceberg-rust, addressing issue #128.

Changes

  • Add crates/iceberg/src/scan/bin_packing.rs with:

    • Greedy bin-packing algorithm with configurable lookback
    • BinPackingStream<S> for async streaming (memory efficient)
    • CombinedScanTask grouping for balanced parallel execution
    • Weight calculation considering data size and file open costs
  • Update crates/iceberg/src/scan/context.rs:

    • File splitting based on split_offsets (Parquet row group boundaries)
    • Fallback to byte-range splitting when offsets unavailable
    • Optimized delete file handling (move ownership for last split)
  • Update crates/iceberg/src/scan/mod.rs:

    • Add with_split_target_size(), with_split_open_file_cost(), with_split_lookback() builder methods
    • Add plan_tasks() method returning CombinedScanTaskStream
  • Update crates/iceberg/src/scan/task.rs:

    • Add CombinedScanTask struct and CombinedScanTaskStream type

Notes

This matches the Java Iceberg implementation's TableScanUtil.planTasks() functionality:

  • Large files are split into multiple tasks for parallel processing
  • Small files are combined to reduce file open overhead
  • Streaming implementation avoids collecting all tasks into memory

Closes #128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plan file scan task according scan file size.

1 participant