Skip to content

Conversation

@BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Jan 20, 2026

this avoids reading all token sets and doc sets into memory, reduces the peak memory footprint by 25%

related: #5754

@github-actions
Copy link
Contributor

PR Review: perf: merge partitions in stream style

This PR improves memory efficiency by streaming partition loads during merge operations instead of loading all partitions upfront. The claim of 25% peak memory reduction is meaningful for large indices.

P0/P1 Issues

1. (P1) Edge case in single-partition path for lazy loading

In the single-partition branch (lines 225-242 in the new code), partitions are copied without being loaded first. The code accesses part.store and part.id directly from PartitionSource:

part.store
    .copy_index_file(&token_file_path(part.id), self.dest_store)
    .await?;

This is correct, but ensure that PartitionSource fields (store, id) remain pub(super) or appropriately accessible since they're now accessed directly rather than through loaded partition methods.

2. (P1) Potential off-by-one in next_id tracking

In new():

next_id: max_id + 1,

Then in ensure_builder():

self.builder = Some(InnerBuilder::new(
    self.next_id,
    with_position,
    self.token_set_format,
));
self.next_id += 1;

And in flush():

let next_id = builder.id() + 1;
// ...
self.next_id = next_id + 1;  // This sets next_id to builder.id() + 2

After flush, next_id becomes builder.id() + 2, but the new builder is created with next_id (which is builder.id() + 1). This seems intentional but the double increment in flush (next_id + 1) may skip IDs. Please verify this is the intended behavior - it looks like it might create gaps in partition IDs.

Minor Observations (not blocking)

  • Good use of get_num_compute_intensive_cpus() to bound the buffer size
  • The new test_merge_streams_partitions_in_batches test validates the streaming behavior
  • Error handling for mismatched position settings across partitions is a good addition

Overall this is a solid memory optimization. Please verify the next_id tracking logic before merging.

@BubbleCal BubbleCal requested a review from Xuanwo January 20, 2026 10:13
@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 87.09677% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/merger.rs 86.44% 15 Missing and 9 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@BubbleCal BubbleCal merged commit 01374ab into main Jan 20, 2026
31 checks passed
@BubbleCal BubbleCal deleted the yang/stream-merge-partitions branch January 20, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants