Skip to content

Conversation

@wkalt
Copy link
Contributor

@wkalt wkalt commented Jan 20, 2026

Lance's read pipeline is split into independent scheduling and decoding loops, designed to maximize both IO and CPU parallelism.

When we schedule a range of rows to read, the scheduler breaks those ranges up into "scan lines" and issues requests against the storage backend at the maximum parallelism advertised by the backend. For each scan line, a message is sent over an unbounded mpsc channel to the decoder. The message includes futures that wrap the results of the column download operations.

Although requests against storage are limited by the advertised capabilities of the store, the limitation applies only to in-flight IO requests. There is nothing that stops results of completed IO requests from piling up in memory prior to decoding. This works fine when IO is slower than decoding, but if IO is significantly faster than decoding (for example, RAM or NVME-based storage, or particularly slow decoding tasks), we can eagerly accumulate a lot of downloaded data in memory a lot sooner than we need it.

This patch introduces a semaphore-based backpressure mechanism. Permits are required prior to scheduling and released after completion of the decoding futures, limiting the amount of work that can accumulate between the scheduler and the consumer.

Lance's read pipeline is split into independent scheduling and decoding
loops, designed to maximize both IO and CPU parallelism.

When we schedule a range of rows to read, the schedule breaks those
ranges up into "scan lines" and issues requests against the storage
backend at the maximum parallelism advertised by the backend. For each
scan line, a message is sent over an unbounded mpsc channel to the
decoder. The message includes futures the wrap the results of the column
download operations.

Although requests against storage are limited by the advertised
capabilities of the store, the limitation applies only to in-flight IO
requests. There is nothing that stops results of completed IO requests
from piling up in memory prior to decoding. This works fine when IO is
slower than decoding, but if IO is signficantly faster than decoding
(for example, RAM or NVME-based storage, or particularly slow decoding
tasks), we can eagerly accumulate a lot of downloaded data in memory a
lot sooner than we need it.

This patch introduces a semaphore-based backpressure mechanism. Permits
are required prior to scheduling and released after completion of the
decoding futures, limiting the amount of work that can accumulate
between the scheduler and the consumer.
@github-actions github-actions bot added the enhancement New feature or request label Jan 20, 2026
@wkalt
Copy link
Contributor Author

wkalt commented Jan 20, 2026

this is WIP/discussion state -- there are some unresolved questions.

One big question is how to make this work optimally when working against a storage-backed memory cache, which is the original reason for the experiment. The problem is, some requests will behave like object storage and some will behave like memory. We may need something smarter/more adaptive than a fixed limit on the channel size to accommodate this. It may be useful to allow the caller to bring their own semaphore, in order to keep this complexity out of lance itself.

Also, I'm not sure if the IO latency simulation in the benchmark is good enough. If we merge this we will probably want to scale the benchmark back and make it a bit more targeted.

Here are the current textual benchmark results ("unbounded" is current behavior):

=== I/O Latency: 0ms ===
  latency= 0ms p= 4 unbounded      1771346 rows/s, 423.0 MB, io_reqs=1
  latency= 0ms p= 4 bounded_c=2    4831215 rows/s,  24.4 MB, io_reqs=98
  latency= 0ms p= 4 bounded_c=4    1206624 rows/s,  48.4 MB, io_reqs=98
  latency= 0ms p= 4 bounded_c=8    1240292 rows/s,  48.4 MB, io_reqs=98
  latency= 0ms p= 4 bounded_c=16   1207817 rows/s,  48.4 MB, io_reqs=98

=== I/O Latency: 1ms ===
  latency= 1ms p= 4 unbounded      1808342 rows/s, 423.0 MB, io_reqs=1
  latency= 1ms p= 4 bounded_c=2     462761 rows/s,  12.4 MB, io_reqs=98
  latency= 1ms p= 4 bounded_c=4     460657 rows/s,  12.4 MB, io_reqs=98
  latency= 1ms p= 4 bounded_c=8     456573 rows/s,  12.4 MB, io_reqs=98
  latency= 1ms p= 4 bounded_c=16    464966 rows/s,  12.4 MB, io_reqs=98

=== I/O Latency: 5ms ===
  latency= 5ms p= 4 unbounded      1668330 rows/s, 423.0 MB, io_reqs=1
  latency= 5ms p= 4 bounded_c=2     163463 rows/s,  12.4 MB, io_reqs=98
  latency= 5ms p= 4 bounded_c=4     162007 rows/s,  12.4 MB, io_reqs=98
  latency= 5ms p= 4 bounded_c=8     162050 rows/s,  12.4 MB, io_reqs=98
  latency= 5ms p= 4 bounded_c=16    162829 rows/s,  12.4 MB, io_reqs=98

=== I/O Latency: 50ms ===
  latency=50ms p= 4 unbounded       936137 rows/s, 423.0 MB, io_reqs=1
  latency=50ms p= 4 bounded_c=2      19863 rows/s,  12.4 MB, io_reqs=98
  latency=50ms p= 4 bounded_c=4      19869 rows/s,  12.4 MB, io_reqs=98
  latency=50ms p= 4 bounded_c=8      19873 rows/s,  12.4 MB, io_reqs=98
  latency=50ms p= 4 bounded_c=16     19872 rows/s,  12.4 MB, io_reqs=98

I validated the effectiveness in my real application's memory cache but as mentioned I think more work is required to cover the cold data case.

@wkalt
Copy link
Contributor Author

wkalt commented Jan 20, 2026

another issue with this patch, is that it precludes some IO coalescing that is currently performed by the unbounded case. I think ideally we would have some way to independently control the degree of coalescing and the degree of permitting/readahead, but currently the coalescing is done in the same step (scheduling) as the storage request dispatch. I was thinking for this it could make sense to split scheduling/decoding into plan/schedule/decode or schedule/execute/decode. Then the permitting could happen between plan and schedule, which would be downstream of IO coalescing.

let next_task = ReadBatchTask {
task: task.boxed(),
num_rows: num_rows as u32,
backpressure_permits: Vec::new(), // Permits are now inside the task future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my preference would be to just remove the tokio::spawn call up above this point. Does that solve the backpressure issue? See this change here for an example: a9efde8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants