feat: introduce semaphore-based scheduling backpressure #5753

wkalt · 2026-01-20T03:35:30Z

Lance's read pipeline is split into independent scheduling and decoding loops, designed to maximize both IO and CPU parallelism.

When we schedule a range of rows to read, the scheduler breaks those ranges up into "scan lines" and issues requests against the storage backend at the maximum parallelism advertised by the backend. For each scan line, a message is sent over an unbounded mpsc channel to the decoder. The message includes futures that wrap the results of the column download operations.

Although requests against storage are limited by the advertised capabilities of the store, the limitation applies only to in-flight IO requests. There is nothing that stops results of completed IO requests from piling up in memory prior to decoding. This works fine when IO is slower than decoding, but if IO is significantly faster than decoding (for example, RAM or NVME-based storage, or particularly slow decoding tasks), we can eagerly accumulate a lot of downloaded data in memory a lot sooner than we need it.

This patch introduces a semaphore-based backpressure mechanism. Permits are required prior to scheduling and released after completion of the decoding futures, limiting the amount of work that can accumulate between the scheduler and the consumer.

Lance's read pipeline is split into independent scheduling and decoding loops, designed to maximize both IO and CPU parallelism. When we schedule a range of rows to read, the schedule breaks those ranges up into "scan lines" and issues requests against the storage backend at the maximum parallelism advertised by the backend. For each scan line, a message is sent over an unbounded mpsc channel to the decoder. The message includes futures the wrap the results of the column download operations. Although requests against storage are limited by the advertised capabilities of the store, the limitation applies only to in-flight IO requests. There is nothing that stops results of completed IO requests from piling up in memory prior to decoding. This works fine when IO is slower than decoding, but if IO is signficantly faster than decoding (for example, RAM or NVME-based storage, or particularly slow decoding tasks), we can eagerly accumulate a lot of downloaded data in memory a lot sooner than we need it. This patch introduces a semaphore-based backpressure mechanism. Permits are required prior to scheduling and released after completion of the decoding futures, limiting the amount of work that can accumulate between the scheduler and the consumer.

wkalt · 2026-01-20T03:42:48Z

this is WIP/discussion state -- there are some unresolved questions.

One big question is how to make this work optimally when working against a storage-backed memory cache, which is the original reason for the experiment. The problem is, some requests will behave like object storage and some will behave like memory. We may need something smarter/more adaptive than a fixed limit on the channel size to accommodate this. It may be useful to allow the caller to bring their own semaphore, in order to keep this complexity out of lance itself.

Also, I'm not sure if the IO latency simulation in the benchmark is good enough. If we merge this we will probably want to scale the benchmark back and make it a bit more targeted.

Here are the current textual benchmark results ("unbounded" is current behavior):

=== I/O Latency: 0ms ===
  latency= 0ms p= 4 unbounded      1771346 rows/s, 423.0 MB, io_reqs=1
  latency= 0ms p= 4 bounded_c=2    4831215 rows/s,  24.4 MB, io_reqs=98
  latency= 0ms p= 4 bounded_c=4    1206624 rows/s,  48.4 MB, io_reqs=98
  latency= 0ms p= 4 bounded_c=8    1240292 rows/s,  48.4 MB, io_reqs=98
  latency= 0ms p= 4 bounded_c=16   1207817 rows/s,  48.4 MB, io_reqs=98

=== I/O Latency: 1ms ===
  latency= 1ms p= 4 unbounded      1808342 rows/s, 423.0 MB, io_reqs=1
  latency= 1ms p= 4 bounded_c=2     462761 rows/s,  12.4 MB, io_reqs=98
  latency= 1ms p= 4 bounded_c=4     460657 rows/s,  12.4 MB, io_reqs=98
  latency= 1ms p= 4 bounded_c=8     456573 rows/s,  12.4 MB, io_reqs=98
  latency= 1ms p= 4 bounded_c=16    464966 rows/s,  12.4 MB, io_reqs=98

=== I/O Latency: 5ms ===
  latency= 5ms p= 4 unbounded      1668330 rows/s, 423.0 MB, io_reqs=1
  latency= 5ms p= 4 bounded_c=2     163463 rows/s,  12.4 MB, io_reqs=98
  latency= 5ms p= 4 bounded_c=4     162007 rows/s,  12.4 MB, io_reqs=98
  latency= 5ms p= 4 bounded_c=8     162050 rows/s,  12.4 MB, io_reqs=98
  latency= 5ms p= 4 bounded_c=16    162829 rows/s,  12.4 MB, io_reqs=98

=== I/O Latency: 50ms ===
  latency=50ms p= 4 unbounded       936137 rows/s, 423.0 MB, io_reqs=1
  latency=50ms p= 4 bounded_c=2      19863 rows/s,  12.4 MB, io_reqs=98
  latency=50ms p= 4 bounded_c=4      19869 rows/s,  12.4 MB, io_reqs=98
  latency=50ms p= 4 bounded_c=8      19873 rows/s,  12.4 MB, io_reqs=98
  latency=50ms p= 4 bounded_c=16     19872 rows/s,  12.4 MB, io_reqs=98

I validated the effectiveness in my real application's memory cache but as mentioned I think more work is required to cover the cold data case.

wkalt · 2026-01-20T04:47:43Z

another issue with this patch, is that it precludes some IO coalescing that is currently performed by the unbounded case. I think ideally we would have some way to independently control the degree of coalescing and the degree of permitting/readahead, but currently the coalescing is done in the same step (scheduling) as the storage request dispatch. I was thinking for this it could make sense to split scheduling/decoding into plan/schedule/decode or schedule/execute/decode. Then the permitting could happen between plan and schedule, which would be downstream of IO coalescing.

westonpace · 2026-01-20T17:42:16Z

rust/lance-encoding/src/decoder.rs

                let next_task = ReadBatchTask {
                    task: task.boxed(),
                    num_rows: num_rows as u32,
+                    backpressure_permits: Vec::new(), // Permits are now inside the task future


I think my preference would be to just remove the tokio::spawn call up above this point. Does that solve the backpressure issue? See this change here for an example: a9efde8

github-actions bot added the enhancement New feature or request label Jan 20, 2026

westonpace reviewed Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: introduce semaphore-based scheduling backpressure #5753

feat: introduce semaphore-based scheduling backpressure #5753

wkalt commented Jan 20, 2026 •

edited

Loading

Uh oh!

wkalt commented Jan 20, 2026 •

edited

Loading

Uh oh!

wkalt commented Jan 20, 2026

Uh oh!

westonpace Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: introduce semaphore-based scheduling backpressure #5753

Are you sure you want to change the base?

feat: introduce semaphore-based scheduling backpressure #5753

Conversation

wkalt commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkalt commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkalt commented Jan 20, 2026

Uh oh!

westonpace Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wkalt commented Jan 20, 2026 •

edited

Loading

wkalt commented Jan 20, 2026 •

edited

Loading