Skip to content

Paused workers shouldn't steal tasks (flaky test_avoid_paused_workers) #5664

@crusaderky

Description

@crusaderky

#5431 changed Scheduler.decide_worker to stop it from assigning new tasks to workers with paused or closing_gracefully status.
However, those workers are still stealing tasks - effectively negating the benefits of the PR.

This is reflected by the flakiness of test_avoid_paused_workers; the test frequently hangs in CI on the lines

    while (len(w1.tasks), len(w2.tasks), len(w3.tasks)) != (4, 0, 4):
        await asyncio.sleep(0.01)

Above, w2 is paused. However, the tuple ends up looking like (3, 1, 4) instead. If you add await wait(futures), the test will start hanging deterministically since a task is always stolen from one of the running workers to the paused one, and there it sits forever since nothing steals it back.

Adding , config={"distributed.scheduler.work-stealing": False} to the gen_cluster decorator makes the issue disappear.

Metadata

Metadata

Assignees

Labels

flaky testIntermittent failures on CI.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions