-
-
Notifications
You must be signed in to change notification settings - Fork 747
Closed
Labels
flaky testIntermittent failures on CI.Intermittent failures on CI.
Description
#5431 changed Scheduler.decide_worker to stop it from assigning new tasks to workers with paused or closing_gracefully status.
However, those workers are still stealing tasks - effectively negating the benefits of the PR.
This is reflected by the flakiness of test_avoid_paused_workers; the test frequently hangs in CI on the lines
while (len(w1.tasks), len(w2.tasks), len(w3.tasks)) != (4, 0, 4):
await asyncio.sleep(0.01)Above, w2 is paused. However, the tuple ends up looking like (3, 1, 4) instead. If you add await wait(futures), the test will start hanging deterministically since a task is always stolen from one of the running workers to the paused one, and there it sits forever since nothing steals it back.
Adding , config={"distributed.scheduler.work-stealing": False} to the gen_cluster decorator makes the issue disappear.
Metadata
Metadata
Assignees
Labels
flaky testIntermittent failures on CI.Intermittent failures on CI.