Stop all a job's retries from quickly draining out through one node in under a minute by adamnovak · Pull Request #5458 · DataBiosphere/toil

adamnovak · 2026-02-12T23:50:20Z

This should fix #5452.

I had wanted to implement fancy tracking of what Slurm nodes jobs ran on, and logic to exclude them from retries, but that would have required poking a bunch of holes in the job abstractions. I didn't want the Slurm batch system to try and save the job store IDs of the jobs it got handed, since nothing else in the batch system layer cares about that. And I didn't want to build more API onto the AbstractBatchSystem for directing jobs to/away from nodes and tracking the nodes they ran on, to just implement it for Slurm. And I didn't want to deal with how to find out whether there were enough nodes in a Slurm partition to let you exclude the ones the last N failures happened on and still get the job to run ever. Nor did I want to get into a situation where your flaky node(s) had been fixed but your job was sitting waiting a long time for space on a node that was very busy.

So this adds simple exponential retries, tuned so that if Cactus is doing 6 retries it ought to be waiting about 20 minutes by the end.

Changelog Entry

To be copied to the draft changelog by merger:

Toil retry job logic now has an exponential backoff delay, controlled by --retryBackoffSeconds (default 2) and --retryBackoffFactor (default 3).

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

I tried to get Anthropic Claude to un-ignore typing in job.py. It thought for half an hour, and then I spent 4 hours cleaning up after it. But it successfully tricked me into typing job.py, so there's that, I guess.

adamnovak added 5 commits February 12, 2026 15:33

Implement exponential backoff delay for job retries

a8a51bf

Satisfy MyPy

09abe84

Actually save changes to the backoff help text

7c71459

Finish renaming field to _retry_backoff_seconds

519c4f2

Add semi-synthetic typing to job.py

aa9fe05

I tried to get Anthropic Claude to un-ignore typing in job.py. It thought for half an hour, and then I spent 4 hours cleaning up after it. But it successfully tricked me into typing job.py, so there's that, I guess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Stop all a job's retries from quickly draining out through one node in under a minute#5458

Stop all a job's retries from quickly draining out through one node in under a minute#5458
adamnovak wants to merge 5 commits intomasterfrom
issues/5452-plug-retry-drain

adamnovak commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

adamnovak commented Feb 12, 2026

Changelog Entry

Reviewer Checklist

Merger Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant