Stop all a job's retries from quickly draining out through one node in under a minute#5458
Open
Stop all a job's retries from quickly draining out through one node in under a minute#5458
Conversation
I tried to get Anthropic Claude to un-ignore typing in job.py. It thought for half an hour, and then I spent 4 hours cleaning up after it. But it successfully tricked me into typing job.py, so there's that, I guess.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This should fix #5452.
I had wanted to implement fancy tracking of what Slurm nodes jobs ran on, and logic to exclude them from retries, but that would have required poking a bunch of holes in the job abstractions. I didn't want the Slurm batch system to try and save the job store IDs of the jobs it got handed, since nothing else in the batch system layer cares about that. And I didn't want to build more API onto the AbstractBatchSystem for directing jobs to/away from nodes and tracking the nodes they ran on, to just implement it for Slurm. And I didn't want to deal with how to find out whether there were enough nodes in a Slurm partition to let you exclude the ones the last N failures happened on and still get the job to run ever. Nor did I want to get into a situation where your flaky node(s) had been fixed but your job was sitting waiting a long time for space on a node that was very busy.
So this adds simple exponential retries, tuned so that if Cactus is doing 6 retries it ought to be waiting about 20 minutes by the end.
Changelog Entry
To be copied to the draft changelog by merger:
--retryBackoffSeconds(default 2) and--retryBackoffFactor(default 3).Reviewer Checklist
issues/XXXX-fix-the-thingin the Toil repo, or from an external repo.camelCasethat want to be insnake_case.docs/running/{cliOptions,cwl,wdl}.rstMerger Checklist