Skip to content

Comments

Stop all a job's retries from quickly draining out through one node in under a minute#5458

Open
adamnovak wants to merge 5 commits intomasterfrom
issues/5452-plug-retry-drain
Open

Stop all a job's retries from quickly draining out through one node in under a minute#5458
adamnovak wants to merge 5 commits intomasterfrom
issues/5452-plug-retry-drain

Conversation

@adamnovak
Copy link
Member

This should fix #5452.

I had wanted to implement fancy tracking of what Slurm nodes jobs ran on, and logic to exclude them from retries, but that would have required poking a bunch of holes in the job abstractions. I didn't want the Slurm batch system to try and save the job store IDs of the jobs it got handed, since nothing else in the batch system layer cares about that. And I didn't want to build more API onto the AbstractBatchSystem for directing jobs to/away from nodes and tracking the nodes they ran on, to just implement it for Slurm. And I didn't want to deal with how to find out whether there were enough nodes in a Slurm partition to let you exclude the ones the last N failures happened on and still get the job to run ever. Nor did I want to get into a situation where your flaky node(s) had been fixed but your job was sitting waiting a long time for space on a node that was very busy.

So this adds simple exponential retries, tuned so that if Cactus is doing 6 retries it ought to be waiting about 20 minutes by the end.

Changelog Entry

To be copied to the draft changelog by merger:

  • Toil retry job logic now has an exponential backoff delay, controlled by --retryBackoffSeconds (default 2) and --retryBackoffFactor (default 3).

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

I tried to get Anthropic Claude to un-ignore typing in job.py.

It thought for half an hour, and then I spent 4 hours cleaning up after
it. But it successfully tricked me into typing job.py, so there's that,
I guess.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

One bad Slurm node can consume all your retries in under a minute and fail your workflow, requiring a manual restart

1 participant