Skip to content

Slurm batch system can submit jobs multiple times if sbatch manages to get the job in but fails #5459

@adamnovak

Description

@adamnovak

The Slurm batch system has logic around retrying sbatch submission commands when they fail.

However, @glennhickey had a job submission fail with exit status 1 and sbatch: error: Batch job submission failed: Socket timed out on send/recv operation, but the Slurm leader still actually received and ran the job. (We did not save standard output for that job submission, so we don't know whether we received the ID assigned to the submitted job, in addition to the error message on standard error.)

When Toil retried the job submission to Slurm, it actually resulted in a duplicate instance of the job running at the same time, which then, as far as I can tell, resulted in bad state in the job store as the two jobs fought over whose writes to promise files ought to win. We don't have anything in the job store logic to make promise-fulfillment writes atomic with updates to job state; we rely on knowing that a job failed and re-running it to successful completion to get a coherent view of all the promise files it needed to update.

We need to figure out how to get exactly one copy of a job submitted to Slurm, and to know its Slurm ID, even if we get disconnected from the Slurm leader in the middle of the submission process. It's not clear if sbatch guarantees that the job ID will be on standard output if and only if the Slurm leader accepted the job. We might need code to go sniff the jobs in queue or running to make sure none of them look suspiciously like a job we failed to submit. Or we might need to do something to the sbatch command to make sure it is rejected if the job already exists.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1812

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions