Skip to content

Comments

Feature/add job recovery#114

Open
Akopti8 wants to merge 27 commits intomainfrom
feature/add-job-recovery
Open

Feature/add job recovery#114
Akopti8 wants to merge 27 commits intomainfrom
feature/add-job-recovery

Conversation

@Akopti8
Copy link
Collaborator

@Akopti8 Akopti8 commented Jan 15, 2026

This adds startup recovery so the API can rebuild in-memory job state after a crash/restart.

  • Adds host_job_id to job records + DB schema updates (Postgres + SQLite) and updates job creation to persist host + host job id.

  • New RecoverAllJobs() runs on boot:

    • docker: checks container state and recovers running/exited containers
    • aws-batch: queries AWS for current status; finalizes terminal jobs, keeps running jobs in ActiveJobs (no watcher loop)
    • subprocess: not recoverable; marks job DISMISSED and writes a server log line explaining it was dismissed due to restart
  • Adds LOST handling (status guards + UI icons/templates) and updates logs/results fetching to be status-aware.

  • Adds a small Docker controller helper (ContainerInfo) to detect “missing container” cleanly.

Goal: avoid “orphaned” non-terminal jobs after restarts, and make recovery behavior explicit + visible in logs/UI.

Note: I was unable to test whether an aws-batch get status updates from the lambda function post-recovery.

Akopti8 added 13 commits January 5, 2026 15:08
- Update Database interface to include hostJobID in addJob and add updateJobHost method.
- Extend PostgresDB and SQLiteDB implementations to support hostJobID.
- Introduce GetNonTerminalJobs method to retrieve jobs not in terminal states.
- Add NewRecoveredDockerJob function to initialize Docker jobs from records.
- Implement recovery logic for running and exited Docker containers.
- Integrate job recovery process into main application flow.
@Akopti8 Akopti8 marked this pull request as ready for review January 16, 2026 18:46
@Akopti8 Akopti8 requested a review from slawler January 16, 2026 18:46
@slawler slawler requested a review from ar-siddiqui January 16, 2026 19:37
@ar-siddiqui
Copy link
Collaborator

@Akopti8 for Docker jobs that were ACCEPTED before the crash. Do they get requeued in the system at startup?
If yes, do these jobs get added to the pending jobs queue?
If no, do we update their status as DISMISSED?

@Akopti8
Copy link
Collaborator Author

Akopti8 commented Jan 20, 2026

@Akopti8 for Docker jobs that were ACCEPTED before the crash. Do they get requeued in the system at startup? If yes, do these jobs get added to the pending jobs queue? If no, do we update their status as DISMISSED?

Good point. Right now Docker jobs that are ACCEPTED at startup do not get requeued / added to a pending queue. On recovery we try to find the docker container via host_job_id; if we can’t find it (or it’s missing), we mark the job as LOST.
I tested by flipping a previously successful job back to accepted and restarting -- it was updated to LOST.

@ar-siddiqui
Copy link
Collaborator

ar-siddiqui commented Jan 20, 2026

Okay. So, is it fair to assume that the design decision here is that only jobs that were running at the time of the crash will be truly recovered, and the jobs that were queued (ACCEPTED status) will be considered lost?

Can we not requeue them? As it will be a major limitation of the system that ACCEPTED jobs are lost at crash.

Even if we stick with this design decision, the status should be updated to DISMISSED, not LOST, as for these jobs, we know with certainty that they weren't started at the time of the crash. From how I see LOST is basically we have no idea what happened to this job. In the case that I am describing, we do know what happened to these jobs.

@Akopti8
Copy link
Collaborator Author

Akopti8 commented Jan 21, 2026

Okay. So, is it fair to assume that the design decision here is that only jobs that were running at the time of the crash will be truly recovered, and the jobs that were queued (ACCEPTED status) will be considered lost?

Can we not requeue them? As it will be a major limitation of the system that ACCEPTED jobs are lost at crash.

Even if we stick with this design decision, the status should be updated to DISMISSED, not LOST, as for these jobs, we know with certainty that they weren't started at the time of the crash. From how I see LOST is basically we have no idea what happened to this job. In the case that I am describing, we do know what happened to these jobs.

From what I can tell right now, we don’t persist the original execution request for Docker jobs. In Execution() the inputs are validated, then embedded into the in-memory cmd and a DB row is created with status, process, and submitter, but I don’t see the inputs or the resolved command being stored anywhere.

Because of that, if the API crashes while a Docker job is still in ACCEPTED and it never received a host_job_id (meaning no container was created), I’m not confident we can safely requeue it on restart since we can’t reliably reconstruct the request-specific inputs.

If we want requeue-on-restart behavior for these ACCEPTED jobs, we would likely need to persist some form of the execution request at submit time, for example the inputs and process version or even just the final resolved command. Otherwise, updating these jobs to DISMISSED feels more accurate than LOST, since we know they never started. The exception would be jobs that were already accepted with a host_job_id present, in which case they could reasonably be considered lost.

@ar-siddiqui
Copy link
Collaborator

Correct, we don't persist those things in DB as of now. I agree with your thought process here. I think in Design.md you should write brief notes about Recovery, stating your design decisions and future work. Besides other things, you can also mention that since we are not persisting enough job information to do a recovery, we would mark accepted jobs as dismissed and that future work can take place to allow the system to do a complete recovery, but it will be a more involved effort. Also, whatever we do in the future, we need to think about both recovery and shutdown. They are linked because a graceful shutdown deliberately dismisses jobs, which is something that might not be desirable if we can fully recover and requeue jobs.

The exception would be jobs that were already accepted with a host_job_id present, in which case they could reasonably be considered lost.

Can you explain the above line? How can this happen? How can we have a job that has ACCEPTED status but got a host_job_id since status is changed first at line 319 and host_job_id updated at line 322.

@ar-siddiqui
Copy link
Collaborator

@slawler currently, if the app gets shutdown signals, it starts teardown steps, which include dismissing all jobs across hosts and for all statuses, including AWS jobs that are pending or running. Is this okay with you? Are you okay with the fact that in case of a graceful shutdown, the app would explicitly dismiss all jobs, including all AWS jobs, or should we change this behaviour?

This behaviour was always there; it hasn't been introduced by this PR.

@Akopti8
Copy link
Collaborator Author

Akopti8 commented Jan 22, 2026

Correct, we don't persist those things in DB as of now. I agree with your thought process here. I think in Design.md you should write brief notes about Recovery, stating your design decisions and future work. Besides other things, you can also mention that since we are not persisting enough job information to do a recovery, we would mark accepted jobs as dismissed and that future work can take place to allow the system to do a complete recovery, but it will be a more involved effort. Also, whatever we do in the future, we need to think about both recovery and shutdown. They are linked because a graceful shutdown deliberately dismisses jobs, which is something that might not be desirable if we can fully recover and requeue jobs.

The exception would be jobs that were already accepted with a host_job_id present, in which case they could reasonably be considered lost.

Can you explain the above line? How can this happen? How can we have a job that has ACCEPTED status but got a host_job_id since status is changed first at line 319 and host_job_id updated at line 322.

I just pushed a change for accepted docker jobs. I don't see any Design.md file, did you mean for me to create one or add then in DEV_GUIDE.md somewhere?

@ar-siddiqui
Copy link
Collaborator

I meant DEV_GUIDE.md.

Also, I don't understand the last change; an ACCEPTED job would never have the host_job_id, per my last comment?
So why are we doing this check?

I will arrange a meeting next week to do a PR Review.

@Akopti8
Copy link
Collaborator Author

Akopti8 commented Jan 23, 2026

I meant DEV_GUIDE.md.

Also, I don't understand the last change; an ACCEPTED job would never have the host_job_id, per my last comment? So why are we doing this check?

I will arrange a meeting next week to do a PR Review.

You’re right, that one was on me.
An ACCEPTED job will never have a host_job_id, so the check was unnecessary.

I’ve fixed it so that all ACCEPTED jobs are marked DISMISSED during recovery, since they never started execution and can’t be safely requeued.

I also added a Job Recovery section to DEV_GUIDE.md documenting the recovery behavior across all backends and the rationale for DISMISSED vs LOST.

@slawler
Copy link
Member

slawler commented Jan 23, 2026

@slawler currently, if the app gets shutdown signals, it starts teardown steps, which include dismissing all jobs across hosts and for all statuses, including AWS jobs that are pending or running. Is this okay with you? Are you okay with the fact that in case of a graceful shutdown, the app would explicitly dismiss all jobs, including all AWS jobs, or should we change this behaviour?

This behaviour was always there; it hasn't been introduced by this PR.

Let's discuss next week, need to think through this one.

Copy link
Collaborator

@ar-siddiqui ar-siddiqui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress first time touching this repo, please see my comments.

}

if r.HostJobID == "" {
if r.Status == ACCEPTED {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a job has ACCEPTED status, it hasn't been started. We should separately handle ACCEPTED cases even before checking for HostJobID and update their status to DISMISS.

if r.Status == ACCEPTED {
log.Warnf("Recovery(docker): ACCEPTED job missing container ID, marking DISMISSED job=%s", r.JobID)
_ = db.updateJobRecord(r.JobID, DISMISSED, time.Now())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if host_job_id is missing but status is running?
We should mark it as LOST here.

job.ctx = ctx
job.ctxCancel = cancel

if err := job.initLogger(); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initLogger() uses os.Create. Calling it here without modifying it will overwrite server/process logs created before the crash.


func recoverExitedContainer(j *DockerJob, exitCode int) {
if exitCode == 0 {
j.NewStatusUpdate(SUCCESSFUL, time.Now())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run the metadata write routine here if the job finished successfully.

}

if err := j.initLogger(); err != nil {
log.Warnf("Recovery(aws-batch): failed to init logger job=%s: %v", r.JobID, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job should be marked lost before continuing.

api/jobs/jobs.go Outdated
func FetchResults(svc *s3.S3, jid string, status string) (interface{}, error) {

logs, err := FetchLogs(svc, jid, true)
// LOST jobs never have results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required since we only ever ask for results if the job is successful.

api/jobs/jobs.go Outdated
return nil, fmt.Errorf("no results available")
}

// Only successful jobs can have results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is okay; it is just double-checking, but the above check should be removed. Why treat LOST specially? DISMISSED, RUNNING, etc. also don't have a result.

api/jobs/jobs.go Outdated
continue
}

// NEW: LOST jobs don't have container logs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed.

return
}
j.PID = fmt.Sprintf("%d", j.execCmd.Process.Pid)
j.DB.updateJobHost(j.UUID, "subprocess", j.PID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Collaborator

@ar-siddiqui ar-siddiqui Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are HTML files whitespace-modified?

@ar-siddiqui
Copy link
Collaborator

@Akopti8 I reviewed the PR. You can see my review comments above.

I also then went ahead and addressed those comments. I made a mistake and pushed my commits to your branch. I should have pushed to a new branch so that you could easily see the diff between your last commit and my changes. Luckily, we still have an easy way clearly see diffs.

Here is your changes and my review comments https://github.com/Dewberry/sepex/pull/114/changes/BASE..2a9e3fdb5011bb65635e055db46096cb47af6462

Here are my changes to address those comments and some improvements https://github.com/Dewberry/sepex/pull/114/changes/2a9e3fdb5011bb65635e055db46096cb47af6462..12f2ca4d4abb127bf13ac8d684ad459e683c15b9

Let me know what you think of my changes. I'm happy to modify anything if needed. I put my thought process in DEV_GUIDE.md and some code comments.

I didn't do any testing, so that is something that needs to be carried out.

@Akopti8
Copy link
Collaborator Author

Akopti8 commented Feb 10, 2026

@Akopti8 I reviewed the PR. You can see my review comments above.

I also then went ahead and addressed those comments. I made a mistake and pushed my commits to your branch. I should have pushed to a new branch so that you could easily see the diff between your last commit and my changes. Luckily, we still have an easy way clearly see diffs.

Here is your changes and my review comments https://github.com/Dewberry/sepex/pull/114/changes/BASE..2a9e3fdb5011bb65635e055db46096cb47af6462

Here are my changes to address those comments and some improvements https://github.com/Dewberry/sepex/pull/114/changes/2a9e3fdb5011bb65635e055db46096cb47af6462..12f2ca4d4abb127bf13ac8d684ad459e683c15b9

Let me know what you think of my changes. I'm happy to modify anything if needed. I put my thought process in DEV_GUIDE.md and some code comments.

I didn't do any testing, so that is something that needs to be carried out.


Thanks for the review and for the changes you made. It was really helpful seeing how you addressed things in the code!

I tested recovery manually in a few different ways:

Subprocess jobs

I started a subprocess job, then manually crashed the API and restarted it right away. After restart, the job was marked LOST, which makes sense since the process was gone.

I also tested the case where a job never actually started. I started a job, crashed the API immediately, then manually removed the host_job_id and set the status to ACCEPTED in the DB to simulate a crash before the job launched (I don’t think I can do that fast enough in real time). After restarting the API, that job was marked DISMISSED.

Docker jobs

Same as subprocess for the “never started” case (ACCEPTED + no host id → DISMISSED).

I also tested a job that was already running in Docker. I crashed the API while the container was still running, restarted the API, and verified that the job was reattached in memory and continued tracking correctly.

AWS Batch jobs

The main thing I could test here is that after restart the API reattaches the job and continues handling status updates correctly.

I don’t have access to the Lambda that sends status callbacks, so I couldn’t fully simulate the real status update flow.

I also can’t realistically simulate a true “lost” AWS job case.


@slawler, As for incorprating this into the newman test, I think we can do them by adding a few stpes after the newman tests that already exist. We could submit a long async job, docker kill the api, and then start it up again, on restart we can make a nother newman collection that tests if the status of the jobs recovered is correctly assigned (LOST, or DISMISSED).
@ar-siddiqui, I am curious to hear what you think about the above idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

At startup, resolve status of jobs that were previously in non-terminated status

3 participants