Feature/add job recovery by Akopti8 · Pull Request #114 · Dewberry/sepex

Akopti8 · 2026-01-15T14:45:03Z

This adds startup recovery so the API can rebuild in-memory job state after a crash/restart.

Adds host_job_id to job records + DB schema updates (Postgres + SQLite) and updates job creation to persist host + host job id.
New RecoverAllJobs() runs on boot:
- docker: checks container state and recovers running/exited containers
- aws-batch: queries AWS for current status; finalizes terminal jobs, keeps running jobs in ActiveJobs (no watcher loop)
- subprocess: not recoverable; marks job DISMISSED and writes a server log line explaining it was dismissed due to restart
Adds LOST handling (status guards + UI icons/templates) and updates logs/results fetching to be status-aware.
Adds a small Docker controller helper (ContainerInfo) to detect “missing container” cleanly.

Goal: avoid “orphaned” non-terminal jobs after restarts, and make recovery behavior explicit + visible in logs/UI.

Note: I was unable to test whether an aws-batch get status updates from the lambda function post-recovery.

resolves At startup, resolve status of jobs that were previously in non-terminated status #68

…r, and Subprocess jobs

- Update Database interface to include hostJobID in addJob and add updateJobHost method. - Extend PostgresDB and SQLiteDB implementations to support hostJobID. - Introduce GetNonTerminalJobs method to retrieve jobs not in terminal states. - Add NewRecoveredDockerJob function to initialize Docker jobs from records. - Implement recovery logic for running and exited Docker containers. - Integrate job recovery process into main application flow.

…subprocess jobs during API restarts

…status in job and status views

…for Docker, AWS Batch, and subprocess jobs

…ry functions

…ve logging for Docker and AWS Batch jobs

…ng for better clarity

…arity and consistency

… for job results and metadata

ar-siddiqui · 2026-01-16T21:26:56Z

@Akopti8 for Docker jobs that were ACCEPTED before the crash. Do they get requeued in the system at startup?
If yes, do these jobs get added to the pending jobs queue?
If no, do we update their status as DISMISSED?

Akopti8 · 2026-01-20T03:18:53Z

@Akopti8 for Docker jobs that were ACCEPTED before the crash. Do they get requeued in the system at startup? If yes, do these jobs get added to the pending jobs queue? If no, do we update their status as DISMISSED?

Good point. Right now Docker jobs that are ACCEPTED at startup do not get requeued / added to a pending queue. On recovery we try to find the docker container via host_job_id; if we can’t find it (or it’s missing), we mark the job as LOST.
I tested by flipping a previously successful job back to accepted and restarting -- it was updated to LOST.

ar-siddiqui · 2026-01-20T17:03:37Z

Okay. So, is it fair to assume that the design decision here is that only jobs that were running at the time of the crash will be truly recovered, and the jobs that were queued (ACCEPTED status) will be considered lost?

Can we not requeue them? As it will be a major limitation of the system that ACCEPTED jobs are lost at crash.

Even if we stick with this design decision, the status should be updated to DISMISSED, not LOST, as for these jobs, we know with certainty that they weren't started at the time of the crash. From how I see LOST is basically we have no idea what happened to this job. In the case that I am describing, we do know what happened to these jobs.

Akopti8 · 2026-01-21T15:11:43Z

Okay. So, is it fair to assume that the design decision here is that only jobs that were running at the time of the crash will be truly recovered, and the jobs that were queued (ACCEPTED status) will be considered lost?

Can we not requeue them? As it will be a major limitation of the system that ACCEPTED jobs are lost at crash.

Even if we stick with this design decision, the status should be updated to DISMISSED, not LOST, as for these jobs, we know with certainty that they weren't started at the time of the crash. From how I see LOST is basically we have no idea what happened to this job. In the case that I am describing, we do know what happened to these jobs.

From what I can tell right now, we don’t persist the original execution request for Docker jobs. In Execution() the inputs are validated, then embedded into the in-memory cmd and a DB row is created with status, process, and submitter, but I don’t see the inputs or the resolved command being stored anywhere.

Because of that, if the API crashes while a Docker job is still in ACCEPTED and it never received a host_job_id (meaning no container was created), I’m not confident we can safely requeue it on restart since we can’t reliably reconstruct the request-specific inputs.

If we want requeue-on-restart behavior for these ACCEPTED jobs, we would likely need to persist some form of the execution request at submit time, for example the inputs and process version or even just the final resolved command. Otherwise, updating these jobs to DISMISSED feels more accurate than LOST, since we know they never started. The exception would be jobs that were already accepted with a host_job_id present, in which case they could reasonably be considered lost.

ar-siddiqui · 2026-01-21T16:25:05Z

Correct, we don't persist those things in DB as of now. I agree with your thought process here. I think in Design.md you should write brief notes about Recovery, stating your design decisions and future work. Besides other things, you can also mention that since we are not persisting enough job information to do a recovery, we would mark accepted jobs as dismissed and that future work can take place to allow the system to do a complete recovery, but it will be a more involved effort. Also, whatever we do in the future, we need to think about both recovery and shutdown. They are linked because a graceful shutdown deliberately dismisses jobs, which is something that might not be desirable if we can fully recover and requeue jobs.

The exception would be jobs that were already accepted with a host_job_id present, in which case they could reasonably be considered lost.

Can you explain the above line? How can this happen? How can we have a job that has ACCEPTED status but got a host_job_id since status is changed first at line 319 and host_job_id updated at line 322.

ar-siddiqui · 2026-01-21T16:28:50Z

@slawler currently, if the app gets shutdown signals, it starts teardown steps, which include dismissing all jobs across hosts and for all statuses, including AWS jobs that are pending or running. Is this okay with you? Are you okay with the fact that in case of a graceful shutdown, the app would explicitly dismiss all jobs, including all AWS jobs, or should we change this behaviour?

This behaviour was always there; it hasn't been introduced by this PR.

…D jobs and mark them as DISMISSED

Akopti8 · 2026-01-22T15:46:02Z

Correct, we don't persist those things in DB as of now. I agree with your thought process here. I think in Design.md you should write brief notes about Recovery, stating your design decisions and future work. Besides other things, you can also mention that since we are not persisting enough job information to do a recovery, we would mark accepted jobs as dismissed and that future work can take place to allow the system to do a complete recovery, but it will be a more involved effort. Also, whatever we do in the future, we need to think about both recovery and shutdown. They are linked because a graceful shutdown deliberately dismisses jobs, which is something that might not be desirable if we can fully recover and requeue jobs.

The exception would be jobs that were already accepted with a host_job_id present, in which case they could reasonably be considered lost.

Can you explain the above line? How can this happen? How can we have a job that has ACCEPTED status but got a host_job_id since status is changed first at line 319 and host_job_id updated at line 322.

I just pushed a change for accepted docker jobs. I don't see any Design.md file, did you mean for me to create one or add then in DEV_GUIDE.md somewhere?

ar-siddiqui · 2026-01-23T14:17:59Z

I meant DEV_GUIDE.md.

Also, I don't understand the last change; an ACCEPTED job would never have the host_job_id, per my last comment?
So why are we doing this check?

I will arrange a meeting next week to do a PR Review.

Akopti8 · 2026-01-23T19:01:42Z

I meant DEV_GUIDE.md.

Also, I don't understand the last change; an ACCEPTED job would never have the host_job_id, per my last comment? So why are we doing this check?

I will arrange a meeting next week to do a PR Review.

You’re right, that one was on me.
An ACCEPTED job will never have a host_job_id, so the check was unnecessary.

I’ve fixed it so that all ACCEPTED jobs are marked DISMISSED during recovery, since they never started execution and can’t be safely requeued.

I also added a Job Recovery section to DEV_GUIDE.md documenting the recovery behavior across all backends and the rationale for DISMISSED vs LOST.

slawler · 2026-01-23T21:40:20Z

@slawler currently, if the app gets shutdown signals, it starts teardown steps, which include dismissing all jobs across hosts and for all statuses, including AWS jobs that are pending or running. Is this okay with you? Are you okay with the fact that in case of a graceful shutdown, the app would explicitly dismiss all jobs, including all AWS jobs, or should we change this behaviour?

This behaviour was always there; it hasn't been introduced by this PR.

Let's discuss next week, need to think through this one.

…process jobs recovery

ar-siddiqui

Good progress first time touching this repo, please see my comments.

ar-siddiqui · 2026-01-31T00:41:34Z

api/jobs/recovery.go

+		}
+
+		if r.HostJobID == "" {
+			if r.Status == ACCEPTED {


If a job has ACCEPTED status, it hasn't been started. We should separately handle ACCEPTED cases even before checking for HostJobID and update their status to DISMISS.

ar-siddiqui · 2026-01-31T00:52:24Z

api/jobs/recovery.go

+			if r.Status == ACCEPTED {
+				log.Warnf("Recovery(docker): ACCEPTED job missing container ID, marking DISMISSED job=%s", r.JobID)
+				_ = db.updateJobRecord(r.JobID, DISMISSED, time.Now())
+			}


What happens if host_job_id is missing but status is running?
We should mark it as LOST here.

ar-siddiqui · 2026-01-31T03:00:28Z

api/jobs/recovery.go

+		job.ctx = ctx
+		job.ctxCancel = cancel
+
+		if err := job.initLogger(); err != nil {


initLogger() uses os.Create. Calling it here without modifying it will overwrite server/process logs created before the crash.

ar-siddiqui · 2026-02-02T21:20:44Z

api/jobs/recovery.go

+
+func recoverExitedContainer(j *DockerJob, exitCode int) {
+	if exitCode == 0 {
+		j.NewStatusUpdate(SUCCESSFUL, time.Now())


We should run the metadata write routine here if the job finished successfully.

ar-siddiqui · 2026-02-03T03:12:07Z

api/jobs/recovery.go

+		}
+
+		if err := j.initLogger(); err != nil {
+			log.Warnf("Recovery(aws-batch): failed to init logger job=%s: %v", r.JobID, err)


The job should be marked lost before continuing.

ar-siddiqui · 2026-02-03T04:04:04Z

api/jobs/jobs.go

+func FetchResults(svc *s3.S3, jid string, status string) (interface{}, error) {

-	logs, err := FetchLogs(svc, jid, true)
+	// LOST jobs never have results


This is not required since we only ever ask for results if the job is successful.

ar-siddiqui · 2026-02-03T04:05:42Z

api/jobs/jobs.go

+		return nil, fmt.Errorf("no results available")
+	}
+
+	// Only successful jobs can have results


This check is okay; it is just double-checking, but the above check should be removed. Why treat LOST specially? DISMISSED, RUNNING, etc. also don't have a result.

ar-siddiqui · 2026-02-03T04:06:54Z

api/jobs/jobs.go

 			continue
 		}

+		// NEW: LOST jobs don't have container logs


I don't think this is needed.

ar-siddiqui · 2026-02-03T04:07:37Z

api/jobs/subprocess_jobs.go

 		return
 	}
 	j.PID = fmt.Sprintf("%d", j.execCmd.Process.Pid)
+	j.DB.updateJobHost(j.UUID, "subprocess", j.PID)


Why is this needed?

ar-siddiqui · 2026-02-03T04:08:05Z

api/views/jobLogs.html

Why are HTML files whitespace-modified?

…nces in job handling

ar-siddiqui · 2026-02-03T15:12:13Z

@Akopti8 I reviewed the PR. You can see my review comments above.

I also then went ahead and addressed those comments. I made a mistake and pushed my commits to your branch. I should have pushed to a new branch so that you could easily see the diff between your last commit and my changes. Luckily, we still have an easy way clearly see diffs.

Here is your changes and my review comments https://github.com/Dewberry/sepex/pull/114/changes/BASE..2a9e3fdb5011bb65635e055db46096cb47af6462

Here are my changes to address those comments and some improvements https://github.com/Dewberry/sepex/pull/114/changes/2a9e3fdb5011bb65635e055db46096cb47af6462..12f2ca4d4abb127bf13ac8d684ad459e683c15b9

Let me know what you think of my changes. I'm happy to modify anything if needed. I put my thought process in DEV_GUIDE.md and some code comments.

I didn't do any testing, so that is something that needs to be carried out.

Akopti8 · 2026-02-10T17:45:31Z

@Akopti8 I reviewed the PR. You can see my review comments above.

I also then went ahead and addressed those comments. I made a mistake and pushed my commits to your branch. I should have pushed to a new branch so that you could easily see the diff between your last commit and my changes. Luckily, we still have an easy way clearly see diffs.

Here is your changes and my review comments https://github.com/Dewberry/sepex/pull/114/changes/BASE..2a9e3fdb5011bb65635e055db46096cb47af6462

Here are my changes to address those comments and some improvements https://github.com/Dewberry/sepex/pull/114/changes/2a9e3fdb5011bb65635e055db46096cb47af6462..12f2ca4d4abb127bf13ac8d684ad459e683c15b9

Let me know what you think of my changes. I'm happy to modify anything if needed. I put my thought process in DEV_GUIDE.md and some code comments.

I didn't do any testing, so that is something that needs to be carried out.

Thanks for the review and for the changes you made. It was really helpful seeing how you addressed things in the code!

I tested recovery manually in a few different ways:

Subprocess jobs

I started a subprocess job, then manually crashed the API and restarted it right away. After restart, the job was marked LOST, which makes sense since the process was gone.

I also tested the case where a job never actually started. I started a job, crashed the API immediately, then manually removed the host_job_id and set the status to ACCEPTED in the DB to simulate a crash before the job launched (I don’t think I can do that fast enough in real time). After restarting the API, that job was marked DISMISSED.

Docker jobs

Same as subprocess for the “never started” case (ACCEPTED + no host id → DISMISSED).

I also tested a job that was already running in Docker. I crashed the API while the container was still running, restarted the API, and verified that the job was reattached in memory and continued tracking correctly.

AWS Batch jobs

The main thing I could test here is that after restart the API reattaches the job and continues handling status updates correctly.

I don’t have access to the Lambda that sends status callbacks, so I couldn’t fully simulate the real status update flow.

I also can’t realistically simulate a true “lost” AWS job case.

@slawler, As for incorprating this into the newman test, I think we can do them by adding a few stpes after the newman tests that already exist. We could submit a long async job, docker kill the api, and then start it up again, on restart we can make a nother newman collection that tests if the status of the jobs recovered is correctly assigned (LOST, or DISMISSED).
@ar-siddiqui, I am curious to hear what you think about the above idea.

Akopti8 added 13 commits January 5, 2026 15:08

Add ContainerInfo struct and ContainerInfo method for DockerController

c9e4f32

Add LOST status handling and update job creation for AWS Batch, Docke…

37964e3

…r, and Subprocess jobs

Implement DismissStaleSubprocessJobs function to handle non-terminal …

2b12f6c

…subprocess jobs during API restarts

Add silver-circle-icon.svg and update job status handling for "lost" …

5004a36

…status in job and status views

Implement RecoverAllJobs function to streamline job recovery process …

8482de5

…for Docker, AWS Batch, and subprocess jobs

Merge remote-tracking branch 'origin/main' into feature/add-job-recovery

0cf2590

Refactor logging statements for consistency and clarity in job recove…

446197c

…ry functions

Refactor job recovery logic: consolidate recovery functions and impro…

db88e2c

…ve logging for Docker and AWS Batch jobs

Improve error handling in FetchResults function and enhance log parsi…

c73322a

…ng for better clarity

Refine comments and logging messages in job recovery functions for cl…

dbbebd4

…arity and consistency

Add lost.svg icon and update job status handling in views

126ba6b

Enhance job status handling: include 'LOST' status in error responses…

96ce230

… for job results and metadata

Akopti8 marked this pull request as ready for review January 16, 2026 18:46

Akopti8 requested a review from slawler January 16, 2026 18:46

slawler requested a review from ar-siddiqui January 16, 2026 19:37

Improve Docker job recovery: handle missing container IDs for ACCEPTE…

2a9e3fd

…D jobs and mark them as DISMISSED

ar-siddiqui added 3 commits February 2, 2026 16:44

Some fixes to docker recovery

35103a3

Fix unused parameter

3aa965d

Improve logging and handle accepted vs running job separately for sub…

97575bd

…process jobs recovery

ar-siddiqui added 3 commits February 2, 2026 21:31

Refactor subprocess jobs recovery

da1c091

Refactor AWS Recovery

9066246

Rename lost-icon.svg

d1b1b78

ar-siddiqui requested changes Feb 3, 2026

View reviewed changes

ar-siddiqui added 7 commits February 2, 2026 23:27

Allow missing metadata

f35fa9e

Refactor AWSBatchJob recovery logic and simplify Close method

9da9557

Rename updateJobHost to updateJobHostId for clarity and update refere…

926c61d

…nces in job handling

Fix whitespace

14e4e0f

Revert to not passing status around

f9d8777

Add documentation

4e2c858

Update CHANGELOG.md to document new recovery features

12f2ca4

Comments

Conversation

Akopti8 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ar-siddiqui commented Jan 16, 2026

Uh oh!

Akopti8 commented Jan 20, 2026

Uh oh!

ar-siddiqui commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Akopti8 commented Jan 21, 2026

Uh oh!

ar-siddiqui commented Jan 21, 2026

Uh oh!

ar-siddiqui commented Jan 21, 2026

Uh oh!

Akopti8 commented Jan 22, 2026

Uh oh!

ar-siddiqui commented Jan 23, 2026

Uh oh!

Akopti8 commented Jan 23, 2026

Uh oh!

slawler commented Jan 23, 2026

Uh oh!

ar-siddiqui left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ar-siddiqui Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ar-siddiqui commented Feb 3, 2026

Uh oh!

Akopti8 commented Feb 10, 2026

Subprocess jobs

Docker jobs

AWS Batch jobs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Akopti8 commented Jan 15, 2026 •

edited

Loading

ar-siddiqui commented Jan 20, 2026 •

edited

Loading

ar-siddiqui Feb 3, 2026 •

edited

Loading