Conversation
…r, and Subprocess jobs
- Update Database interface to include hostJobID in addJob and add updateJobHost method. - Extend PostgresDB and SQLiteDB implementations to support hostJobID. - Introduce GetNonTerminalJobs method to retrieve jobs not in terminal states. - Add NewRecoveredDockerJob function to initialize Docker jobs from records. - Implement recovery logic for running and exited Docker containers. - Integrate job recovery process into main application flow.
…subprocess jobs during API restarts
…status in job and status views
…for Docker, AWS Batch, and subprocess jobs
…ve logging for Docker and AWS Batch jobs
…ng for better clarity
…arity and consistency
… for job results and metadata
|
@Akopti8 for Docker jobs that were |
Good point. Right now Docker jobs that are |
|
Okay. So, is it fair to assume that the design decision here is that only jobs that were running at the time of the crash will be truly recovered, and the jobs that were queued ( Can we not requeue them? As it will be a major limitation of the system that Even if we stick with this design decision, the status should be updated to |
From what I can tell right now, we don’t persist the original execution request for Docker jobs. In Execution() the inputs are validated, then embedded into the in-memory cmd and a DB row is created with status, process, and submitter, but I don’t see the inputs or the resolved command being stored anywhere. Because of that, if the API crashes while a Docker job is still in ACCEPTED and it never received a host_job_id (meaning no container was created), I’m not confident we can safely requeue it on restart since we can’t reliably reconstruct the request-specific inputs. If we want requeue-on-restart behavior for these ACCEPTED jobs, we would likely need to persist some form of the execution request at submit time, for example the inputs and process version or even just the final resolved command. Otherwise, updating these jobs to DISMISSED feels more accurate than LOST, since we know they never started. The exception would be jobs that were already accepted with a host_job_id present, in which case they could reasonably be considered lost. |
|
Correct, we don't persist those things in DB as of now. I agree with your thought process here. I think in
Can you explain the above line? How can this happen? How can we have a job that has |
|
@slawler currently, if the app gets shutdown signals, it starts teardown steps, which include dismissing all jobs across hosts and for all statuses, including AWS jobs that are pending or running. Is this okay with you? Are you okay with the fact that in case of a graceful shutdown, the app would explicitly dismiss all jobs, including all AWS jobs, or should we change this behaviour? This behaviour was always there; it hasn't been introduced by this PR. |
…D jobs and mark them as DISMISSED
I just pushed a change for accepted docker jobs. I don't see any |
|
I meant DEV_GUIDE.md. Also, I don't understand the last change; an ACCEPTED job would never have the host_job_id, per my last comment? I will arrange a meeting next week to do a PR Review. |
You’re right, that one was on me. I’ve fixed it so that all ACCEPTED jobs are marked DISMISSED during recovery, since they never started execution and can’t be safely requeued. I also added a Job Recovery section to DEV_GUIDE.md documenting the recovery behavior across all backends and the rationale for DISMISSED vs LOST. |
Let's discuss next week, need to think through this one. |
ar-siddiqui
left a comment
There was a problem hiding this comment.
Good progress first time touching this repo, please see my comments.
api/jobs/recovery.go
Outdated
| } | ||
|
|
||
| if r.HostJobID == "" { | ||
| if r.Status == ACCEPTED { |
There was a problem hiding this comment.
If a job has ACCEPTED status, it hasn't been started. We should separately handle ACCEPTED cases even before checking for HostJobID and update their status to DISMISS.
| if r.Status == ACCEPTED { | ||
| log.Warnf("Recovery(docker): ACCEPTED job missing container ID, marking DISMISSED job=%s", r.JobID) | ||
| _ = db.updateJobRecord(r.JobID, DISMISSED, time.Now()) | ||
| } |
There was a problem hiding this comment.
What happens if host_job_id is missing but status is running?
We should mark it as LOST here.
| job.ctx = ctx | ||
| job.ctxCancel = cancel | ||
|
|
||
| if err := job.initLogger(); err != nil { |
There was a problem hiding this comment.
initLogger() uses os.Create. Calling it here without modifying it will overwrite server/process logs created before the crash.
|
|
||
| func recoverExitedContainer(j *DockerJob, exitCode int) { | ||
| if exitCode == 0 { | ||
| j.NewStatusUpdate(SUCCESSFUL, time.Now()) |
There was a problem hiding this comment.
We should run the metadata write routine here if the job finished successfully.
| } | ||
|
|
||
| if err := j.initLogger(); err != nil { | ||
| log.Warnf("Recovery(aws-batch): failed to init logger job=%s: %v", r.JobID, err) |
There was a problem hiding this comment.
The job should be marked lost before continuing.
api/jobs/jobs.go
Outdated
| func FetchResults(svc *s3.S3, jid string, status string) (interface{}, error) { | ||
|
|
||
| logs, err := FetchLogs(svc, jid, true) | ||
| // LOST jobs never have results |
There was a problem hiding this comment.
This is not required since we only ever ask for results if the job is successful.
api/jobs/jobs.go
Outdated
| return nil, fmt.Errorf("no results available") | ||
| } | ||
|
|
||
| // Only successful jobs can have results |
There was a problem hiding this comment.
This check is okay; it is just double-checking, but the above check should be removed. Why treat LOST specially? DISMISSED, RUNNING, etc. also don't have a result.
api/jobs/jobs.go
Outdated
| continue | ||
| } | ||
|
|
||
| // NEW: LOST jobs don't have container logs |
There was a problem hiding this comment.
I don't think this is needed.
api/jobs/subprocess_jobs.go
Outdated
| return | ||
| } | ||
| j.PID = fmt.Sprintf("%d", j.execCmd.Process.Pid) | ||
| j.DB.updateJobHost(j.UUID, "subprocess", j.PID) |
There was a problem hiding this comment.
Why are HTML files whitespace-modified?
…nces in job handling
|
@Akopti8 I reviewed the PR. You can see my review comments above. I also then went ahead and addressed those comments. I made a mistake and pushed my commits to your branch. I should have pushed to a new branch so that you could easily see the diff between your last commit and my changes. Luckily, we still have an easy way clearly see diffs. Here is your changes and my review comments https://github.com/Dewberry/sepex/pull/114/changes/BASE..2a9e3fdb5011bb65635e055db46096cb47af6462 Here are my changes to address those comments and some improvements https://github.com/Dewberry/sepex/pull/114/changes/2a9e3fdb5011bb65635e055db46096cb47af6462..12f2ca4d4abb127bf13ac8d684ad459e683c15b9 Let me know what you think of my changes. I'm happy to modify anything if needed. I put my thought process in DEV_GUIDE.md and some code comments. I didn't do any testing, so that is something that needs to be carried out. |
Thanks for the review and for the changes you made. It was really helpful seeing how you addressed things in the code! I tested recovery manually in a few different ways: Subprocess jobsI started a subprocess job, then manually crashed the API and restarted it right away. After restart, the job was marked LOST, which makes sense since the process was gone. I also tested the case where a job never actually started. I started a job, crashed the API immediately, then manually removed the host_job_id and set the status to ACCEPTED in the DB to simulate a crash before the job launched (I don’t think I can do that fast enough in real time). After restarting the API, that job was marked DISMISSED. Docker jobsSame as subprocess for the “never started” case (ACCEPTED + no host id → DISMISSED). I also tested a job that was already running in Docker. I crashed the API while the container was still running, restarted the API, and verified that the job was reattached in memory and continued tracking correctly. AWS Batch jobsThe main thing I could test here is that after restart the API reattaches the job and continues handling status updates correctly. I don’t have access to the Lambda that sends status callbacks, so I couldn’t fully simulate the real status update flow. I also can’t realistically simulate a true “lost” AWS job case. @slawler, As for incorprating this into the newman test, I think we can do them by adding a few stpes after the newman tests that already exist. We could submit a long async job, docker kill the api, and then start it up again, on restart we can make a nother newman collection that tests if the status of the jobs recovered is correctly assigned (LOST, or DISMISSED). |
This adds startup recovery so the API can rebuild in-memory job state after a crash/restart.
Adds
host_job_idto job records + DB schema updates (Postgres + SQLite) and updates job creation to persist host + host job id.New
RecoverAllJobs()runs on boot:ActiveJobs(no watcher loop)Adds LOST handling (status guards + UI icons/templates) and updates logs/results fetching to be status-aware.
Adds a small Docker controller helper (
ContainerInfo) to detect “missing container” cleanly.Goal: avoid “orphaned” non-terminal jobs after restarts, and make recovery behavior explicit + visible in logs/UI.
Note: I was unable to test whether an aws-batch get status updates from the lambda function post-recovery.