Skip to content

Comments

Job management changes#41

Open
dylanlee wants to merge 5 commits intomainfrom
job-management-changes
Open

Job management changes#41
dylanlee wants to merge 5 commits intomainfrom
job-management-changes

Conversation

@dylanlee
Copy link
Contributor

@dylanlee dylanlee commented Aug 28, 2025

This PR encompasses changes associated with the pipelines submission and tracking of Nomad jobs. The biggest changes are to nomad_job_manager.py.

nomad_job_manager.py was changed to:

  1. Use polling to monitor job status
  2. Add better tracking for lost or cancelled Nomad jobs
  3. propagate exceptions to pipeline_stages.py in a better way so that pipelines can be marked failed earlier to not waste compute time
  4. Use a semaphore to limit the concurrent number of jobs dispatched or polled. This limits the load placed on the Nomad server by a single pipeline

pipeline_stages.py was also changed to stagger job submissions within each stage. This was another move designed to reduce load on the nomad server by many pipeline jobs running at once.

dylanlee and others added 5 commits August 28, 2025 11:52
Added a semaphore to NomadJobManager so that the dispatch job requests
don't exceed the urllib3 pool size limit used by the nomad python
library
Edited nomad_job_manager.py to work with polling instead of the Nomad
event stream. Also modified pipeline_stages.py so that a stage is marked
failed if a job in that stage is lost by the Nomad API
When I was cherrypicking changes in other files in src into this branch
accidently copied over main.py changes. Reverted the changes since
src/main.py changes are being tracked in the update-entrypoint branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant