Conversation
|
/assign @terrytangyuan |
terrytangyuan
left a comment
There was a problem hiding this comment.
I am not sure if this is a common use case. Could you elaborate?
The ability to suspend and resume Jobs is often desired when cluster resources are limited and a higher priority Job needs to execute in the place of another Job. |
|
/ok-to-test |
|
What are the changes you are trying to make to training operator? |
add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state. |
add job partial success status
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
I am not sure if suspend is common in distributed training jobs. There will be side effects depending on the training framework, especially when pods are deleted and recreated. |
|
This is not about the training job itself. |
add job partial success status