Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
|
Thanks for the PR, is it ready to review? |
@gaocegege Ready for review. Thanks :) |
| if err != nil { | ||
| return err | ||
| } | ||
| if commonutil.IsSucceeded(jobStatus) || commonutil.IsFailed(jobStatus) || (jobSuspended != nil && *jobSuspended) { |
There was a problem hiding this comment.
My concern here is Suspend is just a transition state, should we delete all the pods or just the active ones, leaving the completed pods(succeeded/failed) as they are.
There was a problem hiding this comment.
If anything, it should have the same semantics as kubernetes Job, where we delete the running pods.
There was a problem hiding this comment.
Yes, the current implementation is consistent with batch/job.
| } | ||
|
|
||
| func (jc *JobController) JobSuspended(job interface{}) (*bool, error) { | ||
| log.Infof("Not implemented.") |
There was a problem hiding this comment.
I am wondering if we should merge this since the feature is not completed.
There was a problem hiding this comment.
This is a base class default function in case Job subclasses(TFJob, MPIJob, etc.) do not implement this method.
There was a problem hiding this comment.
Actually it will be override in Job subclass which supports job suspend.
There was a problem hiding this comment.
If DeletePodsAndServices has an implementation, I don't see why this function wouldn't have one.
|
How is this PR going now? |
|
Is this actively being worked on? Or will we get rid of the common repo first? |
@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.
|
|
Agree. we will take this up in next release after our merging kubeflow/common as planned in kubeflow/trainer#1714 (comment) |
|
@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? |
@alculquicondor Yes. that is a good idea. I was thinking of the same. |
|
Excellent! |
You are right. I will work on the following steps after kubeflow feature freeze date (1/25) since I have no enough bandwidth for mpi-operator v2, now:
Although, other anyone can take step 3 after step 1 is completed. |
|
@mimowo will help with suspend in mpi-operator kubeflow/mpi-operator#504 |
|
Great! Thanks to @mimowo! |
Agreed. We could try to work on Job suspend feature for kubeflow v1.8. |
|
@johnugeorge how are we doing with the branch creation? |
To support job suspend semantics like Kubernetes batch job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job