Conversation
Signed-off-by: cegao <ce.gao@outlook.com>
|
/hold |
Signed-off-by: cegao <ce.gao@outlook.com>
|
It is blocked by kubeflow/common#181 |
Pull Request Test Coverage Report for Build 1562410664
💛 - Coveralls |
terrytangyuan
left a comment
There was a problem hiding this comment.
LGTM. Pending approval from the PR in kubeflow/common.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/hold cancel |
gaocegege
left a comment
There was a problem hiding this comment.
We decided not to add the successpolicy into the common since we may move to SuccessCondition in the future. Thus we just implement it in PyTorch.Spec and TF.Spec.
/cc @terrytangyuan
| // Leave a succeeded condition for the following two cases: | ||
| // 1. If default success policy is used and worker 0 has completed. | ||
| // 2. If `SuccessPolicyAllWorkers` success policy is used and all workers are succeeded. | ||
| if expected == 0 || (worker0Completed && *pytorchjob.Spec.SuccessPolicy != pytorchv1.SuccessPolicyAllWorkers) { |
There was a problem hiding this comment.
L387 might take a bit of time to understand. Maybe it's because the order of the two conditions are not arranged as the comment lists.
There was a problem hiding this comment.
The comments and conditions are copied from TF. I will refine it soon.
| if !ok { | ||
| return true, nil | ||
| } | ||
| podSlices, err := p.getPodSlices(job, |
There was a problem hiding this comment.
I understand the slice approach definitely works. But what is the point to create a the getPodSlices method while all we need is just the worker pod with index=0? And what kind of disadvantages will it have to use Get function or List with label selector to catch worker0 pod?
There was a problem hiding this comment.
It is to keep consistency with TF controller. The code is copied from it.
| } | ||
|
|
||
| // GetContainerExitCode gets the container exit code from the given pod. | ||
| func GetContainerExitCode(pod *corev1.Pod, name string) int32 { |
| commonutil "github.com/kubeflow/common/pkg/util" | ||
| "github.com/sirupsen/logrus" | ||
| corev1 "k8s.io/api/core/v1" | ||
| v1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
Thanks for the comment.
Signed-off-by: cegao <ce.gao@outlook.com>
|
/assign @zw0610 |
Signed-off-by: cegao <ce.gao@outlook.com>
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Signed-off-by: cegao ce.gao@outlook.com
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist: