Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| "github.com/prometheus/client_golang/prometheus/promauto" | ||
| log "github.com/sirupsen/logrus" | ||
| "k8s.io/api/core/v1" | ||
| v1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
This change is unnecessary
There was a problem hiding this comment.
It is created by gofmt I think, maybe we can keep it.
| // Check if the pod is retryable. | ||
| if spec.RestartPolicy == apiv1.RestartPolicyExitCode { | ||
| if pod.Status.Phase == v1.PodFailed && trainutil.IsRetryableExitCode(exitCode) { | ||
| if pod.Status.Phase == v1.PodFailed && !trainutil.IsRetryableExitCode(exitCode) { |
There was a problem hiding this comment.
I think this was intended and correct but we probably need to improve the log message here
There was a problem hiding this comment.
The original logic here is to provide flexibility to define your own retryable exit code. This should not be changed.
Since kubernetes doesn't have restart method, That's the reason we delete the pod, and wait for next reconcile loop to create a new one. It's asynchronous so restart here may be confused to users.
|
|
||
| if job == nil { | ||
| if pod.Labels[apiv1.GroupNameLabel] == jc.Controller.GetGroupNameLabelValue() { | ||
| if pod.Labels[apiv1.GroupNameLabel] != jc.Controller.GetGroupNameLabelValue() { |
There was a problem hiding this comment.
Please check logics here.
common/pkg/controller.v1/common/job_controller.go
Lines 321 to 336 in ac0c6e1
If we can not find the job, we will end reconcile loop and directly return. The only thing matters is if we want to have some meaningful log.
There're several reason job == nil here
- Kind doesn't match
- GroupNameLabel doesn't match
- Job doesn't exist. In this case, the pod is an orphan pod.
the 3rd one is the only case we want to persist the log. Does it make sense?
There was a problem hiding this comment.
@Jeffwan
Thank you so much Jeff. Much clearer after your patient explanation.
But I am still a little confused
I think the three cases should be:
1.Kind doesn't match
2.Name unmatched
3. Name matched but UID unmatched
My understanding is that third case cannot prove this pod is an orphan pod. It can only prove that the controllerRef point s to a different job.
Thank you again. Really appreciate your help
Thank you
There was a problem hiding this comment.
@xfate123 You are right. I didn't follow logic in resolveControllerRef. UID mismatch is one of the cases. We check GroupNameLabel again in caller side which I think is unnecessary. We use fixed value kubeflow.org for most of the operators.
My understanding is that third case cannot prove this pod is an orphan pod. It can only prove that the controllerRef point s to a different job.
I may not explain this clearly. you are right. It could be an orphan pod (job has been deleted - case 2) or a pod point to different job (UID mismatch - case 3). case 1 seems match the criterion as well.
There was a problem hiding this comment.
@Jeffwan Thank you so much for your clear explanation
| // Check if the pod is retryable. | ||
| if spec.RestartPolicy == apiv1.RestartPolicyExitCode { | ||
| if pod.Status.Phase == v1.PodFailed && trainutil.IsRetryableExitCode(exitCode) { | ||
| if pod.Status.Phase == v1.PodFailed && !trainutil.IsRetryableExitCode(exitCode) { |
There was a problem hiding this comment.
The original logic here is to provide flexibility to define your own retryable exit code. This should not be changed.
Since kubernetes doesn't have restart method, That's the reason we delete the pod, and wait for next reconcile loop to create a new one. It's asynchronous so restart here may be confused to users.
|
|
||
| if job == nil { | ||
| if pod.Labels[apiv1.GroupNameLabel] == jc.Controller.GetGroupNameLabelValue() { | ||
| if pod.Labels[apiv1.GroupNameLabel] != jc.Controller.GetGroupNameLabelValue() { |
There was a problem hiding this comment.
Please check logics here.
common/pkg/controller.v1/common/job_controller.go
Lines 321 to 336 in ac0c6e1
If we can not find the job, we will end reconcile loop and directly return. The only thing matters is if we want to have some meaningful log.
There're several reason job == nil here
- Kind doesn't match
- GroupNameLabel doesn't match
- Job doesn't exist. In this case, the pod is an orphan pod.
the 3rd one is the only case we want to persist the log. Does it make sense?
Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
just some minor errors I found when I studied the code