minor errors in pod.go by xfate123 · Pull Request #100 · kubeflow/common

xfate123 · 2020-08-04T20:09:16Z

just some minor errors I found when I studied the code

kubeflow-bot · 2020-08-04T20:09:21Z

This change is

k8s-ci-robot · 2020-08-04T20:09:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign richardsliu
You can assign the PR to them by writing /assign @richardsliu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terrytangyuan · 2020-08-04T20:22:33Z

pkg/controller.v1/common/pod.go

 	"github.com/prometheus/client_golang/prometheus/promauto"
 	log "github.com/sirupsen/logrus"
-	"k8s.io/api/core/v1"
+	v1 "k8s.io/api/core/v1"


This change is unnecessary

It is created by gofmt I think, maybe we can keep it.

terrytangyuan · 2020-08-04T20:26:39Z

pkg/controller.v1/common/pod.go

 			// Check if the pod is retryable.
 			if spec.RestartPolicy == apiv1.RestartPolicyExitCode {
-				if pod.Status.Phase == v1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
+				if pod.Status.Phase == v1.PodFailed && !trainutil.IsRetryableExitCode(exitCode) {


I think this was intended and correct but we probably need to improve the log message here

The original logic here is to provide flexibility to define your own retryable exit code. This should not be changed.

Since kubernetes doesn't have restart method, That's the reason we delete the pod, and wait for next reconcile loop to create a new one. It's asynchronous so restart here may be confused to users.

merlintang · 2020-08-04T21:09:14Z

pkg/controller.v1/common/pod.go


 		if job == nil {
-			if pod.Labels[apiv1.GroupNameLabel] == jc.Controller.GetGroupNameLabelValue() {
+			if pod.Labels[apiv1.GroupNameLabel] != jc.Controller.GetGroupNameLabelValue() {


good catch up

@xfate123

Please check logics here.

common/pkg/controller.v1/common/job_controller.go

Lines 321 to 336 in ac0c6e1

func (jc *JobController) resolveControllerRef(namespace string, controllerRef *metav1.OwnerReference) metav1.Object {

// We can't look up by UID, so look up by Name and then verify UID.

// Don't even try to look up by Name if it's the wrong Kind.

if controllerRef.Kind != jc.Controller.GetAPIGroupVersionKind().Kind {

return nil

}

job, err := jc.Controller.GetJobFromInformerCache(namespace, controllerRef.Name)

if err != nil {

return nil

}

if job.GetUID() != controllerRef.UID {

// The controller we found with this Name is not the same one that the

// ControllerRef points to.

return nil

}

return job

If we can not find the job, we will end reconcile loop and directly return. The only thing matters is if we want to have some meaningful log.

There're several reason job == nil here

Kind doesn't match

GroupNameLabel doesn't match

Job doesn't exist. In this case, the pod is an orphan pod.

~~the 3rd one is the only case we want to persist the log~~. Does it make sense?

@Jeffwan
Thank you so much Jeff. Much clearer after your patient explanation.
But I am still a little confused
I think the three cases should be:
1.Kind doesn't match
2.Name unmatched
3. Name matched but UID unmatched
My understanding is that third case cannot prove this pod is an orphan pod. It can only prove that the controllerRef point s to a different job.
Thank you again. Really appreciate your help
Thank you

@xfate123 You are right. I didn't follow logic in resolveControllerRef. UID mismatch is one of the cases. We check GroupNameLabel again in caller side which I think is unnecessary. We use fixed value kubeflow.org for most of the operators.

My understanding is that third case cannot prove this pod is an orphan pod. It can only prove that the controllerRef point s to a different job.

I may not explain this clearly. you are right. It could be an orphan pod (job has been deleted - case 2) or a pod point to different job (UID mismatch - case 3). case 1 seems match the criterion as well.

@Jeffwan Thank you so much for your clear explanation

Jeffwan · 2020-08-10T16:06:19Z

pkg/controller.v1/common/pod.go

 			// Check if the pod is retryable.
 			if spec.RestartPolicy == apiv1.RestartPolicyExitCode {
-				if pod.Status.Phase == v1.PodFailed && trainutil.IsRetryableExitCode(exitCode) {
+				if pod.Status.Phase == v1.PodFailed && !trainutil.IsRetryableExitCode(exitCode) {


The original logic here is to provide flexibility to define your own retryable exit code. This should not be changed.

Since kubernetes doesn't have restart method, That's the reason we delete the pod, and wait for next reconcile loop to create a new one. It's asynchronous so restart here may be confused to users.

Jeffwan · 2020-08-10T16:13:28Z

pkg/controller.v1/common/pod.go


 		if job == nil {
-			if pod.Labels[apiv1.GroupNameLabel] == jc.Controller.GetGroupNameLabelValue() {
+			if pod.Labels[apiv1.GroupNameLabel] != jc.Controller.GetGroupNameLabelValue() {


@xfate123

Please check logics here.

common/pkg/controller.v1/common/job_controller.go

Lines 321 to 336 in ac0c6e1

func (jc *JobController) resolveControllerRef(namespace string, controllerRef *metav1.OwnerReference) metav1.Object {

// We can't look up by UID, so look up by Name and then verify UID.

// Don't even try to look up by Name if it's the wrong Kind.

if controllerRef.Kind != jc.Controller.GetAPIGroupVersionKind().Kind {

return nil

}

job, err := jc.Controller.GetJobFromInformerCache(namespace, controllerRef.Name)

if err != nil {

return nil

}

if job.GetUID() != controllerRef.UID {

// The controller we found with this Name is not the same one that the

// ControllerRef points to.

return nil

}

return job

If we can not find the job, we will end reconcile loop and directly return. The only thing matters is if we want to have some meaningful log.

There're several reason job == nil here

Kind doesn't match

GroupNameLabel doesn't match

Job doesn't exist. In this case, the pod is an orphan pod.

~~the 3rd one is the only case we want to persist the log~~. Does it make sense?

Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

minor errors in pod.go

ac0c6e1

k8s-ci-robot requested review from Jeffwan and gaocegege August 4, 2020 20:09

k8s-ci-robot added the size/XS label Aug 4, 2020

terrytangyuan reviewed Aug 4, 2020

View reviewed changes

merlintang reviewed Aug 4, 2020

View reviewed changes

Jeffwan suggested changes Aug 10, 2020

View reviewed changes

	func (jc JobController) resolveControllerRef(namespace string, controllerRef metav1.OwnerReference) metav1.Object {
	// We can't look up by UID, so look up by Name and then verify UID.
	// Don't even try to look up by Name if it's the wrong Kind.
	if controllerRef.Kind != jc.Controller.GetAPIGroupVersionKind().Kind {
	return nil
	}
	job, err := jc.Controller.GetJobFromInformerCache(namespace, controllerRef.Name)
	if err != nil {
	return nil
	}
	if job.GetUID() != controllerRef.UID {
	// The controller we found with this Name is not the same one that the
	// ControllerRef points to.
	return nil
	}
	return job

Conversation

xfate123 commented Aug 4, 2020

Uh oh!

kubeflow-bot commented Aug 4, 2020

Uh oh!

k8s-ci-robot commented Aug 4, 2020

Uh oh!

terrytangyuan Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

gaocegege Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

terrytangyuan Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jeffwan Aug 10, 2020

Choose a reason for hiding this comment

Uh oh!

merlintang Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

Jeffwan Aug 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xfate123 Aug 10, 2020

Choose a reason for hiding this comment

Uh oh!

Jeffwan Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xfate123 Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

Jeffwan Aug 10, 2020

Choose a reason for hiding this comment

Uh oh!

Jeffwan Aug 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

terrytangyuan Aug 4, 2020 •

edited

Loading

Jeffwan Aug 10, 2020 •

edited

Loading

Jeffwan Aug 12, 2020 •

edited

Loading

Jeffwan Aug 10, 2020 •

edited

Loading