Skip to content

fix: creating multiple pods at the same time will be assigned to the same NPU.#53

Open
lomtom wants to merge 1 commit intoProject-HAMi:mainfrom
lomtom:main
Open

fix: creating multiple pods at the same time will be assigned to the same NPU.#53
lomtom wants to merge 1 commit intoProject-HAMi:mainfrom
lomtom:main

Conversation

@lomtom
Copy link

@lomtom lomtom commented Jan 30, 2026

fix: #52

@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lomtom
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot bot requested review from DSFans2014 and archlitchi January 30, 2026 08:07
@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 30, 2026

Welcome @lomtom! It looks like this is your first PR to Project-HAMi/ascend-device-plugin 🎉

Copy link
Contributor

@DSFans2014 DSFans2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please sign-off your commit

…same NPU.

Signed-off-by: lomtom <lomtom@qq.com>
@lomtom
Copy link
Author

lomtom commented Feb 2, 2026

please sign-off your commit

it's done

@DSFans2014
Copy link
Contributor

Is this issue consistently reproducible? Can this fix be verified to resolve that problem? @lomtom

@lomtom
Copy link
Author

lomtom commented Feb 4, 2026

Is this issue consistently reproducible? Can this fix be verified to resolve that problem? @lomtom

Yes,This is almost 100% reproducible, and after I fixed it, the verification issue was resolved

@DSFans2014
Copy link
Contributor

Is this issue consistently reproducible? Can this fix be verified to resolve that problem? @lomtom

Yes,This is almost 100% reproducible, and after I fixed it, the verification issue was resolved

hi lomtom,please explain why set annotations[util.DeviceBindPhase] = util.DeviceBindSuccess can resolve this problem

@lomtom
Copy link
Author

lomtom commented Feb 4, 2026

hi, @DSFans2014

pod, err := util.GetPendingPod(ctx, ps.nodeName)

This method will filter the pending pod, but after the Pod is processed, his DeviceBindPhase annotation will not change, which will cause the processed Pod to be still selected when assigning to the second Pod (at this time, the first Pod is still in the Pending state for a short time)

https://github.com/Project-HAMi/HAMi/blob/0369aab84876bd26ae64c636f94c5da6d53b6925/pkg/util/util.go#L95-L101

		if phase, ok := p.Annotations[DeviceBindPhase]; !ok {
			continue
		} else {
			if strings.Compare(phase, DeviceBindAllocating) != 0 {
				continue
			}
		}

Therefore, if a Pod changes the DeviceBindPhase annotation to util.DeviceBindSuccess after executing Allocate, it will be filtered out next time. (Although he is still in the pending state for a short time)

@DSFans2014
Copy link
Contributor

hi, @DSFans2014

pod, err := util.GetPendingPod(ctx, ps.nodeName)

This method will filter the pending pod, but after the Pod is processed, his DeviceBindPhase annotation will not change, which will cause the processed Pod to be still selected when assigning to the second Pod (at this time, the first Pod is still in the Pending state for a short time)

https://github.com/Project-HAMi/HAMi/blob/0369aab84876bd26ae64c636f94c5da6d53b6925/pkg/util/util.go#L95-L101

		if phase, ok := p.Annotations[DeviceBindPhase]; !ok {
			continue
		} else {
			if strings.Compare(phase, DeviceBindAllocating) != 0 {
				continue
			}
		}

Therefore, if a Pod changes the DeviceBindPhase annotation to util.DeviceBindSuccess after executing Allocate, it will be filtered out next time. (Although he is still in the pending state for a short time)

got it, thank you. but now there are still potential concurrency issues due to the nodelock problem in Volcano. we will fix it later.

Copy link
Contributor

@DSFans2014 DSFans2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

同时创建多个pod时会导致分配到同一个NPU

2 participants