Skip to content

feat(gpu): refactor gpu plugins#2

Open
JustinChengLZ wants to merge 22 commits intoluomingmeng:dev/support-gpu-memory-qrm-pluginfrom
JustinChengLZ:dev/support-gpu-memory-plugin
Open

feat(gpu): refactor gpu plugins#2
JustinChengLZ wants to merge 22 commits intoluomingmeng:dev/support-gpu-memory-qrm-pluginfrom
JustinChengLZ:dev/support-gpu-memory-plugin

Conversation

@JustinChengLZ
Copy link

What type of PR is this?

  • refactor gpu plugins into resourcePlugins and devicePlugins
  • StaticPolicy multiplexes request into the respective plugins
  • Add some unit tests

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

chore: add unit tests

chore: add unit tests

chore: add unit tests

chore: add unit tests
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-memory-plugin branch from a6c0da9 to 01a1035 Compare October 3, 2025 06:28
…lugins

feat: introduce rdma state and allow states to share within gpu sub-plugins

feat: introduce rdma state and allow states to share within gpu sub-plugins
…ompany resource allocation

feat: implement rdma custom device plugin and implement logic for accompany resource allocation
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-memory-plugin branch from a4a23f1 to 9442945 Compare October 15, 2025 06:27
JustinChengLZ and others added 2 commits October 18, 2025 23:22
feat: implement allocation of accompany resource first before device
- Remove unused ResourcePluginsNames field and related configurations
- Add DefaultAccompanyResourceName method to CustomDevicePlugin interface
- Make registry maps private and add getter functions
- Improve error handling and cleanup in StaticPolicy allocation
- Simplify device topology initialization and allocation logic
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-memory-plugin branch 2 times, most recently from b361407 to 7ea9a29 Compare October 21, 2025 03:19
JustinChengLZ and others added 10 commits October 21, 2025 11:20
refactor(gpu): restructure device plugin and resource management
introduce a new strategy framework for GPU allocation with filtering, sorting and binding components
add helper functions for GPU memory and device allocation
remove redundant checks and simplify allocation logic
restructure gpu allocation strategy into separate packages for better maintainability. move filtering, sorting and binding strategies to dedicated directories and implement unified generic allocation strategy. update manager to use new strategy structure and rename default strategy constant
Convert public strategy fields to private and provide getter/setter methods
to maintain encapsulation while allowing controlled access to the strategies
Introduce DeviceAffinityGroup field to DeviceInfo struct to support device affinity grouping with priority levels.
feat(gpu): implement strategy-based GPU allocation framework
feat: implement device affinity strategy
feat(npu): develop device affinity binding and filtering strategies
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-memory-plugin branch from f4ff416 to 5e84d3a Compare October 28, 2025 02:43
JustinChengLZ and others added 4 commits October 29, 2025 10:50
… allocation

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

fix: simplify logic of unallocated devices and change name of field
feat(gpu): implement device affinity binding strategy
- introduce DefaultResourceStateGeneratorRegistry for resource state generation
- add SetResourceState method to state interface
- move strategy registry to separate package
- enhance GenericAllocationStrategy with dynamic strategy selection
- update device topology registry with thread-safe operations
- consolidate GPU and RDMA device plugin initialization
- improve state checkpoint handling with resource state generators
- add custom strategy configuration options
refractor gpu plugin state and allocation strategy manager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants