Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
0123bb1
feat: support hami ascend device
DSFans2014 Nov 6, 2025
11657ea
refactor: revert ASCEND310PvGPU
DSFans2014 Nov 6, 2025
5740a81
fix: fix gemini comment
DSFans2014 Nov 6, 2025
f9d3dfc
refactor: rm unused code in test
DSFans2014 Nov 6, 2025
702d723
fix: fix cicd
DSFans2014 Nov 7, 2025
4461b3f
fix: fix style
DSFans2014 Nov 7, 2025
13fa9d9
fix: fix comment
DSFans2014 Nov 7, 2025
8502b35
docs: add docs
DSFans2014 Nov 7, 2025
f8ff355
fix: fix commit
DSFans2014 Nov 7, 2025
d11d22d
fix: fix the comment about log level
DSFans2014 Nov 7, 2025
6b81acf
fix: fix comment
DSFans2014 Nov 10, 2025
c8a7544
docs: update user guide
DSFans2014 Nov 10, 2025
20ed063
fix: fix ASCEND310P names
DSFans2014 Nov 10, 2025
7ac7f29
update documents
archlitchi Nov 10, 2025
9e41055
update documents
archlitchi Nov 10, 2025
9778f77
update documents
archlitchi Nov 10, 2025
1563281
fix: fix the title of docs
DSFans2014 Nov 10, 2025
6f392a0
fix: fix typo
DSFans2014 Nov 10, 2025
4c58fc2
docs: update ascend-device-configmap file name
DSFans2014 Nov 10, 2025
65d75b4
add user guide of mindcluster mode in the doc
JackyTYang Nov 11, 2025
0e9abf6
Merge branch 'feat/support-hami-ascend-device' of github.com:DSFans20…
JackyTYang Nov 11, 2025
de55e81
docs: update docs
DSFans2014 Nov 11, 2025
dfee718
refactor: add hami and mindcluster dir
DSFans2014 Nov 13, 2025
cafdeea
refactor: rename ascend/device_info
DSFans2014 Nov 13, 2025
671309e
fix: fix package name
DSFans2014 Nov 13, 2025
7770021
fix: fix package name
DSFans2014 Nov 13, 2025
fc3ea66
fix: fix node_info_test
DSFans2014 Nov 13, 2025
9b9b033
fix: fix node_info_test
DSFans2014 Nov 13, 2025
5f1f174
Merge branch 'master' of github-james:DSFans2014/volcano into feat/su…
DSFans2014 Nov 13, 2025
359f08a
refactor: rename package
DSFans2014 Nov 13, 2025
1887e05
fix: fix style
DSFans2014 Nov 13, 2025
f325340
refactor: add ExtractResourceRequest fun
DSFans2014 Nov 13, 2025
6f6c3ee
fix: fix compile
DSFans2014 Nov 13, 2025
4d07e4b
refactor: mv the hami definition to hami dir
DSFans2014 Nov 17, 2025
a7fa2cd
fix: fix compile error
DSFans2014 Nov 17, 2025
d97a5b3
Merge branch 'master' of github-james:DSFans2014/volcano into feat/su…
DSFans2014 Nov 17, 2025
bab5654
Merge branch 'master' of github-james:DSFans2014/volcano into feat/su…
DSFans2014 Nov 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
293 changes: 293 additions & 0 deletions docs/user-guide/how_to_use_vnpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
# Ascend vNPU User Guide

## Introduction

Volcano supports **two vNPU modes** for sharing Ascend devices:

---

### 1. MindCluster mode

**Description**:

The initial version of [MindCluster](https://gitcode.com/Ascend/mind-cluster)—the official Ascend cluster scheduling add-on—required custom modifications and recompilation of Volcano. Furthermore, it was limited to Volcano release1.7 and release1.9, which complicated its use and restricted access to newer Volcano features.

To address this, we have integrated its core scheduling logic for Ascend vNPU into Volcano's native device-share plugin, which is designed specifically for scheduling and sharing heterogeneous resources like GPUs and NPUs. This integration provides seamless access to vNPU capabilities through the procedure below, while maintaining full compatibility with the latest Volcano features.

**Use case**:

vNPU cluster for Ascend 310 series
with support for more chip types to come

---

### 2. HAMi mode

**Description**:

This mode is developed by a third-party community 'HAMi', which is the developer of [volcano-vgpu](./how_to_use_volcano_vgpu.md) feature. It supports vNPU feature for both Ascend 310 and Ascend 910. It also supports managing heterogeneous Ascend cluster(Cluster with multiple Ascend types, i.e. 910A,910B2,910B3,310P)

**Use case**:

NPU and vNPU cluster for Ascend 910 series
NPU and vNPU cluster for Ascend 310 series
Heterogeneous Ascend cluster

---

## Installation

To enable vNPU scheduling, the following components must be set up based on the selected mode:


**Prerequisites**:

Kubernetes >= 1.16
Volcano >= 1.14
[ascend-docker-runtime](https://gitcode.com/Ascend/mind-cluster/tree/master/component/ascend-docker-runtime) (for HAMi Mode)

### Install Volcano:

Follow instructions in Volcano Installer Guide

* Follow instructions in [Volcano Installer Guide](https://github.com/volcano-sh/volcano?tab=readme-ov-file#quick-start-guide)

### Install ascend-device-plugin and third-party components

In this step, you need to select different ascend-device-plugin based on the vNPU mode you selected. MindCluster mode requires additional components from Ascend to be installed.

---

#### MindCluster Mode

##### Install Third-Party Components

Follow the official [Ascend documentation](https://www.hiascend.com/document/detail/zh/mindcluster/72rc1/clustersched/dlug/mxdlug_start_006.html#ZH-CN_TOPIC_0000002470358262__section1837511531098) to install the following components:
- NodeD
- Ascend Device Plugin
- Ascend Docker Runtime
- ClusterD
- Ascend Operator

> **Note:** Skip the installation of `ascend-volcano` mentioned in the document above, as we have already installed the native Volcano from the Volcano community in the **Prerequisites** part.

**Configuration Adjustment for Ascend Device Plugin:**

When installing `ascend-device-plugin`, you must set the `presetVirtualDevice` parameter to `"false"` in the `device-plugin-310P-volcano-v{version}.yaml` file to enable dynamic virtualization of 310P:

```yaml
...
args: [
"device-plugin",
"-useAscendDocker=true",
"-volcanoType=true",
"-presetVirtualDevice=false",
"-logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log",
"-logLevel=0"
]
...
```
For detailed information, please consult the official [Ascend MindCluster documentation.](https://www.hiascend.com/document/detail/zh/mindcluster/72rc1/clustersched/dlug/cpaug_0020.html)

##### Scheduler Config Update
```yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: predicates
- name: deviceshare
arguments:
deviceshare.AscendMindClusterVNPUEnable: true # enable ascend vnpu
configurations:
...
- name: init-params
arguments: {"grace-over-time":"900","presetVirtualDevice":"false"} # to enable dynamic virtulization, presetVirtualDevice need to be set false
```

---

#### HAMi mode

##### Label the Node with `ascend=on`

```
kubectl label node {ascend-node} ascend=on
```

##### Deploy `hami-scheduler-device` ConfigMap

```
kubectl apply -f https://raw.githubusercontent.com/Project-HAMi/ascend-device-plugin/refs/heads/main/ascend-device-configmap.yaml
```

##### Deploy ascend-device-plugin

```
kubectl apply -f https://raw.githubusercontent.com/Project-HAMi/ascend-device-plugin/refs/heads/main/ascend-device-plugin.yaml
```

For more information, refer to the [ascend-device-plugin documentation](https://github.com/Project-HAMi/ascend-device-plugin).

##### Scheduler Config Update
```yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: predicates
- name: deviceshare
arguments:
deviceshare.AscendHAMiVNPUEnable: true # enable ascend vnpu
deviceshare.SchedulePolicy: binpack # scheduling policy. binpack / spread
deviceshare.KnownGeometriesCMNamespace: kube-system
deviceshare.KnownGeometriesCMName: hami-scheduler-device
```

**Note:** You may notice that, 'volcano-vgpu' has its own GeometriesCMName and GeometriesCMNamespace, which means if you want to use both vNPU and vGPU in a same volcano cluster, you need to merge the configMap from both sides and set it here.

## Usage

Usage is different depending on the mode you selected

---

### MindCluster mode

```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: mindx-dls
namespace: vnpu
labels:
ring-controller.atlas: ascend-310P
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry: 3
queue: default
tasks:
- name: "default-test"
replicas: 1
template:
metadata:
labels:
app: infers
ring-controller.atlas: ascend-310P
vnpu-dvpp: "null"
vnpu-level: low
spec:
schedulerName: volcano
containers:
- name: resnet50infer
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.1.RC1-300I-Duo-py311-openeuler24.03-lts
imagePullPolicy: IfNotPresent
securityContext:
privileged: false
command: ["/bin/bash", "-c", "tail -f /dev/null"]
resources:
requests:
huawei.com/npu-core: 8
limits:
huawei.com/npu-core: 8
nodeSelector:
host-arch: huawei-arm

```

The supported Ascend chips and their `ResourceNames` are shown in the following table:

| ChipName | JobLabel and TaskLabel | ResourceName |
|-------|------------------------------------|-------|
| 310P3 | ring-controller.atlas: ascend-310P | huawei.com/npu-core |

**Description of Labels in the Virtualization Task YAML**

| **Key** | **Value** | **Description** |
| ------------------------- | --------------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **vnpu-level** | **low** | Low configuration (default). Selects the lowest-configuration "virtualized instance template." |
| | **high** | Performance-first. When cluster resources are sufficient, the scheduler will choose the highest-configured virtualized instance template possible. When most physical NPUs in the cluster are already in use and only a few AI Cores remain on each device, the scheduler will allocate templates that match the remaining AI Core count rather than forcing high-profile templates. For details, refer to the table below. |
| **vnpu-dvpp** | **yes** | The Pod uses DVPP. |
| | **no** | The Pod does not use DVPP. |
| | **null** | Default value. DVPP usage is not considered. |
| **ring-controller.atlas** | **ascend-310P** | Indicates that the task uses products from the Atlas inference series. |

**Effect of DVPP and Level Configurations**

| **Product Model** | **Requested AI Core Count** | **vnpu-dvpp** | **vnpu-level** | **Downgrade** | **Selected Template** |
| --------------------------------------- | --------------------------- |---------------| -------------------- | ------------- | --------------------- |
| **Atlas Inference Series (8 AI Cores)** | **1** | `null` | Any value | – | `vir01` |
| | **2** | `null` | `low` / other values | – | `vir02_1c` |
| | **2** | `null` | `high` | No | `vir02` |
| | **2** | `null` | `high` | Yes | `vir02_1c` |
| | **4** | `yes` | `low` / other values | – | `vir04_4c_dvpp` |
| | **4** | `no` | `low` / other values | – | `vir04_3c_ndvpp` |
| | **4** | `null` | `low` / other values | – | `vir04_3c` |
| | **4** | `yes` | `high` | – | `vir04_4c_dvpp` |
| | **4** | `no` | `high` | – | `vir04_3c_ndvpp` |
| | **4** | `null` | `high` | No | `vir04` |
| | **4** | `null` | `high` | Yes | `vir04_3c` |
| | **8 or multiples of 8** | Any value | Any value | – | – |


**Notice**

For **chip virtualization (non-full card usage)**, the value of `vnpu-dvpp` must strictly match the corresponding value listed in the above table.
Any other values will cause the task to fail to be dispatched.


For detailed information, please consult the official [Ascend MindCluster documentation.](https://www.hiascend.com/document/detail/zh/mindcluster/72rc1/clustersched/dlug/cpaug_0020.html)

---

### HAMi mode

```yaml
apiVersion: v1
kind: Pod
metadata:
name: ascend-pod
spec:
schedulerName: volcano
containers:
- name: ubuntu-container
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
command: ["sleep"]
args: ["100000"]
resources:
limits:
huawei.com/Ascend310P: "1"
huawei.com/Ascend310P-memory: "4096"

```

The supported Ascend chips and their `ResourceNames` are shown in the following table:

| ChipName | ResourceName | ResourceMemoryName |
|-------|-------|-------|
| 910A | huawei.com/Ascend910A | huawei.com/Ascend910A-memory |
| 910B2 | huawei.com/Ascend910B2 | huawei.com/Ascend910B2-memory |
| 910B3 | huawei.com/Ascend910B3 | huawei.com/Ascend910B3-memory |
| 910B4 | huawei.com/Ascend910B4 | huawei.com/Ascend910B4-memory |
| 910B4-1 | huawei.com/Ascend910B4-1 | huawei.com/Ascend910B4-1-memory |
| 310P3 | huawei.com/Ascend310P | huawei.com/Ascend310P-memory |
Loading
Loading