Skip to content

aborting stress test leaves GPU at 100% and in P0 #14

@sherwoac

Description

@sherwoac

Hi.

Compiled gst and ran it in ubuntu 24 and sm_120a.

ran gst and then hit CTRL+C after a while, as per the output below.
The GPU appeared at 100% utilization in P0 state, had to reboot to stop this, this is repeatable.

dam@z10:~/CODE/GPUStressTest/build$ ./gst 1
./gst capturing GPU information...
WATCHDOG starting, TIMEOUT: 600 seconds
Detected 1 CUDA Capable device(s)
./gst Done.
Device 0: "NVIDIA GeForce RTX 5090"
./gst done capturing GPU information.
DEBUG_MATRIX_SIZES: Checking matrix size only (no CUDA execution) for: T4
Initilizing T4 based test suite
GPU Memory: 31, memgb: 16


Device 0: "NVIDIA GeForce RTX 5090", PCIe: a
stress_tests[0].test_name FP16
P hsh
m 31864
n 38648
k 88304
ta 0
tb 1
B 0

***** STARTING TEST 0: FP16 On Device 0 NVIDIA GeForce RTX 5090
testing cublasLt
Allocate matrixSize Total Bytes A + B + C:  14915943040 
#### args: ta=N tb=T m=31864 n=38648 k=88304  alpha = (0x3f800000, 1) beta= (0x00000000, 0)
#### args: lda=31864 ldb=38648 ldc=31864 ldd=31864 loop=10
^^^^ CUDA : elapsed = 15.22 sec,  Gflops = 142896.701 
testing cublasLt pass
***** TEST FP16 On Device 0 NVIDIA GeForce RTX 5090
stress_tests[1].test_name C32
P ccc
m 11432
n 16424
k 61000
ta 0
tb 1
B 0

***** STARTING TEST 1: C32 On Device 0 NVIDIA GeForce RTX 5090
testing cublasLt
Allocate matrixSize Total Bytes A + B + C:  15095801344 
#### args: ta=N tb=T m=11432 n=16424 k=61000  alpha = (0x3f800000 1), (0x00000000 0) beta= (0x00000000 0), (0x00000000 0)
#### args: lda=11432 ldb=16424 ldc=11432 ldd=11432 loop=10
^C
adam@z10:~/CODE/GPUStressTest/build$ nvidia-smi
Thu Jul  3 14:19:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:0A:00.0 Off |                  N/A |
| 41%   57C    P0            129W /  575W |       2MiB /  32607MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions