Skip to content

Failure on A100 Card #5

@karthik86248

Description

@karthik86248

Running the GPUStress Tool on a A100 card is reporting the below error. However, card seems to be healthy and working correctly per the HW tests performed by our hardware vendor.

Command Executed: ./gst -T=1
Output:
./gst capturing GPU information...
WATCHDOG starting, TIMEOUT: 600 seconds
Detected 2 CUDA Capable device(s)
Device 0: "NVIDIA A100 80GB PCIe"
Device 1: "NVIDIA A100 80GB PCIe"
Initilizing A100 80 GB based test suite
TYPE=2
GPU Memory: 79, memgb: 80
Device 0: "NVIDIA A100 80GB PCIe", PCIe: 17
***** STARTING TEST 0: INT8 On Device 0 NVIDIA A100 80GB PCIe

math_type 10

args: matrixSizeA 34878833064 matrixSizeB 16672535724 matrixSizeC 28662757344

args: ta=N tb=T m=244872 n=117052 k=142437 lda=7835904 ldb=3745792 ldc=7835904

loop=1
***** TEST INT8 On Device 0 NVIDIA A100 80GB PCIe
***** TEST PASSED ****
TEST TIME: 24 seconds
***** STARTING TEST 1: FP16 On Device 0 NVIDIA A100 80GB PCIe

math_type 0

args: matrixSizeA 13000629632 matrixSizeB 13114567936 matrixSizeC 13557084032

args: ta=N tb=N m=115928 n=116944 k=112144 lda=115928 ldb=112144 ldc=115928

loop=1
***** TEST FP16 On Device 0 NVIDIA A100 80GB PCIe
***** TEST PASSED ****
TEST TIME: 17 seconds
***** STARTING TEST 2: TF32 On Device 0 NVIDIA A100 80GB PCIe

math_type 0

args: matrixSizeA 18964675584 matrixSizeB 6192059904 matrixSizeC 14359400064

std::exception: out of memory
testing cublasLt fail

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions