Skip to content

[Bug]: Metadata collector crashes on nodes running driver 550 or older #810

@lalitadithya

Description

@lalitadithya

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Bug Description

The metadata collector makes use of nvmlDeviceGetPlatformInfo which is only available from R560 onwards, so some clusters that are using R550 or older aren't able to run the metadata collector.

nvmlDeviceGetPlatformInfo is only required to get chassis number which is applicable for Blackwell and newer, can we skip this API call based on driver version or some other property?

Component

Core Service

Steps to Reproduce

  1. Install any version of NVSentinel on a cluster running R550 driver
  2. Look at logs for metadata collector

Environment

  • NVSentinel version: any
  • Kubernetes version: any
  • Deployment method: helm

Logs/Output

symbol lookup error: undefined symbol: nvmlDeviceGetPlatformInfo

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions