NixOS-CUDA CI/CD Infrastructure, including NixOS configurations for Hydra and the builders. This is not an official NixOS project.
The purpose of this system is to advance maintainability of hardware-accelerated (specifically CUDA) software in Nixpkgs. Sustainable maintenance and development of Nixpkgs CUDA requires both a comprehensive test suite run on-schedule, for retroactive detection, and a lighter on-push test-suite for early notification of contributors and the prevention of regressions from being merged.
We aim to detect and distinguish between:
- build failures;
- breakages of basic functionality, like the loading of shared libraries by downstream applications in their GPU branches;
- architecture-specific errors;
- errors in collective communication libraries;
- regressions in performance and closure sizes.
Accounts of currently available hardware and access.
| Hostname | Purpose | IP | GPU | GPU architecture |
|---|---|---|---|---|
| ada | GPU builder | ada.nixos-cuda.org - 144.76.101.55 |
RTX 4000 ada (SFF) | Ada Lovelace |
| pascal | GPU builder | pascal.nixos-cuda.org - 95.216.72.164 |
GeForce GTX 1080 | Pascal |
| hydra | Hydra + binary cache | hydra.nixos-cuda.org - 37.27.129.22 |
- | - |
| atlas | CPU builder | atlas.nixos-cuda.org - 95.216.20.88 |
- | - |
| oxide-1 | CPU builder (provided by Oxide computers) | oxide-1.nixos-cuda.org - 45.154.216.118 |
- | - |
Hydra jobsets
We are using declarative Hydra jobsets. All jobsets are defined in a dedicated repository.
Here are the jobsets currently running on Hydra.
cuda-packages-[un]stable: buildsnixpkgs'srelease-cuda.nixjobset.cuda-gpu-tests-[un]stable: runs the nixpkgs GPU tests on builders withcudacapability.
Learn more here.
Hydra's binary cache is exposed for development purposes. For a compliant way to consume CUDA with Nix refer to NVIDIA. The substituter is currently backed by harmonia.
{
nix.settings.substituters = [
"https://cache.nixos-cuda.org"
];
nix.settings.trusted-public-keys = [
"cache.nixos-cuda.org:74DUi4Ye579gUqzH4ziL9IyiJBlDpMRn9MBN8oNan9M="
];
}- Hardware monitoring
- Set up Grafana -> https://grafana.nixos-cuda.org
- Monitor basic hardware metrics
- Monitor GPU metrics
- Set-up
collectd+ Prometheus on the GPU nodes - Add the
cuda-gpudashboard
- Set-up
- Harmonia binary cache monitoring in Grafana
When we'll update tonixos-26.05, Harmonia will be at version>=0.3.0.
It ships the ability to monitor metrics from Grafana:
- Coverage
- Remove hard-coded attribute lists: cf. "Collect
gpuChecks by followingrecurseIntoAttrs" in "MVE"; same for packages. - Data-Center Hardware and Multi-GPU set-ups
- Probably requires ephemeral builders due to cost.
- Currently no multi-GPU/collective communications test-suites available in Nixpkgs.
- Jetson (tentatively, based on owned hardware and colocation)
- Remove hard-coded attribute lists: cf. "Collect
- Efficiency:
-
harmonia→snix-narbridge; - virtiofsd flat stores → snix virtiofs; in particular, we should hope to eliminate the inefficient Nix substitution;
- Ephemeral Builders:
- Make NixOS work on Azure (under pain limits).
- Basic functionality: on-demand deployment and automatic deallocation of remote builders; the hooking up the builders to Hydra.
- IO costs: synchronizing the closures is likely to be the bottleneck. Cf. the snix virtio story.
-
- Isolation and Access Control:
- [Serge] Move remote builders, Hydra, and web services to microvms with isolated stores.
- Prevent unaudited SSH access to hypervisors and to Hydra (currently Gaetan and Serge in authorized keys).
- Pull-based Deployment.
- Mimimal Viable Example:
- [third parties via Jonas] Initial funding for GPU hardware.
- [Jonas] GitHub organization, domain names, web page.
- [Gaetan] Set up NixOS and Hydra.
- [Gaetan] ZFS Nix store on
ada,pascal. - [Gaetan] Set up
sops-nixfor managing the secrets. - [Gaetan] Hydra.
- [Gaetan] Back up the Hydra configuration (DB?, jobsets?).
- [Gaetan] Move Hydra to
ada(more storage available). - [Serge] Figure out how Hydra inputs work.
- Open PR for cuda-gpu-tests jobset (currently the input points at Gaetan's branch) -> NixOS/nixpkgs#454251
- Collect
gpuChecks by followingrecurseIntoAttrsandpassthru.tests(currently using a hard-coded list). -> nixos-cuda/hydra-jobsets#2 - Declarative jobsets (currently configured via web UI). -> nixos-cuda/hydra-jobsets#4
- [Gaetan] Expose binary cache