Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 94 additions & 56 deletions docs/compute/monitoring-resources.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,12 @@
# Monitoring Resources

CU Research Computing has two main tools which can help users monitor their HPC resources:
* [Slurmtools](#slurmtools): A [module](./modules.md) that loads a collection of functions to assess recent usage statistics
* [Slurm commands](#monitoring-through-slurm-commands): Slurm provides built-in commands that allow users to retrieve usage summaries, job efficiency data, job history, and priority.
* [XDMoD](#xdmod): A web portal for viewing metrics at the system, partition, and user-levels.

## Slurmtools
Slurmtools is a collection of helper scripts for everyday use of the [SLURM](https://slurm.schedmd.com/overview.html) job scheduler. Slurmtools can be loaded in as a module from any node (including login nodes). Slurmtools can help us understand the following questions:
* How many core hours (SUs) have I used recently?
* Who is using all of the SUs on my group's account?
* What jobs have I run over the past few days?
* What is my priority?
* How efficient are my jobs?

### __Step 1__: Log in
If you have a CURC account, login as you [normally would](../getting_started/logging-in.md) using your identikey and Duo from a terminal:

```bash
$ ssh ralphie@login.rc.colorado.edu
```

### __Step 2__: Load the slurm module for the HPC resource you want to query metrics about (Blanca, Alpine):
```bash
$ module load slurm/alpine # substitute alpine for blanca
```

### __Step 3__: Load the `slurmtools` module
```bash
$ module load slurmtools
```

You will see the following informational message:

```
You have sucessfully loaded slurmtools, a collection of functions
to assess recent usage statistics. Available commands include:

'suacct' (SU usage for each user of a specified account over N days)

'suuser' (SU usage for a specified user over N days)

'seff' (CPU and RAM efficiency for a specified job)

'seff-array' (CPU, RAM, and time efficiency for a specified array job)

'jobstats' (job statistics for all jobs run by a specified user over N days)

'levelfs' (current fair share priority for a specified user)


Type any command without arguments for usage instructions
```

### __Step 4__: Get some metrics!

#### How many Service Units (core hours) have I used?
## Monitoring Through Slurm Commands
You can obtain important usage and efficiency metrics directly through Slurm’s built-in commands and answer the following questions:
### How many Service Units (core hours) have I used?

Type the command name for usage hint:
```bash
Expand Down Expand Up @@ -85,7 +38,7 @@ This output tells us that:
* Ralphie's usage by account varied from 3,812 SUs to 15,987 SUs


#### Who is using all of the SUs on my groups' account?
### Who is using all of the SUs on my groups' account?

Type the command name for usage hint:
```bash
Expand Down Expand Up @@ -121,7 +74,7 @@ This output tells us that:
* Five users used the account in the past 180 days.
* Their usage ranged from 24 SUs to 84,216 SUs

#### What jobs have I run over the past few days?
### What jobs have I run over the past few days?

Type the command name for usage hint:
```bash
Expand Down Expand Up @@ -166,7 +119,7 @@ This output tells me that:
* The elapsed times ranged from 0 hours to 1 hour and 48 minutes


#### What is my priority?
### What is my priority?

Type the command name for usage hint:
```bash
Expand Down Expand Up @@ -212,7 +165,7 @@ What is "Priority"?
* Your "Fair Share" priority has a half life of 14 days (i.e., it recovers fully in ~1 month with zero usage)
```

#### How efficient are my jobs?
### How efficient are my jobs?

Type the command name for usage hint:
```bash
Expand Down Expand Up @@ -250,7 +203,7 @@ This output tells us that:

This information is also sent to users who include the `--mail` directive in jobs.

#### How can I check the efficiency of array jobs?
### How can I check the efficiency of array jobs?

Use the `seff-array` command with the help flag for a usage hint:
```
Expand Down Expand Up @@ -303,6 +256,91 @@ The above indicates that all of the jobs displayed less than 40% CPU efficiency,
If you are not familiar with Job Arrays in SLURM, you can learn more on the ["Scaling Up with job Arrays" page](../running-jobs/job-arrays.md).
```

### How can I check memory and GPU utilization for my jobs?

Type the command name for usage hint:
```
$ sacct -h
```
```
Purpose: This command reports detailed accounting information for completed jobs,
including CPU, memory, and GPU metrics such as `gpumem` and `gpuutil`.
```
To view the maximum GPU resource usage for a job, the command is:
```
Usage:
sacct -j <jobid> -pno JobID,TRESUsageInMax

positional arguments:
jobid Job ID to query.

options:
-j <jobid> Specifies the Slurm job ID you want information about.
-n Removes the header row from the output.
-o TRESUsageInMax Chooses the output field(s). Here, you’re requesting to display maximum resource usage for a specific jobid.
```
In order to check the metrics for job 18194943, run:
```
$ sacct -j 18194943 -pno JobID,TRESUsageInMax
```
This will display the output:
```
--------------------------------------------------------
18194943||
18194943.batch|cpu=00:56:24,energy=0,fs/disk=350130624578,gres/gpumem=38888M,gres/gpuutil=100,mem=10372780K,pages=3465,vmem=10194376K|
18194943.extern|cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=256K,pages=0,vmem=0|
--------------------------------------------------------
```
This tells us:
| Variable | Description |
| ------------- |------------|
|`cpu` | Total CPU time consumed (hours:minutes:seconds).|
|`energy` | Energy consumed by the job in arbitrary units (often 0 for systems without energy accounting).|
|`fs/disk` | Amount of disk I/O performed by the job (in bytes). |
|`gres/gpumem` | Peak GPU memory usage. |
|`gres/gpuutil` | Peak GPU utilization as a percentage. |
|`mem` | RAM used by the job. |
|`pages` | Number of memory pages used. Typically for advanced monitoring; can often be ignored. |
|`vmem` | Virtual memory used by the job. Includes RAM + swap + memory-mapped files. |

```{note}
`sacct` shows multiple entries for each job. The primary entry is the one with the `.batch` step, which reflects the actual workload and contains the meaningful resource-usage metrics. The `.extern` entry corresponds to Slurm’s lightweight setup step and will show minimal or negligible usage. Users should reference the `.batch` line when reviewing their job’s resource usage.
```

Similarly, to view the average GPU resource usage for a job
```
Usage:
sacct -j <jobid> -pno JobID,TRESUsageInAve --noconvert

positional arguments:
jobid Job ID to query.

options:
-j <jobid> Filters the report to only show information for the specified job ID.
-n Removes the header row from the output.
-o TRESUsageInAve Chooses the output field(s). Here, you’re requesting to display average resource usage metrics.
--noconvert Prevents conversion of the units (e.g., MB → GB), ensuring raw data remains unchanged.
```
In order to check average GPU metrics for job 18194943, run:
```
$ sacct -j 18194943 -pno JobID,TRESUsageInAve --noconvert
```
This will display the output:
```
--------------------------------------------------------
18194943||
18194943.batch|cpu=00:56:24,energy=0,fs/disk=350130624578,gres/gpumem=40777023488,gres/gpuutil=100,mem=10621726720,pages=3465,vmem=10439041024|
18194943.extern|cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=262144,pages=0,vmem=0|
--------------------------------------------------------
```
This output contains the same fields as in the `TRESUsageInMax` example, but here they represent average usage instead of peak usage.

```{important}
- GPU metrics are currently available only on NVIDIA GPUs. Accessing these metrics requires CUDA 12 or newer.
- Currently, AMD GPUs, H200s and Blanca H100 GPUs do not support GPU metric collection.
- Users running jobs on unsupported GPUs or older CUDA versions will see zeros or infinite value for GPU memory and utilization fields. Make sure your jobs are running on compatible hardware to obtain meaningful GPU metrics.
```

## XDMoD

XDMoD is a web portal for viewing metrics at the system-, partition- and user-levels.
Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ sstat --jobs=your_job_id --format=User,JobName,JobId,MaxRSS

For more information on `sstat` or `sacct` commands, [take a look at our Useful Slurm Commands tutorial.](../running-jobs/slurm-commands.md) Or visit the Slurm reference pages on [sstat](https://slurm.schedmd.com/sstat.html) and [sacct](https://slurm.schedmd.com/sacct.html).

You can also view information related to service unit (SU) usage and CPU & RAM efficiency by using [slurmtools](../compute/monitoring-resources.md#slurmtools). Note that CPU & RAM efficiency statistics will be included in emails sent when a job completes, if requested.
You can also view information related to service unit (SU) usage and CPU & RAM efficiency [here](../compute/monitoring-resources.md#monitoring-through-slurm-commands). Note that CPU & RAM efficiency statistics will be included in emails sent when a job completes, if requested.
::::

### How can I see my current FairShare priority?
Expand Down
2 changes: 1 addition & 1 deletion docs/software/curc_provided_software.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ Before requesting a software installation, please review our [Software installat
| [SAMtools](http://www.htslib.org/doc/samtools.html) | 1.16.1 | Samtools is a suite of programs for interacting with high-throughput sequencing data.|
| [ScaLAPACK](https://netlib.org/scalapack/) | 2.2.0 |ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines.|
| [Singularity/Apptainer](https://apptainer.org/) | 3.6.4 (D), 3.7.4 |Singularity/Apptainer is a computer program that performs operating-system-level virtualization also known as containerization. [CURC Usage Guide](./containerization.md#apptainer)|
| [Slurmtools](../compute/monitoring-resources.md#slurmtools) | 0.0.0 |A collection of helper scripts for everyday use of the Slurm job scheduler.|
| Slurmtools | 0.0.0 |A collection of helper scripts for everyday use of the Slurm job scheduler.|
| [SQLite](https://sqlite.org/index.html) | 3.36.0, 3.38.01, 3.46.1 |SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.|
| [SRA-Toolkit](https://hpc.nih.gov/apps/sratoolkit.html) | 3.0.0 | The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives.|
| [STAR](https://github.com/alexdobin/STAR/tree/master) | 2.7.10b | A tool to align RNA-seq data. The STAR algorithm uses suffix arrays, seed clustering, and stitching.|
Expand Down