From 9b2bea9797883098523e35952dc1a30be6acec57 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Thu, 11 Dec 2025 01:30:39 -0700 Subject: [PATCH 01/26] adding gpu metrics collection command --- docs/compute/monitoring-resources.md | 153 +++++++++++++++--------- docs/getting_started/faq.md | 2 +- docs/software/curc_provided_software.md | 2 +- 3 files changed, 99 insertions(+), 58 deletions(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 8f90e5da..0bb9666a 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -1,59 +1,12 @@ # Monitoring Resources CU Research Computing has two main tools which can help users monitor their HPC resources: -* [Slurmtools](#slurmtools): A [module](./modules.md) that loads a collection of functions to assess recent usage statistics +* [Slurm commands](#monitoring-through-slurm-commands): Slurm provides built-in commands that allow users to retrieve usage summaries, job efficiency data, job history, and priority. * [XDMoD](#xdmod): A web portal for viewing metrics at the system, partition, and user-levels. -## Slurmtools -Slurmtools is a collection of helper scripts for everyday use of the [SLURM](https://slurm.schedmd.com/overview.html) job scheduler. Slurmtools can be loaded in as a module from any node (including login nodes). Slurmtools can help us understand the following questions: -* How many core hours (SUs) have I used recently? -* Who is using all of the SUs on my group's account? -* What jobs have I run over the past few days? -* What is my priority? -* How efficient are my jobs? - -### __Step 1__: Log in -If you have a CURC account, login as you [normally would](../getting_started/logging-in.md) using your identikey and Duo from a terminal: - -```bash -$ ssh ralphie@login.rc.colorado.edu -``` - -### __Step 2__: Load the slurm module for the HPC resource you want to query metrics about (Blanca, Alpine): -```bash -$ module load slurm/alpine # substitute alpine for blanca -``` - -### __Step 3__: Load the `slurmtools` module -```bash -$ module load slurmtools -``` - -You will see the following informational message: - -``` -You have sucessfully loaded slurmtools, a collection of functions - to assess recent usage statistics. Available commands include: - - 'suacct' (SU usage for each user of a specified account over N days) - - 'suuser' (SU usage for a specified user over N days) - - 'seff' (CPU and RAM efficiency for a specified job) - - 'seff-array' (CPU, RAM, and time efficiency for a specified array job) - - 'jobstats' (job statistics for all jobs run by a specified user over N days) - - 'levelfs' (current fair share priority for a specified user) - - - Type any command without arguments for usage instructions - ``` - -### __Step 4__: Get some metrics! - -#### How many Service Units (core hours) have I used? +## Monitoring Through Slurm Commands +You can obtain important usage and efficiency metrics directly through Slurm’s built-in commands and answer the following questions: +### How many Service Units (core hours) have I used? Type the command name for usage hint: ```bash @@ -85,7 +38,7 @@ This output tells us that: * Ralphie's usage by account varied from 3,812 SUs to 15,987 SUs -#### Who is using all of the SUs on my groups' account? +### Who is using all of the SUs on my groups' account? Type the command name for usage hint: ```bash @@ -121,7 +74,7 @@ This output tells us that: * Five users used the account in the past 180 days. * Their usage ranged from 24 SUs to 84,216 SUs -#### What jobs have I run over the past few days? +### What jobs have I run over the past few days? Type the command name for usage hint: ```bash @@ -166,7 +119,7 @@ This output tells me that: * The elapsed times ranged from 0 hours to 1 hour and 48 minutes -#### What is my priority? +### What is my priority? Type the command name for usage hint: ```bash @@ -212,7 +165,7 @@ What is "Priority"? * Your "Fair Share" priority has a half life of 14 days (i.e., it recovers fully in ~1 month with zero usage) ``` -#### How efficient are my jobs? +### How efficient are my jobs? Type the command name for usage hint: ```bash @@ -250,7 +203,7 @@ This output tells us that: This information is also sent to users who include the `--mail` directive in jobs. -#### How can I check the efficiency of array jobs? +### How can I check the efficiency of array jobs? Use the `seff-array` command with the help flag for a usage hint: ``` @@ -303,6 +256,94 @@ The above indicates that all of the jobs displayed less than 40% CPU efficiency, If you are not familiar with Job Arrays in SLURM, you can learn more on the ["Scaling Up with job Arrays" page](../running-jobs/job-arrays.md). ``` +### How can I check memory and GPU utilization for my jobs? + +Type the command name for usage hint: +``` +$ sacct +``` +``` +Purpose: This command reports detailed accounting information for completed jobs, +including CPU, memory, and GPU metrics such as `gpumem` and `gpuutil`. +``` +To view the maximum GPU resource usage for a job, the command is: +``` +Usage: +sacct -j -Pno TRESUsageInMax -p + +positional arguments: + jobid Job ID to query. + +options: + -j Specifies the Slurm job ID you want information about. + -P Output in pipe-delimited format (parseable). + -n Removes the header row from the output. + -o TRESUsageInMax Chooses the output field(s). Here, you’re requesting to display maximum resource usage. + -p Print fields with trailing delimiters (helpful for scripts). +``` +In order to check the metrics for job 18194943, run: +``` +$ sacct -j 18194943 -Pno TRESUsageInMax -p +``` +This will display the output: +``` +-------------------------------------------------------- +cpu=00:56:24,energy=0,fs/disk=350130624578,gres/gpumem=38888M,gres/gpuutil=100,mem=10372780K,pages=3465,vmem=10194376K| +cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=256K,pages=0,vmem=0| +-------------------------------------------------------- +``` +This tells us: +| Variable | Description | +| ------------- |------------| +|`cpu` | Total CPU time consumed (hours:minutes:seconds).| +|`energy` | Energy consumed by the job in arbitrary units (often 0 for systems without energy accounting).| +|`fs/disk` | Amount of disk I/O performed by the job (in bytes). | +|`gres/gpumem` | Peak GPU memory usage. | +|`gres/gpuutil` | Peak GPU utilization as a percentage. | +|`mem` | RAM used by the job. | +|`pages` | Number of memory pages used. Typically for advanced monitoring; can often be ignored. | +|`vmem` | Virtual memory used by the job. Includes RAM + swap + memory-mapped files. | + +```{note} +`sacct` reports two entries for each job: the first line reflects the actual workload (the primary job step) and contains the meaningful resource-usage metrics, while the second line corresponds to Slurm’s lightweight `extern` step, which performs the job’s shell setup and shows negligible usage. The first line is the one users should reference. +``` + +Similarly, to view the average GPU resource usage for a job +``` +Usage: +sacct -j -Pno TRESUsageInAve -p --noconvert + +positional arguments: + jobid Job ID to query. + +options: + -j Filters the report to only show information for the specified job ID. + -P Output in pipe-delimited format (parseable). + -n Removes the header row from the output. + -o TRESUsageInAve Chooses the output field(s). Here, you’re requesting to display average resource usage metrics. + -p Print fields with trailing delimiters (helpful for scripts). + --noconvert Prevents conversion of the units (e.g., MB → GB), ensuring raw data remains unchanged. +``` +In order to check average GPU metrics for job 18194943, run: +``` +$ sacct -j 18194943 -Pno TRESUsageInAve -p --noconvert +``` +This will display the output: +``` +-------------------------------------------------------- +cpu=00:56:24,energy=0,fs/disk=350130624578,gres/gpumem=40777023488,gres/gpuutil=100,mem=10621726720,pages=3465,vmem=10439041024| +cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=262144,pages=0,vmem=0| +-------------------------------------------------------- +``` +This output contains the same fields as in the `TRESUsageInMax` example, but here they represent average usage instead of peak usage. + +```{important} +- GPU metrics are currently available only on NVIDIA GPUs. Accessing these metrics requires CUDA 12 or newer. +- For AMD GPUs, memory usage (`gpumem`) is available, but GPU utilization (`gpuutil`) is not supported. +- Alpine H200s and Blanca H100 GPUs do not currently support GPU metric collection. +- Users running jobs on unsupported GPUs or older CUDA versions will see zeros or infinite value for GPU memory and utilization fields. Make sure your jobs are running on compatible hardware to obtain meaningful GPU metrics. +``` + ## XDMoD XDMoD is a web portal for viewing metrics at the system-, partition- and user-levels. diff --git a/docs/getting_started/faq.md b/docs/getting_started/faq.md index 7c7ca684..e6adc17e 100644 --- a/docs/getting_started/faq.md +++ b/docs/getting_started/faq.md @@ -197,7 +197,7 @@ sstat --jobs=your_job_id --format=User,JobName,JobId,MaxRSS For more information on `sstat` or `sacct` commands, [take a look at our Useful Slurm Commands tutorial.](../running-jobs/slurm-commands.md) Or visit the Slurm reference pages on [sstat](https://slurm.schedmd.com/sstat.html) and [sacct](https://slurm.schedmd.com/sacct.html). -You can also view information related to service unit (SU) usage and CPU & RAM efficiency by using [slurmtools](../compute/monitoring-resources.md#slurmtools). Note that CPU & RAM efficiency statistics will be included in emails sent when a job completes, if requested. +You can also view information related to service unit (SU) usage and CPU & RAM efficiency [here](../compute/monitoring-resources.md#monitoring-through-slurm-commands). Note that CPU & RAM efficiency statistics will be included in emails sent when a job completes, if requested. :::: ### How can I see my current FairShare priority? diff --git a/docs/software/curc_provided_software.md b/docs/software/curc_provided_software.md index 5e29d2e8..2235f011 100644 --- a/docs/software/curc_provided_software.md +++ b/docs/software/curc_provided_software.md @@ -127,7 +127,7 @@ Before requesting a software installation, please review our [Software installat | [SAMtools](http://www.htslib.org/doc/samtools.html) | 1.16.1 | Samtools is a suite of programs for interacting with high-throughput sequencing data.| | [ScaLAPACK](https://netlib.org/scalapack/) | 2.2.0 |ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines.| | [Singularity/Apptainer](https://apptainer.org/) | 3.6.4 (D), 3.7.4 |Singularity/Apptainer is a computer program that performs operating-system-level virtualization also known as containerization. [CURC Usage Guide](./containerization.md#apptainer)| -| [Slurmtools](../compute/monitoring-resources.md#slurmtools) | 0.0.0 |A collection of helper scripts for everyday use of the Slurm job scheduler.| +| Slurmtools | 0.0.0 |A collection of helper scripts for everyday use of the Slurm job scheduler.| | [SQLite](https://sqlite.org/index.html) | 3.36.0, 3.38.01, 3.46.1 |SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.| | [SRA-Toolkit](https://hpc.nih.gov/apps/sratoolkit.html) | 3.0.0 | The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives.| | [STAR](https://github.com/alexdobin/STAR/tree/master) | 2.7.10b | A tool to align RNA-seq data. The STAR algorithm uses suffix arrays, seed clustering, and stitching.| From b34206ddb7bbcd004f84879fb65b32bf11dd85d0 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Thu, 11 Dec 2025 14:21:26 -0700 Subject: [PATCH 02/26] Update docs/compute/monitoring-resources.md Co-authored-by: Michael Schneider --- docs/compute/monitoring-resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 0bb9666a..4e127987 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -260,7 +260,7 @@ If you are not familiar with Job Arrays in SLURM, you can learn more on the ["Sc Type the command name for usage hint: ``` -$ sacct +$ sacct -h ``` ``` Purpose: This command reports detailed accounting information for completed jobs, From fe66f6e505ff6ea475acbf4632c565a3cde61867 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Thu, 11 Dec 2025 14:25:16 -0700 Subject: [PATCH 03/26] Update docs/compute/monitoring-resources.md Co-authored-by: Michael Schneider --- docs/compute/monitoring-resources.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 4e127987..92fdce9f 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -276,7 +276,6 @@ positional arguments: options: -j Specifies the Slurm job ID you want information about. - -P Output in pipe-delimited format (parseable). -n Removes the header row from the output. -o TRESUsageInMax Chooses the output field(s). Here, you’re requesting to display maximum resource usage. -p Print fields with trailing delimiters (helpful for scripts). From 85b18bc6c2c1d0509f981a9d2f526a8131cf3b8c Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Thu, 11 Dec 2025 14:25:23 -0700 Subject: [PATCH 04/26] Update docs/compute/monitoring-resources.md Co-authored-by: Michael Schneider --- docs/compute/monitoring-resources.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 92fdce9f..4e18d29f 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -317,7 +317,6 @@ positional arguments: options: -j Filters the report to only show information for the specified job ID. - -P Output in pipe-delimited format (parseable). -n Removes the header row from the output. -o TRESUsageInAve Chooses the output field(s). Here, you’re requesting to display average resource usage metrics. -p Print fields with trailing delimiters (helpful for scripts). From bbe65479962b4eb8c5c5d1efc376d74970d66275 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Thu, 11 Dec 2025 14:25:42 -0700 Subject: [PATCH 05/26] Update docs/compute/monitoring-resources.md Co-authored-by: Michael Schneider --- docs/compute/monitoring-resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 4e18d29f..69cf30f5 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -310,7 +310,7 @@ This tells us: Similarly, to view the average GPU resource usage for a job ``` Usage: -sacct -j -Pno TRESUsageInAve -p --noconvert +sacct -j -pno JobID,TRESUsageInAve --noconvert positional arguments: jobid Job ID to query. From 1b478e7cefdd0b5c02ff91e79e3000b279094f75 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Thu, 11 Dec 2025 14:25:49 -0700 Subject: [PATCH 06/26] Update docs/compute/monitoring-resources.md Co-authored-by: Michael Schneider --- docs/compute/monitoring-resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 69cf30f5..6dac4327 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -269,7 +269,7 @@ including CPU, memory, and GPU metrics such as `gpumem` and `gpuutil`. To view the maximum GPU resource usage for a job, the command is: ``` Usage: -sacct -j -Pno TRESUsageInMax -p +sacct -j -pno JobID,TRESUsageInMax positional arguments: jobid Job ID to query. From c9639819272ba8980a63c7381dabad7b3b2ce339 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Thu, 11 Dec 2025 14:26:34 -0700 Subject: [PATCH 07/26] Update docs/compute/monitoring-resources.md Co-authored-by: Michael Schneider --- docs/compute/monitoring-resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 6dac4327..1fb33f16 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -282,7 +282,7 @@ options: ``` In order to check the metrics for job 18194943, run: ``` -$ sacct -j 18194943 -Pno TRESUsageInMax -p +$ sacct -j 18194943 -pno JobID,TRESUsageInMax ``` This will display the output: ``` From 2627c9d88bb4d3fb5e10da59741e208aea3bf928 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Thu, 11 Dec 2025 14:49:59 -0700 Subject: [PATCH 08/26] updating the output displayed and it's explanation --- docs/compute/monitoring-resources.md | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 1fb33f16..4a4f7963 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -277,8 +277,7 @@ positional arguments: options: -j Specifies the Slurm job ID you want information about. -n Removes the header row from the output. - -o TRESUsageInMax Chooses the output field(s). Here, you’re requesting to display maximum resource usage. - -p Print fields with trailing delimiters (helpful for scripts). + -o TRESUsageInMax Chooses the output field(s). Here, you’re requesting to display maximum resource usage for a specific jobid. ``` In order to check the metrics for job 18194943, run: ``` @@ -287,8 +286,9 @@ $ sacct -j 18194943 -pno JobID,TRESUsageInMax This will display the output: ``` -------------------------------------------------------- -cpu=00:56:24,energy=0,fs/disk=350130624578,gres/gpumem=38888M,gres/gpuutil=100,mem=10372780K,pages=3465,vmem=10194376K| -cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=256K,pages=0,vmem=0| +18194943|| +18194943.batch|cpu=00:56:24,energy=0,fs/disk=350130624578,gres/gpumem=38888M,gres/gpuutil=100,mem=10372780K,pages=3465,vmem=10194376K| +18194943.extern|cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=256K,pages=0,vmem=0| -------------------------------------------------------- ``` This tells us: @@ -304,7 +304,7 @@ This tells us: |`vmem` | Virtual memory used by the job. Includes RAM + swap + memory-mapped files. | ```{note} -`sacct` reports two entries for each job: the first line reflects the actual workload (the primary job step) and contains the meaningful resource-usage metrics, while the second line corresponds to Slurm’s lightweight `extern` step, which performs the job’s shell setup and shows negligible usage. The first line is the one users should reference. +`sacct` shows multiple entries for each job. The primary entry is the one with the `.batch` step, which reflects the actual workload and contains the meaningful resource-usage metrics. The `.extern` entry corresponds to Slurm’s lightweight setup step and will show minimal or negligible usage. Users should reference the `.batch` line when reviewing their job’s resource usage. ``` Similarly, to view the average GPU resource usage for a job @@ -319,26 +319,25 @@ options: -j Filters the report to only show information for the specified job ID. -n Removes the header row from the output. -o TRESUsageInAve Chooses the output field(s). Here, you’re requesting to display average resource usage metrics. - -p Print fields with trailing delimiters (helpful for scripts). --noconvert Prevents conversion of the units (e.g., MB → GB), ensuring raw data remains unchanged. ``` In order to check average GPU metrics for job 18194943, run: ``` -$ sacct -j 18194943 -Pno TRESUsageInAve -p --noconvert +$ sacct -j 18194943 -pno JobID,TRESUsageInAve --noconvert ``` This will display the output: ``` -------------------------------------------------------- -cpu=00:56:24,energy=0,fs/disk=350130624578,gres/gpumem=40777023488,gres/gpuutil=100,mem=10621726720,pages=3465,vmem=10439041024| -cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=262144,pages=0,vmem=0| +18194943|| +18194943.batch|cpu=00:56:24,energy=0,fs/disk=350130624578,gres/gpumem=40777023488,gres/gpuutil=100,mem=10621726720,pages=3465,vmem=10439041024| +18194943.extern|cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=262144,pages=0,vmem=0| -------------------------------------------------------- ``` This output contains the same fields as in the `TRESUsageInMax` example, but here they represent average usage instead of peak usage. ```{important} - GPU metrics are currently available only on NVIDIA GPUs. Accessing these metrics requires CUDA 12 or newer. -- For AMD GPUs, memory usage (`gpumem`) is available, but GPU utilization (`gpuutil`) is not supported. -- Alpine H200s and Blanca H100 GPUs do not currently support GPU metric collection. +- Currently, AMD GPUs, H200s and Blanca H100 GPUs do not support GPU metric collection. - Users running jobs on unsupported GPUs or older CUDA versions will see zeros or infinite value for GPU memory and utilization fields. Make sure your jobs are running on compatible hardware to obtain meaningful GPU metrics. ``` From d239d0d99eff4b338bbd2da7e252d4f9048039fb Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Mon, 5 Jan 2026 17:24:22 -0700 Subject: [PATCH 09/26] update the note about unsupported GPUs --- docs/compute/monitoring-resources.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 4a4f7963..87ae2bb2 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -336,8 +336,9 @@ This will display the output: This output contains the same fields as in the `TRESUsageInMax` example, but here they represent average usage instead of peak usage. ```{important} -- GPU metrics are currently available only on NVIDIA GPUs. Accessing these metrics requires CUDA 12 or newer. -- Currently, AMD GPUs, H200s and Blanca H100 GPUs do not support GPU metric collection. +- GPU metrics are currently available only on select NVIDIA GPUs and require CUDA 12 or newer. +- At this time, GPU memory and utilization metrics are not available on the following configurations: AMD GPUs, GH200s and MIG enables GPUs on Alpine, P100s and A40s on Blanca. +- Core cluster GPUs (e.g., core-gpu[0-4], viz1, viz2), which power Core Desktop and MATLAB GUI, do not currently support GPU memory or utilization metrics. - Users running jobs on unsupported GPUs or older CUDA versions will see zeros or infinite value for GPU memory and utilization fields. Make sure your jobs are running on compatible hardware to obtain meaningful GPU metrics. ``` From 9e2392c78d626aab4472b40cf5039721ed39e407 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Tue, 6 Jan 2026 13:29:35 -0700 Subject: [PATCH 10/26] Update docs/compute/monitoring-resources.md Co-authored-by: Brandon <53541061+b-reyes@users.noreply.github.com> --- docs/compute/monitoring-resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 87ae2bb2..f5bccb46 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -5,7 +5,7 @@ CU Research Computing has two main tools which can help users monitor their HPC * [XDMoD](#xdmod): A web portal for viewing metrics at the system, partition, and user-levels. ## Monitoring Through Slurm Commands -You can obtain important usage and efficiency metrics directly through Slurm’s built-in commands and answer the following questions: +You can obtain important usage and efficiency metrics directly through Slurm’s built-in commands and answer questions posed by the subsections below. ### How many Service Units (core hours) have I used? Type the command name for usage hint: From 9f613de306b8c02679819db5e63dae39c063f655 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Tue, 6 Jan 2026 13:30:48 -0700 Subject: [PATCH 11/26] update "peak" to "max" Co-authored-by: Brandon <53541061+b-reyes@users.noreply.github.com> --- docs/compute/monitoring-resources.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index f5bccb46..1374575f 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -297,8 +297,8 @@ This tells us: |`cpu` | Total CPU time consumed (hours:minutes:seconds).| |`energy` | Energy consumed by the job in arbitrary units (often 0 for systems without energy accounting).| |`fs/disk` | Amount of disk I/O performed by the job (in bytes). | -|`gres/gpumem` | Peak GPU memory usage. | -|`gres/gpuutil` | Peak GPU utilization as a percentage. | +|`gres/gpumem` | Max GPU memory usage. | +|`gres/gpuutil` | Max GPU utilization as a percentage. | |`mem` | RAM used by the job. | |`pages` | Number of memory pages used. Typically for advanced monitoring; can often be ignored. | |`vmem` | Virtual memory used by the job. Includes RAM + swap + memory-mapped files. | From 6a3cb0e4f7fda215257af9797aa49cf8cd7feacf Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Tue, 6 Jan 2026 13:31:34 -0700 Subject: [PATCH 12/26] gramm Co-authored-by: Brandon <53541061+b-reyes@users.noreply.github.com> --- docs/compute/monitoring-resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 1374575f..1d9ca727 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -339,7 +339,7 @@ This output contains the same fields as in the `TRESUsageInMax` example, but her - GPU metrics are currently available only on select NVIDIA GPUs and require CUDA 12 or newer. - At this time, GPU memory and utilization metrics are not available on the following configurations: AMD GPUs, GH200s and MIG enables GPUs on Alpine, P100s and A40s on Blanca. - Core cluster GPUs (e.g., core-gpu[0-4], viz1, viz2), which power Core Desktop and MATLAB GUI, do not currently support GPU memory or utilization metrics. -- Users running jobs on unsupported GPUs or older CUDA versions will see zeros or infinite value for GPU memory and utilization fields. Make sure your jobs are running on compatible hardware to obtain meaningful GPU metrics. +- Users running jobs on unsupported GPUs or older CUDA versions will see zeros or infinite values for GPU memory and utilization fields. Make sure your jobs are running on compatible hardware to obtain meaningful GPU metrics. ``` ## XDMoD From 5dd7eebcabd5e9fc23d085efcfa668838faa0228 Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Tue, 6 Jan 2026 13:33:48 -0700 Subject: [PATCH 13/26] update "peak" to "max" Co-authored-by: Brandon <53541061+b-reyes@users.noreply.github.com> --- docs/compute/monitoring-resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index 1d9ca727..ec85a8cd 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -333,7 +333,7 @@ This will display the output: 18194943.extern|cpu=00:00:00,energy=0,fs/disk=5273,gres/gpumem=0,gres/gpuutil=0,mem=262144,pages=0,vmem=0| -------------------------------------------------------- ``` -This output contains the same fields as in the `TRESUsageInMax` example, but here they represent average usage instead of peak usage. +This output contains the same fields as in the `TRESUsageInMax` example, but here they represent average usage instead of max usage. ```{important} - GPU metrics are currently available only on select NVIDIA GPUs and require CUDA 12 or newer. From 99bf9eb44e789a37cbd20ca857c7d5a5853b7c5b Mon Sep 17 00:00:00 2001 From: mohalkh5 Date: Wed, 7 Jan 2026 08:55:23 -0700 Subject: [PATCH 14/26] adding modifications based on the review --- docs/compute/monitoring-resources.md | 301 +++++++++++++++++++-------- 1 file changed, 209 insertions(+), 92 deletions(-) diff --git a/docs/compute/monitoring-resources.md b/docs/compute/monitoring-resources.md index ec85a8cd..518332c2 100644 --- a/docs/compute/monitoring-resources.md +++ b/docs/compute/monitoring-resources.md @@ -12,7 +12,8 @@ Type the command name for usage hint: ```bash $ suuser ``` -``` +This will display the output: +```text Purpose: This function computes the number of Service Units (SUs) consumed by a specified user over N days. @@ -24,7 +25,8 @@ Check usage for the last 365 days: ```bash $ suuser ralphie 365 ``` -``` +This will display the output: +```text SU used by user ralphie in the last 365 days: Cluster|Account|Login|Proper Name|Used|Energy| alpine|admin|ralphie|Ralphie|15987|0| @@ -44,7 +46,8 @@ Type the command name for usage hint: ```bash $ suacct ``` -``` +This will display the output: +```text Purpose: This function computes the number of Service Units (SUs) consumed by each user of a specified account over N days. @@ -59,7 +62,8 @@ Most user accounts follow the naming convention `ucbXXX_ascX`, in this example w ```bash $ suacct admin 180 ``` -``` +This will display the output: +```text SU used by account (allocation) admin in the last 180 days: Cluster|Account|Login|Proper Name|Used|Energy alpine|admin|||763240|0 @@ -80,7 +84,8 @@ Type the command name for usage hint: ```bash $ jobstats ``` -``` +This will display the output: +```text Purpose: This function shows statistics for each job run by a specified user over N days. @@ -92,7 +97,8 @@ Check ralphie's jobstats for the past 35 days: ```bash $ jobstats ralphie 35 ``` -``` +This will display the output: +```text job stats for user ralphie over past 35 days jobid jobname partition qos account cpus state start-date-time elapsed wait ------------------------------------------------------------------------------------------------------------------- @@ -121,11 +127,22 @@ This output tells me that: ### What is my priority? +```{important} +What is "Priority"? +* Your priority is a number between 0.0 --> 1.0 that defines your relative placement in the queue of scheduled jobs +* Your priority is computed each time a job is scheduled and reflects the following factors: + * Your "Fair Share priority" (the ratio of resources you are allocated versus those you have consumed for a given account) + * Your job size (slightly larger jobs have higher priority) + * Your time spent in the queue (jobs gain priority the longer they wait) + * The partition and qos you choose (this is a minor consideration on CURC systems) +* Your "Fair Share" priority has a half life of 14 days (i.e., it recovers fully in ~1 month with zero usage) +``` Type the command name for usage hint: ```bash $ levelfs ``` -``` +This will display the output: +```text Purpose: This function shows the current fair share priority of a specified user. A value of 1 indicates average priority compared to other users in an account. A value of < 1 indicates lower than average priority @@ -141,29 +158,18 @@ Check Ralphie's fair share priority: ```bash $ levelfs ralphie ``` -``` -ralphie -admin LevelFS: inf -ucb-general LevelFS: 44.796111 -tutorial1 LevelFS: inf -ucb-testing LevelFS: inf +```text +LevelFS for user ralphie and institution ucb: + +Account LevelFS_User LevelFS_Inst +----------------------------------------------------- +admin 1.016845 1.105260 +ucb-general inf 1.105260 ``` This output tells me: -* Ralphie hasn't used `admin`, `tutorial1`, or `ucb-testing` for more than a month, and therefore Ralphie has very high ("infinite") priority. -* Ralphie has used `ucb-general` but not much. Priority is >> 1 , therefore Ralphie can expect lower-than-average queue waits compared to average ucb-general waits. - - -```{important} -What is "Priority"? -* Your priority is a number between 0.0 --> 1.0 that defines your relative placement in the queue of scheduled jobs -* Your priority is computed each time a job is scheduled and reflects the following factors: - * Your "Fair Share priority" (the ratio of resources you are allocated versus those you have consumed for a given account) - * Your job size (slightly larger jobs have higher priority) - * Your time spent in the queue (jobs gain priority the longer they wait) - * The partition and qos you choose (this is a minor consideration on CURC systems) -* Your "Fair Share" priority has a half life of 14 days (i.e., it recovers fully in ~1 month with zero usage) -``` +* Ralphie hasn't used `ucb-general` and therefore Ralphie has very high ("infinite") priority. +* Ralphie has used `admin` but not much. Priority is >> 1 , therefore Ralphie can expect lower-than-average queue waits compared to average ucb-general waits. ### How efficient are my jobs? @@ -171,7 +177,8 @@ Type the command name for usage hint: ```bash $ seff ``` -``` +This will display the output: +```text Usage: seff [Options] Options: -h Help menu @@ -179,29 +186,50 @@ Usage: seff [Options] -d Debug mode: display raw Slurm data ``` -Now check the efficiency of job 8636572: +Now check the efficiency of job 20522520: ```bash -$ seff 8636572 -``` +$ seff 20522520 ``` -Job ID: 8636572 +This will display the output: +```text +Job ID: 20522520 Cluster: alpine User/Group: ralphie/ralphiegrp State: COMPLETED (exit code 0) Nodes: 1 -Cores per node: 24 -CPU Utilized: 04:04:05 -CPU Efficiency: 92.18% of 04:24:48 core-walltime -Job Wall-clock time: 00:11:02 -Memory Utilized: 163.49 MB -Memory Efficiency: 0.14% of 113.62 GB +Cores per node: 21 + +──────── CPU Metrics ──────── +CPU Utilized: 7-14:11:13 +CPU Efficiency: 75.77% of 10-00:27:21 core-walltime +Job Wall-clock time: 11:27:01 +Memory Utilized: 21.25 GiB +Memory Efficiency: 27.96% of 76.00 GiB (76.00 GiB/node) + +──────── GPU Metrics ──────── +Number of GPUs: 1 +GPU Type: l40 +NOTE: GPU metric availability may vary by GPU type. + Please refer to our documentation for details: https://curc.readthedocs.io/en/latest/compute/monitoring-resources.html#how-can-i-check-memory-and-gpu-utilization-for-my-jobs +Max GPU Utilization: 41% +Max GPU Memory Utilized: 874.00 MiB +``` +```{note} +The `seff` output is divided into two sections: CPU Metrics and GPU Metrics. +- CPU Metrics are always shown, regardless of the type of job, and summarize CPU utilization, memory usage, and efficiency. +- GPU Metrics are displayed only for jobs that request GPUs. For CPU-only jobs, this section will not appear. +The example above is for a GPU job, which is why both CPU and GPU metrics are shown. ``` - This output tells us that: -* the 24 cores reserved for this job were 92% utilized (anything > 80% is pretty good) -* 163.49 MB RAM was used of 113.62 GB RAM reserved (0.14%). This job is "cpu bound" so the memory inefficiency is not a major issue. +* The 21 CPU cores reserved for this job were ~76% utilized, which is reasonable but slightly below the ideal range (>80%). +* The job ran for ~11.5 hours wall-clock time, accumulating ~7.6 days of total CPU time, indicating a long-running workload with sustained CPU usage. +* 21.25 GiB of RAM was used out of 76.00 GiB reserved (~28%), suggesting the job was not memory-bound and could likely run with a smaller memory request. +* One `L40` GPU was allocated, but maximum GPU utilization was 41%, with only ~874 MiB of GPU memory used, indicating the GPU was underutilized for much of the run. This information is also sent to users who include the `--mail` directive in jobs. +```{see also} +Not all GPUs support report memory and utilization metrics in `seff` output. See ["Why am I not seeing GPU memory or utilization metrics for my job?"](../getting-started/faq.md#why-am-I-not-seeing-GPU-memory-or-utilization-metrics-for-my-job) for supported configurations and requirements. +``` ### How can I check the efficiency of array jobs? @@ -209,7 +237,8 @@ Use the `seff-array` command with the help flag for a usage hint: ``` $ seff-array -h ``` -``` +This will display the output: +```text usage: seff-array.py [-h] [-c CLUSTER] [--version] jobid positional arguments: @@ -220,37 +249,49 @@ options: -c CLUSTER, --cluster CLUSTER --version show program's version number and exit ``` -In order to check the efficiency of all jobs in job array 8636572, run the command: +In order to check the efficiency of all jobs in job array 14083647, run the command: ``` -$ seff-array 8636572 +$ seff-array 14083647 ``` This will display the status of all jobs in the array: -``` +```text +-------------------------------------------------------- +Job Information +ID: 14083647 +Name: vecjob_gpu +Cluster: alpine +User/Group: ralphie/ralphiegrp +Requested CPUs: 1 cores on 1 node(s) +Requested Memory: 3.75G +Requested Time: 00:10:00 -------------------------------------------------------- Job Status -COMPLETED: 249 -FAILED: 4 -PENDING: 1 -RUNNING: 22 -TIMEOUT: 4 +COMPLETED: 4 -------------------------------------------------------- +-------------------------------------------------------- +Finished Job Statistics +(excludes pending, running, and cancelled jobs) +Average CPU Efficiency 19.21% +Average Memory Usage 0.00G +Average Run-time 3.50s +--------------------- ``` Additionally, `seff-array` will display a histogram of the efficiency statistics all of the jobs in the array, separated into 10% increments. For example: -``` +```text CPU Efficiency (%) --------------------- -+0.00e+00 - +1.00e+01 [ 3] ▌ -+1.00e+01 - +2.00e+01 [244] ████████████████████████████████████████ -+2.00e+01 - +3.00e+01 [ 8] █▎ -+3.00e+01 - +4.00e+01 [ 2] ▍ -+4.00e+01 - +5.00e+01 [ 0] -+5.00e+01 - +6.00e+01 [ 0] -+6.00e+01 - +7.00e+01 [ 0] -+7.00e+01 - +8.00e+01 [ 0] -+8.00e+01 - +9.00e+01 [ 0] -+9.00e+01 - +1.00e+02 [ 0] -``` -The above indicates that all of the jobs displayed less than 40% CPU efficiency, with the majority (244/256) demonstrating between 10% and 20% efficiency. This information will also be displayed for memory and time efficiency. ++0.00e+00 - +1.00e+01 [0] ++1.00e+01 - +2.00e+01 [2] ████████████████████████████████████████ ++2.00e+01 - +3.00e+01 [2] ████████████████████████████████████████ ++3.00e+01 - +4.00e+01 [0] ++4.00e+01 - +5.00e+01 [0] ++5.00e+01 - +6.00e+01 [0] ++6.00e+01 - +7.00e+01 [0] ++7.00e+01 - +8.00e+01 [0] ++8.00e+01 - +9.00e+01 [0] ++9.00e+01 - +1.00e+02 [0] +``` +The above indicates that all of the jobs displayed less than 30% CPU efficiency, with the jobs evenly split (2/4) between 10%–20% and 20%–30% efficiency. This information will also be displayed for memory and time efficiency. ```{seealso} If you are not familiar with Job Arrays in SLURM, you can learn more on the ["Scaling Up with job Arrays" page](../running-jobs/job-arrays.md). @@ -258,27 +299,101 @@ If you are not familiar with Job Arrays in SLURM, you can learn more on the ["Sc ### How can I check memory and GPU utilization for my jobs? -Type the command name for usage hint: +The `sacct` command provides detailed accounting information for completed jobs, including CPU, memory, and GPU metrics (e.g., `gpumem`, `gpuutil`). To see a list of available options and usage examples, run `sacct` with no arguments. ``` $ sacct -h ``` -``` -Purpose: This command reports detailed accounting information for completed jobs, -including CPU, memory, and GPU metrics such as `gpumem` and `gpuutil`. +This will display the output: +```text +sacct [