3_resources

Requesting resources on HPC

Introduction

One of the most important components of working on the cluster is making sure that you request a reasonable amount of memory, CPUs and time on the nodes you are submitting to. There are some quite extensive guides on Sigma2 about this both here and here. However the purpose of this guide is to simplify matters somewhat for the group.

Saga runs on a shared user model - every allocation period (approximately 6 months of a year), we request a CPU hours budget as a group. For example, if we request 100,000 CPU hours, we must share this among us. All jobs we run deduct from this total and so it is very important that any jobs submitted are requesting a fair and efficient amount.

What exactly do CPU hours mean?

You don't really need to worry too much about this, other than to know that we have a quote. You can see this like so on any of the login nodes:

cost -p nn10082K --details

The output will tell you how much we have in the quota, how much we have used, what is available and how many are pending (i.e. will be spent on jobs in the queue).

CPU hours are the billing units - i.e. jobs cost CPU hours and then are deducted from our total. There is detailed information here about how this calculation is made but you do not really need to know the ins and outs.

However, what you should keep in mind is that the cost of a job scales by how much memory you request, the number of CPUs and the time it runs for. We are billed for the amount requested (except for time) and so if you request a huge amount of memory and CPUs but you only use a little bit of that request, your job will cost many times more than it should. This is why we need to ensure jobs are efficient.

A short note on the pipelines

For those of you using the genotyping pipelines, you do not need to really check the CPU costs as the nextflow scripts are designed to deal with this for you. You should mainly perform these checks for non-pipeline scripts or any custom analyses you are performing.

A quick start guide

Looking to quickly get an overview of how to make sure you aren't burning up all our resources? Take the following steps:

Calculate the potential job costs using memory_check.py
Run a test job
Examine the job efficiency
Alter your resource requests if needed

Calculating potential job costs

To do the hard work, Erik and I have written a utility script in python that will calculate how much a job could potentially cost. Full disclosure, I had the original idea, Erik made it much nicer than my old script! If you followed all the steps here then you will be able to run this without issue. It takes a single slurm script (i.e. like the one you would submit) as an input. Run it like so:

job_cost.py my_example.slurm

NB - try to ensure you do not have comments AFTER your #SBATCH commands in your slurm scripts! In line comments like the example below will not work with the script. The simplest solution is just to put them on a separate line.

#SBATCH --time 1-12:00:00 # This comment will break the time argument and the script will complain you have not provided time

Once you've ensured that your slurm script conforms to the formatting, running the utility will give you estimates on the cost. An example looks like this (NB it has prettier colours on Saga):


================= Saga Job Cost Report =====================

┌──────────────────────── WARNING ─────────────────────────┐
│                                                          │
│   This job can be expensive!                             │
│                                                          │
│   Highest possible cost: 80,094.12 CPU-hours             │
│   On queue: normal                                       │
│   Expense: 10,412.24 NOK at market price                 │
│                                                          │
│   Consider reducing walltime, CPUs, or requested         │
│   memory, or choosing a different queue. Carefully       │
│   inspect the calculated cost also on the cheapest       │
│   queue. It could be unacceptably high as well.          │
│                                                          │
└──────────────────────────────────────────────────────────┘

┌───────────────────── CHEAPEST QUEUE ─────────────────────┐
│                                                          │
│   Cheapest cost: 24,864.00 CPU-hours                     │
│   On queue: hugemem                                      │
│   Expense: 3,232.32 NOK at market price                  │
│                                                          │
└──────────────────────────────────────────────────────────┘

================= Basic ====================================

- Your requested resource allocation

    • time per task (hrs)  : 168
    • cpus per task        : 4
    • memory per task (gb) : 50
    • array size (tasks)   : 37

- Cost in CPU-hours on each queue

    • normal
      per task : 2,164.71
      total    : 80,094.12
    • bigmem
      per task : 928.18
      total    : 34,342.53
    • hugemem
      per task : 672.00
      total    : 24,864.00

- NB: These are the costs of submitting your script once.

================= Advanced =================================

- CPU-hours occupied by requested CPUs:

    • per task : 672.00
    • total    : 24,864.00

- CPU-hours occupied by requested memory (per queue)

    • normal
      per task : 2,164.71
      total    : 80,094.12
    • bigmem
      per task : 928.18
      total    : 34,342.53
    • hugemem
      per task : 89.01
      total    : 3,293.25

================= End of Cost Report =======================

You can see that that the script outputs the memory and NOK cost for your job across the different possible queues. It gives you a warning if you are requesting a lot of memory and some advice on how to deal with this. It also recommends the cheapest queue for you to submit the job to.

The rest of the output is split into a basic and advanced section. In the basic section, you can see your requested resources. The utility also detects if this is an array script (provided that is in the #SBATCH header). It then tells you the cost of the job in CPU hours across each of the three available queues. Remember - this is all based on the assumption the job runs for exactly the time you requested.

In the advanced section, it shows you how much you would be charged for the CPUs alone and then the CPU cost scaled by memory requested. The Saga pricing model charges you whichever is the largest of these two. In the case above, the CPU hours occupied by the requested memory is higher for both the normal and bigmem queues. However, the CPU cost alone is smaller than the CPU*memory cost if you submit to hugemem. So this is why this is the best place to submit the job.

There are other ways to improve your cost estimates, including being sure you are actually requesting what you need. So read on for a guide on how to do this.

Run a test job - an illustrative example as to why

It is absolutely essential that you run a test job before submitting many similar jobs doing the same thing. This is especially true if running arrays. One simple reason for this is that we are still charged for jobs, even if they fail - so the very first thing you should do is ensure it actually works!

The other reason for this is that the slurm output can be very useful for getting an idea of what sort of resources a job used. Here is an example

Job 15812769 consumed 374.0 billing hours from project nn10082k.

Submitted 2025-09-05T14:16:05; waited 12.4 hours in the queue after becoming eligible to run.

Requested wallclock time: 2.0 days
Elapsed wallclock time:   15.0 hours

Job exited normally.

Task and CPU statistics:
ID              CPUs  Tasks  CPU util                Start  Elapsed  Exit status
15812769          10            0.0 %  2025-09-06T02:39:53   15.0 h  0
15812769.batch    10      1    10.0 %  2025-09-06T02:39:53   15.0 h  0

Used CPU time:   14.9 CPU hours
Unused CPU time: 5.6 CPU days

Memory statistics, in GiB:
ID               Alloc   Usage
15812769         100.0        
15812769.batch   100.0     0.0

Job 15812769 completed at Sat Sep 6 17:37:34 CEST 2025

What does this tell us? Firstly, the job cost us 374 CPU hours. We can see that here:

Job 15812769 consumed 374.0 billing hours from project nn10082k.

Next we can look at the time it ran for:

Requested wallclock time: 2.0 days
Elapsed wallclock time:   15.0 hours

In this example, 2 days were requested but the job only used 15 hours. Generally we are not charged for the unused time but it is worth ensuring you request only the time you need - that way your jobs will not be queued for so long (for example, this job was queued for 12 hours before running). Here we could easily request 20 hours for a bit of leeway and it would run without issue.

We can also see the number of CPU hours the job used.

Used CPU time:   14.9 CPU hours
Unused CPU time: 5.6 CPU days

Again - it used a lot less than requested - i.e. the job is inefficient - more on this later.

What about the number of CPUs?

Task and CPU statistics:
ID              CPUs  Tasks  CPU util                Start  Elapsed  Exit status
15812769          10            0.0 %  2025-09-06T02:39:53   15.0 h  0
15812769.batch    10      1    10.0 %  2025-09-06T02:39:53   15.0 h  0

The job requested 10 CPUs but it only used 1 of them - i.e. 10% of CPUs were utilised. So this is highly inefficient. We can also see the memory requested and how much was actually used.

Memory statistics, in GiB:
ID               Alloc   Usage
15812769         100.0        
15812769.batch   100.0     0.0

In this example, the job requested 100 Gb of memory - but it used hardly anything. It actually used approximately 40MB of memory but that is so small it is not shown here. So this job cost a large amount of CPU hours because of its memory request, but it hardly used any of it at all.

How did this job cost so much?

A short break down to give an illustrative example here. This job cost 374 billing hours. It did the following:

requested 10 CPUs
requested 100 GB memory
ran for 15 hours

Each CPU has a cost of 1. Memory on the normal node has a cost of 0.2577 per GB. So this job had a CPU cost of 10 and a memory cost of 25.7. Saga will charge whichever of these is higher but it then multiplies them by the hours the job ran. So here, the cost is 15 * 25.7 = 385 - which is approximately the total cost highlighted at the end of the job.

However, we actually only used 1 CPU and less than 1 Gb memory. So if the job had requested that instead it would have cost just 15 CPU hours. This is a huge difference and an illustration of why it is important to ensure your jobs are efficient.

Improving job efficienty

So how can you improve efficiency? Besides running a test job you can also use seff which actually does some of this for you. For example:

seff 15812769

Gives us the following output for the job we looked at above:

Job ID: 15812769
Array Job ID: 15811659_59
Cluster: saga
User/Group: some/user
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 10
CPU Utilized: 14:53:36
CPU Efficiency: 9.96% of 6-05:36:20 core-walltime
Job Wall-clock time: 14:57:38
Memory Utilized: 42.19 MB
Memory Efficiency: 0.04% of 100.00 GB

This gives a clearer breakdown of the CPU and memory efficiency. As a rough guideline, you want to play with your memory and CPU requests to ensure as much efficiency as possible - ideally around 90%. That ensure we do not waste resources. So, run your test job and then use seff to get an idea. Look at the job output too and if you have any questions, don't be afraid to ask!!

There is more info on estimating efficiency here if you'd like to dive deeper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3_resources

Requesting resources on HPC

Introduction

What exactly do CPU hours mean?

A short note on the pipelines

A quick start guide

Calculating potential job costs

Run a test job - an illustrative example as to why

How did this job cost so much?

Improving job efficienty

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally