3_managing_data

Managing data on the HPC

Introduction

By now you have a reasonable idea of how to work on the HPC using slurm. The next important step is to understand how best to manage data access and storage while using the group resources on the cluster. Here we will lay out details on accessing data, managing workspaces and what/where to store data. This document is an evolving set of guidelines but one that I ask you to follow as best you can. Please do contact me if anything is unclear or you would like advice on managing your data.

USERWORK, home and group storage

Generally on SAGA you will work in three main locations that can be immediately accessed during job submission. There is also storage on NIRD (see below for more info). Each of these have different storage and file number limits. As of 24/01/24, they are as follows:

Area	Space	Files
`$HOME`	20 Gb	100 000
/cluster/projects/nn10082k/	5 Tb	5 000 000
`$USERWORK`	No limit	No limit
/nird/projects/NS10082K	10 Tb	No limit

For most of your work, $HOME is only really useful for running a conda installation or storing small scripts.

The majority of your work will be done in $USERWORK. This is a placeholder environmental variable to your user work directory; for example, mine is /cluster/work/users/msravine. As you can see from the table above, this directory has no limits. However, the downside is that it is purged every 42 days and, in times of low storage availability, every 21 days.

Similarly, we have a larger amount of group storage in /cluster/projects/nn10082k. This is not purged and is instead used by us as a group. However as a general rule it is not to be used for analyses. The purpose of this directory is only to store outputs that you do not want to be deleted by the regular purge. These might be vcfs or important files that take a long time to recreate. In general, if you can regenerate it easily within 24 hours, it does not need to be stored here.

Before starting work on the cluster, please make a directory in /cluster/projects/nn10082k/ with your name - this is your personal storage area in the group storage.

Finally a tip - if you ever need to check disk quota or usage, the following command will help:

dusage

So how should I actually work? An example!

It might seem confusing to get used to this purging system, but here is a simple guideline to properly manage your work on Saga. To demonstrate good working practice, we will use the lab genotyping pipeline as an example.

First check whether the samples you wish to genotype have already been aligned to the reference genome. To do this, check the /cluster/projects/nn10082k/crams directory. You can use find to search individual samples - i.e. the following will identify both the .cram and the .crai index.

find * -name "PDOM2012NOR0025F*"

If the individual has not been mapped, you will need to download the reads. Starting in $USERWORK you should download the reads from NIRD. Using the same individual as above:

rsync -ravzhP --chown=:username /nird/projects/NS10082K/domesticus/PDOM2012NOR0025F* .

Here we use rsync because it is faster and it also allows you to restart failed copies. The --chown component changes the group ownership after the files are copied. See below for a guide on exaclty what this means. If you are downloading multiple individuals, it might take some time so do it in a screen. As a general rule, reads are never to be stored in /cluster/projects/nn10082k/ if they are backed up on NIRD.

Run the pipeline on the reads you have downloaded. The first script will produce a bam file. You should convert it to a cram and then, after checking with Mark, you can store it in the crams directory - i.e./cluster/projects/nn10082k/crams
If you are going to genotype other individuals already mapped to the reference genome, you can use paths that point directly to files in /cluster/projects/nn10082k/crams - i.e. there is no need to copy them to $USERWORK.
Run the rest of the pipeline to produce your vcf. When you are satisfied you have a raw vcf (i.e. prior to filtering) and a filtered vcf you wish to keep, you can copy these to your directory in /cluster/projects/nn10082k/
If you wish to do any analyses on these vcfs, you access them directly from slurm scripts when they are in /cluster/projects/nn10082k/. You may keep outputs in your project directory if they are difficult to make but do not use this as a storage for intermediate files.

Remember as a general rule, do not point slurm scripts to /cluster/projects/nn10082k/ for anything other than reading files for a job. All outputs must be written to $USERWORK under all circumstances. You may copy them over at a later stage.

Also please try to keep your storage area clear. If you realise you no longer need files, please delete them. If you are finished with an analysis or dataset, you can store the scripts and the necessary files on NIRD (see below for more details).

If you have any questions about whether you are doing things correctly, please just ask Mark!

Accessing data available to the group

The sample catalogue

If you are working on sparrows in the group, you should have access to the sparrow sample catalogue. This is an open (but not editable) spreadsheet that you can search for the location of blood, extracted DNA and sequence data (i.e. reads and bams). It is regularly updated and will help you find what you need on the cluster. It also contains information on the names, locations, sex and other metadata of samples.

A note on sample names

All individuals sampled or sequenced by the lab are assigned a unique sample name. This name has been specially designed to incorporate metadata. To demonstrate this we will use an example - PDOM2012NOR0025F. This can be broken down with the following information:

PDOM - this is the species ID - Passer domesticus here. See the species_key tab of the spreadsheet for other species. Note that some subspecies are given a species code in order to make it easier to identify them.
2012 - the year of collection (or where unavailable, the year the sample was submitted to the catalogue).
NOR - the alpha 3 country code, Norway in this case - see country_key on the spreadsheet for a full list.
0025 - this is the 25th individual sampled. In this context, it means this is the 25th house sparrow sampled in Norway in 2012.
F - the sex, here this is a female. J is juvenile and U is unknown or unidentified (depending on species).

All new samples entered into the system MUST be given a UiO code. It is important to also record the original ID assigned by the collector (especially if it is a collaborator). All reads included in the datasets and all downstream files MUST use this name.

Note you may notice that some samples occur as duplicates in the catalogue. There is a reason for this - some samples have been sequenced more than once. If you look closely at the read names, you will see that each row corresponds to a different pair of reads. There are several reasons this might be the case - i.e. replicate sequencing to ensure enough coverage, sequencing across multiple lanes to prevent batch affects and so on. It is not an error and the genotyping pipeline is built to account for this, so it is no problem to provide it with multiples of the same sample.

The whole point of these IDs is so that metadata can be easily encoded in the sample names, to make it much easier for you to parse downstream analyses. H/T to PhD student Jack Harper for helping design these codes. If you have any queries, identify mistakes or need assistance with assigning samples codes, please discuss this with Mark.

Sequence data

For now, all sequence data is stored on NIRD. This will change soon as we begin to migrate reads to the European Nucleotide Archive. However for now, if you need to find an individual you can do so at the following path on Saga (assuming NIRD is mounted securely):

/nird/projects/NS10082K/

If you are logged into NIRD itself, the path is:

/projects/NS10082K/

Using ls will show you a series of directories each named after a species (or subspecies) without the genus name. For example, bactrianus contains all reads from Bactrian sparrows.

Do not move or remove samples from this directory without permission - please only copy from it

If you need to stage an analysis using a set of samples, you can transfer them from within a screen session from NIRD to SAGA using the following command:

rsync -ravzhP --chown=:username /nird/projects/NS10082K/domesticus/* .

This will use rsync to download every read from all domesticus samples to your chosen directory. Be warned though, something like this will take quite some time to complete!

BAM/CRAMs

We have recently altered the storage to only include [cram files](https://en.wikipedia.org/wiki/CRAM\_(file_format) as these are compressed bam files[https://en.wikipedia.org/wiki/Binary_Alignment_Map\] and they are generally much smaller in size. All crams are stored on the nn10082k project storage area here:

/cluster/projects/nn10082k/crams

As with reads, they are separated into directories that reflect species/subspecies. If you are genotyping using the pipeline, you do not need to copy crams to your directory, you can run it straight from file paths that point directly here. However, if you are using any analyses that manipulate the crams or you need to reconvert them to bams (possible with samtools) then please make a local copy in $USERWORK.

Please do not add crams to this directory without prior approval from Mark.

VCFs

As with crams, general purpose vcfs can be found in the following directory:

/cluster/projects/nn10082k/crams

This should only be used to store approved vcfs for specific projects. If you place a file in here, please ensure it has a) a clear name and b) an explanation in the README file and c) initials in the README so that we can identify who it belongs to. All vcfs should be indexed if they are in this directory.

NIRD

NIRD is the National Infrastructure for Research Data. This is our longer-term storage for project outputs and important data.

NIRD is mounted within Saga. To access it you can use the following:

ls /nird/projects/NS10082K/

NIRD has two types of storage - datalake and datapeak. The distinction is complex but for our purposes, data that is accesed more regularly is on datapeak, whereas data that is accessed less often is on datalake.

The standard NIRD directory defaults to datapeak, but you can acccess both like this

/nird/datalake/NS10082K
/nird/datapeak/NS10082K

You can also log-in to NIRD as a separate cluster to Saga. For example:

ssh username@login.nird-lmd.sigma2.no

At present, we are using NIRD to store reads although this will change in the near future. It is used to store our most important data as a long-term storage facility.

Please do not upload anything to NIRD without discussing it with Mark first.

A very important note on moving data - MUST READ

One issue we have run into is the use of group permissions for different filesystems and how this can mess with our disk size quota. It is a complex issue and I will not go into all the detail here other than to say that when you copy files from NIRD or the project directory, if you do not change group ownership, it will still count towards our disk usage.

The short summary of this is that by copying files to userwork, we are still eating into our quota, when we shouldn't be! Luckily this is very easy to solve. There are two different ways to deal with this depending on how you move or copy data.

Moving data with `mv` or `rsync`

If you use either mv or rsync you need to change the group ownership immediately after doing so. So for example, if you copy or move a file from NIRD, you would do the following:

# move the file to the current directory 
mv /nird/datalake/NS10082K/test_file .
# change ownership
chgrp -R username test_file

Just replace the username with your own. You can also run this won directories if needed.

Alternative method: change while using rsync

If you use rsync to copy you can change ownership while doing so. For example:

rsync -ravzh --chown=:username /nird/datalake/NS10082K/test_file .

This is especially useful if you are pulling down large files from NIRD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3_managing_data

Managing data on the HPC

Introduction

USERWORK, home and group storage

So how should I actually work? An example!

Accessing data available to the group

The sample catalogue

A note on sample names

Sequence data

BAM/CRAMs

VCFs

NIRD

A very important note on moving data - MUST READ

Moving data with `mv` or `rsync`

Alternative method: change while using rsync

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

3_managing_data

Managing data on the HPC

Introduction

USERWORK, home and group storage

So how should I actually work? An example!

Accessing data available to the group

The sample catalogue

A note on sample names

Sequence data

BAM/CRAMs

VCFs

NIRD

A very important note on moving data - MUST READ

Moving data with mv or rsync

Alternative method: change while using rsync

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Moving data with `mv` or `rsync`