-
Notifications
You must be signed in to change notification settings - Fork 0
3_managing_data
By now you have a reasonable idea of how to work on the HPC using slurm. The next important step is to understand how best to manage data access and storage while using the group resources on the cluster. Here we will lay out details on accessing data, managing workspaces and what/where to store data. This document is an evolving set of guidelines but one that I ask you to follow as best you can. Please do contact me if anything is unclear or you would like advice on managing your data.
Generally on SAGA you will work in three main locations that can be immediately accessed during job submission. There is also storage on NIRD (see below for more info). Each of these have different storage and file number limits. As of 24/01/24, they are as follows:
| Area | Space | Files |
|---|---|---|
$HOME |
20 Gb | 100 000 |
| /cluster/projects/nn10082k/ | 5 Tb | 5 000 000 |
$USERWORK |
No limit | No limit |
| /nird/projects/NS10082K | 10 Tb | No limit |
For most of your work, $HOME is only really useful for running a
conda installation or storing small scripts.
The majority of your work will be done in $USERWORK. This is a
placeholder environmental variable to your user work directory; for
example, mine is /cluster/work/users/msravine. As you can see from the
table above, this directory has no limits. However, the downside is
that it is purged every 42 days and, in times of low storage
availability, every 21 days.
Similarly, we have a larger amount of group storage in
/cluster/projects/nn10082k. This is not purged and is instead used by
us as a group. However as a general rule it is not to be used for
analyses. The purpose of this directory is only to store outputs that
you do not want to be deleted by the regular purge. These might be vcfs
or important files that take a long time to recreate. In general, if you
can regenerate it easily within 24 hours, it does not need to be
stored here.
Before starting work on the cluster, please make a directory in
/cluster/projects/nn10082k/ with your name - this is your personal
storage area in the group storage.
Finally a tip - if you ever need to check disk quota or usage, the following command will help:
dusage
It might seem confusing to get used to this purging system, but here is a simple guideline to properly manage your work on Saga. To demonstrate good working practice, we will use the lab genotyping pipeline as an example.
- First check whether the samples you wish to genotype have already
been aligned to the reference genome. To do this, check the
/cluster/projects/nn10082k/cramsdirectory. You can usefindto search individual samples - i.e. the following will identify both the.cramand the.craiindex.
find * -name "PDOM2012NOR0025F*"
- If the individual has not been mapped, you will need to download the
reads. Starting in
$USERWORKyou should download the reads from NIRD. Using the same individual as above:
rsync -ravzhP --chown=:username /nird/projects/NS10082K/domesticus/PDOM2012NOR0025F* .
Here we use rsync because it is faster and it also allows you to
restart failed copies. The --chown component changes the group ownership after the files are copied. See below for a guide on exaclty what this means. If you are downloading multiple individuals, it
might take some time so do it in a screen. As a general rule, reads
are never to be stored in /cluster/projects/nn10082k/ if they are
backed up on NIRD.
-
Run the pipeline on the reads you have downloaded. The first script will produce a bam file. You should convert it to a cram and then, after checking with Mark, you can store it in the crams directory - i.e.
/cluster/projects/nn10082k/crams -
If you are going to genotype other individuals already mapped to the reference genome, you can use paths that point directly to files in
/cluster/projects/nn10082k/crams- i.e. there is no need to copy them to$USERWORK. -
Run the rest of the pipeline to produce your vcf. When you are satisfied you have a raw vcf (i.e. prior to filtering) and a filtered vcf you wish to keep, you can copy these to your directory in
/cluster/projects/nn10082k/ -
If you wish to do any analyses on these vcfs, you access them directly from slurm scripts when they are in
/cluster/projects/nn10082k/. You may keep outputs in your project directory if they are difficult to make but do not use this as a storage for intermediate files.
Remember as a general rule, do not point slurm scripts to
/cluster/projects/nn10082k/ for anything other than reading files for
a job. All outputs must be written to $USERWORK under all
circumstances. You may copy them over at a later stage.
Also please try to keep your storage area clear. If you realise you no longer need files, please delete them. If you are finished with an analysis or dataset, you can store the scripts and the necessary files on NIRD (see below for more details).
If you have any questions about whether you are doing things correctly, please just ask Mark!
If you are working on sparrows in the group, you should have access to the sparrow sample catalogue. This is an open (but not editable) spreadsheet that you can search for the location of blood, extracted DNA and sequence data (i.e. reads and bams). It is regularly updated and will help you find what you need on the cluster. It also contains information on the names, locations, sex and other metadata of samples.
All individuals sampled or sequenced by the lab are assigned a unique
sample name. This name has been specially designed to incorporate
metadata. To demonstrate this we will use an example -
PDOM2012NOR0025F. This can be broken down with the following
information:
-
PDOM- this is the species ID - Passer domesticus here. See thespecies_keytab of the spreadsheet for other species. Note that some subspecies are given a species code in order to make it easier to identify them. -
2012- the year of collection (or where unavailable, the year the sample was submitted to the catalogue). -
NOR- the alpha 3 country code, Norway in this case - seecountry_keyon the spreadsheet for a full list. -
0025- this is the 25th individual sampled. In this context, it means this is the 25th house sparrow sampled in Norway in 2012. -
F- the sex, here this is a female.Jis juvenile andUis unknown or unidentified (depending on species).
All new samples entered into the system MUST be given a UiO code. It is important to also record the original ID assigned by the collector (especially if it is a collaborator). All reads included in the datasets and all downstream files MUST use this name.
Note you may notice that some samples occur as duplicates in the catalogue. There is a reason for this - some samples have been sequenced more than once. If you look closely at the read names, you will see that each row corresponds to a different pair of reads. There are several reasons this might be the case - i.e. replicate sequencing to ensure enough coverage, sequencing across multiple lanes to prevent batch affects and so on. It is not an error and the genotyping pipeline is built to account for this, so it is no problem to provide it with multiples of the same sample.
The whole point of these IDs is so that metadata can be easily encoded in the sample names, to make it much easier for you to parse downstream analyses. H/T to PhD student Jack Harper for helping design these codes. If you have any queries, identify mistakes or need assistance with assigning samples codes, please discuss this with Mark.
For now, all sequence data is stored on NIRD. This will change soon as
we begin to migrate reads to the European Nucleotide Archive. However
for now, if you need to find an individual you can do so at the
following path on Saga (assuming NIRD is mounted securely):
/nird/projects/NS10082K/
If you are logged into NIRD itself, the path is:
/projects/NS10082K/
Using ls will show you a series of directories each named after a
species (or subspecies) without the genus name. For example,
bactrianus contains all reads from Bactrian sparrows.
Do not move or remove samples from this directory without permission - please only copy from it
If you need to stage an analysis using a set of samples, you can
transfer them from within a screen session from NIRD to SAGA using the
following command:
rsync -ravzhP --chown=:username /nird/projects/NS10082K/domesticus/* .
This will use rsync to download every read from all domesticus samples
to your chosen directory. Be warned though, something like this will
take quite some time to complete!
We have recently altered the storage to only include [cram
files](https://en.wikipedia.org/wiki/CRAM\_(file_format) as these are
compressed bam
files[https://en.wikipedia.org/wiki/Binary_Alignment_Map\] and they are
generally much smaller in size. All crams are stored on the nn10082k
project storage area here:
/cluster/projects/nn10082k/crams
As with reads, they are separated into directories that reflect
species/subspecies. If you are genotyping using the pipeline, you do not
need to copy crams to your directory, you can run it straight from file
paths that point directly here. However, if you are using any analyses
that manipulate the crams or you need to reconvert them to bams
(possible with samtools) then please make a local copy in $USERWORK.
Please do not add crams to this directory without prior approval from Mark.
As with crams, general purpose vcfs can be found in the following directory:
/cluster/projects/nn10082k/crams
This should only be used to store approved vcfs for specific
projects. If you place a file in here, please ensure it has a) a clear
name and b) an explanation in the README file and c) initials in the
README so that we can identify who it belongs to. All vcfs should be
indexed if they are in this directory.
NIRD is the National Infrastructure for Research Data. This is our longer-term storage for project outputs and important data.
NIRD is mounted within Saga. To access it you can use the following:
ls /nird/projects/NS10082K/
NIRD has two types of storage - datalake and datapeak. The distinction is complex but for our purposes, data that is accesed more regularly is on datapeak, whereas data that is accessed less often is on datalake.
The standard NIRD directory defaults to datapeak, but you can acccess both like this
/nird/datalake/NS10082K
/nird/datapeak/NS10082K
You can also log-in to NIRD as a separate cluster to Saga. For example:
ssh username@login.nird-lmd.sigma2.no
At present, we are using NIRD to store reads although this will change in the near future. It is used to store our most important data as a long-term storage facility.
Please do not upload anything to NIRD without discussing it with Mark first.
One issue we have run into is the use of group permissions for different filesystems and how this can mess with our disk size quota. It is a complex issue and I will not go into all the detail here other than to say that when you copy files from NIRD or the project directory, if you do not change group ownership, it will still count towards our disk usage.
The short summary of this is that by copying files to userwork, we are still eating into our quota, when we shouldn't be! Luckily this is very easy to solve. There are two different ways to deal with this depending on how you move or copy data.
If you use either mv or rsync you need to change the group ownership immediately after doing so. So for example, if you copy or move a file from NIRD, you would do the following:
# move the file to the current directory
mv /nird/datalake/NS10082K/test_file .
# change ownership
chgrp -R username test_file
Just replace the username with your own. You can also run this won directories if needed.
If you use rsync to copy you can change ownership while doing so. For example:
rsync -ravzh --chown=:username /nird/datalake/NS10082K/test_file .
This is especially useful if you are pulling down large files from NIRD.