-
Notifications
You must be signed in to change notification settings - Fork 0
Reference Genome instruction #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
MikhailAf
commented
Apr 3, 2025
|
|
||
| Before importing a new reference genome, users are encouraged to **check which reference genomes are already available** in the system. This helps avoid duplication and ensures consistency across datasets. Users can: | ||
|
|
||
| * Browse existing reference genomes in the **File Browser** (under the Reference Genomes category), or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File Browser -> File Manager
|
|
||
| ### **Required File Format** | ||
|
|
||
| If the reference genome needed is not listed in the **File Browser** or returned by the `GET /api/v1/reference-genomes` endpoint, users can import a custom reference genome into ODM to support their dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File Browser -> File Manager
|
|
||
| This response confirms successful import and provides a unique **accession ID**. | ||
|
|
||
| The newly imported reference genome is now available in ODM and visible in the File Manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we have to also mention that this imported reference genome will be available only after successful initialisation, otherwise it will be useless
|
|
||
| ### **Preparing Metadata** | ||
|
|
||
| To upload VCF files, you must also provide a metadata file in TSV (tab-separated values) format. This file should include at least the following fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you must also provide a metadata file in TSV
Let's maybe mention that it's needed in this particular case to mention reference genome information. I'm a bit confused and I'd like to add more details here because in general metadata file can be skipped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should include at least the following fields:
Why? No, to mention what reference genome should be used user have to have in the metadata file either attribute Genome Version or attribute genestack.bio:organism
|
|
||
| To upload VCF files, you must also provide a metadata file in TSV (tab-separated values) format. This file should include at least the following fields: | ||
|
|
||
| * **Genome Version**: The exact name of the reference genome as it appears in ODM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Genome Version may contain one of two variables:
assembly: in case of multiple releases (for example, 100 and 109) a link with the latest (109) release will be created;
OR
name: a link with the exact release will be created
|
|
||
| Additional optional fields, such as **Version**, **Accession**, or **User**, may also be included and will not interfere with the upload. The system is flexible and accepts metadata files with varying numbers of columns. | ||
|
|
||
| !!! note "Metadata file examples" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about these examples and suggest to use information from this old article https://genestack.atlassian.net/wiki/spaces/~940367389/pages/3417047043/Working+with+Reference+genomes+version+ODM+1.53#Examples to show what options user has and how they can be used. Amount of other columns is unnecessary information when we tell about reference genomes.
|
|
||
| As with other data types, the request should include: | ||
|
|
||
| * A **metadata file** with information about the reference genome and organism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and organism
can be removed, it's necessary to provide information about organism
|
|
||
| * A **metadata file** with information about the reference genome and organism | ||
| * A **VCF file** compressed **.vcf.gz** or plain **.vcf** (See example of a [VCF file](https:///s3.amazonaws.com/bio-test-data/gVCF_Mm_Demo.vcf)) | ||
| * A **link structure** connecting the data to samples, libraries, or preparations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? we don't provide this information in the body of job endpoint to import vcf data
| * A **link structure** connecting the data to samples, libraries, or preparations | ||
|
|
||
| !!! note "Important" | ||
| Unlike transcriptomics or flow cytometry data, **a reference genome must be specified** when importing VCF files. If no metadata is provided, the system defaults to using the **human reference genome (GRCh38)**. To use a different genome, you must include a metadata file where the **Genome Version** matches the name of a **previously imported custom reference genome** in your ODM instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"must be specified" -> "can be specified"? Maybe for our client is okay to use default reference genome
| * **organism**, **assembly**, **release**: Core genome attributes | ||
| * **annotationUrl**: Link to the annotation file used (e.g., GTF from Ensembl) | ||
| * **genestack:accession**: ODM accession for the reference genome | ||
| * **initializationStatus**: Should be COMPLETE if the genome is ready for use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be COMPLETE if the genome is ready for use
If I remember correctly...user won't be able to import vcf file with the metadata file where unsuccessfully initialised reference genome is mentioned