Skip to content

Conversation

@MikhailAf
Copy link
Contributor

image

@MikhailAf MikhailAf requested a review from eeliane April 3, 2025 13:57
@MikhailAf MikhailAf requested review from a team as code owners April 3, 2025 13:57
@MikhailAf MikhailAf requested a review from a team April 3, 2025 13:57
@MikhailAf MikhailAf marked this pull request as draft May 30, 2025 12:30
@srz11d srz11d requested review from MariaBorodaenko and eeliane June 2, 2025 15:55
@MikhailAf MikhailAf marked this pull request as ready for review July 20, 2025 16:30
@MikhailAf MikhailAf requested review from a team as code owners July 20, 2025 16:30

Before importing a new reference genome, users are encouraged to **check which reference genomes are already available** in the system. This helps avoid duplication and ensures consistency across datasets. Users can:

* Browse existing reference genomes in the **File Browser** (under the Reference Genomes category), or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File Browser -> File Manager


### **Required File Format**

If the reference genome needed is not listed in the **File Browser** or returned by the `GET /api/v1/reference-genomes` endpoint, users can import a custom reference genome into ODM to support their dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File Browser -> File Manager


This response confirms successful import and provides a unique **accession ID**.

The newly imported reference genome is now available in ODM and visible in the File Manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have to also mention that this imported reference genome will be available only after successful initialisation, otherwise it will be useless


### **Preparing Metadata**

To upload VCF files, you must also provide a metadata file in TSV (tab-separated values) format. This file should include at least the following fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you must also provide a metadata file in TSV

Let's maybe mention that it's needed in this particular case to mention reference genome information. I'm a bit confused and I'd like to add more details here because in general metadata file can be skipped

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should include at least the following fields:

Why? No, to mention what reference genome should be used user have to have in the metadata file either attribute Genome Version or attribute genestack.bio:organism


To upload VCF files, you must also provide a metadata file in TSV (tab-separated values) format. This file should include at least the following fields:

* **Genome Version**: The exact name of the reference genome as it appears in ODM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genome Version may contain one of two variables:

assembly: in case of multiple releases (for example, 100 and 109) a link with the latest (109) release will be created;
OR

name: a link with the exact release will be created


Additional optional fields, such as **Version**, **Accession**, or **User**, may also be included and will not interfere with the upload. The system is flexible and accepts metadata files with varying numbers of columns.

!!! note "Metadata file examples"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about these examples and suggest to use information from this old article https://genestack.atlassian.net/wiki/spaces/~940367389/pages/3417047043/Working+with+Reference+genomes+version+ODM+1.53#Examples to show what options user has and how they can be used. Amount of other columns is unnecessary information when we tell about reference genomes.


As with other data types, the request should include:

* A **metadata file** with information about the reference genome and organism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and organism

can be removed, it's necessary to provide information about organism


* A **metadata file** with information about the reference genome and organism
* A **VCF file** compressed **.vcf.gz** or plain **.vcf** (See example of a [VCF file](https:///s3.amazonaws.com/bio-test-data/gVCF_Mm_Demo.vcf))
* A **link structure** connecting the data to samples, libraries, or preparations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? we don't provide this information in the body of job endpoint to import vcf data

* A **link structure** connecting the data to samples, libraries, or preparations

!!! note "Important"
Unlike transcriptomics or flow cytometry data, **a reference genome must be specified** when importing VCF files. If no metadata is provided, the system defaults to using the **human reference genome (GRCh38)**. To use a different genome, you must include a metadata file where the **Genome Version** matches the name of a **previously imported custom reference genome** in your ODM instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"must be specified" -> "can be specified"? Maybe for our client is okay to use default reference genome

* **organism**, **assembly**, **release**: Core genome attributes
* **annotationUrl**: Link to the annotation file used (e.g., GTF from Ensembl)
* **genestack:accession**: ODM accession for the reference genome
* **initializationStatus**: Should be COMPLETE if the genome is ready for use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be COMPLETE if the genome is ready for use

If I remember correctly...user won't be able to import vcf file with the metadata file where unsuccessfully initialised reference genome is mentioned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants