-
Notifications
You must be signed in to change notification settings - Fork 0
docs:better draft of first tutorial 1_Zarr_And_Xarray #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
adamklie
wants to merge
3
commits into
main
Choose a base branch
from
docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,7 +5,7 @@ | |
|
|
||
| # SeqData (Annotated sequence data) | ||
|
|
||
| [[documentation](https://seqdata.readthedocs.io/en/latest/)][[tutorials]()] | ||
| [[documentation](https://seqdata.readthedocs.io/en/latest/)][[tutorials](https://github.com/ML4GLand/SeqData/tree/docs/docs/tutorials)] | ||
|
|
||
| SeqData is a Python package for preparing ML-ready genomic sequence datasets. Some of the key features of SeqData include: | ||
|
|
||
|
|
@@ -15,7 +15,7 @@ SeqData is a Python package for preparing ML-ready genomic sequence datasets. So | |
| - Offers out-of-core dataloading from disk to CPU to GPU | ||
|
|
||
| > [!NOTE] | ||
| > SeqData is under active development. The API has largely been decided on, but may change slightly across versions until the first major release. | ||
| > The API for SeqData has largely been decided on, but may change slightly across versions until the first major release. | ||
|
|
||
| ## Installation | ||
|
|
||
|
|
@@ -27,13 +27,14 @@ Although my focus will largely follow my research projects and the feedback I re | |
|
|
||
| - v0.1.0: ✔️ Initial API for reading BAM, FASTA, BigWig and Tabular data and building loading PyTorch dataloaders | ||
| - v0.2.0: (WIP) Bug fixes, improved documentation, tutorials, and examples | ||
| - v0.3.0: Improved out of core functionality, robust BED classification datasets | ||
| - v0.0.4 — Interoperability with AnnData and SnapATAC2 | ||
| - v0.X.0: Improved out of core functionality, robust BED classification datasets | ||
| - v0.X.4 — Interoperability with AnnData and SnapATAC2 | ||
|
|
||
| ## Usage | ||
| ## Quickstart | ||
| The below examples illustrate the simplest way to read in data from commonly used file formats. For a more comprehensive guide to using the SeqData API, see the full [documentation](https://seqdata.readthedocs.io/en/latest/). | ||
|
|
||
| ### Loading data from "flat" files | ||
| The simplest way to store genomic sequence data is in a table or in a "flat" fasta file. Though this can easily be accomplished using something like `pandas.read_csv`, the SeqData interface keeps the resulting on-disk and in-memory objects standardized with the rest of the SeqData and larger ML4GLand API. | ||
| ### Loading sequences from "flat" files | ||
| The simplest way to store genomic sequence data is as plain text strings in a table. For reading sequences from one or more csv/tsv files, use the `read_table` function: | ||
|
|
||
| ```python | ||
| from seqdata import read_table | ||
|
|
@@ -48,12 +49,38 @@ sdata = sd.read_table( | |
| ) | ||
| ``` | ||
|
|
||
| Will generate a `sdata.zarr` file containing the sequences in the `seq_col` column of `sequences.tsv`. The resulting `sdata` object can then be used for downstream analysis. | ||
| These "fixed" sequences can also be stored in FASTA format. In SeqData, we call this a "flat" fasta file. Use the `read_flat_fasta` function to read sequences from such a file: | ||
|
|
||
| ```python | ||
| from seqdata import read_flat_fasta | ||
| sdata = sd.read_flat_fasta( | ||
| name="seq", # name of resulting xarray variable containing sequences | ||
| out="sdata.zarr", # output file | ||
| fasta="sequences.fa", # fasta file | ||
| fixed_length=False, # whether all sequences are the same length | ||
| batch_size=1000, # number of sequences to load at once | ||
| overwrite=True, # overwrite the output file if it exists | ||
| ) | ||
| ``` | ||
|
|
||
| ### Loading sequences from genomic coordinates | ||
| Sequences are commonly implicity referenced in FASTA files using genomic coordinates in BED-like files rather than fully specified as above. We can use `read_genome_fasta` to load sequences from a genome fasta file using regions in a BED-like file: | ||
|
|
||
| ```python | ||
| from seqdata import read_genome_fasta | ||
| sdata = sd.read_genome_fasta( | ||
| name="seq", # name of resulting xarray variable containing sequences | ||
| out="sdata.zarr", # output file | ||
| fasta="genome.fa", # fasta file | ||
| bed="regions.bed", # bed file | ||
| fixed_length=False, # whether all sequences are the same length | ||
| batch_size=1000, # number of sequences to load at once | ||
| overwrite=True, # overwrite the output file if it exists | ||
| ) | ||
| ``` | ||
|
|
||
| ### Loading data from BAM files | ||
| Reading from bam files allows one to choose custom counting strategies (often necessary with ATAC-seq data). | ||
| ### Loading read depth from BAM files | ||
| In functional genomics, we often work with aligned sequence reads stored in BAM files. In many applications, it is useful to quantify the pileup of reads at each position to describe a signal of interest (e.g. protein binding, chromatin accessibility, etc.). Used in combination with BED-like files, we can extract both sequences and base-pair resolution read pileup with the `read_bam` function: | ||
|
|
||
| ```python | ||
| from seqdata import read_bam | ||
|
|
@@ -68,8 +95,10 @@ sdata = sd.read_bam( | |
| ) | ||
| ``` | ||
|
|
||
| ### Loading data from BigWig files | ||
| [BigWig files](https://genome.ucsc.edu/goldenpath/help/bigWig.html) are a common way to store track-based data and the workhorse of modern genomic sequence based ML. ... | ||
| Because BAM files contain read alignments, we can use different strategies for quantifying the pileup at each position. See the TODO for a deeper dive into...TODO | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. some TODOs here |
||
|
|
||
| ### Loading read depth from BigWig files | ||
| BAM files can be quite large and often carry more information than we need. [BigWig files](https://genome.ucsc.edu/goldenpath/help/bigWig.html) are a common way to store quantitative values at each genomic position (e.g. read depth, methylation fraction, etc.) | ||
|
|
||
| ```python | ||
| from seqdata import read_bigwig | ||
|
|
@@ -84,22 +113,17 @@ sdata = sd.read_bigwig( | |
| ) | ||
| ``` | ||
|
|
||
| ### Working with Zarr stores and XArray objects | ||
| The SeqData API is built to convert data from common formats to Zarr stores on disk. The Zarr store... When coupled with XArray and Dask, we also have the ability to lazy load data and work with data that is too large to fit in memory. | ||
| ### Building a dataloader | ||
| One of the main goals of SeqData is to allow a seamless flow from files on disk to machine learning ready datasets. This can be achieved after loading data from the above functions by building a PyTorch dataloader with the `get_torch_dataloader` function: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. after |
||
|
|
||
| ```python | ||
| from seqdata import get_torch_dataloader | ||
| dl = sd.get_torch_dataloader( | ||
| sdata, # SeqData object (e.g. as returned by read_table) | ||
| sample_dims="_sequence", # dimension to sample along | ||
| variables=["seqs"], # list of variables to include in the dataloader | ||
| batch_size=2, | ||
| ) | ||
| ``` | ||
|
|
||
| Admittedly, working with XArray can take some getting used to... | ||
|
|
||
| ### Building a dataloader | ||
| The main goal of SeqData is to allow a seamless flow | ||
|
|
||
| ## Contributing | ||
| This section was modified from https://github.com/pachterlab/kallisto. | ||
|
|
||
| All contributions, including bug reports, documentation improvements, and enhancement suggestions are welcome. Everyone within the community is expected to abide by our [code of conduct](https://github.com/ML4GLand/EUGENe/blob/main/CODE_OF_CONDUCT.md) | ||
|
|
||
| As we work towards a stable v1.0.0 release, and we typically develop on branches. These are merged into `dev` once sufficiently tested. `dev` is the latest, stable, development branch. | ||
|
|
||
| `main` is used only for official releases and is considered to be stable. If you submit a pull request, please make sure to request to merge into `dev` and NOT `main`. | ||
| This generates a PyTorch dataloader that returns batches as Python dictionaries with the specified variables as keys. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v0.X.0