diff --git a/README.md b/README.md index dc4c443..0cb0a3e 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ # SeqData (Annotated sequence data) -[[documentation](https://seqdata.readthedocs.io/en/latest/)][[tutorials]()] +[[documentation](https://seqdata.readthedocs.io/en/latest/)][[tutorials](https://github.com/ML4GLand/SeqData/tree/docs/docs/tutorials)] SeqData is a Python package for preparing ML-ready genomic sequence datasets. Some of the key features of SeqData include: @@ -15,7 +15,7 @@ SeqData is a Python package for preparing ML-ready genomic sequence datasets. So - Offers out-of-core dataloading from disk to CPU to GPU > [!NOTE] -> SeqData is under active development. The API has largely been decided on, but may change slightly across versions until the first major release. +> The API for SeqData has largely been decided on, but may change slightly across versions until the first major release. ## Installation @@ -27,13 +27,14 @@ Although my focus will largely follow my research projects and the feedback I re - v0.1.0: ✔️ Initial API for reading BAM, FASTA, BigWig and Tabular data and building loading PyTorch dataloaders - v0.2.0: (WIP) Bug fixes, improved documentation, tutorials, and examples -- v0.3.0: Improved out of core functionality, robust BED classification datasets -- v0.0.4 — Interoperability with AnnData and SnapATAC2 +- v0.X.0: Improved out of core functionality, robust BED classification datasets +- v0.X.4 — Interoperability with AnnData and SnapATAC2 -## Usage +## Quickstart +The below examples illustrate the simplest way to read in data from commonly used file formats. For a more comprehensive guide to using the SeqData API, see the full [documentation](https://seqdata.readthedocs.io/en/latest/). -### Loading data from "flat" files -The simplest way to store genomic sequence data is in a table or in a "flat" fasta file. Though this can easily be accomplished using something like `pandas.read_csv`, the SeqData interface keeps the resulting on-disk and in-memory objects standardized with the rest of the SeqData and larger ML4GLand API. +### Loading sequences from "flat" files +The simplest way to store genomic sequence data is as plain text strings in a table. For reading sequences from one or more csv/tsv files, use the `read_table` function: ```python from seqdata import read_table @@ -48,12 +49,38 @@ sdata = sd.read_table( ) ``` -Will generate a `sdata.zarr` file containing the sequences in the `seq_col` column of `sequences.tsv`. The resulting `sdata` object can then be used for downstream analysis. +These "fixed" sequences can also be stored in FASTA format. In SeqData, we call this a "flat" fasta file. Use the `read_flat_fasta` function to read sequences from such a file: + +```python +from seqdata import read_flat_fasta +sdata = sd.read_flat_fasta( + name="seq", # name of resulting xarray variable containing sequences + out="sdata.zarr", # output file + fasta="sequences.fa", # fasta file + fixed_length=False, # whether all sequences are the same length + batch_size=1000, # number of sequences to load at once + overwrite=True, # overwrite the output file if it exists +) +``` ### Loading sequences from genomic coordinates +Sequences are commonly implicity referenced in FASTA files using genomic coordinates in BED-like files rather than fully specified as above. We can use `read_genome_fasta` to load sequences from a genome fasta file using regions in a BED-like file: + +```python +from seqdata import read_genome_fasta +sdata = sd.read_genome_fasta( + name="seq", # name of resulting xarray variable containing sequences + out="sdata.zarr", # output file + fasta="genome.fa", # fasta file + bed="regions.bed", # bed file + fixed_length=False, # whether all sequences are the same length + batch_size=1000, # number of sequences to load at once + overwrite=True, # overwrite the output file if it exists +) +``` -### Loading data from BAM files -Reading from bam files allows one to choose custom counting strategies (often necessary with ATAC-seq data). +### Loading read depth from BAM files +In functional genomics, we often work with aligned sequence reads stored in BAM files. In many applications, it is useful to quantify the pileup of reads at each position to describe a signal of interest (e.g. protein binding, chromatin accessibility, etc.). Used in combination with BED-like files, we can extract both sequences and base-pair resolution read pileup with the `read_bam` function: ```python from seqdata import read_bam @@ -68,8 +95,10 @@ sdata = sd.read_bam( ) ``` -### Loading data from BigWig files -[BigWig files](https://genome.ucsc.edu/goldenpath/help/bigWig.html) are a common way to store track-based data and the workhorse of modern genomic sequence based ML. ... +Because BAM files contain read alignments, we can use different strategies for quantifying the pileup at each position. See the TODO for a deeper dive into...TODO + +### Loading read depth from BigWig files +BAM files can be quite large and often carry more information than we need. [BigWig files](https://genome.ucsc.edu/goldenpath/help/bigWig.html) are a common way to store quantitative values at each genomic position (e.g. read depth, methylation fraction, etc.) ```python from seqdata import read_bigwig @@ -84,22 +113,17 @@ sdata = sd.read_bigwig( ) ``` -### Working with Zarr stores and XArray objects -The SeqData API is built to convert data from common formats to Zarr stores on disk. The Zarr store... When coupled with XArray and Dask, we also have the ability to lazy load data and work with data that is too large to fit in memory. +### Building a dataloader +One of the main goals of SeqData is to allow a seamless flow from files on disk to machine learning ready datasets. This can be achieved after loading data from the above functions by building a PyTorch dataloader with the `get_torch_dataloader` function: ```python +from seqdata import get_torch_dataloader +dl = sd.get_torch_dataloader( + sdata, # SeqData object (e.g. as returned by read_table) + sample_dims="_sequence", # dimension to sample along + variables=["seqs"], # list of variables to include in the dataloader + batch_size=2, + ) ``` -Admittedly, working with XArray can take some getting used to... - -### Building a dataloader -The main goal of SeqData is to allow a seamless flow - -## Contributing -This section was modified from https://github.com/pachterlab/kallisto. - -All contributions, including bug reports, documentation improvements, and enhancement suggestions are welcome. Everyone within the community is expected to abide by our [code of conduct](https://github.com/ML4GLand/EUGENe/blob/main/CODE_OF_CONDUCT.md) - -As we work towards a stable v1.0.0 release, and we typically develop on branches. These are merged into `dev` once sufficiently tested. `dev` is the latest, stable, development branch. - -`main` is used only for official releases and is considered to be stable. If you submit a pull request, please make sure to request to merge into `dev` and NOT `main`. +This generates a PyTorch dataloader that returns batches as Python dictionaries with the specified variables as keys. diff --git a/docs/tutorials/4_Zarr_And_XArray.ipynb b/docs/tutorials/1_Zarr_And_XArray.ipynb similarity index 66% rename from docs/tutorials/4_Zarr_And_XArray.ipynb rename to docs/tutorials/1_Zarr_And_XArray.ipynb index f293f02..e6b2bee 100644 --- a/docs/tutorials/4_Zarr_And_XArray.ipynb +++ b/docs/tutorials/1_Zarr_And_XArray.ipynb @@ -5,7 +5,7 @@ "metadata": {}, "source": [ "# Working with Zarr and Xarray\n", - "Most Pythonistas are familair with Pandas and NumPy (and maybe Torch) for handling their data and might be less familiar with Zarr and Xarray. This tutorial is meant to highlight what you need to know about Zarr and Xarray to work with SeqData. More comprehensive tutorials can be found in the [Xarray](https://docs.xarray.dev/en/stable/) and [Zarr](https://zarr.dev/)" + "The backbone of SeqData are the packages Zarr and Xarray. Most Pythonistas are familair with Pandas and NumPy but might be less familiar with Zarr or Xarray. This tutorial is meant to give you what you need to know about Zarr and Xarray to work with SeqData. More comprehensive tutorials can be found at the [Xarray](https://docs.xarray.dev/en/stable/) and [Zarr](https://zarr.dev/) documentation." ] }, { @@ -19,28 +19,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Genomics data is multidimensional and complex, and while Pandas is great for 2D data and NumPy can handle n-dimensional arrays, Xarray is specifically designed to handle n-dimensional data with labeled dimensions and coordinates. We believe this leads to a more intuitive, more concise, and less error-prone developer experience. The good thing about Xarray is that it built with a Pythonic API very similar to Pandas and NumPy and can easily convert between these libraries." + "Genomics data is multidimensional and complex, and while Pandas is great for 2D data and NumPy can handle n-dimensional arrays, Xarray is specifically designed to handle n-dimensional data with labeled dimensions and coordinates. We believe this leads to a more intuitive, more concise, and less error-prone developer experience. The good thing about Xarray is that it built with a Pythonic API very similar to Pandas and NumPy and can easily convert between these libraries (when applicable)." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.\n" - ] - } - ], + "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import xarray as xr" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Xarray Data Structures\n", + "Adapted from: https://docs.xarray.dev/en/latest/user-guide/data-structures.html#data-structures" + ] + }, { "cell_type": "markdown", "metadata": { @@ -49,12 +49,12 @@ } }, "source": [ - "Xarray has two core data structures that are fundamentally N-dimensional. The first are `DataArrays` which are simply labeled, N-dimensional array. `DataArrays` are an N-D generalization of a `pandas.Series` and work very similarly to numpy arrays:" + "Xarray has two core data structures that are fundamentally N-dimensional. The first are `DataArray`s which are simply labeled, N-dimensional arrays. `DataArray`s are an N-D generalization of a `pandas.Series` and work very similarly to Numpy arrays:" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -424,23 +424,23 @@ " fill: currentColor;\n", "}\n", "
<xarray.DataArray (x: 2, y: 3)> Size: 48B\n",
-       "array([[ 0.62386048, -0.39317812, -1.0522729 ],\n",
-       "       [ 1.10703133,  0.63643986,  1.45434113]])\n",
+       "array([[-0.01801908, -0.05525951, -0.65284135],\n",
+       "       [-0.33377063, -1.00420174, -0.70043332]])\n",
        "Coordinates:\n",
        "  * x        (x) int64 16B 10 20\n",
-       "Dimensions without coordinates: y
" + "Dimensions without coordinates: y" ], "text/plain": [ " Size: 48B\n", - "array([[ 0.62386048, -0.39317812, -1.0522729 ],\n", - " [ 1.10703133, 0.63643986, 1.45434113]])\n", + "array([[-0.01801908, -0.05525951, -0.65284135],\n", + " [-0.33377063, -1.00420174, -0.70043332]])\n", "Coordinates:\n", " * x (x) int64 16B 10 20\n", "Dimensions without coordinates: y" ] }, - "execution_count": 2, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -454,12 +454,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The second is a multi-dimensional, in-memory array database called a Dataset. It is a Python dictionary like container of `DataArray` objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the `pandas.DataFrame.`" + "We created a 2D array above with the labeled dimensions `x` and `y`. Xarray uses 'coordinates' to provide meaningful labels for the dimensions of a dataset. In this case we gave the two dimensions the coordinates 10 and 20. Coordinates are not required, but will enable indexing of data along those dimensions beyond simple integer indexing." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The second Xarray data structure worth mentioning is the `Dataset`. `Dataset`s are multi-dimensional, in-memory array databases that behave like Python dictionaries of `DataArray` objects. `Dataset`s can be aligned aligned along any number of shared dimensions, and serve a similar purpose in Xarray to a `DataFrame` in pandas." ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -834,10 +841,10 @@ " * x (x) int64 16B 10 20\n", "Dimensions without coordinates: y\n", "Data variables:\n", - " foo (x, y) float64 48B 0.6239 -0.3932 -1.052 1.107 0.6364 1.454\n", + " foo (x, y) float64 48B -0.01802 -0.05526 -0.6528 -0.3338 -1.004 -0.7004\n", " bar (x) int64 16B 1 2\n", - " baz float64 8B 3.142" + " baz float64 8B 3.142" ], "text/plain": [ " Size: 88B\n", @@ -846,12 +853,12 @@ " * x (x) int64 16B 10 20\n", "Dimensions without coordinates: y\n", "Data variables:\n", - " foo (x, y) float64 48B 0.6239 -0.3932 -1.052 1.107 0.6364 1.454\n", + " foo (x, y) float64 48B -0.01802 -0.05526 -0.6528 -0.3338 -1.004 -0.7004\n", " bar (x) int64 16B 1 2\n", " baz float64 8B 3.142" ] }, - "execution_count": 3, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -861,19 +868,9 @@ "ds" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can have not only different data types, but also different numbers of dimensions.\n", - "\n" - ] - }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -1243,23 +1240,23 @@ " fill: currentColor;\n", "}\n", "
<xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48B\n",
-       "array([[ 0.62386048, -0.39317812, -1.0522729 ],\n",
-       "       [ 1.10703133,  0.63643986,  1.45434113]])\n",
+       "array([[-0.01801908, -0.05525951, -0.65284135],\n",
+       "       [-0.33377063, -1.00420174, -0.70043332]])\n",
        "Coordinates:\n",
        "  * x        (x) int64 16B 10 20\n",
-       "Dimensions without coordinates: y
" + "Dimensions without coordinates: y" ], "text/plain": [ " Size: 48B\n", - "array([[ 0.62386048, -0.39317812, -1.0522729 ],\n", - " [ 1.10703133, 0.63643986, 1.45434113]])\n", + "array([[-0.01801908, -0.05525951, -0.65284135],\n", + " [-0.33377063, -1.00420174, -0.70043332]])\n", "Coordinates:\n", " * x (x) int64 16B 10 20\n", "Dimensions without coordinates: y" ] }, - "execution_count": 4, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -1268,13 +1265,6 @@ "ds[\"foo\"]" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Working with Xarray" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -1287,7 +1277,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Xarray supports four different kinds of indexing, as described below and summarized in this table:\n", + "Xarray supports four different kinds of indexing, as summarized in this table:\n", "\n", "| Dimension lookup | Index lookup | `DataArray` syntax | `Dataset` syntax |\n", "|------------------|--------------|-------------------------------|------------------------------|\n", @@ -1303,12 +1293,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's see how indexing works on some dummy data:" + "Let's see how indexing works in practice:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Positional indexing" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 18, "metadata": {}, "outputs": [ { @@ -1678,31 +1675,31 @@ " fill: currentColor;\n", "}\n", "
<xarray.DataArray (time: 4, space: 3)> Size: 96B\n",
-       "array([[0.38986545, 0.46173691, 0.33087501],\n",
-       "       [0.34059049, 0.93449628, 0.18913959],\n",
-       "       [0.57198535, 0.41152441, 0.58869555],\n",
-       "       [0.23481294, 0.33283331, 0.05015661]])\n",
+       "array([[0.38236869, 0.35003513, 0.19700494],\n",
+       "       [0.07759571, 0.88967283, 0.94486908],\n",
+       "       [0.24583904, 0.58348918, 0.15871799],\n",
+       "       [0.17580363, 0.94530778, 0.99840317]])\n",
        "Coordinates:\n",
        "  * time     (time) datetime64[ns] 32B 2000-01-01 2000-01-02 ... 2000-01-04\n",
-       "  * space    (space) <U2 24B 'IA' 'IL' 'IN'
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04'], dtype='datetime64[ns]', name='time', freq='D'))
    • space
      PandasIndex
      PandasIndex(Index(['IA', 'IL', 'IN'], dtype='object', name='space'))
  • " ], "text/plain": [ " Size: 96B\n", - "array([[0.38986545, 0.46173691, 0.33087501],\n", - " [0.34059049, 0.93449628, 0.18913959],\n", - " [0.57198535, 0.41152441, 0.58869555],\n", - " [0.23481294, 0.33283331, 0.05015661]])\n", + "array([[0.38236869, 0.35003513, 0.19700494],\n", + " [0.07759571, 0.88967283, 0.94486908],\n", + " [0.24583904, 0.58348918, 0.15871799],\n", + " [0.17580363, 0.94530778, 0.99840317]])\n", "Coordinates:\n", " * time (time) datetime64[ns] 32B 2000-01-01 2000-01-02 ... 2000-01-04\n", " * space (space)
    <xarray.DataArray (time: 2, space: 3)> Size: 48B\n",
    -       "array([[0.38986545, 0.46173691, 0.33087501],\n",
    -       "       [0.34059049, 0.93449628, 0.18913959]])\n",
    +       "array([[0.38236869, 0.35003513, 0.19700494],\n",
    +       "       [0.07759571, 0.88967283, 0.94486908]])\n",
            "Coordinates:\n",
            "  * time     (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n",
    -       "  * space    (space) <U2 24B 'IA' 'IL' 'IN'
    " + " * space (space) <U2 24B 'IA' 'IL' 'IN'" ], "text/plain": [ " Size: 48B\n", - "array([[0.38986545, 0.46173691, 0.33087501],\n", - " [0.34059049, 0.93449628, 0.18913959]])\n", + "array([[0.38236869, 0.35003513, 0.19700494],\n", + " [0.07759571, 0.88967283, 0.94486908]])\n", "Coordinates:\n", " * time (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n", " * space (space)
    <xarray.DataArray ()> Size: 8B\n",
    -       "array(0.38986545)\n",
    +       "
    <xarray.DataArray (time: 2)> Size: 16B\n",
    +       "array([0.38236869, 0.07759571])\n",
            "Coordinates:\n",
    -       "    time     datetime64[ns] 8B 2000-01-01\n",
    -       "    space    <U2 8B 'IA'
    " + " * time (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n", + " space <U2 8B 'IA'
    " ], "text/plain": [ - " Size: 8B\n", - "array(0.38986545)\n", + " Size: 16B\n", + "array([0.38236869, 0.07759571])\n", "Coordinates:\n", - " time datetime64[ns] 8B 2000-01-01\n", + " * time (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n", " space
    <xarray.DataArray (time: 2)> Size: 16B\n",
    -       "array([0.38986545, 0.34059049])\n",
    +       "array([0.38236869, 0.07759571])\n",
            "Coordinates:\n",
            "  * time     (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n",
    -       "    space    <U2 8B 'IA'
    " + " space <U2 8B 'IA'" ], "text/plain": [ " Size: 16B\n", - "array([0.38986545, 0.34059049])\n", + "array([0.38236869, 0.07759571])\n", "Coordinates:\n", " * time (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n", " space
    <xarray.DataArray (time: 2)> Size: 16B\n",
    -       "array([0.38986545, 0.34059049])\n",
    +       "
    <xarray.DataArray (time: 2, space: 3)> Size: 48B\n",
    +       "array([[0.38236869, 0.35003513, 0.19700494],\n",
    +       "       [0.07759571, 0.88967283, 0.94486908]])\n",
            "Coordinates:\n",
            "  * time     (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n",
    -       "    space    <U2 8B 'IA'
    " + " * space (space) <U2 24B 'IA' 'IL' 'IN'
    " ], "text/plain": [ - " Size: 16B\n", - "array([0.38986545, 0.34059049])\n", + " Size: 48B\n", + "array([[0.38236869, 0.35003513, 0.19700494],\n", + " [0.07759571, 0.88967283, 0.94486908]])\n", "Coordinates:\n", " * time (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n", - " space
    <xarray.DataArray (time: 2, space: 3)> Size: 48B\n",
    -       "array([[0.38986545, 0.46173691, 0.33087501],\n",
    -       "       [0.34059049, 0.93449628, 0.18913959]])\n",
    +       "
    <xarray.DataArray (time: 2)> Size: 16B\n",
    +       "array([0.38236869, 0.07759571])\n",
            "Coordinates:\n",
            "  * time     (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n",
    -       "  * space    (space) <U2 24B 'IA' 'IL' 'IN'
    " + " space <U2 8B 'IA'
    " ], "text/plain": [ - " Size: 48B\n", - "array([[0.38986545, 0.46173691, 0.33087501],\n", - " [0.34059049, 0.93449628, 0.18913959]])\n", + " Size: 16B\n", + "array([0.38236869, 0.07759571])\n", "Coordinates:\n", " * time (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n", - " * space (space)
    <xarray.DataArray (time: 2)> Size: 16B\n",
    -       "array([0.38986545, 0.34059049])\n",
    +       "
    <xarray.DataArray (time: 2, space: 3)> Size: 48B\n",
    +       "array([[0.38236869, 0.35003513, 0.19700494],\n",
    +       "       [0.07759571, 0.88967283, 0.94486908]])\n",
            "Coordinates:\n",
            "  * time     (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n",
    -       "    space    <U2 8B 'IA'
    " + " * space (space) <U2 24B 'IA' 'IL' 'IN'
    " ], "text/plain": [ - " Size: 16B\n", - "array([0.38986545, 0.34059049])\n", + " Size: 48B\n", + "array([[0.38236869, 0.35003513, 0.19700494],\n", + " [0.07759571, 0.88967283, 0.94486908]])\n", "Coordinates:\n", " * time (time) datetime64[ns] 16B 2000-01-01 2000-01-02\n", - " space
    • foo
      (time, space)
      float64
      0.09745 0.03016 ... 0.1258 0.6561
      array([[0.09745444, 0.03015742, 0.45398843],\n",
      +       "       [0.74742904, 0.23636766, 0.69940638],\n",
      +       "       [0.67474816, 0.82512846, 0.64404217],\n",
      +       "       [0.06185114, 0.12577389, 0.65610942]])
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04'], dtype='datetime64[ns]', name='time', freq='D'))
    • space
      PandasIndex
      PandasIndex(Index(['IA', 'IL', 'IN'], dtype='object', name='space'))
  • " ], "text/plain": [ " Size: 152B\n", @@ -4538,21 +4547,36 @@ " * time (time) datetime64[ns] 32B 2000-01-01 2000-01-02 ... 2000-01-04\n", " * space (space) " + " foo (time, space) float64 8B 0.09745" ], "text/plain": [ " Size: 24B\n", @@ -4936,10 +4960,10 @@ " * time (time) datetime64[ns] 8B 2000-01-01\n", " * space (space) " + " foo (space) float64 24B 0.09745 0.03016 0.454" ], "text/plain": [ " Size: 56B\n", @@ -5334,10 +5358,10 @@ " time datetime64[ns] 8B 2000-01-01\n", " * space (space)
    <xarray.Dataset> Size: 24B\n",
    -       "Dimensions:  (time: 1, space: 1)\n",
    -       "Coordinates:\n",
    -       "  * time     (time) datetime64[ns] 8B 2000-01-01\n",
    -       "  * space    (space) <U2 8B 'IA'\n",
    +       "
    <xarray.Dataset> Size: 16B\n",
    +       "Dimensions:  (x: 1, y: 1, z: 1)\n",
    +       "Dimensions without coordinates: x, y, z\n",
            "Data variables:\n",
    -       "    foo      (time, space) float64 8B 0.9963
    " + " foo (x, y, z) int64 8B 42\n", + " bar (y, z) int64 8B 24
    " ], "text/plain": [ - " Size: 24B\n", - "Dimensions: (time: 1, space: 1)\n", - "Coordinates:\n", - " * time (time) datetime64[ns] 8B 2000-01-01\n", - " * space (space) Size: 16B\n", + "Dimensions: (x: 1, y: 1, z: 1)\n", + "Dimensions without coordinates: x, y, z\n", "Data variables:\n", - " foo (time, space) float64 8B 0.9963" + " foo (x, y, z) int64 8B 42\n", + " bar (y, z) int64 8B 24" ] }, - "execution_count": 18, + "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "ds[dict(space=[0], time=[0])]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Converting between Xarray and the NumPy stack" + "ds = xr.Dataset({\"foo\": ((\"x\", \"y\", \"z\"), [[[42]]]), \"bar\": ((\"y\", \"z\"), [[24]])})\n", + "ds\n" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 35, "metadata": {}, "outputs": [ { "data": { - "text/plain": [ - "x y\n", - "10 0 0.623860\n", - " 1 -0.393178\n", - " 2 -1.052273\n", - "20 0 1.107031\n", - " 1 0.636440\n", - " 2 1.454341\n", - "dtype: float64" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# convert to a pandas Series\n", - "series = data.to_series()\n", - "series" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Writing to Zarr stores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Zarr is a Python package that provides an implementation of chunked, compressed, N-dimensional arrays. Zarr has the ability to store arrays in a range of ways, including in memory, in files, and in cloud-based object storage such as Amazon S3 and Google Cloud Storage. Xarray’s Zarr backend allows xarray to leverage these capabilities, including the ability to store and analyze datasets far too large fit onto disk (particularly in combination with dask)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, + "text/html": [ + "
    \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
    <xarray.Dataset> Size: 16B\n",
    +       "Dimensions:  (x: 1, y: 1, z: 1)\n",
    +       "Dimensions without coordinates: x, y, z\n",
    +       "Data variables:\n",
    +       "    foo      (y, z, x) int64 8B 42\n",
    +       "    bar      (y, z) int64 8B 24
    " + ], + "text/plain": [ + " Size: 16B\n", + "Dimensions: (x: 1, y: 1, z: 1)\n", + "Dimensions without coordinates: x, y, z\n", + "Data variables:\n", + " foo (y, z, x) int64 8B 42\n", + " bar (y, z) int64 8B 24" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds.transpose(\"y\", \"z\", \"x\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Concatenating Xarray objects" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can concatenate Xarray objects along a new or existing dimension using the concat() function:" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
    \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
    <xarray.DataArray (x: 2, y: 3)> Size: 48B\n",
    +       "array([[0, 1, 2],\n",
    +       "       [3, 4, 5]])\n",
    +       "Coordinates:\n",
    +       "  * x        (x) <U1 8B 'a' 'b'\n",
    +       "  * y        (y) int64 24B 10 20 30
    " + ], + "text/plain": [ + " Size: 48B\n", + "array([[0, 1, 2],\n", + " [3, 4, 5]])\n", + "Coordinates:\n", + " * x (x) \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
    <xarray.DataArray (x: 2, y: 3)> Size: 48B\n",
    +       "array([[0, 1, 2],\n",
    +       "       [3, 4, 5]])\n",
    +       "Coordinates:\n",
    +       "  * x        (x) <U1 8B 'a' 'b'\n",
    +       "  * y        (y) int64 24B 10 20 30
    " + ], + "text/plain": [ + " Size: 48B\n", + "array([[0, 1, 2],\n", + " [3, 4, 5]])\n", + "Coordinates:\n", + " * x (x)