Skip to content

Zarr support#190

Draft
keller-mark wants to merge 108 commits intoscverse:develfrom
keller-mark:keller-mark/zarr
Draft

Zarr support#190
keller-mark wants to merge 108 commits intoscverse:develfrom
keller-mark:keller-mark/zarr

Conversation

@keller-mark
Copy link

@keller-mark keller-mark commented Nov 5, 2024

Fixes #91

These changes are from both me and @Artur-man

The main public-facing changes here are:

  • The ZarrAnnData class
  • read_zarr and write_zarr top-level functions
  • Support for from_Seurat(output_class="ZarrAnnData")
  • Support for from_SingleCellExperiment(output_class="ZarrAnnData")

Internally:

  • read_zarr_helpers.R is the zarr analog of read_h5ad_helpers.R
  • write_zarr_helpers.R is the zarr analog of write_h5ad_helpers.R
  • Test fixtures within inst/extdata/example.zarr (this makes the diff noisy, apologies)
  • Lots of tests:
    • test-Zarr-read.R (35 new tests)
    • test-Zarr-write.R (70)
    • test-ZarrAnnData.R (26)
    • test-h5ad-zarr.R (17)

A number of these functions generate warnings in the R console that are intended to be followed up on to improve the code (and should probably be resolved as end users may not appreciate them), but the tests still pass despite these warnings.

Known things that are not implemented here:

  • support for recarrays
  • usage of mode = c("r", "r+", "a", "w", "w-", "x") parameter value

Copy link
Member

@rcannood rcannood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @keller-mark and @Artur-man !

I went through the PR for a first time and left some minor comments. I will review the code by running it a couple of times next :)

attrs <- g$get_attrs()$to_list()

if (!all(c("encoding-type", "encoding-version") %in% names(attrs))) {
path <- name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are a lot of linting issues in this file -- could you run lintr::lint_package() and fix any issues that pop up?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ... did a full lint_package check and corrected some R check issues too!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should probably be removed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

expect_equal(array_h5ad, array_zarr)
})

test_that("reading mappings is same for h5ad and zarr", {
Copy link

@Artur-man Artur-man Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference between rhdf5, Rarr and h5py on how rec arrays and structured data types are read.

To me, h5py and Rarr reads it correctly, but rhdf5 does not:
Huber-group-EMBL/rhdf5#192

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need an issue for this here as well? From our point of view it doesn't matter so much what is returned by {rhdf5} as long as what we return from the AnnData matches Python.

Going the other way (writing a recarray) is more difficult because we don't know when to use that format. I don't think we test this. Also, I don't think the AnnData disk spec really covers alternative arrays like recarray.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Artur-man
Copy link

We are really close, tests are passing on my local but we have to wait until changes in Rarr are in BioC-devel.
Let me give a break I will review the entire PR. Please also let me know if you catch stuff, otherwise I will get back to this to finalize.

Thanks @Bisaloo again for answering all issues of Rarr!

@Bisaloo
Copy link
Contributor

Bisaloo commented Jan 27, 2026

As discussed in the bioconductor zulip, you're not likely to get up-to-date macOS x86 binaries for Rarr. So you'll need to either force installing from source, or add the bioc r-universe as an additional repository for this test to pass.

Alternatively, setting 1.11.24 as minimum required version for Rarr might do the trick to force installing from source when a recent enough binary isn't available.

@github-actions
Copy link

github-actions bot commented Feb 25, 2026

🐰 Bencher Report

Branchkeller-mark/zarr
Testbedubuntu-latest

⚠️ WARNING: Truncated view!

The full continuous benchmarking report exceeds the maximum length allowed on this platform.

🐰 View full continuous benchmarking report in Bencher

@lazappi lazappi assigned lazappi and unassigned lazappi Feb 26, 2026
@lazappi lazappi self-requested a review February 26, 2026 06:24
@lazappi
Copy link
Collaborator

lazappi commented Feb 26, 2026

As discussed in the bioconductor zulip, you're not likely to get up-to-date macOS x86 binaries for Rarr. So you'll need to either force installing from source, or add the bioc r-universe as an additional repository for this test to pass.

Alternatively, setting 1.11.24 as minimum required version for Rarr might do the trick to force installing from source when a recent enough binary isn't available.

Could you please explain what the issue with this is? Is it something on the Bioconductor side so do we need to look into it?

I'm guessing the current release check is also expected to fail because we need a later version of {Rarr}?

@Bisaloo
Copy link
Contributor

Bisaloo commented Feb 26, 2026

Could you please explain what the issue with this is? Is it something on the Bioconductor side so do we need to look into it?

BioC macOS x86 builder is no longer working. They're planning on pulling the binaries from r-universe, as they are doing with windows binaries, but it's still a work in progress.
In the meantime, the latest available version for macOS x86 binaries in BioC devel is quite outdated.
It's something that needs to be (and will be) resolved by Bioconductor.

But you can still work around the issue if you need to. By requiring installation from source (this should happen automatically if you request a version more recent than what the binaries provide). Or by getting the binaries from r-universe yourself.

I'm guessing the current release check is also expected to fail because we need a later version of {Rarr}?

Yes, exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Zarr backend

5 participants