Skip to content

Conversation

@cgwalters
Copy link
Collaborator

Obviously massively Assisted-by: OpenCode (Opus 4.5) here; I did a lot of design work of course and but this is looking pretty nice to me, and it seems to work well!

Copy link
Collaborator Author

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a review here so it's more visible

@giuseppe
Copy link
Member

I've not looked into details yet, but are we comfortable with the risks of these intrusive handling of the storage library? The storage library doesn't have any stability guarantees and can change at any time

@cgwalters
Copy link
Collaborator Author

I've not looked into details yet, but are we comfortable with the risks of these intrusive handling of the storage library? The storage library doesn't have any stability guarantees and can change at any time

Is there anything fundamentally different from the storage access in bootc-dev/bootc#1962 though?

As far as the "can change at any time" - in reality we have pretty strong constraints there on any truly backwards-incompatible changes right? Because even though there's one Go library for it, there's multiple tools using that library which may not always be upgraded at the same time and I definitely remember an incident with cri-o + podman on different schedules.

I think if we did any future breaking changes, we'd just need to remember to also update this library right? We can enforce that with CI.

@giuseppe
Copy link
Member

Is there anything fundamentally different from the storage access in bootc-dev/bootc#1962 though?

No, but I've added that only for the PoC, I think that logic belongs to container-libs.

As far as the "can change at any time" - in reality we have pretty strong constraints there on any truly backwards-incompatible changes right? Because even though there's one Go library for it, there's multiple tools using that library which may not always be upgraded at the same time and I definitely remember an incident with cri-o + podman on different schedules.

Sure, but testing with the previous version of the container tools will be caught earlier (and hopefully block a release), while it could be too late for a separate implementation as the one proposed here.

As long as we are OK with the potential risks, I am OK with this. It is better to get unblocked and make some progress instead of waiting for the "perfect" solution.

@cgwalters
Copy link
Collaborator Author

No, but I've added that only for the PoC, I think that logic belongs to container-libs.

Yes, agreed. I think we can pursue a proper API for this (and I like the jsonrpc-fdpass + splitfdstream, but obviously up for debate)

That said, there's an entirely different path to pursue here, which is a Go implementation of what's in composefs-rs (this repo) (or redoing the containers-storage-composefs backend on IPC to cfsctl which we productize and ship, or a hybrid). I suspect that such a path would be relatively straightforward, and would force us to standardize even more the file formats etc.

@allisonkarlitskaya
Copy link
Collaborator

So from what I can tell, the main thrust here is to get the reflinking thing between c-storage and composefs-rs working, right?

I'd have a really strong preference here to add a new API to the existing skopeo proxy and containers-image-store-rs. There are already fd-passing APIs there: we'd just need a lot more fds...

@giuseppe's argument about having a parallel implementation has traction with me. I'm not sure that I'd prefer a vibecoded containers-storage reimplementation, even if it is in Rust, compared to the real thing, specifically because we're gonna RPC to it anyway...

See also #137 for existing code that understands enough of tarsplit (because it's part of zstd:chunked) to get its job done...

I know you're trying to reduce CI pain here by decreasing the size of the stack of modules we have, but I'm not sure how I feel about mass-importing a vibecoding session here... the fact that most of that is in a separate crate helps a bit, of course.

Also: are you still 100% sure "pull into c-storage and then copy to composefs" is the way we definitely want to do this? I know the arguments about exotic authentication mechanisms, but skopeo would cover us there. I also know the arguments about zstd:chunked, but we have our own support in #137. I also know the arguments about combining zstd:chunked with the exotic authentication mechanisms thing and that's where I have a bit of sympathy for this approach, but ... it seems like maybe we could possibly figure something out still? Also: if this is the plan, then we'll need skopeo (at least) around anyway to do the pulling for us, so having a separate implementation of c-storage won't really gain us anything in terms of reducing on-disk size, etc...

@travier
Copy link
Member

travier commented Jan 30, 2026

Also: are you still 100% sure "pull into c-storage and then copy to composefs" is the way we definitely want to do this?

Another argument for this logic is local layering. Pull the updated image in c-storage, do a local layered build with podman, import into composefs "for free".

@allisonkarlitskaya
Copy link
Collaborator

Also: are you still 100% sure "pull into c-storage and then copy to composefs" is the way we definitely want to do this?

Another argument for this logic is local layering. Pull the updated image in c-storage, do a local layered build with podman, import into composefs "for free".

Indeed. That's exactly what we do for our examples, and it'd be nice to see those be more efficient...

Add layer references in the original manifest diff_id order rather
than the order layers finish downloading. This ensures reproducible
config splitstream digests regardless of download timing.

Fixes: 54c3c6d ("splitstream: Rework file format")
Assisted-by: OpenCode (Claude claude-opus-4-5-20250514)
This new internal crate provides safe, zero-copy parsing of tar archive
headers using the zerocopy crate. It supports:

- POSIX.1-1988, UStar (POSIX.1-2001), and GNU tar header formats
- Base-256 encoding for large values (GNU extension)
- EntryType enum for all standard tar entry types
- Checksum verification

The goal is to share code between composefs-oci and cstorage (which both
do tar header parsing), and eventually enable upstream contribution to
the tar-rs crate (ref: alexcrichton/tar-rs#392).

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters
Copy link
Collaborator Author

but I'm not sure how I feel about mass-importing a vibecoding session here...

I understand the point of what you're saying, but the definition of vibecoding is not looking at the code, and I can assure you that I have and will continue to read and sanity check all "core" code to which I attach my name. This code will be responsible for updating my operating system and those of people and customers, and so for sure it will not be "vibecoded" by the generally accepted definition of that term. "Includes substantial AI generation" - absolutely. But this took quite a bit of iteration.
This is of course an industry-wide debate, with many reasonable positions to take.

To me another key input into this is "could the person have written it themselves"? And I hope you'd agree for this the answer is yes, because I have a long track record from before LLMs even came on the scene. It's going to be a lot harder for people joining the industry now, but that's a whole other thing...


Also: are you still 100% sure "pull into c-storage and then copy to composefs" is the way we definitely want to do this?

I don't think that has to be the only way, no. I'm totally fine with an opt-in mode here that uses e.g. oci-distribution - it would be a very obvious thing to support. I just don't think we can use it in bootc at least by default.

Also: if this is the plan, then we'll need skopeo (at least) around anyway to do the pulling for us, so having a separate implementation of c-storage won't really gain us anything in terms of reducing on-disk size, etc...

I agree it makes sense to put this in skopeo (and/or podman). There's no question there and I already have some PoC patches there, but we need to come to consensus on the design (jsonrpc-fdpass for IPC + splitfdstream for tar layer interchange is again my proposal).

But that said: this whole problem has been so painful for so long and I really want to make fowrard progress - patching skopeo also implies waiting on a whole review-merge-release-ship cycle, probably fastest path a month. Not the end of the world if we agree and commit. There's a flip side though - we can make this an opt-in feature and gain more experience with it. The places that would use this in bootc at least are already both experimental (unified storage and composefs). Although we'd need to be careful to turn off this code path for bootc install and bootc switch --transport containers-storage by default if we did that.

@allisonkarlitskaya
Copy link
Collaborator

But that said: this whole problem has been so painful for so long and I really want to make fowrard progress - patching skopeo also implies waiting on a whole review-merge-release-ship cycle, probably fastest path a month.

How about incredibly evil hacks? We know the ID of the container and the ID of the layers and the names of the files in the tar archives... and you know what the layout of the repository looks like... could we just go looking for the files? skopeo would waste its time reassembling the tar archive and streaming it to us (just to have us ignore it), but it would have to do that anyway in order to hash it...

The the thing on the other side of the RPC looks more like "open this [named] file for me"...

@cgwalters
Copy link
Collaborator Author

How about incredibly evil hacks? We know the ID of the container and the ID of the layers and the names of the files in the tar archives... and you know what the layout of the repository looks like... could we just go looking for the files?

How is that different than what this code is doing? What specific portion of the code here feels riskier than re-implementing it a different way?

but it would have to do that anyway in order to hash it...

I don't think we need to validate the tar digest ("diffid") when pulling from containers-storage: by default. Should we support it? Absolutely. But there's a notable difference in efficiency. IMO where we really want to go is "composefs digest per layer as replacement for diffid" and I have some work on that.

@cgwalters
Copy link
Collaborator Author

See also #137 for existing code that understands enough of tarsplit (because it's part of zstd:chunked) to get its job done...

What's funny of course is that you and I are almost arguing opposite ends here in the two PRs of "should we reimplement here parts of container-libs" 😄

(But to clarify for me though, unified storage is WAY more important than zstd:chunked)

@allisonkarlitskaya
Copy link
Collaborator

What's funny of course is that you and I are almost arguing opposite ends here in the two PRs of "should we reimplement here parts of container-libs" 😄

I suppose I always dreamed of containers-storage backending directly on to composefs and no longer having its own repository. That would be a better future, I think we could agree...

And in that context, stuff like "ability to pull directly from the network without skopeo" starts to make a lot more sense...

@allisonkarlitskaya
Copy link
Collaborator

I suppose I always dreamed of containers-storage backending directly on to composefs.

Actually, there's a thought: we could teach skopeo a bit about how to write to composefs... the object storage part is obviously the most important bit, but it's also the easiest bit: instead of it handing us a tonne of fds that we do reflinks to we could just hand it our object directory (as an fd, or by name) and have it do that part...

@cgwalters
Copy link
Collaborator Author

I suppose I always dreamed of containers-storage backending directly on to composefs and no longer having its own repository. That would be a better future, I think we could agree...

Yes I think as part of this we must deliver sealed OCI and a revamp of the current containers-storage: composefs backend is in order IMO. However: we need a solid impl of https://github.com/containers/composefs-rs/blob/main/doc/plans/oci-sealing-spec.md
(I'll post some WIP on that soon)

Actually, there's a thought: we could teach skopeo a bit about how to write to composefs...

Yes but there's a shorter term sticking point: For bootc because we have logically bound images at the current time we must expose our images in a way that current podman (i.e. container-libs) understands.

For this reason among others I would like to stick with the plan of pulling into containers-storage first, then reflinking to composefs.

Migrate composefs-oci to use the shared tar-header crate instead of
the tar crate for header parsing. The tar crate is still used for
PaxExtensions (PAX header parsing) and Builder (test utilities).

This eliminates duplicate tar header parsing logic and uses
zerocopy for safe, zero-copy access to header fields.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Add test-all and build-cstorage targets for building and testing
with the containers-storage feature enabled.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Add a new crate implementing the splitfdstream binary format, which
enables streaming tar archives with external file references.

The format uses an 8-byte LE signed prefix followed by optional data:
- Negative prefix: inline data of size |prefix|
- Non-negative prefix: external file descriptor reference by index

Key types:
- SplitfdstreamWriter: Build splitfdstreams with inline/external chunks
- SplitfdstreamReader: Parse splitfdstreams back into chunks
- SplitfdstreamTarReader: Read adapter that reconstructs byte streams

This enables zero-copy tar reconstruction from containers-storage's
tar-split metadata by streaming headers inline while referencing
file content via fd.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Add a new crate providing read-only access to containers-storage
(the storage backend used by podman, buildah, and other container tools).

Key components:
- Storage: Main entry point, discovers and opens storage locations
- Layer: Overlay layer with content access via diff_dir
- Image: OCI image with manifest/config parsing
- TarSplitFdStream: Zero-copy tar-split streaming with fd passing
- LockFile: Wire-compatible locking with Go containers/storage

The crate uses cap-std for capability-based file operations and
provides a safe, read-only interface to access container images
and layers stored by podman/buildah.

This is adapted from cgwalters/cstor-rs with modifications to fit
the composefs-rs codebase style and conventions.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Add a new method to Repository that stores objects using FICLONE
reflinks when available, falling back to regular copy otherwise.

This enables zero-copy import from containers-storage by reflinking
files directly into the objects/ directory, avoiding data duplication
on filesystems that support reflinks (btrfs, XFS with reflinks).

The method only falls back to copy on EOPNOTSUPP (reflink not supported)
or EXDEV (cross-device); other errors are propagated to the caller.

Includes test coverage for the new method.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Add optional containers-storage feature that enables importing
container images directly from podman/buildah's local storage
into composefs repositories.

Key functionality:
- import_from_containers_storage(): Main async entry point
- Uses cstorage crate for Storage/Layer/Image access
- Leverages tar-split metadata for zero-copy streaming
- Uses ensure_object_from_file() for reflink-based storage
- Shows progress via indicatif progress bar

The integration uses spawn_blocking to avoid blocking the async
runtime during synchronous file I/O operations.

Usage: cfsctl oci pull containers-storage:alpine:latest

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Update cfsctl to detect and handle containers-storage: image
references when the containers-storage feature is enabled.

When an image reference starts with 'containers-storage:', cfsctl
routes to the native cstor import function instead of using skopeo,
enabling zero-copy layer import via reflinks.

Also improve image name matching in cstorage to support short names
like 'busybox' that resolve to 'docker.io/library/busybox:latest'.

The containers-storage feature is now enabled by default in cfsctl.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Add ImportStats struct to track containers-storage import operations:
- Count of layers (total and already present)
- Count of objects (reflinked, copied, already present)
- Bytes (reflinked, copied, inlined)

The repository now provides ensure_object_from_file_with_stats() and
ObjectStoreMethod enum to report how each object was stored.

The pull() function in composefs-oci now automatically routes
containers-storage: references to the native cstor import path,
returning a PullResult struct that includes optional ImportStats.
This keeps the routing logic in the library rather than CLI.

cfsctl prints statistics after a containers-storage pull showing
whether reflinks were used for zero-copy import.

Example output:
  Import statistics:
    layers: 1 (0 already present)
    objects: 16 total (16 reflinked, 0 copied, 0 already present)
    reflinked: 4.22 MiB (zero-copy)
    inlined: 228.90 KiB

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
When running as an unprivileged user, files in containers-storage may
have restrictive permissions (e.g., /etc/shadow with mode 0600 owned by
remapped UIDs). This commit adds a user namespace helper that enables
reading these files by spawning a helper process via `podman unshare`.

The helper runs as UID 0 inside the user namespace and can read any file.
It communicates with the parent process via Unix socket using JSON-RPC 2.0
with SCM_RIGHTS file descriptor passing for zero-copy streaming.

Key components:
- userns.rs: can_bypass_file_permissions() to detect if helper is needed
- userns_helper.rs: Helper process with JSON-RPC protocol, StorageProxy
  client, and ProxiedLayerStream for streaming layer content

The cstor.rs import code now automatically uses the proxy when running
as an unprivileged user, falling back to direct access when running as
root or with CAP_DAC_OVERRIDE.

Rather than manually setting up user namespaces (parsing /etc/subuid,
calling newuidmap/newgidmap, etc.), we delegate all that complexity
to `podman unshare` which already handles all the edge cases.

Ported from cgwalters/cstor-rs.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Image IDs from `podman build --iidfile` include the `sha256:` prefix,
but containers-storage directories use just the hex digest. Strip the
prefix when opening images to support both formats.

Fixes CI failure where `cfsctl oci pull containers-storage:sha256:...`
failed with "image not found".

See containers/skopeo#2750

Assisted-by: OpenCode (Opus 4.5)
Add a libtest-mimic based integration test suite that:
- Runs as a userns helper (calls init_if_helper at startup)
- Tests cfsctl CLI functionality
- Compares containers-storage vs skopeo import paths

The equivalence test builds a minimal podman image, imports via both
paths, and verifies both produce identical splitstream digests.

Assisted-by: OpenCode (Claude claude-opus-4-5-20250514)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants