-
Notifications
You must be signed in to change notification settings - Fork 14
Import cstor-rs, use for reflinked imports from containers-storage #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ed16d50 to
6df2401
Compare
cgwalters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a review here so it's more visible
|
I've not looked into details yet, but are we comfortable with the risks of these intrusive handling of the storage library? The storage library doesn't have any stability guarantees and can change at any time |
Is there anything fundamentally different from the storage access in bootc-dev/bootc#1962 though? As far as the "can change at any time" - in reality we have pretty strong constraints there on any truly backwards-incompatible changes right? Because even though there's one Go library for it, there's multiple tools using that library which may not always be upgraded at the same time and I definitely remember an incident with cri-o + podman on different schedules. I think if we did any future breaking changes, we'd just need to remember to also update this library right? We can enforce that with CI. |
6df2401 to
455891a
Compare
No, but I've added that only for the PoC, I think that logic belongs to container-libs.
Sure, but testing with the previous version of the container tools will be caught earlier (and hopefully block a release), while it could be too late for a separate implementation as the one proposed here. As long as we are OK with the potential risks, I am OK with this. It is better to get unblocked and make some progress instead of waiting for the "perfect" solution. |
Yes, agreed. I think we can pursue a proper API for this (and I like the jsonrpc-fdpass + splitfdstream, but obviously up for debate) That said, there's an entirely different path to pursue here, which is a Go implementation of what's in composefs-rs (this repo) (or redoing the containers-storage-composefs backend on IPC to |
|
So from what I can tell, the main thrust here is to get the reflinking thing between c-storage and composefs-rs working, right? I'd have a really strong preference here to add a new API to the existing skopeo proxy and containers-image-store-rs. There are already fd-passing APIs there: we'd just need a lot more fds... @giuseppe's argument about having a parallel implementation has traction with me. I'm not sure that I'd prefer a vibecoded containers-storage reimplementation, even if it is in Rust, compared to the real thing, specifically because we're gonna RPC to it anyway... See also #137 for existing code that understands enough of tarsplit (because it's part of zstd:chunked) to get its job done... I know you're trying to reduce CI pain here by decreasing the size of the stack of modules we have, but I'm not sure how I feel about mass-importing a vibecoding session here... the fact that most of that is in a separate crate helps a bit, of course. Also: are you still 100% sure "pull into c-storage and then copy to composefs" is the way we definitely want to do this? I know the arguments about exotic authentication mechanisms, but skopeo would cover us there. I also know the arguments about zstd:chunked, but we have our own support in #137. I also know the arguments about combining zstd:chunked with the exotic authentication mechanisms thing and that's where I have a bit of sympathy for this approach, but ... it seems like maybe we could possibly figure something out still? Also: if this is the plan, then we'll need skopeo (at least) around anyway to do the pulling for us, so having a separate implementation of c-storage won't really gain us anything in terms of reducing on-disk size, etc... |
Another argument for this logic is local layering. Pull the updated image in c-storage, do a local layered build with podman, import into composefs "for free". |
Indeed. That's exactly what we do for our examples, and it'd be nice to see those be more efficient... |
Add layer references in the original manifest diff_id order rather than the order layers finish downloading. This ensures reproducible config splitstream digests regardless of download timing. Fixes: 54c3c6d ("splitstream: Rework file format") Assisted-by: OpenCode (Claude claude-opus-4-5-20250514)
This new internal crate provides safe, zero-copy parsing of tar archive headers using the zerocopy crate. It supports: - POSIX.1-1988, UStar (POSIX.1-2001), and GNU tar header formats - Base-256 encoding for large values (GNU extension) - EntryType enum for all standard tar entry types - Checksum verification The goal is to share code between composefs-oci and cstorage (which both do tar header parsing), and eventually enable upstream contribution to the tar-rs crate (ref: alexcrichton/tar-rs#392). Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
I understand the point of what you're saying, but the definition of vibecoding is not looking at the code, and I can assure you that I have and will continue to read and sanity check all "core" code to which I attach my name. This code will be responsible for updating my operating system and those of people and customers, and so for sure it will not be "vibecoded" by the generally accepted definition of that term. "Includes substantial AI generation" - absolutely. But this took quite a bit of iteration. To me another key input into this is "could the person have written it themselves"? And I hope you'd agree for this the answer is yes, because I have a long track record from before LLMs even came on the scene. It's going to be a lot harder for people joining the industry now, but that's a whole other thing...
I don't think that has to be the only way, no. I'm totally fine with an opt-in mode here that uses e.g. oci-distribution - it would be a very obvious thing to support. I just don't think we can use it in bootc at least by default.
I agree it makes sense to put this in skopeo (and/or podman). There's no question there and I already have some PoC patches there, but we need to come to consensus on the design (jsonrpc-fdpass for IPC + splitfdstream for tar layer interchange is again my proposal). But that said: this whole problem has been so painful for so long and I really want to make fowrard progress - patching skopeo also implies waiting on a whole review-merge-release-ship cycle, probably fastest path a month. Not the end of the world if we agree and commit. There's a flip side though - we can make this an opt-in feature and gain more experience with it. The places that would use this in bootc at least are already both experimental (unified storage and composefs). Although we'd need to be careful to turn off this code path for |
b9fa9db to
3c7573b
Compare
How about incredibly evil hacks? We know the ID of the container and the ID of the layers and the names of the files in the tar archives... and you know what the layout of the repository looks like... could we just go looking for the files? skopeo would waste its time reassembling the tar archive and streaming it to us (just to have us ignore it), but it would have to do that anyway in order to hash it... The the thing on the other side of the RPC looks more like "open this [named] file for me"... |
How is that different than what this code is doing? What specific portion of the code here feels riskier than re-implementing it a different way?
I don't think we need to validate the tar digest ("diffid") when pulling from containers-storage: by default. Should we support it? Absolutely. But there's a notable difference in efficiency. IMO where we really want to go is "composefs digest per layer as replacement for diffid" and I have some work on that. |
What's funny of course is that you and I are almost arguing opposite ends here in the two PRs of "should we reimplement here parts of container-libs" 😄 (But to clarify for me though, unified storage is WAY more important than zstd:chunked) |
I suppose I always dreamed of containers-storage backending directly on to composefs and no longer having its own repository. That would be a better future, I think we could agree... And in that context, stuff like "ability to pull directly from the network without skopeo" starts to make a lot more sense... |
Actually, there's a thought: we could teach skopeo a bit about how to write to composefs... the object storage part is obviously the most important bit, but it's also the easiest bit: instead of it handing us a tonne of fds that we do reflinks to we could just hand it our object directory (as an fd, or by name) and have it do that part... |
Yes I think as part of this we must deliver sealed OCI and a revamp of the current containers-storage: composefs backend is in order IMO. However: we need a solid impl of https://github.com/containers/composefs-rs/blob/main/doc/plans/oci-sealing-spec.md
Yes but there's a shorter term sticking point: For bootc because we have logically bound images at the current time we must expose our images in a way that current podman (i.e. container-libs) understands. For this reason among others I would like to stick with the plan of pulling into containers-storage first, then reflinking to composefs. |
Migrate composefs-oci to use the shared tar-header crate instead of the tar crate for header parsing. The tar crate is still used for PaxExtensions (PAX header parsing) and Builder (test utilities). This eliminates duplicate tar header parsing logic and uses zerocopy for safe, zero-copy access to header fields. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Add test-all and build-cstorage targets for building and testing with the containers-storage feature enabled. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Add a new crate implementing the splitfdstream binary format, which enables streaming tar archives with external file references. The format uses an 8-byte LE signed prefix followed by optional data: - Negative prefix: inline data of size |prefix| - Non-negative prefix: external file descriptor reference by index Key types: - SplitfdstreamWriter: Build splitfdstreams with inline/external chunks - SplitfdstreamReader: Parse splitfdstreams back into chunks - SplitfdstreamTarReader: Read adapter that reconstructs byte streams This enables zero-copy tar reconstruction from containers-storage's tar-split metadata by streaming headers inline while referencing file content via fd. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Add a new crate providing read-only access to containers-storage (the storage backend used by podman, buildah, and other container tools). Key components: - Storage: Main entry point, discovers and opens storage locations - Layer: Overlay layer with content access via diff_dir - Image: OCI image with manifest/config parsing - TarSplitFdStream: Zero-copy tar-split streaming with fd passing - LockFile: Wire-compatible locking with Go containers/storage The crate uses cap-std for capability-based file operations and provides a safe, read-only interface to access container images and layers stored by podman/buildah. This is adapted from cgwalters/cstor-rs with modifications to fit the composefs-rs codebase style and conventions. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Add a new method to Repository that stores objects using FICLONE reflinks when available, falling back to regular copy otherwise. This enables zero-copy import from containers-storage by reflinking files directly into the objects/ directory, avoiding data duplication on filesystems that support reflinks (btrfs, XFS with reflinks). The method only falls back to copy on EOPNOTSUPP (reflink not supported) or EXDEV (cross-device); other errors are propagated to the caller. Includes test coverage for the new method. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Add optional containers-storage feature that enables importing container images directly from podman/buildah's local storage into composefs repositories. Key functionality: - import_from_containers_storage(): Main async entry point - Uses cstorage crate for Storage/Layer/Image access - Leverages tar-split metadata for zero-copy streaming - Uses ensure_object_from_file() for reflink-based storage - Shows progress via indicatif progress bar The integration uses spawn_blocking to avoid blocking the async runtime during synchronous file I/O operations. Usage: cfsctl oci pull containers-storage:alpine:latest Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Update cfsctl to detect and handle containers-storage: image references when the containers-storage feature is enabled. When an image reference starts with 'containers-storage:', cfsctl routes to the native cstor import function instead of using skopeo, enabling zero-copy layer import via reflinks. Also improve image name matching in cstorage to support short names like 'busybox' that resolve to 'docker.io/library/busybox:latest'. The containers-storage feature is now enabled by default in cfsctl. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Add ImportStats struct to track containers-storage import operations:
- Count of layers (total and already present)
- Count of objects (reflinked, copied, already present)
- Bytes (reflinked, copied, inlined)
The repository now provides ensure_object_from_file_with_stats() and
ObjectStoreMethod enum to report how each object was stored.
The pull() function in composefs-oci now automatically routes
containers-storage: references to the native cstor import path,
returning a PullResult struct that includes optional ImportStats.
This keeps the routing logic in the library rather than CLI.
cfsctl prints statistics after a containers-storage pull showing
whether reflinks were used for zero-copy import.
Example output:
Import statistics:
layers: 1 (0 already present)
objects: 16 total (16 reflinked, 0 copied, 0 already present)
reflinked: 4.22 MiB (zero-copy)
inlined: 228.90 KiB
Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
When running as an unprivileged user, files in containers-storage may have restrictive permissions (e.g., /etc/shadow with mode 0600 owned by remapped UIDs). This commit adds a user namespace helper that enables reading these files by spawning a helper process via `podman unshare`. The helper runs as UID 0 inside the user namespace and can read any file. It communicates with the parent process via Unix socket using JSON-RPC 2.0 with SCM_RIGHTS file descriptor passing for zero-copy streaming. Key components: - userns.rs: can_bypass_file_permissions() to detect if helper is needed - userns_helper.rs: Helper process with JSON-RPC protocol, StorageProxy client, and ProxiedLayerStream for streaming layer content The cstor.rs import code now automatically uses the proxy when running as an unprivileged user, falling back to direct access when running as root or with CAP_DAC_OVERRIDE. Rather than manually setting up user namespaces (parsing /etc/subuid, calling newuidmap/newgidmap, etc.), we delegate all that complexity to `podman unshare` which already handles all the edge cases. Ported from cgwalters/cstor-rs. Assisted-by: OpenCode (Opus 4.5) Signed-off-by: Colin Walters <walters@verbum.org>
Image IDs from `podman build --iidfile` include the `sha256:` prefix, but containers-storage directories use just the hex digest. Strip the prefix when opening images to support both formats. Fixes CI failure where `cfsctl oci pull containers-storage:sha256:...` failed with "image not found". See containers/skopeo#2750 Assisted-by: OpenCode (Opus 4.5)
Add a libtest-mimic based integration test suite that: - Runs as a userns helper (calls init_if_helper at startup) - Tests cfsctl CLI functionality - Compares containers-storage vs skopeo import paths The equivalence test builds a minimal podman image, imports via both paths, and verifies both produce identical splitstream digests. Assisted-by: OpenCode (Claude claude-opus-4-5-20250514)
3c7573b to
6c16d20
Compare
Obviously massively
Assisted-by: OpenCode (Opus 4.5)here; I did a lot of design work of course and but this is looking pretty nice to me, and it seems to work well!