-
Notifications
You must be signed in to change notification settings - Fork 149
feat(unixfs): configurable CID Profiles from IPIP-499 #1088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
add configurable size estimation modes for determining when to switch between BasicDirectory and HAMTDirectory: - SizeEstimationLinks: legacy mode using len(name) + len(CID), default - SizeEstimationBlock: full serialized dag-pb block size (accurate) - SizeEstimationDisabled: link-count only via MaxLinks, ignores size includes: - HAMTSizeEstimation global for default mode - WithSizeEstimationMode option for per-directory override - helper functions for accurate protobuf size calculation part of IPIP-499 UnixFS CID Profiles implementation.
introduces UnixFSProfile struct with predefined profiles: - UnixFS_v0_2015: legacy CIDv0 settings (256 KiB chunks, 174 links/node) - UnixFS_v1_2025: modern CIDv1 settings (1 MiB chunks, 1024 links/node) profiles control file chunking, DAG width, and HAMT sharding parameters. ApplyGlobals() sets all relevant global variables at once. part of IPIP-499 implementation.
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## main #1088 +/- ##
==========================================
+ Coverage 61.06% 61.61% +0.54%
==========================================
Files 264 265 +1
Lines 26221 26505 +284
==========================================
+ Hits 16013 16331 +318
+ Misses 8527 8486 -41
- Partials 1681 1688 +7
... and 5 files with indirect coverage changes 🚀 New features to boost your workflow:
|
add SerialFileOptions and NewSerialFileWithOptions to control whether symlinks are preserved as UnixFS symlink nodes (Data.Type=4) or dereferenced and replaced with their target content during file traversal.
Link.Size is already uint64, so the explicit conversions are redundant and flagged by golangci-lint unconvert check.
HAMT sharding threshold comparison was historically implemented as `>` in JS and `>=` in Go: - JS: https://github.com/ipfs/helia/blob/005c2a7/packages/unixfs/src/commands/utils/is-over-shard-threshold.ts#L31 - Go: https://github.com/ipfs/boxo/blob/319662c/ipld/unixfs/io/directory.go#L438 This inconsistency meant a directory exactly at the 256 KiB threshold would stay basic in JS but convert to HAMT in Go, producing different CIDs for the same input. This commit changes Go to use `>` (matching JS), so a directory exactly at the threshold now stays as a basic (flat) directory. This aligns cross-implementation behavior for CID determinism per IPIP-499. Also adds SizeEstimationMode to MkdirOpts so MFS directories respect the configured estimation mode instead of always using the global default.
b844fc9 to
6707376
Compare
- fix trailing newline in directory_test.go - add #1088 PR references to changelog entries
- files: fix nil filter check in serialFile.Size() - unixfs/io: document thread-safety for global vars and ApplyGlobals - changelog: move DefaultBlockSize to Changed section with breaking marker
| // Thread safety: this function modifies global variables and is not safe | ||
| // for concurrent use. Call it once during program initialization, before | ||
| // starting any imports. Do not call from multiple goroutines. | ||
| func (p UnixFSProfile) ApplyGlobals() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ boxo already had globals (DefaultBlockSize, DefaultLinksPerBlock etc)
this is a very surgical way of having predefined UnixFS profiles in boxo itself that users can apply programmatically at startup of their app.
i dont like it tbh, but others in golang that we already use (like certmagic, or even net.DefaultResolver) have similar way of managing global defaults, so maybe its just me not being a fan of globals.
this is a compromise which delivers ability to set-and-forget profile on startup, but implemented in smallest amount of code that avoids breaking every user of existing APIs
gammazero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a circular symlink test to show that it works when not dereferencing symlinks, otherwise it fails.
|
Triage:
|
IPIP-499 block-bytes estimation improvements: - add fast path optimization in needsToSwitchByBlockSize to skip expensive exact calculation when clearly above threshold (+256 margin) - clarify documentation for linkSerializedSize, calculateBlockSize, cachedBlockSize, and SetStat methods with IPIP-499 context - extract saveAndRestoreGlobals as package-level test helper tests for mode/mtime block size overhead: - verify exact protobuf overhead: mode (3 bytes), mtime seconds (8 bytes), nanoseconds (5 bytes), combined (16 bytes) - verify cachedBlockSize accuracy after add/remove/replace operations - verify linkSerializedSize matches actual link contribution - verify HAMT threshold accounts for metadata overhead - test fast path and near-boundary exact calculation behavior
…utable consolidate fragmented size tracking into a single method and field: - merge `cachedBlockSize` into `estimatedSize` (single field for all modes) - replace `addToEstimatedSize`, `removeFromEstimatedSize`, and `updateCachedBlockSize` with unified `updateEstimatedSize(name, oldLink, newLink)` - remove `SetSizeEstimationMode` from Directory interface; mode is now set only at creation time via `WithSizeEstimationMode` option this prevents mode changes after directory creation which could cause size tracking inconsistencies, and simplifies the calling code from two method calls per operation to one. test coverage: - TestHAMTToBasicDowngrade: new test for HAMT->Basic threshold boundaries covering both SizeEstimationLinks and SizeEstimationBlock modes - TestEstimatedSizeAccuracy: verifies size tracking after add/remove/replace - TestProfileHAMTThresholdBehavior: upgrade threshold boundaries - TestDynamicDirectorySwitch: Basic<->HAMT conversions
removes the need for protobuf serialization when checking HAMT threshold. the block size is now computed arithmetically from protobuf field definitions: - dataFieldSerializedSize(): UnixFS Data field (Type + optional mode/mtime) - linkSerializedSize(): PBLink fields (Hash, Name, Tsize) + wrapper this replaces the previous approach that serialized a temporary node copy when near the threshold boundary. the arithmetic calculation is exact and verified against actual serialization in TestDataFieldSerializedSizeMatchesActual. calculateBlockSize() moved to test-only code in profile_test.go.
19fb7a1 to
6141039
Compare
…2025 # Conflicts: # CHANGELOG.md
|
@gammazero @aschmahmann i've applied feedback from today's colo
|
replace bit shift notation with plain arithmetic for readability
add inline comments explaining: - varintLen formula derivation (ceil(N/7) without branching) - why negative int64 always uses 10 bytes in protobuf encoding
add maybeCollapseToRawLeaf() to DagModifier that collapses a ProtoNode with a single RawNode child (and no metadata) to just the RawNode. this ensures CID compatibility with `ipfs add` for single-block files when RawLeaves is enabled. the collapse happens in GetNode() rather than during Sync() to avoid issues with intermediate operations like seeks and modifications.
enables MFS to respect kubo's Import.* config options by accepting tree-wide settings via functional options on NewRoot(): - WithChunker: configures chunker for files (was hardcoded to default) - WithMaxLinks: configures directory link threshold for HAMT sharding - WithSizeEstimationMode: configures size estimation for HAMT decisions changes: - mfs/root.go: add RootOption pattern and apply settings to root directory - mfs/dir.go: inherit chunker from parent, propagate SizeEstimationMode to child directories when loaded from disk via cacheNode() - mfs/file.go: use configured chunker instead of hardcoded default - mfs/ops.go: add Chunker field to MkdirOpts - ipld/unixfs/io/directory.go: add SetSizeEstimationMode() to interface, fix addLinkChild() to correctly handle entry replacements (was double- decrementing totalLinks) - ipld/unixfs/mod/dagmodifier.go: add identitySafeDAGService wrapper to handle identity CID overflow gracefully during append operations, preserving identity for small files while switching to sha2-256 when data exceeds the 128-byte limit
add tests to verify that WithChunker, WithMaxLinks, and WithSizeEstimationMode correctly propagate settings to files and directories created in MFS: - TestRootOptionChunker: verifies custom chunker produces expected block count (512-byte chunks vs default 256KB) - TestRootOptionMaxLinks: verifies custom MaxLinks triggers HAMT sharding (3 links vs default ~174) - TestRootOptionSizeEstimationMode: verifies mode propagates to directories after reload from DAG - TestChunkerInheritance: verifies chunker propagates through nested subdirectories (/a/b/c) all tests use non-default values and include assertions to confirm the custom settings are actually being used rather than falling back to defaults.
add per-directory HAMT sharding size threshold support: - add hamtShardingSize field to BasicDirectory and HAMTDirectory - add Get/SetHAMTShardingSize() methods to Directory interface - add getEffectiveShardingSize() helper for per-directory or global fallback - propagate HAMTShardingSize to child directories in cacheNode/setNodeData - add HAMTShardingSize to MkdirOpts with parent inheritance add RootOptions: - WithMaxHAMTFanout(n) sets HAMT bucket width - WithHAMTShardingSize(size) sets per-directory size threshold add tests: - TestRootOptionMaxHAMTFanout - TestRootOptionHAMTShardingSize - TestHAMTShardingSizeInheritance
expandSparse creates zero-padding when writing past end of file, but wasn't updating dm.curNode to point to the new node. subsequent writes would use the old unexpanded node, losing data.
3ed97b0 to
7884ae2
Compare
- use Go 1.22+ range-over-int syntax in IPIP-499 tests - expand ipld/unixfs/io package documentation with overview of directory types, HAMT sharding config, and IPIP-499 profiles
gammazero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one question about checking Mode() when converting to leaf node. Otherwise, looks good.
directories with mode/mtime metadata (set via WithStat) would lose this optional metadata when: - converting from BasicDirectory to HAMTDirectory (during sharding) - converting from HAMTDirectory to BasicDirectory (when shrinking) - reloading a HAMT directory from disk root cause: HAMT shards did not support mode/mtime in their UnixFS data, and the conversion functions did not propagate these fields. changes: - add HAMTShardDataWithStat() to include mode/mtime in HAMT shard nodes - add SetStat() to hamt.Shard to store metadata for serialization - propagate mode/mtime and SizeEstimationMode during Basic<->HAMT conversions in DynamicDirectory - extract mode/mtime from fsNode when loading HAMT via NewDirectoryFromNode also adds tests for: negative mtime encoding, SizeEstimationDisabled with maxLinks=0, unicode filenames in size estimation, concurrent HAMT conversion, mode/mtime preservation after reload, and exact HAMT threshold boundary behavior.
when calling Mkdir with Mkparents=true and a custom Chunker, intermediate directories would inherit the chunker from root instead of using the one specified in MkdirOpts. now parentsOpts includes opts.Chunker so all directories created in the path use the same chunker.
- fix gofumpt formatting: comment alignment with unicode chars, octal literal - rename TestConcurrentHAMTConversion to TestSequentialHAMTConversion and serialize operations to avoid race condition (Directory is not thread-safe for concurrent reads and writes)
move the maxLinks check into the error handling block to eliminate the intermediate `existed` boolean variable. the logic is equivalent but more idiomatic: check maxLinks only when RemoveChild returns ErrNotExist (new entry), skip it when removal succeeds (replacement). adds test for replacement behavior at maxLinks capacity. suggested by @gammazero in #1088 review: #1088 (comment)
|
|
||
| // TestHAMTDirectoryModeAndMtimeAfterReload verifies that mode and mtime metadata | ||
| // survives HAMT conversion and reload from DAG. | ||
| func TestHAMTDirectoryModeAndMtimeAfterReload(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This test makes me feel much better. It could have been easily overlooked.
previously maybeCollapseToRawLeaf only checked ModTime when deciding
whether to keep a ProtoNode wrapper. files with Mode metadata (unix
permissions) but no ModTime would incorrectly collapse to RawNode,
losing the permission information.
now checks both ModTime and Mode before collapsing:
if !fsn.ModTime().IsZero() || fsn.Mode() != 0 {
also refactored metadata preservation tests into a table-driven test
covering ModTime-only, Mode-only, and both metadata fields.
suggested-by: @gammazero
ref: #1088 (comment)
moved package docs from dagmodifier.go to dedicated doc.go file. expanded documentation to cover: - MFS semantics for metadata handling - clarification that Mode and ModTime are optional (most use cases do not set them) - mtime update behavior on content modification (matches Unix fs) - identity CID handling - RawNode growth conversion - raw leaf collapsing behavior also added inline comments at mtime update sites explaining the behavior and how to preserve specific mtime values if needed.
|
Addressed review feedback:
All CI checks passing. If no concerns I will merge tomorrow to unblock Kubo 0.40 RC1 (ipfs/go-ipfs-cmds#315, ipfs/kubo#11148) |
Adds building blocks for reproducible CID generation across IPFS implementations, based on IPIP-499
Users of boxo can now
SizeEstimationLinks), accurate block-based (SizeEstimationBlock), or disable size-based thresholds entirely(
SizeEstimationDisabled)UnixFS_v0_2015orUnixFS_v1_2025profiles from IPIP-499 viaApplyGlobals()andCidBuilder()SerialFileOptions.DereferenceSymlinks>=to>) – 6707376Related: IPIP-499, used by kubo#11148