Skip to content

Support "holey" archival bags#12131

Closed
qqmyers wants to merge 15 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-2157_holey_bags
Closed

Support "holey" archival bags#12131
qqmyers wants to merge 15 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-2157_holey_bags

Conversation

@qqmyers
Copy link
Member

@qqmyers qqmyers commented Jan 30, 2026

What this PR does / why we need it: Previously, Dataverse always archived a dataset as a single zipped bag file whose size was limited only by the (compressed) size of the metadata and data. This PR adds support for the "holey" bag mechanism in which some/all datafiles are not sent in the bag but are instead listed in a fetch.txt file. To complete the bag, the receiver must read the fetch file, retrieve the listed files from their URLs and place them at the specified location (path/filename).

Whether files are included is determined by two new jvm options:

dataverse.bagit.holey.max-file-size
dataverse.bagit.holey.max-data-size

which take the largest allowed (uncompressed) data file size and the max aggregate (uncompressed) size that should be zipped.
Files are now processed in order of increasing size which means that the zip will include the largest number of files possible if the max-data-size limit is used.

Which issue(s) this PR closes:

  • Closes #
    DANS DD-2157

Special notes for your reviewer:
This builds on #12063.
The internal BagGenerator was recently updated to include the gbrecs parameter to suppress download counts when the downloads are for archival purposes. This PR also adds that parameter to the URLs in the fetch.txt file to assure that the third-party receiver doesn't accidentally trigger download counts.

In it's first iteration, this PR assumes that the receiver will retrieve the missing files directly from Dataverse, which means it may need an API key/other credential to get those files. The next step will be to add an option for those files to be transferred separately to the same place the bag is sent and to adjust the fetch file accordingly. This will assure that, with holey bags, completion of the archiving by Dataverse means that all data is in the receiving service.

Suggestions on how to test this: Setup archiving and set either/both of the settings above and verify the split between files in the zip and listed in fetch.txt is correct.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@qqmyers
Copy link
Member Author

qqmyers commented Jan 30, 2026

wrong base branch

@qqmyers qqmyers closed this Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant