Support "holey" archival bags by qqmyers · Pull Request #12131 · IQSS/dataverse

qqmyers · 2026-01-30T21:49:43Z

What this PR does / why we need it: Previously, Dataverse always archived a dataset as a single zipped bag file whose size was limited only by the (compressed) size of the metadata and data. This PR adds support for the "holey" bag mechanism in which some/all datafiles are not sent in the bag but are instead listed in a fetch.txt file. To complete the bag, the receiver must read the fetch file, retrieve the listed files from their URLs and place them at the specified location (path/filename).

Whether files are included is determined by two new jvm options:

dataverse.bagit.holey.max-file-size
dataverse.bagit.holey.max-data-size

which take the largest allowed (uncompressed) data file size and the max aggregate (uncompressed) size that should be zipped.
Files are now processed in order of increasing size which means that the zip will include the largest number of files possible if the max-data-size limit is used.

Which issue(s) this PR closes:

Closes #
DANS DD-2157

Special notes for your reviewer:
This builds on #12063.
The internal BagGenerator was recently updated to include the gbrecs parameter to suppress download counts when the downloads are for archival purposes. This PR also adds that parameter to the URLs in the fetch.txt file to assure that the third-party receiver doesn't accidentally trigger download counts.

In it's first iteration, this PR assumes that the receiver will retrieve the missing files directly from Dataverse, which means it may need an API key/other credential to get those files. The next step will be to add an option for those files to be transferred separately to the same place the bag is sent and to adjust the fetch file accordingly. This will assure that, with holey bags, completion of the archiving by Dataverse means that all data is in the receiving service.

Suggestions on how to test this: Setup archiving and set either/both of the settings above and verify the split between files in the zip and listed in fetch.txt is correct.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

qqmyers · 2026-01-30T23:11:12Z

wrong base branch

qqmyers added 15 commits January 7, 2026 13:35

initial impl

f8f7739

fix requestedSettings handling

5bd6f8d

efficiency improvement

4aaf6ca

QDR fixes transx timeout, ignored bag thread setting, add deletable

7cdef81

archival submit fix - per version cache

67e01e0

Add check to display submit button only if prior versions are archvd

50e8c61

Merge remote-tracking branch 'IQSS/develop' into DANS-2097

74a73fb

setting name tweak, add docs, release note

0642897

simplify

ca0af05

basic fetch

1808d2d

order by file size

366eccd

only add subcollection folders (if they exist)

eec333b

replace deprecated constructs

5445700

restore name collision check

b746d5d

add null check to quiet log/avoid exception

88edc8a

qqmyers closed this Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "holey" archival bags#12131

Support "holey" archival bags#12131
qqmyers wants to merge 15 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-2157_holey_bags

qqmyers commented Jan 30, 2026 •

edited

Loading

Uh oh!

qqmyers commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qqmyers commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qqmyers commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qqmyers commented Jan 30, 2026 •

edited

Loading