Closed
Conversation
Member
Author
|
wrong base branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it: Previously, Dataverse always archived a dataset as a single zipped bag file whose size was limited only by the (compressed) size of the metadata and data. This PR adds support for the "holey" bag mechanism in which some/all datafiles are not sent in the bag but are instead listed in a fetch.txt file. To complete the bag, the receiver must read the fetch file, retrieve the listed files from their URLs and place them at the specified location (path/filename).
Whether files are included is determined by two new jvm options:
which take the largest allowed (uncompressed) data file size and the max aggregate (uncompressed) size that should be zipped.
Files are now processed in order of increasing size which means that the zip will include the largest number of files possible if the max-data-size limit is used.
Which issue(s) this PR closes:
DANS DD-2157
Special notes for your reviewer:
This builds on #12063.
The internal BagGenerator was recently updated to include the gbrecs parameter to suppress download counts when the downloads are for archival purposes. This PR also adds that parameter to the URLs in the fetch.txt file to assure that the third-party receiver doesn't accidentally trigger download counts.
In it's first iteration, this PR assumes that the receiver will retrieve the missing files directly from Dataverse, which means it may need an API key/other credential to get those files. The next step will be to add an option for those files to be transferred separately to the same place the bag is sent and to adjust the fetch file accordingly. This will assure that, with holey bags, completion of the archiving by Dataverse means that all data is in the receiving service.
Suggestions on how to test this: Setup archiving and set either/both of the settings above and verify the split between files in the zip and listed in fetch.txt is correct.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: