feat(transaction): Add option to check added data files in FastAppendAction #2025
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
This PR does not close any existing issues. It addresses an optimization opportunity in the fast append workflow.
Use Case
I need to validate data files before committing them to the table. Currently,
validate_added_data_files()is called internally duringcommit(), which means validation occurs on every commit attempt, including retries.Enhancement
By disabling
validate_added_data_files()in the commit method, I can perform validation once before attempting said commit. This allows for commit retries without re-running validation, reducing overhead in retry scenarios.It's a performance optimization that provides more control over the validation/commit lifecycle.
What changes are included in this PR?
This commit adds an option to the FastAppendAction to disable the validation step
snapshot_producer.validate_added_data_files()during commits. This is similar to the option to disablesnapshot_producer.validate_duplicate_files()append.rs.The change is implemented in
crates/iceberg/src/transaction/append.rs.Are these changes tested?
These changes have been manually tested outside the test framework. I noticed that the existing
with_check_duplicate()method has no test coverage. I'm not sure if either feature is just too small to be considered in scope for the project's test strategy. If helpful, I can add tests for bothwith_check_duplicate()and the newvalidate_added_data_files()method here in this PR.