857 speed up repeated csv reads binary#944
Open
EricBenschneider wants to merge 61 commits intodaphne-project:mainfrom
Open
857 speed up repeated csv reads binary#944EricBenschneider wants to merge 61 commits intodaphne-project:mainfrom
EricBenschneider wants to merge 61 commits intodaphne-project:mainfrom
Conversation
d336e5a to
370822d
Compare
This reverts commit 85ea77a.
This reverts commit 8d79817.
370822d to
953a5e4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description:
This PR refactors CSV parsing to support daphne's binary data format (.dbdf) that saves DenseMatrix and Frame objects in a streamlined manner. The new binary (daphne) saving mechanism demonstrates a clear performance increase and reliable operation. Key changes include:
Efficient Binary Saving and Loading:
When the use-dbdf-optimization flag is set or save_csv_as_bin has been set in the user config to true, the reader first attempts to load a preformatted .dbdf file using readDaphne().
If the .dbdf file is found and valid, the data is loaded directly from the .dbdf file.
If the .dbdf file is not found or an error occurs during its load, the standard CSV parsing path is executed, and a .dbdf file is generated afterward using writeDaphne().
This dual-path strategy ensures that subsequent runs benefit from fast native binary I/O.
Performance Gains:
Experiments show that the .dbdf saving mechanism significantly reduces read times.

The folowing chart lists the file size ratio in comparison to the original csv file. REP being a matrix with repeating signed integers.
The csv files used for this results were generated. The resulting frame consists of a mix of all numeric value types currently supported by daphne. Evenly distributed.

The csv files used for this results were generated. The resulting matrix consists of random floating point values.

Concluding from the results, the first read is noticeably slower than the normal read. But the performance increase on multiple reads seem justify the usability of this feature.
Supported Data Structures:
The mechanism works for both DenseMatrix (with numeric value types) and Frame objects that doesn't contain strings. This limitation is due to the fact, that currently daphne's binary data format doesn't support strings. But if that changes in the future this feature could similarly extend to also support string values.
Testing:
Unit Tests:
The existing test suite (e.g., in ReadCsvTest.cpp) now verifies that:
System-Level Tests:
System-level tests have been executed by running full .daphne files that use the readFrame and readMatrix functions. These tests confirm that:
Overall, the new .dbdf saving workflow is working as expected and provides a significant performance boost.
Please review and test the changes. Feedback is welcome!