Benchmark JSON serialization to binary formats (Parquet, Avro, Protobuf, ORC). Load json files into arrow format as intermediary. Uses default compression levels, e.g. 3 for zstd, 6 for gzip. Snappy does not have compression levels.
Format Compress Time (ms) Size (bytes) Files Ratio MB/s Rows/s
-----------------------------------------------------------------------------------------------
avro gzip 217801.45 125.27 MB 2 7.78x 4.47 1566
avro none 158799.67 595.13 MB 10 1.64x 6.14 2147
avro snappy 153354.26 179.88 MB 3 5.42x 6.36 2224
avro zstd 173574.12 127.83 MB 2 7.62x 5.61 1965
orc gzip 2299.70 46.98 MB 1 20.74x 423.80 148281
orc none 9399.54 551.47 MB 5 1.77x 103.69 36279
orc snappy 761.36 93.60 MB 1 10.41x 1280.08 447883
orc zstd 1025.39 47.41 MB 1 20.56x 950.47 332557
parquet gzip 1677.35 37.06 MB 1 26.30x 581.04 203298
parquet none 14482.22 178.76 MB 2 5.45x 67.30 23546
parquet snappy 849.15 59.76 MB 1 16.31x 1147.75 401580
parquet zstd 917.32 38.26 MB 1 25.47x 1062.45 371737
protobuf gzip 50786.57 78.36 MB 3 12.44x 19.19 6714
protobuf none 18770.55 621.74 MB 20 1.57x 51.92 18167
protobuf snappy 22680.47 140.67 MB 5 6.93x 42.97 15035
protobuf zstd 25615.61 51.02 MB 2 19.10x 38.05 13312
Total rows: 341002
Original JSON size: 974.61 MB
Fastest: orc + snappy (761.36 ms)
Slowest: avro + gzip (217801.45 ms)
Best ratio: parquet + gzip (26.30x)
Worst ratio: protobuf + none (1.57x)
cargo build --releaseor download from releases.
# Convert all JSON files in raw_data/ to all formats
./target/release/serde-bench
# Specific format and compression
./target/release/serde-bench -f parquet -c zstd
# Custom input/output
./target/release/serde-bench -i data/ -o results/
# Benchmark 5 iterations, verbose
./target/release/serde-bench --iterations 5 -v
# Export results as JSON
./target/release/serde-bench --json-file bench.json| Flag | Description | Default |
|---|---|---|
-i |
Input files/directories | raw_data |
-o |
Output directory | output |
-f |
Format: parquet, avro, protobuf, orc, all |
all |
-c |
Compression: none, zstd, snappy, gzip |
all |
--iterations |
Benchmark iterations | 1 |
--dry-run |
Benchmark without writing files | - |
-v |
Verbose output | - |
Contributions are welcome! Here's how to get started:
- Fork the repository and clone your fork
- Create a feature branch (
git checkout -b feature/my-feature) - Make your changes
- Run
cargo build --releaseto ensure it compiles - Test with sample data:
cargo run --release -- -i raw_data/ --dry-run - Push to your fork and open a pull request
Multi-threading support - Currently, all benchmarks run single-threaded. Adding parallel execution for:
- Processing multiple input files concurrently
- Running different format/compression combinations in parallel
- Parallel row group writing for formats that support it (Parquet, ORC)
This would significantly improve benchmark throughput on multi-core systems.