Skip to content

benchmarking tool for serialization between json, avro, protobuf, parquet and others

License

Notifications You must be signed in to change notification settings

themoah/serde-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

serde-bench

Benchmark JSON serialization to binary formats (Parquet, Avro, Protobuf, ORC). Load json files into arrow format as intermediary. Uses default compression levels, e.g. 3 for zstd, 6 for gzip. Snappy does not have compression levels.

sample output

Format     Compress      Time (ms)   Size (bytes)   Files      Ratio         MB/s         Rows/s
-----------------------------------------------------------------------------------------------
avro       gzip          217801.45      125.27 MB       2       7.78x         4.47           1566
avro       none          158799.67      595.13 MB      10       1.64x         6.14           2147
avro       snappy        153354.26      179.88 MB       3       5.42x         6.36           2224
avro       zstd          173574.12      127.83 MB       2       7.62x         5.61           1965
orc        gzip            2299.70       46.98 MB       1      20.74x       423.80         148281
orc        none            9399.54      551.47 MB       5       1.77x       103.69          36279
orc        snappy           761.36       93.60 MB       1      10.41x      1280.08         447883
orc        zstd            1025.39       47.41 MB       1      20.56x       950.47         332557
parquet    gzip            1677.35       37.06 MB       1      26.30x       581.04         203298
parquet    none           14482.22      178.76 MB       2       5.45x        67.30          23546
parquet    snappy           849.15       59.76 MB       1      16.31x      1147.75         401580
parquet    zstd             917.32       38.26 MB       1      25.47x      1062.45         371737
protobuf   gzip           50786.57       78.36 MB       3      12.44x        19.19           6714
protobuf   none           18770.55      621.74 MB      20       1.57x        51.92          18167
protobuf   snappy         22680.47      140.67 MB       5       6.93x        42.97          15035
protobuf   zstd           25615.61       51.02 MB       2      19.10x        38.05          13312

Total rows: 341002
Original JSON size: 974.61 MB
Fastest: orc + snappy (761.36 ms)
Slowest: avro + gzip (217801.45 ms)
Best ratio: parquet + gzip (26.30x)
Worst ratio: protobuf + none (1.57x)

Install

cargo build --release

or download from releases.

Usage

# Convert all JSON files in raw_data/ to all formats
./target/release/serde-bench

# Specific format and compression
./target/release/serde-bench -f parquet -c zstd

# Custom input/output
./target/release/serde-bench -i data/ -o results/

# Benchmark 5 iterations, verbose
./target/release/serde-bench --iterations 5 -v

# Export results as JSON
./target/release/serde-bench --json-file bench.json

Options

Flag Description Default
-i Input files/directories raw_data
-o Output directory output
-f Format: parquet, avro, protobuf, orc, all all
-c Compression: none, zstd, snappy, gzip all
--iterations Benchmark iterations 1
--dry-run Benchmark without writing files -
-v Verbose output -

Contributing

Contributions are welcome! Here's how to get started:

  1. Fork the repository and clone your fork
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Make your changes
  4. Run cargo build --release to ensure it compiles
  5. Test with sample data: cargo run --release -- -i raw_data/ --dry-run
  6. Push to your fork and open a pull request

Good First Issues

Multi-threading support - Currently, all benchmarks run single-threaded. Adding parallel execution for:

  • Processing multiple input files concurrently
  • Running different format/compression combinations in parallel
  • Parallel row group writing for formats that support it (Parquet, ORC)

This would significantly improve benchmark throughput on multi-core systems.

About

benchmarking tool for serialization between json, avro, protobuf, parquet and others

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages