Skip to content

Conversation

@thekid
Copy link
Member

@thekid thekid commented Aug 2, 2025

Snappy is widely used in Google projects like Bigtable, MapReduce and in compressing data for Google's internal RPC systems. It can be used in open-source projects like MariaDB ColumnStore, Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, RocksDB, Lucene, Spark, Parquet, InfluxDB, and Ceph. Firefox uses Snappy to compress data in localStorage

  • Compression
  • Decompression
  • Input streaming
  • Output streaming

See https://en.wikipedia.org/wiki/Snappy_(compression), https://google.github.io/snappy/ and xp-forge/mongodb#62 (comment)


⚠️ Note: The streaming and string-based operations contain large amounts of duplicated code inlined and purpose-adopted for performance reasons!

@thekid
Copy link
Member Author

thekid commented Aug 3, 2025

Output streaming is a bit challenging, as the compressed data starts with the uncompressed length:

The first bytes of the stream are the length of uncompressed data, stored as a little-endian varint
See https://en.wikipedia.org/wiki/Snappy_(compression)#Stream_format

One solution could be to do the following:

use io\streams\FileOutputStream;
use io\streams\compress\Snappy;

$snappy= new Snappy();

$out= new FileOutputStream('compressed.sn');
$out->write($snappy->length(strlen($data));

$stream= $snappy->create($out);
$stream->write($data);
$stream->close();

...but that feels hacky.


We could overload the second parameter to open() as snappy does not use a compression level, which would give us:

use io\streams\FileOutputStream;
use io\streams\compress\Snappy;

$snappy= new Snappy();

$stream= $snappy->create(new FileOutputStream('compressed.sn'), strlen($data));
$stream->write($data);
$stream->close();

...but that would be inconsistent with other implementations. The classical options-approach would give us us something like this:

$stream= $snappy->create(new FileOutputStream('compressed.sn'), ['length' => strlen($data)]);

...but that's error prone to its "string-key" nature. We could solve this with an Options object - currently using two keys level (default: -1) and length (default: null):

$stream= $snappy->create(new FileOutputStream('compressed.sn'), new Options(length: strlen($data));

@thekid thekid added the enhancement New feature or request label Aug 3, 2025
@thekid
Copy link
Member Author

thekid commented Aug 3, 2025

Integration testing buffered vs. unbuffered snappy compression shows the implementation has bugs:

# Calls compress()
$ xp snappy.script.php -c pdf.streaming > pdf.sn
pdf.streaming (2207250 -> 876717) 0.064 seconds & 2044.38 kB used / 6550.12 kB peak

# Calls open($out)
$ xp snappy.script.php -buf pdf.streaming pdf.sn
[.]
pdf.streaming (2207250 -> 876717) 0.074 seconds & 1426.95 kB used / 6832.72 kB peak

# Calls open($out, new Options(length: $size))
$ xp snappy.script.php -out pdf.streaming pdf.sn
[.]
pdf.streaming (2207250 -> 89427) 0.190 seconds & 1445.20 kB used / 1786.59 kB peak

All of these yield the following decompression error:

$ snappy -d pdf.sn > pdf.return
snappy: pdf.sn: compressed block of length 876717: expecting 2207250 bytes, got 909072

For comparison, this is what is expected:

$ snappy pdf.streaming > pdf.sn
pdf.streaming: 2207250 -> 2199987 (99.67%)

@thekid
Copy link
Member Author

thekid commented Aug 3, 2025

Using https://github.com/google/snappy/tree/main/testdata files copied to ./fixtures:

Integration testing for compress()

for file in $(ls -1 fixtures/* | grep -v baddata); do 
  echo "== $file =="
  xp snappy.script.php -c $file > sn 
  snappy -d sn > test 
  diff -u test $file && echo "OK"
  rm sn test 
done

✅ Works

@thekid
Copy link
Member Author

thekid commented Aug 3, 2025

Streaming, while being a bit slower for small files, really shines with large files:

== fixtures/lcet10.txt ==
Compress: fixtures/lcet10.txt (426754 -> 234392) 0.088 seconds & 1415.44 kB used / 2265.16 kB peak
Stream:   fixtures/lcet10.txt (426754 -> 234392) 0.099 seconds & 1440.20 kB used / 1805.58 kB peak

== fixtures/download.mp4 ==
Compress: fixtures/download.mp4 (612228640 -> 611611198) 34.002 seconds & 599200.44 kB used / 1197666.18 kB peak
Stream:   fixtures/download.mp4 (612228640 -> 611611198) 9.174 seconds & 1443.69 kB used / 1897.12 kB peak

The 584 MB video file compresses in 9 seconds instead of 34, and has a peak memory usage of just 1.8 Megabytes vs. 1.1 Gigabytes!

@thekid thekid merged commit c3950ea into main Aug 15, 2025
@thekid thekid deleted the feature/snappy-compression branch August 15, 2025 19:27
@thekid
Copy link
Member Author

thekid commented Aug 16, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants