Deterministic, streaming Content-Defined Chunking (CDC) for Rust
chunkrs is a high-performance, portable infrastructure library for FastCDC chunking and cryptographic hashing.
Bytes in → Chunks & hashes out.
Zero-copy streaming. Async-agnostic. Excellent for any chunking and hashing use case.
- Streaming-first: Processes multi-GB files with constant memory (no full-file buffering)
- Deterministic-by-design: Identical bytes always produce identical chunk hashes, regardless of batching or execution timing
- Zero-allocation hot path: Thread-local buffer pools eliminate allocator contention under load
- FastCDC algorithm: Gear hash rolling boundary detection with configurable min/avg/max sizes
- BLAKE3 identity: Cryptographic chunk hashing (optional, incremental)
- Runtime-agnostic async: Works with Tokio, async-std, or any
futures-ioruntime - Strictly safe:
#![forbid(unsafe_code)]
chunkrs processes one logical byte stream at a time with strictly serial CDC state:
┌───────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Input Byte │ │ I/O Batching │ │ Serial CDC State │
│ Stream │────▶│ (8KB buffers│────▶ │ Machine │
│ (any io::Read │ │ for syscall │ │ (FastCDC rolling │
│ or AsyncRead)│ │ efficiency) │ │ hash) │
└───────────────┘ └──────────────┘ └──────────────────┘
┌─────────────┐ ┌───────────────────┐
│ │ │ Chunk { │
──▶ │ Chunk │────▶ │ data: Bytes, │
│ Stream │ │ offset: u64, │
│ │ │ hash: ChunkHash │
└─────────────┘ │ } │
└───────────────────┘
| Scenario | Recommendation |
|---|---|
| Delta sync (rsync-style) | ✅ Perfect fit |
| Backup tools | ✅ Ideal for single-stream chunking |
| Deduplication (CAS) | ✅ Use with your own index |
| NVMe Gen4/5 saturation | ✅ 3–5 GB/s per core |
| Distributed dedup | ✅ Stateless, easy to distribute |
| Any other CDC use case | ✅ Likely fits |
[dependencies]
chunkrs = "0.8"use std::fs::File;
use chunkrs::{Chunker, ChunkConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open("data.bin")?;
let chunker = Chunker::new(ChunkConfig::default());
for chunk in chunker.chunk(file) {
let chunk = chunk?;
println!("offset: {:?}, len: {}, hash: {:?}",
chunk.offset, chunk.len(), chunk.hash);
}
Ok(())
}What's in the Chunk Stream:
Each element is a Chunk containing:
data:Bytes— the actual chunk payload (zero-copy reference when possible) for subsequent use (e.g., writing to disk)offset:Option<u64>— byte position in the original streamhash:Option<ChunkHash>— BLAKE3 hash for content identity (if enabled)
| Type | Description |
|---|---|
Chunker |
Stateful CDC engine (maintains rolling hash across batches) |
Chunk |
Content-addressed block with Bytes payload and optional BLAKE3 hash |
ChunkHash |
32-byte BLAKE3 hash identifying chunk content |
ChunkConfig |
Min/avg/max chunk sizes and hash configuration |
ChunkIter |
Iterator over chunks (sync) |
ChunkError |
Error type for chunking operations |
use chunkrs::{Chunker, ChunkConfig};
// From file
let file = std::fs::File::open("data.bin")?;
let chunker = Chunker::new(ChunkConfig::default());
for chunk in chunker.chunk(file) {
let chunk = chunk?;
// chunk.data: Bytes - the chunk payload
// chunk.offset: Option<u64> - position in original stream
// chunk.hash: Option<ChunkHash> - BLAKE3 hash (if enabled)
}
// From memory
let data: Vec<u8> = vec![0u8; 1024 * 1024];
let chunks: Vec<_> = chunker.chunk_bytes(data);Runtime-agnostic via futures-io:
use futures_util::StreamExt;
use chunkrs::{ChunkConfig, ChunkError};
async fn process<R: futures_io::AsyncRead + Unpin>(reader: R) -> Result<(), ChunkError> {
let mut stream = chunkrs::chunk_async(reader, ChunkConfig::default());
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
// Process
}
Ok(())
}Tokio compatibility:
use tokio::fs::File;
use tokio_util::compat::TokioAsyncReadCompatExt;
let file = File::open("data.bin").await?;
let stream = chunkrs::chunk_async(file.compat(), ChunkConfig::default());Choose based on your deduplication granularity needs:
use chunkrs::ChunkConfig;
// Small files / high dedup (8 KiB average)
let small = ChunkConfig::new(2 * 1024, 8 * 1024, 32 * 1024)?;
// Default (16 KiB average) - good general purpose
let default = ChunkConfig::default();
// Large files / high throughput (256 KiB average)
let large = ChunkConfig::new(64 * 1024, 256 * 1024, 1024 * 1024)?;use chunkrs::{ChunkConfig, HashConfig};
// With BLAKE3 (default)
let with_hash = ChunkConfig::default();
// Boundary detection only (faster, no content identity)
let no_hash = ChunkConfig::default().with_hash_config(HashConfig::disabled());Throughput targets on modern hardware:
| Storage | Single-core CDC | Bottleneck |
|---|---|---|
| NVMe Gen4 | ~3–5 GB/s | CPU (hashing) |
| NVMe Gen5 | ~3–5 GB/s | CDC algorithm |
| SATA SSD | ~500 MB/s | Storage |
| 10 Gbps LAN | ~1.2 GB/s | Network |
| HDD | ~200 MB/s | Seek latency |
Memory usage:
- Constant:
O(batch_size)typically 4–16MB per stream - Thread-local cache: ~64MB per thread (reusable)
To saturate NVMe Gen5: Process multiple files concurrently (application-level parallelism). Do not attempt to parallelize within a single file—this destroys deduplication ratios.
chunkrs guarantees content-addressable identity:
- Strong guarantee: Identical byte streams produce identical
ChunkHash(BLAKE3) values - Boundary stability: For identical inputs and configurations, chunk boundaries are deterministic across different batch sizes or execution timings
- Serial consistency: Rolling hash state is strictly maintained across batch boundaries
What this means: You can re-chunk a file on Tuesday with different I/O batch sizes and get bit-identical chunks to Monday's run. This is essential for delta sync correctness.
- No unsafe code:
#![forbid(unsafe_code)] - Comprehensive testing: Unit tests, doc tests, and property-based tests ensure:
- Determinism invariants
- Batch equivalence (chunking whole vs chunked yields same results)
- No panics on edge cases (empty files, single byte, max-size boundaries)
Boundary Detection: FastCDC (Gear hash rolling hash)
- Byte-by-byte polynomial rolling hash via lookup table
- Dual-mask normalization (small/large chunk detection)
- Configurable min/avg/max constraints
Chunk Identity: BLAKE3 (when enabled)
- Incremental hashing for streaming
- 32-byte cryptographic digests
| Feature | Description | Default |
|---|---|---|
hash-blake3 |
BLAKE3 chunk hashing | ✅ |
async-io |
Async Stream support via futures-io |
❌ |
# Default: sync + hashing
[dependencies]
chunkrs = "0.8"
# Minimal: sync only, no hashing
[dependencies]
chunkrs = { version = "0.8", default-features = false }
# Full featured: sync + async + hashing
[dependencies]
chunkrs = { version = "0.8", features = ["async-io"] }Current: 0.8.0 — Core API stable, comprehensive feature set, seeking production feedback.
Note: bumped version to 0.8.0 because design, APIs, features are almost matured.
Core Functionality:
- FastCDC rolling hash, sync, async I/O, zero-copy, BLAKE3 hashing, thread-local buffer pools, deterministic chunking
Quality & Safety:
- 45 unit tests + 40 doctests, fuzzing, no
unsafe - documents and example
- benchmarks
0.9.x — Production Hardening:
- Extended cross-platform testing (Windows, macOS, Linux variants)
- Additional fuzzing targets for edge cases
- Miri validation for memory safety
- Performance profiling and optimization for specific workloads
- Enhanced error messages with context
1.0.0 — Stable Release:
- Alternative hash algorithms (xxHash for speed, SHA-256 for compatibility)
- Configurable buffer pool sizes for memory-constrained environments
- Custom allocator support for specialized use cases
- Formal SemVer commitment with MSRV policy
- Comprehensive integration guide and production deployment patterns
Post-1.0 — Additive Features Only:
- SIMD optimizations (AVX2/AVX-512) for rolling hash
- Hardware-accelerated hashing (BLAKE3 SIMD, SHA-NI)
- Advanced CDC algorithm variants (e.g., pattern-aware chunking)
no_stdsupport for embedded environments
These features are intentionally out of scope:
- Networking: Handle in application layer
- Encryption: Pre-encrypt or post-encrypt at application layer
- Compression: Apply compression before or after chunking
- Deduplication indexing: Use companion crates (CAS index implementations)
- Distributed coordination: Manage at application level
We're actively seeking feedback on:
- Real-world deployment patterns and performance characteristics
- Edge cases and failure modes in production
- Integration patterns with storage systems and databases
- Feature requests that align with CDC use cases
Open issues or discussions at GitHub Issues. Issues and pull requests are welcome.
- Refer ARCHITECTURE.md for Design and implementation details.
- See CHANGELOG.md for version history.
This crate implements the FastCDC algorithm described in:
Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang, Qing Liu,
"FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication",
in Proceedings of USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, USA, June 22–24, 2016, pages: 101-114.
Paper Link
Wen Xia, Xiangyu Zou, Yukun Zhou, Hong Jiang, Chuanyi Liu, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang,
"The Design of Fast Content-Defined Chunking for Data Deduplication based Storage Systems",
IEEE Transactions on Parallel and Distributed Systems (TPDS), 2020.
This crate is inspired by the original fastcdc crate but focuses on a modernized API with streaming-first design, strict determinism, and allocation-conscious internals.
MIT License — see LICENSE