itar builds constant‑time indexes over one or more tar file shards, enabling direct, random access to members without extracting the archives. It ships a lightweight CLI (itar) and a Python API.
Designed for large datasets and deep‑learning pipelines, it supports single or sharded tar archives with thread‑safe access for concurrent reads.
pip install itar[cli]echo "Hello world!" > hello.txt
tar cf hello.tar hello.txt # regular tarball
itar index create hello.itar # indexes hello.tar
itar index list hello.itar # list indexed membersimport itar
with itar.open("hello.itar") as archive:
print(archive["hello.txt"].read())Give each shard a zero-padded suffix before building the index:
tar cf photos-0.tar wedding/ # shard 0
tar cf photos-1.tar vacation/ # shard 1
itar index create photos.itar # discovers photos-0.tar, photos-1.tar, ...
itar index list -l photos.itar # shard index, offsets, byte sizesimport itar
with itar.open("photos.itar") as photos:
assert "wedding/cake.jpg" in photos
img_bytes = photos["vacation/sunrise.jpg"].read()| Command | Purpose |
|---|---|
itar index create <archive>.itar [--single TAR | --shards shard0.tar shard1.tar ...] |
Indexes a single archive or an explicit set of shards. With no flags, shards are auto-discovered next to <archive>.itar. |
itar index list <archive>.itar |
Lists members. Use -l for shard/offset info and -H for human-readable sizes. |
itar index check <archive>.itar |
Validates recorded entries; add --member NAME to focus on specific files. |
itar cat <archive>.itar <member> |
Streams a member’s bytes to stdout. |
itar.index.build(shards, progress_bar=False) -> dict: construct an index mapping for paths, file objects, or buffers.itar.index.create("archive.itar", shards): convenience wrapper that builds + saves an index file.itar.index.dump(index, path): serialize an index you built elsewhere.itar.index.load(path) -> dict: load the msgpack index without opening shards.itar.open(path, *, shards=None, open_fn=None) -> IndexedTarFile: attach shard handles using an existing index file.
An itar index file is a simple MessagePack dictionary mapping member paths to metadata:
{
"path/to/member1.jpg": [ # file name
null, # either null or shard index (0-based)
[
2048, # metadata byte offset
2560, # data byte offset
1048576, # file length in bytes
],
],
...
}