Skip to content

Schmiddo/itar

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

itar

image

itar builds constant‑time indexes over one or more tar file shards, enabling direct, random access to members without extracting the archives. It ships a lightweight CLI (itar) and a Python API.

Designed for large datasets and deep‑learning pipelines, it supports single or sharded tar archives with thread‑safe access for concurrent reads.

Quickstart

pip install itar[cli]

Single tarball

echo "Hello world!" > hello.txt
tar cf hello.tar hello.txt       # regular tarball

itar index create hello.itar     # indexes hello.tar
itar index list hello.itar       # list indexed members
import itar

with itar.open("hello.itar") as archive:
    print(archive["hello.txt"].read())

Sharded tarballs

Give each shard a zero-padded suffix before building the index:

tar cf photos-0.tar wedding/    # shard 0
tar cf photos-1.tar vacation/   # shard 1

itar index create photos.itar   # discovers photos-0.tar, photos-1.tar, ...
itar index list -l photos.itar  # shard index, offsets, byte sizes
import itar

with itar.open("photos.itar") as photos:
    assert "wedding/cake.jpg" in photos
    img_bytes = photos["vacation/sunrise.jpg"].read()

CLI reference

Command Purpose
itar index create <archive>.itar [--single TAR | --shards shard0.tar shard1.tar ...] Indexes a single archive or an explicit set of shards. With no flags, shards are auto-discovered next to <archive>.itar.
itar index list <archive>.itar Lists members. Use -l for shard/offset info and -H for human-readable sizes.
itar index check <archive>.itar Validates recorded entries; add --member NAME to focus on specific files.
itar cat <archive>.itar <member> Streams a member’s bytes to stdout.

Python helpers

  • itar.index.build(shards, progress_bar=False) -> dict: construct an index mapping for paths, file objects, or buffers.
  • itar.index.create("archive.itar", shards): convenience wrapper that builds + saves an index file.
  • itar.index.dump(index, path): serialize an index you built elsewhere.
  • itar.index.load(path) -> dict: load the msgpack index without opening shards.
  • itar.open(path, *, shards=None, open_fn=None) -> IndexedTarFile: attach shard handles using an existing index file.

itar File Format

An itar index file is a simple MessagePack dictionary mapping member paths to metadata:

{
    "path/to/member1.jpg": [  # file name
        null,                 # either null or shard index (0-based)
        [
            2048,             # metadata byte offset
            2560,             # data byte offset
            1048576,          # file length in bytes
        ],
    ],
    ...
}

About

tar file index for constant-time access

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%