Rework chunk parsing in order to avoid repeatedly re-reading data #48

pettyalex · 2025-06-03T14:37:49Z

I observed that skip in data.table::fread is very slow, and that the developers of data.table advise that skip should not be used for chunking: Rdatatable/data.table#1721

This change will use pipes to incrementally read the input files in chunks instead of repeatedly decompressing and re-reading the files. This should enable much smaller chunk sizes, which will significantly reduce memory usage.

…ginning of the file.

pettyalex added 2 commits May 26, 2025 16:34

Update Tractor chunk processing to not repeatedly re-read from the be…

5db331f

…ginning of the file.

Fix pipes being closed prematurely.

a7db4de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework chunk parsing in order to avoid repeatedly re-reading data #48

Rework chunk parsing in order to avoid repeatedly re-reading data #48

Uh oh!

pettyalex commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rework chunk parsing in order to avoid repeatedly re-reading data #48

Are you sure you want to change the base?

Rework chunk parsing in order to avoid repeatedly re-reading data #48

Uh oh!

Conversation

pettyalex commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant