Chunksize - a simple approach to splitting #16

web-sst · 2024-11-30T04:16:34Z

This change implements a simple alternative to https://github.com/simonw/ttok/pull/2/files. It does not deal with output files, and instead simply prints each chunk sequentially by the existing output mechanism.

When producing text output this is uninteresting. With --tokens it gives readable split output. Combined with --encode it produces one encoded chunk per line, output that can be piped to split -l 1 prefix, which can be decoded later. This makes it straightforward to split long text into chunks for embedding or other purposes.

This prints any output as chunks of the given size. It is useful for running something like this: ttok -m model -i infile --encode --chunksize | split 'outfile.' and then decoding the outfiles in order to accommodate embedding model token limits.

web-sst added 3 commits November 13, 2024 20:58

Add "--chunksize int" option

862c6af

This prints any output as chunks of the given size. It is useful for running something like this: ttok -m model -i infile --encode --chunksize | split 'outfile.' and then decoding the outfiles in order to accommodate embedding model token limits.

Remove debugging statement

afd3867

Add tests for --chunksize

b56570b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunksize - a simple approach to splitting #16

Chunksize - a simple approach to splitting #16

Uh oh!

web-sst commented Nov 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Chunksize - a simple approach to splitting #16

Are you sure you want to change the base?

Chunksize - a simple approach to splitting #16

Uh oh!

Conversation

web-sst commented Nov 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant