Skip to content

Conversation

@web-sst
Copy link

@web-sst web-sst commented Nov 30, 2024

This change implements a simple alternative to https://github.com/simonw/ttok/pull/2/files. It does not deal with output files, and instead simply prints each chunk sequentially by the existing output mechanism.

When producing text output this is uninteresting. With --tokens it gives readable split output. Combined with --encode it produces one encoded chunk per line, output that can be piped to split -l 1 prefix, which can be decoded later. This makes it straightforward to split long text into chunks for embedding or other purposes.

This prints any output as chunks of the given size. It is useful
for running something like this:

ttok -m model -i infile --encode --chunksize | split 'outfile.'

and then decoding the outfiles in order to accommodate embedding
model token limits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant