A few thoughts/questions:

Please excuse any ignorance here - I don't have a lot of experience in the lower levels / finer details of tokenizers and fine-tuning. Also, I don't expect a reply to each of these points, to be clear - just dropping some thoughts for you to skim in case they're helpful.

* Seems like `startofthought` should be renamed to `thought_start` to be consistent with `im_start`? Or vice-versa. I'm also kind of curious what `im` means / stands for. A name like `message_start` would make more sense to me, in terms of clarity/consistency. Is there a reason to follow ChatML in this respect? `fim_x` standing for `fill-in-middle x` adds a little more confusion too, since I'm guessing `im_start` doesn't stand for `in-middle start`.
* If it's possible to start pretty fresh here, why not use an explicit syntax for the concept of "closing tags" so it's clear that `_start`/`_end` aren't part of the name/semantics of the tag? Something XML-like would make nesting and grouping more natural and cleanly extensible I think. E.g. instead of `<|thought_start|> <|thought_end|>` it could be `<|thought:|> <|:thought|>` or `<|thought:start|> <|thought:end|>` - and ditto for `message`. XML also has self-closing tags to take inspiration from - could be used for separators like `file_separator` and `fim_middle` - for the example colon syntax above you'd just leave out the colon like `<|my_self_closing_tag|>`.
* Are `file_separator`s used *within* messages? If so, must they always come at the start or end of a message? If not, then it seems like you'd want to have enclosing-style tags rather than just a separator/delimiter? Otherwise you can't embed "files" in the middle of a message, if I'm understanding correctly. Also wondering whether it actually makes sense to add this as an spec-level abstraction, rather than just leaving it to userland - like the Claude-style semantic XML recommendation - e.g. `<snippet>...</snippet>` or `<memory>...</memory>`. Maybe I'm missing the point of what "file" refers to here though.
* Is the choice of `<s>` and `</s>` somehow constrained by pretraining? Given that different base models use different BOS/EOS, I'm guessing not? (Or will BOS/EOS tag *change* depending on the base model, unlike the other tags?) If it's not constrained, then it seems less than ideal to use this since it very plainly clashes with HTML's [strikethrough](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/s) tag, which is still regularly used. I'm not sure how well escaping and unescaping is handled in training and inference libs - TGI doesn't seem to be able to generate strikethrough HTML with llama 2, at least - guessing it was a pretraining oversight, and also guessing unescaping isn't currently automatic in popular inference libs anyway, but would be happy to be wrong there.
* For creative/entertainment applications, there are reasonable uses cases where you e.g. have multiple users interacting with multiple characters *and* where users want to temporarily assume the role of one of the characters, while letting the LLM act/speak for that character the rest of the time. I think there are some things to consider for this sort of use case - e.g. for a user to take the role of another character, it might pay to make it easy to stop generation before that character's response by ensuring that the spec is strict on `name=foo` always coming immediately after the role (assuming that there may be other metadata alongside `name` in the future), with a single space between them. That way a simple stop sequence can be used to interrupt generation right before the character in question was about to speak. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few thoughts/questions: #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A few thoughts/questions: #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions