-
Notifications
You must be signed in to change notification settings - Fork 85
Description
Thank you for your excellent work!
I'm very interested in whether you have explored the impact of different tokenizers on model performance.
Furthermore, a closely related issue is that the drawbacks of BPE-Tokenizer have been a long-standing problem. Recently, some research has begun to address this issue by attempting to directly process byte sequences using attention mechanisms or prediction-based dynamic grouping approaches:
Byte Latent Transformer: Patches Scale Better Than Tokens
https://arxiv.org/abs/2412.09871
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
https://arxiv.org/abs/2506.14761
H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
https://arxiv.org/abs/2508.05628
I would like to know whether you think it's possible to build a byte-level continuous autoregressive language model? Or to apply autoencoders to byte sequences.