Skip to content

Experiment about choice of text tokenizer #13

@lilitong7

Description

@lilitong7

Thank you for your excellent work!
I'm very interested in whether you have explored the impact of different tokenizers on model performance.
Furthermore, a closely related issue is that the drawbacks of BPE-Tokenizer have been a long-standing problem. Recently, some research has begun to address this issue by attempting to directly process byte sequences using attention mechanisms or prediction-based dynamic grouping approaches:

Byte Latent Transformer: Patches Scale Better Than Tokens
https://arxiv.org/abs/2412.09871

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
https://arxiv.org/abs/2506.14761

H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
https://arxiv.org/abs/2508.05628

I would like to know whether you think it's possible to build a byte-level continuous autoregressive language model? Or to apply autoencoders to byte sequences.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions