Fix gemma3 token vocab mismatch by Saibo-creator · Pull Request #128 · epfl-dlab/transformers-CFG

Saibo-creator · 2025-04-09T20:24:03Z

Gemma 3 tokenizer has an inconsistency between the vocab_size value (262144) and the size of the vocab dictionary (262145)

It has a special token called <image_soft_token> of token id 262144 and this token is part of the vocab dictionary from tokenizer.get_vocab() but is not counted into the vocab size.

…ailable

urroxyz · 2025-04-10T00:07:33Z

I think is #126 is more robust because it aligns after analyzing the mismatch rather than always truncating, so that it isn't just a fix for Gemma 3, but for other models with their own niche vocabulary problems.

Fix gemma3 token vocab mismatch by using tokenizer.vocab_size when av…

e5a52c0

…ailable

Saibo-creator mentioned this pull request Apr 9, 2025

transformers-CFG incompatible with gemma-3: causes tokenizer and model vocab size mismatch #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gemma3 token vocab mismatch #128

Fix gemma3 token vocab mismatch #128
Saibo-creator wants to merge 1 commit intomainfrom
fix_gemma3

Saibo-creator commented Apr 9, 2025 •

edited

Loading

Uh oh!

urroxyz commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Saibo-creator commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

urroxyz commented Apr 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Saibo-creator commented Apr 9, 2025 •

edited

Loading