Skip to content

Fix gemma3 token vocab mismatch #128

Open
Saibo-creator wants to merge 1 commit intomainfrom
fix_gemma3
Open

Fix gemma3 token vocab mismatch #128
Saibo-creator wants to merge 1 commit intomainfrom
fix_gemma3

Conversation

@Saibo-creator
Copy link
Collaborator

@Saibo-creator Saibo-creator commented Apr 9, 2025

Gemma 3 tokenizer has an inconsistency between the vocab_size value (262144) and the size of the vocab dictionary (262145)

It has a special token called <image_soft_token> of token id 262144 and this token is part of the vocab dictionary from tokenizer.get_vocab() but is not counted into the vocab size.

@urroxyz
Copy link
Contributor

urroxyz commented Apr 10, 2025

I think is #126 is more robust because it aligns after analyzing the mismatch rather than always truncating, so that it isn't just a fix for Gemma 3, but for other models with their own niche vocabulary problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants