Fix Perplexity Score For Tokenizers without bos_token_id by kylehowells · Pull Request #682 · huggingface/evaluate

kylehowells · 2025-06-03T22:04:29Z

Some Tokenizer's, like Qwen2.5's, don't include a bos_token_id.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B')
print(tokenizer.bos_token_id)
# None

Which means using the model with the perplexity metric fails.

import evaluate
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence."]
results = perplexity.compute(model_id='Qwen/Qwen2.5-0.5B', predictions=input_texts)
print(results)

Results in this error, caused by the tokeniser not containing a bos_token_id:

(venv) $ python bug.py
  0%|                                                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/.../PerplexityScoring/bug.py", line 4, in <module>
    results = perplexity.compute(model_id='Qwen/Qwen2.5-0.5B', predictions=input_texts)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../venv/lib/python3.11/site-packages/evaluate/module.py", line 467, in compute
    output = self._compute(**inputs, **compute_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kylehowells/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--perplexity/8ab643ad86f568b7d1d5f7822373fa7401ff5ff0297ccf114b0ca6a33be96bc0/perplexity.py", line 170, in _compute
    bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not infer dtype of NoneType

There is also a similar PR for the transformers library itself to handle this situation back in February: fix: condition bos_token_id and space as token #36211

This PR adds a not None check to the add bos_token_id block.

…eniser

chrisjbryant · 2025-06-17T15:50:30Z

Fwiw, I'm running into the same problem and haven't yet worked out how to get a perplexity score when the input is a single token.

lhoestq · 2025-06-20T16:02:50Z

good catch ! thanks for the fix :)

Update perplexity.py to handle models without bos_token_id in the tok…

1255c6f

…eniser

kylehowells changed the title ~~Update Perplexity For Tokenizers without bos_token_id~~ Fix Perplexity Score For Tokenizers without bos_token_id Jun 3, 2025

lhoestq merged commit f05792f into huggingface:main Jun 20, 2025
2 of 6 checks passed

kylehowells deleted the perplexity-support-tokenisers-without-bos_token_id branch June 20, 2025 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Perplexity Score For Tokenizers without bos_token_id#682

Fix Perplexity Score For Tokenizers without bos_token_id#682
lhoestq merged 1 commit intohuggingface:mainfrom
kylehowells:perplexity-support-tokenisers-without-bos_token_id

kylehowells commented Jun 3, 2025 •

edited

Loading

Uh oh!

chrisjbryant commented Jun 17, 2025

Uh oh!

Uh oh!

lhoestq commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kylehowells commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisjbryant commented Jun 17, 2025

Uh oh!

Uh oh!

lhoestq commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylehowells commented Jun 3, 2025 •

edited

Loading