fix: Handle different separator token values for uint16/uint32 dtypes #14

jlamprou · 2024-11-30T22:09:08Z

Inconsistent token handling between uint16 and uint32 datasets

Description

There's a data processing inconsistency between process_data_dolly.py and lm_datasets.py when using uint32 dtype. In process_data_dolly.py, when -1 is used as a separator token for Qwen models (which use uint32), it overflows to 4294967295. However, lm_datasets.py is hardcoded to look for 65535 (the uint16 overflow value) as the separator token.

Problem

For non-Qwen models, binary data is stored as uint16, so -1 correctly overflows to 65535
For Qwen models, binary data is stored as uint32, so -1 overflows to 4294967295
lm_datasets.py only checks for 65535 in its input processing logic:

if 65535 in input_ids:
    source_len = np.where(input_ids==65535)[0][0]

This means the separator token isn't being detected for Qwen model data, leading to incorrect prompt/response segmentation.

Steps to Fix

Update lm_datasets.py to handle both overflow cases:

if 65535 in input_ids or 4294967295 in input_ids:
    source_len = np.where((input_ids==65535) | (input_ids==4294967295))[0][0]

Or better yet, make the separator token value configurable based on the model type/dtype being used.

Related Files

lm_datasets.py
process_data_dolly.py

When processing Qwen model data with uint32 dtype, the -1 separator token overflows to 4294967295 instead of 65535. Add support for detecting both values to ensure correct prompt/response segmentation.

fix: Handle different separator token values for uint16/uint32 dtypes

cfde5bb

When processing Qwen model data with uint32 dtype, the -1 separator token overflows to 4294967295 instead of 65535. Add support for detecting both values to ensure correct prompt/response segmentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle different separator token values for uint16/uint32 dtypes #14

fix: Handle different separator token values for uint16/uint32 dtypes #14

Uh oh!

jlamprou commented Nov 30, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: Handle different separator token values for uint16/uint32 dtypes #14

Are you sure you want to change the base?

fix: Handle different separator token values for uint16/uint32 dtypes #14

Uh oh!

Conversation

jlamprou commented Nov 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Inconsistent token handling between uint16 and uint32 datasets

Description

Problem

Steps to Fix

Related Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamprou commented Nov 30, 2024 •

edited

Loading