Skip to content

Conversation

@jlamprou
Copy link

@jlamprou jlamprou commented Nov 30, 2024

Inconsistent token handling between uint16 and uint32 datasets

Description

There's a data processing inconsistency between process_data_dolly.py and lm_datasets.py when using uint32 dtype. In process_data_dolly.py, when -1 is used as a separator token for Qwen models (which use uint32), it overflows to 4294967295. However, lm_datasets.py is hardcoded to look for 65535 (the uint16 overflow value) as the separator token.

Problem

  • For non-Qwen models, binary data is stored as uint16, so -1 correctly overflows to 65535
  • For Qwen models, binary data is stored as uint32, so -1 overflows to 4294967295
  • lm_datasets.py only checks for 65535 in its input processing logic:
if 65535 in input_ids:
    source_len = np.where(input_ids==65535)[0][0]

This means the separator token isn't being detected for Qwen model data, leading to incorrect prompt/response segmentation.

Steps to Fix

  1. Update lm_datasets.py to handle both overflow cases:
if 65535 in input_ids or 4294967295 in input_ids:
    source_len = np.where((input_ids==65535) | (input_ids==4294967295))[0][0]
  1. Or better yet, make the separator token value configurable based on the model type/dtype being used.

Related Files

  • lm_datasets.py
  • process_data_dolly.py

When processing Qwen model data with uint32 dtype, the -1 separator token overflows to 4294967295 instead of 65535. Add support for detecting both values to ensure correct prompt/response segmentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant