This project investigates temporal biases in large language model training data by analyzing tokenizer patterns.
This research extends Hayase et al.'s work on Data Mixture Inference through tokenizer analysis to examine how language models represent content from different decades. It focuses on three key questions:
- How are training data distributed across different time periods?
- What is the correlation between data volume and temporal recency?
- How do temporal distributions affect model performance across different decades?
The approach adapts linear programming techniques to identify temporal signatures in tokenizer merge rules, allowing us to:
- Identify decade-specific language patterns
- Quantify the temporal distribution of training data
- Measure the impact of these patterns on model performance
notebooks/: Jupyter notebooks for analysis and visualizationsrc/data/: Data loading and processing utilitiessrc/validation/: Evaluation metrics and statistical validation
- Python 3.8+
- Required packages: pandas, numpy, matplotlib, seaborn, transformers, etc.
git clone https://github.com/RoshaniPawar16/temporal_tokenizer_analysis.git
pip install -r requirements.txt