Temporal Tokenizer Analysis

This project investigates temporal biases in large language model training data by analyzing tokenizer patterns.

Overview

This research extends Hayase et al.'s work on Data Mixture Inference through tokenizer analysis to examine how language models represent content from different decades. It focuses on three key questions:

How are training data distributed across different time periods?
What is the correlation between data volume and temporal recency?
How do temporal distributions affect model performance across different decades?

Methodology

The approach adapts linear programming techniques to identify temporal signatures in tokenizer merge rules, allowing us to:

Identify decade-specific language patterns
Quantify the temporal distribution of training data
Measure the impact of these patterns on model performance

Project Structure

notebooks/: Jupyter notebooks for analysis and visualization
src/data/: Data loading and processing utilities
src/validation/: Evaluation metrics and statistical validation

Getting Started

Prerequisites

Python 3.8+
Required packages: pandas, numpy, matplotlib, seaborn, transformers, etc.

Installation

Clone the repository

git clone https://github.com/RoshaniPawar16/temporal_tokenizer_analysis.git

Install dependencies

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
data		data
notebooks		notebooks
results		results
results_from_maxwell/results		results_from_maxwell/results
src		src
temporal_analysis		temporal_analysis
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
analyze_maxwell_results.py		analyze_maxwell_results.py
analyze_results.py		analyze_results.py
debug_bl_dataset.sh		debug_bl_dataset.sh
fix_transformers.py		fix_transformers.py
local_results_sync.py		local_results_sync.py
parallel_submit.sh		parallel_submit.sh
requirements.txt		requirements.txt
run_multimodel_analysis.py		run_multimodel_analysis.py
run_on_maxwell.py		run_on_maxwell.py
submit_to_maxwell.sh		submit_to_maxwell.sh
test_locally.py		test_locally.py
test_temporal.sh		test_temporal.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Temporal Tokenizer Analysis

Overview

Methodology

Project Structure

Getting Started

Prerequisites

Installation

Clone the repository

Install dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RoshaniPawar16/temporal_tokenizer_analysis

Folders and files

Latest commit

History

Repository files navigation

Temporal Tokenizer Analysis

Overview

Methodology

Project Structure

Getting Started

Prerequisites

Installation

Clone the repository

Install dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages