Added Q&A language pairs dataset and training#2
Added Q&A language pairs dataset and training#2FiveTechSoft wants to merge 16 commits intoolivkoch:mainfrom
Conversation
Updated evaluation commands and added Q&A pairs example.
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for training on Q&A language pairs by introducing a new dataset builder and training script. The implementation enables the TinyRecursiveModels framework to work with natural language question-answer pairs.
Key changes:
- New Q&A dataset builder with templates for factual, mathematical, color, animal, time, and weather questions
- Training script configured for Q&A pairs with appropriate hyperparameters
- Documentation updates showing how to prepare data, train, and evaluate Q&A models
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| train_qa_pairs.sh | Shell script with hyperparameters for training the model on Q&A dataset |
| dataset/build_qa_dataset.py | Python script that generates Q&A pairs from templates and converts them to the expected dataset format |
| README.md | Documentation updates adding Q&A pairs example and correcting evaluation command |
Comments suppressed due to low confidence (1)
dataset/build_qa_dataset.py:184
- Variable train_data is not used.
train_data = []
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dataset/build_qa_dataset.py
Outdated
| {"country": "Japan", "capital": "Tokyo"}, | ||
| {"country": "China", "capital": "Beijing"}, | ||
| {"country": "India", "capital": "New Delhi"}, | ||
| {"country": "Brazil", "capital": "Brasília"}, |
There was a problem hiding this comment.
The city name "Brasília" contains a special character (í) that may cause encoding issues depending on the tokenization approach used. Since the create_qa_sequence() function uses a simple regex-based word tokenizer that only matches \b\w+\b, the accented character might not be handled correctly. Consider using "Brasilia" without the accent or ensuring proper Unicode handling in the tokenizer.
| {"country": "Brazil", "capital": "Brasília"}, | |
| {"country": "Brazil", "capital": "Brasilia"}, |
No description provided.