Skip to content

Added Q&A language pairs dataset and training#2

Open
FiveTechSoft wants to merge 16 commits intoolivkoch:mainfrom
FiveTechSoft:main
Open

Added Q&A language pairs dataset and training#2
FiveTechSoft wants to merge 16 commits intoolivkoch:mainfrom
FiveTechSoft:main

Conversation

@FiveTechSoft
Copy link

No description provided.

Copilot AI review requested due to automatic review settings November 12, 2025 21:29
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for training on Q&A language pairs by introducing a new dataset builder and training script. The implementation enables the TinyRecursiveModels framework to work with natural language question-answer pairs.

Key changes:

  • New Q&A dataset builder with templates for factual, mathematical, color, animal, time, and weather questions
  • Training script configured for Q&A pairs with appropriate hyperparameters
  • Documentation updates showing how to prepare data, train, and evaluate Q&A models

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File Description
train_qa_pairs.sh Shell script with hyperparameters for training the model on Q&A dataset
dataset/build_qa_dataset.py Python script that generates Q&A pairs from templates and converts them to the expected dataset format
README.md Documentation updates adding Q&A pairs example and correcting evaluation command
Comments suppressed due to low confidence (1)

dataset/build_qa_dataset.py:184

  • Variable train_data is not used.
    train_data = []

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

{"country": "Japan", "capital": "Tokyo"},
{"country": "China", "capital": "Beijing"},
{"country": "India", "capital": "New Delhi"},
{"country": "Brazil", "capital": "Brasília"},
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The city name "Brasília" contains a special character (í) that may cause encoding issues depending on the tokenization approach used. Since the create_qa_sequence() function uses a simple regex-based word tokenizer that only matches \b\w+\b, the accented character might not be handled correctly. Consider using "Brasilia" without the accent or ensuring proper Unicode handling in the tokenizer.

Suggested change
{"country": "Brazil", "capital": "Brasília"},
{"country": "Brazil", "capital": "Brasilia"},

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant