Added Q&A language pairs dataset and training by FiveTechSoft · Pull Request #2 · olivkoch/TinyRecursiveModels

FiveTechSoft · 2025-11-12T21:29:38Z

No description provided.

Updated evaluation commands and added Q&A pairs example.

Copilot

Pull Request Overview

This PR adds support for training on Q&A language pairs by introducing a new dataset builder and training script. The implementation enables the TinyRecursiveModels framework to work with natural language question-answer pairs.

Key changes:

New Q&A dataset builder with templates for factual, mathematical, color, animal, time, and weather questions
Training script configured for Q&A pairs with appropriate hyperparameters
Documentation updates showing how to prepare data, train, and evaluate Q&A models

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File	Description
train_qa_pairs.sh	Shell script with hyperparameters for training the model on Q&A dataset
dataset/build_qa_dataset.py	Python script that generates Q&A pairs from templates and converts them to the expected dataset format
README.md	Documentation updates adding Q&A pairs example and correcting evaluation command

Comments suppressed due to low confidence (1)

dataset/build_qa_dataset.py:184

Variable train_data is not used.

    train_data = []

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dataset/build_qa_dataset.py

train_qa_pairs.sh

dataset/build_qa_dataset.py

Copilot · 2025-11-12T21:33:44Z

dataset/build_qa_dataset.py

+            {"country": "Japan", "capital": "Tokyo"},
+            {"country": "China", "capital": "Beijing"},
+            {"country": "India", "capital": "New Delhi"},
+            {"country": "Brazil", "capital": "Brasília"},


The city name "Brasília" contains a special character (í) that may cause encoding issues depending on the tokenization approach used. Since the create_qa_sequence() function uses a simple regex-based word tokenizer that only matches \b\w+\b, the accented character might not be handled correctly. Consider using "Brasilia" without the accent or ensuring proper Unicode handling in the tokenizer.

Suggested change

{"country": "Brazil", "capital": "Brasília"},

{"country": "Brazil", "capital": "Brasilia"},

README.md

dataset/build_qa_dataset.py

FiveTechSoft added 3 commits November 12, 2025 22:20

added Q&A dataset and training

aa0de2f

Revise evaluation commands and add Q&A section

3daf295

Updated evaluation commands and added Q&A pairs example.

Merge branch 'olivkoch:main' into main

a55bac9

Copilot AI review requested due to automatic review settings November 12, 2025 21:29

Copilot started reviewing on behalf of FiveTechSoft November 12, 2025 21:30 View session

Copilot finished reviewing on behalf of FiveTechSoft November 12, 2025 21:32

Copilot AI reviewed Nov 12, 2025

View reviewed changes

FiveTechSoft added 13 commits November 13, 2025 21:48

Add files via upload

f6d16fd

Add files via upload

2a002f8

Add files via upload

81d6c56

Add files via upload

75d5dbc

Add files via upload

5824bf9

more complex Q&A dataset

f0e3ff4

Added math & gsm8k dataset

2ed2e92

math and gsmk8 trainer

166b74f

math & gsmk8 config

5cc8aba

math and gsmk8 evaluation

e67d916

updated

941d4fd

updated

9fea295

added file

6f2435a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Q&A language pairs dataset and training#2

Added Q&A language pairs dataset and training#2
FiveTechSoft wants to merge 16 commits intoolivkoch:mainfrom
FiveTechSoft:main

FiveTechSoft commented Nov 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	{"country": "Brazil", "capital": "Brasília"},
	{"country": "Brazil", "capital": "Brasilia"},

Conversation

FiveTechSoft commented Nov 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant