WIP: Refactor Course Lab Benchmark #27

tareknaser · 2025-12-01T18:53:22Z

Current State

SWE-agent repository cloned entirely by install.sh, support for minisweagent and openhands
run.sh overfits to claudecode
Multiple data formats (xlsx, json, csv) with transformation scripts
Schema couples lab metadata with task details
No clear separation between courses, labs, and tasks

Proposed Changes

Separate course metadata in courses_metadata.json
Each lab can contain multiple tasks
- Tasks link back to their parent lab and course via lab_number, lab_name, and course_id
Each task can specify a git repository that will be cloned and uploaded to the execution environment
Schema validation: Tests verify data integrity (that run in CI)
Agents: Initial support for claudecode only with tests against real-world tasks including file modifications
Current data: Only 2 simple test tasks included for validation
- Basic file modification task
- Another simple task that tests repo cloning

Full migration of existing labs is out of scope for this PR. The current 2 tasks are just to validate and test the proposed infrastructure.

benchmarks/course_lab_bench_v2/src/executor.py

xuafeng · 2025-12-01T19:37:47Z

benchmarks/course_lab_bench_v2/README.md

+- `repo_url`: Git repository to clone (null if not needed)
+- `setup_commands`: Commands to run before testing
+- `timeout_minutes`: Maximum execution time (default: 30)
+- `lab_url`: Link to lab instructions


We can also add "how to add new agents" section based on previous add_agents.md

benchmarks/course_lab_bench_v2/data/benchmark/labs.jsonl

xuafeng · 2025-12-05T21:36:43Z

benchmarks/courselab_bench/courselab_bench/agent/react.py

+import time
+from typing import Any
+from loguru import logger
+


I just read the whole repo. My only concern is this part. I think we need to have an agent class that works for any agents. REACTAgent is one agent, claude agent is another one agent.
Then, we need to figure out how to run agent (like, claude code) in your env class.

Signed-off-by: Tarek <tareknaser360@gmail.com>

benchmarks/courselab_bench/pyproject.toml

xuafeng · 2025-12-11T23:35:00Z

benchmarks/courselab_bench/data/test_course/test__simple__echo/config.json

@@ -0,0 +1,7 @@
+{


It is very clear and clean. I very like this. I think it can be used by many system tasks (env setup, system impl, performance, diagnosis.)

benchmarks/courselab_bench/README.md

xuafeng · 2025-12-11T23:38:19Z

benchmarks/courselab_bench/run_benchmark.py

+        default="data/tasks.jsonl",
+        help="Path to tasks JSONL file (default: data/tasks.jsonl)",
+    )
+    parser.add_argument("--model", type=str, default="anthropic/claude-sonnet-4-5-20250929")


Maybe, we can also add agent selection later.

xuafeng · 2025-12-11T23:42:16Z

benchmarks/courselab_bench/courselab_bench/agent/react.py

+    pass
+
+
+class REACTAgent:


For another agent, I am thinking:

class ClaudeCodeAgent:
def init():
# install it in the env.
def run():
result = self.env.execute("claude -p xxx .....")
# may parser the result to get logs and execution results, I am not sure it has all the trajectory you need.

xuafeng · 2025-12-11T23:45:33Z

benchmarks/courselab_bench/courselab_bench/environment/docker.py

@@ -0,0 +1,150 @@
+import subprocess


If each folder only has one file, I am not sure we need the folder.

I tried to keep the design as modular as possible for this PoC by separating different components in folders. We should be able to add new agents, environments (currently only Docker is supported), models (right now it’s just litellm, but we can support other methods) and evaluation scripts

xuafeng

Thanks a lot, @tareknaser. I did a pass. It is very clear and clean.
Is it the folder structure also works for other benchmarks? Maybe, we can unify them later (long-term).

…tadata Signed-off-by: Tarek <tareknaser360@gmail.com>

Signed-off-by: Tarek <tareknaser360@gmail.com>

Add PATH exports to profile files so Go is available when using `bash -lc` Signed-off-by: Tarek <tareknaser360@gmail.com>

… tests Signed-off-by: Tarek <tareknaser360@gmail.com>

…tation Signed-off-by: Tarek <tareknaser360@gmail.com>

…rent task Signed-off-by: Tarek <tareknaser360@gmail.com>

Signed-off-by: Tarek <tareknaser360@gmail.com>

… config Signed-off-by: Tarek <tareknaser360@gmail.com>

xuafeng reviewed Dec 1, 2025

View reviewed changes

benchmarks/course_lab_bench_v2/src/executor.py Outdated Show resolved Hide resolved

tareknaser force-pushed the refactor_course_lab_bench branch from a9f1972 to 4a1dde6 Compare December 5, 2025 16:46

xuafeng reviewed Dec 5, 2025

View reviewed changes

wip: course lab benchmark rework

1e84d46

Signed-off-by: Tarek <tareknaser360@gmail.com>

tareknaser force-pushed the refactor_course_lab_bench branch from 4a1dde6 to 1e84d46 Compare December 11, 2025 01:46

xuafeng reviewed Dec 11, 2025

View reviewed changes

benchmarks/courselab_bench/pyproject.toml Outdated Show resolved Hide resolved

xuafeng reviewed Dec 11, 2025

View reviewed changes

benchmarks/courselab_bench/README.md Show resolved Hide resolved

xuafeng reviewed Dec 11, 2025

View reviewed changes

tareknaser added 8 commits December 12, 2025 13:57

docs(course_lab_bench): update task instructions to include course me…

ace20cc

…tadata Signed-off-by: Tarek <tareknaser360@gmail.com>

docs(pyproject.toml): update author information

4acb95f

Signed-off-by: Tarek <tareknaser360@gmail.com>

fix(docker): go PATH for login shells in Docker environment

f6c9f23

Add PATH exports to profile files so Go is available when using `bash -lc` Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(executor): retry mechanism for evaluation script to handle flaky…

de6959a

… tests Signed-off-by: Tarek <tareknaser360@gmail.com>

docs(courselab_bench): add a note on previous labs reference implemen…

6678826

…tation Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(courselab_bench): modify system prompt to emphasize focus on cur…

7d7f696

…rent task Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(courselab_bench): add config option to add starter files

b7e9b5e

Signed-off-by: Tarek <tareknaser360@gmail.com>

feat(courselab_bench): add validation for starter and output files in…

fa1fb00

… config Signed-off-by: Tarek <tareknaser360@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Refactor Course Lab Benchmark #27

WIP: Refactor Course Lab Benchmark #27

Uh oh!

tareknaser commented Dec 1, 2025

Uh oh!

Uh oh!

xuafeng Dec 1, 2025

Uh oh!

Uh oh!

xuafeng Dec 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

xuafeng Dec 11, 2025

Uh oh!

Uh oh!

xuafeng Dec 11, 2025

Uh oh!

xuafeng Dec 11, 2025 •

edited

Loading

Uh oh!

xuafeng Dec 11, 2025

Uh oh!

tareknaser Dec 12, 2025

Uh oh!

xuafeng left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: Refactor Course Lab Benchmark #27

Are you sure you want to change the base?

WIP: Refactor Course Lab Benchmark #27

Uh oh!

Conversation

tareknaser commented Dec 1, 2025

Current State

Proposed Changes

Uh oh!

Uh oh!

xuafeng Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuafeng Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuafeng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuafeng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

xuafeng Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuafeng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

tareknaser Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

xuafeng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xuafeng Dec 5, 2025 •

edited

Loading

xuafeng Dec 11, 2025 •

edited

Loading

xuafeng left a comment •

edited

Loading