Skip to content

Conversation

@tareknaser
Copy link
Collaborator

Current State

  • SWE-agent repository cloned entirely by install.sh, support for minisweagent and openhands
  • run.sh overfits to claudecode
  • Multiple data formats (xlsx, json, csv) with transformation scripts
  • Schema couples lab metadata with task details
  • No clear separation between courses, labs, and tasks

Proposed Changes

  • Separate course metadata in courses_metadata.json
  • Each lab can contain multiple tasks
    • Tasks link back to their parent lab and course via lab_number, lab_name, and course_id
  • Each task can specify a git repository that will be cloned and uploaded to the execution environment
  • Schema validation: Tests verify data integrity (that run in CI)
  • Agents: Initial support for claudecode only with tests against real-world tasks including file modifications
  • Current data: Only 2 simple test tasks included for validation
    • Basic file modification task
    • Another simple task that tests repo cloning

Full migration of existing labs is out of scope for this PR. The current 2 tasks are just to validate and test the proposed infrastructure.

@tareknaser tareknaser force-pushed the refactor_course_lab_bench branch from a9f1972 to 4a1dde6 Compare December 5, 2025 16:46
- `repo_url`: Git repository to clone (null if not needed)
- `setup_commands`: Commands to run before testing
- `timeout_minutes`: Maximum execution time (default: 30)
- `lab_url`: Link to lab instructions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also add "how to add new agents" section based on previous add_agents.md

import time
from typing import Any
from loguru import logger

Copy link
Collaborator

@xuafeng xuafeng Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read the whole repo. My only concern is this part. I think we need to have an agent class that works for any agents. REACTAgent is one agent, claude agent is another one agent.
Then, we need to figure out how to run agent (like, claude code) in your env class.

Signed-off-by: Tarek <tareknaser360@gmail.com>
@tareknaser tareknaser force-pushed the refactor_course_lab_bench branch from 4a1dde6 to 1e84d46 Compare December 11, 2025 01:46
@@ -0,0 +1,7 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very clear and clean. I very like this. I think it can be used by many system tasks (env setup, system impl, performance, diagnosis.)

default="data/tasks.jsonl",
help="Path to tasks JSONL file (default: data/tasks.jsonl)",
)
parser.add_argument("--model", type=str, default="anthropic/claude-sonnet-4-5-20250929")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we can also add agent selection later.

pass


class REACTAgent:
Copy link
Collaborator

@xuafeng xuafeng Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For another agent, I am thinking:

class ClaudeCodeAgent:
def init():
# install it in the env.
def run():
result = self.env.execute("claude -p xxx .....")
# may parser the result to get logs and execution results, I am not sure it has all the trajectory you need.

@@ -0,0 +1,150 @@
import subprocess
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If each folder only has one file, I am not sure we need the folder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to keep the design as modular as possible for this PoC by separating different components in folders. We should be able to add new agents, environments (currently only Docker is supported), models (right now it’s just litellm, but we can support other methods) and evaluation scripts

Copy link
Collaborator

@xuafeng xuafeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @tareknaser. I did a pass. It is very clear and clean.
Is it the folder structure also works for other benchmarks? Maybe, we can unify them later (long-term).

…tadata

Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Add PATH exports to profile files so Go is available when using `bash -lc`

Signed-off-by: Tarek <tareknaser360@gmail.com>
… tests

Signed-off-by: Tarek <tareknaser360@gmail.com>
…tation

Signed-off-by: Tarek <tareknaser360@gmail.com>
…rent task

Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
… config

Signed-off-by: Tarek <tareknaser360@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants