-
Notifications
You must be signed in to change notification settings - Fork 5
WIP: Refactor Course Lab Benchmark #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a9f1972 to
4a1dde6
Compare
| - `repo_url`: Git repository to clone (null if not needed) | ||
| - `setup_commands`: Commands to run before testing | ||
| - `timeout_minutes`: Maximum execution time (default: 30) | ||
| - `lab_url`: Link to lab instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also add "how to add new agents" section based on previous add_agents.md
| import time | ||
| from typing import Any | ||
| from loguru import logger | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read the whole repo. My only concern is this part. I think we need to have an agent class that works for any agents. REACTAgent is one agent, claude agent is another one agent.
Then, we need to figure out how to run agent (like, claude code) in your env class.
Signed-off-by: Tarek <tareknaser360@gmail.com>
4a1dde6 to
1e84d46
Compare
| @@ -0,0 +1,7 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very clear and clean. I very like this. I think it can be used by many system tasks (env setup, system impl, performance, diagnosis.)
| default="data/tasks.jsonl", | ||
| help="Path to tasks JSONL file (default: data/tasks.jsonl)", | ||
| ) | ||
| parser.add_argument("--model", type=str, default="anthropic/claude-sonnet-4-5-20250929") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we can also add agent selection later.
| pass | ||
|
|
||
|
|
||
| class REACTAgent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For another agent, I am thinking:
class ClaudeCodeAgent:
def init():
# install it in the env.
def run():
result = self.env.execute("claude -p xxx .....")
# may parser the result to get logs and execution results, I am not sure it has all the trajectory you need.
| @@ -0,0 +1,150 @@ | |||
| import subprocess | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If each folder only has one file, I am not sure we need the folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep the design as modular as possible for this PoC by separating different components in folders. We should be able to add new agents, environments (currently only Docker is supported), models (right now it’s just litellm, but we can support other methods) and evaluation scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @tareknaser. I did a pass. It is very clear and clean.
Is it the folder structure also works for other benchmarks? Maybe, we can unify them later (long-term).
…tadata Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Add PATH exports to profile files so Go is available when using `bash -lc` Signed-off-by: Tarek <tareknaser360@gmail.com>
… tests Signed-off-by: Tarek <tareknaser360@gmail.com>
…tation Signed-off-by: Tarek <tareknaser360@gmail.com>
…rent task Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
… config Signed-off-by: Tarek <tareknaser360@gmail.com>
Current State
install.sh, support forminisweagentandopenhandsrun.shoverfits to claudecodexlsx,json,csv) with transformation scriptsProposed Changes
courses_metadata.jsonlab_number,lab_name, andcourse_idclaudecodeonly with tests against real-world tasks including file modifications