-
Notifications
You must be signed in to change notification settings - Fork 5
Integrate SREGym #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate SREGym #30
Conversation
git-subtree-dir: benchmarks/sregym/sregym_core git-subtree-split: 529f564ef3c955db2ed82aac2f84c6eb05e078c7
…s/sregym/sregym_core'
LGTM (doge) LByM
|
Resolves #26 |
xuafeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look great. And I left some comments.
| cd cli | ||
| ./run_all_local.sh <model_name> | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add one section to highlight how to add/test new agents? I saw you have sentence in Section 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Piggy-backing on @xuafeng comment, could you also show an example? You can follow the template here: https://github.com/SREGym/system-intelligence-benchmark/tree/main/benchmarks/course_lab_bench#how-to-extend-the-benchmark or here: https://github.com/SREGym/system-intelligence-benchmark/tree/main/benchmarks/course_lab_bench#how-to-extend-the-benchmark ("Adding a new artifact" subsection)
benchmarks/sregym/README.md
Outdated
| @@ -0,0 +1,69 @@ | |||
| # SREGym Quick Guide | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you follow the TEMPLATE's structure, adding a quick intro to your benchmark and what's the task looking like? You can refer https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/course_exam_bench or https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/sysmobench
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmarks/sregym/run.sh
Outdated
|
|
||
| if [ $# -ne 1 ]; then | ||
| echo "Usage: $0 <model_location>" | ||
| echo "Example: $0 \"gemini/gemini-2.5-flash\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about expose agent as parameter?
| @@ -0,0 +1 @@ | |||
| {"text": "text of one doc", "metadata": {"scenarios": "XXX", "subtask": "XXXX", "description": "xx", "link": "XXX", "XXX": "XXX"}} No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this folder.
| @@ -0,0 +1,2 @@ | |||
| {"sys_prompt": "You are XXX", "user_prompt": "what", "thinking": "chain of thought", "response": "XXX", "metadata": {"scenario": "XX", "subtask": "XXX", "data_quality":"high", "XXX": "XXX"}} | |||
|
No newline at end of file |
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this folder
benchmarks/sregym/env.toml
Outdated
| @@ -0,0 +1,15 @@ | |||
| [llm] | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see we use this file anywhere. Maybe you can delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have a unified config file across different benchmarks. Is it possible to put the required env variable in env.tmol? And then, in the main.py to read the env.toml, and use os.environ["MY_VAR"] = "hello". I think it will have the same func as dotenv to read from .env.
You can refer to sysmobench. https://github.com/sys-intelligence/system-intelligence-benchmark/blob/7abdd4d6d3823a4de6f1c161a19617d5632bc227/benchmarks/sysmobench/src/main.py#L20C1-L20C77
benchmarks/sregym/README.md
Outdated
|
|
||
| ## Run SREGym | ||
|
|
||
| 1. Prepare `.env` for the configurations. You can make a copy of `.env.example` into `.env` and set the keys in the `.env` file. For System Intelligence, you need to set the API keys for the models you want to test, like below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, when I read the README.md and repo, i can not easily find the .env.example file. Maybe you can consider move it under the sregym folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be a bit tricky, since this make it out of the subtree. But I can specify the path more clearly. 😃
| PROVIDER="litellm" | ||
|
|
||
| GEMINI_API_KEY="XXXXXX" | ||
| OPENAI_API_KEY="XXXXXX" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about AzureOpenAI and our own host open source model? Do we support it? I think we need to set an endpoint_url?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can solve this together with the env file issue. I discussed a bit with the team. They tend to offer more direct exposure of the llm backend so I need to work on a bit on the SREGym side 😃
|
I reviewed everything except sregym_core, and aside from the issues Xuan mentioned, it looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Sure! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this container might fail because the entry point is ./test.sh. Docker runs this script as the first process and when it finishes (yours simply terminates without running anything) the container exits as well. Also, my understanding is that any docker run image <...> command becomes an argument to test.sh which, in your case, doesn't process any (again, because it exits immediately). You might want to check out an example from ArtEvalBench: https://github.com/sys-intelligence/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed your tasks are missing a "test method", namely a command that the framework can run to validate whether the agent solved this tasks correctly or not. You may want to take a look at course_lab_bench for a simple example: https://github.com/SREGym/system-intelligence-benchmark/blob/main/benchmarks/course_lab_bench/data/benchmark/course_lab_tasks_mit_65840_2024.jsonl . Or, for a more complex example, check out the evaluator JSON field in arteval_bench: https://github.com/SREGym/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
|
Hi folks @xuafeng @bastoica @Qian-Cheng-nju!
Thanks for your guys' review! There are so many comments hhh, so I do not ack all of them. And if everything is good, please still pend a bit before merging it, since I am waiting for the upstream PR on modifications inside the SREGym to be approved. If the upstream does not need further adaptions, then we can merge. 🤠 |
Thanks @Jackcuii for the efforts. Let me know when you are ready for merge. |
|
@Jackcuii just a quick clarification. Does SREGym have a method to test/validate that agents solve the tasks correctly/successfully? |
Yes! It is called Our SREGym is totally wrapped inside, and SREGym will report the result directly, so we do not need the |
|
Cool, thanks for confirming. Do you think it can be integrated with the system intelligence framework? For example by creating an Evaluator class to invoke those oracles and populating a corresponding @xuafeng wdyt? |
I have discussed this with Tianyin and Xuan. We reached an agreement that the most important part of the framework is decoupling. SREGym also has an abstraction with decoupling (they just have different terms and implementations). So we choose to clarify the correspondence in the README (the abstraction part). Meanwhile, to align the implementation with the SI framework costs lots of effort to change the original implementation of SREGym, which may frustrate the other potential participants. |
Yes. Ported benchmarks have their own logics and specific implementation. They only need to follow the high-level interface. |
|
@Jackcuii Let me know when you done the upstream changes/updates. Thanks. |
|
@Jackcuii The changes are merged upstream in SREGym. Very excited for this integration! |
Hi Xuan! Glad to tell you the upstream modifications are approved just now! Thanks~ |
Add SREGym - benchmark for SRE Agents