Skip to content

Conversation

@Jackcuii
Copy link
Collaborator

@Jackcuii Jackcuii commented Dec 4, 2025

Add SREGym - benchmark for SRE Agents

@Jackcuii Jackcuii requested a review from xuafeng December 4, 2025 08:22
@Jackcuii
Copy link
Collaborator Author

Jackcuii commented Dec 4, 2025

Resolves #26

Copy link
Collaborator

@xuafeng xuafeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great. And I left some comments.

cd cli
./run_all_local.sh <model_name>
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add one section to highlight how to add/test new agents? I saw you have sentence in Section 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,69 @@
# SREGym Quick Guide
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you follow the TEMPLATE's structure, adding a quick intro to your benchmark and what's the task looking like? You can refer https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/course_exam_bench or https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/sysmobench

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a comment below, but you can separate technical description from the "why". Technical description should go into the README.md and while the reasoning behind SREgym should go into WHY.md. #21 shows a template for the artifact evaluation benchmark (or follow the links @xuafeng mentioned).


if [ $# -ne 1 ]; then
echo "Usage: $0 <model_location>"
echo "Example: $0 \"gemini/gemini-2.5-flash\""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about expose agent as parameter?

@@ -0,0 +1 @@
{"text": "text of one doc", "metadata": {"scenarios": "XXX", "subtask": "XXXX", "description": "xx", "link": "XXX", "XXX": "XXX"}} No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this folder.

@@ -0,0 +1,2 @@
{"sys_prompt": "You are XXX", "user_prompt": "what", "thinking": "chain of thought", "response": "XXX", "metadata": {"scenario": "XX", "subtask": "XXX", "data_quality":"high", "XXX": "XXX"}}

No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this folder

@@ -0,0 +1,15 @@
[llm]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see we use this file anywhere. Maybe you can delete?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have a unified config file across different benchmarks. Is it possible to put the required env variable in env.tmol? And then, in the main.py to read the env.toml, and use os.environ["MY_VAR"] = "hello". I think it will have the same func as dotenv to read from .env.

You can refer to sysmobench. https://github.com/sys-intelligence/system-intelligence-benchmark/blob/7abdd4d6d3823a4de6f1c161a19617d5632bc227/benchmarks/sysmobench/src/main.py#L20C1-L20C77


## Run SREGym

1. Prepare `.env` for the configurations. You can make a copy of `.env.example` into `.env` and set the keys in the `.env` file. For System Intelligence, you need to set the API keys for the models you want to test, like below:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, when I read the README.md and repo, i can not easily find the .env.example file. Maybe you can consider move it under the sregym folder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a bit tricky, since this make it out of the subtree. But I can specify the path more clearly. 😃

PROVIDER="litellm"

GEMINI_API_KEY="XXXXXX"
OPENAI_API_KEY="XXXXXX"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AzureOpenAI and our own host open source model? Do we support it? I think we need to set an endpoint_url?

Copy link
Collaborator Author

@Jackcuii Jackcuii Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can solve this together with the env file issue. I discussed a bit with the team. They tend to offer more direct exposure of the llm backend so I need to work on a bit on the SREGym side 😃

@Qian-Cheng-nju
Copy link
Collaborator

I reviewed everything except sregym_core, and aside from the issues Xuan mentioned, it looks good to me.

Copy link
Collaborator

@bastoica bastoica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @Jackcuii ! Could you add a WHY.md explaining the reason behind introducing an SRE benchmark (e.g., why is this benchmark needed, who does it help, etc.)? You could take a look here #21 for a template, but it shouldn't be longer than 500 words.

@Jackcuii
Copy link
Collaborator Author

Jackcuii commented Dec 6, 2025

Sure!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this container might fail because the entry point is ./test.sh. Docker runs this script as the first process and when it finishes (yours simply terminates without running anything) the container exits as well. Also, my understanding is that any docker run image <...> command becomes an argument to test.sh which, in your case, doesn't process any (again, because it exits immediately). You might want to check out an example from ArtEvalBench: https://github.com/sys-intelligence/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/Dockerfile

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackcuii @xuafeng we might actually want to run the Docker image to make sure? What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed your tasks are missing a "test method", namely a command that the framework can run to validate whether the agent solved this tasks correctly or not. You may want to take a look at course_lab_bench for a simple example: https://github.com/SREGym/system-intelligence-benchmark/blob/main/benchmarks/course_lab_bench/data/benchmark/course_lab_tasks_mit_65840_2024.jsonl . Or, for a more complex example, check out the evaluator JSON field in arteval_bench: https://github.com/SREGym/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl

@Jackcuii
Copy link
Collaborator Author

Jackcuii commented Dec 8, 2025

Hi folks @xuafeng @bastoica @Qian-Cheng-nju!
Glad to tell you I have fixed all the mentioned issues above, including:

  • add agent name selection to README
  • add extension instructions
  • add task descriptions
  • update the run.sh parameter
  • remove the unused files (.toml)
  • update the model selection logic (upstream side)
  • add azure support
  • add clearer explantion of the .env
  • remove the entrypoint of the Dockerfile
  • add WHY.md

Thanks for your guys' review! There are so many comments hhh, so I do not ack all of them.

And if everything is good, please still pend a bit before merging it, since I am waiting for the upstream PR on modifications inside the SREGym to be approved. If the upstream does not need further adaptions, then we can merge. 🤠

@xuafeng
Copy link
Collaborator

xuafeng commented Dec 9, 2025

Hi folks @xuafeng @bastoica @Qian-Cheng-nju! Glad to tell you I have fixed all the mentioned issues above, including:

  • add agent name selection to README
  • add extension instructions
  • add task descriptions
  • update the run.sh parameter
  • remove the unused files (.toml)
  • update the model selection logic (upstream side)
  • add azure support
  • add clearer explantion of the .env
  • remove the entrypoint of the Dockerfile
  • add WHY.md

Thanks for your guys' review! There are so many comments hhh, so I do not ack all of them.

And if everything is good, please still pend a bit before merging it, since I am waiting for the upstream PR on modifications inside the SREGym to be approved. If the upstream does not need further adaptions, then we can merge. 🤠

Thanks @Jackcuii for the efforts. Let me know when you are ready for merge.

@xuafeng xuafeng closed this Dec 9, 2025
@xuafeng xuafeng reopened this Dec 9, 2025
@bastoica
Copy link
Collaborator

bastoica commented Dec 9, 2025

@Jackcuii just a quick clarification. Does SREGym have a method to test/validate that agents solve the tasks correctly/successfully?

@Jackcuii
Copy link
Collaborator Author

Jackcuii commented Dec 9, 2025

@Jackcuii just a quick clarification. Does SREGym have a method to test/validate that agents solve the tasks correctly/successfully?

Yes! It is called Oracle. For the diagnosis task, we have LLM-as-a-judge supported oracle. For the mitigation task, we have specially-designed oracles.

Our SREGym is totally wrapped inside, and SREGym will report the result directly, so we do not need the test_method 😃

@bastoica
Copy link
Collaborator

bastoica commented Dec 9, 2025

Cool, thanks for confirming. Do you think it can be integrated with the system intelligence framework? For example by creating an Evaluator class to invoke those oracles and populating a corresponding test_method field in the tasks.jsonl. You can take a look at how the other benchmarks implement this (for arteval_bench the JSONL filed is called evaluator).

@xuafeng wdyt?

@Jackcuii
Copy link
Collaborator Author

Jackcuii commented Dec 9, 2025

Cool, thanks for confirming. Do you think it can be integrated with the system intelligence framework? For example by creating an Evaluator class to invoke those oracles and populating a corresponding test_method field in the tasks.jsonl. You can take a look at how the other benchmarks implement this (for arteval_bench the JSONL filed is called evaluator).

@xuafeng wdyt?

I have discussed this with Tianyin and Xuan. We reached an agreement that the most important part of the framework is decoupling. SREGym also has an abstraction with decoupling (they just have different terms and implementations). So we choose to clarify the correspondence in the README (the abstraction part). Meanwhile, to align the implementation with the SI framework costs lots of effort to change the original implementation of SREGym, which may frustrate the other potential participants.

@xuafeng
Copy link
Collaborator

xuafeng commented Dec 9, 2025

Cool, thanks for confirming. Do you think it can be integrated with the system intelligence framework? For example by creating an Evaluator class to invoke those oracles and populating a corresponding test_method field in the tasks.jsonl. You can take a look at how the other benchmarks implement this (for arteval_bench the JSONL filed is called evaluator).
@xuafeng wdyt?

I have discussed this with Tianyin and Xuan. We reached an agreement that the most important part of the framework is decoupling. SREGym also has an abstraction with decoupling (they just have different terms and implementations). So we choose to clarify the correspondence in the README (the abstraction part). Meanwhile, to align the implementation with the SI framework costs lots of effort to change the original implementation of SREGym, which may frustrate the other potential participants.

Yes. Ported benchmarks have their own logics and specific implementation. They only need to follow the high-level interface.

@xuafeng
Copy link
Collaborator

xuafeng commented Dec 11, 2025

@Jackcuii Let me know when you done the upstream changes/updates. Thanks.

@HacksonClark
Copy link
Collaborator

@Jackcuii The changes are merged upstream in SREGym. Very excited for this integration!

@Jackcuii
Copy link
Collaborator Author

@Jackcuii Let me know when you done the upstream changes/updates. Thanks.

Hi Xuan! Glad to tell you the upstream modifications are approved just now! Thanks~

@xuafeng xuafeng merged commit 4950df8 into sys-intelligence:main Dec 16, 2025
4 checks passed
@xuafeng xuafeng mentioned this pull request Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants