Integrate SREGym #30

Jackcuii · 2025-12-04T08:22:09Z

Add SREGym - benchmark for SRE Agents

git-subtree-dir: benchmarks/sregym/sregym_core git-subtree-split: 529f564ef3c955db2ed82aac2f84c6eb05e078c7

…s/sregym/sregym_core'

LGTM (doge) LByM

Jackcuii · 2025-12-04T08:23:43Z

Resolves #26

xuafeng

Look great. And I left some comments.

xuafeng · 2025-12-04T17:54:19Z

benchmarks/sregym/README.md

+cd cli
+./run_all_local.sh <model_name>
+```
+


Can we add one section to highlight how to add/test new agents? I saw you have sentence in Section 2

Piggy-backing on @xuafeng comment, could you also show an example? You can follow the template here: https://github.com/SREGym/system-intelligence-benchmark/tree/main/benchmarks/course_lab_bench#how-to-extend-the-benchmark or here: https://github.com/SREGym/system-intelligence-benchmark/tree/main/benchmarks/course_lab_bench#how-to-extend-the-benchmark ("Adding a new artifact" subsection)

xuafeng · 2025-12-04T17:56:19Z

benchmarks/sregym/README.md

@@ -0,0 +1,69 @@
+# SREGym Quick Guide


Can you follow the TEMPLATE's structure, adding a quick intro to your benchmark and what's the task looking like? You can refer https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/course_exam_bench or https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/sysmobench

I wrote a comment below, but you can separate technical description from the "why". Technical description should go into the README.md and while the reasoning behind SREgym should go into WHY.md. #21 shows a template for the artifact evaluation benchmark (or follow the links @xuafeng mentioned).

xuafeng · 2025-12-04T18:08:40Z

benchmarks/sregym/run.sh

+
+if [ $# -ne 1 ]; then
+    echo "Usage: $0 <model_location>"
+    echo "Example: $0 \"gemini/gemini-2.5-flash\""


How about expose agent as parameter?

xuafeng · 2025-12-04T18:11:11Z

benchmarks/sregym/data/pretrain/example_bench_pretrain_timestamp.jsonl

@@ -0,0 +1 @@
+{"text": "text of one doc", "metadata": {"scenarios": "XXX", "subtask": "XXXX", "description": "xx", "link": "XXX", "XXX": "XXX"}}


We can remove this folder.

xuafeng · 2025-12-04T18:11:25Z

benchmarks/sregym/data/sft/example_bench_sft_timestamp.jsonl

@@ -0,0 +1,2 @@
+{"sys_prompt": "You are XXX", "user_prompt": "what", "thinking": "chain of thought", "response": "XXX", "metadata": {"scenario": "XX", "subtask": "XXX", "data_quality":"high", "XXX": "XXX"}}
+


We can remove this folder

xuafeng · 2025-12-04T18:16:01Z

benchmarks/sregym/env.toml

@@ -0,0 +1,15 @@
+[llm]


I did not see we use this file anywhere. Maybe you can delete?

To have a unified config file across different benchmarks. Is it possible to put the required env variable in env.tmol? And then, in the main.py to read the env.toml, and use os.environ["MY_VAR"] = "hello". I think it will have the same func as dotenv to read from .env.

You can refer to sysmobench. https://github.com/sys-intelligence/system-intelligence-benchmark/blob/7abdd4d6d3823a4de6f1c161a19617d5632bc227/benchmarks/sysmobench/src/main.py#L20C1-L20C77

xuafeng · 2025-12-04T18:18:02Z

benchmarks/sregym/README.md

+
+## Run SREGym
+
+1. Prepare `.env` for the configurations. You can make a copy of `.env.example` into `.env` and set the keys in the `.env` file. For System Intelligence, you need to set the API keys for the models you want to test, like below:


Actually, when I read the README.md and repo, i can not easily find the .env.example file. Maybe you can consider move it under the sregym folder?

I think this could be a bit tricky, since this make it out of the subtree. But I can specify the path more clearly. 😃

xuafeng · 2025-12-04T18:19:57Z

benchmarks/sregym/README.md

+PROVIDER="litellm"
+
+GEMINI_API_KEY="XXXXXX"
+OPENAI_API_KEY="XXXXXX"


How about AzureOpenAI and our own host open source model? Do we support it? I think we need to set an endpoint_url?

I think we can solve this together with the env file issue. I discussed a bit with the team. They tend to offer more direct exposure of the llm backend so I need to work on a bit on the SREGym side 😃

Qian-Cheng-nju · 2025-12-06T03:26:22Z

I reviewed everything except sregym_core, and aside from the issues Xuan mentioned, it looks good to me.

bastoica

Great work @Jackcuii ! Could you add a WHY.md explaining the reason behind introducing an SRE benchmark (e.g., why is this benchmark needed, who does it help, etc.)? You could take a look here #21 for a template, but it shouldn't be longer than 500 words.

Jackcuii · 2025-12-06T21:39:28Z

Sure!

bastoica · 2025-12-06T23:16:11Z

benchmarks/sregym/Dockerfile

I think this container might fail because the entry point is ./test.sh. Docker runs this script as the first process and when it finishes (yours simply terminates without running anything) the container exits as well. Also, my understanding is that any docker run image <...> command becomes an argument to test.sh which, in your case, doesn't process any (again, because it exits immediately). You might want to check out an example from ArtEvalBench: https://github.com/sys-intelligence/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/Dockerfile

@Jackcuii @xuafeng we might actually want to run the Docker image to make sure? What do you think?

bastoica · 2025-12-06T23:21:24Z

benchmarks/sregym/data/benchmark/tasks.jsonl

I noticed your tasks are missing a "test method", namely a command that the framework can run to validate whether the agent solved this tasks correctly or not. You may want to take a look at course_lab_bench for a simple example: https://github.com/SREGym/system-intelligence-benchmark/blob/main/benchmarks/course_lab_bench/data/benchmark/course_lab_tasks_mit_65840_2024.jsonl . Or, for a more complex example, check out the evaluator JSON field in arteval_bench: https://github.com/SREGym/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl

Jackcuii · 2025-12-08T20:30:48Z

Hi folks @xuafeng @bastoica @Qian-Cheng-nju!
Glad to tell you I have fixed all the mentioned issues above, including:

add agent name selection to README
add extension instructions
add task descriptions
update the run.sh parameter
remove the unused files (.toml)
update the model selection logic (upstream side)
add azure support
add clearer explantion of the .env
remove the entrypoint of the Dockerfile
add WHY.md

Thanks for your guys' review! There are so many comments hhh, so I do not ack all of them.

And if everything is good, please still pend a bit before merging it, since I am waiting for the upstream PR on modifications inside the SREGym to be approved. If the upstream does not need further adaptions, then we can merge. 🤠

xuafeng · 2025-12-09T19:57:36Z

Hi folks @xuafeng @bastoica @Qian-Cheng-nju! Glad to tell you I have fixed all the mentioned issues above, including:

add agent name selection to README

add extension instructions

add task descriptions

update the run.sh parameter

remove the unused files (.toml)

update the model selection logic (upstream side)

add azure support

add clearer explantion of the .env

remove the entrypoint of the Dockerfile

add WHY.md

Thanks for your guys' review! There are so many comments hhh, so I do not ack all of them.

And if everything is good, please still pend a bit before merging it, since I am waiting for the upstream PR on modifications inside the SREGym to be approved. If the upstream does not need further adaptions, then we can merge. 🤠

Thanks @Jackcuii for the efforts. Let me know when you are ready for merge.

bastoica · 2025-12-09T20:01:37Z

@Jackcuii just a quick clarification. Does SREGym have a method to test/validate that agents solve the tasks correctly/successfully?

Jackcuii · 2025-12-09T20:03:03Z

@Jackcuii just a quick clarification. Does SREGym have a method to test/validate that agents solve the tasks correctly/successfully?

Yes! It is called Oracle. For the diagnosis task, we have LLM-as-a-judge supported oracle. For the mitigation task, we have specially-designed oracles.

Our SREGym is totally wrapped inside, and SREGym will report the result directly, so we do not need the test_method 😃

bastoica · 2025-12-09T20:06:44Z

Cool, thanks for confirming. Do you think it can be integrated with the system intelligence framework? For example by creating an Evaluator class to invoke those oracles and populating a corresponding test_method field in the tasks.jsonl. You can take a look at how the other benchmarks implement this (for arteval_bench the JSONL filed is called evaluator).

@xuafeng wdyt?

Jackcuii · 2025-12-09T20:13:34Z

Cool, thanks for confirming. Do you think it can be integrated with the system intelligence framework? For example by creating an Evaluator class to invoke those oracles and populating a corresponding test_method field in the tasks.jsonl. You can take a look at how the other benchmarks implement this (for arteval_bench the JSONL filed is called evaluator).

@xuafeng wdyt?

I have discussed this with Tianyin and Xuan. We reached an agreement that the most important part of the framework is decoupling. SREGym also has an abstraction with decoupling (they just have different terms and implementations). So we choose to clarify the correspondence in the README (the abstraction part). Meanwhile, to align the implementation with the SI framework costs lots of effort to change the original implementation of SREGym, which may frustrate the other potential participants.

xuafeng · 2025-12-09T20:24:26Z

Cool, thanks for confirming. Do you think it can be integrated with the system intelligence framework? For example by creating an Evaluator class to invoke those oracles and populating a corresponding test_method field in the tasks.jsonl. You can take a look at how the other benchmarks implement this (for arteval_bench the JSONL filed is called evaluator).
@xuafeng wdyt?

I have discussed this with Tianyin and Xuan. We reached an agreement that the most important part of the framework is decoupling. SREGym also has an abstraction with decoupling (they just have different terms and implementations). So we choose to clarify the correspondence in the README (the abstraction part). Meanwhile, to align the implementation with the SI framework costs lots of effort to change the original implementation of SREGym, which may frustrate the other potential participants.

Yes. Ported benchmarks have their own logics and specific implementation. They only need to follow the high-level interface.

xuafeng · 2025-12-11T22:49:53Z

@Jackcuii Let me know when you done the upstream changes/updates. Thanks.

HacksonClark · 2025-12-12T19:00:59Z

@Jackcuii The changes are merged upstream in SREGym. Very excited for this integration!

Jackcuii · 2025-12-12T22:39:41Z

@Jackcuii Let me know when you done the upstream changes/updates. Thanks.

Hi Xuan! Glad to tell you the upstream modifications are approved just now! Thanks~

Jackcuii added 15 commits November 27, 2025 20:39

copy template

5404095

Squashed 'benchmarks/sregym/sregym_core/' content from commit 529f564e

6d5eab0

git-subtree-dir: benchmarks/sregym/sregym_core git-subtree-split: 529f564ef3c955db2ed82aac2f84c6eb05e078c7

Merge commit '6d5eab07e630fb12c8f06ee5eda787c461430f69' as 'benchmark…

72055b4

…s/sregym/sregym_core'

create task list

1618bec

add install.sh

f244ff2

modify main.py entrypoint

6788f10

add agent selection

51bfd5e

ensure compatibility with sregym_core

f408aeb

Bench Core

a360150

squash bugs found in the test

79dbf8f

Fix problems in the SREGym

68e1d64

Update README.md

5818f51

Merge pull request #1 from SREGym/setup-migration

fc28862

LGTM (doge) LByM

Merge branch 'sys-intelligence:main' into main

427237b

Small fixes

c97bbda

Jackcuii requested a review from xuafeng December 4, 2025 08:22

xuafeng requested review from Qian-Cheng-nju and bastoica December 4, 2025 17:50

xuafeng reviewed Dec 5, 2025

View reviewed changes

bastoica reviewed Dec 6, 2025

View reviewed changes

Jackcuii added 5 commits December 6, 2025 18:54

Delete unused files

9f42685

clean up

4bcbbb5

refactor the llm backend / expose llm selection to CLI

b7e7262

Minor adaptions

1bb9bac

Merge branch 'sys-intelligence:main' into main

e0d28df

Jackcuii added 2 commits December 8, 2025 14:58

Fix the typo of the bullet points in the README.md

2d66dc7

Merge branch 'main' of github.com:SREGym/system-intelligence-benchmark

9afc244

Jackcuii mentioned this pull request Dec 8, 2025

Slight modification for System Intelligence SREGym/SREGym#396

Merged

xuafeng closed this Dec 9, 2025

xuafeng reopened this Dec 9, 2025

xuafeng merged commit 4950df8 into sys-intelligence:main Dec 16, 2025
4 checks passed

xuafeng mentioned this pull request Dec 16, 2025

Port SREGym #26

Closed

		@@ -0,0 +1 @@
		{"text": "text of one doc", "metadata": {"scenarios": "XXX", "subtask": "XXXX", "description": "xx", "link": "XXX", "XXX": "XXX"}} No newline at end of file

		@@ -0,0 +1,2 @@
		{"sys_prompt": "You are XXX", "user_prompt": "what", "thinking": "chain of thought", "response": "XXX", "metadata": {"scenario": "XX", "subtask": "XXX", "data_quality":"high", "XXX": "XXX"}}
		No newline at end of file


		## Run SREGym

		1. Prepare `.env` for the configurations. You can make a copy of `.env.example` into `.env` and set the keys in the `.env` file. For System Intelligence, you need to set the API keys for the models you want to test, like below:

Integrate SREGym #30

Integrate SREGym #30

Uh oh!

Conversation

Jackcuii commented Dec 4, 2025

Uh oh!

Jackcuii commented Dec 4, 2025

Uh oh!

xuafeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackcuii Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Qian-Cheng-nju commented Dec 6, 2025

Uh oh!

bastoica left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackcuii commented Dec 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackcuii commented Dec 8, 2025

Uh oh!

xuafeng commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bastoica commented Dec 9, 2025

Uh oh!

Jackcuii commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bastoica commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jackcuii commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuafeng commented Dec 9, 2025

Uh oh!

xuafeng commented Dec 11, 2025

Uh oh!

HacksonClark commented Dec 12, 2025

Uh oh!

Jackcuii commented Dec 12, 2025

Uh oh!

Uh oh!

Jackcuii Dec 5, 2025 •

edited

Loading

bastoica left a comment •

edited

Loading

xuafeng commented Dec 9, 2025 •

edited

Loading

Jackcuii commented Dec 9, 2025 •

edited

Loading

bastoica commented Dec 9, 2025 •

edited

Loading

Jackcuii commented Dec 9, 2025 •

edited

Loading