Skip to content

Conversation

@houqiii
Copy link

@houqiii houqiii commented Dec 11, 2025

Introduction

This PR integrates TopoSense-Bench, a rigorous benchmark designed to evaluate Large Language Models (LLMs) on the Semantic-Spatial Sensor Scheduling (S³) problem.

It originates from the ACM MobiCom '26 paper: "IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling".

Unlike standard QA tasks, this benchmark requires the LLM to act as an agent that translates high-level user intents (e.g., "Find my backpack lost between the library and the gym") into precise physical sensor node IDs within a large-scale digital twin.

Key Features

  • Hugging Face Integration:
    • Unlike other benchmarks that store data locally, this implementation utilizes the datasets library to load data directly from Hugging Face.
    • Benefit: Keeps the repository lightweight and ensures users always access the latest version of the dataset.
  • RAG-based Evaluation Logic:
    • Implements a TopologyManager that simulates a retrieval system. It dynamically fetches relevant building/floor topological data based on the user query, testing the model's ability to reason over long contexts and spatial constraints.
  • Specialized Evaluator:
    • Includes a custom TopoSenseEvaluator to robustly parse and match sensor Node IDs (e.g., teaching_building_1_camera_03) against ground truth.

📊 Dataset Statistics

  • Scale: 5,250 natural language queries.
  • Environment: A university digital twin with 33 buildings, 161 floor plans, and 2,510 sensors.
  • Tasks:
    1. Intra-Zone Perception
    2. Intra-Building Coordination
    3. Inter-Building Coordination

Implementation Details

  • Directory Structure: Follows the repository standard (benchmarks/toposense_bench/).
  • SDK Usage: Reuses sdk.executor.SimpleExecutor for LLM calls and sdk.utils for configuration management.
  • Configuration: Sensitive keys are managed via env.toml (template provided).

How to Run

  1. Navigate to the benchmark directory:

    cd benchmarks/toposense_bench
  2. Install dependencies:

    pip install -r requirements.txt
  3. Configure env.toml with your API Key (e.g., OPENAI_API_KEY).

  4. Run the evaluation script:

    # Example using GPT-4o
    bash run.sh "gpt-4o"
    
    # Example using DeepSeek (via OpenAI-compatible endpoint)
    bash run.sh "openai/deepseek-chat"

@@ -0,0 +1,23 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you follow template to add one install.sh that is needed for our integration? Thanks.

os.makedirs(output_dir)

df = pd.DataFrame(results)
df.to_json(
Copy link
Collaborator

@xuafeng xuafeng Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refer to https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/course_exam_bench#output-files to have different level of result files? summary.json is needed.

Copy link
Collaborator

@xuafeng xuafeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contributions. I left some comments and also looped our team members for feedback.

@xuafeng xuafeng requested a review from tareknaser December 11, 2025 22:54

## 📊 Overview

- **Source**: Hosted on [Hugging Face](https://huggingface.co/datasets/IoT-Brain-Project/TopoSense-Bench) (Seamlessly integrated via the `datasets` library).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It said "404". Is it because that the datasets are not public yet?

Copy link
Collaborator

@xuafeng xuafeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more comments.

{
"answer": "sensor_name_here",
"explanation": "Brief reasoning based on map tags"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the closing ``` fence

try:
# Load the 'topology' configuration.
# Hugging Face defaults uploaded JSONL files to the 'train' split.
ds = load_dataset("IoT-Brain/TopoSense-Bench", "topology", split="train")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"IoT-Brain/TopoSense-Bench"

Not aligned to the README's Hugging Face link

Copy link
Collaborator

@tareknaser tareknaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! I left some comments

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an entry for the benchmark to the root project README?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn’t running in CI right now. Please add it to .github/workflows/test.yml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s also ongoing work in another PR to add a Why.md file to each benchmark directory. See the discussion: #21 (comment)

@houqiii
Copy link
Author

houqiii commented Dec 14, 2025

Thank you @xuafeng, @Qian-Cheng-nju, and @tareknaser for the constructive feedback!

I have pushed the latest changes which address all the points raised in the review. Here is a summary of the updates:

1. Output Format & Engineering (@xuafeng)

  • Standardized Output: Updated src/main.py to align with the repository standard. It now generates three files:
    • summary.json (Aggregated statistics with overall and category-level accuracy).
    • results.jsonl (Minimal results).
    • results_detailed.jsonl (Full debugging info including prompts and retrieval status).
  • Installation: Added install.sh to automate environment setup.
  • Link Fix: Corrected the Hugging Face URL in README.md to point to the correct organization (IoT-Brain).

2. Documentation & CI (@tareknaser)

  • Why TopoSense: Added benchmarks/toposense_bench/Why.md to explain the benchmark's significance.
  • Root README: Added TopoSense-Bench entry to the main project README.
  • CI Integration: Added toposense_bench to the test matrix in .github/workflows/test.yml.

3. Code Fixes (@Qian-Cheng-nju)

  • Syntax: Fixed the missing closing brace } in the JSON prompt template.
  • Topology Loader: Updated src/topology_loader.py to use the correct HF dataset path (IoT-Brain/TopoSense-Bench), ensuring consistency with the README.

I verified the changes locally with a smoke test script, confirming that the data loading, RAG retrieval, and file generation logic work as expected.

Ready for the next round of review!

@Qian-Cheng-nju
Copy link
Collaborator

The new version looks great to me — thank you very much!

@xuafeng xuafeng merged commit 4b3c323 into sys-intelligence:main Dec 22, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants