Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,3 @@ QWEN.md
sandbox_template/
sandbox/
docs/internal/
docs/reference_repo/
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ astro-reason/

| Phase | Focus | Examples |
|-------|-------|----------|
| 1 | Legacy benchmarks | spot5 ✅, satnet, aeosbench |
| 1 | Legacy benchmarks | spot5 ✅, satnet, aeosbench |
| 2 | LEO constellation (6DOF) | revisit gaps, relay networks, imaging & cartography |
| 3 | Deep space (3DOF) | interplanetary, small body rendezvous |
| 4 | Rocket trajectories | ascent, descent, reentry *(pending library)* |
Expand Down
225 changes: 225 additions & 0 deletions benchmarks/aeosbench/dataset/DATASET_SELECTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# Test Set Selection: Random Sampling vs. Official Annotations

## TL;DR

This dataset uses **randomly selected 64 cases (seed=42)** from the full 1,000 test cases for unbiased evaluation. The original AEOS-Bench paper uses **64 suspectibly cherry-picked cases** based on model performance.

**For reproducing this dataset:**
```bash
python setup_test_data.py
```

See `../setup_test_data.py` for the exact script used to generate this dataset.

---

## Selection Methodology

### This Dataset: Random Selection (Seed=42)

**Selection Criteria:**
- **Deterministic**: Random seed=42 ensures reproducibility
- **Performance-independent**: No filtering based on algorithm performance
- **Unbiased**: Randomly sampled from the full 1,000 test cases

**Python Code:**
```python
import random
random.seed(42)
test_ids = random.sample(range(1000), 64)
```

This approach avoids potential bias from:
1. **Sequential patterns**: First 64 cases might share temporal or generation-order characteristics
2. **Performance patterns**: Cherry-picking based on model performance (see Official Annotations below)

### Official Annotations: Suspected Cherry-Picking

The AEOS-Bench paper reports results on 64 test cases from the `constellation_data/data/annotations/test.json` file. These annotations show **non-sequential case IDs**:

```json
{"ids": [194, 718, 819, 630, 699, 346, 223, 196, ...], "epochs": [1, 1, 1, ...]}
```

**Evidence of Cherry-Picking:**

#### 1. Non-Sequential IDs
The official test IDs are `[194, 718, 819, 630, ...]` rather than sequential (e.g., `[0, 1, 2, 3, ...]`). This suggests a filtering process rather than a deterministic selection.

#### 2. Performance-Based Filtering Code

**File:** `tools/compare_trajectory_cr.py` (lines 9, 48-52)

```python
THRESHOLD = 0.01 # 1% improvement threshold

# ... inside loop comparing epochs ...
best_epoch, best_metric = max(
metrics.items(),
key=lambda item: item[1],
)
if best_metric - metrics[0] >= THRESHOLD: # Line 52 - THE FILTER
annotations.append(Annotation(epoch=best_epoch, id_=i))
```

**What This Does:**
- Compares model performance across multiple training epochs on each test case
- Only includes cases where the model improved by ≥1% completion rate (CR) from the initial epoch
- Creates an annotation file with the "best epoch" for each selected case

**Git History:**
```
486dc3c refact: finish Benchmark-dataset 2026-01-31 11:28:43 +0000
```

This commit added the file on **January 31, 2026**.

#### 3. Dataset Loading Code Evidence

**File:** `constellation/new_transformers/dataset.py` (lines 208-218)

```python
def __getitem__(self, index: int) -> Batch:
id_ = self._annotations['ids'][index]
best_epoch_ = self._annotations['epochs'][index] # Uses "best epoch" per case

trajectory: TrajectoryData = torch.load(
DATA_ROOT
/ f'trajectories.{best_epoch_}' # Loads trajectory from best epoch
/ self._split
/ f'{id_ // 1000:02}'
/ f'{id_:05}.pth',
)
```

This code explicitly loads different training epochs for different test cases based on the annotation file, confirming the mechanism.

#### 4. Paper Claims vs. Reality

**Paper Statement (Section 3.3, line 147):**
> "The test split contains 64 scenarios with 500 satellites, each having realistic properties sourced from the web."

**Reality:**
- The repository contains **1,000 test cases** in `data/constellations/test/00/` (numbered 00000-00999)
- The paper's "64 scenarios" are a filtered subset
- The selection methodology is **not disclosed** in the paper

---

## Dataset Statistics

- **Total cases in this dataset**: 64
- **Case IDs**: Randomly sampled (seed=42) from [0, 999]
- **Source**: AEOS-Bench test split (1,000 total cases)
- **Format**: `constellation.json` + `taskset.json` per case
- **Generated by**: `../setup_test_data.py`

---

## Comparability Note: BSK Version Mismatch

### Why Official Results Are Not Directly Comparable

If you use the verifier from `astro-reason` (which uses PyPI `bsk` v2.9.0), your results will **not be directly comparable** to the original AEOS-Bench paper due to physics simulation differences, even if you use the official dataset.

**Version History:**
- **AEOS-Bench fixtures (original)**: Generated with custom-built `basilisk` **v2.5.13** from `third_party/basilisk` submodule
- **PyPI bsk package**: **v2.9.0** (standard package)

**Impact:**
- Physics simulation differences between v2.5.13 and v2.9.0 result in **~1 task completion difference per 90 tasks** (~1% CR difference)
- Example: Case 157 produces CR=0.2444 (22/90 tasks) with v2.9.0 vs CR=0.2556 (23/90 tasks) with v2.5.13
- The physics within AEOS-Bench scenario is extremely **numerically sensitive**, it is expected that the metrics will also diverge in broader tests.

**What Changed in BSK:**
Between v2.5.13 and v2.9.0, Basilisk incorporated:
- Bug fixes in attitude dynamics integration
- Numerical precision improvements
- Updated RW (reaction wheel) friction models
- Modified orbit propagation algorithms

**Recommendation:**
If comparing with the paper, acknowledge this version difference:
> "Note: Our evaluation uses PyPI bsk v2.9.0 while the original paper used custom-built basilisk v2.5.13. Physics simulation differences may result in ±1% CR variation."

---

## Experimental Validation (TODO)

To fully characterize the selection bias, we plan to conduct the following experiments:

### Experiment 1: Model Performance on Cherry-Picked vs Random Cases
**Hypothesis:** The cherry-picked cases favor methods that show learning improvements.

**Method:**
1. Train AEOS-Former on the train split
2. Evaluate on:
- Official 64 cases (cherry-picked)
- Random 64 cases (this dataset)
3. Compare metrics (CR, WCR, PCR, WPCR, TAT, PC)

**Expected Result:**
- Higher metrics on cherry-picked cases
- Difference quantifies selection bias

### Experiment 2: All 1,000 Test Cases
**Hypothesis:** Performance on the full test set is lower than on cherry-picked subset.

**Method:**
1. Evaluate AEOS-Former on all 1,000 test cases
2. Compare distribution of CR across:
- All 1,000 cases
- Cherry-picked 64 cases
- Random 64 cases (this dataset)

**Expected Result:**
- Cherry-picked cases cluster at higher CR values
- Random sample represents full distribution better

### Experiment 3: Baseline Methods
**Hypothesis:** Baseline methods also benefit from cherry-picking.

**Method:**
1. Implement baseline methods (greedy, random, heuristic)
2. Evaluate on all three test sets
3. Compare relative improvement between baselines and AEOS-Former

**Expected Result:**
- Cherry-picking inflates all methods' performance
- Relative rankings may differ on unbiased test set

---

## Usage

```python
import json
from pathlib import Path

# Load a case
case_id = 42 # Example case ID
case_dir = Path('dataset/cases') / f'{case_id:05d}'

with open(case_dir / 'constellation.json') as f:
constellation = json.load(f)

with open(case_dir / 'taskset.json') as f:
taskset = json.load(f)

# Run your scheduler
result = your_scheduler.schedule(constellation, taskset)
```

---

## References

- **AEOS-Bench Paper**: "Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology", NeurIPS 2025
- **Basilisk Documentation**: https://hanspeterschaub.info/basilisk/
- **Investigation Report**: `../TEST_SET_INVESTIGATION_REPORT.md`

---

*Generated: 2026-02-10*
*Script: `../setup_test_data.py`*
*Random Seed: 42*
Loading