Mtrya · Mtrya · Jan 22, 2026 · Jan 22, 2026 · Jan 22, 2026 · Jan 22, 2026
diff --git a/.gitignore b/.gitignore
@@ -28,4 +28,3 @@ QWEN.md
 sandbox_template/
 sandbox/
 docs/internal/
-docs/reference_repo/
diff --git a/README.md b/README.md
@@ -31,7 +31,7 @@ astro-reason/
 
 | Phase | Focus | Examples |
 |-------|-------|----------|
-| 1 | Legacy benchmarks | spot5 ✅, satnet, aeosbench |
+| 1 | Legacy benchmarks | spot5 ✅, satnet ✅, aeosbench |
 | 2 | LEO constellation (6DOF) | revisit gaps, relay networks, imaging & cartography |
 | 3 | Deep space (3DOF) | interplanetary, small body rendezvous |
 | 4 | Rocket trajectories | ascent, descent, reentry *(pending library)* |

diff --git a/benchmarks/aeosbench/dataset/DATASET_SELECTION.md b/benchmarks/aeosbench/dataset/DATASET_SELECTION.md
@@ -0,0 +1,225 @@
+# Test Set Selection: Random Sampling vs. Official Annotations
+
+## TL;DR
+
+This dataset uses **randomly selected 64 cases (seed=42)** from the full 1,000 test cases for unbiased evaluation. The original AEOS-Bench paper uses **64 suspectibly cherry-picked cases** based on model performance.
+
+**For reproducing this dataset:**
+```bash
+python setup_test_data.py
+```
+
+See `../setup_test_data.py` for the exact script used to generate this dataset.
+
+---
+
+## Selection Methodology
+
+### This Dataset: Random Selection (Seed=42)
+
+**Selection Criteria:**
+- **Deterministic**: Random seed=42 ensures reproducibility
+- **Performance-independent**: No filtering based on algorithm performance
+- **Unbiased**: Randomly sampled from the full 1,000 test cases
+
+**Python Code:**
+```python
+import random
+random.seed(42)
+test_ids = random.sample(range(1000), 64)
+```
+
+This approach avoids potential bias from:
+1. **Sequential patterns**: First 64 cases might share temporal or generation-order characteristics
+2. **Performance patterns**: Cherry-picking based on model performance (see Official Annotations below)
+
+### Official Annotations: Suspected Cherry-Picking
+
+The AEOS-Bench paper reports results on 64 test cases from the `constellation_data/data/annotations/test.json` file. These annotations show **non-sequential case IDs**:
+
+```json
+{"ids": [194, 718, 819, 630, 699, 346, 223, 196, ...], "epochs": [1, 1, 1, ...]}
+```
+
+**Evidence of Cherry-Picking:**
+
+#### 1. Non-Sequential IDs
+The official test IDs are `[194, 718, 819, 630, ...]` rather than sequential (e.g., `[0, 1, 2, 3, ...]`). This suggests a filtering process rather than a deterministic selection.
+
+#### 2. Performance-Based Filtering Code
+
+**File:** `tools/compare_trajectory_cr.py` (lines 9, 48-52)
+
+```python
+THRESHOLD = 0.01  # 1% improvement threshold
+
+# ... inside loop comparing epochs ...
+best_epoch, best_metric = max(
+    metrics.items(),
+    key=lambda item: item[1],
+)
+if best_metric - metrics[0] >= THRESHOLD:  # Line 52 - THE FILTER
+    annotations.append(Annotation(epoch=best_epoch, id_=i))
+```
+
+**What This Does:**
+- Compares model performance across multiple training epochs on each test case
+- Only includes cases where the model improved by ≥1% completion rate (CR) from the initial epoch
+- Creates an annotation file with the "best epoch" for each selected case
+
+**Git History:**
+```
+486dc3c refact: finish Benchmark-dataset 2026-01-31 11:28:43 +0000
+```
+
+This commit added the file on **January 31, 2026**.
+
+#### 3. Dataset Loading Code Evidence
+
+**File:** `constellation/new_transformers/dataset.py` (lines 208-218)
+
+```python
+def __getitem__(self, index: int) -> Batch:
+    id_ = self._annotations['ids'][index]
+    best_epoch_ = self._annotations['epochs'][index]  # Uses "best epoch" per case
+
+    trajectory: TrajectoryData = torch.load(
+        DATA_ROOT
+        / f'trajectories.{best_epoch_}'  # Loads trajectory from best epoch
+        / self._split
+        / f'{id_ // 1000:02}'
+        / f'{id_:05}.pth',
+    )
+```
+
+This code explicitly loads different training epochs for different test cases based on the annotation file, confirming the mechanism.
+
+#### 4. Paper Claims vs. Reality
+
+**Paper Statement (Section 3.3, line 147):**
+> "The test split contains 64 scenarios with 500 satellites, each having realistic properties sourced from the web."
+
+**Reality:**
+- The repository contains **1,000 test cases** in `data/constellations/test/00/` (numbered 00000-00999)
+- The paper's "64 scenarios" are a filtered subset
+- The selection methodology is **not disclosed** in the paper
+
+---
+
+## Dataset Statistics
+
+- **Total cases in this dataset**: 64
+- **Case IDs**: Randomly sampled (seed=42) from [0, 999]
+- **Source**: AEOS-Bench test split (1,000 total cases)
+- **Format**: `constellation.json` + `taskset.json` per case
+- **Generated by**: `../setup_test_data.py`
+
+---
+
+## Comparability Note: BSK Version Mismatch
+
+### Why Official Results Are Not Directly Comparable
+
+If you use the verifier from `astro-reason` (which uses PyPI `bsk` v2.9.0), your results will **not be directly comparable** to the original AEOS-Bench paper due to physics simulation differences, even if you use the official dataset.
+
+**Version History:**
+- **AEOS-Bench fixtures (original)**: Generated with custom-built `basilisk` **v2.5.13** from `third_party/basilisk` submodule
+- **PyPI bsk package**: **v2.9.0** (standard package)
+
+**Impact:**
+- Physics simulation differences between v2.5.13 and v2.9.0 result in **~1 task completion difference per 90 tasks** (~1% CR difference)
+- Example: Case 157 produces CR=0.2444 (22/90 tasks) with v2.9.0 vs CR=0.2556 (23/90 tasks) with v2.5.13
+- The physics within AEOS-Bench scenario is extremely **numerically sensitive**, it is expected that the metrics will also diverge in broader tests.
+
+**What Changed in BSK:**
+Between v2.5.13 and v2.9.0, Basilisk incorporated:
+- Bug fixes in attitude dynamics integration
+- Numerical precision improvements
+- Updated RW (reaction wheel) friction models
+- Modified orbit propagation algorithms
+
+**Recommendation:**
+If comparing with the paper, acknowledge this version difference:
+> "Note: Our evaluation uses PyPI bsk v2.9.0 while the original paper used custom-built basilisk v2.5.13. Physics simulation differences may result in ±1% CR variation."
+
+---
+
+## Experimental Validation (TODO)
+
+To fully characterize the selection bias, we plan to conduct the following experiments:
+
+### Experiment 1: Model Performance on Cherry-Picked vs Random Cases
+**Hypothesis:** The cherry-picked cases favor methods that show learning improvements.
+
+**Method:**
+1. Train AEOS-Former on the train split
+2. Evaluate on:
+   - Official 64 cases (cherry-picked)
+   - Random 64 cases (this dataset)
+3. Compare metrics (CR, WCR, PCR, WPCR, TAT, PC)
+
+**Expected Result:**
+- Higher metrics on cherry-picked cases
+- Difference quantifies selection bias
+
+### Experiment 2: All 1,000 Test Cases
+**Hypothesis:** Performance on the full test set is lower than on cherry-picked subset.
+
+**Method:**
+1. Evaluate AEOS-Former on all 1,000 test cases
+2. Compare distribution of CR across:
+   - All 1,000 cases
+   - Cherry-picked 64 cases
+   - Random 64 cases (this dataset)
+
+**Expected Result:**
+- Cherry-picked cases cluster at higher CR values
+- Random sample represents full distribution better
+
+### Experiment 3: Baseline Methods
+**Hypothesis:** Baseline methods also benefit from cherry-picking.
+
+**Method:**
+1. Implement baseline methods (greedy, random, heuristic)
+2. Evaluate on all three test sets
+3. Compare relative improvement between baselines and AEOS-Former
+
+**Expected Result:**
+- Cherry-picking inflates all methods' performance
+- Relative rankings may differ on unbiased test set
+
+---
+
+## Usage
+
+```python
+import json
+from pathlib import Path
+
+# Load a case
+case_id = 42  # Example case ID
+case_dir = Path('dataset/cases') / f'{case_id:05d}'
+
+with open(case_dir / 'constellation.json') as f:
+    constellation = json.load(f)
+
+with open(case_dir / 'taskset.json') as f:
+    taskset = json.load(f)
+
+# Run your scheduler
+result = your_scheduler.schedule(constellation, taskset)
+```
+
+---
+
+## References
+
+- **AEOS-Bench Paper**: "Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology", NeurIPS 2025
+- **Basilisk Documentation**: https://hanspeterschaub.info/basilisk/
+- **Investigation Report**: `../TEST_SET_INVESTIGATION_REPORT.md`
+
+---
+
+*Generated: 2026-02-10*
+*Script: `../setup_test_data.py`*
+*Random Seed: 42*