Kainmueller-Lab · Conscht · Feb 17, 2026 · Feb 19, 2026 · Feb 19, 2026 · Feb 19, 2026
diff --git a/README.md b/README.md
@@ -13,12 +13,13 @@ This is the official implementation of the **FISBe (FlyLight Instance Segmentati
 
 The benchmark supports 2D and 3D segmentations and computes a wide range of commonly used evaluation metrics (e.g., AP, F1, coverage). Crucially, it provides specialized error attribution for topological errors (False Merges, False Splits) relevant to filamentous structures.
 
-### Features
-- **Standard Metrics:** AP, F1, Precision, Recall.
-- **FISBe Metrics:** Greedy many-to-many matching for False Merges (FM) and False Splits (FS).
+### Key Features
+- **Official Protocol:** Implements the exact ranking score ($S$) and matching logic defined in the FISBe paper.
+- **Topology-Aware:** Uses skeleton-based localization (`clDice`) to handle thin structures robustly.
+- **Error Attribution:** Explicitly quantifies False Merges (FM) and False Splits (FS) via many-to-many matching.
 - **Flexibility:** Supports HDF5 (`.hdf`, `.h5`) and Zarr (`.zarr`) files.
-- **Modes:** Run on single files, entire folders, or in stability analysis mode.
-- **Partly Labeled Data:** Robust evaluation ignoring background conflicts for sparse Ground Truth.
+- **Modes:** Single file, folder evaluation, or 3x stability analysis.
+- **Partly Labeled Support:** Robust evaluation that ignores background conflicts for sparse Ground Truth.
 
 ---
 
@@ -29,17 +30,14 @@ The recommended way to install is using `uv` (fastest) or `micromamba`.
 ### Option 1: Using `uv` (Fastest)
 
 ```bash
-# 1. Install uv (if not installed)
 pip install uv
-
-# 2. Clone and install
 git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation.git
 cd evaluate-instance-segmentation
 uv venv
 uv pip install -e .
 ```
 
-### Option 2: Using `micromamba` or `conda`
+### Option 2: Using micromamba or conda
 
 ```bash
 micromamba create -n evalinstseg python=3.10
@@ -56,10 +54,11 @@ The `evalinstseg` command is automatically available after installation.
 ### 1. Evaluate a Single File
 ```bash
 evalinstseg \
-  --res_file tests/pred/R14A02-20180905.hdf \
-  --res_key volumes/labels \
-  --gt_file tests/gt/R14A02-20180905.zarr \
+  --res_file tests/pred/sample_01.hdf \
+  --res_key volumes/gmm_label_cleaned \
+  --gt_file tests/gt/sample_01.zarr \
   --gt_key volumes/gt_instances \
+  --split_file assets/sample_list_per_split.txt \
   --out_dir tests/results \
   --app flylight
 ```
@@ -70,7 +69,7 @@ If you provide a directory path to `--res_file`, the tool will look for matching
 ```bash
 evalinstseg \
   --res_file /path/to/predictions_folder \
-  --res_key volumes/labels \
+  --res_key volumes/gmm_label_cleaned \
   --gt_file /path/to/ground_truth_folder \
   --gt_key volumes/gt_instances \
   --out_dir /path/to/output_folder \
@@ -90,11 +89,12 @@ evalinstseg \
 ```
 
 **Requirements:**
+
 - `--run_dirs`: Provide exactly 3 folders.
 - `--gt_file`: The folder containing Ground Truth files (filenames must match predictions).
 
 ### 4. Partly Labeled Data
-If your ground truth is sparse (not fully dense), use the `--partly` flag. T
+If your ground truth is sparse (not fully dense), use the `--partly` flag. See the **Partly Labeled Data Mode** section for details on how False Positives are handled.
 
 ## Usage: Python Package
 You can integrate the benchmark directly into your Python scripts or notebooks.
@@ -140,56 +140,46 @@ metrics = evaluate_volume(
     add_general_metrics=["false_merge", "false_split"]
 )
 ```
-### 4. Partly Labeled Data (`--partly`)
-Some samples contain sparse / incomplete GT annotations. In this setting, counting all unmatched predictions as false positives is not meaningful.
 
-When `--partly` is enabled, we approximate FP by counting only **unmatched predictions whose best match is a foreground GT instance** (based on the localization matrix used for evaluation, e.g. clPrecision for `cldice`).  
-Unmatched predictions whose best match is **background** are ignored.
+## FISBe Benchmark Protocol
+For a complete reference of all calculated metrics, see [docs/METRICS.md](docs/METRICS.md).
+> **Note:** Some output keys use internal names; see the documentation for the exact mapping to website/leaderboard columns.
 
-Concretely, we compute for each unmatched prediction the index of the GT label with maximal overlap score; it is counted as FP only if that index is > 0 (foreground), not 0 (background).
+### Official FlyLight Configuration (`--app flylight`)
+The `flylight` preset implements the specific metrics described in the FISBe paper for evaluating long-range thin filamentous neuronal structures.
 
----
+**Primary Ranking Score ($S$)**
+The single scalar used to rank methods on the leaderboard:
+$$S = 0.5 \cdot \text{avF1} + 0.5 \cdot C$$
+
+### Key Metrics
+- **avF1**: Average F1 score across clDice thresholds.
+- **C (Coverage)**: Average GT skeleton coverage (assignment via max clPrecision; scoring via clRecall on union of matches).
+- **clDiceTP**: Average clDice score of matched TPs at threshold 0.5.
+- **tp**: Relative number of TPs at threshold 0.5 (`TP_0.5 / N_GT`).
+- **FS (False Splits)**: Sum over GT of `max(0, N_assigned_pred - 1)`.
+- **FM (False Merges)**: Sum over predictions of `max(0, N_assigned_gt - 1)`.
 
-## Metrics Explanation
-
-### 1. Standard Instance Metrics (TP/FP/FN, F-score, AP proxy)
-These metrics are computed from a **one-to-one matching** between GT and prediction instances (Hungarian or greedy), using a chosen localization criterion (default for FlyLight is `cldice`).
-
-- **TP**: matched pairs above threshold  
-- **FP**: unmatched predictions (or, in `--partly`, only those whose best match is foreground)  
-- **FN**: unmatched GT instances  
-- **precision** = TP / (TP + FP)  
-- **recall** = TP / (TP + FN)  
-- **fscore** = 2 * precision * recall / (precision + recall)  
-- **AP**: we report a simple AP proxy `precision × recall` at each threshold and average it across thresholds (this is not COCO-style AP).
-
-### 2. FISBe Error Attribution (False Splits / False Merges)
-False splits (FS) and false merges (FM) aim to quantify **instance topology errors** for long-range thin filamentous structures.
-
-We compute FS/FM using **greedy many-to-many matching with consumption**:
-- Candidate GT–Pred pairs above threshold are processed in descending score order.
-- After selecting a match, we update “available” pixels so that already explained structure is not matched again.
-- FS counts when one GT is explained by multiple preds (excess preds per GT).
-- FM counts when one pred explains multiple GTs (excess GTs per pred).
-
-This produces an explicit attribution of split/merge errors rather than only TP/FP/FN.
-
-### Metric Definitions
-
-#### Instance-Level (per threshold)
-| Metric | Description |
-| :--- | :--- |
-| **AP_TP** | True Positives (1-to-1 match) |
-| **AP_FP** | False Positives (unmatched preds; in `--partly`: only unmatched preds whose best match is foreground) |
-| **AP_FN** | False Negatives (unmatched GT) |
-| **precision** | TP / (TP + FP) |
-| **recall** | TP / (TP + FN) |
-| **fscore** | Harmonic mean of precision and recall |
-
-#### Global / FISBe
-| Metric | Description |
-| :--- | :--- |
-| **avAP** | Mean AP proxy across thresholds ≥ 0.5 |
-| **FM** | False Merges (many-to-many matching with consumption) |
-| **FS** | False Splits (many-to-many matching with consumption) |
-| **avg_gt_skel_coverage** | Mean skeleton coverage of GT instances by associated predictions (association via best-match mapping) |
+### Partly Labeled Data Mode (`--partly`)
+FISBe includes 71 partly labeled images where only a subset of neurons is annotated.
+- **Logic**: Unmatched predictions are only counted as False Positives if they match a **Foreground GT instance**.
+- **Background Exclusion**: Predictions matching background (unlabeled regions) are ignored.
+
+## Output Structure
+Metrics returned by the API or saved to disk are grouped into category-specific dictionaries:
+
+```python
+metrics["confusion_matrix"]
+├── TP / FP / FN         # Counts across all images
+├── precision / recall   # Standard detection metrics
+└── avAP                 # Mean precision × recall proxy
+
+metrics["general"]
+├── aggregate_score      # S (Official Ranking Score)
+├── avg_gt_skel_coverage # C (Coverage)
+├── FM                   # Global False Merge count
+└── FS                   # Global False Split count
+
+metrics["curves"]
+└── F1_0.1 … F1_0.9     # Per-threshold performance
+```
diff --git a/assets/sample_list_per_split.txt b/assets/sample_list_per_split.txt
@@ -0,0 +1,117 @@
+(1) samples for FlyLight completely:
+
+train:
+R38F04-20181005_63_G3
+R38F04-20181005_63_G5
+R38F04-20181005_63_H1
+R53A10-20181019_64_A4
+R75E01-20181030_64_D1
+VT008647-20171222_63_D2
+VT008647-20171222_63_D1
+VT008647-20171222_63_E1
+VT019303-20171013_65_B6
+VT019307-20171013_65_F1
+VT033051-20171128_61_E4
+VT033051-20171128_61_E2
+VT040433-20170919_63_D6
+VT047848-20171020_66_I3
+VT047848-20171020_66_I2
+VT047848-20171020_66_J2
+VT047848-20171020_66_I5
+VT061467-20180911_62_E5
+
+val:
+R22C03-20180918_66_J2
+VT012403-20171128_61_B2
+VT033614-20171124_64_H1
+VT033614-20171124_64_H5
+VT041298-20171114_63_C3
+
+test:
+JRC_SS04989-20160318_24_A2
+R14A02-20180905_65_A6
+R54A09-20181019_64_H1
+VT011145-20171222_63_I1
+VT027175-20171031_62_H3
+VT027175-20171031_62_H4
+VT050157-20171110_61_C1
+
+(2) samples for FlyLight partly:
+
+train:
+R14B11-20180905_65_D2
+R14B11-20180905_65_D6
+R24D12-20180921_65_J6
+R38F04-20181005_63_G2
+R38F04-20181005_63_G4
+VT003236-20170602_62_G4
+VT003236-20170602_62_G5
+VT007080-20170517_61_A2
+VT007080-20170517_61_A4
+VT007080-20170517_61_A5
+VT008135-20171122_61_C2
+VT008647-20171222_63_D5
+VT008647-20171222_63_D6
+VT010264-20171222_63_H2
+VT010264-20171222_63_H5
+VT011049-20180918_66_I1
+VT024641-20170615_62_D2
+VT024641-20170615_62_D3
+VT024641-20170615_62_D5
+VT024641-20170615_62_D6
+VT024641-20170615_62_E1
+VT025523-20170915_64_I1
+VT026776-20171017_62_J1
+VT033051-20171128_61_E3
+VT033296-20171010_62_B4
+VT034391-20171128_61_G2
+VT038149-20171103_62_F1
+VT039484-20171020_64_C1
+VT039484-20171020_64_C2
+VT040430-20170919_63_C4
+VT040433-20170919_63_E1
+VT045568-20171020_66_C5
+VT045568-20171020_66_D2
+VT047848-20171020_66_I1
+VT047848-20171020_66_I4
+VT047848-20171020_66_J1
+VT050217-20171110_61_D6
+VT050217-20171110_61_E1
+VT058568-20170926_64_E1
+VT060731-20170517_63_F1
+VT060731-20170517_63_F2
+VT061467-20180911_62_E4
+VT062059-20170727_61_D4
+
+val:
+JRC_SS05008-20160318_24_B1
+JRC_SS05008-20160318_24_B2
+R22C03-20180918_66_J1
+R9F03-20181030_62_B5
+VT008194-20171222_63_A3
+VT008194-20171222_63_A5
+VT012403-20171128_61_B1
+VT033614-20171124_64_H4
+VT039350-20171020_64_A1
+VT039350-20171020_64_A3
+VT039350-20171020_64_A6
+VT059775-20170630_63_D5
+
+test:
+R54A09-20181019_64_H4
+R54A09-20181019_64_H6
+R73H08-20181030_62_G5
+VT006202-20170511_63_C4
+VT011145-20171222_63_I2
+VT021537-20171003_61_C3
+VT023747-20171017_61_F1
+VT027175-20171031_62_H6
+VT028606-20170721_65_A2
+VT028606-20170721_65_A3
+VT033453-20170721_65_D2
+VT033453-20170721_65_D4
+VT033453-20170721_65_D5
+VT046838-20170922_62_A2
+VT050157-20171110_61_C5
+VT058571-20170926_64_G6
+