Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 53 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ This is the official implementation of the **FISBe (FlyLight Instance Segmentati

The benchmark supports 2D and 3D segmentations and computes a wide range of commonly used evaluation metrics (e.g., AP, F1, coverage). Crucially, it provides specialized error attribution for topological errors (False Merges, False Splits) relevant to filamentous structures.

### Features
- **Standard Metrics:** AP, F1, Precision, Recall.
- **FISBe Metrics:** Greedy many-to-many matching for False Merges (FM) and False Splits (FS).
### Key Features
- **Official Protocol:** Implements the exact ranking score ($S$) and matching logic defined in the FISBe paper.
- **Topology-Aware:** Uses skeleton-based localization (`clDice`) to handle thin structures robustly.
- **Error Attribution:** Explicitly quantifies False Merges (FM) and False Splits (FS) via many-to-many matching.
- **Flexibility:** Supports HDF5 (`.hdf`, `.h5`) and Zarr (`.zarr`) files.
- **Modes:** Run on single files, entire folders, or in stability analysis mode.
- **Partly Labeled Data:** Robust evaluation ignoring background conflicts for sparse Ground Truth.
- **Modes:** Single file, folder evaluation, or 3x stability analysis.
- **Partly Labeled Support:** Robust evaluation that ignores background conflicts for sparse Ground Truth.

---

Expand All @@ -29,17 +30,14 @@ The recommended way to install is using `uv` (fastest) or `micromamba`.
### Option 1: Using `uv` (Fastest)

```bash
# 1. Install uv (if not installed)
pip install uv

# 2. Clone and install
git clone https://github.com/Kainmueller-Lab/evaluate-instance-segmentation.git
cd evaluate-instance-segmentation
uv venv
uv pip install -e .
```

### Option 2: Using `micromamba` or `conda`
### Option 2: Using micromamba or conda

```bash
micromamba create -n evalinstseg python=3.10
Expand All @@ -56,10 +54,11 @@ The `evalinstseg` command is automatically available after installation.
### 1. Evaluate a Single File
```bash
evalinstseg \
--res_file tests/pred/R14A02-20180905.hdf \
--res_key volumes/labels \
--gt_file tests/gt/R14A02-20180905.zarr \
--res_file tests/pred/sample_01.hdf \
--res_key volumes/gmm_label_cleaned \
--gt_file tests/gt/sample_01.zarr \
--gt_key volumes/gt_instances \
--split_file assets/sample_list_per_split.txt \
--out_dir tests/results \
--app flylight
```
Expand All @@ -70,7 +69,7 @@ If you provide a directory path to `--res_file`, the tool will look for matching
```bash
evalinstseg \
--res_file /path/to/predictions_folder \
--res_key volumes/labels \
--res_key volumes/gmm_label_cleaned \
--gt_file /path/to/ground_truth_folder \
--gt_key volumes/gt_instances \
--out_dir /path/to/output_folder \
Expand All @@ -90,11 +89,12 @@ evalinstseg \
```

**Requirements:**

- `--run_dirs`: Provide exactly 3 folders.
- `--gt_file`: The folder containing Ground Truth files (filenames must match predictions).

### 4. Partly Labeled Data
If your ground truth is sparse (not fully dense), use the `--partly` flag. T
If your ground truth is sparse (not fully dense), use the `--partly` flag. See the **Partly Labeled Data Mode** section for details on how False Positives are handled.

## Usage: Python Package
You can integrate the benchmark directly into your Python scripts or notebooks.
Expand Down Expand Up @@ -140,56 +140,46 @@ metrics = evaluate_volume(
add_general_metrics=["false_merge", "false_split"]
)
```
### 4. Partly Labeled Data (`--partly`)
Some samples contain sparse / incomplete GT annotations. In this setting, counting all unmatched predictions as false positives is not meaningful.

When `--partly` is enabled, we approximate FP by counting only **unmatched predictions whose best match is a foreground GT instance** (based on the localization matrix used for evaluation, e.g. clPrecision for `cldice`).
Unmatched predictions whose best match is **background** are ignored.
## FISBe Benchmark Protocol
For a complete reference of all calculated metrics, see [docs/METRICS.md](docs/METRICS.md).
> **Note:** Some output keys use internal names; see the documentation for the exact mapping to website/leaderboard columns.

Concretely, we compute for each unmatched prediction the index of the GT label with maximal overlap score; it is counted as FP only if that index is > 0 (foreground), not 0 (background).
### Official FlyLight Configuration (`--app flylight`)
The `flylight` preset implements the specific metrics described in the FISBe paper for evaluating long-range thin filamentous neuronal structures.

---
**Primary Ranking Score ($S$)**
The single scalar used to rank methods on the leaderboard:
$$S = 0.5 \cdot \text{avF1} + 0.5 \cdot C$$

### Key Metrics
- **avF1**: Average F1 score across clDice thresholds.
- **C (Coverage)**: Average GT skeleton coverage (assignment via max clPrecision; scoring via clRecall on union of matches).
- **clDiceTP**: Average clDice score of matched TPs at threshold 0.5.
- **tp**: Relative number of TPs at threshold 0.5 (`TP_0.5 / N_GT`).
- **FS (False Splits)**: Sum over GT of `max(0, N_assigned_pred - 1)`.
- **FM (False Merges)**: Sum over predictions of `max(0, N_assigned_gt - 1)`.

## Metrics Explanation

### 1. Standard Instance Metrics (TP/FP/FN, F-score, AP proxy)
These metrics are computed from a **one-to-one matching** between GT and prediction instances (Hungarian or greedy), using a chosen localization criterion (default for FlyLight is `cldice`).

- **TP**: matched pairs above threshold
- **FP**: unmatched predictions (or, in `--partly`, only those whose best match is foreground)
- **FN**: unmatched GT instances
- **precision** = TP / (TP + FP)
- **recall** = TP / (TP + FN)
- **fscore** = 2 * precision * recall / (precision + recall)
- **AP**: we report a simple AP proxy `precision × recall` at each threshold and average it across thresholds (this is not COCO-style AP).

### 2. FISBe Error Attribution (False Splits / False Merges)
False splits (FS) and false merges (FM) aim to quantify **instance topology errors** for long-range thin filamentous structures.

We compute FS/FM using **greedy many-to-many matching with consumption**:
- Candidate GT–Pred pairs above threshold are processed in descending score order.
- After selecting a match, we update “available” pixels so that already explained structure is not matched again.
- FS counts when one GT is explained by multiple preds (excess preds per GT).
- FM counts when one pred explains multiple GTs (excess GTs per pred).

This produces an explicit attribution of split/merge errors rather than only TP/FP/FN.

### Metric Definitions

#### Instance-Level (per threshold)
| Metric | Description |
| :--- | :--- |
| **AP_TP** | True Positives (1-to-1 match) |
| **AP_FP** | False Positives (unmatched preds; in `--partly`: only unmatched preds whose best match is foreground) |
| **AP_FN** | False Negatives (unmatched GT) |
| **precision** | TP / (TP + FP) |
| **recall** | TP / (TP + FN) |
| **fscore** | Harmonic mean of precision and recall |

#### Global / FISBe
| Metric | Description |
| :--- | :--- |
| **avAP** | Mean AP proxy across thresholds ≥ 0.5 |
| **FM** | False Merges (many-to-many matching with consumption) |
| **FS** | False Splits (many-to-many matching with consumption) |
| **avg_gt_skel_coverage** | Mean skeleton coverage of GT instances by associated predictions (association via best-match mapping) |
### Partly Labeled Data Mode (`--partly`)
FISBe includes 71 partly labeled images where only a subset of neurons is annotated.
- **Logic**: Unmatched predictions are only counted as False Positives if they match a **Foreground GT instance**.
- **Background Exclusion**: Predictions matching background (unlabeled regions) are ignored.

## Output Structure
Metrics returned by the API or saved to disk are grouped into category-specific dictionaries:

```python
metrics["confusion_matrix"]
├── TP / FP / FN # Counts across all images
├── precision / recall # Standard detection metrics
└── avAP # Mean precision × recall proxy

metrics["general"]
├── aggregate_score # S (Official Ranking Score)
├── avg_gt_skel_coverage # C (Coverage)
├── FM # Global False Merge count
└── FS # Global False Split count

metrics["curves"]
└── F1_0.1 … F1_0.9 # Per-threshold performance
```
117 changes: 117 additions & 0 deletions assets/sample_list_per_split.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
(1) samples for FlyLight completely:

train:
R38F04-20181005_63_G3
R38F04-20181005_63_G5
R38F04-20181005_63_H1
R53A10-20181019_64_A4
R75E01-20181030_64_D1
VT008647-20171222_63_D2
VT008647-20171222_63_D1
VT008647-20171222_63_E1
VT019303-20171013_65_B6
VT019307-20171013_65_F1
VT033051-20171128_61_E4
VT033051-20171128_61_E2
VT040433-20170919_63_D6
VT047848-20171020_66_I3
VT047848-20171020_66_I2
VT047848-20171020_66_J2
VT047848-20171020_66_I5
VT061467-20180911_62_E5

val:
R22C03-20180918_66_J2
VT012403-20171128_61_B2
VT033614-20171124_64_H1
VT033614-20171124_64_H5
VT041298-20171114_63_C3

test:
JRC_SS04989-20160318_24_A2
R14A02-20180905_65_A6
R54A09-20181019_64_H1
VT011145-20171222_63_I1
VT027175-20171031_62_H3
VT027175-20171031_62_H4
VT050157-20171110_61_C1

(2) samples for FlyLight partly:

train:
R14B11-20180905_65_D2
R14B11-20180905_65_D6
R24D12-20180921_65_J6
R38F04-20181005_63_G2
R38F04-20181005_63_G4
VT003236-20170602_62_G4
VT003236-20170602_62_G5
VT007080-20170517_61_A2
VT007080-20170517_61_A4
VT007080-20170517_61_A5
VT008135-20171122_61_C2
VT008647-20171222_63_D5
VT008647-20171222_63_D6
VT010264-20171222_63_H2
VT010264-20171222_63_H5
VT011049-20180918_66_I1
VT024641-20170615_62_D2
VT024641-20170615_62_D3
VT024641-20170615_62_D5
VT024641-20170615_62_D6
VT024641-20170615_62_E1
VT025523-20170915_64_I1
VT026776-20171017_62_J1
VT033051-20171128_61_E3
VT033296-20171010_62_B4
VT034391-20171128_61_G2
VT038149-20171103_62_F1
VT039484-20171020_64_C1
VT039484-20171020_64_C2
VT040430-20170919_63_C4
VT040433-20170919_63_E1
VT045568-20171020_66_C5
VT045568-20171020_66_D2
VT047848-20171020_66_I1
VT047848-20171020_66_I4
VT047848-20171020_66_J1
VT050217-20171110_61_D6
VT050217-20171110_61_E1
VT058568-20170926_64_E1
VT060731-20170517_63_F1
VT060731-20170517_63_F2
VT061467-20180911_62_E4
VT062059-20170727_61_D4

val:
JRC_SS05008-20160318_24_B1
JRC_SS05008-20160318_24_B2
R22C03-20180918_66_J1
R9F03-20181030_62_B5
VT008194-20171222_63_A3
VT008194-20171222_63_A5
VT012403-20171128_61_B1
VT033614-20171124_64_H4
VT039350-20171020_64_A1
VT039350-20171020_64_A3
VT039350-20171020_64_A6
VT059775-20170630_63_D5

test:
R54A09-20181019_64_H4
R54A09-20181019_64_H6
R73H08-20181030_62_G5
VT006202-20170511_63_C4
VT011145-20171222_63_I2
VT021537-20171003_61_C3
VT023747-20171017_61_F1
VT027175-20171031_62_H6
VT028606-20170721_65_A2
VT028606-20170721_65_A3
VT033453-20170721_65_D2
VT033453-20170721_65_D4
VT033453-20170721_65_D5
VT046838-20170922_62_A2
VT050157-20171110_61_C5
VT058571-20170926_64_G6

Loading