feat(pipeline): End-to-end validation with ground truth meshes

## Summary

Add real end-to-end testing by reconstructing objects with known ground truth meshes and textures, then comparing results via quantitative metrics.

## Motivation

Current tests validate CLI/script behavior but not reconstruction quality. Need regression detection for actual output quality when pipeline or models change.

## Proposed Design

### Trigger & Execution
- Manual kick-off (e.g., `scripts/validate.sh`)
- Runs overnight — full pipeline too expensive for CI

### Test Assets (stored in R2)
- **Location**: `r2:hummat-assets/mini-mesh/validation/`
- **Initial objects**:
  - `016_mokka` — geometry only (from automatica dataset)
  - `018_mustard_bottle` — geometry only (from automatica dataset)  
  - YCB mustard bottle — geometry + texture (for texture validation)
- **Format**: GT mesh (`.ply`/`.obj`), test video, metadata (expected scale, etc.)

### Geometry Metrics
| Metric | Purpose |
|--------|---------|
| Chamfer distance | Overall surface accuracy |
| Hausdorff distance | Worst-case error (catches outliers, thin structures) |
| F-score @ threshold | % of surface within tolerance (e.g., 1mm, 5mm) |

### Alignment Pipeline
1. **Fast Global Registration (FGR)** — coarse alignment using FPFH features (handles unknown correspondences)
2. **Scaled ICP** — refine with scale estimation (`TransformationEstimationPointToPoint(with_scaling=True)`)
3. **Fallback** — if FGR fitness < threshold, use PCA-based init (centroid + principal axes) → scaled ICP
4. **Outlier filtering** — remove predicted points far from GT (handles unmasked runs with extra geometry like table surface)
5. **Compute metrics** — on filtered, aligned point sets

### Texture Metrics
- **Method**: Render-based comparison
  - Render GT and reconstructed mesh from N viewpoints
  - Compare rendered images via PSNR, SSIM, LPIPS
- **Captures**: Albedo accuracy + sharpness
- **Renderer**: Trimesh + pyrender (default), nvdiffrast for speed

### Output
- **JSON report**: Machine-readable metrics for tracking
- **Markdown summary**: Human-readable results

### Regression Detection
- **Initially**: Manual inspection, no automatic pass/fail
- **Once baselines established**: Relative with absolute floor
  - Flag if metrics degrade > X% from stored baseline
  - OR if metrics fall below absolute minimum acceptable quality
  - Catches both sudden regressions and slow drift over time

## Alternatives Considered

1. **Manual visual inspection only**: Compare reconstructed meshes by eye in Blender/MeshLab. Rejected because: subjective, not reproducible, doesn't scale, can't detect subtle regressions.

2. **Synthetic data with perfect GT**: Render synthetic scenes and reconstruct them. Considered but deprioritized because: doesn't test real-world capture challenges (lighting variation, motion blur, reflections), though useful as a complementary test later.

3. **Unit tests on pipeline stages only**: Test individual components (SfM accuracy, mesh extraction) in isolation. Rejected as primary approach because: integration bugs and quality drift across stages would go undetected. Current CLI tests already cover this partially.

4. **CI integration with smaller assets**: Run validation on every PR with tiny test objects. Rejected because: meaningful reconstruction takes hours on GPU; tiny objects don't exercise the full pipeline realistically. Manual/nightly trigger is more practical.

## Tasks

- [ ] Set up R2 storage structure for validation assets
- [ ] Record/prepare test videos for initial objects
- [ ] Upload GT meshes and videos to R2
- [ ] Implement alignment pipeline (FGR + scaled ICP + PCA fallback + filtering)
- [ ] Implement geometry metrics (Chamfer, Hausdorff, F-score)
- [ ] Implement render-based texture comparison
- [ ] Create `scripts/validate.sh` entry point
- [ ] Generate JSON + Markdown reports
- [ ] Document usage in `docs/`
- [ ] Add baseline storage and regression detection (after initial baselines captured)

## Dependencies

- `trimesh` — mesh loading, point sampling
- `pyrender` — rendering
- `lpips` — perceptual similarity
- `open3d` — FGR, scaled ICP, point cloud operations

## Future Considerations

- Add more objects incrementally (challenging cases: reflective, thin structures)
- Integration with scheduled runs (nightly/weekly)

## Related Issues

- #11 — PBR texture extraction (texture validation useful for PBR outputs)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(pipeline): End-to-end validation with ground truth meshes #5

Summary

Motivation

Proposed Design

Trigger & Execution

Test Assets (stored in R2)

Geometry Metrics

Alignment Pipeline

Texture Metrics

Output

Regression Detection

Alternatives Considered

Tasks

Dependencies

Future Considerations

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Purpose
Chamfer distance	Overall surface accuracy
Hausdorff distance	Worst-case error (catches outliers, thin structures)
F-score @ threshold	% of surface within tolerance (e.g., 1mm, 5mm)

Uh oh!

feat(pipeline): End-to-end validation with ground truth meshes #5

Description

Summary

Motivation

Proposed Design

Trigger & Execution

Test Assets (stored in R2)

Geometry Metrics

Alignment Pipeline

Texture Metrics

Output

Regression Detection

Alternatives Considered

Tasks

Dependencies

Future Considerations

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions