Skip to content

feat(pipeline): End-to-end validation with ground truth meshes #5

@hummat

Description

@hummat

Summary

Add real end-to-end testing by reconstructing objects with known ground truth meshes and textures, then comparing results via quantitative metrics.

Motivation

Current tests validate CLI/script behavior but not reconstruction quality. Need regression detection for actual output quality when pipeline or models change.

Proposed Design

Trigger & Execution

  • Manual kick-off (e.g., scripts/validate.sh)
  • Runs overnight — full pipeline too expensive for CI

Test Assets (stored in R2)

  • Location: r2:hummat-assets/mini-mesh/validation/
  • Initial objects:
    • 016_mokka — geometry only (from automatica dataset)
    • 018_mustard_bottle — geometry only (from automatica dataset)
    • YCB mustard bottle — geometry + texture (for texture validation)
  • Format: GT mesh (.ply/.obj), test video, metadata (expected scale, etc.)

Geometry Metrics

Metric Purpose
Chamfer distance Overall surface accuracy
Hausdorff distance Worst-case error (catches outliers, thin structures)
F-score @ threshold % of surface within tolerance (e.g., 1mm, 5mm)

Alignment Pipeline

  1. Fast Global Registration (FGR) — coarse alignment using FPFH features (handles unknown correspondences)
  2. Scaled ICP — refine with scale estimation (TransformationEstimationPointToPoint(with_scaling=True))
  3. Fallback — if FGR fitness < threshold, use PCA-based init (centroid + principal axes) → scaled ICP
  4. Outlier filtering — remove predicted points far from GT (handles unmasked runs with extra geometry like table surface)
  5. Compute metrics — on filtered, aligned point sets

Texture Metrics

  • Method: Render-based comparison
    • Render GT and reconstructed mesh from N viewpoints
    • Compare rendered images via PSNR, SSIM, LPIPS
  • Captures: Albedo accuracy + sharpness
  • Renderer: Trimesh + pyrender (default), nvdiffrast for speed

Output

  • JSON report: Machine-readable metrics for tracking
  • Markdown summary: Human-readable results

Regression Detection

  • Initially: Manual inspection, no automatic pass/fail
  • Once baselines established: Relative with absolute floor
    • Flag if metrics degrade > X% from stored baseline
    • OR if metrics fall below absolute minimum acceptable quality
    • Catches both sudden regressions and slow drift over time

Alternatives Considered

  1. Manual visual inspection only: Compare reconstructed meshes by eye in Blender/MeshLab. Rejected because: subjective, not reproducible, doesn't scale, can't detect subtle regressions.

  2. Synthetic data with perfect GT: Render synthetic scenes and reconstruct them. Considered but deprioritized because: doesn't test real-world capture challenges (lighting variation, motion blur, reflections), though useful as a complementary test later.

  3. Unit tests on pipeline stages only: Test individual components (SfM accuracy, mesh extraction) in isolation. Rejected as primary approach because: integration bugs and quality drift across stages would go undetected. Current CLI tests already cover this partially.

  4. CI integration with smaller assets: Run validation on every PR with tiny test objects. Rejected because: meaningful reconstruction takes hours on GPU; tiny objects don't exercise the full pipeline realistically. Manual/nightly trigger is more practical.

Tasks

  • Set up R2 storage structure for validation assets
  • Record/prepare test videos for initial objects
  • Upload GT meshes and videos to R2
  • Implement alignment pipeline (FGR + scaled ICP + PCA fallback + filtering)
  • Implement geometry metrics (Chamfer, Hausdorff, F-score)
  • Implement render-based texture comparison
  • Create scripts/validate.sh entry point
  • Generate JSON + Markdown reports
  • Document usage in docs/
  • Add baseline storage and regression detection (after initial baselines captured)

Dependencies

  • trimesh — mesh loading, point sampling
  • pyrender — rendering
  • lpips — perceptual similarity
  • open3d — FGR, scaled ICP, point cloud operations

Future Considerations

  • Add more objects incrementally (challenging cases: reflective, thin structures)
  • Integration with scheduled runs (nightly/weekly)

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpipelinePipeline (scripts/run.sh)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions