Skip to content

Performance Refactoring

Hans edited this page Jan 18, 2026 · 1 revision

Refactoring for Performance (Without Approximations)

This wiki documents a real-world refactor in the simple codebase that achieved a ~30% speedup through low-level optimization only.

The refactor:

  • does not change algorithms
  • does not introduce approximations
  • preserves numerical behavior up to normal floating-point roundoff

Performance gains come from restructuring data movement, fusing pipeline stages, and refactoring hot loops to better match how modern CPUs execute code.

This documentation serves two purposes:

  • explain what was changed and why
  • provide guidance for future performance work in simple

Why This Matters (for New Contributors)

At first glance, much of the original code already looks optimized: it uses FFTW, avoids obvious inefficiencies, and is written clearly in terms of scientific intent. So why bother with refactors like this?

Performance is often limited by data movement, not math

In image-processing pipelines, the dominant cost is frequently:

  • streaming large arrays through memory
  • repeated full-image passes
  • rearranging data (for example fftshift, clipping, masking)

Even when each individual operation is efficient, chaining many of them together can become bandwidth-bound.

This refactor shows that reducing the number of full-array passes can yield large speedups without changing any math.

Clear code and fast code are not opposites

The original implementation is easy to read because each operation is expressed as a separate method call.

The refactored version keeps that clarity at the orchestration level (for example prepimg4align still reads like a pipeline), while moving complexity into well-named, tightly scoped kernels.

Key idea:

  • readable pipelines at the top
  • specialized, performance-aware kernels underneath

“No approximations” does not mean “no refactoring”

A common misconception is that performance gains require:

  • approximations
  • shortcuts
  • reduced precision
  • altered algorithms

This example demonstrates the opposite:

  • the math is unchanged
  • results are numerically equivalent (up to floating-point roundoff)
  • the speedup comes from restructuring work, not skipping it

Understanding mathematical equivalences (for example spatial shifts versus Fourier phase modulation) is a powerful optimization tool.

Mechanical sympathy

These changes reflect how modern CPUs actually execute code:

  • memory bandwidth is precious
  • branches and function calls in hot loops are expensive
  • compilers optimize best when loops are simple and uniform

Writing fast code in simple often means shaping algorithms so they work with the hardware, not against it.

Case Study: prepimg4align

This case study focuses on refactoring the particle preprocessing pipeline used before alignment.

Original pipeline (conceptual)

The baseline implementation performs the following steps:

normalize
→ FFT
→ clip
→ optional spatial shift
→ CTF phase flip
→ IFFT
→ mask
→ optional instrument-function division
→ FFT

Each step typically performs a full pass over the image.

Refactored pipeline

The refactored version preserves the same logical steps but consolidates most work into two fused kernels:

  • norm_noise_fft_clip_shift
  • ifft_mask_divwinstrfun_fft

The call site remains short and readable, while the heavy lifting is done inside specialized kernels.

Refactoring principles applied

Fuse full-array passes

Multiple operations that streamed over the entire image were combined into single kernels, reducing memory traffic.

Replace spatial shift with Fourier-domain phase modulation

The real-space shift was replaced by its mathematically equivalent Fourier phase factor, applied during clipping.

This avoids an additional resampling pass while preserving exact behavior.

Combine fftshift with useful work

Quadrant swaps (fftshift) were fused with normalization, masking, and division so that data rearrangement is never done on its own.

Short-circuit near no-op cases

Cheap guards avoid expensive arithmetic when normalization or shifting would have no meaningful effect.

Mixed precision with intent

  • double precision for reductions (mean, variance)
  • single precision for per-pixel transforms and FFTs

This preserves numerical intent while matching FFT precision.

Case Study: CTF apply_serial

This case study focuses on refactoring the serial CTF application kernel.

Baseline structure

The original implementation:

  • uses a select-case on mode
  • duplicates nearly identical nested loops for each mode
  • recomputes geometry and address mapping inside hot loops

This structure is correct but inefficient for a performance-critical kernel.

Refactored structure

The refactored version introduces:

  • a pure elemental kernel for computing the CTF value
  • a single unified loop
  • masked mode handling using merge
  • memoized geometry and addressing

Refactoring principles applied

Unify duplicated loops

The CTF value is computed once per frequency, and the mode-specific behavior is applied via masked arithmetic.

This reduces instruction cache pressure and improves compiler optimization opportunities.

Hoist invariants

Constants derived from wavelength, defocus, and spherical aberration are computed once outside the loop.

Memoize geometry and addressing

Angle, spatial frequency squared, and physical array indices are precomputed and reused.

This removes repeated atan2, division, and address computation from the hot loop.

Make the hot loop SIMD-friendly

The refactored loop:

  • contains no data-dependent branching
  • has a uniform body
  • is easier for the compiler to vectorize

The result is faster execution even in serial mode.

Validation and Numerical Equivalence

This refactor preserves the algorithm exactly. Any differences are limited to expected floating-point roundoff.

Recommended validation steps:

  • compare baseline and refactored outputs in real space
  • compare Fourier-domain representations
  • compute max-absolute and relative error norms
  • test both shift and no-shift paths
  • run end-to-end alignment regression tests

FFT plans (precision and flags) should be kept identical during comparison to avoid confounding effects.

Performance Checklist for Future Refactors

When working on performance-critical code in simple:

  • profile first; identify the dominant costs
  • fuse consecutive full-array passes
  • minimize domain round-trips
  • eliminate transcendentals from inner loops
  • hoist invariants aggressively
  • memoize geometry and index mapping
  • prefer uniform loops over mode-specific branching
  • keep orchestration readable; specialize the hot path
  • validate numerical equivalence early and often