-
Notifications
You must be signed in to change notification settings - Fork 8
Performance Refactoring
This wiki documents a real-world refactor in the simple codebase that achieved a ~30% speedup through low-level optimization only.
The refactor:
- does not change algorithms
- does not introduce approximations
- preserves numerical behavior up to normal floating-point roundoff
Performance gains come from restructuring data movement, fusing pipeline stages, and refactoring hot loops to better match how modern CPUs execute code.
This documentation serves two purposes:
- explain what was changed and why
- provide guidance for future performance work in simple
At first glance, much of the original code already looks optimized: it uses FFTW, avoids obvious inefficiencies, and is written clearly in terms of scientific intent. So why bother with refactors like this?
In image-processing pipelines, the dominant cost is frequently:
- streaming large arrays through memory
- repeated full-image passes
- rearranging data (for example fftshift, clipping, masking)
Even when each individual operation is efficient, chaining many of them together can become bandwidth-bound.
This refactor shows that reducing the number of full-array passes can yield large speedups without changing any math.
The original implementation is easy to read because each operation is expressed as a separate method call.
The refactored version keeps that clarity at the orchestration level (for example prepimg4align still reads like a pipeline), while moving complexity into well-named, tightly scoped kernels.
Key idea:
- readable pipelines at the top
- specialized, performance-aware kernels underneath
A common misconception is that performance gains require:
- approximations
- shortcuts
- reduced precision
- altered algorithms
This example demonstrates the opposite:
- the math is unchanged
- results are numerically equivalent (up to floating-point roundoff)
- the speedup comes from restructuring work, not skipping it
Understanding mathematical equivalences (for example spatial shifts versus Fourier phase modulation) is a powerful optimization tool.
These changes reflect how modern CPUs actually execute code:
- memory bandwidth is precious
- branches and function calls in hot loops are expensive
- compilers optimize best when loops are simple and uniform
Writing fast code in simple often means shaping algorithms so they work with the hardware, not against it.
This case study focuses on refactoring the particle preprocessing pipeline used before alignment.
The baseline implementation performs the following steps:
normalize
→ FFT
→ clip
→ optional spatial shift
→ CTF phase flip
→ IFFT
→ mask
→ optional instrument-function division
→ FFT
Each step typically performs a full pass over the image.
The refactored version preserves the same logical steps but consolidates most work into two fused kernels:
- norm_noise_fft_clip_shift
- ifft_mask_divwinstrfun_fft
The call site remains short and readable, while the heavy lifting is done inside specialized kernels.
Multiple operations that streamed over the entire image were combined into single kernels, reducing memory traffic.
The real-space shift was replaced by its mathematically equivalent Fourier phase factor, applied during clipping.
This avoids an additional resampling pass while preserving exact behavior.
Quadrant swaps (fftshift) were fused with normalization, masking, and division so that data rearrangement is never done on its own.
Cheap guards avoid expensive arithmetic when normalization or shifting would have no meaningful effect.
- double precision for reductions (mean, variance)
- single precision for per-pixel transforms and FFTs
This preserves numerical intent while matching FFT precision.
This case study focuses on refactoring the serial CTF application kernel.
The original implementation:
- uses a select-case on mode
- duplicates nearly identical nested loops for each mode
- recomputes geometry and address mapping inside hot loops
This structure is correct but inefficient for a performance-critical kernel.
The refactored version introduces:
- a pure elemental kernel for computing the CTF value
- a single unified loop
- masked mode handling using merge
- memoized geometry and addressing
The CTF value is computed once per frequency, and the mode-specific behavior is applied via masked arithmetic.
This reduces instruction cache pressure and improves compiler optimization opportunities.
Constants derived from wavelength, defocus, and spherical aberration are computed once outside the loop.
Angle, spatial frequency squared, and physical array indices are precomputed and reused.
This removes repeated atan2, division, and address computation from the hot loop.
The refactored loop:
- contains no data-dependent branching
- has a uniform body
- is easier for the compiler to vectorize
The result is faster execution even in serial mode.
This refactor preserves the algorithm exactly. Any differences are limited to expected floating-point roundoff.
Recommended validation steps:
- compare baseline and refactored outputs in real space
- compare Fourier-domain representations
- compute max-absolute and relative error norms
- test both shift and no-shift paths
- run end-to-end alignment regression tests
FFT plans (precision and flags) should be kept identical during comparison to avoid confounding effects.
When working on performance-critical code in simple:
- profile first; identify the dominant costs
- fuse consecutive full-array passes
- minimize domain round-trips
- eliminate transcendentals from inner loops
- hoist invariants aggressively
- memoize geometry and index mapping
- prefer uniform loops over mode-specific branching
- keep orchestration readable; specialize the hot path
- validate numerical equivalence early and often
Overview
Architecture & Design
Engineering Process
Performance