Performance Refactoring

Refactoring for Performance (Without Approximations)

This wiki documents a real-world refactor in the simple codebase that achieved a ~30% speedup through low-level optimization only.

The refactor:

does not change algorithms
does not introduce approximations
preserves numerical behavior up to normal floating-point roundoff

Performance gains come from restructuring data movement, fusing pipeline stages, and refactoring hot loops to better match how modern CPUs execute code.

This documentation serves two purposes:

explain what was changed and why
provide guidance for future performance work in simple

Why This Matters (for New Contributors)

At first glance, much of the original code already looks optimized: it uses FFTW, avoids obvious inefficiencies, and is written clearly in terms of scientific intent. So why bother with refactors like this?

Performance is often limited by data movement, not math

In image-processing pipelines, the dominant cost is frequently:

streaming large arrays through memory
repeated full-image passes
rearranging data (for example fftshift, clipping, masking)

Even when each individual operation is efficient, chaining many of them together can become bandwidth-bound.

This refactor shows that reducing the number of full-array passes can yield large speedups without changing any math.

Clear code and fast code are not opposites

The original implementation is easy to read because each operation is expressed as a separate method call.

The refactored version keeps that clarity at the orchestration level (for example prepimg4align still reads like a pipeline), while moving complexity into well-named, tightly scoped kernels.

Key idea:

readable pipelines at the top
specialized, performance-aware kernels underneath

“No approximations” does not mean “no refactoring”

A common misconception is that performance gains require:

approximations
shortcuts
reduced precision
altered algorithms

This example demonstrates the opposite:

the math is unchanged
results are numerically equivalent (up to floating-point roundoff)
the speedup comes from restructuring work, not skipping it

Understanding mathematical equivalences (for example spatial shifts versus Fourier phase modulation) is a powerful optimization tool.

Mechanical sympathy

These changes reflect how modern CPUs actually execute code:

memory bandwidth is precious
branches and function calls in hot loops are expensive
compilers optimize best when loops are simple and uniform

Writing fast code in simple often means shaping algorithms so they work with the hardware, not against it.

Case Study: prepimg4align

This case study focuses on refactoring the particle preprocessing pipeline used before alignment.

Original pipeline (conceptual)

The baseline implementation performs the following steps:

normalize
→ FFT
→ clip
→ optional spatial shift
→ CTF phase flip
→ IFFT
→ mask
→ optional instrument-function division
→ FFT

Each step typically performs a full pass over the image.

Refactored pipeline

The refactored version preserves the same logical steps but consolidates most work into two fused kernels:

norm_noise_fft_clip_shift
ifft_mask_divwinstrfun_fft

The call site remains short and readable, while the heavy lifting is done inside specialized kernels.

Refactoring principles applied

Fuse full-array passes

Multiple operations that streamed over the entire image were combined into single kernels, reducing memory traffic.

Replace spatial shift with Fourier-domain phase modulation

The real-space shift was replaced by its mathematically equivalent Fourier phase factor, applied during clipping.

This avoids an additional resampling pass while preserving exact behavior.

Combine fftshift with useful work

Quadrant swaps (fftshift) were fused with normalization, masking, and division so that data rearrangement is never done on its own.

Short-circuit near no-op cases

Cheap guards avoid expensive arithmetic when normalization or shifting would have no meaningful effect.

Mixed precision with intent

double precision for reductions (mean, variance)
single precision for per-pixel transforms and FFTs

This preserves numerical intent while matching FFT precision.

Case Study: CTF apply_serial

This case study focuses on refactoring the serial CTF application kernel.

Baseline structure

The original implementation:

uses a select-case on mode
duplicates nearly identical nested loops for each mode
recomputes geometry and address mapping inside hot loops

This structure is correct but inefficient for a performance-critical kernel.

Refactored structure

The refactored version introduces:

a pure elemental kernel for computing the CTF value
a single unified loop
masked mode handling using merge
memoized geometry and addressing

Refactoring principles applied

Unify duplicated loops

The CTF value is computed once per frequency, and the mode-specific behavior is applied via masked arithmetic.

This reduces instruction cache pressure and improves compiler optimization opportunities.

Hoist invariants

Constants derived from wavelength, defocus, and spherical aberration are computed once outside the loop.

Memoize geometry and addressing

Angle, spatial frequency squared, and physical array indices are precomputed and reused.

This removes repeated atan2, division, and address computation from the hot loop.

Make the hot loop SIMD-friendly

The refactored loop:

contains no data-dependent branching
has a uniform body
is easier for the compiler to vectorize

The result is faster execution even in serial mode.

Validation and Numerical Equivalence

This refactor preserves the algorithm exactly. Any differences are limited to expected floating-point roundoff.

Recommended validation steps:

compare baseline and refactored outputs in real space
compare Fourier-domain representations
compute max-absolute and relative error norms
test both shift and no-shift paths
run end-to-end alignment regression tests

FFT plans (precision and flags) should be kept identical during comparison to avoid confounding effects.

Performance Checklist for Future Refactors

When working on performance-critical code in simple:

profile first; identify the dominant costs
fuse consecutive full-array passes
minimize domain round-trips
eliminate transcendentals from inner loops
hoist invariants aggressively
memoize geometry and index mapping
prefer uniform loops over mode-specific branching
keep orchestration readable; specialize the hot path
validate numerical equivalence early and often

SIMPLE Wiki

Overview

Home

Architecture & Design

SIMPLE Architecture: What You Need to Know (Quick Start)

Engineering Process

Development of Non-Orthogonal Features

Performance

Performance Refactoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Refactoring

Refactoring for Performance (Without Approximations)

Why This Matters (for New Contributors)

Performance is often limited by data movement, not math

Clear code and fast code are not opposites

“No approximations” does not mean “no refactoring”

Mechanical sympathy

Case Study: prepimg4align

Original pipeline (conceptual)

Refactored pipeline

Refactoring principles applied

Fuse full-array passes

Replace spatial shift with Fourier-domain phase modulation

Combine fftshift with useful work

Short-circuit near no-op cases

Mixed precision with intent

Case Study: CTF apply_serial

Baseline structure

Refactored structure

Refactoring principles applied

Unify duplicated loops

Hoist invariants

Memoize geometry and addressing

Make the hot loop SIMD-friendly

Validation and Numerical Equivalence

Performance Checklist for Future Refactors

Uh oh!

Uh oh!

Uh oh!

SIMPLE Wiki

Clone this wiki locally