Skip to content

kunzmi/mpp

Repository files navigation

MPP - Michael Performance Primitives

Intel has IPP, Nvidia has NPP – and I have MPP: Michael Performance Primitives (actually not that easy to find a three letter name where the last two letters are given and that is still available...).

For a very long time now, many – if not all – of my cuda related projects rely in some way on NPP. I even wrote a C# based wrapper for this library as a part of managedCuda and all functions in NPP have a counterpart in C#: I thus know how many functions NPP provides (more than 5000 including all variants).

But from managedCuda I also know that many things are repetitive in NPP: nppiAdd does (mostly) the same for all datatypes, nppiSub is basically the same as nppiAdd, etc. If one only counts the actual functionalities independent of the data type, that means leaving out different variations, temp-buffer size and other helper functions, the number of 5000 functions quickly reduces to less than 300. A feasible number. Once I knew that number, the idea for MPP was born – because even though the number of functions is large, there is always one function missing: a special edge case, a special variant, a different datatype. At the end, one usually must implement a similar kernel with hopefully a similar performance. But to do so, what is the special ingredient that makes NPP a set of performance primitives and not just another vectorAdd-hello-world example?

Working now with Cuda for more than 15 years, this was not per se the most difficult problem to solve – the question was more to get things generalized and scaled to all NPP-use-cases: The solution is to use C++ templates and making heavy use of concepts introduced in C++20. And a different approach to handle unaligned data as compared to NPP, but this would go beyond the scope of a project readme.md – I’ll likely fill the wiki with some information on that over time.

Finally, now that I implemented the entire imaging subset of NPP for all datatypes, I can say that it is not that complicated to achieve the performance of NPP, it is even possible to become faster than NPP, sometimes even much faster!

I compared the average runtime of a few exemplery primitives in this table: Performance Analysis.

The short resume:

  • Aligned data is either as fast as NPP or faster (with very few exceptions).
  • Unaligned data is significantly faster with MPP than with NPP due to my one-kernel-approach.
  • Box- or Gauss-filtering with large kernels is (up to 90x) faster with MPP than with NPP. All tests were done on a RTX4090.
  • Median-filtering with NPP is quite slow, I've seen up to 400x improvement. But for the time beeing, only kernel sizes 3x3, 5x5 and 7x7 are implemented, NPP supports all possible kernel sizes.

To date, nearly all functions of the image processing subset of NPP are implemented – except for a few functions where I either have not yet decided on how to implement them or I simply have no idea what they are good for, how to use them or the NPP/IPP description is not in agreement with the actually computed outcome. Also missing is batch-processing and the 1D signal processing part of NPP.

What is already available in MPP:

  • All functionality in

    • arithmetic
    • colorConversion, except nppiAlphaCompColorKey, nppiLUTPaletteSwap, nppiRGBToCbYCr422Gamma
    • dataExchangeAndInit
    • geometricTransforms
    • morphology
    • thresholdAndCompare
    • statistics
    • filtering is also complete except for nppiHistogramOfGradients, nppiFilterGaussPyramidLayerUp/Down, nppiFilterHoughLine, nppiFloodFill, nppiLabelMarkersUF, nppiSegmentWatershed and nppiSignedDistanceTransform
  • Some new functions that I always missed in NPP, e.g. conjugate complex multiplication

  • Supported datatypes are:

    • unsigned byte (8u)
    • signed byte (8s)
    • unsigned short (16u)
    • signed short (16s)
    • unsigned int (32u)
    • signed int (32s)
    • in some exceptional cases also 64bit int (signed/unsigned).
    • 16bit half-float (16f)
    • 16bit bfloat (16bf)
    • 32bit float (32f)
    • 64bit float (64f)
    • 16bit signed complex int (16sc)
    • 32bit signed complex int (32sc)
    • 32bit complex float (32fc)
  • One (C1), two (C2), three (C3), four (C4) channel images and three channels plus alpha channel (AC4)

  • All kernels are implemented for all meaningful datatypes (e.g. no min/max operations on complex numbers). Using templates for the different channel counts and different datatypes makes this relatively easy to realize: one just needs to add another instantiation.

  • All interpolation methods are available in all kernels that use interpolation, i.e.:

    • NearestNeighbor
    • Linear
    • CubicHermiteSpline (called ‘cubic’ e.g. in Matlab)
    • CubicLagrange (‘cubic’ in NPP)
    • Cubic2ParamBSpline
    • Cubic2ParamCatmullRom
    • Cubic2ParamB05C03
    • SuperSampling (only for image downsampling)
    • Lanczos2Lobed
    • Lanczos3Lobed
  • All kernels that might need to read pixels outside the image area (e.g. filtering, geometric transformation, etc.) support border processing in a similar way to the _Border variants in NPP. Except that NPP only supports the border type “Replicate”, MPP implements:

    • None
    • Constant
    • Replicate
    • Mirror
    • MirrorReplicate
    • Wrap
    • SmoothEdge, which mimics IPP's SmoothEdge edge feature and which is not implemented in NPP.
  • A C++ class-based wrapper for NPP like the one in managedCuda – initially the idea was to use NPP as a reference implementation but I quickly gave that up: NPP has too many bugs and inconsistencies that it could be used as a reference. The code is still there though and was usefull for performance comparisons.

  • All cuda kernels are also implemented in a “simpleCPU” variant using no optimization and no parallelization at all, but the same operators and functors as the cuda kernels. This simplified variant is now considered as the reference implementation, the gold standard that all computed values must reach.

  • Compiles on Windows with VisualC++ and Linux with GCC 13.3+

  • No dependencies other than provided in the repository (except Cuda SDK)

The TODO list gets smaller and smaller, things that I want to add:

  • There's no install, releasing or packaging, CMake as of now just compiles the code and that's it.
  • Signal processing part of NPP/IPP, i.e. NPPs or IPPs?

What is next?

  • Once MPP is fully implemented with Cuda, the next logical step would be to port the cuda kernels to hip for AMD GPUs.
  • Large parts of the code could also be reused for a CPU implementation using SSE or AVX, which then leads to the question: how easy is it to reach the performance of IPP on CPU? Arm’s Neon vector units would then also benefit from such an implementation…

A full list of the currently implemented functions in MPP:

Arithmetic and Logical

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
Abs |
AbsDiff |
AbsDiffC |
AbsDiffDevC |
Add |
AddC |
AddDevC |
AddProduct |
AddSquare |
AddWeighted |
AlphaComp |
AlphaCompC |
AlphaPremul |
AlphaPremulC |
And |
AndC |
AndDevC |
Div |
DivC |
DivDevC |
DivInv1 |
DivInvC1 |
DivInvDevC1 |
Exp |
LShiftC |
Ln |
Mul |
MulC |
MulDevC |
MulScale |
MulScaleC |
MulScaleDevC |
Not |
Or |
OrC |
OrDevC |
RShiftC |
Sqr |
Sqrt |
Sub |
SubC |
SubDevC |
SubInv1 |
SubInvC1 |
SubInvDevC1 |
Xor |
XorC |
XorDevC |

1I added this variant to allow for inplace operation SrcDst = Src2 - SrcDst or SrcDst = Src2 / SrcDst in addition to SrcDst = SrcDst - Src2

Color Conversion

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
BGRtoHLS |
BGRtoHSV |
BGRtoLUV |
BGRtoLab |
CFAToRGB |
ColorToGray |
ColorTwist3x3 |
ColorTwist3x3From411 |
ColorTwist3x3From420 |
ColorTwist3x3From422 |
ColorTwist3x3To411 |
ColorTwist3x3To420 |
ColorTwist3x3To422 |
ColorTwist3x41 |
ColorTwist3x4From4111 |
ColorTwist3x4From4201 |
ColorTwist3x4From4221 |
ColorTwist3x4To4111 |
ColorTwist3x4To4201 |
ColorTwist3x4To4221 |
ColorTwist4x4 |
ColorTwist4x4C |
CompColorKey |
ConvertSampling4222 |
GammaCorrBT7093 |
GammaCorrsRGB3 |
GammaInvCorrBT7093 |
GammaInvCorrsRGB3 |
GradientColorToGray |
HLStoBGR |
HLStoRGB |
HSVtoBGR |
HSVtoRGB |
LUT4 |
LUTPalette |
LUTPaletteAC4 |
LUTPaletteC3 |
LUTTrilinear |
LUTTrilinearAC4 |
LUVtoBGR |
LUVtoRGB |
LabtoBGR |
LabtoRGB |
RGBToCFA5 |
RGBtoHLS |
RGBtoHSV |
RGBtoLUV |
RGBtoLab |

1The NPP API provides individual functions for different color space conversions, such as RGBtoYCbCr. But then variants exist for planar configurations in RGBtoYCbCr but not for BGRtoYCbCr and vice versa - as internally all of these conversions can be represented by a 3x4 matrix operation, I opted for a gereneralized API and provide the user the matrix values to pass to the ColorTwist3x4 function.

2MPP only provides a primitive to convert from and to 422 C2 packed format to/from 2/3-channel planar. All other resampling operations on planar data can be replaced by either a Copy-function or a Resize-function.

3The gamma curve that NPP and IPP are using corresponds to the 'BT709' gamma curve, but most common is likely the sRGB curve, so MPP provides both.

4For 8u and 16u data, it is more performant to pre-compute all possible LUT values in a palette and then use the LUTPalette function. MPP provides routines to compute a palette from a LUT on host and on device.

5The inverse of CFAtoRGB seems like a nice to have

Data Exchange and Initialization

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
Copy |
CopyBorder |
CopySubpix |
SetC |
SetDevC |
SwapChannel |
Transpose |

Filtering

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
BilateralGaussFilter |
BoxAndSumSquareFilter1 |
BoxFilter |
CannyEdge |
ColumnFilter |
ColumnWindowSum |
Filter |
FixedFilter2 |
GradientVectorPrewitt |
GradientVectorScharr |
GradientVectorSobel |
HarrisCornerResponse |
MaxFilter |
MedianFilter3 |
MinFilter |
RowFilter |
RowWindowSum |
SeparableFilter4 |
ThresholdAdaptiveBoxFilter |
UnsharpFilter |
WienerFilter |

1A specialised box filter for one-channel images that returns in first channel result image the mean value under the box area and in the second channel the summed squared pixel values. The result can then be used as a pre-computed step in the CrossCorrelationCoefficient function.

2One function for the following filter kernels: Gauss, HighPass, LowPass, Laplace, PrewittHoriz, PrewittVert, RobertsDown, RobertsUp, ScharrHoriz, ScharrVert, Sharpen, SobelCross, SobelHoriz, SobelVert, SobelHorizSecond, SobelVertSecond

3Only for filter sizes 3x3, 5x5 and 7x7

4Replaces nppiFilterGaussAdvancedBorder

Geometry Transforms

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
Mirror |
Remap |
RemapC21 |
Resize |
ResizeSqrPixel |
Rotate |
WarpAffine |
WarpAffineBack |
WarpPerspective |
WarpPerspectiveBack |

1NPP has a Remap function that takes two one-channel maps for X and Y pixel indices. MPP additionally has the same routine with a packed two-channel index-map, i.e. X and Y component in one image.

Complex Numbers1

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
Angle |
Conj |
ConjMul |
Imag |
Magnitude2 |
MagnitudeLog2 |
MagnitudeSqr2 |
MakeComplex |
MakeComplexImag |
Real |

1NPP offers only rudimentary support for complex numbers in a category named linear transforms, but geometry and color transforms are also linear transforms, why I renamed the category Complex Numbers.

2As a special feature, the Magnitude functions can return a full spectrum with zero frequency in the center if the input is of size (width/2)+1, the size of a FFTW/CUFFT R2C FFT.

Morphological

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
BlackHat |
Close |
Dilation |
DilationGray |
DilationMask |
Erosion |
ErosionGray |
ErosionMask |
MorphologyGradient1 |
Open |
TopHat |

1When one calls the MorphGradient functions in NPP, it returns the result of BlackHat

Statistics

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
AverageError |
AverageRelativeError |
CircularRadialProfile |
CountInRange |
CrossCorrelation |
CrossCorrelationCoefficient |
CrossCorrelationNormalized |
DotProduct |
EllipticalRadialProfile1 |
HistogramEven |
HistogramRange |
Integral |
MSE |
MSSSIM |
Max |
MaxEvery |
MaxIndex |
MaximumError |
MaximumRelativeError |
Mean |
MeanStd |
Min |
MinEvery |
MinIndex |
MinMax |
MinMaxIndex |
NormDiffInf |
NormDiffL1 |
NormDiffL2 |
NormInf |
NormL1 |
NormL2 |
NormRelInf |
NormRelL1 |
NormRelL2 |
PSNR |
QualityIndex2 |
QualityIndexWindow2 |
RectStdDev |
SSIM |
SqrIntegral |
SquareDistanceNormalized |
Sum |

1I never managed to get a result out of EllipticalRadialProfile in NPP.

2QualityIndex is identical to the one in NPP computing a global index without a sliding window. QualityIndexWindow is implemented using a sliding window approach as is done in the original paper / code with a window size of 11x11 pixels.

Threshold and Sompare

Name 8u 8s 16u 16s 32u 32s 16f 16bf 32f 64f 16sc 32sc 32fc | C1 C2 C3 C4 AC4
Compare1 |
CompareC1 |
CompareDevC1 |
CompareEqEps |
CompareEqEpsC |
CompareEqEpsDevC |
CompareFloat2 |
ReplaceIf3 |
ReplaceIfC3 |
ReplaceIfDevC3 |
ReplaceIfFloat3 |
Threshold |
ThresholdDevC |
ThresholdGT |
ThresholdGTDevC |
ThresholdGTVal |
ThresholdLT |
ThresholdLTDevC |
ThresholdLTGT |
ThresholdLTVal |
ThresholdVal |

1Additionally to what NPP allows, MPP can apply the compare-operation per channel or for all channels and return a result for each channel or for all channels together.

2CompareFloat allows to check for unnormalized floating values, such as INF or NaN, a feature I always missed in NPP.

3ReplaceIf, replaces a pixel value if the compare-operation returns true: Basically a combined Compare/Set-with-mask operation.

About

An open source implementation of NPP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published