The code in src/main.cu is a contrived data processing pipeline built to highlight NVIDIA's profiling tools and a couple of common pitfalls when getting started with CUDA programming. There are 4 different implementations of the same pipeline: they are all functionally identical, but differ in their relative runtime performance. The implementations are:
- Baseline
- Launch pipeline with >1 thread blocks
- Process pipeline with CUDA streams to prevent unnecessary blocking
- Use coalesced memory access (relevant blog post)
A graphical representation of the contrived pipeline:

A Dockerfile, along with bash scripts for building and running the Docker container are located in the deploy directory. To build and run the container, use:
./deploy/build-docker.sh
./deploy/run-docker.shThe application is built with CMake:
mkdir build && cd build
cmake ..
makeRun the application with ./nsight-demo.
A script for generating an Nsight Systems report (.nsys-rep) and Nsight Compute reports (.ncu-rep) can be run with ./deploy/profile.sh. The reports will be saved in the ./nsys-reports directory.