From 2da0b8d6b34cfa3ac247fdd7236beab48fbc47c2 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:18:35 -0700
Subject: [PATCH 01/12] Create best_practices.md

---
 docs/best_practices.md | 237 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)
 create mode 100644 docs/best_practices.md
diff --git a/docs/best_practices.md b/docs/best_practices.md
new file mode 100644
index 00000000..65c3c742
--- /dev/null
+++ b/docs/best_practices.md
@@ -0,0 +1,237 @@
+# NVBench Best Practices
+
+NVBench is a **small but actively developed benchmarking library** for **CUDA GPU workloads**. It is **well-documented** and comes with many examples to help users get started quickly.
+
+* [README](https://github.com/NVIDIA/cuCollections/blob/dev/README.md) — installation and basic usage.
+* [Benchmark documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md) — detailed features.
+* [Examples](https://github.com/NVIDIA/nvbench/tree/main/examples) — sample benchmarks.
+* [CLI guides](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md) and [CLI axis documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md).
+
+> **Note:** This document complements the official guides. All code is for demonstration purposes and is **not a production recommendation**.
+
+---
+
+## Key Features
+
+* Purpose-built for **CUDA GPU workloads**.
+* Provides **GPU-aware features**: warmup runs, synchronization, throughput/latency metrics, parameter sweeps, etc.
+* Produces **machine-readable output** (JSON, CSV) for regression tracking and CI pipelines.
+* **Natural choice for GPU benchmarking**, also supports CPU code.
+* Python bindings are planned for future releases.
+
+---
+
+## Getting Started: Benchmarking a Simple GPU Kernel
+
+### Naive Example
+
+```cpp
+void sequence_bench(nvbench::state& state) {
+  auto data = thrust::device_vector<int>(10);
+  state.exec([](nvbench::launch& launch) {
+    thrust::sequence(data.begin(), data.end());
+  });
+}
+NVBENCH_BENCH(sequence_bench);
+```
+
+> This may compile with **unused parameter warnings** and may **hang at runtime**, because NVBench requires explicit CUDA stream targeting and careful handling of synchronous kernels.
+
+---
+
+### Correct Usage with Stream
+
+```cpp
+void sequence_bench(nvbench::state& state) {
+  auto data = thrust::device_vector<int>(10);
+  state.exec([](nvbench::launch& launch) {
+    thrust::sequence(thrust::device.on(launch.get_stream()), data.begin(), data.end());
+  });
+}
+NVBENCH_BENCH(sequence_bench);
+```
+
+---
+
+### Avoiding Deadlocks with `exec_tag::sync`
+
+```cpp
+void sequence_bench(nvbench::state& state) {
+  auto data = thrust::device_vector<int>(10);
+  state.exec(nvbench::exec_tag::sync, [&data](nvbench::launch& launch) {
+    thrust::sequence(thrust::device.on(launch.get_stream()), data.begin(), data.end());
+  });
+}
+NVBENCH_BENCH(sequence_bench);
+```
+
+> This ensures correct timing and avoids hangs caused by implicit synchronization in `thrust` calls.
+
+---
+
+## Multi-GPU Awareness
+
+By default, NVBench runs on **all available GPUs**, which may increase runtime significantly. Target a specific GPU using:
+
+```bash
+export CUDA_VISIBLE_DEVICES=0
+```
+
+Example run output for a single GPU:
+
+```bash
+user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench 
+# Devices
+
+## [0] `Quadro RTX 8000`
+* SM Version: 750 (PTX Version: 750)
+* Number of SMs: 72
+* SM Default Clock Rate: 1770 MHz
+* Global Memory: 48232 MiB Free / 48403 MiB Total
+* Global Memory Bus Peak: 672 GB/sec (384-bit DDR @7001MHz)
+* Max Shared Memory: 64 KiB/SM, 48 KiB/Block
+* L2 Cache Size: 6144 KiB
+* Maximum Active Blocks: 16/SM
+* Maximum Active Threads: 1024/SM, 1024/Block
+* Available Registers: 65536/SM, 65536/Block
+* ECC Enabled: No
+
+# Log
+
+Run:  [1/1] sequence_bench [Device=0]
+Pass: Cold: 0.006257ms GPU, 0.009850ms CPU, 0.50s total GPU, 4.40s total wall, 79920x 
+
+# Benchmark Results
+
+## sequence_bench
+
+### [0] Quadro RTX 8000
+
+| Samples | CPU Time | Noise | GPU Time | Noise  |
+|---------|----------|-------|----------|--------|
+|  79920x | 9.850 us | 9.62% | 6.257 us | 13.32% |
+```
+
+---
+
+## Benchmarking Multiple Problem Sizes
+
+Add an **axis** to test multiple input sizes without recompiling:
+
+```cpp
+void sequence_bench(nvbench::state& state) {
+  auto const n = state.get_int64("Num");
+  auto data = thrust::device_vector<int>(n);
+
+  state.exec(nvbench::exec_tag::sync, [&data](nvbench::launch& launch) {
+    thrust::sequence(thrust::device.on(launch.get_stream()), data.begin(), data.end());
+  });
+}
+
+NVBENCH_BENCH(sequence_bench)
+  .add_int64_axis("Num", {10, 100, 1000, 1000000});
+```
+
+CLI override example:
+
+```bash
+user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench -a Num=[10,100000]
+# Devices
+
+## [0] `Quadro RTX 8000`
+* SM Version: 750 (PTX Version: 750)
+* Number of SMs: 72
+* SM Default Clock Rate: 1770 MHz
+* Global Memory: 48232 MiB Free / 48403 MiB Total
+* Global Memory Bus Peak: 672 GB/sec (384-bit DDR @7001MHz)
+* Max Shared Memory: 64 KiB/SM, 48 KiB/Block
+* L2 Cache Size: 6144 KiB
+* Maximum Active Blocks: 16/SM
+* Maximum Active Threads: 1024/SM, 1024/Block
+* Available Registers: 65536/SM, 65536/Block
+* ECC Enabled: No
+
+# Log
+
+Run:  [1/2] sequence_bench [Device=0 Num=10]
+Pass: Cold: 0.006318ms GPU, 0.009948ms CPU, 0.50s total GPU, 4.37s total wall, 79152x 
+Run:  [2/2] sequence_bench [Device=0 Num=100000]
+Pass: Cold: 0.006586ms GPU, 0.010193ms CPU, 0.50s total GPU, 4.14s total wall, 75936x 
+
+# Benchmark Results
+
+## sequence_bench
+
+### [0] Quadro RTX 8000
+
+|  Num   | Samples | CPU Time  | Noise | GPU Time | Noise  |
+|--------|---------|-----------|-------|----------|--------|
+|     10 |  79152x |  9.948 us | 9.63% | 6.318 us | 13.73% |
+| 100000 |  75936x | 10.193 us | 9.62% | 6.586 us | 12.86% |
+```
+
+---
+
+## Comparing Algorithms
+
+You can easily benchmark alternative implementations. For example, replacing `thrust::sequence` with `thrust::transform`:
+
+```cpp
+void sequence_bench(nvbench::state& state) {
+  auto const n = state.get_int64("Num");
+  auto data = thrust::device_vector<int>(n);
+
+  state.exec(nvbench::exec_tag::sync, [&data, n](nvbench::launch& launch) {
+    thrust::transform(
+      thrust::device.on(launch.get_stream()),
+      thrust::counting_iterator<int>(0),
+      thrust::counting_iterator<int>(n),
+      data.begin(),
+      cuda::std::identity{}
+    );
+  });
+}
+```
+
+Record results to JSON for post-processing:
+
+```bash
+user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench --json sequence_transform.json
+```
+
+Compare with reference `thrust::sequence` run using `nvbench_compare.py`:
+
+```bash
+user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequence_transform.json 
+['sequence_ref.json', 'sequence_transform.json']
+# sequence_bench
+
+## [0] Quadro RTX 8000
+
+|   Num   |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
+|---------|------------|-------------|------------|-------------|-----------|---------|----------|
+|   10    |   6.288 us |      13.70% |   6.301 us |      14.38% |  0.013 us |   0.20% |   SAME   |
+|   100   |   6.331 us |      13.74% |   6.350 us |      15.15% |  0.019 us |   0.31% |   SAME   |
+|  1000   |   6.548 us |      13.29% |   6.504 us |      13.95% | -0.043 us |  -0.66% |   SAME   |
+| 1000000 |  12.528 us |       7.56% |  12.507 us |       8.41% | -0.021 us |  -0.17% |   SAME   |
+
+# Summary
+
+- Total Matches: 4  
+  - Pass    (diff <= min_noise): 4  
+  - Unknown (infinite noise):    0  
+  - Failure (diff > min_noise):  0
+```
+
+> The two implementations perform nearly identically, demonstrating how NVBench can be used to **compare different algorithms or kernel implementations**.
+
+---
+
+## Summary
+
+* Always **specify the CUDA stream** and use `exec_tag::sync` for synchronous kernels.
+* Use **axes** and **CLI overrides** for flexible multi-size benchmarking.
+* Record results in **JSON/CSV** for CI integration and regression analysis.
+* NVBench is **actively developed**, easy to use, and ideal for **GPU benchmarking**, but note that it is small and has **limited community support**.
+
+For more details and advanced examples, visit the [NVBench repository](https://github.com/NVIDIA/nvbench).

From af9773df09ea07c677be1c3b38d88679c62c4fc1 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:37:04 -0700
Subject: [PATCH 02/12] Update best_practices.md

---
 docs/best_practices.md | 168 +++++++++++++++++++++++++++++------------
 1 file changed, 118 insertions(+), 50 deletions(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index 65c3c742..ca77142e 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -1,29 +1,24 @@
 # NVBench Best Practices
 
-NVBench is a **small but actively developed benchmarking library** for **CUDA GPU workloads**. It is **well-documented** and comes with many examples to help users get started quickly.
+NVBench is a **small but actively developed benchmarking library** for CUDA GPU workloads. The [README](https://github.com/NVIDIA/cuCollections/blob/dev/README.md) is the best starting point, listing all the details where users can get hands-on 101 from installation to usage of the framework. It has links to, e.g., the [benchmark documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md), which contains all the information needed for users to use NVBench and all the key features NVBench supports. It also has links and pointers to [code examples](https://github.com/NVIDIA/nvbench/tree/main/examples) where users can get hands-on samples on how to apply NVBench to their own codebase.
 
-* [README](https://github.com/NVIDIA/cuCollections/blob/dev/README.md) — installation and basic usage.
-* [Benchmark documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md) — detailed features.
-* [Examples](https://github.com/NVIDIA/nvbench/tree/main/examples) — sample benchmarks.
-* [CLI guides](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md) and [CLI axis documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md).
-
-> **Note:** This document complements the official guides. All code is for demonstration purposes and is **not a production recommendation**.
+This document is not meant to replace the detailed benchmark documentation ([here](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md)) or the CLI help guides ([CLI help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md) and [CLI axis help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md)). All the examples shown are for demonstration purposes and are **not a recommendation guide for the best use in real-case problems**.
 
 ---
 
-## Key Features
+## NVBench
 
-* Purpose-built for **CUDA GPU workloads**.
-* Provides **GPU-aware features**: warmup runs, synchronization, throughput/latency metrics, parameter sweeps, etc.
-* Produces **machine-readable output** (JSON, CSV) for regression tracking and CI pipelines.
-* **Natural choice for GPU benchmarking**, also supports CPU code.
-* Python bindings are planned for future releases.
+* Purpose-built for CUDA GPU workloads.
+* Provides GPU-aware features: warmup runs, synchronization, throughput/latency metrics, and parameter sweeps, etc.
+* Produces machine-readable output (JSON, CSV) suitable for regression tracking and CI pipelines.
+* The natural choice for benchmarking GPU-accelerated code.
+* Also supports benchmarking normal CPU implementations.
+* Python code support is in the roadmap.
 
 ---
 
-## Getting Started: Benchmarking a Simple GPU Kernel
-
-### Naive Example
+## Benchmark Your GPU Code with NVBench
+Let’s begin with a simple example for users who are new to NVBench and want to learn the basics of benchmarking GPU code. Consider measuring the performance of `thrust::sequence` on a GPU. Similar to `std::iota`, suppose we have an input array of 10 elements, and we want `thrust::sequence` to populate it with the sequence of values from 0 to 9. The following example demonstrates this approach:
 
 ```cpp
 void sequence_bench(nvbench::state& state) {
@@ -35,12 +30,12 @@ void sequence_bench(nvbench::state& state) {
 NVBENCH_BENCH(sequence_bench);
 ```
 
-> This may compile with **unused parameter warnings** and may **hang at runtime**, because NVBench requires explicit CUDA stream targeting and careful handling of synchronous kernels.
-
----
+Will this code work as-is? Depending on the build system configuration, compilation may succeed but produce warnings indicating that `launch` is an unused parameter. The code may or may not execute correctly. This is a common scenario when users, accustomed to a sequential programming mindset, overlook the fact that GPU architectures are highly parallel. Streams and synchronization play a critical role in accurately measuring performance in benchmark code.
 
-### Correct Usage with Stream
+In this context, two common mistakes should be noted:
 
+1. Stream Awareness: NVBench requires knowledge of the exact CUDA stream being targeted to correctly trace kernel execution and measure performance.
+2. Explicit Stream Specification: Users must explicitly provide the stream to be benchmarked. For example, passing the NVBench launch stream ensures correct execution and measurement:
 ```cpp
 void sequence_bench(nvbench::state& state) {
   auto data = thrust::device_vector<int>(10);
@@ -51,9 +46,28 @@ void sequence_bench(nvbench::state& state) {
 NVBENCH_BENCH(sequence_bench);
 ```
 
----
+By explicitly specifying `launch.get_stream()`, NVBench can correctly target the kernels executed on that stream. After recompilation, the compilation warnings will be resolved, and the build will complete successfully. However, at runtime, the code may hang, for example:
+
+
+```bash
 
-### Avoiding Deadlocks with `exec_tag::sync`
+######################################################################
+##################### Possible Deadlock Detected #####################
+######################################################################
+
+Forcing unblock: The current measurement appears to have deadlocked
+and the results cannot be trusted.
+
+This happens when the KernelLauncher synchronizes the CUDA device.
+If this is the case, pass the `sync` exec_tag to the `exec` call:
+
+    state.exec(<KernelLauncher>); // Deadlock
+    state.exec(nvbench::exec_tag::sync, <KernelLauncher>); // Safe
+```
+
+The runtime execution log indicates a deadlock, and NVBench terminated the run to prevent unnecessary execution. The log shows that the issue arises from implicit synchronization within the target kernel—in this case, the `thrust::sequence` call. By default, unless explicitly specified, `thrust` uses a synchronous execution policy internally. Therefore, users must pass `nvbench::exec_tag::sync` to ensure correct benchmarking. This will **not** produce a build-time error but can cause runtime hangs if omitted.
+
+Now, we fix the code:
 
 ```cpp
 void sequence_bench(nvbench::state& state) {
@@ -65,19 +79,69 @@ void sequence_bench(nvbench::state& state) {
 NVBENCH_BENCH(sequence_bench);
 ```
 
-> This ensures correct timing and avoids hangs caused by implicit synchronization in `thrust` calls.
+If we run the benchmark now, results are displayed without issues. Users may notice, especially in a multi-GPU environment, that many results are collected more than expected:
 
----
+```bash
+user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench 
+# Devices
+
+## [0] `Quadro RTX 8000`
+* SM Version: 750 (PTX Version: 750)
+* Number of SMs: 72
+* SM Default Clock Rate: 1770 MHz
+* Global Memory: 48232 MiB Free / 48403 MiB Total
+* Global Memory Bus Peak: 672 GB/sec (384-bit DDR @7001MHz)
+* Max Shared Memory: 64 KiB/SM, 48 KiB/Block
+* L2 Cache Size: 6144 KiB
+* Maximum Active Blocks: 16/SM
+* Maximum Active Threads: 1024/SM, 1024/Block
+* Available Registers: 65536/SM, 65536/Block
+* ECC Enabled: No
+
+## [1] `NVIDIA RTX A400`
+* SM Version: 860 (PTX Version: 860)
+* Number of SMs: 6
+* SM Default Clock Rate: 1762 MHz
+* Global Memory: 2801 MiB Free / 3769 MiB Total
+* Global Memory Bus Peak: 96 GB/sec (64-bit DDR @6001MHz)
+* Max Shared Memory: 100 KiB/SM, 48 KiB/Block
+* L2 Cache Size: 1024 KiB
+* Maximum Active Blocks: 16/SM
+* Maximum Active Threads: 1536/SM, 1024/Block
+* Available Registers: 65536/SM, 65536/Block
+* ECC Enabled: No
+
+# Log
+
+Run:  [1/2] sequence_bench [Device=0]
+Pass: Cold: 0.006150ms GPU, 0.009768ms CPU, 0.50s total GPU, 4.52s total wall, 81312x 
+Run:  [2/2] sequence_bench [Device=1]
+Pass: Cold: 0.007819ms GPU, 0.013864ms CPU, 0.50s total GPU, 3.59s total wall, 63952x 
+
+# Benchmark Results
+
+## sequence_bench
+
+### [0] Quadro RTX 8000
+
+| Samples | CPU Time | Noise  | GPU Time | Noise  |
+|---------|----------|--------|----------|--------|
+|  81312x | 9.768 us | 13.55% | 6.150 us | 20.16% |
 
-## Multi-GPU Awareness
+### [1] NVIDIA RTX A400
 
-By default, NVBench runs on **all available GPUs**, which may increase runtime significantly. Target a specific GPU using:
+| Samples | CPU Time  |  Noise  | GPU Time |  Noise  |
+|---------|-----------|---------|----------|---------|
+|  63952x | 13.864 us | 432.95% | 7.819 us | 447.95% |
+```
+
+By default, NVBench runs all GPUs locally unless specified. If not specified, it will run all available GPUs. This is especially problematic if your system has multiple GPUs and you want to target a particular GPU to save build time. In our case, we target **RTX8000**:
 
 ```bash
-export CUDA_VISIBLE_DEVICES=0
+user@nvbench-test:~/nvbench/build/bin$ export CUDA_VISIBLE_DEVICES=0
 ```
 
-Example run output for a single GPU:
+Now, if we rerun:
 
 ```bash
 user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench 
@@ -112,11 +176,9 @@ Pass: Cold: 0.006257ms GPU, 0.009850ms CPU, 0.50s total GPU, 4.40s total wall, 7
 |  79920x | 9.850 us | 9.62% | 6.257 us | 13.32% |
 ```
 
----
-
 ## Benchmarking Multiple Problem Sizes
 
-Add an **axis** to test multiple input sizes without recompiling:
+Benchmarking the performance of a single problem size is usually **less desired in real-world problems**. In most cases, we want to run different problem sizes for the same kernel. NVBench provides an **“axis”** feature to help with this. For example, to test input sizes from `10` to `1000000`:
 
 ```cpp
 void sequence_bench(nvbench::state& state) {
@@ -127,12 +189,11 @@ void sequence_bench(nvbench::state& state) {
     thrust::sequence(thrust::device.on(launch.get_stream()), data.begin(), data.end());
   });
 }
-
 NVBENCH_BENCH(sequence_bench)
-  .add_int64_axis("Num", {10, 100, 1000, 1000000});
+  .add_int64_axis("Num", std::vector<nvbench::int64_t>{10, 100, 1000, 1000000});
 ```
 
-CLI override example:
+**Axis is a powerful tool** provided by NVBench. Users may encounter situations where they want to test only certain sizes. NVBench provides a **flexible CLI**, so users can change the benchmark parameters **without recompiling the code**:
 
 ```bash
 user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench -a Num=[10,100000]
@@ -170,11 +231,23 @@ Pass: Cold: 0.006586ms GPU, 0.010193ms CPU, 0.50s total GPU, 4.14s total wall, 7
 | 100000 |  75936x | 10.193 us | 9.62% | 6.586 us | 12.86% |
 ```
 
+For more details about **CLI axis control**, please check [here](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md).
+
 ---
 
-## Comparing Algorithms
+## Comparing Algorithms Using NVBench
+
+Once benchmarks are set, a major use is to evaluate performance between different algorithms. For example, the same sequence algorithm can be written manually using `thrust::transform`. We can compare the performance of a manual transform sequence against `thrust::sequence`.
+
+### Step 1: Record Reference Performance
 
-You can easily benchmark alternative implementations. For example, replacing `thrust::sequence` with `thrust::transform`:
+Record the `thrust::sequence` benchmark in a JSON file for post-processing:
+
+```bash
+user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench --json sequence_ref.json
+```
+
+### Step 2: Update Code with `thrust::transform`
 
 ```cpp
 void sequence_bench(nvbench::state& state) {
@@ -193,13 +266,15 @@ void sequence_bench(nvbench::state& state) {
 }
 ```
 
-Record results to JSON for post-processing:
+### Step 3: Run Benchmark with Transform and Save JSON
 
 ```bash
 user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench --json sequence_transform.json
 ```
 
-Compare with reference `thrust::sequence` run using `nvbench_compare.py`:
+### Step 4: Compare Results
+
+NVBench provides a convenient script under `nvbench/scripts` called `nvbench_compare.py`. After copying the JSON files to the scripts folder:
 
 ```bash
 user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequence_transform.json 
@@ -217,21 +292,14 @@ user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequ
 
 # Summary
 
-- Total Matches: 4  
-  - Pass    (diff <= min_noise): 4  
-  - Unknown (infinite noise):    0  
+- Total Matches: 4
+  - Pass    (diff <= min_noise): 4
+  - Unknown (infinite noise):    0
   - Failure (diff > min_noise):  0
 ```
 
-> The two implementations perform nearly identically, demonstrating how NVBench can be used to **compare different algorithms or kernel implementations**.
+We can see that the performance of the two approaches is essentially the same.
 
 ---
 
-## Summary
-
-* Always **specify the CUDA stream** and use `exec_tag::sync` for synchronous kernels.
-* Use **axes** and **CLI overrides** for flexible multi-size benchmarking.
-* Record results in **JSON/CSV** for CI integration and regression analysis.
-* NVBench is **actively developed**, easy to use, and ideal for **GPU benchmarking**, but note that it is small and has **limited community support**.
-
-For more details and advanced examples, visit the [NVBench repository](https://github.com/NVIDIA/nvbench).
+For more information on how to use NVBench in your projects, please check the [NVBench repository](https://github.com/NVIDIA/nvbench). Feel free to raise questions or feature requests via **GitHub issues** or **discussions**, and enjoy benchmarking with NVBench!

From be0cda88a3b485a017b3f8bd3ba1cc6e326f7bc8 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:38:44 -0700
Subject: [PATCH 03/12] Update best_practices.md

---
 docs/best_practices.md | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index ca77142e..8b3b14b8 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -1,10 +1,8 @@
 # NVBench Best Practices
 
-NVBench is a **small but actively developed benchmarking library** for CUDA GPU workloads. The [README](https://github.com/NVIDIA/cuCollections/blob/dev/README.md) is the best starting point, listing all the details where users can get hands-on 101 from installation to usage of the framework. It has links to, e.g., the [benchmark documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md), which contains all the information needed for users to use NVBench and all the key features NVBench supports. It also has links and pointers to [code examples](https://github.com/NVIDIA/nvbench/tree/main/examples) where users can get hands-on samples on how to apply NVBench to their own codebase.
+NVBench is a **small yet actively developed benchmarking library** for CUDA GPU workloads. The [README](https://github.com/NVIDIA/cuCollections/blob/dev/README.md) serves as the ideal starting point, providing detailed guidance for users to get hands-on experience—from installation to framework usage. It includes links to the [benchmark documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md), which covers all essential features and usage instructions, as well as links to [code examples](https://github.com/NVIDIA/nvbench/tree/main/examples) that demonstrate how to integrate NVBench into a user’s codebase.
 
-This document is not meant to replace the detailed benchmark documentation ([here](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md)) or the CLI help guides ([CLI help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md) and [CLI axis help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md)). All the examples shown are for demonstration purposes and are **not a recommendation guide for the best use in real-case problems**.
-
----
+This document is **not intended to replace** the detailed benchmark documentation ([here](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md)) or the CLI help guides ([CLI help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md) and [CLI axis help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md)). All examples provided are for demonstration purposes and are **not intended as recommendations for best practices in real-world scenarios**.
 
 ## NVBench
 
@@ -15,8 +13,6 @@ This document is not meant to replace the detailed benchmark documentation ([her
 * Also supports benchmarking normal CPU implementations.
 * Python code support is in the roadmap.
 
----
-
 ## Benchmark Your GPU Code with NVBench
 Let’s begin with a simple example for users who are new to NVBench and want to learn the basics of benchmarking GPU code. Consider measuring the performance of `thrust::sequence` on a GPU. Similar to `std::iota`, suppose we have an input array of 10 elements, and we want `thrust::sequence` to populate it with the sequence of values from 0 to 9. The following example demonstrates this approach:
 
@@ -233,8 +229,6 @@ Pass: Cold: 0.006586ms GPU, 0.010193ms CPU, 0.50s total GPU, 4.14s total wall, 7
 
 For more details about **CLI axis control**, please check [here](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md).
 
----
-
 ## Comparing Algorithms Using NVBench
 
 Once benchmarks are set, a major use is to evaluate performance between different algorithms. For example, the same sequence algorithm can be written manually using `thrust::transform`. We can compare the performance of a manual transform sequence against `thrust::sequence`.

From 87aa856b4dca46c0f3b0a37bcfbd6ea0fbb2cf79 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:41:37 -0700
Subject: [PATCH 04/12] Update best_practices.md

---
 docs/best_practices.md | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index 8b3b14b8..b43b5b8d 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -25,13 +25,10 @@ void sequence_bench(nvbench::state& state) {
 }
 NVBENCH_BENCH(sequence_bench);
 ```
+Will this code work as-is? Depending on the build system configuration, compilation may succeed but generate warnings indicating that `launch` is an unused parameter. The code may or may not execute correctly. This often occurs when users, accustomed to a sequential programming mindset, overlook the fact that GPU architectures are highly parallel. Proper use of streams and synchronization is essential for accurately measuring performance in benchmark code.
 
-Will this code work as-is? Depending on the build system configuration, compilation may succeed but produce warnings indicating that `launch` is an unused parameter. The code may or may not execute correctly. This is a common scenario when users, accustomed to a sequential programming mindset, overlook the fact that GPU architectures are highly parallel. Streams and synchronization play a critical role in accurately measuring performance in benchmark code.
+A common mistake in this context is neglecting stream specification: NVBench requires knowledge of the exact CUDA stream being targeted to correctly trace kernel execution and measure performance. Therefore, users must explicitly provide the stream to be benchmarked. For example, passing the NVBench launch stream ensures correct execution and accurate measurement:
 
-In this context, two common mistakes should be noted:
-
-1. Stream Awareness: NVBench requires knowledge of the exact CUDA stream being targeted to correctly trace kernel execution and measure performance.
-2. Explicit Stream Specification: Users must explicitly provide the stream to be benchmarked. For example, passing the NVBench launch stream ensures correct execution and measurement:
 ```cpp
 void sequence_bench(nvbench::state& state) {
   auto data = thrust::device_vector<int>(10);
@@ -75,7 +72,7 @@ void sequence_bench(nvbench::state& state) {
 NVBENCH_BENCH(sequence_bench);
 ```
 
-If we run the benchmark now, results are displayed without issues. Users may notice, especially in a multi-GPU environment, that many results are collected more than expected:
+When the benchmark is executed, results are displayed without issues. However, users, particularly in a multi-GPU environment, may observe that more results are collected than expected:
 
 ```bash
 user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench 

From 63f884d666920b792627f15953353023955e8d41 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:46:30 -0700
Subject: [PATCH 05/12] Update best_practices.md

---
 docs/best_practices.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index b43b5b8d..5d92c8ce 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -128,7 +128,7 @@ Pass: Cold: 0.007819ms GPU, 0.013864ms CPU, 0.50s total GPU, 3.59s total wall, 6
 |  63952x | 13.864 us | 432.95% | 7.819 us | 447.95% |
 ```
 
-By default, NVBench runs all GPUs locally unless specified. If not specified, it will run all available GPUs. This is especially problematic if your system has multiple GPUs and you want to target a particular GPU to save build time. In our case, we target **RTX8000**:
+By default, NVBench runs benchmarks on all available GPUs unless specified otherwise. On multi-GPU systems, this can unnecessarily increase runtime and resource usage. To target a specific GPU, saving both time and resources, you can set the `CUDA_VISIBLE_DEVICES` environment variable. In our case, we target the **RTX8000**:
 
 ```bash
 user@nvbench-test:~/nvbench/build/bin$ export CUDA_VISIBLE_DEVICES=0

From 0b25b91223476f52d7531ea21505b7d363640e69 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:53:30 -0700
Subject: [PATCH 06/12] Update best_practices.md

---
 docs/best_practices.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index 5d92c8ce..6c2f8134 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -293,4 +293,4 @@ We can see that the performance of the two approaches is essentially the same.
 
 ---
 
-For more information on how to use NVBench in your projects, please check the [NVBench repository](https://github.com/NVIDIA/nvbench). Feel free to raise questions or feature requests via **GitHub issues** or **discussions**, and enjoy benchmarking with NVBench!
+For more information on how to use NVBench in your projects, please check the [NVBench repository](https://github.com/NVIDIA/nvbench). Feel free to raise questions or feature requests via GitHub [issues](https://github.com/NVIDIA/nvbench/issues) or [discussions](https://github.com/NVIDIA/nvbench/discussions), and enjoy benchmarking with NVBench!

From df27dcb5df2a1bd1e612f5346c53c5bed3f3b28f Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:55:40 -0700
Subject: [PATCH 07/12] Update best_practices.md

---
 docs/best_practices.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index 6c2f8134..a4a6f58b 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -1,6 +1,6 @@
 # NVBench Best Practices
 
-NVBench is a **small yet actively developed benchmarking library** for CUDA GPU workloads. The [README](https://github.com/NVIDIA/cuCollections/blob/dev/README.md) serves as the ideal starting point, providing detailed guidance for users to get hands-on experience—from installation to framework usage. It includes links to the [benchmark documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md), which covers all essential features and usage instructions, as well as links to [code examples](https://github.com/NVIDIA/nvbench/tree/main/examples) that demonstrate how to integrate NVBench into a user’s codebase.
+NVBench is a **small yet actively developed benchmarking library** for CUDA GPU workloads. The [README](https://github.com/NVIDIA/cuCollections/blob/dev/README.md) serves as the ideal starting point, providing detailed guidance for users to get hands-on experience—from installation to framework usage. It includes links to the [benchmark documentation](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md), which covers all essential features and usage instructions, as well as links to [code examples](https://github.com/NVIDIA/nvbench/tree/main/examples) that demonstrate how to integrate and apply various NVBench features within a user’s codebase.
 
 This document is **not intended to replace** the detailed benchmark documentation ([here](https://github.com/NVIDIA/nvbench/blob/main/docs/benchmarks.md)) or the CLI help guides ([CLI help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help.md) and [CLI axis help](https://github.com/NVIDIA/nvbench/blob/main/docs/cli_help_axis.md)). All examples provided are for demonstration purposes and are **not intended as recommendations for best practices in real-world scenarios**.
 

From 54cbcd0c427c251449ed3f3fdc6b9066db054340 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:58:37 -0700
Subject: [PATCH 08/12] Update best_practices.md

---
 docs/best_practices.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index a4a6f58b..27f56ea8 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -6,8 +6,7 @@ This document is **not intended to replace** the detailed benchmark documentatio
 
 ## NVBench
 
-* Purpose-built for CUDA GPU workloads.
-* Provides GPU-aware features: warmup runs, synchronization, throughput/latency metrics, and parameter sweeps, etc.
+* Provides GPU-aware features: runtime customization, throughput calculations, and parameter sweeps, etc.
 * Produces machine-readable output (JSON, CSV) suitable for regression tracking and CI pipelines.
 * The natural choice for benchmarking GPU-accelerated code.
 * Also supports benchmarking normal CPU implementations.

From b95acb80d49f6b3a0d3aca90e01ed0e1c7764de0 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 12:59:20 -0700
Subject: [PATCH 09/12] Update best_practices.md

---
 docs/best_practices.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index 27f56ea8..ce051a31 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -12,7 +12,7 @@ This document is **not intended to replace** the detailed benchmark documentatio
 * Also supports benchmarking normal CPU implementations.
 * Python code support is in the roadmap.
 
-## Benchmark Your GPU Code with NVBench
+## Benchmarking GPU Workloads with NVBench
 Let’s begin with a simple example for users who are new to NVBench and want to learn the basics of benchmarking GPU code. Consider measuring the performance of `thrust::sequence` on a GPU. Similar to `std::iota`, suppose we have an input array of 10 elements, and we want `thrust::sequence` to populate it with the sequence of values from 0 to 9. The following example demonstrates this approach:
 
 ```cpp

From df7abef8493b921773d8f9c12e2f4bc6c1a37d10 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Fri, 19 Sep 2025 13:03:48 -0700
Subject: [PATCH 10/12] Update best_practices.md

---
 docs/best_practices.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index ce051a31..c9138954 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -288,7 +288,9 @@ user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequ
   - Failure (diff > min_noise):  0
 ```
 
-We can see that the performance of the two approaches is essentially the same.
+We can see that the performance of the two approaches is essentially the same. 
+
+(wanted to mention users can also use the json file to trace regressions in CI)
 
 ---
 

From 8af0aa38d5d4cba0e34ca39a8bda8b2565fa3b89 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Mon, 29 Sep 2025 12:22:07 -0700
Subject: [PATCH 11/12] Updates

---
 docs/best_practices.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index c9138954..8cc82a0c 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -18,13 +18,13 @@ Let’s begin with a simple example for users who are new to NVBench and want to
 ```cpp
 void sequence_bench(nvbench::state& state) {
   auto data = thrust::device_vector<int>(10);
-  state.exec([](nvbench::launch& launch) {
+  state.exec([](nvbench::launch&) {
     thrust::sequence(data.begin(), data.end());
   });
 }
 NVBENCH_BENCH(sequence_bench);
 ```
-Will this code work as-is? Depending on the build system configuration, compilation may succeed but generate warnings indicating that `launch` is an unused parameter. The code may or may not execute correctly. This often occurs when users, accustomed to a sequential programming mindset, overlook the fact that GPU architectures are highly parallel. Proper use of streams and synchronization is essential for accurately measuring performance in benchmark code.
+Will this code run correctly as written? While it may compile successfully, runtime behavior isn’t guaranteed. This is a common pitfall for developers used to sequential programming, who may overlook the massively parallel nature of GPU architectures. To ensure accurate performance measurement in benchmark code, proper use of streams and synchronization is crucial.
 
 A common mistake in this context is neglecting stream specification: NVBench requires knowledge of the exact CUDA stream being targeted to correctly trace kernel execution and measure performance. Therefore, users must explicitly provide the stream to be benchmarked. For example, passing the NVBench launch stream ensures correct execution and accurate measurement:
 
@@ -74,7 +74,7 @@ NVBENCH_BENCH(sequence_bench);
 When the benchmark is executed, results are displayed without issues. However, users, particularly in a multi-GPU environment, may observe that more results are collected than expected:
 
 ```bash
-user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench 
+user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
 # Devices
 
 ## [0] `Quadro RTX 8000`
@@ -106,9 +106,9 @@ user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
 # Log
 
 Run:  [1/2] sequence_bench [Device=0]
-Pass: Cold: 0.006150ms GPU, 0.009768ms CPU, 0.50s total GPU, 4.52s total wall, 81312x 
+Pass: Cold: 0.006150ms GPU, 0.009768ms CPU, 0.50s total GPU, 4.52s total wall, 81312x
 Run:  [2/2] sequence_bench [Device=1]
-Pass: Cold: 0.007819ms GPU, 0.013864ms CPU, 0.50s total GPU, 3.59s total wall, 63952x 
+Pass: Cold: 0.007819ms GPU, 0.013864ms CPU, 0.50s total GPU, 3.59s total wall, 63952x
 
 # Benchmark Results
 
@@ -136,7 +136,7 @@ user@nvbench-test:~/nvbench/build/bin$ export CUDA_VISIBLE_DEVICES=0
 Now, if we rerun:
 
 ```bash
-user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench 
+user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
 # Devices
 
 ## [0] `Quadro RTX 8000`
@@ -155,7 +155,7 @@ user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
 # Log
 
 Run:  [1/1] sequence_bench [Device=0]
-Pass: Cold: 0.006257ms GPU, 0.009850ms CPU, 0.50s total GPU, 4.40s total wall, 79920x 
+Pass: Cold: 0.006257ms GPU, 0.009850ms CPU, 0.50s total GPU, 4.40s total wall, 79920x
 
 # Benchmark Results
 
@@ -207,9 +207,9 @@ user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench -a Num=[10,100000]
 # Log
 
 Run:  [1/2] sequence_bench [Device=0 Num=10]
-Pass: Cold: 0.006318ms GPU, 0.009948ms CPU, 0.50s total GPU, 4.37s total wall, 79152x 
+Pass: Cold: 0.006318ms GPU, 0.009948ms CPU, 0.50s total GPU, 4.37s total wall, 79152x
 Run:  [2/2] sequence_bench [Device=0 Num=100000]
-Pass: Cold: 0.006586ms GPU, 0.010193ms CPU, 0.50s total GPU, 4.14s total wall, 75936x 
+Pass: Cold: 0.006586ms GPU, 0.010193ms CPU, 0.50s total GPU, 4.14s total wall, 75936x
 
 # Benchmark Results
 
@@ -267,7 +267,7 @@ user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench --json sequence_transfor
 NVBench provides a convenient script under `nvbench/scripts` called `nvbench_compare.py`. After copying the JSON files to the scripts folder:
 
 ```bash
-user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequence_transform.json 
+user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequence_transform.json
 ['sequence_ref.json', 'sequence_transform.json']
 # sequence_bench
 
@@ -288,7 +288,7 @@ user@nvbench-test:~/nvbench/scripts$ ./nvbench_compare.py sequence_ref.json sequ
   - Failure (diff > min_noise):  0
 ```
 
-We can see that the performance of the two approaches is essentially the same. 
+We can see that the performance of the two approaches is essentially the same.
 
 (wanted to mention users can also use the json file to trace regressions in CI)
 

From 3a9c80d33b48f8d1e36fd084ddd8c3f330345372 Mon Sep 17 00:00:00 2001
From: Yunsong Wang <yunsongw@nvidia.com>
Date: Mon, 29 Sep 2025 12:38:12 -0700
Subject: [PATCH 12/12] Updates

---
 docs/best_practices.md | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/docs/best_practices.md b/docs/best_practices.md
index 8cc82a0c..f2ae7e87 100644
--- a/docs/best_practices.md
+++ b/docs/best_practices.md
@@ -127,16 +127,9 @@ Pass: Cold: 0.007819ms GPU, 0.013864ms CPU, 0.50s total GPU, 3.59s total wall, 6
 |  63952x | 13.864 us | 432.95% | 7.819 us | 447.95% |
 ```
 
-By default, NVBench runs benchmarks on all available GPUs unless specified otherwise. On multi-GPU systems, this can unnecessarily increase runtime and resource usage. To target a specific GPU, saving both time and resources, you can set the `CUDA_VISIBLE_DEVICES` environment variable. In our case, we target the **RTX8000**:
-
-```bash
-user@nvbench-test:~/nvbench/build/bin$ export CUDA_VISIBLE_DEVICES=0
-```
-
-Now, if we rerun:
-
+By default, NVBench executes benchmarks on all available GPUs unless instructed otherwise. On multi-GPU systems, this can lead to longer runtimes and higher resource usage. To focus on a specific GPU and optimize both time and resources, users can use the `-d` CLI option to select the target GPU. In our example, we target the **RTX8000**:
 ```bash
-user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench
+user@nvbench-test:~/nvbench/build/bin$ ./sequence_bench -d 0
 # Devices
 
 ## [0] `Quadro RTX 8000`