diff --git a/Accelerated_Python_User_Guide/notebooks/2D_Navier_Stokes_In_Warp.ipynb b/Accelerated_Python_User_Guide/notebooks/2D_Navier_Stokes_In_Warp.ipynb
new file mode 100644
index 00000000..7d57b9ef
--- /dev/null
+++ b/Accelerated_Python_User_Guide/notebooks/2D_Navier_Stokes_In_Warp.ipynb
@@ -0,0 +1,1535 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0\n",
+        "#\n",
+        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+        "# you may not use this file except in compliance with the License.\n",
+        "# You may obtain a copy of the License at\n",
+        "#\n",
+        "# http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing, software\n",
+        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+        "# See the License for the specific language governing permissions and\n",
+        "# limitations under the License."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# GPU-Accelerated 2-D Navier Stokes Simulation using NVIDIA Warp\n",
+        "\n",
+        "## Overview\n",
+        "\n",
+        "In this notebook, we will build a fluid solver entirely in Python using the Warp framework. Specifically, we will simulate 2-D turbulent flow in a box.\n",
+        "\n",
+        "Through this implementation, you will learn:\n",
+        "\n",
+        "* Writing SIMT (single instruction, multiple threads) kernels for finite difference operators and flow initialization\n",
+        "* Warp's tile-based programming primitives for efficient matrix operations, e.g., FFT, inverse FFT, and matrix transpose\n",
+        "* CUDA graph capture for improving performance of timestepping codes\n",
+        "\n",
+        "By the end of this lab, we will have built a GPU-accelerated 2-D Navier-Stokes solver in a periodic box, while gaining experience in applying Warp's features to leverage a GPU for a computational physics problem. "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Setup\n",
+        "\n",
+        "Before we begin implementing the 2-D N-S solver, let's ensure we have all the necessary packages installed."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# install Warp\n",
+        "%pip install warp-lang \n",
+        "\n",
+        "# install visualization package\n",
+        "%pip install matplotlib Pillow"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Now let us import the necessary libraries and initialize Warp to check if GPU support is available:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import numpy as np\n",
+        "import warp as wp\n",
+        "\n",
+        "if wp.get_cuda_device_count() > 0:\n",
+        "    print(\"GPU detected successfully\")\n",
+        "else:\n",
+        "    print(\"No GPU detected!\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Introduction\n",
+        "\n",
+        "This seemingly simple 2-D example combines multiple Warp features that can be also leveraged to build industrial-grade solvers:\n",
+        " - Accelerated Warp kernels for finite difference operators\n",
+        " - Tile-based matrix operations for a more performant code\n",
+        " - CUDA graph capture to speed-up multiple calls to the same kernel\n",
+        "\n",
+        "Before diving into the code, let's understand a bit of physics behind the underlying equations. "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Background: Equations, boundary conditions, and initial conditions\n",
+        "\n",
+        "### 1. Vorticity-stream function formulation\n",
+        "\n",
+        "For a 2-D incompressible flow, we define:\n",
+        "- **Vorticity**: $\\omega = \\partial v/\\partial x - \\partial u/\\partial y$\n",
+        "- **Stream function**: $\\psi$ such that $u = \\partial\\psi/\\partial y$, $v = -\\partial\\psi/\\partial x$\n",
+        "\n",
+        "In the vorticity and stream function equation above, $u$ and $v$ are the two components of the velocity vector $\\mathbf{u} = [u, v]$ in 2-D. This formulation automatically satisfies continuity and eliminates pressure from the Navier-Stokes equations.\n",
+        "\n",
+        "### 2. Governing equation\n",
+        "\n",
+        "In 2-D, the non-dimensional governing equations for incompressible flows can be written using the **vorticity-stream function transport equation** as\n",
+        "\n",
+        "$$\\underbrace{\\frac{\\partial \\omega}{\\partial t}}_{\\text{unsteadiness}} + \\underbrace{\\frac{\\partial \\psi}{\\partial y}\\frac{\\partial \\omega}{\\partial x} - \\frac{\\partial \\psi}{\\partial x}\\frac{\\partial \\omega}{\\partial y}}_{\\text{advection}} = \\underbrace{\\frac{1}{Re}\\left(\\frac{\\partial^2 \\omega}{\\partial x^2} + \\frac{\\partial^2 \\omega}{\\partial y^2}\\right)}_{\\text{diffusion}} \\tag{1},$$\n",
+        "\n",
+        "along with the relationship between stream function and vorticity, given by a Poisson equation. \n",
+        "\n",
+        "Starting from the definition of vorticity (in 2-D)\n",
+        " \n",
+        "$$\\omega = \\frac{\\partial v}{\\partial x} - \\frac{\\partial u}{\\partial y}.$$\n",
+        " \n",
+        "Substituting the streamfunction relations $u = \\frac{\\partial \\psi}{\\partial y}$ and $v = -\\frac{\\partial \\psi}{\\partial x}$ gives\n",
+        " \n",
+        "$$\\omega = \\frac{\\partial}{\\partial x}\\left(-\\frac{\\partial \\psi}{\\partial x}\\right) - \\frac{\\partial}{\\partial y}\\left(\\frac{\\partial \\psi}{\\partial y}\\right),$$\n",
+        " \n",
+        "$$\\omega = -\\frac{\\partial^2 \\psi}{\\partial x^2} - \\frac{\\partial^2 \\psi}{\\partial y^2}.$$\n",
+        " \n",
+        "Rearranging gives the Poisson equation\n",
+        " \n",
+        "$$\\frac{\\partial^2 \\psi}{\\partial x^2} + \\frac{\\partial^2 \\psi}{\\partial y^2} = -\\omega \\tag{2}.$$\n",
+        "\n",
+        "When writing the solver, equations (1) and (2) will be solved at each timestep to evolve the flow. \n",
+        "\n",
+        "**NOTE: These equations might look intimidating, but we will break down the different parts in the subsequent cells to make them more tractable. So, please bear with us for a bit!**\n",
+        "\n",
+        "### 3. Spectral Poisson solver\n",
+        "\n",
+        "The Poisson equation in (2) can be discretized and solved using iterative methods like Jacobi and Gauss-Seidel methods. However, since we have a periodic domain, we can represent both $\\omega$ and $\\psi$ using Fourier series with sine and cosine basis functions as\n",
+        " \n",
+        "$$\\omega(x,y) = \\sum_{m,n} \\hat{\\omega}_{m,n} e^{i(k_x x + k_y y)},$$\n",
+        " \n",
+        "$$\\psi(x,y) = \\sum_{m,n} \\hat{\\psi}_{m,n} e^{i(k_x x + k_y y)},$$\n",
+        " \n",
+        "where $\\hat{\\omega}_{m,n}$ and $\\hat{\\psi}_{m,n}$ are what we call the Fourier coefficients, and $k_x = \\frac{2\\pi m}{L_x}$, $k_y = \\frac{2\\pi n}{L_y}$ are the wavenumbers. Now, if we substitute these Fourier representations back into equation (2), we can compute the derivatives as\n",
+        "\n",
+        "$$\\frac{\\partial^2 \\psi}{\\partial x^2} = \\sum_{m,n} -k_x^2 \\hat{\\psi}_{m,n} e^{i(k_x x + k_y y)},$$\n",
+        " \n",
+        "$$\\frac{\\partial^2 \\psi}{\\partial y^2} = \\sum_{m,n} -k_y^2 \\hat{\\psi}_{m,n} e^{i(k_x x + k_y y)}.$$\n",
+        " \n",
+        "Substituting these into the Poisson equation, i.e., equation (2) above will result in\n",
+        " \n",
+        "$$\\sum_{m,n} -(k_x^2 + k_y^2) \\hat{\\psi}_{m,n} e^{i(k_x x + k_y y)} = -\\sum_{m,n} \\hat{\\omega}_{m,n} e^{i(k_x x + k_y y)}.$$\n",
+        " \n",
+        "Since this must hold for each pair of $[k_x, k_y]$ independently, **the Poisson equation reduces to a simple algebraic equation in Fourier space** \n",
+        "\n",
+        "$$\\hat{\\psi}_{m,n} =  \\frac{\\hat{\\omega}_{m,n}}{(k_x^2 + k_y^2)}. \\tag{3}$$\n",
+        "\n",
+        "\n",
+        "Note that **Fourier space** refers to the spectral domain where we work directly with the Fourier coefficients $\\hat{\\omega}_{m,n}$ and $\\hat{\\psi}_{m,n}$ along with their corresponding wavenumbers $k_x$ and $k_y$. After solving in Fourier space, we recover the physical fields $\\omega$ and $\\psi$ using the inverse Fourier transform. The wavenumbers are defined as $k_x = \\frac{2\\pi m}{L_x}, \\quad k_y = \\frac{2\\pi n}{L_y}$, where $m$ and $n$ are the Fourier mode indices, each ranging from $[-N/2, N/2 -1]$ on a square grid (with $N$ being the total number of grid points along any given direction of the square grid). \n",
+        "\n",
+        "\n",
+        "### 4. Time integration \n",
+        "Runge-Kutta methods advance the solution by taking multiple intermediate \"substeps\" within each timestep, achieving better accuracy than a single forward step. In this example, we use a **strong-stability preserving (SSP) third-order** **explicit** Runge-Kutta method.\n",
+        "\n",
+        "- **Third-order**: The numerical error scales as $\\mathcal{O}(\\Delta t)^3$. So, halving the timestep reduces error by $\\sim 8\\times$.\n",
+        "\n",
+        "- **SSP**: Schemes designed specifically to prevent spurious oscillations for hyperbolic PDEs. Interested readers can read more [here](https://api.drum.lib.umd.edu/server/api/core/bitstreams/faa96f40-0fbb-4af0-84a3-5e3b80c908df/content).\n",
+        "\n",
+        "- **Explicit**: Each new timestep is computed directly from the previous one (easier to program, as we will see!).\n",
+        "\n",
+        "SSP RK3 scheme takes the following form below, where $\\omega^{(n)}$ and $\\omega^{(n+1)}$ are the vorticity field at the beginning and the end of a timestep. $\\omega^{(1)}$ and $\\omega^{(2)}$ are the intermediate vorticity fields.\n",
+        "\n",
+        "$$\\omega^{(1)} = \\omega^{(n)} + \\Delta t \\mathcal{L}(\\omega^{(n)}),$$\n",
+        " \n",
+        "$$\\omega^{(2)} = \\frac{3}{4}\\omega^{(n)} + \\frac{1}{4}\\omega^{(1)} + \\frac{1}{4}\\Delta t \\mathcal{L}(\\omega^{(1)}),$$ \n",
+        "\n",
+        "$$\\omega^{(n+1)} = \\frac{1}{3}\\omega^{(n)} + \\frac{2}{3}\\omega^{(2)} + \\frac{2}{3}\\Delta t \\mathcal{L}(\\omega^{(2)}). \\tag{4}$$\n",
+        "\n",
+        "Here $\\mathcal{L}(\\omega)$ encapsulates the finite-difference forms of both advection and diffusion terms in equation (1). \n",
+        "\n",
+        "\n",
+        "### 5. Initial condition: 2-D decaying turbulence\n",
+        "\n",
+        "In this problem, we solve the governing equations on a square domain of size $2\\pi \\times 2\\pi$. Recall that the wavenumbers are defined as $k_x = 2 \\pi m / L_x$ and $k_y = 2 \\pi n / L_y$. By choosing $L_x = L_y = 2\\pi$, the wavenumbers $k_x$ and $k_y$ become integers (specifically, $k_x = m$ and $k_y = n$), which simplifies some of the FFT-based operations. This choice of domain size is a common convention in spectral solver literature for this reason.\n",
+        "\n",
+        "The initial condition is generated using the energy spectrum from San & Staples CNF (2012)\n",
+        "\n",
+        "$$E(k) = \\frac{a_s}{2 k_p}\\left(\\frac{k}{k_p}\\right)^{2s+1} \\exp\\left[-\\left(s + \\frac{1}{2}\\right)\\left(\\frac{k}{k_p}\\right)^2\\right] \\tag{5},$$\n",
+        "\n",
+        "where\n",
+        "- $k = |\\mathbf{k}| = \\sqrt{k_x^2 + k_y^2}$,\n",
+        "- $k_p = 12 (wavenumber at which $E(k)$ peaks), \n",
+        "- $s = 3$ (shape parameter of the spectrum),\n",
+        "- $a_s = (2s+1)^{s+1}/(2^s s!)$.\n",
+        "\n",
+        "The magnitude of Fourier coefficients related to the initial energy spectrum is given by \n",
+        "\n",
+        "$$|\\hat{\\omega}(\\mathbf{k})| = \\sqrt{\\frac{k}{\\pi}E(k)},$$\n",
+        "\n",
+        "which is then randomized with a phase function $\\zeta(\\mathbf{k})$ so that $\\hat{\\omega}(k_x, k_y) = \\hat{\\omega}(\\mathbf{k})$ becomes \n",
+        "\n",
+        "$$\\hat{\\omega}(\\mathbf{k}) = \\sqrt{\\frac{k}{\\pi}E(k)} \\, e^{i\\zeta(\\mathbf{k})}. \\tag{6}$$\n",
+        " \n",
+        "The phase function $\\zeta(\\mathbf{k})$ is constructed with specific symmetry constraints to ensure that the vorticity field $\\omega(x,y)$ is real-valued in the physical space. The details of how $\\zeta(\\mathbf{k})$ is formulated is provided in **Appendix A**.\n",
+        "\n",
+        "**NOTE: Although the initialization equations may appear complex, this notebook focuses primarily on the Warp implementation. Much of the boilerplate code will be provided for you.**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Building blocks of the solver\n",
+        "\n",
+        "$$\n",
+        "\\boxed{\\;\\;\n",
+        "\\begin{array}{c}\n",
+        "\\large\\textbf{Solver Pipeline} \\\\[14pt]\n",
+        "\\hat{\\omega} \\;\\xrightarrow{\\;\\text{IFFT}\\;}\\; \\omega(t) \\;\\xrightarrow{\\quad}\\;\n",
+        "\\underbrace{\\text{Discretize} \\;\\to\\; \\text{1 RK substep} \\;\\to\\; \\text{Poisson solve}}_{\\times\\, 3 \\text{\\ times for a RK3 scheme}} \n",
+        "\\;\\xrightarrow{\\quad}\\; \\omega(t+\\Delta t) \\\\[6pt]\n",
+        "\\uparrow \\hspace{20em} \\downarrow \\\\[-2pt]\n",
+        "\\xleftarrow{\\hspace{5em} \\text{For next timestep/RK step} \\hspace{2em}}\n",
+        "\\end{array}\n",
+        "\\;\\;}\n",
+        "$$\n",
+        "\n",
+        "There are four main building blocks of the solver that we would tackle in a sequential manner.\n",
+        "\n",
+        "- Initialize $\\omega$, i.e., generate $\\hat{\\omega}$ in the Fourier space using the energy spectrum specified above and inverse FFT it to the physical space.\n",
+        "- Discretize the advection and diffusion terms to obtain the RHS $\\mathcal{L}(\\omega)$ of $\\frac{\\partial \\omega}{\\partial t} = \\mathcal{L}(\\omega)$ (see equation 1).\n",
+        "- Advancing $\\omega(t)$ to $\\omega(t+\\Delta t)$ using SSP RK3 scheme (equation 4).\n",
+        "- Solve Poisson equation (equation 2) in the Fourier space $\\hat{\\psi}_{m,n} = \\hat{\\omega}_{m,n}/(k_x^2 + k_y^2)$ (equation 3), and inverse FFT $\\hat{psi}$ to $\\psi$ after each RK substep.  "
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Simulation parameters"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# grid resolution\n",
+        "N_GRID = 512\n",
+        "# box size\n",
+        "LEN = 2 * np.pi\n",
+        "# delta t for timestepping\n",
+        "DT = 0.001\n",
+        "# Reynolds number\n",
+        "RE = 1000.0\n",
+        "# define SSP-RK3 coefficients used for timestepping\n",
+        "RK3_COEFFS = [\n",
+        "    [1.0, 0.0, 1.0],\n",
+        "    [3.0 / 4.0, 1.0 / 4.0, 1.0 / 4.0],\n",
+        "    [1.0 / 3.0, 2.0 / 3.0, 2.0 / 3.0],\n",
+        "]\n",
+        "# h = l/N (size of a computational cell)\n",
+        "H = LEN / N_GRID\n",
+        "# parameters for Warp's tiled-FFT functionality\n",
+        "TILE_M = 1\n",
+        "TILE_N = N_GRID\n",
+        "TILE_TRANSPOSE_DIM = 16\n",
+        "BLOCK_DIM = TILE_N // 2"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Building block 1: Flow initialization in Warp\n",
+        "\n",
+        "Recall from equation (6) in the Background section that the initial vorticity field $\\hat{\\omega}(\\mathbf{k})$ in Fourier space is given by\n",
+        "\n",
+        "$$\\hat{\\omega}(\\mathbf{k}) = \\underbrace{\\sqrt{\\frac{k}{\\pi}E(k)}}_{\\text{amplitude}} \\cdot \\underbrace{e^{i\\zeta(\\mathbf{k})}}_{\\text{phase}} = \\underbrace{\\sqrt{\\frac{k}{\\pi}E(k)}}_{\\text{amplitude}} \\cdot \\left(\\cos\\zeta(\\mathbf{k}) + i\\sin\\zeta(\\mathbf{k})\\right),$$\n",
+        "\n",
+        "which gives us the real and imaginary parts\n",
+        " \n",
+        "$$\\text{Re}(\\hat{\\omega}) = \\sqrt{\\frac{k}{\\pi}E(k)} \\cos\\zeta(\\mathbf{k}), \\quad \\text{Im}(\\hat{\\omega}) = \\sqrt{\\frac{k}{\\pi}E(k)} \\sin\\zeta(\\mathbf{k}).$$\n",
+        "\n",
+        "$E(k)$ is the energy spectrum (equation 5) and $\\zeta(\\mathbf{k})$ is the phase function with symmetry constraints ensuring a real-valued physical field.\n",
+        "\n",
+        "**What do we need to implement?**\n",
+        "\n",
+        "To initialize the flow, we need:\n",
+        "\n",
+        "1. A kernel to populate $\\hat{\\omega}(\\mathbf{k})$ on the entire 2-D grid using equation (6).\n",
+        "\n",
+        "2. Two helper functions for that kernel (and equation 6):\n",
+        "\n",
+        "     a. A function to compute the **energy spectrum** $E(k)$ at each wavenumber magnitude.\n",
+        "\n",
+        "     b. A function to compute the **phase** $\\zeta(\\mathbf{k})$.\n",
+        "  \n",
+        "The helper functions will be `@wp.func` device functions (callable from kernels), and the initialization kernel will be a `@wp.kernel`."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**Note on Warp math utilities:** Warp provides built-in functions for basic operations like `wp.cos()`, `wp.sqrt()`, `wp.sin()`, etc. for use in kernels. For complex number arithmetic, Warp uses `wp.vec2f` and `wp.vec2d` as data types for 32-bit and 64-bit precision, respectively. For example, complex numbers can represented as `wp.vec2f(real_part, imaginary_part)`, where the first component is the real part of type `wp.float32` and the second component is the imaginary part of type `wp.float32`.\n",
+        "\n",
+        "**EXERCISE: Assuming the helper functions are correct, complete the kernel `decaying_turbulence_initializer(...)` by computing the amplitude using the Warp function `def energy_spectrum(...)`, the phase function using `def phase_randomizer(...)`, and combining them to form the complex valued `omega_hat_init[i, j]`.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "@wp.func\n",
+        "def factorial(n: wp.int32) -> wp.int32:\n",
+        "    \"\"\"Compute factorial.\n",
+        "\n",
+        "    Args:\n",
+        "        n: Input integer for which we want factorial.\n",
+        "\n",
+        "    Returns:\n",
+        "        Factorial of input n.\n",
+        "    \"\"\"\n",
+        "    result = wp.int32(1)\n",
+        "    for i in range(2, n + 1):\n",
+        "        result *= i\n",
+        "    return result\n",
+        "\n",
+        "\n",
+        "@wp.func\n",
+        "def energy_spectrum(k: wp.float32, s: wp.int32, kp: wp.float32) -> wp.float32:\n",
+        "    \"\"\"Compute energy at wavenumber magnitude k.\n",
+        "\n",
+        "    Follows San and Staples 2012 Computers and Fluids (page 49).\n",
+        "    https://www.sciencedirect.com/science/article/abs/pii/S0045793012001363.\n",
+        "\n",
+        "    Args:\n",
+        "        k: Input wavenumber magnitude.\n",
+        "        s: Shape parameter of spectrum.\n",
+        "        kp: Wavenumber magnitude at which maximum of energy spectrum lies.\n",
+        "\n",
+        "    Returns:\n",
+        "        Energy contained at wavenumber magnitude k.\n",
+        "    \"\"\"\n",
+        "    s_factorial = wp.float32(factorial(s))\n",
+        "    s_float32 = wp.float32(s)\n",
+        "    a_s = (2.0 * s_float32 + 1.0) ** (s_float32 + 1.0) / (2.0**s_float32 * s_factorial)\n",
+        "    energy_k = (\n",
+        "        a_s\n",
+        "        / (2.0 * kp)\n",
+        "        * (k / kp) ** (2.0 * s_float32 + 1.0)\n",
+        "        * wp.exp(-(s_float32 + 0.5) * (k / kp) ** 2.0)\n",
+        "    )\n",
+        "    return energy_k\n",
+        "\n",
+        "\n",
+        "@wp.func\n",
+        "def phase_randomizer(\n",
+        "    n: int,\n",
+        "    zeta: wp.array2d(dtype=wp.float32),\n",
+        "    eta: wp.array2d(dtype=wp.float32),\n",
+        "    i: int,\n",
+        "    j: int,\n",
+        ") -> wp.float32:\n",
+        "    \"\"\"Calculate value of the random phase at index (i, j).\n",
+        "\n",
+        "    Follows San and Staples 2012 to return phase value in any quadrant based on\n",
+        "    the values of eta and zeta in the first quadrant.\n",
+        "\n",
+        "    Args:\n",
+        "        n: Size of the simulation domain.\n",
+        "        zeta: First phase function.\n",
+        "        eta: Second phase function\n",
+        "        i: rowwise index on the 2-D simulation domain.\n",
+        "        j: columnwise index on the 2-D simulation domain\n",
+        "\n",
+        "    Returns:\n",
+        "        Value of the random phase in any quadrant.\n",
+        "    \"\"\"\n",
+        "    n_half = n // 2\n",
+        "\n",
+        "    # first quadrant\n",
+        "    if i < n_half and j < n_half:\n",
+        "        return zeta[i, j] + eta[i, j]\n",
+        "    # second quadrant\n",
+        "    if i >= n_half and j < n_half:\n",
+        "        return -zeta[n - i, j] + eta[n - i, j]\n",
+        "    # third quadrant\n",
+        "    if i >= n_half and j >= n_half:\n",
+        "        return -zeta[n - i, n - j] - eta[n - i, n - j]\n",
+        "    # fourth quadrant\n",
+        "    return zeta[i, n - j] - eta[i, n - j]\n",
+        "\n",
+        "@wp.kernel\n",
+        "def decaying_turbulence_initializer(\n",
+        "    n: int,\n",
+        "    kp: wp.float32,\n",
+        "    s: wp.int32,\n",
+        "    k_mag: wp.array2d(dtype=wp.float32),\n",
+        "    zeta: wp.array2d(dtype=wp.float32),\n",
+        "    eta: wp.array2d(dtype=wp.float32),\n",
+        "    omega_hat_init: wp.array2d(dtype=wp.vec2f), # In Warp, wp.vec2f/wp.vec2d are used to handle complex numbers\n",
+        "):\n",
+        "    \"\"\"Initialize the vorticity field in Fourier space for decaying turbulence.\n",
+        "\n",
+        "    Args:\n",
+        "        n: Size of the simulation domain.\n",
+        "        kp: Wavenumber magnitude at which maximum of energy spectrum lies.\n",
+        "        s: Shape parameter of the energy spectrum.\n",
+        "        k_mag: Wavenumber magnitude on the 2-D grid.\n",
+        "        zeta: First phase function for phase randomization.\n",
+        "        eta: Second phase function for phase randomization.\n",
+        "        omega_hat_init: Output vorticity field in Fourier space.\n",
+        "    \"\"\"\n",
+        "    i, j = wp.tid()\n",
+        "\n",
+        "    amplitude = wp.sqrt((k_mag[i, j] / wp.pi) * energy_spectrum(k_mag[i, j], s, kp)) # MISSING\n",
+        "    phase = phase_randomizer(n, zeta, eta, i, j) # MISSING\n",
+        "    omega_hat_init[i, j] = wp.vec2f(amplitude*wp.cos(phase), amplitude*wp.sin(phase)) # MISSING"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Building blocks 2 and 3: Discretization of advection and viscous diffusion terms for RHS and Runge-Kutta timestepping\n",
+        "\n",
+        "Recall the vorticity transport equation (1) from the Background section given as\n",
+        "\n",
+        "$$\\frac{\\partial \\omega}{\\partial t} + \\underbrace{\\frac{\\partial \\psi}{\\partial y}\\frac{\\partial \\omega}{\\partial x} - \\frac{\\partial \\psi}{\\partial x}\\frac{\\partial \\omega}{\\partial y}}_{\\text{advection}} = \\underbrace{\\frac{1}{Re}\\left(\\frac{\\partial^2 \\omega}{\\partial x^2} + \\frac{\\partial^2 \\omega}{\\partial y^2}\\right)}_{\\text{diffusion}}.$$\n",
+        "\n",
+        "We discretize the **advection** and **diffusion** terms on a **uniform grid** using **central finite difference schemes**. Let us see what does that mean. It's not as scary as it sounds!\n",
+        "### Advection term\n",
+        "\n",
+        "Using central differences for first derivatives with grid spacing $h$,\n",
+        "\n",
+        "$$\\frac{\\partial \\omega}{\\partial x}\\bigg|_{i,j} \\approx \\frac{\\omega_{i+1,j} - \\omega_{i-1,j}}{2h}, \\quad \\frac{\\partial \\omega}{\\partial y}\\bigg|_{i,j} \\approx \\frac{\\omega_{i,j+1} - \\omega_{i,j-1}}{2h}, \\quad \\frac{\\partial \\psi}{\\partial x}\\bigg|_{i,j} \\approx \\frac{\\psi_{i+1,j} - \\psi_{i-1,j}}{2h}, \\quad \\frac{\\partial \\psi}{\\partial y}\\bigg|_{i,j} \\approx \\frac{\\psi_{i,j+1} - \\psi_{i,j-1}}{2h}.$$\n",
+        "\n",
+        "The advection term becomes\n",
+        "\n",
+        "$$\\text{Advection}_{i,j} = \\underbrace{\\frac{\\omega_{i+1,j} - \\omega_{i-1,j}}{2h} \\cdot \\frac{\\psi_{i,j+1} - \\psi_{i,j-1}}{2h}}_{\\texttt{term\\_1}} - \\underbrace{\\frac{\\omega_{i,j+1} - \\omega_{i,j-1}}{2h} \\cdot \\frac{\\psi_{i+1,j} - \\psi_{i-1,j}}{2h}}_{\\texttt{term\\_2}}.$$\n",
+        "\n",
+        "### Diffusion (or Laplacian) term\n",
+        "\n",
+        "Using the standard 5-point stencil for the Laplacian, the diffusion term becomes\n",
+        "\n",
+        "$$\\text{Diffusion}_{i,j}  = \\nabla^2 \\omega\\bigg|_{i,j} = \\frac{\\partial^2 \\omega}{\\partial x^2}\\bigg|_{i,j} + \\frac{\\partial^2 \\omega}{\\partial y^2}\\bigg|_{i,j} \\approx \\frac{\\omega_{i+1,j} - 2\\omega_{i,j} + \\omega_{i-1,j}}{h^2} + \\frac{\\omega_{i,j+1} - 2\\omega_{i,j} + \\omega_{i,j-1}}{h^2} = \\frac{\\omega_{i+1,j} + \\omega_{i-1,j} + \\omega_{i,j+1} + \\omega_{i,j-1} - 4\\omega_{i,j}}{h^2}.$$\n",
+        "\n",
+        "### Runge-Kutta timestepping\n",
+        "\n",
+        "Recall the SSP-RK3 scheme from equation (4). Each sub-step has the general form\n",
+        "\n",
+        "$$\\omega^{(\\text{new})} = c_0 \\cdot \\omega^{(n)} + c_1 \\cdot \\omega^{(\\text{old})} + c_2 \\cdot \\Delta t \\cdot \\mathcal{L}(\\omega^{(\\text{old})}),$$\n",
+        "\n",
+        "where \n",
+        "\n",
+        "- $\\mathcal{L}(\\omega) = \\frac{1}{Re}\\text{Diffusion} - \\text{Advection}$ is the RHS term we just discretized above. \n",
+        "- $\\omega^{(n)}$ is the vorticity field on at the beginning of the timestep $t$. \n",
+        "- $\\omega^{\\text{old}}$ and $\\omega^{\\text{new}}$ are the vorticity field at the beginning and the end **of a RK substep**, respectively. \n",
+        "\n",
+        "**Note: At the first and last RK substep specifically, $\\omega^{\\text{(new)}}$ and $\\omega^{\\text{(old)}}$ are equal to $\\omega(t+1)$ and $\\omega(t)$, respectively. For intermediate RK substeps, this does not necessarily hold true.**\n",
+        "\n",
+        "In a Warp kernel, all of these operations would translate to:\n",
+        "1. Compute `rhs = (1/Re) * diffusion(...) - advection(...)`.\n",
+        "2. Update `omega_old = coeff0 * omega_n + coeff1 * omega_old + coeff2 * dt * rhs`, where `omega_n` is the vorticity field at the very beginning of the RK timestepping. By passing different coefficients, each time the kernel is called, the same kernel can handle all three RK sub-steps. Here, note that the same variable `omega_old` is overwritten to store the new values after a RK substep to store $\\omega^{\\text{(new)}}$. \n",
+        "\n",
+        "**What do we need to implement?**\n",
+        "\n",
+        "Two `@wp.func` helper functions:\n",
+        "1. `viscous_advection_rk3_kernel(...)` - Warp kernel that calls two Warp functions `advection(...)` and `diffusion(...)` to calculate the advection and diffusion terms, and then performs one RK update on the entire 2-D grid.\n",
+        "2. `advection(...)` - Warp function that computes the advection term at a grid point given the flow information on the neighboring left, right, top, and bottom neighbors.\n",
+        "3. `diffusion(...)` - Warp function computes the Laplacian at a grid point given the flow information on the left, right, top and bottom neighbors."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**EXERCISE: Fill in the MISSING parts in `advection(...)`, `diffusion(...)`, using the finite difference formulas described above and the RK update equation.**\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "@wp.func\n",
+        "def cyclic_index(idx: wp.int32, n: wp.int32) -> wp.int32:\n",
+        "    \"\"\"Map any index to [0, n-1] for periodic boundary conditions.\n",
+        "\n",
+        "    Args:\n",
+        "        idx: Input index that may be outside the valid range.\n",
+        "        n: Grid size defining the periodic domain.\n",
+        "\n",
+        "    Returns:\n",
+        "        Index wrapped to the range [0, n-1].\n",
+        "    \"\"\"\n",
+        "    ret_idx = idx % n\n",
+        "    if ret_idx < 0:\n",
+        "        ret_idx += n\n",
+        "    return ret_idx\n",
+        "    \n",
+        "@wp.func\n",
+        "def advection(\n",
+        "    omega_left: wp.float32,\n",
+        "    omega_right: wp.float32,\n",
+        "    omega_top: wp.float32,\n",
+        "    omega_down: wp.float32,\n",
+        "    psi_left: wp.float32,\n",
+        "    psi_right: wp.float32,\n",
+        "    psi_top: wp.float32,\n",
+        "    psi_down: wp.float32,\n",
+        "    h: wp.float32,\n",
+        ") -> wp.float32:\n",
+        "    \"\"\"Calculate the advection term using central finite difference.\n",
+        "\n",
+        "    Args:\n",
+        "        omega_left: Vorticity at (i-1, j).\n",
+        "        omega_right: Vorticity at (i+1, j).\n",
+        "        omega_top: Vorticity at (i, j+1).\n",
+        "        omega_down: Vorticity at (i, j-1).\n",
+        "        psi_left: Stream function at (i-1, j).\n",
+        "        psi_right: Stream function at (i+1, j).\n",
+        "        psi_top: Stream function at (i, j+1).\n",
+        "        psi_down: Stream function at (i, j-1).\n",
+        "        h: Grid spacing.\n",
+        "\n",
+        "    Returns:\n",
+        "        Advection term value at grid point (i, j).\n",
+        "    \"\"\"\n",
+        "    inv_2h = 1.0 / (2.0 * h)\n",
+        "    term_1 = ((omega_right - omega_left) * inv_2h) * ((psi_top - psi_down) * inv_2h) # MISSING\n",
+        "    term_2 = ((omega_top - omega_down) * inv_2h) * ((psi_right - psi_left) * inv_2h) # MISSING\n",
+        "    return term_1 - term_2\n",
+        "\n",
+        "\n",
+        "@wp.func\n",
+        "def diffusion(\n",
+        "    omega_left: wp.float32,\n",
+        "    omega_right: wp.float32,\n",
+        "    omega_center: wp.float32,\n",
+        "    omega_down: wp.float32,\n",
+        "    omega_top: wp.float32,\n",
+        "    h: wp.float32,\n",
+        ") -> wp.float32:\n",
+        "    \"\"\"Calculate the Laplacian for viscous diffusion using central difference.\n",
+        "\n",
+        "    Args:\n",
+        "        omega_left: Vorticity at (i-1, j).\n",
+        "        omega_right: Vorticity at (i+1, j).\n",
+        "        omega_center: Vorticity at (i, j).\n",
+        "        omega_down: Vorticity at (i, j-1).\n",
+        "        omega_top: Vorticity at (i, j+1).\n",
+        "        h: Grid spacing.\n",
+        "\n",
+        "    Returns:\n",
+        "        Laplacian of vorticity at grid point (i, j).\n",
+        "    \"\"\"\n",
+        "    inv_h2 = 1.0 / (h * h)\n",
+        "    # combine both the diffusion terms in the x and y direction together\n",
+        "    laplacian = (omega_left + omega_right + omega_top + omega_down - 4.0*omega_center) * inv_h2 # MISSING\n",
+        "    return laplacian"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Validating advection and diffusion kernels\n",
+        "\n",
+        "Before moving to `def viscous_advection_rk3_kernel(...)`, we validate the finite-difference implementation using a **manufactured solution** with known analytical derivatives.\n",
+        "\n",
+        "$$\\psi(x,y) = \\sin(k_x x)\\sin(k_y y), \\quad \\omega(x,y) = (k_x^2 + k_y^2)\\sin(k_x x)\\sin(k_y y)$$\n",
+        "\n",
+        "Note that $\\nabla^{2} \\psi = -\\omega$. \n",
+        "\n",
+        "Applying the Laplacian to $\\omega$\n",
+        "\n",
+        "$$\\nabla^2\\omega = \\frac{\\partial^2 \\omega}{\\partial x^2} + \\frac{\\partial^2 \\omega}{\\partial y^2} = -(k_x^2 + k_y^2)^2 \\sin(k_x x)\\sin(k_y y) = -(k_x^2 + k_y^2)\\omega.$$\n",
+        "\n",
+        "Evaluating $\\frac{\\partial \\psi}{\\partial x}\\frac{\\partial \\omega}{\\partial y} - \\frac{\\partial \\psi}{\\partial y}\\frac{\\partial \\omega}{\\partial x}$ gives\n",
+        "\n",
+        "$$\\frac{\\partial \\psi}{\\partial x} = k_x\\cos(k_x x)\\sin(k_y y), \\quad \\frac{\\partial \\psi}{\\partial y} = k_y\\sin(k_x x)\\cos(k_y y)$$\n",
+        "\n",
+        "$$\\frac{\\partial \\omega}{\\partial x} = (k_x^2+k_y^2)k_x\\cos(k_x x)\\sin(k_y y), \\quad \\frac{\\partial \\omega}{\\partial y} = (k_x^2+k_y^2)k_y\\sin(k_x x)\\cos(k_y y)$$\n",
+        "\n",
+        "Substituting into the Jacobian:\n",
+        "\n",
+        "$$\\frac{\\partial \\psi}{\\partial x}\\frac{\\partial \\omega}{\\partial y} = (k_x^2+k_y^2)k_x k_y \\cos(k_x x)\\sin(k_x x)\\sin(k_y y)\\cos(k_y y)$$\n",
+        "\n",
+        "$$\\frac{\\partial \\psi}{\\partial y}\\frac{\\partial \\omega}{\\partial x} = (k_x^2+k_y^2)k_x k_y \\sin(k_x x)\\cos(k_x x)\\cos(k_y y)\\sin(k_y y)$$\n",
+        "\n",
+        "These two terms are identical, so advection terms evaluate to zero.\n",
+        "\n",
+        "The cell below compares numerical output against these analytical values. \n",
+        "\n",
+        "**EXERCISE: Run the initial validation and try playing with `n_grid`, `kx`, `ky` parameters. For now, treat the `def compute_advection_diffusion_kernel(...)` as a black box. We will encounter a similar kernel soon.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from utils import validate_advection_diffusion\n",
+        "\n",
+        "@wp.kernel\n",
+        "def compute_advection_diffusion_kernel(\n",
+        "    omega: wp.array2d(dtype=wp.float32),\n",
+        "    psi: wp.array2d(dtype=wp.float32),\n",
+        "    advection_out: wp.array2d(dtype=wp.float32),\n",
+        "    diffusion_out: wp.array2d(dtype=wp.float32),\n",
+        "    h: wp.float32,\n",
+        "    n: wp.int32,\n",
+        "):\n",
+        "    \"\"\"Compute advection and diffusion terms at each grid point.\n",
+        "\n",
+        "    Args:\n",
+        "        omega: Vorticity field.\n",
+        "        psi: Stream function field.\n",
+        "        advection_out: Output array for advection term J(psi, omega).\n",
+        "        diffusion_out: Output array for diffusion term (Laplacian of omega).\n",
+        "        h: Grid spacing.\n",
+        "        n: Grid size.\n",
+        "    \"\"\"\n",
+        "    i, j = wp.tid()\n",
+        "\n",
+        "    # obtain the neighboring indices for the [i, j]th cell in a periodic square box\n",
+        "    i_left = cyclic_index(i - 1, n)\n",
+        "    i_right = cyclic_index(i + 1, n)\n",
+        "    j_down = cyclic_index(j - 1, n)\n",
+        "    j_top = cyclic_index(j + 1, n)\n",
+        "\n",
+        "    # gather omega values at neighbors\n",
+        "    omega_center = omega[i, j]\n",
+        "    omega_left = omega[i_left, j]\n",
+        "    omega_right = omega[i_right, j]\n",
+        "    omega_down = omega[i, j_down]\n",
+        "    omega_top = omega[i, j_top]\n",
+        "\n",
+        "    # gather psi values at neighbors\n",
+        "    psi_left = psi[i_left, j]\n",
+        "    psi_right = psi[i_right, j]\n",
+        "    psi_down = psi[i, j_down]\n",
+        "    psi_top = psi[i, j_top]\n",
+        "\n",
+        "    # compute diffusion term (Laplacian)\n",
+        "    diffusion_out[i, j] = diffusion(\n",
+        "        omega_left,\n",
+        "        omega_right,\n",
+        "        omega_center,\n",
+        "        omega_down,\n",
+        "        omega_top,\n",
+        "        h,\n",
+        "    )\n",
+        "\n",
+        "    # compute advection term J(psi, omega)\n",
+        "    advection_out[i, j] = advection(\n",
+        "        omega_left,\n",
+        "        omega_right,\n",
+        "        omega_top,\n",
+        "        omega_down,\n",
+        "        psi_left,\n",
+        "        psi_right,\n",
+        "        psi_top,\n",
+        "        psi_down,\n",
+        "        h,\n",
+        "    )\n",
+        "\n",
+        "# Change n_grid, kx, ky here and play with different values\n",
+        "(fig_diff, _), (fig_adv, _) = validate_advection_diffusion(\n",
+        "    compute_advection_diffusion_kernel, n_grid=N_GRID, kx=2, ky=3\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**EXERCISE: Given that we have the correct `def advection(...)` and `def diffusion(...)` functions, let's assemble the timestepping kernel together. Recall**\n",
+        "\n",
+        "$$\\omega^{(\\text{new})} = c_0 \\cdot \\omega^{(n)} + c_1 \\cdot \\omega^{(\\text{old})} + c_2 \\cdot \\Delta t \\cdot \\mathcal{L}(\\omega^{(\\text{old})}).$$\n",
+        "\n",
+        "In coding terminology, we can write `omega_old = coeff0 * omega_n + coeff1 * omega_old + coeff2 * dt * rhs`.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "\n",
+        "@wp.kernel\n",
+        "def viscous_advection_rk3_kernel(\n",
+        "    n: int,\n",
+        "    h: float,\n",
+        "    re: float,\n",
+        "    dt: float,\n",
+        "    coeff0: float,\n",
+        "    coeff1: float,\n",
+        "    coeff2: float,\n",
+        "    omega_0: wp.array2d(dtype=float),\n",
+        "    omega_1: wp.array2d(dtype=float),\n",
+        "    psi: wp.array2d(dtype=float),\n",
+        "    rhs: wp.array2d(dtype=float),\n",
+        "):\n",
+        "    \"\"\"Perform a single substep of SSP-RK3.\n",
+        "\n",
+        "    Args:\n",
+        "        n: Grid size.\n",
+        "        h: Grid spacing.\n",
+        "        re: Reynolds number.\n",
+        "        dt: Timestep size.\n",
+        "        coeff0: SSP-RK3 coefficient for omega_0.\n",
+        "        coeff1: SSP-RK3 coefficient for omega_1.\n",
+        "        coeff2: SSP-RK3 coefficient for RHS.\n",
+        "        omega_0: Vorticity field at the beginning of the timestep.\n",
+        "        omega_1: Vorticity field at the end of the RK step.\n",
+        "        psi: Stream function field.\n",
+        "        rhs: Temporarily stores diffusion + advection terms.\n",
+        "    \"\"\"\n",
+        "    i, j = wp.tid()\n",
+        "\n",
+        "    # obtain the neighboring indices for the [i, j]th cell in a periodic square box\n",
+        "    left_idx = cyclic_index(i - 1, n)\n",
+        "    right_idx = cyclic_index(i + 1, n)\n",
+        "    top_idx = cyclic_index(j + 1, n)\n",
+        "    down_idx = cyclic_index(j - 1, n)\n",
+        "\n",
+        "    # compute viscous diffusion term\n",
+        "    rhs[i, j] = (1.0 / re) * diffusion(\n",
+        "        omega_1[left_idx, j],\n",
+        "        omega_1[right_idx, j],\n",
+        "        omega_1[i, j],\n",
+        "        omega_1[i, down_idx],\n",
+        "        omega_1[i, top_idx],\n",
+        "        h,\n",
+        "    )\n",
+        "\n",
+        "    # add advection term\n",
+        "    rhs[i, j] -= advection(\n",
+        "        omega_1[left_idx, j],\n",
+        "        omega_1[right_idx, j],\n",
+        "        omega_1[i, top_idx],\n",
+        "        omega_1[i, down_idx],\n",
+        "        psi[left_idx, j],\n",
+        "        psi[right_idx, j],\n",
+        "        psi[i, top_idx],\n",
+        "        psi[i, down_idx],\n",
+        "        h,\n",
+        "    )\n",
+        "\n",
+        "    # perform RK update\n",
+        "    omega_1[i, j] = coeff0 * omega_0[i, j] + coeff1 * omega_1[i, j] + coeff2 * dt * rhs[i, j] # MISSING"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Building block 4: Spectral Poisson solver using tile-based FFT and transpose\n",
+        "\n",
+        "Recall from equation (3) in the Background section that the Poisson equation in the Fourier space is\n",
+        "\n",
+        "$$\\hat{\\psi}_{m,n} = \\frac{\\hat{\\omega}_{m,n}}{k_x^2 + k_y^2}.$$\n",
+        "\n",
+        "To solve this, we need to perform the follow steps:\n",
+        "1. Transform $\\omega$ to Fourier space using 2-D FFT → $\\hat{\\omega}$.\n",
+        "2. Divide by $|k|^2$ to obtain $\\hat{\\psi}$.\n",
+        "3. Transform $\\hat{\\psi}$ back to physical space using 2-D IFFT.\n",
+        "\n",
+        "**Warp's FFT primitives**: Warp provides `wp.tile_fft()` and `wp.tile_ifft()` that operate on `wp.vec2f` (or `wp.vec2d`) arrays and perform row-wise FFT/IFFT, where:\n",
+        "- `.x` component = **real part**\n",
+        "- `.y` component = **imaginary part**\n",
+        "\n",
+        "Since our physical fields $\\omega$, $\\psi$ are stored as `wp.float32`, we need a helper kernel to convert between data types.\n",
+        "\n",
+        "**What do we need to implement?**\n",
+        "\n",
+        "1. **Data type conversion**: Kernels to move data from <u>`wp.float32` to `wp.vec2f`</u> (for FFT) and from <u>`wp.vec2f` to `wp.float32`</u> (after IFFT).\n",
+        "\n",
+        "2. **Tile-based 1-D FFT/IFFT kernels**: Kernels that will use `wp.tile_fft()` and `wp.tile_ifft()` to perform row-wise transforms.\n",
+        "\n",
+        "3. **Transpose kernel**: A tiled transpose for efficient matrix transpose.\n",
+        "\n",
+        "4. **2-D FFT composition**: A Python function that will put together the kernels in steps (2) and (3) above to do 2-D FFT as **row FFT → transpose → row FFT**.\n",
+        "\n",
+        "Let's build these step by step."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Step 1: Data type conversion kernels\n",
+        "\n",
+        "Since Warp's FFT operates on `wp.vec2f` (complex) arrays but our physical fields are `wp.float32` (real), we need:\n",
+        "\n",
+        "- `copy_float_to_vec2`: Convert real array to complex with zero imaginary part (before FFT).\n",
+        "- `extract_real_and_scale`: Extract real part and apply scaling by dividing the real part by `scale` (after IFFT)."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**EXERCISE: Complete the `def extract_real_and_scale(...)` kernel definition. `def copy_float_to_vec2(...)` is already completed for you for reference.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "@wp.kernel\n",
+        "def copy_float_to_vec2(\n",
+        "    omega: wp.array2d(dtype=wp.float32), omega_complex: wp.array2d(dtype=wp.vec2f)\n",
+        "):\n",
+        "    \"\"\"Copy real vorticity to a complex array with zero imaginary part.\n",
+        "\n",
+        "    Args:\n",
+        "        omega: Input real-valued vorticity array.\n",
+        "        omega_complex: Output complex array where real part is omega, imaginary is 0.\n",
+        "    \"\"\"\n",
+        "    i, j = wp.tid()\n",
+        "    omega_complex[i, j] = wp.vec2f(omega[i, j], 0.0)\n",
+        "\n",
+        "\n",
+        "@wp.kernel\n",
+        "def extract_real_and_scale(\n",
+        "    scale: wp.float32,\n",
+        "    complex_array: wp.array2d(dtype=wp.vec2f),\n",
+        "    real_array: wp.array2d(dtype=wp.float32),\n",
+        "):\n",
+        "    \"\"\"Extract real part from complex array and scale in one pass.\n",
+        "\n",
+        "    Args:\n",
+        "        scale: Scale factor to multiply each element by.\n",
+        "        complex_array: Input complex array (vec2f where .x is real part).\n",
+        "        real_array: Output real array (scaled).\n",
+        "    \"\"\"\n",
+        "    i, j = wp.tid()\n",
+        "    real_array[i, j] = complex_array[i, j].x / scale # MISSING"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Step 2: Tile-based 1-D FFT and IFFT kernels\n",
+        "\n",
+        "Warp's `wp.tile_fft()` and `wp.tile_ifft()` perform FFT on tiles loaded into registers. For row-wise transforms:\n",
+        "\n",
+        "1. Each thread block loads one row (tile) of the 2-D array\n",
+        "2. Performs FFT/IFFT using `wp.tile_fft()` / `wp.tile_ifft()`\n",
+        "3. Stores the result back to global memory\n",
+        "\n",
+        "The tile dimensions `TILE_M=1` and `TILE_N=N_GRID` mean each tile is exactly one row.\n",
+        "\n",
+        "**TODO: I am not sure if we should ask the readers to fill in the kernels definitions for `tile_fft(...)` and `tile_ifft(...)`. Also, @Eric for a better description of tile-based operations...**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "@wp.kernel\n",
+        "def fft_tiled(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):\n",
+        "    \"\"\"Perform 1-D FFT on each row using wp.tile_fft().\n",
+        "\n",
+        "    Args:\n",
+        "        x: Input complex array of shape (N, N).\n",
+        "        y: Output complex array of shape (N, N) storing FFT results.\n",
+        "    \"\"\"\n",
+        "    i, _, _ = wp.tid()\n",
+        "    a = wp.tile_load(x, shape=(TILE_M, TILE_N), offset=(i * TILE_M, 0))\n",
+        "    wp.tile_fft(a)\n",
+        "    wp.tile_store(y, a, offset=(i * TILE_M, 0))\n",
+        "\n",
+        "\n",
+        "@wp.kernel\n",
+        "def ifft_tiled(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):\n",
+        "    \"\"\"Perform 1-D inverse FFT on each row using wp.tile_ifft().\n",
+        "\n",
+        "    Args:\n",
+        "        x: Input complex array of shape (N, N).\n",
+        "        y: Output complex array of shape (N, N) storing IFFT results.\n",
+        "    \"\"\"\n",
+        "    i, _, _ = wp.tid()\n",
+        "    a = wp.tile_load(x, shape=(TILE_M, TILE_N), offset=(i * TILE_M, 0))\n",
+        "    wp.tile_ifft(a)\n",
+        "    wp.tile_store(y, a, offset=(i * TILE_M, 0))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Validating FFT/IFFT kernels \n",
+        "To verify that `fft_tiled` and `ifft_tiled` work correctly, we test them with a known signal $f(x) = \\sin(x)$ on the domain $[0, 2\\pi)$. The Fourier transform of $\\sin(x)$ has a simple analytical form. Since $\\sin(x) = \\frac{e^{ix} - e^{-ix}}{2i}$, the FFT should produce peaks only at wavenumbers $k = \\pm 1$, with zero amplitude elsewhere. \n",
+        "\n",
+        "The `validate_fft_roundtrip(...)` function initializes the signal $f(x)$ on the $[0, 2\\pi)$ domain and performs a full roundtrip, i.e., it applies the FFT to the signal, then applies the IFFT to the result. The function then plots the magnitude spectrum $|\\hat{f}(k)|$ and compares the original signal with the reconstructed signal obtained from the FFT -> IFFT roundtrip."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Validate FFT kernels with a simple sine wave test\n",
+        "from utils import validate_fft_roundtrip\n",
+        "\n",
+        "fig, axes, max_error = validate_fft_roundtrip(\n",
+        "    fft_kernel=fft_tiled,\n",
+        "    ifft_kernel=ifft_tiled,\n",
+        "    n_grid=N_GRID,\n",
+        "    tile_m=TILE_M,\n",
+        "    tile_n=TILE_N,\n",
+        "    block_dim=BLOCK_DIM,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Step 3: Tiled transpose kernel\n",
+        "\n",
+        "To compute 2-D FFT using 1-D row transforms, we need a transpose operation between passes\n",
+        "\n",
+        "$$\\text{2-D FFT} = \\text{row FFT} \\rightarrow \\text{transpose} \\rightarrow \\text{row FFT}.$$\n",
+        "\n",
+        "To perform the matrix transpose, we again use Warp's tile-based primitive `wp.tile_transpose()` below in the kernel `def tiled_transpose(...)` that takes in a `x` array and transposes it to give the `y` array.\n",
+        "\n",
+        "**TODO**: Same question as above, should we ask them to fill in these tile kernels. Again @Eric for a more accurate description of tile-based transpose...\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "@wp.kernel\n",
+        "def tiled_transpose(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):\n",
+        "    \"\"\"Transpose a 2-D array using tiled approach with shared memory.\n",
+        "\n",
+        "    Args:\n",
+        "        x: Input complex array.\n",
+        "        y: Output complex array storing the transpose of x.\n",
+        "    \"\"\"\n",
+        "    i, j = wp.tid()\n",
+        "    t = wp.tile_load(\n",
+        "        x,\n",
+        "        shape=(TILE_TRANSPOSE_DIM, TILE_TRANSPOSE_DIM),\n",
+        "        offset=(i * TILE_TRANSPOSE_DIM, j * TILE_TRANSPOSE_DIM),\n",
+        "        storage=\"shared\",\n",
+        "    )\n",
+        "    t_transposed = wp.tile_transpose(t)\n",
+        "    wp.tile_store(\n",
+        "        y, t_transposed, offset=(j * TILE_TRANSPOSE_DIM, i * TILE_TRANSPOSE_DIM)\n",
+        "    )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Validating transpose kernel\n",
+        "\n",
+        "In `def validate_transpose(...)`, we validate this using an upper triangular matrix, which should become lower triangular after transpose. For example, in case of a $4 \\times 4$ upper triangular matrix, transpose operation should result in the following change\n",
+        "\n",
+        "$$\n",
+        "\\begin{bmatrix} 1 & 1 & 1 & 1 \\\\ 0 & 1 & 1 & 1 \\\\ 0 & 0 & 1 & 1 \\\\ 0 & 0 & 0 & 1 \\end{bmatrix}\n",
+        "\\xrightarrow{\\text{transpose}}\n",
+        "\\begin{bmatrix} 1 & 0 & 0 & 0 \\\\ 1 & 1 & 0 & 0 \\\\ 1 & 1 & 1 & 0 \\\\ 1 & 1 & 1 & 1 \\end{bmatrix}.\n",
+        "$$"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import utils\n",
+        "from utils import validate_transpose\n",
+        "\n",
+        "fig, axes = validate_transpose(\n",
+        "    transpose_kernel=tiled_transpose,\n",
+        "    tile_transpose_dim=TILE_TRANSPOSE_DIM,\n",
+        "    n_test=64,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Step 4: Going from $\\hat{\\omega}$ to $\\hat{\\psi}$ in the Fourier space\n",
+        "\n",
+        "After transforming $\\omega$ to Fourier space, we solve equation (3)\n",
+        "\n",
+        "$$\\hat{\\psi}_{m,n} = \\frac{\\hat{\\omega}_{m,n}}{k_x^2 + k_y^2} = \\frac{\\hat{\\omega}_{m,n}}{|k|^2} = \\hat{\\omega}_{m,n} \\cdot |k|^{-2}.$$\n",
+        "\n",
+        "We precompute $1/|k|^2$ and multiply it elementwise to the `omega_hat` array to obtain the `psi_hat` array."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**EXERCISE: Please fill out the `def multiply_k2_inverse(...)` kernel definition.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "@wp.kernel\n",
+        "def multiply_k2_inverse(\n",
+        "    k2i: wp.array2d(dtype=wp.float32),\n",
+        "    omega_hat: wp.array2d(dtype=wp.vec2f),\n",
+        "    psi_hat: wp.array2d(dtype=wp.vec2f),\n",
+        "):\n",
+        "    \"\"\"Solve Poisson equation in Fourier space: psi_hat = omega_hat / |k|^2.\n",
+        "\n",
+        "    Args:\n",
+        "        k2i: Precomputed 1/|k|^2 array (0 at k=0 to handle DC component).\n",
+        "        omega_hat: Fourier transform of vorticity.\n",
+        "        psi_hat: Output Fourier transform of stream function.\n",
+        "    \"\"\"\n",
+        "    i, j = wp.tid()\n",
+        "    psi_hat[i, j] =  omega_hat[i, j] * k2i[i, j] # MISSING"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Now that we have all the kernel calls figured out, we need to put together two Python functions before we start assembling the solver together!\n",
+        "\n",
+        "1. `def fft_2d(fft_kernel, input_arr)` - Python function that composes row-wise FFT/IFFT -> transpose → row-wise FFT to perform a full 2-D FFT/IFFT on a input complex (`wp.vec2f`) array.\n",
+        "\n",
+        "2. `def solve_poisson(omega, k2i)` - Python function that takes the real-valued vorticity field and pre-computed $|k|^2$ and gives out real values $\\psi$.\n",
+        "\n",
+        "3. `def step(omega_old, omega_new)` - Python function that performs on Runge-Kutte timestep to go from `omega_old` to `omega_new`."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**EXERCISE: Until now, you have mostly filled in and worked through Warp kernels and functions. In the next cell, we ask you to call a Warp kernel from inside the Python function `def fft_2d(fft_kernel, input_arr)`. This function is FFT/IFFT agnostic and can be used for both (how?). We ask you to fill two kernel calls, one to `def tiled_transpose(...)` and another to `def fft_kernel(...)`.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def fft_2d(fft_kernel, input_arr):\n",
+        "    \"\"\"Perform 2-D FFT or IFFT using row-wise transform + transpose pattern.\n",
+        "\n",
+        "    Args:\n",
+        "        fft_kernel: Either def fft_tiled(...) or def ifft_tiled(...) Warp kernel.\n",
+        "        input_arr: Input complex array.\n",
+        "    \n",
+        "    Returns:\n",
+        "        Output complex array with FFT/IFFT results.\n",
+        "    \"\"\"\n",
+        "    # allocate temporary buffers for intermediate FFT results\n",
+        "    temp_1 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)\n",
+        "    temp_2 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)\n",
+        "\n",
+        "    # perform rowwise FFT/IFFT\n",
+        "    wp.launch_tiled(\n",
+        "        fft_kernel,\n",
+        "        dim=[N_GRID, 1],\n",
+        "        inputs=[input_arr],\n",
+        "        outputs=[temp_1],\n",
+        "        block_dim=BLOCK_DIM,\n",
+        "    )\n",
+        "\n",
+        "    wp.launch_tiled(\n",
+        "        tiled_transpose,\n",
+        "        dim=(N_GRID // TILE_TRANSPOSE_DIM, N_GRID // TILE_TRANSPOSE_DIM),\n",
+        "        inputs= [temp_1], # MISSING \n",
+        "        outputs= [temp_2], # MISSING \n",
+        "        block_dim=TILE_TRANSPOSE_DIM * TILE_TRANSPOSE_DIM,\n",
+        "    )\n",
+        "\n",
+        "    # perform columnwise FFT/IFFT\n",
+        "    output_arr = wp.empty_like(input_arr)\n",
+        "    wp.launch_tiled(\n",
+        "        fft_kernel,\n",
+        "        dim=[N_GRID, 1],\n",
+        "        inputs= [temp_2], # MISSING \n",
+        "        outputs=[output_arr],\n",
+        "        block_dim=BLOCK_DIM,\n",
+        "    )\n",
+        "    \n",
+        "    return output_arr"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "**EXERCISE: Now, let's fill in the function call to `def fft_2d(...)` in the `def solve_poisson(...)` below.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def solve_poisson(omega, psi, k2i):\n",
+        "    \"\"\"Solve the Poisson equation using FFT.\n",
+        "    \n",
+        "    Args:\n",
+        "        omega: Vorticity field (wp.float32 array).\n",
+        "        psi: Stream function solution (wp.float32 array).\n",
+        "        k2i: Precomputed 1/|k|^2 array.\n",
+        "    \"\"\"\n",
+        "    # allocate all temporary arrays locally\n",
+        "    omega_complex = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)\n",
+        "    temp_1 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)\n",
+        "    temp_2 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)\n",
+        "\n",
+        "    # convert vorticity from wp.float32 to wp.vec2f\n",
+        "    wp.launch(copy_float_to_vec2, dim=(N_GRID, N_GRID),\n",
+        "              inputs=[omega], outputs=[omega_complex])\n",
+        "\n",
+        "    # forward FFT: take omega\n",
+        "    # Remember fft_2d first and second arguments are fft_tiled/ifft_tiled and vorticity field (complex valued, wp.vec2f)\n",
+        "    temp_1 = fft_2d(fft_tiled, omega_complex) # MISSING\n",
+        "\n",
+        "    # solve Poisson in Fourier space: psi_hat = omega_hat / |k|^2\n",
+        "    wp.launch(multiply_k2_inverse, dim=(N_GRID, N_GRID),\n",
+        "              inputs=[k2i, temp_1], outputs=[temp_2])\n",
+        "\n",
+        "    # inverse FFT: psi_hat -> psi\n",
+        "    # Remember fft_2d first and second arguments are fft_tiled/ifft_tiled and vorticity field (complex valued, wp.vec2f)\n",
+        "    temp_1 = fft_2d(ifft_tiled, temp_2) # MISSING\n",
+        "\n",
+        "    # extract real part and normalize\n",
+        "    wp.launch(extract_real_and_scale, dim=(N_GRID, N_GRID),\n",
+        "              inputs=[wp.float32(N_GRID * N_GRID), temp_1],\n",
+        "              outputs=[psi]) # MISSING"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Validating the Poisson solver\n",
+        "\n",
+        "We validate `solve_poisson` using an analytical test case. Consider the Poisson equation $\\nabla^2 \\psi = -\\omega$. If we choose $\\psi_{\\text{analytical}} = \\sin(k_x x)\\sin(k_y y)$, then\n",
+        "\n",
+        "$$\\nabla^2 \\psi = -k_x^2 \\sin(k_x x)\\sin(k_y y) - k_y^2 \\sin(k_x x)\\sin(k_y y) = -(k_x^2 + k_y^2)\\sin(k_x x)\\sin(k_y y).$$\n",
+        "\n",
+        "which means $\\omega = (k_x^2 + k_y^2)\\sin(k_x x)\\sin(k_y y)$.\n",
+        "\n",
+        "In the `def validate_poisson_solver(...)`, we set $\\omega = (k_x^2 + k_y^2)\\sin(k_x x)\\sin(k_y y)$ and make a call to `def solve_poisson(...)` to obtain $\\psi_{\\text{numerical}}$. Thereafter, we compare the obtained $\\psi_{\\text{numerical}}$ with $\\psi_{\\text{analytical}}$.\n",
+        "\n",
+        "In the cell below, for starters, we have set $k_x = k_y = 1$. Run the cell and see what you get. \n",
+        "\n",
+        "**Exercise: Change the values of $k_x$ and $k_y$ and see how the plots look for different $[k_x, k_y]$ pair.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Validate Poisson solver\n",
+        "from utils import validate_poisson_solver\n",
+        "\n",
+        "fig, axes, max_error = validate_poisson_solver(\n",
+        "    solve_poisson_func=solve_poisson,\n",
+        "    n_grid=N_GRID,\n",
+        "    kx=1,\n",
+        "    ky=1,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We have already given you the `def step(...)` full definition in the cell below. A single call to `def step(...)` basically advances the whole $\\omega$ field on the 2-D grid from $t$ to $t + \\Delta t$. "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def step(omega_0, omega_1, psi, rhs, k2i):\n",
+        "    \"\"\"Advance simulation by one timestep using SSP-RK3.\n",
+        "    \n",
+        "    Args:\n",
+        "        omega_0: Vorticity at the start of timestep on the 2-D grid.\n",
+        "        omega_1: Vorticity at the end of timestep/RK substep on the 2-D grid.\n",
+        "        psi: Stream function on the 2-D grid.\n",
+        "        rhs: Temporary array for advection + diffusion terms.\n",
+        "        k2i: Precomputed 1/|k|^2 for Poisson solver.\n",
+        "    \"\"\"\n",
+        "    # loop over 3 RK substeps with coefficients (c0, c1, c2)\n",
+        "    for c0, c1, c2 in RK3_COEFFS:\n",
+        "        rhs.zero_()\n",
+        "\n",
+        "        # kernel call to compute the advection and diffusion components\n",
+        "        # then advance the vorticity 1 RK3 substep\n",
+        "        wp.launch(\n",
+        "            viscous_advection_rk3_kernel,\n",
+        "            dim=(N_GRID, N_GRID),\n",
+        "            inputs=[N_GRID, H, RE, DT, c0, c1, c2, omega_0, omega_1, psi, rhs],\n",
+        "        )\n",
+        "\n",
+        "        # solve Poisson equation to get psi from omega_1\n",
+        "        solve_poisson(omega_1, psi, k2i)\n",
+        "\n",
+        "    # omega_1 now holds omega(t+dt), copy back to omega_0 for next timestep\n",
+        "    wp.copy(omega_0, omega_1)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Assembling the Solver\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Now that we have all the building blocks in place, let's put them together to run the simulation. In the cells below, we will:\n",
+        "\n",
+        "1. Allocate arrays for vorticity ($\\omega$), stream function ($\\psi$), RHS terms, and pre-compute $1/|k|^2$.\n",
+        "2. Initialize the vorticity field using the decaying turbulence spectrum mentioned above.\n",
+        "3. Capture the time-stepping loop in a CUDA graph for improved performance.\n",
+        "4. Run the simulation and visualize the results."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 1. Allocating arrays for the simulation"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# allocate warp arrays for vorticity, stream-function, and RHS of NS equation\n",
+        "omega_0 = wp.zeros((N_GRID, N_GRID), dtype=wp.float32) # MISSING\n",
+        "omega_1 = wp.zeros((N_GRID, N_GRID), dtype=wp.float32) # MISSING\n",
+        "psi = wp.zeros((N_GRID, N_GRID), dtype=wp.float32) # MISSING\n",
+        "rhs = wp.zeros((N_GRID, N_GRID), dtype=wp.float32) # MISSING\n",
+        "\n",
+        "# precompute 1/k^2 for spectral Poisson solver (avoid division by zero at k=0)\n",
+        "k = np.fft.fftfreq(N_GRID, d=1.0 / N_GRID)\n",
+        "kx, ky = np.meshgrid(k, k)\n",
+        "k2 = kx**2 + ky**2\n",
+        "k2i_np = np.zeros_like(k2)\n",
+        "nonzero = k2 != 0\n",
+        "k2i_np[nonzero] = 1.0 / k2[nonzero]\n",
+        "k2i = wp.array2d(k2i_np.astype(np.float32), dtype=wp.float32)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 2. Initialize vorticity on the 2-D grid using the spectrum discussed before"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# compute initial vorticity distribution for decaying turbulence\n",
+        "k_mag_np = np.sqrt(k**2 + k[:, np.newaxis] ** 2)\n",
+        "k_mag = wp.array2d(k_mag_np.astype(np.float32), dtype=wp.float32)\n",
+        "\n",
+        "rng = np.random.default_rng(42)\n",
+        "zeta_np = 2 * np.pi * rng.random((N_GRID // 2 + 1, N_GRID // 2 + 1))\n",
+        "eta_np = 2 * np.pi * rng.random((N_GRID // 2 + 1, N_GRID // 2 + 1))\n",
+        "zeta = wp.array2d(zeta_np.astype(np.float32), dtype=wp.float32)\n",
+        "eta = wp.array2d(eta_np.astype(np.float32), dtype=wp.float32)\n",
+        "omega_complex = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)\n",
+        "\n",
+        "# set parameters for energy spectrum\n",
+        "K_P = 12.0\n",
+        "S = 3\n",
+        "\n",
+        "wp.launch(\n",
+        "    decaying_turbulence_initializer,\n",
+        "    dim=(N_GRID, N_GRID),\n",
+        "    inputs=[N_GRID, K_P, S, k_mag, zeta, eta],\n",
+        "    outputs=[omega_complex], # MISSING\n",
+        ")\n",
+        "\n",
+        "# compute IFFT of omega_complex field to get initial vorticity in physical space\n",
+        "# Remember fft_2d first and second arguments are fft_tiled/ifft_tiled and vorticity field (complex valued, wp.vec2f)\n",
+        "fft_result = fft_2d(ifft_tiled, omega_complex) # MISSING\n",
+        "\n",
+        "# extract real part to get initial vorticity field (scale=1.0 for normalization)\n",
+        "wp.launch(\n",
+        "    extract_real_and_scale,\n",
+        "    dim=(N_GRID, N_GRID),\n",
+        "    inputs=[1.0, fft_result],\n",
+        "    outputs=[omega_0],\n",
+        ")\n",
+        "\n",
+        "# for initial distribution, set both omega_1 and omega_0 to be the same\n",
+        "wp.copy(omega_1, omega_0)\n",
+        "\n",
+        "# solve initial Poisson equation to get psi from initial vorticity field\n",
+        "solve_poisson(omega_1, psi, k2i)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 3. CUDA graph capture\n",
+        "\n",
+        "CUDA graphs let you define a sequence of operations (kernel launches, data movement) and their dependencies once, then replay it repeatedly with minimal overhead. Normally, each kernel launch incurs CPU-side setup costs. For short-running kernels launched many times, this overhead dominates. By capturing the entire workflow in a graph, these setup costs are paid once during graph creation, and subsequent launches are nearly free.\n",
+        "\n",
+        "In Warp, you can capture a CUDA graph with `wp.ScopedCapture()`. In the code block below, a single call to `step(...)` is captured and will be replayed for as many times as we want to run the simulation."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# capture first step in a CUDA graph\n",
+        "with wp.ScopedCapture() as capture:\n",
+        "    step(omega_0, omega_1, psi, rhs, k2i)\n",
+        "step_graph = capture.graph"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 4. Running the simulation and saving the results\n",
+        "\n",
+        "The boilerplate code below runs the simulation for 2000 timesteps and saves the vorticity field at every 10th timestep. Thereafter, the vorticity fields are plotted and saved as a GIF."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import time\n",
+        "from PIL import Image\n",
+        "import matplotlib.pyplot as plt\n",
+        "from matplotlib.colors import Normalize\n",
+        "import os\n",
+        "\n",
+        "NUM_FRAMES = 200\n",
+        "STEPS_PER_FRAME = 10\n",
+        "\n",
+        "# Colormap setup\n",
+        "cmap = plt.cm.twilight\n",
+        "norm = Normalize(vmin=-15, vmax=15)\n",
+        "\n",
+        "frames = []\n",
+        "print(f\"Running {NUM_FRAMES} frames ({STEPS_PER_FRAME} steps each)...\")\n",
+        "\n",
+        "start_time = time.perf_counter()\n",
+        "for frame in range(NUM_FRAMES):\n",
+        "    # Advance simulation\n",
+        "    for _ in range(STEPS_PER_FRAME):\n",
+        "        wp.capture_launch(step_graph)\n",
+        "\n",
+        "    # Capture frame\n",
+        "    vorticity = omega_1.numpy().T  # Transpose for correct orientation\n",
+        "    colored = cmap(norm(vorticity))\n",
+        "    rgb = (colored[:, :, :3] * 255).astype(np.uint8)\n",
+        "    frames.append(rgb)\n",
+        "\n",
+        "    if (frame + 1) % 50 == 0:\n",
+        "        print(f\"  Frame {frame + 1}/{NUM_FRAMES}\")\n",
+        "\n",
+        "elapsed = time.perf_counter() - start_time\n",
+        "total_steps = NUM_FRAMES * STEPS_PER_FRAME\n",
+        "print(f\"Completed {total_steps} steps in {elapsed:.2f}s ({total_steps/elapsed:.0f} steps/s)\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import IPython.display\n",
+        "\n",
+        "# Create output directory\n",
+        "os.makedirs(\"./images/chapter-12.2\", exist_ok=True)\n",
+        "output_file = f\"./images/chapter-12.2/turbulence_{N_GRID}x{N_GRID}_Re{int(RE)}.gif\"\n",
+        "\n",
+        "# Save as GIF\n",
+        "pil_images = [Image.fromarray(frame) for frame in frames]\n",
+        "pil_images[0].save(\n",
+        "    output_file,\n",
+        "    save_all=True,\n",
+        "    append_images=pil_images[1:],\n",
+        "    duration=50,\n",
+        "    loop=0,\n",
+        ")\n",
+        "\n",
+        "print(f\"Saved: {output_file}\")\n",
+        "IPython.display.Image(output_file)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Validation\n",
+        "\n",
+        "**TODO**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Conclusion\n",
+        "\n",
+        "**TODO**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## References\n",
+        "\n",
+        "**TODO**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "---\n",
+        "## Appendix A: Phase symmetry for real-valued physical fields\n",
+        "\n",
+        "Following San and Staples 2012, we decompose the phase as $\\zeta(\\mathbf{k}) = \\xi(\\mathbf{k}) + \\eta(\\mathbf{k})$, where $\\xi$ and $\\eta$ are independently chosen random values in $[0, 2\\pi]$ in the first quadrant ($k_x \\geq 0$, $k_y \\geq 0$).\n",
+        "\n",
+        "$\\xi(\\mathbf{k})$ and $\\eta(\\mathbf{k})$ must follow these relations in other quadrants\n",
+        "\n",
+        "$$\\xi(-k_x, k_y) = -\\xi(k_x, k_y),$$\n",
+        "$$\\xi(-k_x, -k_y) = -\\xi(k_x, k_y),$$\n",
+        "$$\\xi(k_x, -k_y) = \\xi(k_x, k_y),$$\n",
+        "\n",
+        "$$\\eta(-k_x, k_y) = \\eta(k_x, k_y),$$\n",
+        "$$\\eta(-k_x, -k_y) = -\\eta(k_x, k_y),$$\n",
+        "$$\\eta(k_x, -k_y) = -\\eta(k_x, k_y).$$"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "warp-cfd",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.10"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}
diff --git a/Accelerated_Python_User_Guide/notebooks/utils.py b/Accelerated_Python_User_Guide/notebooks/utils.py
new file mode 100644
index 00000000..fa08403b
--- /dev/null
+++ b/Accelerated_Python_User_Guide/notebooks/utils.py
@@ -0,0 +1,449 @@
+# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
+"""Utility functions for validating FFT kernels in the 2D Navier-Stokes solver."""
+
+import numpy as np
+import warp as wp
+import matplotlib.pyplot as plt
+
+
+def validate_fft_roundtrip(
+    fft_kernel,
+    ifft_kernel,
+    n_grid: int,
+    tile_m: int,
+    tile_n: int,
+    block_dim: int,
+    figsize=(14, 4),
+):
+    """Validate FFT and IFFT kernels using a sine wave test.
+
+    Creates a sin(x) function on [0, 2\pi], applies FFT to show the amplitude
+    spectrum (should peak at k=1), then applies IFFT and compares with original.
+
+    Args:
+        fft_kernel: The fft_tiled Warp kernel.
+        ifft_kernel: The ifft_tiled Warp kernel.
+        n_grid: Grid resolution (must match TILE_N used in kernel compilation).
+        tile_m: TILE_M parameter (typically 1 for row-wise FFT).
+        tile_n: TILE_N parameter (typically N_GRID).
+        block_dim: Block dimension for kernel launch.
+        figsize: Figure size tuple.
+
+    Returns:
+        fig, axes: Matplotlib figure and axes.
+        max_error: Maximum absolute error between original and reconstructed.
+    """
+    # Create physical grid on [0, 2\pi)
+    L = 2.0 * np.pi
+    x = np.linspace(0, L, n_grid, endpoint=False)
+
+    # Create test signal: sin(x) which has a peak at k=1
+    # Since f(x) = sin(x) = (e^{ix} - e^{-ix}) / 2i, FFT should show peaks at k=\pm 1
+    signal = np.sin(x)
+
+    # Prepare 2D array (fft_tiled operates on rows of 2D arrays)
+    # We'll use a single row for this 1D test
+    signal_2d = signal.reshape(1, n_grid).astype(np.float32)
+
+    # Convert to complex (wp.vec2f format): [real, imag]
+    # Shape must be (1, n_grid, 2) for Warp to interpret as (1, n_grid) of vec2f
+    signal_complex = np.zeros((1, n_grid, 2), dtype=np.float32)
+    signal_complex[:, :, 0] = signal_2d  # Real part
+    signal_complex[:, :, 1] = 0.0  # Imaginary part = 0
+
+    # Create Warp arrays - pass shape (1, n_grid) and let Warp handle vec2f
+    input_wp = wp.array(signal_complex, dtype=wp.vec2f, shape=(1, n_grid))
+    fft_result = wp.zeros((1, n_grid), dtype=wp.vec2f)
+    ifft_result = wp.zeros((1, n_grid), dtype=wp.vec2f)
+
+    # Perform FFT
+    wp.launch_tiled(
+        fft_kernel,
+        dim=[1, 1],
+        inputs=[input_wp],
+        outputs=[fft_result],
+        block_dim=block_dim,
+    )
+
+    # Perform IFFT
+    wp.launch_tiled(
+        ifft_kernel,
+        dim=[1, 1],
+        inputs=[fft_result],
+        outputs=[ifft_result],
+        block_dim=block_dim,
+    )
+
+    # Synchronize and get results
+    wp.synchronize()
+
+    # Convert back to numpy
+    fft_np = fft_result.numpy()
+    ifft_np = ifft_result.numpy()
+
+    # Extract real and imaginary parts from FFT result
+    fft_real = fft_np[0, :, 0]
+    fft_imag = fft_np[0, :, 1]
+    fft_magnitude = np.sqrt(fft_real**2 + fft_imag**2)
+
+    # IFFT result (need to normalize by N)
+    reconstructed = ifft_np[0, :, 0] / n_grid  # Real part, normalized
+
+    # Wavenumbers for FFT (standard ordering: 0, 1, 2, ..., N/2-1, -N/2, ..., -1)
+    k = np.fft.fftfreq(n_grid, d=L / (2 * np.pi * n_grid))
+
+    # Compute error
+    max_error = np.max(np.abs(signal - reconstructed))
+
+    # Create figure with 3 panels
+    fig, axes = plt.subplots(1, 3, figsize=figsize)
+
+    # Panel 1: Original signal
+    axes[0].plot(x, signal, "b-", linewidth=2, label="sin(x)")
+    axes[0].set_xlabel("x")
+    axes[0].set_ylabel("f(x)")
+    axes[0].set_title("Original Signal: f(x) = sin(x)")
+    axes[0].set_xlim([0, L])
+    axes[0].grid(True, alpha=0.3)
+    axes[0].legend()
+
+    # Panel 2: FFT magnitude spectrum
+    # Shift for better visualization (put k=0 at center)
+    k_shifted = np.fft.fftshift(k)
+    mag_shifted = np.fft.fftshift(fft_magnitude)
+
+    axes[1].stem(k_shifted, mag_shifted, basefmt=" ")
+    axes[1].set_xlabel("Wavenumber k")
+    axes[1].set_ylabel(r"$|\hat{f}(k)|$")
+    axes[1].set_title("FFT Amplitude Spectrum")
+    axes[1].set_xlim([-10, 10])  # Focus on low wavenumbers
+    axes[1].grid(True, alpha=0.3)
+    axes[1].axvline(x=1, color="r", linestyle="--", alpha=0.5, label="k=1")
+    axes[1].axvline(x=-1, color="r", linestyle="--", alpha=0.5, label="k=-1")
+    axes[1].legend()
+
+    # Panel 3: Comparison of original vs reconstructed
+    axes[2].plot(x, signal, "b-", linewidth=2, label="Original")
+    axes[2].plot(x, reconstructed, "r--", linewidth=2, label="FFT→IFFT")
+    axes[2].set_xlabel("x")
+    axes[2].set_ylabel("f(x)")
+    axes[2].set_title(f"Comparison b/w original and reconstructed signal")
+    axes[2].set_xlim([0, L])
+    axes[2].grid(True, alpha=0.3)
+    axes[2].legend()
+
+    plt.tight_layout()
+
+    return fig, axes, max_error
+
+
+def validate_transpose(
+    transpose_kernel,
+    tile_transpose_dim: int,
+    n_test: int = 64,
+    figsize=(10, 4),
+):
+    """Validate the tiled transpose kernel with a visual comparison.
+
+    Creates an upper triangular matrix (1s above diagonal, 0s below),
+    applies transpose, and shows before/after.
+
+    Args:
+        transpose_kernel: The tiled_transpose Warp kernel.
+        tile_transpose_dim: TILE_TRANSPOSE_DIM parameter used in kernel.
+        n_test: Size of the test matrix (default 64x64).
+        figsize: Figure size tuple.
+
+    Returns:
+        fig, axes: Matplotlib figure and axes.
+    """
+    # Create upper triangular matrix: 1s above diagonal, 0s below
+    pattern = np.triu(np.ones((n_test, n_test), dtype=np.float32))
+
+    # Convert to complex format (wp.vec2f) - store pattern in real part
+    input_complex = np.zeros((n_test, n_test, 2), dtype=np.float32)
+    input_complex[:, :, 0] = pattern
+
+    # Create Warp arrays
+    input_wp = wp.array(input_complex, dtype=wp.vec2f, shape=(n_test, n_test))
+    output_wp = wp.zeros((n_test, n_test), dtype=wp.vec2f)
+
+    # Launch transpose kernel
+    wp.launch_tiled(
+        transpose_kernel,
+        dim=(n_test // tile_transpose_dim, n_test // tile_transpose_dim),
+        inputs=[input_wp],
+        outputs=[output_wp],
+        block_dim=tile_transpose_dim * tile_transpose_dim,
+    )
+    wp.synchronize()
+
+    # Get result
+    transposed_result = output_wp.numpy()[:, :, 0]
+
+    # Plot
+    fig, axes = plt.subplots(1, 2, figsize=figsize)
+
+    axes[0].imshow(pattern, cmap="gray_r")
+    axes[0].set_title("Original")
+    axes[0].set_xlabel("j")
+    axes[0].set_ylabel("i")
+
+    axes[1].imshow(transposed_result, cmap="gray_r")
+    axes[1].set_title("After transpose")
+    axes[1].set_xlabel("j")
+    axes[1].set_ylabel("i")
+
+    plt.tight_layout()
+    return fig, axes
+
+
+def validate_poisson_solver(
+    solve_poisson_func,
+    n_grid: int,
+    kx: int = 1,
+    ky: int = 1,
+    figsize=(14, 4),
+):
+    """Validate the Poisson solver using an analytical test case.
+
+    Tests with \omega = (kx^2 + ky^2)sin(kx·x)sin(ky·y), which has the analytical
+    solution psi = sin(kx·x)sin(ky·y) for the Poisson equation \laplacian \psi = -\omega.
+
+    Args:
+        solve_poisson_func: The solve_poisson function to validate.
+        n_grid: Grid resolution.
+        kx: Wavenumber in x-direction (default 1).
+        ky: Wavenumber in y-direction (default 1).
+        figsize: Figure size tuple.
+
+    Returns:
+        fig, axes: Matplotlib figure and axes.
+        max_error: Maximum absolute error between computed and analytical solution.
+    """
+    import warp as wp
+
+    # Create 2D grid on [0, 2\pi) \times [0, 2\pi)
+    L = 2.0 * np.pi
+    x = np.linspace(0, L, n_grid, endpoint=False)
+    y = np.linspace(0, L, n_grid, endpoint=False)
+    X, Y = np.meshgrid(x, y)
+
+    # Analytical solution: \psi = sin(kx·x)sin(ky·y)
+    psi_analytical = np.sin(kx * X) * np.sin(ky * Y)
+
+    # Corresponding vorticity: \omega = (kx^2 + ky^2)sin(kx·x)sin(ky·y)
+    # From \laplacian\psi = -\omega, we have \omega = -\laplacian\psi = (kx^2 + ky^2)\psi
+    k_squared = kx**2 + ky**2
+    omega = k_squared * psi_analytical
+
+    # Precompute 1/|k|^2 for spectral Poisson solver
+    k_freq = np.fft.fftfreq(n_grid, d=1.0 / n_grid)
+    kx_grid, ky_grid = np.meshgrid(k_freq, k_freq)
+    k2 = kx_grid**2 + ky_grid**2
+    k2i_np = np.zeros_like(k2)
+    nonzero = k2 != 0
+    k2i_np[nonzero] = 1.0 / k2[nonzero]
+    k2i = wp.array(k2i_np.astype(np.float32), dtype=wp.float32)
+
+    # Create Warp arrays for omega and psi
+    omega_wp = wp.array(omega.astype(np.float32), dtype=wp.float32)
+    psi_wp = wp.zeros((n_grid, n_grid), dtype=wp.float32)
+
+    # Solve Poisson equation
+    solve_poisson_func(omega_wp, psi_wp, k2i)
+    wp.synchronize()
+
+    # Get computed solution
+    psi_computed = psi_wp.numpy()
+
+    # Compute error
+    error = np.abs(psi_computed - psi_analytical)
+    max_error = np.max(error)
+
+    # Create visualization
+    fig, axes = plt.subplots(1, 3, figsize=figsize)
+
+    # Common colormap settings
+    vmin = min(psi_analytical.min(), psi_computed.min())
+    vmax = max(psi_analytical.max(), psi_computed.max())
+
+    # Panel 1: Computed solution
+    im0 = axes[0].imshow(
+        psi_computed,
+        extent=[0, L, 0, L],
+        origin="lower",
+        cmap="RdBu_r",
+        vmin=vmin,
+        vmax=vmax,
+    )
+    axes[0].set_title(r"Computed $\psi$")
+    axes[0].set_xlabel("x")
+    axes[0].set_ylabel("y")
+    plt.colorbar(im0, ax=axes[0], shrink=0.8)
+
+    # Panel 2: Analytical solution
+    im1 = axes[1].imshow(
+        psi_analytical,
+        extent=[0, L, 0, L],
+        origin="lower",
+        cmap="RdBu_r",
+        vmin=vmin,
+        vmax=vmax,
+    )
+    axes[1].set_title(r"Analytical $\psi = \sin(x)\sin(y)$")
+    axes[1].set_xlabel("x")
+    axes[1].set_ylabel("y")
+    plt.colorbar(im1, ax=axes[1], shrink=0.8)
+
+    # Panel 3: Error
+    im2 = axes[2].imshow(error, extent=[0, L, 0, L], origin="lower", cmap="hot")
+    axes[2].set_title(f"Error")
+    axes[2].set_xlabel("x")
+    axes[2].set_ylabel("y")
+    plt.colorbar(im2, ax=axes[2], shrink=0.8)
+
+    plt.tight_layout()
+    return fig, axes, max_error
+
+
+def validate_advection_diffusion(
+    advection_diffusion_kernel,
+    n_grid: int,
+    kx: int = 2,
+    ky: int = 3,
+    figsize=(10, 4),
+):
+    """Validate finite-difference advection and diffusion using analytical test case.
+
+    Uses test functions:
+        omega = (kx^2 + ky^2) * sin(kx*x) * sin(ky*y)
+        psi = sin(kx*x) * sin(ky*y)
+
+    Analytical results:
+        Diffusion: nabla^2 omega = -(kx^2 + ky^2)^2 * sin(kx*x) * sin(ky*y)
+        Advection: J(psi, omega) = 0 (since omega = c * psi with constant c)
+
+    Args:
+        advection_diffusion_kernel: Warp kernel with signature:
+            (omega, psi, advection_out, diffusion_out, h, n) where all arrays
+            are wp.array2d(dtype=wp.float32), h is wp.float32, n is wp.int32.
+        n_grid: Grid resolution.
+        kx: Wavenumber in x-direction.
+        ky: Wavenumber in y-direction.
+        figsize: Figure size for each 1x2 panel.
+
+    Returns:
+        (fig_diff, axes_diff): Matplotlib figure/axes for diffusion comparison.
+        (fig_adv, axes_adv): Matplotlib figure/axes for advection comparison.
+        max_diff_error: Maximum absolute error in diffusion.
+        max_adv_error: Maximum absolute error in advection.
+    """
+    # Create 2D grid on [0, 2*pi) x [0, 2*pi)
+    L = 2.0 * np.pi
+    h = L / n_grid
+    x = np.linspace(0, L, n_grid, endpoint=False)
+    y = np.linspace(0, L, n_grid, endpoint=False)
+    X, Y = np.meshgrid(x, y)
+
+    # Analytical fields
+    k_squared = kx**2 + ky**2
+    psi_analytical = np.sin(kx * X) * np.sin(ky * Y)
+    omega_analytical = k_squared * psi_analytical
+
+    # Analytical diffusion: nabla^2 omega = -(kx^2 + ky^2) * omega
+    diffusion_analytical = -k_squared * omega_analytical
+
+    # Analytical advection: J(psi, omega) = 0 (since omega = c * psi)
+    advection_analytical = np.zeros_like(omega_analytical)
+
+    # Create Warp arrays
+    omega_wp = wp.array(omega_analytical.astype(np.float32), dtype=wp.float32)
+    psi_wp = wp.array(psi_analytical.astype(np.float32), dtype=wp.float32)
+    advection_wp = wp.zeros((n_grid, n_grid), dtype=wp.float32)
+    diffusion_wp = wp.zeros((n_grid, n_grid), dtype=wp.float32)
+
+    # Launch user-provided kernel
+    wp.launch(
+        advection_diffusion_kernel,
+        dim=(n_grid, n_grid),
+        inputs=[omega_wp, psi_wp, advection_wp, diffusion_wp, float(h), n_grid],
+    )
+    wp.synchronize()
+
+    # Get numerical results
+    diffusion_numerical = diffusion_wp.numpy()
+    advection_numerical = advection_wp.numpy()
+
+    # --- Diffusion comparison plot (1x2 panel) ---
+    fig_diff, axes_diff = plt.subplots(1, 2, figsize=figsize)
+
+    vmin_d = min(diffusion_analytical.min(), diffusion_numerical.min())
+    vmax_d = max(diffusion_analytical.max(), diffusion_numerical.max())
+
+    im0 = axes_diff[0].imshow(
+        diffusion_analytical,
+        extent=[0, L, 0, L],
+        origin="lower",
+        cmap="RdBu_r",
+        vmin=vmin_d,
+        vmax=vmax_d,
+    )
+    axes_diff[0].set_title(r"Analytical $\nabla^2\omega$")
+    axes_diff[0].set_xlabel("x")
+    axes_diff[0].set_ylabel("y")
+    plt.colorbar(im0, ax=axes_diff[0], shrink=0.8)
+
+    im1 = axes_diff[1].imshow(
+        diffusion_numerical,
+        extent=[0, L, 0, L],
+        origin="lower",
+        cmap="RdBu_r",
+        vmin=vmin_d,
+        vmax=vmax_d,
+    )
+    axes_diff[1].set_title(rf"Numerical Diffusion")
+    axes_diff[1].set_xlabel("x")
+    axes_diff[1].set_ylabel("y")
+    plt.colorbar(im1, ax=axes_diff[1], shrink=0.8)
+
+    fig_diff.suptitle(f"Diffusion Validation (kx={kx}, ky={ky}, N={n_grid})", y=1.02)
+    fig_diff.tight_layout()
+
+    # --- Advection comparison plot (1x2 panel) ---
+    fig_adv, axes_adv = plt.subplots(1, 2, figsize=figsize)
+
+    # For advection, analytical is zero, so set symmetric limits around numerical
+    adv_max = max(np.abs(advection_numerical).max(), 1e-10)
+    vmin_a, vmax_a = -adv_max, adv_max
+
+    im2 = axes_adv[0].imshow(
+        advection_analytical,
+        extent=[0, L, 0, L],
+        origin="lower",
+        cmap="RdBu_r",
+        vmin=vmin_a,
+        vmax=vmax_a,
+    )
+    axes_adv[0].set_title(r"Analytical Advection (= 0)")
+    axes_adv[0].set_xlabel("x")
+    axes_adv[0].set_ylabel("y")
+    plt.colorbar(im2, ax=axes_adv[0], shrink=0.8)
+
+    im3 = axes_adv[1].imshow(
+        advection_numerical,
+        extent=[0, L, 0, L],
+        origin="lower",
+        cmap="RdBu_r",
+        vmin=vmin_a,
+        vmax=vmax_a,
+    )
+    axes_adv[1].set_title(rf"Numerical Advection")
+    axes_adv[1].set_xlabel("x")
+    axes_adv[1].set_ylabel("y")
+    plt.colorbar(im3, ax=axes_adv[1], shrink=0.8)
+
+    fig_adv.suptitle(f"Advection Validation (kx={kx}, ky={ky}, N={n_grid})", y=1.02)
+    fig_adv.tight_layout()
+
+    return (fig_diff, axes_diff), (fig_adv, axes_adv)