diff --git a/content/cupy.md b/content/cupy.md new file mode 100644 index 0000000..c351902 --- /dev/null +++ b/content/cupy.md @@ -0,0 +1,879 @@ +# CuPy + +:::{questions} +- How could I make my Python code to run on a GPU? +- How do I copy data to the GPU memory? +::: + +:::{objectives} +- Understand the basics of the library CuPy and its functionalities +- Analyze and detect whether a variable is stored in the CPU or GPU memory +- Execute a data-copy operation from host to device memory and vice versa +- Re-write a simple NumPy/SciPy + function, to program the CuPy equivalent which runs on the GPUs + +::: + + +## Introduction to CuPy + +Another excellent tool for writing Python code to run on GPUs is CuPy. +CuPy implements most of the NumPy/SciPy operations +and acts as a drop-in replacement to run existing +code on both NVIDIA CUDA or AMD ROCm platforms. +By design, the CuPy interface is as close as possible to NumPy/SciPy, +making code porting much easier. + +:::{note} +A common misconception is that CuPy is an official NVIDIA project. +It is rather a community driven project. Originally it was developed +to support a deep-learning framework called Chainer (now deprecated), +wherein it only supported CUDA as a target. Nowadays CuPy has +great support for both NVIDIA CUDA or AMD ROCm platforms. +::: + +## Basics of CuPy + +CuPy's syntax here is identical to that of NumPy. A list of +NumPy/SciPy APIs and its corresponding CuPy implementations +is summarised here: + +[Complete Comparison of NumPy and SciPy to CuPy functions](https://docs.cupy.dev/en/stable/reference/comparison.html#comparison-table). + +In short, CuPy provides N-dimensional array (ndarray), +sparse matrices, and the associated routines for GPU devices, +most having the same API as NumPy/SciPy. + +Let us take a look at the following code snippet which calculates the L2-norm of an array. +Note how simple it is to run on a GPU device using CuPy, i.e. essentially by changing np to cp. + + + + + + + + + + + +
NumPyCuPy
+ +``` +import numpy as np +x_cpu = np.array([1, 2, 3]) +l2_cpu = np.linalg.norm(x_cpu) +``` + + + +``` +import cupy as cp +x_gpu = cp.array([1 ,2 ,3]) +l2_gpu = cp.linalg.norm(x_gpu) +``` + +
+ +:::{warning} + +Do not change the import line +in the code to something like + +`import cupy as np` + +which can cause problems if you need to use +NumPy code and not CuPy code. + +::: + + +### Conversion to/from NumPy arrays + +Although `cupy.ndarray` is the CuPy counterpart of NumPy `numpy.ndarray`, +the main difference is that `cupy.ndarray` resides on the `current device`, +and they are not implicitly convertible to each other. +When you need to manipulate CPU and GPU arrays, an explicit data transfer +may be required to move them to the same location – either CPU or GPU. +For this purpose, CuPy implements the following methods: + +- To convert `numpy.ndarray` to `cupy.ndarray`, use `cupy.array()` or `cupy.asarray()` +- To convert `cupy.ndarray` to `numpy.ndarray`, use `cupy.asnumpy()` or `cupy.ndarray.get()` + +These methods can accept arbitrary input, meaning that they can be applied to any data +that is located on either the host or device. + +Here is an example that demonstrates the use of both methods: +``` +>>> import numpy as np +>>> import cupy as cp +>>> +>>> x_cpu = np.array([1, 2, 3]) # allocating array x on cpu +>>> y_cpu = np.array([4, 5, 6]) # allocating array y on cpu +>>> x_cpu + y_cpu # add x and y +array([5, 7, 9]) +>>> +>>> x_gpu = cp.asarray(x_cpu) # move x to gpu +>>> x_gpu + y_cpu # now it should fail +Traceback (most recent call last): + File "", line 1, in + File "cupy/_core/core.pyx", line 1375, in cupy._core.core._ndarray_base.__add__ + File "cupy/_core/core.pyx", line 1799, in cupy._core.core._ndarray_base.__array_ufunc__ + File "cupy/_core/_kernel.pyx", line 1285, in cupy._core._kernel.ufunc.__call__ + File "cupy/_core/_kernel.pyx", line 159, in cupy._core._kernel._preprocess_args + File "cupy/_core/_kernel.pyx", line 145, in cupy._core._kernel._preprocess_arg +TypeError: Unsupported type +>>> +>>> cp.asnumpy(x_gpu) + y_cpu +array([5, 7, 9]) +>>> cp.asnumpy(x_gpu) + cp.asnumpy(y_cpu) +array([5, 7, 9]) +>>> x_gpu + cp.asarray(y_cpu) +array([5, 7, 9]) +>>> cp.asarray(x_gpu) + cp.asarray(y_cpu) +array([5, 7, 9]) +``` + + + +:::{note} +Converting between `cupy.ndarray` and `numpy.ndarray` incurs data transfer +between the host (CPU) device and the GPU device, +which is costly in terms of performance. +::: + +### Current Device + +CuPy introduces the concept of a `current device`, +which represents the default GPU device on which +the allocation, manipulation, calculation, etc., +of arrays take place. `cupy.ndarray.device` attribute +can be used to determine the device allocated to a CuPy array. +By default, ID of the current device is 0. + +``` +>>> import cupy as cp +>>> x_gpu = cp.array([1, 2, 3, 4, 5]) +>>> x_gpu.device + +``` +To obtain the total number of accessible devices, +one can utilize the `getDeviceCount` function: +``` +>>> import cupy as cp +>>> cp.cuda.runtime.getDeviceCount() +1 +``` + +To switch to another GPU device, use the `Device` context manager. +For example, the following code snippet creates an array on GPU 1: +``` +>>> import cupy as cp +>>> with cp.cuda.Device(1): + x_gpu1 = cp.array([1, 2, 3, 4, 5]) +>>> print("x_gpu1 is on device:" x_gpu1.device) +``` + +Sometimes it is more convenient to set the device globally: +``` +>>> import cupy as cp +>>> cp.cuda.runtime.setDevice(1) +``` + +All CuPy operations (except for multi-GPU features +and device-to-device copy) are performed +on the currently active device. + +:::{note} +The device will be called even if you are on AMD GPUs. + +In general, CuPy functions expect that +the data array is on the current device. +Passing an array stored on a non-current +device may work depending on the hardware configuration +but is generally discouraged as it may not be performant. +::: + +### Exercises: Matrix Multiplication + +:::{exercise} Exercise : Matrix Multiplication +The first example is a simple matrix multiplication in single precision (float32). +The arrays are created with random values in the range of -1.0 to 1.0. +Convert the NumPy code to run on GPU using CuPy. + +``` +import math +import numpy as np + +A = np.random.uniform(low=-1., high=1., size=(64,64)).astype(np.float32) +B = np.random.uniform(low=-1., high=1., size=(64,64)).astype(np.float32) +C = np.matmul(A,B) +``` +::: + +:::{solution} +``` +import math +import cupy as cp + +A = cp.random.uniform(low=-1., high=1., size=(64,64)).astype(cp.float32) +B = cp.random.uniform(low=-1., high=1., size=(64,64)).astype(cp.float32) +C = cp.matmul(A,B) +``` + +Notice in this snippet of code that the variable C remains on the GPU. +You have to copy it back to the CPU explicitly if needed. +Otherwise all the data on the GPU is wiped once the code ends. + +::: + + +### Exercises: moving data from GPU to CPU + +:::{exercise} Exercise : moving data from GPU to CPU +The code snippet simply computes a singular value decomposition (SVD) +of a matrix. In this case, the matrix is a +single-precision 64x64 matrix of random values. First re-write the code +using CuPy for GPU enabling. Second, adding a few lines to +copy variable u back to CPU and print objects' type using `type()` function. +``` +import numpy as np + +A = np.random.uniform(low=-1., high=1., size=(64, 64)).astype(np.float32) +u, s, v = np.linalg.svd(A) +``` +::: + +:::{solution} +``` +import cupy as cp + +A = cp.random.uniform(low=-1., high=1., size=(64, 64)).astype(cp.float32) +u_gpu, s_gpu, v_gpu = cp.linalg.svd(A) +print("type(u_gpu) = ",type(u_gpu)) +u_cpu = cp.asnumpy(u_gpu) +print("type(u_cpu) = ",type(u_cpu)) +``` +::: + +### Exercises: CuPy vs Numpy/SciPy + +:::{exercise} CuPy vs Numpy/SciPy +Although the CuPy team focuses on providing a complete +NumPy/SciPy API coverage to become a full drop-in replacement, +some important differences between CuPy and NumPy should be noted, +one should keep these differences in mind when porting NumPy code to CuPy. + + + +Here are various examples illustrating the differences +::: + +:::{solution} When CuPy is different from NumPy/SciPy +### Cast behavior from float to integer + +Some casting behaviors from float to integer +are not defined in C++ specification. +The casting from a negative float to +unsigned integer and infinity to integer +is one of such examples. The behavior of NumPy +depends on your CPU architecture. +This is the result on an Intel CPU: + +```python +>>> np.array([-1], dtype=np.float32).astype(np.uint32) +array([4294967295], dtype=uint32) + +>>> cp.array([-1], dtype=np.float32).astype(np.uint32) +array([0], dtype=uint32) +``` + +```python +>>> np.array([float('inf')], dtype=np.float32).astype(np.int32) +array([-2147483648], dtype=int32) + +>>> cp.array([float('inf')], dtype=np.float32).astype(np.int32) +array([2147483647], dtype=int32) +``` + +### Random methods support dtype argument + +NumPy's random value generator does not support +a dtype argument and instead always returns +a float64 value. While in CuPy, both +float32 and float64 are supported because of cuRAND. + +``` +>>> np.random.randn(dtype=np.float32) +Traceback (most recent call last): + File "", line 1, in +TypeError: randn() got an unexpected keyword argument 'dtype' + +>>> cp.random.randn(dtype=np.float32) +array(1.3591791, dtype=float32) +``` + +### Out-of-bounds indices + +CuPy handles out-of-bounds indices differently +by default from NumPy when using integer array indexing. +NumPy handles them by raising an error, but CuPy wraps around them. + + +``` +>>> x = np.array([0, 1, 2]) +>>> x[[1, 3]] = 10 +Traceback (most recent call last): + File "", line 1, in +IndexError: index 3 is out of bounds for axis 0 with size 3 + +>>> x = cp.array([0, 1, 2]) +>>> x[[1, 3]] +array([1, 0]) +>>> x[[1, 3]] = 10 +>>> x +array([10, 10, 2]) +``` + +### Duplicate values in indices + +CuPy's `__setitem__` behaves differently from NumPy +when integer arrays reference the same location multiple times. +NumPy stores the value corresponding to the last element +among elements referencing duplicate locations. +In CuPy, the value that is actually stored is undefined. + +``` +>>> a = np.zeros((2,)) +>>> i = np.arange(10000) % 2 +>>> v = np.arange(10000).astype(np.float32) +>>> a[i] = v +>>> a +array([9998., 9999.]) + +>>> a = cp.zeros((2,)) +>>> i = cp.arange(10000) % 2 +>>> v = cp.arange(10000).astype(cp.float32) +>>> a[i] = v +>>> a +array([4592., 4593.]) +``` + + +### Zero-dimensional array + +#### Reduction methods + +NumPy's reduction functions (e.g. `numpy.sum()`) return +scalar values (e.g. `numpy.float32`). However +CuPy counterparts return zero-dimensional `cupy.ndarray`. +That is because CuPy scalar values (e.g. `cupy.float32`) +are aliases of NumPy scalar values and are allocated +in CPU memory. If these types were returned, it would be +required to synchronize between GPU and CPU. If you want to +use scalar values, cast the returned arrays explicitly. + +``` +>>> type(np.sum(np.arange(3))) == np.int64 +True + +>>> type(cp.sum(cp.arange(3))) == cp.ndarray +True +``` + +#### Type promotion + +CuPy automatically promotes dtypes of `cupy.ndarray` +in a function with two or more operands, the result dtype +is determined by the dtypes of the inputs. This is different +from NumPy's rule on type promotion, when operands contain +zero-dimensional arrays. Zero-dimensional `numpy.ndarray` +are treated as if they were scalar values if they appear +in operands of NumPy's function. This may affect the dtype +of its output, depending on the values of the "scalar" inputs. + +``` +>>> (np.array(3, dtype=np.int32) * np.array([1., 2.], dtype=np.float32)).dtype +dtype('float32') + +>>> (cp.array(3, dtype=np.int32) * cp.array([1., 2.], dtype=np.float32)).dtype +dtype('float64') +``` + +### Matrix type (`numpy.matrix`) + +SciPy returns `numpy.matrix` (a subclass of `numpy.ndarray`) +when dense matrices are computed from sparse matrices +(e.g., coo_matrix + ndarray). However, CuPy returns +`cupy.ndarray` for such operations. + +There is no plan to provide `numpy.matrix` equivalent in CuPy. +This is because the use of `numpy.matrix` is no longer +recommended since NumPy 1.15. + +### Data types + +Data type of CuPy arrays cannot be non-numeric like strings or objects. + +### Universal Functions only work with CuPy array or scalar + +Unlike NumPy, Universal Functions in CuPy only work with +CuPy array or scalar. They do not accept other objects +(e.g., lists or `numpy.ndarray`). + +``` +>>> np.power([np.arange(5)], 2) +array([[ 0, 1, 4, 9, 16]]) + +>>> cp.power([cp.arange(5)], 2) +Traceback (most recent call last): + File "", line 1, in + File "cupy/_core/_kernel.pyx", line 1285, in cupy._core._kernel.ufunc.__call__ + File "cupy/_core/_kernel.pyx", line 159, in cupy._core._kernel._preprocess_args + File "cupy/_core/_kernel.pyx", line 145, in cupy._core._kernel._preprocess_arg +TypeError: Unsupported type +``` + +### Random seed arrays are hashed to scalars + +Like NumPy, CuPy's RandomState objects accept seeds +either as numbers or as full NumPy arrays. + +However, unlike NumPy, array seeds will be hashed down +to a single number of 64 bits. In contrast, NumPy often +converts the seeds into a larger state space of 128 bits. +Therefore, CuPy's implementation may not communicate as +much entropy to the underlying random number generator. + + + +### NaN (not-a-number) handling + +Prior to CuPy v11, CuPy's reduction functions (e.g., `cupy.sum()`) +handle NaNs in complex numbers differently from NumPy's counterparts: + +``` +>>> a = [0.5 + 3.7j, complex(0.7, np.nan), complex(np.nan, -3.9), complex(np.nan, np.nan)] +>>> a +[(0.5+3.7j), (0.7+nanj), (nan-3.9j), (nan+nanj)] +>>> +>>> a_np = np.asarray(a) +>>> print(a_np.max(), a_np.min()) +(0.7+nanj) (0.7+nanj) +>>> +>>> a_cp = cp.asarray(a_np) +>>> print(a_cp.max(), a_cp.min()) +(nan-3.9j) (nan-3.9j) +``` + +The reason is that internally the reduction is performed +in a strided fashion, thus it does not ensure a +proper comparison order and cannot follow NumPy's rule +to always propagate the first-encountered NaN. +This difference does not apply when {abbr}`CUB library (CUB is a accelerator backend shipped together with CuPy. It accelerates reduction operations -- such as sum, prod, amin, amax, argmin, argmax -- and other routines, such as inclusive scans -- ex: cumsum --, histograms, sparse matrix-vector multiplications and ReductionKernel)` [is enabled](https://docs.cupy.dev/en/stable/user_guide/performance.html#use-cub-cutensor-backends-for-reduction-and-other-routines). +which is the default for CuPy v11 and later. + +### Contiguity / Strides + +To provide the best performance, the contiguity of +a resulting ndarray is not guaranteed to match with +that of NumPy's output. + +``` +>>> a = np.array([[1, 2], [3, 4]], order='F') +>>> a +array([[1, 2], + [3, 4]]) +>>> print((a + a).flags.f_contiguous) +True + +>>> a = cp.array([[1, 2], [3, 4]], order='F') +>>> a +array([[1, 2], + [3, 4]]) +>>> print((a + a).flags.f_contiguous) +False +``` +::: + +## Interoperability + +CuPy implements standard APIs for data exchange and interoperability, +which means it can be used in conjunction with +any other libraries supporting the standard. For example, +NumPy, Numba, PyTorch, TensorFlow, MPI4Py among others +can be directly operated on CuPy arrays. + +### NumPy +CuPy implements `__array_ufunc__` interface +[(see NEP 13)](https://numpy.org/neps/nep-0013-ufunc-overrides.html), +`__array_function__` interface +[(see NEP 18)](https://numpy.org/neps/nep-0018-array-function-protocol.html), +and other [Python Array API Standard](https://data-apis.org/array-api/latest). + + +Note that the return type of these operations is still consistent with the initial type. + +``` +>>> import cupy as cp +>>> import numpy as np +>>> dev_arr = cp.random.randn(1, 2, 3, 4).astype(cp.float32) +>>> result = np.sum(dev_arr) +>>> print(type(result)) + +``` + + +:::{note} +`__array_ufunc__` feature requires NumPy 1.13 or later + +`__array_function__` feature requires NumPy 1.16 or later. +As of NumPy 1.17, `__array_function__` is enabled by default + +NEP 13 — A mechanism for overriding Ufuncs + +NEP 18 — A dispatch mechanism for NumPy's high level array functions +::: + +### Numba + +CuPy implements `__cuda_array_interface__` +which is compatible with Numba v0.39.0 or later +(see [CUDA Array Interface](https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html) +for details). It means one can pass CuPy arrays to kernels JITed with Numba. + +################### FIXME: crashes when launching +``` +import cupy as cp +from numba import cuda + +@cuda.jit +def add(x, y, out): + start = cuda.grid(1) + stride = cuda.gridsize(1) + for i in range(start, x.shape[0], stride): + out[i] = x[i] + y[i] + +a = cp.arange(10) +b = a * 2 + +out = cp.zeros_like(a) +print(out) # => [0 0 0 0 0 0 0 0 0 0] + +add[1, 32](a, b, out) +print(out) # => [ 0 3 6 9 12 15 18 21 24 27] +``` + +In addition, `cupy.asarray()` supports zero-copy conversion from Numba CUDA array to CuPy array. +``` +>>> import numpy as np +>>> import cupy as cp +>>> import numba +>>> x = np.arange(10) +>>> type(x) + +>>> x_numba = numba.cuda.to_device(x) +>>> type(x_numba) + +>>> x_cupy = cp.asarray(x_numba) +>>> type(x_cupy) + +``` + +### CPU/GPU agnostic code + +Once beginning porting code to the GPU, one has to +consider how to handle creating data on either the CPU or GPU. +CuPy's compatibility with NumPy/SciPy makes it possible to write CPU/GPU agnostic code. +For this purpose, CuPy implements the `cupy.get_array_module()` function that +returns a reference to cupy if any of its arguments resides on a GPU and numpy otherwise. + +Here is an example of a CPU/GPU agnostic function + +``` +>>> import numpy as np +>>> import cupy as cp +>>> # define a simple function: f(x)=x+1 +>>> def addone(x): +... xp = cp.get_array_module(x) # Returns cupy if any array is on the GPU, otherwise numpy. 'xp' is a standard usage in the community +... print("Using:", xp.__name__) +... return x+1 +>>> # create an array and copy it to GPU +>>> a_cpu = np.arange(0, 20, 2) +>>> a_gpu = cp.asarray(a_cpu) +>>> # GPU/CPU agnostic code also works with CuPy +>>> print(addone(a_cpu)) +Using: numpy +[ 1 3 5 7 9 11 13 15 17 19] +>>> print(addone(a_gpu)) +Using: cupy +[ 1 3 5 7 9 11 13 15 17 19] +``` + +## User-Defined Kernels + +Sometimes you need a specific GPU function or routine +that is not provided by an existing library or tool. +In these situation, you need to write a "custom kernel", +i.e. a user-defined GPU kernel. Custom kernels written +with CuPy only require a small snippet of code +and CuPy automatically wraps and compiles it. +Compiled binaries are then cached and reused in subsequent runs. + + +CuPy provides three templates of user-defined kernels: + +- cupy.ElementwiseKernel: User-defined elementwise kernel +- cupy.ReductionKernel: User-defined reduction kernel +- cupy.RawKernel: User-defined CUDA/HIP kernel + + +### ElementwiseKernel + +The element-wise kernel focuses on kernels that operate on an element-wise basis. +An element-wise kernel has four components: + - input argument list + - output argument list + - function code + - kernel name + +The argument lists consist of comma-separated argument definitions. +Each argument definition consists of a type specifier and an argument name. +Names of NumPy data types can be used as type specifiers. + +``` +>>> my_kernel = cp.ElementwiseKernel( +... 'float32 x, float32 y', # input arg list +... 'float32 z', # output arg list +... 'z = (x - y) * (x - y)', # function +... 'my_kernel') # kernel name +``` + +In the first line, the object instantiation is named `my_kernel`. +The next line has the variables to be used as input (x and y) and output (z). +These variables can be typed with NumPy data types, as shown. +The function code then follows. The last line states the kernel name, +which is `my_kernel`, in this case. + +The above kernel can be called on either scalars or arrays +since the ElementwiseKernel class does the indexing with broadcasting automatically: + +``` +>>> import cupy as cp +>>> # user-defined kernel +>>> my_kernel = cp.ElementwiseKernel( +... 'float32 x, float32 y', +... 'float32 z', +... 'z = (x - y) * (x - y)', +... 'my_kernel') +>>> # allocating arrays x and y +>>> x = cp.arange(10, dtype=np.float32).reshape(2, 5) +>>> y = cp.arange(5, dtype=np.float32) +>>> # launch the kernel +>>> my_kernel(x,y) +array([[ 0., 0., 0., 0., 0.], + [25., 25., 25., 25., 25.]], dtype=float32) +>>> my_kernel(x,5) +array([[25., 16., 9., 4., 1.], + [ 0., 1., 4., 9., 16.]], dtype=float32) +``` + +Sometimes it would be nice to create a generic kernel that can handle multiple data types. +CuPy allows this with the use of a type placeholder. +The above `my_kernel` can be made type-generic as follows: + +``` +my_kernel_generic = cp.ElementwiseKernel( + 'T x, T y', + 'T z', + 'z = (x - y) * (x - y)', + 'my_kernel_generic') +``` + +If a type specifier is one character, T in this case, +it is treated as a **type placeholder**. Same character +in the kernel definition indicates the same type. +More than one type placeholder can be used in a kernel definition. +The actual type of these placeholders is determined +by the actual argument type. The ElementwiseKernel class +first checks the output arguments and then the input arguments +to determine the actual type. If no output arguments are given +on the kernel invocation, only the input arguments +are used to determine the type. + +``` +my_kernel_generic2 = cp.ElementwiseKernel( + 'X x, Y y', + 'Z z', + 'z = (x - y) * (x - y)', + 'my_kernel_generic2') +``` + +:::{note} +This above kernel, i.e. `my_kernel_generic2`, +requires the output argument to be explicitly specified, +because the type Z cannot be automatically determined from +the input arguments X and Y. +::: + + +### ReductionKernel + +The second type of CuPy custom kernel is the reduction kernel, +which is focused on kernels of the Map-Reduce type. +The ReductionKernel class has four extra parts: + + - Identity value: Initial value of the reduction + - Mapping expression: Pre-processes each element to be reduced + - Reduction expression: An operator to reduce the multiple mapped values. + Two special variables, a and b, are used for this operand + - Post-mapping expression: Transforms the resulting reduced values. + The special variable a is used as input. The output should be written to the output variable + +Here is an example to compute L2 norm along specified axies: +``` +>>> import cupy as cp +>>> # user-defined kernel +>>> l2norm_kernel = cp.ReductionKernel( + 'T x', # input arg list + 'T y', # output arg list + 'x * x', # mapping + 'a + b', # reduction + 'y = sqrt(a)', # post-reduction mapping function + '0', # identity value + 'l2norm' # kernel name + ) +>>> # allocating array +>>> x = cp.arange(10, dtype=np.float32).reshape(2, 5) +>>> x +array([[0., 1., 2., 3., 4.], + [5., 6., 7., 8., 9.]], dtype=float32) +>>> # kernel launch +>>> l2norm_kernel(x, axis=1) +array([ 5.477226 , 15.9687195], dtype=float32) +``` + +### RawKernel + +The last is the RawKernel class, which is used to define kernels from raw CUDA/HIP source code. + +RawKernel object allows you to call the kernel with CUDA's cuLaunchKernel interface, +and this gives you control of e.g. the grid size, block size, shared memory size, and stream. + +``` +>>> add_kernel = cp.RawKernel(r''' +... extern "C" __global__ +... void my_add(const float* x1, const float* x2, float* y) { +... int tid = blockDim.x * blockIdx.x + threadIdx.x; +... y[tid] = x1[tid] + x2[tid]; +... } +... ''', 'my_add') +>>> x1 = cp.arange(25, dtype=cp.float32).reshape(5, 5) +>>> x2 = cp.arange(25, dtype=cp.float32).reshape(5, 5) +>>> y = cp.zeros((5, 5), dtype=cp.float32) +>>> add_kernel((5,), (5,), (x1, x2, y)) +>>> y +array([[ 0., 2., 4., 6., 8.], + [10., 12., 14., 16., 18.], + [20., 22., 24., 26., 28.], + [30., 32., 34., 36., 38.], + [40., 42., 44., 46., 48.]], dtype=float32) +``` + +:::{note} +The kernel does not have return values. You need to pass +both input arrays and output arrays as arguments. + +When using printf() in your GPU kernel, you may need to +synchronize the stream to see the output. + +The kernel is declared in an extern "C" block, +indicating that the C linkage is used. This is to ensure +the kernel names are not mangled so that they can be retrieved by name. +::: + + +### `cupy.fuse` decorator + +Apart from using the above templates for custom kernels, +CuPy provides the `cupy.fuse` decorator which "fuse" the +custom kernel functions to a single kernel function, therefore +creating a dramatic lowering of the launching overhead. +Moreover, the syntax looks like a Numba decorator, it is +much easier to define an elementwise or reduction kernel +than using the ElementwiseKernel or ReductionKernel template. + +However, it is still experimental, i.e. there are bugs and +incomplete functionalities to be fixed. + +Here is the example using `cupy.fuse()` decorator + +``` +>>> import cupy as cp +>>> # adding decorator to the function squared_diff +>>> @cp.fuse(kernel_name='squared_diff') +... def squared_diff(x, y): +... return (x - y) * (x - y) +>>> # allocating x and y +>>> x = cp.arange(10,dtype=cp.float32) +>>> y = x[::-1] +>>> # call the function +>>> squared_diff(x, y) +array([81, 49, 25, 9, 1, 1, 9, 25, 49, 81]) +``` + +### Low-level features + +In addition to custom kernels, accessing low-level CUDA/HIP features +are available for those who need more fine-grain control for performance: + +- Stream and Event: CUDA stream and per-thread default stream are supported by all APIs +- Memory Pool: Customizable memory allocator with a built-in memory pool +- Profiler: Supports profiling code using CUDA Profiler and NVTX +- Host API Binding: Directly call CUDA libraries, such as + NCCL, cuDNN, cuTENSOR, and cuSPARSELt APIs from Python + + + + +## Summary + +In this episode, we have learned about: + +- CuPy basics +- Moving data between the CPU and GPU devices +- Different ways to launch GPU kernels + + + +:::{keypoints} +- GPUs have massive computing power compared to CPU +- CuPy is a good first step to start +- CuPy provides an extensive collection of GPU array functions +- Always have both the CPU and GPU versions of your code available + so that you can compare performance, as well as validate the results +- Fine-tuning for optimal performance of real-world applications can be tedioius +::: + + +## See also + +- [CuPy Homepage](https://docs.cupy.dev/en/stable/index.html) +- [GPU programming: When, Why and How?](https://enccs.github.io/gpu-programming) +- [CUDA Python from Nvidia](https://nvidia.github.io/cuda-python/latest) +- [CuPy Wiki page](https://en.wikipedia.org/wiki/CuPy) +- [Chainer Blog](https://chainer.org/announcement/2019/12/05/released-v7.html)