A CUBLAS‐CUDA based implementation of multi-GPU large matrix multiplication. It is a standalone C/C++ commandline application to lauch large matrix-matrix multiplicatio and get profiled outputs in a GPU cluster. It can be easily transformed into a C/C++ library given its lightweight codebase.
- CUDA - A parallel computing platform and programming model developed by NVIDIA for GPU-accelerated computing.
- cuBLAS - The NVIDIA CUDA Basic Linear Algebra Subprograms (cuBLAS) library for efficient GPU-accelerated linear algebra operations.
- Dependencies
- Environment
- Important Files
- Installation
- Documentation
- Running the Application
- Available Options
The LargeMM application relies on the following dependencies:
| Dependency | Version |
|---|---|
| CUDA | 11.6.1+ |
| GCC | 10.3.0+ |
| CMake | 3.24.2+ |
CUDA modules should be loaded prior to compilation or execution.
This application is designed to run on 1-4 Tesla V100 SXM2. The default environment is a GPU node in Gadi.
-
datafolder stores performance data of LargeMM. -
profilefolder stores profiler timeline files for the performance ofv2_ngpus_reduction,v1_1_n_streams, andbase_cublasDgemm. -
testfolder stores tests forv2_ngpus_reduction,v1_1_n_streams, andbase_cublasDgemm.
-
Clone the repository into your workspace and navigate to the project directory:
git clone https://github.com/Zlisch/LargeMM.git cd LargeMM -
Run the installation script:
chmod -x ./INSTALL.sh ./INSTALL.sh
Or you can directly download the latest executable from the link.
You can either view the documentation in header files of the cloned repository or if you are using Visual Studio Code,
-
Install the Live Server extension in your Visual Studio Code. To enable Live Server,
cmd+shift+pin your Visual Studio Code and typelive serverin the prompt. Selectopen with live server. -
With the Live Server extension enabled, enter
http://127.0.0.1:5500/docs/html/globals.htmlin your browser to view the documentation.
After running ./INSTALL.sh, use the following to run v2_ngpus_reduction with lookup table on 4 GPUs and print the output.
./bin/largemm -s "-1" -m 28377 -a 2 -g 4To run the LargeMM with NVIDIA visual profiler, use:
nsys profile --stats=true ./bin/largemm -s "-1" -m 28377 -a 2 -g 4Or you can build your own run script. A run script template is provided in ./run.sh.
-s
- Description: Specify the stream stride (square root of the number of streams to be used) for each GPU. If
-1is given, the lookup table will be used instead to decide the number of streams for each GPU. - Example: Run
v2_ngpus_reductionwith 9 streams for each GPU on 4 GPUs and print the output.
./bin/largemm -s 3 -m 28377 -a 2 -g 4-a
- Description: Specify the algorithm to run.
| Value | Algorithm Version |
|---|---|
| 0 | base_cublasDgemm |
| 1 | v1_1_n_streams |
| 2 | v2_ngpus_reduction |
| 3 | v2_ngpus_parallel_a |
| 4 | v2_ngpus_parallel_a_n_streams_breadth |
-m
- Description: Row dimension of the matrix.
- Example: Run
v2_ngpus_reductionon a square matrix of size 6GB (row width 28377 if double precision is used).
./bin/largemm -s 3 -m 28377 -a 2 -g 4-g
- Description: Specify the number of GPU(s) to use. Cannot be zero.