⚡ Optimizing PCSR #51
Replies: 7 comments 5 replies
-
Loop-unswitching within
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Finally some resultsBased on the results from the previous post in this discussion we can see that almost 55% of the time is spent in moving data from the host to the GPU. We tried to see how long The above inferences were made after running a few tests on Nithins laptop, the results are as shown below We can notice here that if the PCSR data transfer time is reduced to match PyG-T. We should be having parallel performance between Naive and PCSR. This is odd because our previous benchmarks on colab said 40% of the time went in updating the PCSR. This got us thinking that since PCSR is a CPU datastructure it depends on the capability of the CPU. Hence the performance will differ from machine to machine. We decided to print the proportion of time taken for updates per epoch on Nithin's laptop. We believe that the reason why the time for updates per epoch is 40% of total time in google colab is because the CPU has low compute and its resources are being shared by multiple individuals. We can see that 40% of the total time is still being used for updates, but the total time is much lesser because Nithin has a CPU that at the moment was dedicated to running this task. Things to look into
Working Conclusion (Need to be verified)
|
Beta Was this translation helpful? Give feedback.
-
Pinned Memory
|
Beta Was this translation helpful? Give feedback.
-
🤔 Do we really need Thrust Vectors?From the comments posted earlier in this discussion, we know that moving data from CPU to GPU is usually the bottleneck of CUDA programs with large data involved. Now we have witnessed in our PCSRGraph implementation, it is taking quite some time to transfer data. Compared to PyG-T, transfer time in PCSRGraph is 100 times slower. We needed to find a way to make transfer time faster. We also learned that thrust vectors take up quite some time when trying to transfer data in and out of the GPU. It had to do something with regards to pinned and pageable memory. Measuring Thrust data transfer performanceA basic CUDA program was written optim.cu #include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/remove.h>
#include <thrust/sort.h>
#include <cub/cub.cuh>
int main()
{
int N = 300000000;
// Moving data from CPU to GPU using Thrust vectors
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
thrust::host_vector<int> h_vec_1(N, 1);
thrust::device_vector<int> d_vec_1;
// measuring time to move data from CPU to GPU
cudaEventRecord(start);
d_vec_1 = h_vec_1;
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
std::cout << "Using thrust (CPU to GPU): " << milliseconds << " ms" << std::endl;
// moving data from GPU to CPU
cudaEventRecord(start);
h_vec_1 = d_vec_1;
cudaEventRecord(stop);
cudaEventSynchronize(stop);
milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
std::cout << "Using thrust (GPU to CPU): " << milliseconds << " ms" << std::endl;
// Moving data from CPU to GPU using regular CUDA methods
int *a; // The arrays on the host CPU machine
int *dev_a; // The arrays for the GPU device
// 2.a allocate the memory on the CPU
cudaEvent_t start_og, stop_og;
cudaEventCreate(&start_og);
cudaEventCreate(&stop_og);
a = (int *)malloc(N * sizeof(int));
// 2.b. fill the array 'a' on the CPU with dummy values
for (int i = 0; i < N; i++)
a[i] = 1;
cudaEventRecord(start_og);
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaEventRecord(stop_og);
cudaEventSynchronize(stop_og);
milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start_og, stop_og);
std::cout << "\nWithout thrust (CPU to GPU): " << milliseconds << " ms" << std::endl;
cudaEventRecord(start_og);
cudaMemcpy(a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(stop_og);
cudaEventSynchronize(stop_og);
milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start_og, stop_og);
std::cout << "Without thrust (GPU to CPU): " << milliseconds << " ms" << std::endl;
}This is what we benchmark using this program We also measure the following These were the results ConclusionWithout thrust, moving data from CPU to GPU is 160,000 times faster than using thrust! While moving data from GPU to CPU is 14,000 times faster. Both are incredibly faster (assuming that my benchmarking methods are correct 🤞🏽). We chose thrust because it was convenient developing our CUDA code for PCSR. But I guess it comes with drawbacks such as huge time taken for data transfers. Will try re-implementing a version of PCSR without using any thrust vectors and only regular C++ arrays/vectors. |
Beta Was this translation helpful? Give feedback.
-
UpdatesFollowed #53 and implemented similar logic for PCSR. Weird ErrorWhen building a dummy version of the PCSRGraph to account for context object space, the program is unable to deallocate that object. The error is |
Beta Was this translation helpful? Give feedback.
-
|
@nithinmanoj10 closing this out, since we pushed PCSR to its limits but it wasn't able to surpass the benefits of using GPU based data structures. Noting here that this was the starting point to integration of STGraph with dynamic graph data structures and hence was a crucial part of this project. |
Beta Was this translation helpful? Give feedback.










Uh oh!
There was an error while loading. Please reload this page.
-
Here we shall discuss various optimizing techniques used to make PCSR faster. Even a gain by 0.01 seconds is considered a win.
Beta Was this translation helpful? Give feedback.
All reactions