⚡ Optimizing PCSR #51

nithinmanoj10 · 2023-05-24T08:05:09Z

nithinmanoj10
May 24, 2023
Maintainer

Here we shall discuss various optimizing techniques used to make PCSR faster. Even a gain by 0.01 seconds is considered a win.

nithinmanoj10 · 2023-05-24T08:36:47Z

nithinmanoj10
May 24, 2023
Maintainer Author

Loop-unswitching within `pcsr.edge_update_list`

While exploring techniques for loop optimizations, I came across an approach known as loop-unswitching. The concept was presented in an article titled Loop Optimizations: Taking Matters into Your Hands.

To illustrate the technique, consider the following code snippet:

for (int i = 0; i < n; i++) {
    switch (operation) {
        case ADD: a[i] += x * x; break;
        case SUB: a[i] -= x * x; break;
    }
}

It can be restructured using loop-unswitching as follows:

auto x_2 = x * x;
if (operation == ADD) {
    for (int i = 0; i < n; i++) {
        a[i] += x_2;
    }
} else if (operation == SUB) {
    for (int i = 0; i < n; i++) {
        a[i] -= x_2;
    }
}

Applying loop-unswitching to `pcsr.edge_update_list`

I attempted to apply the loop-unswitching technique to the pcsr.edge_update_list method. Here is the original code:

void PCSR::edge_update_list(std::vector<std::tuple<uint32_t, uint32_t>> edge_list, bool is_delete = false, bool is_reverse_edge = false)
{
    for (auto &edge : edge_list)
    {
        uint32_t src = (is_reverse_edge == true) ? std::get<1>(edge) : std::get<0>(edge);
        uint32_t dst = (is_reverse_edge == true) ? std::get<0>(edge) : std::get<1>(edge);

        if (is_delete)
            delete_edge(src, dst);
        else
            add_edge(src, dst, 1);
    }
}

And here is the version with loop-unswitching applied:

void PCSR::edge_update_list_optm(std::vector<std::tuple<uint32_t, uint32_t>> edge_list, bool is_delete = false, bool is_reverse_edge = false)
{
    // Applying loop unswitching for-loop optimization based on conditions: (is_delete, is_reverse_edge)
    if (is_delete == false && is_reverse_edge == false)
    {
        for (auto &edge : edge_list)
        {
            uint32_t src = std::get<0>(edge);
            uint32_t dst = std::get<1>(edge);

            add_edge(src, dst, 1);
        }
        return;
    }

    if (is_delete == false && is_reverse_edge == true)
    {
        for (auto &edge : edge_list)
        {
            uint32_t src = std::get<1>(edge);
            uint32_t dst = std::get<0>(edge);

            add_edge(src, dst, 1);
        }
        return;
    }

    if (is_delete == true && is_reverse_edge == false)
    {
        for (auto &edge : edge_list)
        {
            uint32_t src = std::get<0>(edge);
            uint32_t dst = std::get<1>(edge);

            delete_edge(src, dst);
        }
        return;
    }

    if (is_delete == true && is_reverse_edge == true)
    {
        for (auto &edge : edge_list)
        {
            uint32_t src = std::get<1>(edge);
            uint32_t dst = std::get<0>(edge);

            delete_edge(src, dst);
        }
        return;
    }
}

Results

Upon observation, there was no noticeable improvement in the updates per second metric. In fact, the performance seemed to be worse. This outcome is intriguing, and I intend to investigate the underlying reasons at a later stage.

Local Variable Optimization

The aforementioned article also mentioned the practice of assigning values of global variables used within if-else condition statements to local variables. To explore this concept, I applied it to the parameters is_reverse_edge and is_delete by introducing local variables is_reverse_edge_local and is_delete_local, respectively.

Here's the modified code:

void PCSR::edge_update_list(std::vector<std::tuple<uint32_t, uint32_t>> edge_list, bool is_delete = false, bool is_reverse_edge = false)
{
    bool is_reverse_edge_local = is_reverse_edge;
    bool is_delete_local = is_delete;

    for (auto &edge : edge_list)
    {
        uint32_t src = (is_reverse_edge_local == true) ? std::get<1>(edge) : std::get<0>(edge);
        uint32_t dst = (is_reverse_edge_local == true) ? std::get<0>(edge) : std::get<1>(edge);

        if (is_delete_local)
            delete_edge(src, dst);
        else
            add_edge(src, dst, 1);
    }
}

Results

I conducted seven runs of both the local-variable-optimized version and the original version of PCSR. I recorded the "updates per second" metric each time and calculated the average values.

Average updates per second for the Local Variable Optimized Version: 695,443
Average updates per second for the Old Version: 685,610

The optimized version exhibited approximately a 1.41% increase in updates per second compared to the older version.

Conclusion

In the case of loop-unswitching, there was no significant enhancement in the updates per second metric. However, by employing the local variable assignment approach, we observed a 1.41% increase in updates per second.

0 replies

JoelMathewC · 2023-05-24T13:39:11Z

JoelMathewC
May 24, 2023
Maintainer

Looking at the evidence

PCSR

We can see that of the total 40% of the time is going in insertions. But that still begs to ask where the rest of the time is going in. Because Naive and PCSR differed in only one aspect and that was the insertions. We will need to understand where the extra time is going in.

1 reply

JoelMathewC May 24, 2023
Maintainer

Lead 1

There is a possibility that the excess time is spent in moving data to the GPU and back. Let us verify the proportion of time spent in the process.

Looks like 55% of the total time is spent in moving the CSR arrays onto the GPU. 40% in insertion and the rest 5% in the actual GNN processing and that does cause the math to add up!

JoelMathewC · 2023-05-24T19:51:28Z

JoelMathewC
May 24, 2023
Maintainer

Finally some results

Based on the results from the previous post in this discussion we can see that almost 55% of the time is spent in moving data from the host to the GPU. We tried to see how long pytorch took to synchronously move data from CPU to GPU. We saw that it took only a fraction of time that was being taken by thrust. After doing a bit of research we came across the following reference which explained that thrust used paged memory transfer which was much slower than the alternative pinned memory transfer. More can be found on this NVIDIA blog.

The above inferences were made after running a few tests on Nithins laptop, the results are as shown below

python3 dynamic_bench.py --max_feat_size 16 --max_num_nodes 55000 --num_epochs 10

We can notice here that if the PCSR data transfer time is reduced to match PyG-T. We should be having parallel performance between Naive and PCSR. This is odd because our previous benchmarks on colab said 40% of the time went in updating the PCSR. This got us thinking that since PCSR is a CPU datastructure it depends on the capability of the CPU. Hence the performance will differ from machine to machine. We decided to print the proportion of time taken for updates per epoch on Nithin's laptop. We believe that the reason why the time for updates per epoch is 40% of total time in google colab is because the CPU has low compute and its resources are being shared by multiple individuals.

We can see that 40% of the total time is still being used for updates, but the total time is much lesser because Nithin has a CPU that at the moment was dedicated to running this task.

Things to look into

We need to see why PCSR is able to run the GNN compute is 0.06sec but Naive is taking 0.13sec
But we can notice that in the first picture, PCSR seems to have taken the same amount of time as colab so this sorta gets us thinking whats influencing the results. Is it the amount of background processes?

Working Conclusion (Need to be verified)

If the Host to Device transfer can be speeded up PCSR might have parallel performance with Naive
The time taken for edge updates in PCSR for a proper powerful CPU is negligible

1 reply

JoelMathewC May 25, 2023
Maintainer

Attempt 1

I attempted to move data to the GPU in PCSR using pytorch and ended up with the same results. So we will need to read more into pinned memory to get an understanding of how that works. I'm starting to have the feeling that PyG performing so quickly was probably no because of pinned memory because we never explicitly set it to be pinned. Maybe some other optimization is at work there.

JoelMathewC · 2023-05-26T23:49:33Z

JoelMathewC
May 26, 2023
Maintainer

Pinned Memory

1 reply

JoelMathewC May 26, 2023
Maintainer

The response was found on this forum indicates pinning of memory might have disadvantages we may have to look into.

nithinmanoj10 · 2023-05-27T16:35:10Z

nithinmanoj10
May 27, 2023
Maintainer Author

🤔 Do we really need Thrust Vectors?

From the comments posted earlier in this discussion, we know that moving data from CPU to GPU is usually the bottleneck of CUDA programs with large data involved. Now we have witnessed in our PCSRGraph implementation, it is taking quite some time to transfer data. Compared to PyG-T, transfer time in PCSRGraph is 100 times slower. We needed to find a way to make transfer time faster.

We also learned that thrust vectors take up quite some time when trying to transfer data in and out of the GPU. It had to do something with regards to pinned and pageable memory.

Measuring Thrust data transfer performance

A basic CUDA program was written

optim.cu

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/remove.h>
#include <thrust/sort.h>
#include <cub/cub.cuh>

int main()
{
    int N = 300000000;

    // Moving data from CPU to GPU using Thrust vectors
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    thrust::host_vector<int> h_vec_1(N, 1);
    thrust::device_vector<int> d_vec_1;

    // measuring time to move data from CPU to GPU
    cudaEventRecord(start);
    d_vec_1 = h_vec_1;
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    std::cout << "Using thrust (CPU to GPU): " << milliseconds << " ms" << std::endl;

    // moving data from GPU to CPU
    cudaEventRecord(start);
    h_vec_1 = d_vec_1;
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    std::cout << "Using thrust (GPU to CPU): " << milliseconds << " ms" << std::endl;

    // Moving data from CPU to GPU using regular CUDA methods
    int *a;     // The arrays on the host CPU machine
    int *dev_a; // The arrays for the GPU device

    // 2.a allocate the memory on the CPU
    cudaEvent_t start_og, stop_og;
    cudaEventCreate(&start_og);
    cudaEventCreate(&stop_og);

    a = (int *)malloc(N * sizeof(int));

    // 2.b. fill the array 'a' on the CPU with dummy values
    for (int i = 0; i < N; i++)
        a[i] = 1;

    cudaEventRecord(start_og);
    cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaEventRecord(stop_og);

    cudaEventSynchronize(stop_og);
    milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start_og, stop_og);

    std::cout << "\nWithout thrust (CPU to GPU): " << milliseconds << " ms" << std::endl;

    cudaEventRecord(start_og);
    cudaMemcpy(a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);
    cudaEventRecord(stop_og);

    cudaEventSynchronize(stop_og);
    milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start_og, stop_og);

    std::cout << "Without thrust (GPU to CPU): " << milliseconds << " ms" << std::endl;
}

This is what we benchmark using this program

1. We initialize a thrust host_vector of size 300 million, with each value set to 1.
2. Measure the time taken to move this thrust vector from host to device (CPU to GPU)
3. Measure the time taken to move a device vector from device to host (GPU to CPU)

We also measure the following

1. We initialize a C++ array of size 300 million, with each value set to 1.
2. Measure the time taken to move this array from host to device (CPU to GPU)
3. Measure the time taken to move an array from device to host (GPU to CPU)

These were the results

Conclusion

Without thrust, moving data from CPU to GPU is 160,000 times faster than using thrust! While moving data from GPU to CPU is 14,000 times faster. Both are incredibly faster (assuming that my benchmarking methods are correct 🤞🏽).

We chose thrust because it was convenient developing our CUDA code for PCSR. But I guess it comes with drawbacks such as huge time taken for data transfers. Will try re-implementing a version of PCSR without using any thrust vectors and only regular C++ arrays/vectors.

1 reply

nithinmanoj10 May 27, 2023
Maintainer Author

🎉 There is progress

TL;DR

Was able to make PCSR overall 1.25 times faster than previous implementation

Replacing thrust vectors

As discussed in the above comment, replaced the occurence of thrust vectors in the PCSR class and the get_csr_ptrs function. Not getting much into it in detail right now. Will do so when committing all the changes and finalizing on the optimizations after benchmarking. For now here are the results

Results

Without thrust vectors

With thrust vectors

Conclusion

We can notice that when we reach feature size of 16, PCSR is just of the same performance level as PyG-T.
There is an improvement of GPU move time.
The overall GPU move time compared to PyG-T is still worse. Need to check on that.

JoelMathewC · 2023-06-21T22:58:18Z

JoelMathewC
Jun 21, 2023
Maintainer

Updates

Followed #53 and implemented similar logic for PCSR.

Weird Error

When building a dummy version of the PCSRGraph to account for context object space, the program is unable to deallocate that object.

Graph = PCSRGraph([[(0,0)]],1);     // FAILS
Graph = PCSRGraph([[(0,1)]],2);     // SUCCEEDS

The error is

free(): invalid next size (fast)

1 reply

JoelMathewC Nov 4, 2023
Maintainer

This was resolved.

JoelMathewC · 2023-11-04T03:48:43Z

JoelMathewC
Nov 4, 2023
Maintainer

@nithinmanoj10 closing this out, since we pushed PCSR to its limits but it wasn't able to surpass the benefits of using GPU based data structures. Noting here that this was the starting point to integration of STGraph with dynamic graph data structures and hence was a crucial part of this project.

0 replies

⚡ Optimizing PCSR #51

Uh oh!

nithinmanoj10 May 24, 2023 Maintainer

Replies: 7 comments · 5 replies

Uh oh!

nithinmanoj10 May 24, 2023 Maintainer Author

Loop-unswitching within pcsr.edge_update_list

Applying loop-unswitching to pcsr.edge_update_list

Results

Local Variable Optimization

Results

Conclusion

Uh oh!

JoelMathewC May 24, 2023 Maintainer

Looking at the evidence

PCSR

Uh oh!

Uh oh!

JoelMathewC May 24, 2023 Maintainer

Lead 1

Uh oh!

Uh oh!

JoelMathewC May 24, 2023 Maintainer

Finally some results

Things to look into

Working Conclusion (Need to be verified)

Uh oh!

JoelMathewC May 25, 2023 Maintainer

Attempt 1

Uh oh!

JoelMathewC May 26, 2023 Maintainer

Pinned Memory

Uh oh!

JoelMathewC May 26, 2023 Maintainer

Uh oh!

nithinmanoj10 May 27, 2023 Maintainer Author

🤔 Do we really need Thrust Vectors?

Measuring Thrust data transfer performance

Conclusion

Uh oh!

nithinmanoj10 May 27, 2023 Maintainer Author

🎉 There is progress

TL;DR

Replacing thrust vectors

Results

Conclusion

Uh oh!

Uh oh!

JoelMathewC Jun 21, 2023 Maintainer

Updates

Weird Error

Uh oh!

JoelMathewC Nov 4, 2023 Maintainer

Uh oh!

JoelMathewC Nov 4, 2023 Maintainer

nithinmanoj10
May 24, 2023
Maintainer

Replies: 7 comments 5 replies

nithinmanoj10
May 24, 2023
Maintainer Author

Loop-unswitching within `pcsr.edge_update_list`

Applying loop-unswitching to `pcsr.edge_update_list`

JoelMathewC
May 24, 2023
Maintainer

JoelMathewC May 24, 2023
Maintainer

JoelMathewC
May 24, 2023
Maintainer

JoelMathewC May 25, 2023
Maintainer

JoelMathewC
May 26, 2023
Maintainer

JoelMathewC May 26, 2023
Maintainer

nithinmanoj10
May 27, 2023
Maintainer Author

nithinmanoj10 May 27, 2023
Maintainer Author

JoelMathewC
Jun 21, 2023
Maintainer

JoelMathewC Nov 4, 2023
Maintainer

JoelMathewC
Nov 4, 2023
Maintainer