Skip to content

Conversation

@michaelneuder
Copy link
Collaborator

A few timing results

num threads, timing of loop 3
1, 0.92242285818792880
2, 0.66375376703217626
4, 0.66562559804879129
8, 0.67282426310703158

@michaelneuder
Copy link
Collaborator Author

I should note that the timings above are for Nx=1600 and Ny=1275

@michaelneuder
Copy link
Collaborator Author

Ok after coming back to this, now my timing results are looking much better! I wonder if this specific compute node I am on is better, or if I was doing something wrong before. Either way, we are now seeing some better speedups!

Here is a little table:

threads, timing, speedup
1, 0.52083734888583422, 1
2, 0.44693268812261522, 1.16
4, 0.37362689198926091, 1.39
8, 0.23520854488015175, 2.21

Generally, past this point the increasing thread count isn't actually useful.

@michaelneuder
Copy link
Collaborator Author

michaelneuder commented Apr 15, 2021

Ok I am having trouble getting replicable results unfortunately. Now when I connect to the academic cluster, I get the performance below for the exact same experiment as the previous comment.

1, 0.51881068991497159, 1
2, 0.45359245501458645, 1.14
4, 0.57991017214953899, 0.89
8, 0.57748107495717704, 0.90

This is discouraging, as without replicability, we can't properly test or benchmark our implementations. So we need to sort this out ASAP.

@michaelneuder
Copy link
Collaborator Author

Ok some really great results! I switched to AWS to sanity check what was happening with the performance and ran the same experiment with much better success! On a t2.2xlarge instance, for the loop 3 runtime I was able to get nearly perfect linear scaling!! Phew this is a huge relief.

threads, timing, speedup
1, 0.44721634100005758, 1
2, 0.22189785400007622, 2.01
4, 0.11443732100008219, 3.91
8, 0.0568768589999, 7.86

Even though this is a single loop, speeding it up actually dramatically impact the overall runtime of the code. For a single time step in serial, we have a runtime of 3.93, but with 8 threads I am getting 2.33, which is a 1.69x speedup! Pretty awesome for a single parallel loop.

@michaelneuder
Copy link
Collaborator Author

I am also parallelizing loop 5 in the calc_explicit call. When using 8 threads, we see the timing drop from 0.11551769900006548 to 0.015365126999995482, which is a 7.5x speedup, and drops the overall timing of a single iteration to 1.98, which gives an overall speedup of 3.93/1.98 = 1.98x for the entire code!

@michaelneuder
Copy link
Collaborator Author

Running for 50 iterations in serial we have the following timing output

real 3m34.483s
user 3m29.935s
sys 0m3.846s

and for parallel

real 1m51.155s
user 3m38.685s
sys 0m0.887s

So we are seeing a 214 (s) / 111 (s) = 1.93x speedup. I also verified correctness by checking the nusselt numbers and they are identical for both runs.

@michaelneuder
Copy link
Collaborator Author

Ok I made each loop in calc explicit parallel and got the overall runtime down to

real 1m39.510s
user 3m46.794s
sys 0m0.849s

So we have 214 (s) / 99 (s) = 2.16x speedup!

@michaelneuder
Copy link
Collaborator Author

michaelneuder commented Apr 15, 2021

OK, I am trying to parallelize other parts of the main loop besides calc_explicit, and am running into some weird behavior. It can be boiled down to the example below.

   !$OMP PARALLEL DO num_threads(8) private(tmp_uy) schedule(dynamic)
   do it = 1,Nx
      !$OMP CRITICAL
      ! Solve for v
      call calc_vi(tmp_uy, phi(:,it))
      uy(:,it) = tmp_uy
      ! Solve for u
      if (kx(it) /= 0.0_dp) then
         !ux(:,it) = -CI*d1y(tmp_uy)/kx(it)
         ux(:,it) = CI*d1y(tmp_uy)/kx(it)
      else if (kx(it) == 0.0_dp) then
         ux(:,it) = cmplx(0.0_dp, 0.0_dp, kind=C_DOUBLE_COMPLEX) ! Zero mean flow!
      end if
      !$OMP END CRITICAL
   end do
   !$OMP END PARALLEL DO

This is my sanity check for the loop iterations being independent, because each of them is wrapped in a critical region, so they will be run in a random order, but each one at a time. Yet, this actually breaks the code and the nusselt number quickly explodes into a NaN. I believe this means that the loop iterations are not independent, but I can't quite make out why? It seems like tmp_uy should be the only private variable. I believe the problem is coming from the calc_vi call, because if I make phi a thread private variable, then the code doesn't explode, but the nusselt number is off by a small amount, which makes sense because really this should be a shared variable. But since phi is only accessed at the phi(:,it) slice, it seems like each iteration should be independent?

@michaelneuder
Copy link
Collaborator Author

I added a lot more parallelism today, including in the x direction of stages 1-3. We are seeing very great performance results:

with Nx=4800 Ny=3825 on m4.4xlarge (16 cores) running 16 threads
serial
overall timing: 46.162012183000115 (s)
parallel
overall timing: 5.7642300109998814 (s)
speedup = 8.008x

@michaelneuder michaelneuder changed the title One level of OMP parallelize in calc_explicit. OpenMP Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant