-
Notifications
You must be signed in to change notification settings - Fork 1
OpenMP #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: nusselt
Are you sure you want to change the base?
OpenMP #10
Conversation
|
I should note that the timings above are for Nx=1600 and Ny=1275 |
|
Ok after coming back to this, now my timing results are looking much better! I wonder if this specific compute node I am on is better, or if I was doing something wrong before. Either way, we are now seeing some better speedups! Here is a little table: threads, timing, speedup Generally, past this point the increasing thread count isn't actually useful. |
|
Ok I am having trouble getting replicable results unfortunately. Now when I connect to the academic cluster, I get the performance below for the exact same experiment as the previous comment. 1, 0.51881068991497159, 1 This is discouraging, as without replicability, we can't properly test or benchmark our implementations. So we need to sort this out ASAP. |
|
Ok some really great results! I switched to AWS to sanity check what was happening with the performance and ran the same experiment with much better success! On a t2.2xlarge instance, for the loop 3 runtime I was able to get nearly perfect linear scaling!! Phew this is a huge relief. threads, timing, speedup Even though this is a single loop, speeding it up actually dramatically impact the overall runtime of the code. For a single time step in serial, we have a runtime of 3.93, but with 8 threads I am getting 2.33, which is a 1.69x speedup! Pretty awesome for a single parallel loop. |
|
I am also parallelizing loop 5 in the calc_explicit call. When using 8 threads, we see the timing drop from 0.11551769900006548 to 0.015365126999995482, which is a 7.5x speedup, and drops the overall timing of a single iteration to 1.98, which gives an overall speedup of 3.93/1.98 = 1.98x for the entire code! |
|
Running for 50 iterations in serial we have the following timing output real 3m34.483s and for parallel real 1m51.155s So we are seeing a 214 (s) / 111 (s) = 1.93x speedup. I also verified correctness by checking the nusselt numbers and they are identical for both runs. |
|
Ok I made each loop in calc explicit parallel and got the overall runtime down to real 1m39.510s So we have 214 (s) / 99 (s) = 2.16x speedup! |
|
OK, I am trying to parallelize other parts of the main loop besides calc_explicit, and am running into some weird behavior. It can be boiled down to the example below. !$OMP PARALLEL DO num_threads(8) private(tmp_uy) schedule(dynamic)
do it = 1,Nx
!$OMP CRITICAL
! Solve for v
call calc_vi(tmp_uy, phi(:,it))
uy(:,it) = tmp_uy
! Solve for u
if (kx(it) /= 0.0_dp) then
!ux(:,it) = -CI*d1y(tmp_uy)/kx(it)
ux(:,it) = CI*d1y(tmp_uy)/kx(it)
else if (kx(it) == 0.0_dp) then
ux(:,it) = cmplx(0.0_dp, 0.0_dp, kind=C_DOUBLE_COMPLEX) ! Zero mean flow!
end if
!$OMP END CRITICAL
end do
!$OMP END PARALLEL DOThis is my sanity check for the loop iterations being independent, because each of them is wrapped in a critical region, so they will be run in a random order, but each one at a time. Yet, this actually breaks the code and the nusselt number quickly explodes into a NaN. I believe this means that the loop iterations are not independent, but I can't quite make out why? It seems like |
|
I added a lot more parallelism today, including in the x direction of stages 1-3. We are seeing very great performance results: with Nx=4800 Ny=3825 on m4.4xlarge (16 cores) running 16 threads |
A few timing results
num threads, timing of loop 3
1, 0.92242285818792880
2, 0.66375376703217626
4, 0.66562559804879129
8, 0.67282426310703158