OpenMP #10

michaelneuder · 2021-04-13T21:22:39Z

A few timing results

num threads, timing of loop 3
1, 0.92242285818792880
2, 0.66375376703217626
4, 0.66562559804879129
8, 0.67282426310703158

michaelneuder · 2021-04-14T23:04:34Z

I should note that the timings above are for Nx=1600 and Ny=1275

michaelneuder · 2021-04-14T23:29:16Z

Ok after coming back to this, now my timing results are looking much better! I wonder if this specific compute node I am on is better, or if I was doing something wrong before. Either way, we are now seeing some better speedups!

Here is a little table:

threads, timing, speedup
1, 0.52083734888583422, 1
2, 0.44693268812261522, 1.16
4, 0.37362689198926091, 1.39
8, 0.23520854488015175, 2.21

Generally, past this point the increasing thread count isn't actually useful.

michaelneuder · 2021-04-15T12:57:16Z

Ok I am having trouble getting replicable results unfortunately. Now when I connect to the academic cluster, I get the performance below for the exact same experiment as the previous comment.

1, 0.51881068991497159, 1
2, 0.45359245501458645, 1.14
4, 0.57991017214953899, 0.89
8, 0.57748107495717704, 0.90

This is discouraging, as without replicability, we can't properly test or benchmark our implementations. So we need to sort this out ASAP.

michaelneuder · 2021-04-15T14:07:04Z

Ok some really great results! I switched to AWS to sanity check what was happening with the performance and ran the same experiment with much better success! On a t2.2xlarge instance, for the loop 3 runtime I was able to get nearly perfect linear scaling!! Phew this is a huge relief.

threads, timing, speedup
1, 0.44721634100005758, 1
2, 0.22189785400007622, 2.01
4, 0.11443732100008219, 3.91
8, 0.0568768589999, 7.86

Even though this is a single loop, speeding it up actually dramatically impact the overall runtime of the code. For a single time step in serial, we have a runtime of 3.93, but with 8 threads I am getting 2.33, which is a 1.69x speedup! Pretty awesome for a single parallel loop.

michaelneuder · 2021-04-15T14:15:43Z

I am also parallelizing loop 5 in the calc_explicit call. When using 8 threads, we see the timing drop from 0.11551769900006548 to 0.015365126999995482, which is a 7.5x speedup, and drops the overall timing of a single iteration to 1.98, which gives an overall speedup of 3.93/1.98 = 1.98x for the entire code!

michaelneuder · 2021-04-15T14:40:21Z

Running for 50 iterations in serial we have the following timing output

real 3m34.483s
user 3m29.935s
sys 0m3.846s

and for parallel

real 1m51.155s
user 3m38.685s
sys 0m0.887s

So we are seeing a 214 (s) / 111 (s) = 1.93x speedup. I also verified correctness by checking the nusselt numbers and they are identical for both runs.

michaelneuder · 2021-04-15T15:34:37Z

Ok I made each loop in calc explicit parallel and got the overall runtime down to

real 1m39.510s
user 3m46.794s
sys 0m0.849s

So we have 214 (s) / 99 (s) = 2.16x speedup!

michaelneuder · 2021-04-15T17:03:08Z

OK, I am trying to parallelize other parts of the main loop besides calc_explicit, and am running into some weird behavior. It can be boiled down to the example below.

   !$OMP PARALLEL DO num_threads(8) private(tmp_uy) schedule(dynamic)
   do it = 1,Nx
      !$OMP CRITICAL
      ! Solve for v
      call calc_vi(tmp_uy, phi(:,it))
      uy(:,it) = tmp_uy
      ! Solve for u
      if (kx(it) /= 0.0_dp) then
         !ux(:,it) = -CI*d1y(tmp_uy)/kx(it)
         ux(:,it) = CI*d1y(tmp_uy)/kx(it)
      else if (kx(it) == 0.0_dp) then
         ux(:,it) = cmplx(0.0_dp, 0.0_dp, kind=C_DOUBLE_COMPLEX) ! Zero mean flow!
      end if
      !$OMP END CRITICAL
   end do
   !$OMP END PARALLEL DO

This is my sanity check for the loop iterations being independent, because each of them is wrapped in a critical region, so they will be run in a random order, but each one at a time. Yet, this actually breaks the code and the nusselt number quickly explodes into a NaN. I believe this means that the loop iterations are not independent, but I can't quite make out why? It seems like tmp_uy should be the only private variable. I believe the problem is coming from the calc_vi call, because if I make phi a thread private variable, then the code doesn't explode, but the nusselt number is off by a small amount, which makes sense because really this should be a shared variable. But since phi is only accessed at the phi(:,it) slice, it seems like each iteration should be independent?

michaelneuder · 2021-04-17T21:11:30Z

I added a lot more parallelism today, including in the x direction of stages 1-3. We are seeing very great performance results:

with Nx=4800 Ny=3825 on m4.4xlarge (16 cores) running 16 threads
serial
overall timing: 46.162012183000115 (s)
parallel
overall timing: 5.7642300109998814 (s)
speedup = 8.008x

OMP parallelization of loop3 in calc_explicit

47325b5

modifying thread count

3428b0f

Adding AWS setup script

860270a

Adding second parallel do on loop 5 of calc_explicit

ef94770

Ubuntu added 2 commits April 15, 2021 15:06

Adding parallelism to rest of loops in calc_explicit

0c997a5

Adding last two parallel dos

e002c43

Trying to parallelize updates

01eff5d

Ubuntu and others added 8 commits April 15, 2021 17:08

new input.data config

35a0555

tweak aws script

9c61654

Adding experiment to confirm loop independence

f38d82f

more experiments

0355020

Parallelizing along x

f281214

Remove stale file

24f6a25

Fixing spacing

be88fc8

Adding 16 threads

d7bcca0

michaelneuder changed the title ~~One level of OMP parallelize in calc_explicit.~~ OpenMP Apr 30, 2021

michaelneuder added 5 commits May 7, 2021 09:39

set up enviroment variables

6fc53c5

set up enviroment variables

2d1c2e7

set up enviroment variables

b3697bd

set up enviroment variables

b64a035

set up enviroment variables

ea64e80

Ubuntu and others added 3 commits May 7, 2021 14:06

debug omp performance

71e9fe9

Debug omp performance

ac97e3a

debug omp perf

5df5848

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP #10

OpenMP #10

Uh oh!

michaelneuder commented Apr 13, 2021

Uh oh!

michaelneuder commented Apr 14, 2021

Uh oh!

michaelneuder commented Apr 14, 2021

Uh oh!

michaelneuder commented Apr 15, 2021 •

edited

Loading

Uh oh!

michaelneuder commented Apr 15, 2021

Uh oh!

michaelneuder commented Apr 15, 2021

Uh oh!

michaelneuder commented Apr 15, 2021

Uh oh!

michaelneuder commented Apr 15, 2021

Uh oh!

michaelneuder commented Apr 15, 2021 •

edited

Loading

Uh oh!

michaelneuder commented Apr 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

OpenMP #10

Are you sure you want to change the base?

OpenMP #10

Uh oh!

Conversation

michaelneuder commented Apr 13, 2021

Uh oh!

michaelneuder commented Apr 14, 2021

Uh oh!

michaelneuder commented Apr 14, 2021

Uh oh!

michaelneuder commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelneuder commented Apr 15, 2021

Uh oh!

michaelneuder commented Apr 15, 2021

Uh oh!

michaelneuder commented Apr 15, 2021

Uh oh!

michaelneuder commented Apr 15, 2021

Uh oh!

michaelneuder commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelneuder commented Apr 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelneuder commented Apr 15, 2021 •

edited

Loading

michaelneuder commented Apr 15, 2021 •

edited

Loading