Single-node MPI strong scaling differences between 2d and 3d
I am investigating the strong scaling behaviour of the 2d and 3d versions of ExaHyPE. While the 2d version shows reasonable scalability, the 3d version does not.
- Experiments are performed on a single-node of SuperMUC Phase 2.
- All plotters are turned off
- To exclude interconnect effects, all experiments are performed on a single-node.
In my experiments, I switch the master-worker communication (M/W) on or off as well as the neighbour communication (N).
Only Peano communication (M/W=off,N=off)
ranks | adapter name | iterations | total CPU time [t]=s | average CPU time [t]=s | total user time [t]=s | average user time [t]=s |
---|---|---|---|---|---|---|
2 | ADERDGTimeStep | 29 | 37.07 | 1.27828 | 316.768 | 10.923 |
3 | ADERDGTimeStep | 29 | 36.25 | 1.25 | 305.752 | 10.5432 |
12 | ADERDGTimeStep | 29 | 27.04 | 0.932414 | 206.142 | 7.10834 |
28 | ADERDGTimeStep | 29 | 10.27 | 0.354138 | 24.4012 | 0.841421 |
M/W=on, N=off
ranks | adapter name | iterations | total CPU time [t]=s | average CPU time [t]=s | total user time [t]=s | average user time [t]=s |
---|---|---|---|---|---|---|
2 | ADERDGTimeStep | 29 | 37.58 | 1.29586 | 316.044 | 10.8981 |
3 | ADERDGTimeStep | 29 | 36.3 | 1.25172 | 306.977 | 10.5854 |
12 | ADERDGTimeStep | 29 | 27.21 | 0.938276 | 207.078 | 7.14064 |
28 | ADERDGTimeStep | 29 | 10.27 | 0.354138 | 24.5317 | 0.845921 |
M/W=off, N=on
ranks | adapter name | iterations | total CPU time [t]=s | average CPU time [t]=s | total user time [t]=s | average user time [t]=s |
---|---|---|---|---|---|---|
2 | ADERDGTimeStep | 29 | 39.52 | 1.36276 | 337.709 | 11.6451 |
3 | ADERDGTimeStep | 29 | 99.04 | 3.41517 | 995.378 | |
12 | ADERDGTimeStep | 29 | 121.76 | 4.19862 | 1106.45 | 38.1534 |
28 | ADERDGTimeStep | 29 | 18.24 | 0.628966 | 105.858 | 3.65027 |
M/W=on, N=on
Slightly worse than M/W=off,N=on.
Insights:
- Rank 3 and 12 performance are load balancing issues.
- For the 28 rank run, the LB only deploys 10 ranks. This is actually a well-balanced setup for 10 ranks. (If we set 10 ranks, we have a load balancing issue again.)