Additional refinement along MPI boundaries

Peano's spacetree traversal is inverted every second iteration. This poses a problem for ExaHyPE's master-worker communication if the master rank has local subtrees.

Example:

Imagine a uniform grid where a fork was performed on Level 2. 3^d - 1 workers of our master rank have been introduced. They and the master rank hold 1/3^d of the computational domain. We call the portion belonging to the master rank, the local subtree of the master (rank).

In the first iteration, we correctly kick off all workers on the coarse grid before we descend into the local subtree of the master. In the second iteration, we start within the local subtree of the master and perform computations. As soon as we reach the coarse grid, we kick off all workers. The workers had to wait while we performed our computations on the master's local subtree. Now finally, the workers will start their computations. The whole tree traversal might take twice as long as assumed. This becomes even worse if there are more Master-Worker boundaries. The red bars in the plot below indicate such scenarios.

Tobias proposes to add additional refinement along the MPI boundary to prevent the need of vertical communication. At least from the master to the worker. This might work. I wonder however if it is really beneficial for ExaHyPE to change the traversal order in every second iteration. In a MPI setting where we perform asynchronous communication, an inversion of the traversal order is especially ill-suited since we then have to wait till the last message was received by the neighbour. Instead, we would need to wait and (block) until all messages for the currently touched vertex are received.

Update: The traversal order is hardwired into Peano. It is necessary to run it forward and backward.

Additional Refinement along the MPI boundary

This can be accomplished by continuously refining the top most parent patch (which is of type Cell) of every cell of type Descendant which is at a Master-Worker boundary. This has to be done until we end up with a patch of type Cell at the Master-Worker boundary.

At this point, we then need to send the solution values of the Master cell to the worker. We further might need to impose initial conditions.

We further need to perform status flag merges in prepareSendToWorker.

Problems with this approach: It might introduce rippling refinings around the artificially refined cell.

Introduce a no-operation traversal

We could further introduce a no-operation traversal before we perform reductions and broadcasts which would rewind Peano's streams but does not perform any computations and communication. In this case, we would always follow the top-down traversal. The observed Master-Worker synchronisation would not appear.

Problems with this approach:

~~Batching is currently not possible with~~ ~~multiple adapters. We could maybe perform no operation in every second iteration.~~ ~~However, we would then have still a Master Worker synchronisation. Or would we not?~~ We could have a single empty traversal in front of a batch. That would work.
The BoundaryDataExchanger of the heaps always assumes an inversion of the traversal in every iteration. To alter this behaviour, we would need to change the receive methods in Peano's AbstractHeap,DoubleHeap, and BoundaryDataExchanger methods. We are required to add a bool "assumeForwardTraversal" (defaults to false) to the signature. We are further required to update ExaHyPE's solver implementations: Whenever we receive boundary data after we have run the no-operation traversal, we need to set this new flag to true when calling receiveData. If we run of batch iteration, this must be done only in the first iteration of the batch.

Have both

For optimal performance, it might be useful to employ both techniques. We might use "loop padding", i.e. insert empty traversals, in order to end up with the forward traversal any time we need to broadcast / reduce something. In generally it might useful to handle broadcasts and reductions outside of the mapping. Therefore, we would however need to plug into both, runAsMaster and runAsWorker.

Edited Nov 01, 2017 by Ghost User

Admin message

Additional refinement along MPI boundaries

Example:

Additional Refinement along the MPI boundary

Introduce a no-operation traversal

Have both