Reduce memory footprint
Open issues
-
We have consecutive heap indices for volume data and face data. We thus need to store one index for each and can get the others by incrementing the index up to the bound we know of as developers. We distinguish between cell and face data since we have helper cells that do not allocate cell data but face data.
-
We can remove prediction and volumeFlux fields completely if we perform also the time integration in the volume integral and the boundary extrapolation routines. This would further make it easier to switch between global and local time stepping. Here, we would just load a different kernel for the boundary extrapolation and allocate space-time face data if the user switches local time stepping on.
-
Allocate all the temporary arrays like the rhs, lQi_old, etc. only once per Thread and not dynamically during the kernel calls. (This could be done easily now in each solver!)
-
Create one "big" ADERDGTimeStep function in kernels/solver. This might help the compiler/is more Cache-friendly.
Done
I think the following has been Tobias' idea originally; It is not necessary to store temporary data on the heap for every cell description. We need to analyse which ADER-DG fields are temporary and which need to be stored persistently on the heap. From my point of view, the following variables are temporary:
- spaceTimePredictor
- predictor
- spaceTimeVolumeFlux (includes sources)
- volumeFlux (includes sources)
The spacetime fields have a massive memory footprint. They scale with (N+1)^{d+1} and d*(N+1)^{d+1}.
I thus propose that we assign each thread its own spaceTimePredictor spaceTimeVolumeFlux, predictor, and volume flux fields and remove the fields from the heap cell descriptions. This would reduce the memory footprint of the ADER-DG method dramatically (and might further lead to more cache-friendly code ?).
In a second step, we should kick out the volumeFlux field completely, don't do the time integration of the spaceTimeVolumeFlux, and directly perform the volume integral with the spaceTimeVolumeFlux.
Implementation details
- Allocate arrays in Prediction mapping for each thread