Commit 8f658065 authored by Philipp Samfaß's avatar Philipp Samfaß
Browse files

added missing file

parent 5f6fb1a4
\chapter{Advanced Reactive Extensions}
\label{chapter:reactive}
\exahype\ features several reactive extensions in preparation for modern and future supercomputers, which are characterized by increasing load imbalances (e.g., due to hardware performance fluctuations) as well as by potential unreliable execution (e.g., silent data corruption). The reactive extensions help mitigate load imbalances for \exahype's\ ADER-DG implementation (see \ref{sec:reactive_lb}) and they allow the engine to detect and to correct silent errors in ADER-DG's space-time predictor (see \ref{sec:reactive_error_correction}).
\section{Prerequisites}
All features in this chapter operate between MPI processes, i.e., they require to enable parallelization in distributed memory with MPI.
Furthermore, as the techniques are task-based, they further rely on \exahype's shared memory parallelization with TBB.
Both levels of parallelism therefore need to be activated to be able to use the reactive extensions.
Moreover, the reactive techniques require that the task-based parallelization for ADER-DG's space-time predictor is used and that the fused time stepping optimisation is activated:
\begin{code}
"optimisation" : {
...
"fuse_algorithmic_steps": "all",
...
"spawn_predictor_as_background_thread": true,
...
}
\end{code}
Last, the implementation of reactive extensions currently only support global time stepping, i.e., the following settings should be used:
\begin{code}
"solvers": {
"type": "ADER-DG",
...
"time_stepping": "global"
...
}
\end{code}
\section{Load Balancing through Reactive Task Offloading}
\label{sec:reactive_lb}
Despite Peano's spacetree domain decomposition, load imbalances may arise between MPI ranks in \exahype, e.g., due to dynamic AMR, due to non-uniform costs in the non-linear space-time predictor tasks (varying numbers of Picard iterations) and due to hardware performance fluctuations. Therefore, \exahype\ features a second level of task-based load balancing on top of domain decomposition where space-time predictor tasks in its ADER-DG implementation can be temporarily given away (``offloaded'') to other MPI ranks in order to obtain improved load balance.
To fine-tune its load balancing decisions reactively, \exahype\ introspectively monitors its performance at runtime by measuring waiting times in MPI operations for each rank. It further makes these waiting times available globally in the background to all MPI ranks. If a rank waits for a long time in MPI, it can take up further space-time predictor tasks from another rank. If a rank is just being waited on by other ranks, it is likely a bottleneck and it should be relieved of some computations.
With reactive task offloading, overloaded bottleneck ranks and underutilized ranks are determined for every time step based on the measured and globally distributed waiting times.
Additionally, the waiting times are used to compute task offloading rules. These rules specify how many tasks a certain rank should give away to another one for improving the overall load balance.
The task offloading rules are adapted according to a diffusive process (called ``reactive diffusion'') at runtime after every time step, taking into account the most recently available load information.
To enable reactive task offloading with the reactive diffusion load balancing strategy, the paragraph
\begin{code}
"offloading": {
"offloading_lb_strategy": "reactive_diffusion"
}
\end{code}
needs to be added to the specification file. Further, the toolkit needs to be re-run and the application needs to be re-compiled.
Recompilation is necessary to activate Peano's performance analysis features, which are used by reactive task offloading for performance introspection. As an alternative, performance analysis can also be enabled by setting build mode \texttt{MODE=PeanoProfile}.
The above configuration in the code snippet enables the diffusion-based load balancing strategy for reactive task offloading\footnote{Other strategies are available, but they are targeted at developers and are therefore not discussed further here.}. Setting
\begin{code}
"offloading": {
"offloading_lb_strategy": "none"
}
\end{code}
in the specification file disables reactive task offloading (no re-compilation required).
%Todo: might want to document some fine-tuning features here, too
% A ``temperature'' (i.e., a relaxation factor) specifies how aggressively tasks should be given away.
\section{Task Outcome Sharing and Silent Error Correction with Process Replication}
\label{sec:reactive_error_correction}
As the number of hardware components on supercomputers scales, errors such as process failures or silent data corruptions become more likely. To render \exahype\ resilient against such errors, process replication may be employed.
Using the \teampi\ library for transparent replication of MPI ranks (\url{https://gitlab.lrz.de/hpcsoftware/teaMPI}, also provided as a submodule), \exahype\ is able to run multiple replicas of MPI processes in parallel. So called teams are groups of MPI processes that represent a single application instance. Multiple teams can run independently from each other in parallel. With $K$ teams, each MPI process has $K-1$ replica processes which --- in a baseline configuration --- store exactly the same data and execute the same computations.
To save upon redundant computations with such replication, \exahype\ was extended with a task outcome sharing feature where the results of space-time predictor task computations (``task outcomes'') are exchanged between the teams. Recycling task outcomes from other teams allows \exahype\ to skip some of their computations. Ideally, a run with two-fold redundancy ($K=2$) then would run as fast as a baseline run without replication (if both configurations would use the same total number of cores).
Yet, with process replication, individual process failures may be tolerated as it is highly unlikely that a rank and all of its replicas fail at the same time\footnote{This requires a stable MPI implementation which can continue despite process failures of MPI ranks. \teampi\ was extended (available at \url{https://gitlab.lrz.de/hpcsoftware/teaMPI/-/tree/ulfm_failure_tolerance}) to be able to make use of one such MPI implementation (ULFM). In case of a process failure, it can kill the failed team and it can continue the computation with the ``healthy'' team. Future work in the resilience context may build upon this prototype.}.
\paragraph{Enable Transparent Replication with TeaMPI}
To make use of task outcome sharing and silent error resilience, \exahype\ needs to be linked to \teampi. Further, as the resilience features rely on some library calls to \teampi,\ the include path to \teampi's\ header must be set:
\begin{code}
export COMPILER_CFLAGS=${COMPILER_CFLAGS}" -I<path_to_teampi>/include"
export COMPILER_LFLAGS=${COMPILER_LFLAGS}" -L<path_to_teampi>/lib"
\end{code}
\paragraph{Enable Resilience Features}
The resilience features are enabled by adding the section
\begin{code}
"resilience" : {
"resilience_mode" : <enum>,
<further optional settings>
}
\end{code}
to the specification file and by selecting a resilience mode that is not equal to "none" (see table \ref{tab:23_resilience_modes}). Further, the toolkit must be re-run and the application must be re-compiled with the aforementioned build settings for \teampi.
%todo add table with options for resilience mode
\begin{table}[htb]
\caption{
Different resilience modes.
\label{tab:23_resilience_modes}
}
\begin{center}
\begin{tabular}{p{1.5cm}p{11cm}}
\toprule
{\bf Options}& {\bf Description} \\
\midrule\midrule
\multicolumn{2}{l}{\texttt{none}} \\ &
Disables all resilience features.
\\
\midrule\midrule
\multicolumn{2}{l}{\texttt{task\_sharing}} \\ &
Enables task outcome sharing between teams in order to save upon the overhead of redundant computations.
\\
\midrule\midrule
\multicolumn{2}{l}{\texttt{task\_sharing\_error\_checks}} \\ &
Uses task outcome sharing between teams and tries to detect silent errors in ADER-DG's space-time predictor. \emph{Caution}: in case of a silent error, there is no attempt to fix it in this mode. While the application will try to continue nonetheless, unexpected behavior (e.g., a deadlock running into a timeout eventually) may very well arise. The corrupted results may be visible in written output files.
\\
\midrule\midrule
\multicolumn{2}{l}{\texttt{task\_sharing\_error\_correction}} \\ &
Uses task outcome sharing between teams and tries to detect \emph{and to correct} silent errors in ADER-DGs space-time predictor.
\\ \bottomrule
\end{tabular}
\end{center}
\end{table}
With enabled detection or correction of silent errors, further options must be configured as detailed next.
\paragraph{Options for Error Resilience against Silent Data Corruption}
For resilience against silent errors, an error checking mechanism (``check\_mechanism") must be selected, which controls which tasks should be checked for errors:
\begin{code}
"resilience" : {
"resilience_mode" : <...>,
"check_mechanism" : <enum>
...
}
\end{code}
The available options are explained in table \ref{tab:23_check_mechanism}~\footnote{Other options are mainly for development purposes and they are therefore not discussed further.}.
\begin{table}[htb]
\caption{
Different error checking mechanisms.
\label{tab:23_check_mechanism}
}
\begin{center}
\begin{tabular}{p{1.5cm}p{11cm}}
\toprule
{\bf Options}& {\bf Description} \\
\midrule\midrule
\multicolumn{2}{l}{\texttt{check\_dubious\_stps} (default)} \\ &
Only the subset of \emph{dubious} (i.e., potentially having a silent error) space-time predictor task outcomes are subject to error detection and correction. Task outcomes are considered as dubious according to some selected error criteria (detailed below). \\
\midrule\midrule
\multicolumn{2}{l}{\texttt{check\_all\_stps}} \\ &
The outcomes of \emph{all} space-time predictor tasks are checked.
\\ \bottomrule
\end{tabular}
\end{center}
\end{table}
Space-time predictor task outcomes are identified as dubious (i.e. as potentially being corrupted) according to a combination of several error criteria. Each error criterion can be individually activated or deactivated (see table \ref{tab:23_error_criteria}). While some criteria are of boolean nature, others compute numerical values $\in [0;\infty]$, where a higher value implies a more dubious outcome. For the latter numerical criteria, a tolerance threshold needs to be specified to discriminate between dubious and trustworthy task outcomes. In general, lower tolerance values make error detection more sensitive to silent errors, but they also come at a larger overhead since a larger subset of all task outcomes then needs to be checked.
The error criteria are not only important for pre-filtering potentially corrupted results but they are also used to select --- given two outcomes disagreeing outcomes which have been computed on different teams --- which of those two outcomes is more likely and which outcome should therefore be used for following computations. In this way, they also have an impact on the error correction success.
\begin{table}[htb]
\caption{
Configuration options for the error criteria. The first three options are boolean values while the other options represent numerical tolerance thresholds. The thresholds need to be chosen according to the simulation scenario and the sensitivity requirements.
\label{tab:23_error_criteria}
}
\begin{center}
\begin{tabular}{p{1.5cm}p{11cm}}
\toprule
{\bf Options}& {\bf Description} \\
\multicolumn{2}{l}{\texttt{check\_time\_step\_sizes}} \\ & Enable/disable time step sizes error criterion, which tracks the relative changes in admissible time step sizes per each grid cell locally. Large such changes may come from silent data corruption.
\\
\midrule\midrule
\multicolumn{2}{l}{\texttt{check\_admissibility}} \\ & Enable/disable admissibility error criterion. Using this criterion, task outcomes are checked for physical admissibility (as is implemented in the user solver) and for NaN values. Physically inadmissible outcomes or outcomes with NaN values are flagged as dubious.
\\
\midrule\midrule
\multicolumn{2}{l}{\texttt{check\_derivatives}} \\ & Enable/disable derivatives error criterion. The derivatives criterion tracks the relative changes in the second derivatives of the solution which are caused by space-time predictor task outcomes. Large such changes may come from silent data corruption.
\\
\midrule\midrule
\multicolumn{2}{l}{\texttt{tol\_time\_step\_sizes}} \\ & Tolerance for the time step sizes error criterion.
\\
\midrule\midrule
\multicolumn{2}{l}{\texttt{tol\_derivatives}} \\ & Tolerance for the derivatives error criterion.
\\ \bottomrule
\end{tabular}
\end{center}
\end{table}
\section{Fine-tuning of the MPI Progression}
\label{sec:reactive_progression}
For all reactive features in this chapter, minimizing MPI message latency as well as maximizing the throughput is critical for robust performance.
The reactive extensions in \exahype\ rely on many small non-blocking messages which are typically unexpected at the receiver ranks.
To ensure a fast delivery of such messages, special emphasis needs to be put on their progression.
Per default, a progress helper task is used to this end.
The helper task makes progress on outstanding non-blocking MPI messages. It further probes for incoming messages and issues the receive operations for them. Once finished executing, it reschedules itself to be picked up by the shared memory runtime eventually at a later point. It is terminated at application shutdown.
Other available progression strategies are aimed at improving the progression further, e.g., by polling more aggressively.
Progression can be manually adapted \emph{at compile time} (i.e., re-run of toolkit + re-compilation is required) in the \texttt{distributed\_memory} section of the specification file with the parameter \texttt{reactive\_progress}:
\begin{code}
"distributed_memory": {
...
"reactive_progress" : <enum>
...
}
\end{code}
The relevant possible parameter values are listed in table \ref{tab:23_progression}.
\begin{table}[htb]
\caption{
Configuration options for the progression parameter \texttt{reactive\_progression}:
\label{tab:23_progression}
}
\begin{center}
\begin{tabular}{p{1.5cm}p{11cm}}
\toprule
{\bf Options}& {\bf Description} \\
\multicolumn{2}{l}{progress\_task (default)} \\ & Uses a self re-scheduling progress task for message progression. Only one core/thread at a time can do progression.
\\
\midrule\midrule
\multicolumn{2}{l}{progress\_thread} \\ & Dedicate a TBB thread to message progression. Compared to the default setting, this option more aggressively polls MPI but it typically requires a dedicated core, i.e., uses more hardware resources. Usually, \texttt{"background\_job\_consumers" : <num\_cores>-2} should be set to have dedicated cores available for the master thread as well as for the progression thread.
\\
\midrule\midrule
\multicolumn{2}{l}{mpi\_thread\_split} \\ & Uses thread-private (``split'') MPI communicators for each thread, so multiple threads can do progression in parallel efficiently. If used with MPI implementations that support the multiple endpoints feature, this may result in improved performance (``MPI\_THREAD\_SPLIT" programming model, see \url{https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-guide-linux/top/additional-supported-features/multiple-endpoints-support/mpi-thread-split-programming-model.html}).
\\
\\ \bottomrule
\end{tabular}
\end{center}
\end{table}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment