Commit b89def39 authored by Philipp Samfaß's avatar Philipp Samfaß
Browse files

Merge branch 'master' into mpi_offloading

parents bd4b0f13 97c2a194
\teaMPI\ is an open source library built with C++. It plugs into MPI via
MPI's PMPI interface plus provides an additional interface for advanced
advanced task-based parallelism on distributed memory architectures.
Its research vision reads as follows:
\item {\bf Intelligent task-based load balancing}. Applications can hand over
tasks to \teaMPI. These tasks have to be ready, i.e.~without incoming
dependencies, and both their input and their output have to be serialisable.
It is now up to \teaMPI\ to decide whether a task is injected into the local runtime or temporarily moved
to another rank, where we compute it and then bring back its result.
\item {\bf MPI idle time load metrics}. \teaMPI\ can plug into some MPI
calls---hidden from the application---and measure how long this MPI call
idles, i.e.~waits for incoming messages. It provides lightweight
synchronisation mechanisms between MPI ranks such that MPI ranks can globally
identify ranks that are very busy and ranks which tend to wait for MPI
messages. Such data can be used to guide load balancing. Within \teaMPI, it
can be used to instruct the task-based load balancing how to move data around.
\item {\bf Black-box replication}. By hijacking MPI calls, \teaMPI\ can split
up the global number $N$ of ranks into $T$ teams of equal size. Each team
assumes that there are only $N/T$ ranks in the system and thus runs completely
independent of the other teams. This splitting is completely hidden from teh
application. \teaMPI\ however provides a heartbeat mechanism which identifies
if one team becomes slower and slower. This can be used as a guideline for
resiliency---assuming that failures in ranks that will eventually fail first
manifest in a speed deterioration of their team.
\item {\bf Replication task sharing}. \teaMPI\ for teams can identify tasks
that are replicated in different teams. Everytime the library detects that a
task has been computed that is replicated on another team and handed out to
\teaMPI, it can take the task's outcome, copy it over to the other team, and
cancel the task execution there. This reduces the overhead cost of
resiliency via replication (teams) massively.
\item {\bf Smart progression and smart caching}. If a SmartNIC (Mellanox
Bluefield) is available to a node, \teaMPI\ can run a dedicated helper code on the SmartNIC which
polls MPI all the time. For dedicated MPI messages (tasks), it can hijack
MPI. If a rank A sends data to a rank B, the MPI send is actually deployed to
the SmartNIC which fetches the data directly from A's memory (RDMA). It is in
turn caches on rank B from where it directly deployed into the memory of B if
B has issues a non-blocking receive. Otherwise, it is at least available on
the SmartNIC where we cache it once it is requrested.
\item {\bf Smart snif}. Realtime-guided load balancing in HPC typically
suffers from the fact that the load balancing is unable to distinguish
illbalancing from network congestion. As a result, we can construct situations
where a congested network suggests to the load balancing that ranks were idle,
and the load balancing consequently starts to move data around. As a result,
the network experiences even more stress and we enter a feedback cycle. With
SmartNICs, wee can deploy heartbeats to the network device and distinguish
network problems from illbalancing---eventually enabling smarter timing-based
load balancing.
\item {\bf Smart balancing}. With SmartNICs, \teaMPI\ can outsource its
task-based node balancing completely to the network. All distribution
decisions and data movements are championed by the network card rather than
the main CPU.
\item {\bf Smart replication}. With SmartNICs, \teaMPI\ can outsource its
replication functionality including the task distribution and replication to
the network.
\section*{History and literature}
\teaMPI\ has been started as MScR project by Benjamin Hazelwood under the
supervision of Tobias Weinzierl.
After that, it has been significantly extended
and rewritten by Philipp Samfass as parts of his PhD thesis.
The two core papers describing the research behind the library are
\item Philipp Samfass, Tobias Weinzierl, Benjamin Hazelwood, Michael
Bader: \emph{TeaMPI -- Replication-based Resilience without the (Performance)
Pain} (published at ISC 2020) \url{}
\item Philipp Samfass, Tobias Weinzierl, Dominic E. Charrier, Michael Bader:
\emph{Lightweight Task Offloading Exploiting MPI Wait Times for Parallel
Adaptive Mesh Refinement} (CPE 2020; in press)
\section*{Dependencies and prerequisites}
\teaMPI's core is plain C++17 code.
We however use a whole set of tools around it:
\item GNU autotools (automake) to set up the system (required).
\item C++17-compatible C++ compiler (required).
\item MPI 3. MPI's multithreaded support and non-blocking collectives
\item Intel's Threading Building Blocks (TBB) or OpenMP 4.5 (required).
\item Doxygen if you want to create HTML pages of PDFs of the in-code
\section*{Who should read this document}
This guidebook is written for users of \teaMPI, and for people who want to
extend it.
The text is thus organised into three parts:
First, we describe how to build, install and use \teaMPI.
Second, we describe the vision and rationale behind the software as well as its
application scenarios.
Third, we describe implementation specifica.
Philipp Samfass,
Tobias Weinzierl
\ No newline at end of file
% There are two ways to obtain \Peano:
% You can either download one of the archives we provide on the webpage, or you
% can work directly against a repository clone.
% If you work with the archive, type in a
% \begin{code}
% tar -xzvf myarchive.tar.gz
% \end{code}
% in the directory where you've stored your downloaded file.
% If you work with the git archive, you have to clone this archive first.
% I grant access to \Peano\ free of charge.
% However, I ask users to sign up for the software if they indend to push
% modifications to the code (which I very much appreciate\footnote{\Peano's
% guidebook (the file you currently read) is hosted within \Peano's git
% repository, too. I'm always happy if people add content to this
% documentation, too.}).
% This way I can report to funding agencies how frequent the software is used, and
% I also have at least some ideas which application areas benefit from the
% software and where it is actively used and developed.
% If you do not indend to modify the \Peano\ core code base, you can just clone
% the code anomymously.
% \begin{code}
% git clone
% cd Peano
% git checkout p4
% \end{code}
% \begin{remark}
% I still maintain the ``old'' Peano in the repository (version 3), and most users
% consider this to be the standard Peano generation.
% For the present document, it is thus important that you manually switch to the
% branch \texttt{p4}.
% \end{remark}
% \section{Prepare the configure environment}
% This step should only required if you work directly against the git repository.
% If you prefer to download a snapshot of \Peano, then you can skip this section
% and continue with \ref{section:installation:configure}.
% \begin{itemize}
% \item Ensure you have the autotool packages installed on your system. They
% typically are shipped in the packages \texttt{autoconf}, \texttt{automake} and
% \texttt{libtool}.
% \item Set up the configure environment:
% \begin{code}
% libtoolize; aclocal; autoconf; autoheader;
% cp src/ .;
% automake --add-missing
% \end{code}
% \end{itemize}
% \noindent
% These steps should only be required once, unless you push major revisions to the
% development branch.
% \section{Configure}
% \label{section:installation:configure}
% \Peano\ relies on the autotools to set up its build environment.
% Change into the project's directory and type in
% \begin{code}
% ./configure --help
% \end{code}
% The \texttt{--help} option should give you some information about the available
% variants of \Peano.
% In principle, a sole \texttt{./configure} call is sufficient, but you might want
% to adopt your build; notably as the default build is serial and does not bring
% along any support for postprocessing.
% While the help output should be reasonable verbose, I summarise key options
% below:
% \begin{center}
% \begin{tabular}{lp{10cm}}
% \texttt{--prefix=/mypath} & Will install \Peano\ in \texttt{/mypath}.
% \\
% \texttt{--with-multithreading} & Switch on multithreading. By default, we
% build without multithreading, but a \texttt{--with-multithreading=cpp}, e.g.,
% makes \Peano\ use the C++ threading model. Please consult \texttt{--help} for
% details. Note: To use OpenMP multithreading, you will need a compiler which
% supports OpenMP 5.0 (e.g - GCC 9.x).
% \\
% \texttt{--with-mpi} & Enable the MPI version of \Peano. You have to tell the
% build environment however which compile command to use. Please note that
% we need a C++ MPI wrapper. So \texttt{--with-mpi=mpcxx} is a typical call.
% \\
% \texttt{--with-vtk} & Inform \Peano\ that VTK is available on the system.
% \Peano\ mainly
% relies on its own IO data format. Even some VTK dump routines (used usually
% only for debugging) are written by hand, i.e.~do not rely on the VTK
% libraries.
% However, \Peano\ comes along with some command line conversion tools that can convert its tailored data
% dumps into VTK, and they do rely on the VTK libraries. With this flag, you
% tell \Peano\ to build these tools.
% The VTK installation is sometimes not easy (and you might have to provide additional parameters depending on your installation). I dedicate Chapter \ref{chapter:vtk} on VTK.
% \\
% \texttt{--with-hdf5} & Make \Peano\ support HDF5 output. Not stable at the
% moment.
% \\
% \texttt{--with-delta} & Configure \Peano\ such that the geometry library
% $\Delta $ is used. Not stable at the moment.
% \end{tabular}
% \end{center}
% \begin{remark}
% I recommend that you start a first test without any additional flavours of
% \Peano, i.e.~to work with a plain \texttt{./configure} call. Once the tests
% pass, I recommend that you first add IO (VTK) and then parallelisation.
% \end{remark}
% \noindent
% Compilers, linkers and both compiler and linker flags can be changed by
% resetting the corresponding environment variables \emph{prior} to the configure
% call.
% Alternatively, you can pass the variables to be used to \texttt{configure}
% through arguments.
% Please consult the \texttt{--help} output for details.
% For some supercomputers that we use frequently, there are recommendations which
% setting to use (Chapter~\ref{chapter:selected-HPC-platforms}).
% \section{Build}
% Once the configuration has been successful, a simple
% \begin{code}
% make
% \end{code}
% should build the \Peano\ core and some examples.
% \begin{code}
% make install
% \end{code}
% finally will deploy the \Peano\ files in the directory specified via the
% \texttt{prefix} before.
% If you haven't set the prefix, then the default will likely be a system
% directory.
% Unless you have superuser rights, the installation then will fail.
% So I recommend that you install \Peano\ into a local directory specifying it via
% the \texttt{--with-prefix} option.
% \begin{remark}
% To develop with \Peano, you don't have to install it. You can instead just skip
% the installation and work in the download directory.
% \end{remark}
% \section{Installation test}
% Once you have compiled \Peano\, I recommend that you run all tests using
% \begin{code}
% make check
% \end{code}
% or to run 2D and 3D tests individually, use
% \begin{code}
% src/examples/unittests/UnitTests2d
% src/examples/unittests/UnitTests3d
% \end{code}
% \noindent
% You can run these builds with different core counts and also MPI support if you
% have compiled with MPI.
% The executables contain both node correctness tests and MPI ping-pong tests,
% i.e.~you can both assess a valid build plus a working MPI environment.
% \section{Python configuration}
% \Peano\ is developed with Python 3. Python 2 is not supported.
% To make any Python example pass, you have to set the Python paths.
% There are multiple ways to do so.
% The standard way is to set the \texttt{PYTHONPATH}:
% \begin{code}
% export PYTHONPATH=myPeanoDirectory/python
% \end{code}
% This can even be done in the \texttt{.bashrc}.
% Alternatively, you can augment your Python scripts later on with
% \begin{code}
% import sys
% PEANO4_PYTHON_DIR="myPeanoDirectory/python"
% sys.path.append(PEANO4_PYTHON_DIR)
% \end{code}
% \noindent
% Please note that most Python scripts require you to set the
% \texttt{LD\_LIBRARY\_PATH}.
% If you install \Peano, then this path should be automatically right.
% If you work with a local \Peano\ installation, you can either tell \Peano\
% through its API where to search for the library (all routines that add a
% library accept additional search paths), or you can again use the
% \texttt{sys.path.append} command to extend \texttt{LD\_LIBRARY\_PATH}.
% \begin{remark}
% I personally prefer the variant to go through the environment variable.
% \Peano's Python API yield plain old makefiles building executables that work
% completely without any Python.
% This way, we ensure that we work smoothly on supercomputers, too, where Python
% sometimes is not available in the latest version or on the compute nodes.
% As long as you run through environment variables or the \Peano\ routines to
% specify paths, your makefiles also will continue to work that way.
% \end{remark}
% \section{Fortran}
% \begin{remark}
% If you don't use Fortran in your own code, ignore this section.
% \end{remark}
% \Peano's core does not use any Fortran at all.
% However, many users use Fortran for their domain-specific programming.
% If you want to have a seemless integration of your particular Fortran choice
% through \Peano's Python API, invoke \texttt{./configure} ensuring that the
% Fortran variables---in particular the environment variable \texttt{FC}
% identifying the compiler---are properly set.
% For many codes and the GNU Fortran compiler, you need the flag
% \texttt{-fdefault-real-8}.
% You can export \texttt{FCFLAGS} and make it contain this argument before you
% invoke \texttt{configure}.
% As the \Peano's core does not use Fortran---it is only applications built on top
% of \Peano---you can redefine the flags later on (see \ref{section:installation:applications-built-with-Peano}).
% \section{\Peano\ components and build variants}
% \label{chapter:installation:build-variants}
% \Peano\ currently is delivered as a set of archives, i.e.~static libraries:
% \begin{itemize}
% \item There is a technical architecture (\texttt{Tarch}) and the actual
% \texttt{Peano4Core}.
% \item Each archive variant is available as a release version, as a debug
% version, as a version with tracing and assertions and as a tracing-only
% version.
% \item Each archive variant is available as 2d build and as 3d build. If you
% need higher dimensions, you have to build the required libraries manually.
% \end{itemize}
% The version terminology is as follows:
% \begin{itemize}
% \item {\bf debug} The debug versions of the archives all have the postfix
% \texttt{\_debug}. If you link against these versions, the full set of
% assertions, of all tracing and all debug messages is available; though you can
% always filter on the application-level which information you want to see
% (cmp.~Chapter \ref{section:logging}).
% \item {\bf asserts} Thsee versions of the archives all have the postfix
% \texttt{\_asserts}. If you link against these versions, all assertions are on.
% The code also contains tracing macros (see below).
% \item {\bf tracing} The release versions of the archives all have the postfix
% \texttt{\_trace}. If you link against these versions, all assertions and debug
% messages are removed, but some tracing is active. You can switch on/off the
% tracing per class (cmp.~Chapter \ref{section:logging}), and different tracing
% backends allow you to connect to different (profiling) tools.
% \item {\bf release} The release versions of the archives have no
% particular postfix. They disable tracing, any debugging and all assertions.
% These archives should be used by production runs.
% \end{itemize}
% \begin{remark}
% Peano's release components still write info messages (they suppress debug,
% tracing and logging, but not essential information). If you want to filter out
% these messages, too, you have to apply a log filter (Chapter
% \ref{section:logging:log-filter}).
% \end{remark}
% \noindent
% Besides these archives, the \Peano\ installation also comes along with a set of
% example applications.
% They are found in the directory \texttt{src/examples}.
% Most examples create by default two variants of the example: a debug
% version and one with only tracing enabled.
% Several examples furthermore come along as 2d and 3d build.
% \section{Documentation}
% \begin{figure}
% \begin{center}
% \includegraphics[width=0.4\textwidth]{10_installation/webpage.png}
% \hspace{0.4cm}
% \includegraphics[width=0.4\textwidth]{10_installation/source-docu.png}
% \end{center}
% \caption{
% \Peano's webpage (left) and a screenshot from the auto-generated source code
% docu which can be reached through the webpage if you don't want to generate
% the pages yourself.
% This documentation also provides indices and a search function as well as all
% documentation formulae typeset with LaTeX.
% }
% \end{figure}
% There are three major types/resources of documentation of the software:
% \begin{enumerate}
% \item This guidebook/cookbook that describes how to use the code base from a
% high abstraction level and with anecdotal examples.
% \item The documentation of the C++ code. Here, I follow an ``everything is in
% the code'' philosophy.
% \item The documentation of the Python code. Here, I follow an ``everything is
% in the code'' philosophy.
% \end{enumerate}
% \noindent
% For the code documentation, ``everything is in the code'' means that all
% documentation is comments within the Python script or C++ header files,
% respectively.
% You can create a webpage from this distributed information through
% the tool \texttt{doxygen}\footnote{\url{}.}.
% There are two Doxygen configuration files in the repository, i.e.~I keep the
% Python and the C++ documentation output separate.
% To create the documentation, switch into directory \texttt{src} or
% \texttt{python}, respectively.
% In both directories, the Doxygen config file is called
% \texttt{peano.doxygen-configuration}, i.e.~calling
% \begin{code}
% doxygen peano.doxygen-configuration
% \end{code}
% \noindent
% gives you the output. If you prefer not to generate and maintain the
% documentation yourself, the \Peano\ webpage hosts the autogenerated
% documentation, too.
% It is updated roughly once a week.
% \section{Compile/link options for applications built with \Peano}
% \label{section:installation:applications-built-with-Peano}
% \Peano\ is a framework, i.e.~ships very few executables (mainly only for test
% purposes and for postprocessing).
% Its core delivery is a set of libraries.
% Application codes then build on top of these libraries.
% While it is up to you to decide which build system you use in your application,
% \Peano\ natively favours makefiles.
% If you use \Peano's Python API, it will create dummy makefiles.
% Higher-level \Peano\ applicaitons such as \ExaHyPE\ build upon the Python API
% and thus work with makefiles, too.
% When they create these makefiles, they parse your local \Peano\ installation's
% makefiles and extract the information from there.
% This is, when you've selected a particular compiler or compiler option when you
% install \Peano, this choice will automatically propagate through to your
% domain-specific codes.
% There's however the option to overwrite the chosen settings---either by defining
% appropriate environment variables before you invoke \texttt{make} or by
% resetting variables manually within your Python scripts that configure the
% environment.
\chapter{Using \teaMPI}
\chapter{\teaMPI\ for developers}
\chapter{\teaMPI's architecture}
\teaMPI's design is conceptionally very simple:
\item The user code links against the \teaMPI\ library which in turn hooks
into the PMPI interface. As a result, \teaMPI\ can hijack MPI calls, map them
onto subcommunicators (subsets of ranks) or trigger special functions besides
the MPI core functionality.
\item Furthermore, \teaMPI\ provides an API such that programmers can inform
it about tasks, e.g., or query load balancing information.
\item If \teaMPI\ is built with SmartNIC support, each library running on the
host is paired up with a SmartTea instance running on the BlueField. The
library now can interact with the \teaMPI\ instance running on the SmartNIC
and the two of them can orchestrate data transfer completely independent of
the user application.
\chapter{Smart progression}
Let a Rank A send data to a Rank B.
With non-blocking MPI, A issues an \texttt{Isend}, B issues an \texttt{Irecv},
and both ranks continue to compute while MPI transfer the memory.
In practice this is not happening as
\item MPI implementations require CPU cycles to move too big/too many
messages around and thus have to be called regularity (progressed) via send,
receives or waits.
\item MPI message exchange suffers from congestion, i.e.~the moment we wait
for particular messages is the moment it first has to transfer other messages
which come first in the queue.
Progression is a technique frequently discussed in literature.
The standard recommendations how to handle it are
\item make your own code call \texttt{MPI\_Wait}, \texttt{MPI\_Test},
\texttt{MPI\_Probe} in regular intervals;
\item dedicate one thread to MPI's progression machinery (supported by
Intel MPI, e.g.).
Both ``solutions'' (if they work) are unsatisfying, as they require manual
intervention by programmers and thus make code more complex and/or sacrifice CPU
With \teaMPI, we offer an alternative:
\item Users tell \teaMPI\ about particular messages (message tags) which are
critical to the simulation and non-blocking data transfer.
From hereon, \teaMPI\ knows about \emph{critical} message tags.
\item Whenever a user code calls \texttt{MPI\_Isend}, \teaMPI\ hijacks this
send, i.e.~it is not executed. This happens only for messages with the right
tag (see step (1) in architecture illustration).
\item Next, \teaMPI\ issues a remote data transfer from the
\texttt{MPI\_Isend} buffer on the BlueField (BF), i.e.~data is copied ot the
This runs totally parallel to the CPU (as it is driven by the BF).
See step (2) in the architecture illustration.
\item Once all data resides on the BF, a \texttt{MPI\_Wait} or
\texttt{MPI\_Test} on the sender succeeds. Again \teaMPI\ hides the fact away
that the test or wait does not directly hit the MPI library.
\item Once all data residues on the BF, the BF issues a non-blocking MPI call
to the receiver's BF. As the BF has a CPU of its own, it can poll the MPI
library until this data transfer is complete, i.e.~it takes over the
responsibility of a progression thread. Alternatively, it can work with plain
blocking data transfer (step (3) on illustration).
On the receiver side, we implement the mirrored pattern.
The pattern is not free of drawbacks: It tends to increase the message latency
and requires additional memory (on the BFs).
However, we allow the user to specify manually for which messages it is used, so
it can be tailored towards the user's needs.
Even if the technique is only applied to non-urgent messages, it thus releases
pressure from the MPI subsystem from a host's point of view.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape ( -->
viewBox="0 0 210 297"
inkscape:version="0.92.4 (5da689c313, 2019-01-14)"