24.09., 9:00 - 11:00: Due to updates GitLab will be unavailable for some minutes between 09:00 and 11:00.

Commit 97c2a194 authored by Tobias Weinzierl's avatar Tobias Weinzierl


parent d81c5f36
\chapter{\teaMPI's (smart) architecture} \chapter{\teaMPI's architecture}
\teaMPI's design is conceptionally very simple: \teaMPI's design is conceptionally very simple:
\chapter{Smart progression}
Let a Rank A send data to a Rank B.
With non-blocking MPI, A issues an \texttt{Isend}, B issues an \texttt{Irecv},
and both ranks continue to compute while MPI transfer the memory.
In practice this is not happening as
\item MPI implementations require CPU cycles to move too big/too many
messages around and thus have to be called regularity (progressed) via send,
receives or waits.
\item MPI message exchange suffers from congestion, i.e.~the moment we wait
for particular messages is the moment it first has to transfer other messages
which come first in the queue.
Progression is a technique frequently discussed in literature.
The standard recommendations how to handle it are
\item make your own code call \texttt{MPI\_Wait}, \texttt{MPI\_Test},
\texttt{MPI\_Probe} in regular intervals;
\item dedicate one thread to MPI's progression machinery (supported by
Intel MPI, e.g.).
Both ``solutions'' (if they work) are unsatisfying, as they require manual
intervention by programmers and thus make code more complex and/or sacrifice CPU
With \teaMPI, we offer an alternative:
\item Users tell \teaMPI\ about particular messages (message tags) which are
critical to the simulation and non-blocking data transfer.
From hereon, \teaMPI\ knows about \emph{critical} message tags.
\item Whenever a user code calls \texttt{MPI\_Isend}, \teaMPI\ hijacks this
send, i.e.~it is not executed. This happens only for messages with the right
tag (see step (1) in architecture illustration).
\item Next, \teaMPI\ issues a remote data transfer from the
\texttt{MPI\_Isend} buffer on the BlueField (BF), i.e.~data is copied ot the
This runs totally parallel to the CPU (as it is driven by the BF).
See step (2) in the architecture illustration.
\item Once all data resides on the BF, a \texttt{MPI\_Wait} or
\texttt{MPI\_Test} on the sender succeeds. Again \teaMPI\ hides the fact away
that the test or wait does not directly hit the MPI library.
\item Once all data residues on the BF, the BF issues a non-blocking MPI call
to the receiver's BF. As the BF has a CPU of its own, it can poll the MPI
library until this data transfer is complete, i.e.~it takes over the
responsibility of a progression thread. Alternatively, it can work with plain
blocking data transfer (step (3) on illustration).
On the receiver side, we implement the mirrored pattern.
The pattern is not free of drawbacks: It tends to increase the message latency
and requires additional memory (on the BFs).
However, we allow the user to specify manually for which messages it is used, so
it can be tailored towards the user's needs.
Even if the technique is only applied to non-urgent messages, it thus releases
pressure from the MPI subsystem from a host's point of view.
This diff is collapsed.
...@@ -36,6 +36,7 @@ ...@@ -36,6 +36,7 @@
\part{Realisation} \part{Realisation}
\input{30_architecture} \input{30_architecture}
\end{document} \end{document}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment