\chapter{Smart progression} \label{chapter:smart-progression} \begin{definition} Let a Rank A send data to a Rank B. With non-blocking MPI, A issues an \texttt{Isend}, B issues an \texttt{Irecv}, and both ranks continue to compute while MPI transfer the memory. In practice this is not happening as \begin{enumerate} \item MPI implementations require CPU cycles to move too big/too many messages around and thus have to be called regularity (progressed) via send, receives or waits. \item MPI message exchange suffers from congestion, i.e.~the moment we wait for particular messages is the moment it first has to transfer other messages which come first in the queue. \end{enumerate} \end{definition} \begin{center} \includegraphics[width=0.6\textwidth]{35_smart-progression/architecture.pdf} \end{center} \noindent Progression is a technique frequently discussed in literature. The standard recommendations how to handle it are \begin{enumerate} \item make your own code call \texttt{MPI\_Wait}, \texttt{MPI\_Test}, \texttt{MPI\_Probe} in regular intervals; \item dedicate one thread to MPI's progression machinery (supported by Intel MPI, e.g.). \end{enumerate} \noindent Both ``solutions'' (if they work) are unsatisfying, as they require manual intervention by programmers and thus make code more complex and/or sacrifice CPU resources. With \teaMPI, we offer an alternative: \begin{enumerate} \item Users tell \teaMPI\ about particular messages (message tags) which are critical to the simulation and non-blocking data transfer. From hereon, \teaMPI\ knows about \emph{critical} message tags. \item Whenever a user code calls \texttt{MPI\_Isend}, \teaMPI\ hijacks this send, i.e.~it is not executed. This happens only for messages with the right tag (see step (1) in architecture illustration). \item Next, \teaMPI\ issues a remote data transfer from the \texttt{MPI\_Isend} buffer on the BlueField (BF), i.e.~data is copied ot the BF. This runs totally parallel to the CPU (as it is driven by the BF). See step (2) in the architecture illustration. \item Once all data resides on the BF, a \texttt{MPI\_Wait} or \texttt{MPI\_Test} on the sender succeeds. Again \teaMPI\ hides the fact away that the test or wait does not directly hit the MPI library. \item Once all data residues on the BF, the BF issues a non-blocking MPI call to the receiver's BF. As the BF has a CPU of its own, it can poll the MPI library until this data transfer is complete, i.e.~it takes over the responsibility of a progression thread. Alternatively, it can work with plain blocking data transfer (step (3) on illustration). \end{enumerate} \noindent On the receiver side, we implement the mirrored pattern. The pattern is not free of drawbacks: It tends to increase the message latency and requires additional memory (on the BFs). However, we allow the user to specify manually for which messages it is used, so it can be tailored towards the user's needs. Even if the technique is only applied to non-urgent messages, it thus releases pressure from the MPI subsystem from a host's point of view.