35_smart-progression.tex 3.09 KB
Newer Older
Tobias Weinzierl's avatar
Added  
Tobias Weinzierl committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
\chapter{Smart progression}
\label{chapter:smart-progression}

\begin{definition}
 Let a Rank A send data to a Rank B. 
 With non-blocking MPI, A issues an \texttt{Isend}, B issues an \texttt{Irecv},
 and both ranks continue to compute while MPI transfer the memory.
 In practice this is not happening as
 \begin{enumerate}
   \item MPI implementations require CPU cycles to move too big/too many
   messages around and thus have to be called regularity (progressed) via send,
   receives or waits.
   \item MPI message exchange suffers from congestion, i.e.~the moment we wait
   for particular messages is the moment it first has to transfer other messages
   which come first in the queue.
 \end{enumerate}
\end{definition}


\begin{center}
 \includegraphics[width=0.6\textwidth]{35_smart-progression/architecture.pdf}
\end{center}

\noindent
Progression is a technique frequently discussed in literature. 
The standard recommendations how to handle it are
\begin{enumerate}
  \item make your own code call \texttt{MPI\_Wait},  \texttt{MPI\_Test}, 
  \texttt{MPI\_Probe} in regular intervals;
  \item dedicate one thread to MPI's progression machinery (supported by
  Intel MPI, e.g.).
\end{enumerate}
\noindent
Both ``solutions'' (if they work) are unsatisfying, as they require manual
intervention by programmers and thus make code more complex and/or sacrifice CPU
resources.




With \teaMPI, we offer an alternative:
\begin{enumerate}
  \item Users tell \teaMPI\ about particular messages (message tags) which are
  critical to the simulation and non-blocking data transfer.
  From hereon, \teaMPI\ knows about \emph{critical} message tags.
  \item Whenever a user code calls \texttt{MPI\_Isend}, \teaMPI\ hijacks this
  send, i.e.~it is not executed. This happens only for messages with the right
  tag (see step (1) in architecture illustration).
  \item Next, \teaMPI\ issues a remote data transfer from the
  \texttt{MPI\_Isend} buffer on the BlueField (BF), i.e.~data is copied ot the
  BF.
  This runs totally parallel to the CPU (as it is driven by the BF).
  See step (2) in the architecture illustration.
  \item Once all data resides on the BF, a \texttt{MPI\_Wait} or
  \texttt{MPI\_Test} on the sender succeeds. Again \teaMPI\ hides the fact away
  that the test or wait does not directly hit the MPI library.
  \item Once all data residues on the BF, the BF issues a non-blocking MPI call
  to the receiver's BF. As the BF has a CPU of its own, it can poll the MPI
  library until this data transfer is complete, i.e.~it takes over the
  responsibility of a progression thread. Alternatively, it can work with plain
  blocking data transfer (step (3) on illustration).
\end{enumerate}


\noindent
On the receiver side, we implement the mirrored pattern. 
The pattern is not free of drawbacks: It tends to increase the message latency
and requires additional memory (on the BFs).
However, we allow the user to specify manually for which messages it is used, so
it can be tailored towards the user's needs.
Even if the technique is only applied to non-urgent messages, it thus releases
pressure from the MPI subsystem from a host's point of view.