2.12.2021, 9:00 - 11:00: Due to updates GitLab may be unavailable for some minutes between 09:00 and 11:00.

00_vision.tex 6.85 KB
 Philipp Samfaß committed May 06, 2021 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 \chapter{Vision and Introductory Remarks} \teaMPI\ is an open source library built with C++. It plugs into MPI via MPI's PMPI interface. %plus provides an additional interface for advanced %advanced task-based parallelism on distributed memory architectures. Its research vision reads as follows: \begin{enumerate} % \item {\bf Intelligent task-based load balancing}. Applications can hand over % tasks to \teaMPI. These tasks have to be ready, i.e.~without incoming % dependencies, and both their input and their output have to be serialisable. % It is now up to \teaMPI\ to decide whether a task is injected into the local runtime or temporarily moved % to another rank, where we compute it and then bring back its result. % \item {\bf MPI idle time load metrics}. \teaMPI\ can plug into some MPI % calls---hidden from the application---and measure how long this MPI call % idles, i.e.~waits for incoming messages. It provides lightweight % synchronisation mechanisms between MPI ranks such that MPI ranks can globally % identify ranks that are very busy and ranks which tend to wait for MPI % messages. Such data can be used to guide load balancing. Within \teaMPI, it % can be used to instruct the task-based load balancing how to move data around. \item {\bf Black-box replication}. By hijacking MPI calls, \teaMPI\ can split up the global number $N$ of ranks into $T$ teams of equal size. Each team assumes that there are only $N/T$ ranks in the system and thus runs completely independent of the other teams. This splitting is completely hidden from teh application. \teaMPI\ however provides a heartbeat mechanism which identifies if one team becomes slower and slower. This can be used as a guideline for resiliency---assuming that failures in ranks that will eventually fail first manifest in a speed deterioration of their team. \item {\bf Replication task sharing}. \teaMPI\ for teams can identify tasks that are replicated in different teams. Everytime the library detects that a task has been computed that is replicated on another team and handed out to \teaMPI, it can take the task's outcome, copy it over to the other team, and cancel the task execution there. This reduces the overhead cost of resiliency via replication (teams) massively. This feature has not been integrated into \teaMPI yet, but has rather been implemented and evaluated directly in the ExaHyPE engine (\url{https://gitlab.lrz.de/exahype/ExaHyPE-Engine/-/tree/philipp/tasksharing_offloading}). \end{enumerate} \teaMPI\ is compatible with SmartMPI (\url{https://gitlab.lrz.de/prototypes/mpi_offloading}) which is a library for exploring SmartNICs. For access to SmartMPI please contact Philipp Samfass at \url{samfass@in.tum.de}. Our vision is to use teaMPI and SmartMPI in conjunction to achieve the following objectives: \begin{enumerate} \item {\bf Smart progression and smart caching}. If a SmartNIC (Mellanox Bluefield) is available to a node, \teaMPI\ can run a dedicated helper code on the SmartNIC which polls MPI all the time. For dedicated MPI messages (tasks), it can hijack MPI. If a rank A sends data to a rank B, the MPI send is actually deployed to the SmartNIC which may fetch the data directly from A's memory (RDMA). It is in turn caches on rank B from where it directly deployed into the memory of B if B has issued a non-blocking receive. Otherwise, it is at least available on the SmartNIC where we cache it until it is requested. \item {\bf Smart snif}. Realtime-guided load balancing in HPC typically suffers from the fact that the load balancing is unable to distinguish illbalancing from network congestion. As a result, we can construct situations where a congested network suggests to the load balancing that ranks were idle, and the load balancing consequently starts to move data around. As a result, the network experiences even more stress and we enter a feedback cycle. With SmartNICs, wee can deploy heartbeats to the network device and distinguish network problems from illbalancing---eventually enabling smarter timing-based load balancing. \item {\bf Smart balancing}. With SmartNICs, \teaMPI\ can outsource its task-based node balancing completely to the network. All distribution decisions and data movements are championed by the network card rather than the main CPU. \item {\bf Smart replication}. With SmartNICs, \teaMPI\ can outsource its replication functionality including the task distribution and replication to the network. \end{enumerate} \section*{History and literature} \teaMPI\ has been started as MScR project by Benjamin Hazelwood under the supervision of Tobias Weinzierl. After that, it has been partially extended by Philipp Samfass as parts of his PhD thesis. Within this scope, some aspects of \teaMPI's\ vision were evaluated together with ExaHyPE (\url{https://gitlab.lrz.de/exahype/ExaHyPE-Engine}). Important note: Up to now, teaMPI is not aware of tasks! Task offloading and task sharing were implemented in ExaHyPE using teaMPI's interface and transparent replication functionality. Future work will need to investigate how teaMPI can be made aware of tasks and how ExaHyPE's prototypical task offloading and task sharing algorithms can be extracted into teaMPI. Two core papers describing the research behind the library are \begin{itemize} \item Philipp Samfass, Tobias Weinzierl, Benjamin Hazelwood, Michael Bader: \emph{TeaMPI -- Replication-based Resilience without the (Performance) Pain} (published at ISC 2020) \url{https://arxiv.org/abs/2005.12091} \item Philipp Samfass, Tobias Weinzierl, Dominic E. Charrier, Michael Bader: \emph{Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement} (CPE 2020; in press) \url{https://arxiv.org/abs/1909.06096} \end{itemize} \section*{Dependencies and prerequisites} \teaMPI's core is plain C++11 code. We however use a whole set of tools around it: \begin{itemize} \item CMake 3.05 or newer (required). \item C++11-compatible C++ compiler (required). %\item MPI 3. MPI's multithreaded support and non-blocking collectives %(required). \item Doxygen if you want to create HTML pages of PDFs of the in-code documentation. \end{itemize} \section*{Who should read this document} This guidebook is written for users of \teaMPI, and for people who want to extend it. Presently, the guidebook is limited to discuss only the most important aspects of \teaMPI\ in a very condensed form. For more documentation or further help, please feel free to contact \url{samfass@in.tum.de}. %The text is thus organised into three parts: %First, we describe how to build, install and use \teaMPI. %Second, we describe the vision and rationale behind the software as well as its %application scenarios. %Third, we describe implementation specifica. { \flushright \today \\ Philipp Samfass, Tobias Weinzierl \\ }