Commit f9d85088 authored by Philipp Samfaß's avatar Philipp Samfaß

changed docu

parent 72291018
......@@ -34,14 +34,19 @@ Its research vision reads as follows:
\teaMPI, it can take the task's outcome, copy it over to the other team, and
cancel the task execution there. This reduces the overhead cost of
resiliency via replication (teams) massively.
\end{enumerate}
\teaMPI is compatible with SmartMPI (\url{https://gitlab.lrz.de/prototypes/mpi_offloading}) which is a library for exploring SmartNICs.
Our vision is to use teaMPI and SmartMPI in conjunction to achieve the following objectives:
\begin{enumerate}
\item {\bf Smart progression and smart caching}. If a SmartNIC (Mellanox
Bluefield) is available to a node, \teaMPI\ can run a dedicated helper code on the SmartNIC which
polls MPI all the time. For dedicated MPI messages (tasks), it can hijack
MPI. If a rank A sends data to a rank B, the MPI send is actually deployed to
the SmartNIC which fetches the data directly from A's memory (RDMA). It is in
the SmartNIC which may fetch the data directly from A's memory (RDMA). It is in
turn caches on rank B from where it directly deployed into the memory of B if
B has issues a non-blocking receive. Otherwise, it is at least available on
the SmartNIC where we cache it once it is requrested.
B has issued a non-blocking receive. Otherwise, it is at least available on
the SmartNIC where we cache it until it is requested.
\item {\bf Smart snif}. Realtime-guided load balancing in HPC typically
suffers from the fact that the load balancing is unable to distinguish
illbalancing from network congestion. As a result, we can construct situations
......@@ -66,10 +71,17 @@ Its research vision reads as follows:
\teaMPI\ has been started as MScR project by Benjamin Hazelwood under the
supervision of Tobias Weinzierl.
After that, it has been significantly extended
and rewritten by Philipp Samfass as parts of his PhD thesis.
The two core papers describing the research behind the library are
After that, it has been partially extended
by Philipp Samfass as parts of his PhD thesis.
Within this scope, some aspects of \teaMPI's\ vision were evaluated together
with ExaHyPE (\url{https://gitlab.lrz.de/exahype/ExaHyPE-Engine}).
Important note: Up to now, teaMPI is not aware of tasks!
Task offloading and task sharing were implemented in ExaHyPE using teaMPI's interface
and transparent replication functionality.
Future work will need to investigate how teaMPI can be made aware of tasks and how ExaHyPE's
prototypical task offloading and task sharing algorithms can be extracted into teaMPI.
Two core papers describing the research behind the library are
\begin{itemize}
\item Philipp Samfass, Tobias Weinzierl, Benjamin Hazelwood, Michael
......@@ -84,15 +96,14 @@ The two core papers describing the research behind the library are
\section*{Dependencies and prerequisites}
\teaMPI's core is plain C++17 code.
\teaMPI's core is plain C++11 code.
We however use a whole set of tools around it:
\begin{itemize}
\item GNU autotools (automake) to set up the system (required).
\item C++17-compatible C++ compiler (required).
\item CMake 3.05 or newer (required).
\item C++11-compatible C++ compiler (required).
\item MPI 3. MPI's multithreaded support and non-blocking collectives
(required).
\item Intel's Threading Building Blocks (TBB) or OpenMP 4.5 (required).
\item Doxygen if you want to create HTML pages of PDFs of the in-code
documentation.
\end{itemize}
......@@ -117,4 +128,4 @@ Third, we describe implementation specifica.
\\
}
\ No newline at end of file
\chapter{Installation}
We use CMake/CCMake to build teaMPI.
To build, follow these steps from inside the root folder:
\begin{code}
cd lib
mkdir build
cmake ..
make
\end{code}
If you want to build with SmartMPI support, use
\begin{code}
cd lib
mkdir build_smartmpi
cmake -DENABLE_SMARTMPI=1 ..
make
\end{code}
This will build the library inside the build* directory.
% There are two ways to obtain \Peano:
% You can either download one of the archives we provide on the webpage, or you
% can work directly against a repository clone.
......
\chapter{Using \teaMPI}
Using \teaMPI is as simple as linking \teaMPI to the application and setting \texttt{LD\_LIBRARY\_PATH} to point to \teaMPI:
\begin{enumerate}
\item Link with `-ltmpi -L"path to teaMPI"`
\item Add "path to teaMPI" to \texttt{LD\_LIBRARY\_PATH}
\end{enumerate}
\section{Running with \teaMPI}
Please set the number of teams with the `TEAMS` environment variable (default: 2)
To use some example provided miniapps:
1. run `make` in the applications folder
2. run each application in the bin folder with the required command line parameters (documented in each application folder)
\section{Example Heartbeat Usage}
The following application models many scientific applications. Per loop, the two \texttt{MPI\_Sendrecv} calls act as heartbeats.
The first starts the timer for this rank and the second stops it. Additionally the second heartbeat passes
the data buffer for comparison with other teams. Only a hash of the data is sent.
At the end of the application, the heartbeat times will be written to CSV files.
\begin{code}
double data[SIZE];
for (int t = 0; t < NUM_TRIALS; t++)
{
MPI_Barrier(MPI_COMM_WORLD);
// Start Heartbeat
MPI_Sendrecv(MPI_IN_PLACE, 0, MPI_BYTE, MPI_PROC_NULL, 1, MPI_IN_PLACE, 0,
MPI_BYTE, MPI_PROC_NULL, 0, MPI_COMM_SELF, MPI_STATUS_IGNORE);
for (int i = 0; i < NUM_COMPUTATIONS; i++) {
// Arbitrary computation on data
}
// End Heartbeat and compare data
MPI_Sendrecv(data, SIZE, MPI_DOUBLE, MPI_PROC_NULL, -1, MPI_IN_PLACE, 0,
MPI_BYTE, MPI_PROC_NULL, 0, MPI_COMM_SELF, MPI_STATUS_IGNORE);
MPI_Barrier(MPI_COMM_WORLD);
}
\end{code}
......@@ -8,12 +8,8 @@
into the PMPI interface. As a result, \teaMPI\ can hijack MPI calls, map them
onto subcommunicators (subsets of ranks) or trigger special functions besides
the MPI core functionality.
\item Furthermore, \teaMPI\ provides an API such that programmers can inform
\item In the future, \teaMPI\ may provide an API such that programmers can inform
it about tasks, e.g., or query load balancing information.
\item If \teaMPI\ is built with SmartNIC support, each library running on the
host is paired up with a SmartTea instance running on the BlueField. The
library now can interact with the \teaMPI\ instance running on the SmartNIC
and the two of them can orchestrate data transfer completely independent of
the user application.
\item TeaMPI maps communicators and ranks appropriately to SmartMPI which can take care of message progression.
\end{itemize}
\chapter{Smart progression}
\label{chapter:smart-progression}
\begin{definition}
Let a Rank A send data to a Rank B.
With non-blocking MPI, A issues an \texttt{Isend}, B issues an \texttt{Irecv},
and both ranks continue to compute while MPI transfer the memory.
In practice this is not happening as
\begin{enumerate}
\item MPI implementations require CPU cycles to move too big/too many
messages around and thus have to be called regularity (progressed) via send,
receives or waits.
\item MPI message exchange suffers from congestion, i.e.~the moment we wait
for particular messages is the moment it first has to transfer other messages
which come first in the queue.
\end{enumerate}
\end{definition}
\begin{center}
\includegraphics[width=0.6\textwidth]{35_smart-progression/architecture.pdf}
\end{center}
\noindent
Progression is a technique frequently discussed in literature.
The standard recommendations how to handle it are
\begin{enumerate}
\item make your own code call \texttt{MPI\_Wait}, \texttt{MPI\_Test},
\texttt{MPI\_Probe} in regular intervals;
\item dedicate one thread to MPI's progression machinery (supported by
Intel MPI, e.g.).
\end{enumerate}
\noindent
Both ``solutions'' (if they work) are unsatisfying, as they require manual
intervention by programmers and thus make code more complex and/or sacrifice CPU
resources.
With \teaMPI, we offer an alternative:
\begin{enumerate}
\item Users tell \teaMPI\ about particular messages (message tags) which are
critical to the simulation and non-blocking data transfer.
From hereon, \teaMPI\ knows about \emph{critical} message tags.
\item Whenever a user code calls \texttt{MPI\_Isend}, \teaMPI\ hijacks this
send, i.e.~it is not executed. This happens only for messages with the right
tag (see step (1) in architecture illustration).
\item Next, \teaMPI\ issues a remote data transfer from the
\texttt{MPI\_Isend} buffer on the BlueField (BF), i.e.~data is copied ot the
BF.
This runs totally parallel to the CPU (as it is driven by the BF).
See step (2) in the architecture illustration.
\item Once all data resides on the BF, a \texttt{MPI\_Wait} or
\texttt{MPI\_Test} on the sender succeeds. Again \teaMPI\ hides the fact away
that the test or wait does not directly hit the MPI library.
\item Once all data residues on the BF, the BF issues a non-blocking MPI call
to the receiver's BF. As the BF has a CPU of its own, it can poll the MPI
library until this data transfer is complete, i.e.~it takes over the
responsibility of a progression thread. Alternatively, it can work with plain
blocking data transfer (step (3) on illustration).
\end{enumerate}
\noindent
On the receiver side, we implement the mirrored pattern.
The pattern is not free of drawbacks: It tends to increase the message latency
and requires additional memory (on the BFs).
However, we allow the user to specify manually for which messages it is used, so
it can be tailored towards the user's needs.
Even if the technique is only applied to non-urgent messages, it thus releases
pressure from the MPI subsystem from a host's point of view.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
width="210mm"
height="297mm"
viewBox="0 0 210 297"
version="1.1"
id="svg8"
inkscape:version="0.92.4 (5da689c313, 2019-01-14)"
sodipodi:docname="architecture.svg">
<defs
id="defs2">
<marker
inkscape:stockid="Arrow2Lend"
orient="auto"
refY="0.0"
refX="0.0"
id="Arrow2Lend"
style="overflow:visible;"
inkscape:isstock="true">
<path
id="path966"
style="fill-rule:evenodd;stroke-width:0.625;stroke-linejoin:round;stroke:#000000;stroke-opacity:1;fill:#000000;fill-opacity:1"
d="M 8.7185878,4.0337352 L -2.2072895,0.016013256 L 8.7185884,-4.0017078 C 6.9730900,-1.6296469 6.9831476,1.6157441 8.7185878,4.0337352 z "
transform="scale(1.1) rotate(180) translate(1,0)" />
</marker>
<marker
inkscape:stockid="Arrow2Lend"
orient="auto"
refY="0"
refX="0"
id="Arrow2Lend-1"
style="overflow:visible"
inkscape:isstock="true">
<path
inkscape:connector-curvature="0"
id="path966-2"
style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:0.625;stroke-linejoin:round;stroke-opacity:1"
d="M 8.7185878,4.0337352 -2.2072895,0.01601326 8.7185884,-4.0017078 c -1.7454984,2.3720609 -1.7354408,5.6174519 -6e-7,8.035443 z"
transform="matrix(-1.1,0,0,-1.1,-1.1,0)" />
</marker>
</defs>
<sodipodi:namedview
id="base"
pagecolor="#ffffff"
bordercolor="#666666"
borderopacity="1.0"
inkscape:pageopacity="0.0"
inkscape:pageshadow="2"
inkscape:zoom="0.7"
inkscape:cx="246.49303"
inkscape:cy="837.13586"
inkscape:document-units="mm"
inkscape:current-layer="layer1"
showgrid="false"
inkscape:window-width="1920"
inkscape:window-height="1015"
inkscape:window-x="0"
inkscape:window-y="0"
inkscape:window-maximized="1" />
<metadata
id="metadata5">
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<g
inkscape:label="Layer 1"
inkscape:groupmode="layer"
id="layer1">
<rect
style="opacity:1;fill:#f2f2f2;fill-opacity:1;stroke:#000000;stroke-width:0.26458332;stroke-linecap:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
id="rect815"
width="58.208332"
height="38.931545"
x="18.898809"
y="94.404755" />
<rect
style="opacity:1;fill:#f2f2f2;fill-opacity:1;stroke:#000000;stroke-width:0.26458332;stroke-linecap:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
id="rect815-3"
width="58.208332"
height="38.931545"
x="137.9613"
y="94.215775" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:8.46666622px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
x="47.003998"
y="111.03571"
id="text834"><tspan
sodipodi:role="line"
id="tspan832"
x="47.003998"
y="111.03571"
style="text-align:center;text-anchor:middle;stroke-width:0.26458332">Rank A</tspan><tspan
sodipodi:role="line"
x="47.003998"
y="121.61905"
style="text-align:center;text-anchor:middle;stroke-width:0.26458332"
id="tspan836">(sender)</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:8.46666622px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
x="167.46826"
y="110.59336"
id="text834-6"><tspan
sodipodi:role="line"
id="tspan832-7"
x="167.46826"
y="110.59336"
style="text-align:center;text-anchor:middle;stroke-width:0.26458332">Rank B</tspan><tspan
sodipodi:role="line"
x="167.46826"
y="121.1767"
style="text-align:center;text-anchor:middle;stroke-width:0.26458332"
id="tspan836-5">(sender)</tspan></text>
<path
style="opacity:1;fill:#800000;fill-opacity:1;stroke:#000000;stroke-width:0.37797615;stroke-linecap:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
d="m 122.48264,119.18304 v -3.04611 H 102.07193 81.661212 l 0.0052,-2.92932 c 0.0088,-5.01925 -2.708349,-4.45987 20.975988,-4.31826 l 20.56474,0.12296 0.11024,-3.04427 0.11024,-3.04426 5.65653,4.67736 c 3.11109,2.57255 5.71072,4.83991 5.77694,5.03858 0.0949,0.28473 -9.88375,8.0827 -11.90594,9.30408 -0.3625,0.21894 -0.47247,-0.42365 -0.47247,-2.76076 z"
id="path880"
inkscape:connector-curvature="0" />
<rect
style="opacity:1;fill:#8080ff;fill-opacity:1;stroke:#000000;stroke-width:0.26458332;stroke-linecap:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
id="rect882"
width="33.261906"
height="28.348215"
x="52.916664"
y="24.479164" />
<rect
style="opacity:1;fill:#8080ff;fill-opacity:1;stroke:#000000;stroke-width:0.26458332;stroke-linecap:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
id="rect882-3"
width="33.261906"
height="28.348215"
x="130.0238"
y="25.424107" />
<path
style="opacity:1;fill:#ff0000;fill-opacity:1;stroke:#000000;stroke-width:0.37797618;stroke-linecap:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
d="m 122.36519,46.049878 v -3.04611 H 101.95448 81.543763 l 0.0052,-2.92932 c 0.0088,-5.01925 -2.708349,-4.45987 20.975987,-4.31826 l 20.56474,0.12296 0.11024,-3.04427 0.11024,-3.04426 5.65653,4.67736 c 3.11109,2.57255 5.71072,4.83991 5.77694,5.03858 0.0949,0.28473 -9.88375,8.0827 -11.90593,9.30408 -0.36251,0.21894 -0.47248,-0.42365 -0.47248,-2.76076 z"
id="path880-5"
inkscape:connector-curvature="0" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:5.0181179px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.15681618"
x="85.329216"
y="131.12787"
id="text916"><tspan
sodipodi:role="line"
id="tspan914"
x="85.329216"
y="131.12787"
style="stroke-width:0.15681618">Normal non-blocking</tspan><tspan
sodipodi:role="line"
x="85.329216"
y="137.40051"
style="stroke-width:0.15681618"
id="tspan918">MPI</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:5.01811838px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.1568162"
x="86.373672"
y="13.01928"
id="text916-6"><tspan
sodipodi:role="line"
id="tspan914-2"
x="86.373672"
y="13.01928"
style="stroke-width:0.1568162">Either normal non-blocking</tspan><tspan
sodipodi:role="line"
x="86.373672"
y="19.291929"
style="stroke-width:0.1568162"
id="tspan918-9">MPI with permanent polling</tspan><tspan
sodipodi:role="line"
x="86.373672"
y="25.564577"
style="stroke-width:0.1568162"
id="tspan941">or blocking MPI.</tspan></text>
<path
style="fill:none;stroke:#000000;stroke-width:0.265;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;stroke-miterlimit:4;stroke-dasharray:none;marker-end:url(#Arrow2Lend)"
d="M 42.333333,99.318452 56.651241,45.883284"
id="path943"
inkscape:connector-curvature="0" />
<path
style="fill:none;stroke:#000000;stroke-width:0.26499999;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow2Lend-1)"
d="M 166.53178,99.398691 156.28124,45.963521"
id="path943-7"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cc" />
<path
style="opacity:1;fill:#00ff00;fill-opacity:1;stroke:#000000;stroke-width:0.27149478;stroke-linecap:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
d="m 69.073465,64.640508 -2.118831,-0.545721 -3.656641,14.197379 -3.656644,14.197389 -2.036595,-0.52867 c -3.489816,-0.90502 -3.587426,1.08489 0.754197,-15.364204 L 62.128716,62.31419 60.030917,61.692123 57.933125,61.070057 62.20001,57.973424 c 2.346791,-1.703146 4.389664,-3.105204 4.539719,-3.115674 0.215057,-0.015 3.851496,8.323025 4.33879,9.948437 0.08734,0.29138 -0.379329,0.252752 -2.004985,-0.165948 z"
id="path880-5-0"
inkscape:connector-curvature="0" />
<path
style="opacity:1;fill:#00ff00;fill-opacity:1;stroke:#000000;stroke-width:0.27149478;stroke-linecap:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
d="m 147.25355,84.817472 2.11996,-0.541311 -3.6271,-14.204955 -3.6271,-14.204962 2.03967,-0.516686 c 3.49465,-0.886135 2.62258,-2.677435 6.73287,13.830983 l 3.5689,14.334001 2.13825,-0.464262 2.13827,-0.464268 -2.25004,4.767893 c -1.23753,2.622338 -2.35354,4.834487 -2.48004,4.915881 -0.18129,0.116641 -7.38159,-5.442316 -8.59098,-6.632615 -0.21675,-0.213407 0.21088,-0.404111 1.83741,-0.81943 z"
id="path880-5-0-9"
inkscape:connector-curvature="0" />
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:8.46666622px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
x="40.821426"
y="70.970238"
id="text1842"><tspan
sodipodi:role="line"
id="tspan1840"
x="40.821426"
y="70.970238"
style="stroke-width:0.26458332">(1)</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:8.46666622px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
x="68.413696"
y="77.395828"
id="text1846"><tspan
sodipodi:role="line"
id="tspan1844"
x="68.413696"
y="77.395828"
style="stroke-width:0.26458332">(2)</tspan></text>
<text
xml:space="preserve"
style="font-style:normal;font-weight:normal;font-size:8.46666622px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332"
x="73.780952"
y="18.507145"
id="text1850"><tspan
sodipodi:role="line"
id="tspan1848"
x="73.780952"
y="18.507145"
style="stroke-width:0.26458332">(3)</tspan></text>
</g>
</svg>
......@@ -30,13 +30,13 @@
\part{Building, installing and using \teaMPI}
\input{10_installation}
\input{11_usage}
\input{12_developer}
%\input{12_developer}
\part{Use cases}
%\part{Use cases}
\part{Realisation}
\input{30_architecture}
\input{35_smart-progression}
%\input{35_smart-progression}
\end{document}
SMARTMPI_LIB=../../mpi_offloading/build/smartmpi_lib
SMARTMPI_INC=../../mpi_offloading/smartmpi_lib
CC=mpicxx
CFLAGS += -fPIC -g -Wall -std=c++11 -I${SMARTMPI_INC}
LDFLAGS += -shared -L${SMARTMPI_LIB} -lsmartmpi
SRC = Rank.cpp RankControl.cpp Timing.cpp Wrapper.cpp teaMPI.cpp CommStats.cpp
DEP = Rank.h RankControl.h Timing.h Wrapper.h Logging.h teaMPI.h CommStats.h
OBJECTS = $(SRC:.cpp=.o)
TARGET = libtmpi.so
.PHONY : clean
all: $(TARGET)
%.o: %.cpp $(DEP)
$(CC) $(CFLAGS) -c $< -o $@
$(TARGET) : $(OBJECTS)
$(CC) $(LDFLAGS) $^ -o $@
clean:
rm -f $(OBJECTS) $(TARGET)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment