Commit 9489c3c2 authored by Philipp Samfaß's avatar Philipp Samfaß
Browse files

some more docu and minor improvements for communication statistics

parent 9c195827
\chapter{Vision and Introductory Remarks}
\teaMPI\ is an open source library built with C++. It plugs into MPI via
MPI's PMPI interface.
%plus provides an additional interface for advanced
%advanced task-based parallelism on distributed memory architectures.
Its research vision reads as follows:
\begin{enumerate}
% \item {\bf Intelligent task-based load balancing}. Applications can hand over
% tasks to \teaMPI. These tasks have to be ready, i.e.~without incoming
% dependencies, and both their input and their output have to be serialisable.
% It is now up to \teaMPI\ to decide whether a task is injected into the local runtime or temporarily moved
% to another rank, where we compute it and then bring back its result.
% \item {\bf MPI idle time load metrics}. \teaMPI\ can plug into some MPI
% calls---hidden from the application---and measure how long this MPI call
% idles, i.e.~waits for incoming messages. It provides lightweight
% synchronisation mechanisms between MPI ranks such that MPI ranks can globally
% identify ranks that are very busy and ranks which tend to wait for MPI
% messages. Such data can be used to guide load balancing. Within \teaMPI, it
% can be used to instruct the task-based load balancing how to move data around.
\item {\bf Black-box replication}. By hijacking MPI calls, \teaMPI\ can split
up the global number $N$ of ranks into $T$ teams of equal size. Each team
assumes that there are only $N/T$ ranks in the system and thus runs completely
independent of the other teams. This splitting is completely hidden from teh
application. \teaMPI\ however provides a heartbeat mechanism which identifies
if one team becomes slower and slower. This can be used as a guideline for
resiliency---assuming that failures in ranks that will eventually fail first
manifest in a speed deterioration of their team.
\item {\bf Replication task sharing}. \teaMPI\ for teams can identify tasks
that are replicated in different teams. Everytime the library detects that a
task has been computed that is replicated on another team and handed out to
\teaMPI, it can take the task's outcome, copy it over to the other team, and
cancel the task execution there. This reduces the overhead cost of
resiliency via replication (teams) massively. This feature has not been integrated
into \teaMPI yet, but has rather been implemented and evaluated directly in the ExaHyPE engine (\url{https://gitlab.lrz.de/exahype/ExaHyPE-Engine/-/tree/philipp/tasksharing_offloading}).
\end{enumerate}
\teaMPI\ is compatible with SmartMPI (\url{https://gitlab.lrz.de/prototypes/mpi_offloading}) which is a library for exploring SmartNICs.
For access to SmartMPI please contact Philipp Samfass at \url{samfass@in.tum.de}.
Our vision is to use teaMPI and SmartMPI in conjunction to achieve the following objectives:
\begin{enumerate}
\item {\bf Smart progression and smart caching}. If a SmartNIC (Mellanox
Bluefield) is available to a node, \teaMPI\ can run a dedicated helper code on the SmartNIC which
polls MPI all the time. For dedicated MPI messages (tasks), it can hijack
MPI. If a rank A sends data to a rank B, the MPI send is actually deployed to
the SmartNIC which may fetch the data directly from A's memory (RDMA). It is in
turn caches on rank B from where it directly deployed into the memory of B if
B has issued a non-blocking receive. Otherwise, it is at least available on
the SmartNIC where we cache it until it is requested.
\item {\bf Smart snif}. Realtime-guided load balancing in HPC typically
suffers from the fact that the load balancing is unable to distinguish
illbalancing from network congestion. As a result, we can construct situations
where a congested network suggests to the load balancing that ranks were idle,
and the load balancing consequently starts to move data around. As a result,
the network experiences even more stress and we enter a feedback cycle. With
SmartNICs, wee can deploy heartbeats to the network device and distinguish
network problems from illbalancing---eventually enabling smarter timing-based
load balancing.
\item {\bf Smart balancing}. With SmartNICs, \teaMPI\ can outsource its
task-based node balancing completely to the network. All distribution
decisions and data movements are championed by the network card rather than
the main CPU.
\item {\bf Smart replication}. With SmartNICs, \teaMPI\ can outsource its
replication functionality including the task distribution and replication to
the network.
\end{enumerate}
\section*{History and literature}
\teaMPI\ has been started as MScR project by Benjamin Hazelwood under the
supervision of Tobias Weinzierl.
After that, it has been partially extended
by Philipp Samfass as parts of his PhD thesis.
Within this scope, some aspects of \teaMPI's\ vision were evaluated together
with ExaHyPE (\url{https://gitlab.lrz.de/exahype/ExaHyPE-Engine}).
Important note: Up to now, teaMPI is not aware of tasks!
Task offloading and task sharing were implemented in ExaHyPE using teaMPI's interface
and transparent replication functionality.
Future work will need to investigate how teaMPI can be made aware of tasks and how ExaHyPE's
prototypical task offloading and task sharing algorithms can be extracted into teaMPI.
Two core papers describing the research behind the library are
\begin{itemize}
\item Philipp Samfass, Tobias Weinzierl, Benjamin Hazelwood, Michael
Bader: \emph{TeaMPI -- Replication-based Resilience without the (Performance)
Pain} (published at ISC 2020) \url{https://arxiv.org/abs/2005.12091}
\item Philipp Samfass, Tobias Weinzierl, Dominic E. Charrier, Michael Bader:
\emph{Lightweight Task Offloading Exploiting MPI Wait Times for Parallel
Adaptive Mesh Refinement} (CPE 2020; in press)
\url{https://arxiv.org/abs/1909.06096}
\end{itemize}
\section*{Dependencies and prerequisites}
\teaMPI's core is plain C++11 code.
We however use a whole set of tools around it:
\begin{itemize}
\item CMake 3.05 or newer (required).
\item C++11-compatible C++ compiler (required).
%\item MPI 3. MPI's multithreaded support and non-blocking collectives
%(required).
\item Doxygen if you want to create HTML pages of PDFs of the in-code
documentation.
\end{itemize}
\section*{Who should read this document}
This guidebook is written for users of \teaMPI, and for people who want to
extend it.
Presently, the guidebook is limited to discuss only the most important aspects of \teaMPI\ in a very
condensed form.
For more documentation or further help, please feel free to contact \url{samfass@in.tum.de}.
%The text is thus organised into three parts:
%First, we describe how to build, install and use \teaMPI.
%Second, we describe the vision and rationale behind the software as well as its
%application scenarios.
%Third, we describe implementation specifica.
{
\flushright
\today
\\
Philipp Samfass,
Tobias Weinzierl
\\
}
......@@ -75,4 +75,12 @@ for (int t = 0; t < NUM_TRIALS; t++)
}
\end{code}
\section{Tracking P2P Communication Statistics}
TeaMPI can be configured to plug into commonly used p2p communication routines in order to track the overall communication volume per team per rank at build time:
\begin{code}
cmake -DENABLE_COMM_STATS=on ..
\end{code}
The communication statistics can be dumped into files with a file prefix set by the environment variable \texttt{TMPI\_STATS\_FILENAME\_PREFIX} and an output directory path set
by the environment variable \texttt{TMPI\_STATS\_OUTPUTPATH}.
......@@ -11,6 +11,7 @@ include_directories(${MPI_INCLUDE_PATH})
#option(ENABLE_ASSERTS "Compiles assertions if set" OFF)
#option(ENABLE_DEBUG "Compile debug mode" OFF)
option(ENABLE_SMARTMPI "Enables support for SmartMPI library" OFF)
option(ENABLE_COMM_STATS "Track communication statistics for p2p operations" OFF)
set(COMMUNICATION_MODE_STRINGS "client_server_mpi" "client_server_gpi")
set(SMARTMPI_COMMUNICATION_MODE "client_server_mpi" CACHE STRING "default communication mode")
......@@ -64,6 +65,10 @@ endif()
endif()
if(ENABLE_COMM_STATS)
target_compile_definitions(tmpi PUBLIC COMM_STATS)
endif()
target_compile_options(tmpi PRIVATE ${CMAKE_CXX_COMPILER_FLAGS})
#CC=mpiicpc
......
......@@ -3,7 +3,7 @@
* CommStats.cpp
*
* Created on: 4 Feb 2020
* Author: Philipp Samfass
* Author: Philipp Samfass
*/
#include "CommStats.h"
......@@ -22,12 +22,8 @@
#include <unistd.h>
#include <list>
#include <vector>
#include <atomic>
struct CommunicationStats {
std::atomic<size_t> sentBytes;
std::atomic<size_t> receivedBytes;
} stats;
struct CommunicationStatistics::CommunicationStats stats;
size_t CommunicationStatistics::computeCommunicationVolume(MPI_Datatype datatype, int count) {
int size;
......@@ -50,7 +46,7 @@ void CommunicationStatistics::outputCommunicationStatistics() {
std::cout.flush();
std::string filenamePrefix = getEnvString("TMPI_STATS_FILE");
std::string filenamePrefix = getEnvString("TMPI_STATS_FILENAME_PREFIX");
std::string outputPathPrefix = getEnvString("TMPI_STATS_OUTPUT_PATH");
if (!filenamePrefix.empty()) {
......
/*
* CommStats.h
*
* Created on: 4 Feb 2020
* Author: Philipp Samfass
/**
* @file CommStats.h
* @author Philipp Samfass
* @brief Communication statistics track the communication volume and
* can output how many bytes were sent/received (currently only for p2p communication).
*/
#ifndef STATISTICS_H_
#define STATISTICS_H_
#include <mpi.h>
#include <atomic>
/**
* Communication statistics track the communication volume and
* can output how many bytes were sent/received.
*/
namespace CommunicationStatistics {
size_t computeCommunicationVolume(MPI_Datatype, int count);
/**
* Struct for storing the communication statistics
*/
struct CommunicationStats {
std::atomic<size_t> sentBytes; /// number of sent bytes
std::atomic<size_t> receivedBytes; // number of received bytes
};
/**
* Computes the communication volume in bytes for a message of give datatype and count.
* @param datatype The MPI datatype being sent/received.
* @param count The number of datatype elements.
*/
size_t computeCommunicationVolume(MPI_Datatype datatype, int count);
/**
* Tracks a send operation in the statistics.
* @param datatype The MPI datatype being sent.
* @param count The number of datatype elements.
*/
void trackSend(MPI_Datatype datatype, int count);
/**
* Tracks a receive operation in the statistics.
* @param datatype The MPI datatype being received.
* @param count The number of datatype elements.
*/
void trackReceive(MPI_Datatype datatype, int count);
/**
* Dumps the communication statistics into a file.
* The output path and filename prefix are read from environment variables
* TMPI_STATS_OUTPUT_PATH and TMPI_STATS_FILENAME_PREFIX.
* TeaMPI automatically appends team number and rank to the filename prefix.
*/
void outputCommunicationStatistics();
}
......
......@@ -103,7 +103,7 @@ int MPI_Comm_free(MPI_Comm *comm) {
int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm) {
#if COMM_STATS
#ifdef COMM_STATS
CommunicationStatistics::trackSend(datatype, count);
#endif
//assert(comm == MPI_COMM_WORLD);
......@@ -136,7 +136,7 @@ int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
err = PMPI_Recv(buf, count, datatype, source, tag, getTeamComm(comm), status);
}
#else
#if COMM_STATS
#ifdef COMM_STATS
CommunicationStatistics::trackReceive(datatype, count);
#endif
err = PMPI_Recv(buf, count, datatype, source, tag, getTeamComm(comm), status);
......@@ -176,7 +176,7 @@ int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest,
err = PMPI_Isend(buf, count, datatype, dest, tag, getTeamComm(comm), request);
}
#else
#if COMM_STATS
#ifdef COMM_STATS
CommunicationStatistics::trackSend(datatype, count);
#endif
err = PMPI_Isend(buf, count, datatype, dest, tag, getTeamComm(comm), request);
......@@ -188,7 +188,7 @@ int MPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dest,
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
MPI_Comm comm, MPI_Request *request) {
//assert(comm == MPI_COMM_WORLD);
#if COMM_STATS
#ifdef COMM_STATS
CommunicationStatistics::trackReceive(datatype, count);
#endif
int err = PMPI_Irecv(buf, count, datatype, source, tag, getTeamComm(comm), request);
......@@ -356,7 +356,7 @@ int MPI_Finalize() {
}
smpi_finalize();
#endif
#if COMM_STATS
#ifdef COMM_STATS
CommunicationStatistics::outputCommunicationStatistics();
#endif
#ifdef DirtyCleanUp
......
/**
* @file teaMPI.h
* @mainpage TeaMPI
*
* Welcome to the source code documentation of teaMPI.
* TeaMPI is a PMPI wrapper library to achieve transparent replication of MPI processes.
* Please find more documentation on building, installing and running with teaMPI in the guidebook in the folder documentation.
* @image html ./teaMPI.jpg
*/
#ifndef _TEAMPI_H_
......@@ -8,11 +12,35 @@
#include <mpi.h>
/**
* Returns an MPI communicator for inter-team replication.
* Currently, only communication between replica ranks is allowed.
* E.g., team 0 rank x can only communicate with team 1 rank x and team 2 rank x.
* A rank in the returned MPI communicator refers to the team number.
* E.g., if we send something to rank 2 on the communicator, the data is transferred
* to the replica rank in team 2.
*/
MPI_Comm TMPI_GetInterTeamComm();
/**
* Returns the number of the team the calling rank belongs to.
*/
int TMPI_GetTeamNumber();
/**
* Returns the world rank in MPI_COMM_WORLD of the calling rank.
*/
int TMPI_GetWorldRank();
/**
* Returns the size of the inter-team communicator which is equal to the number of teams (but may be changed in the future if required).
*/
int TMPI_GetInterTeamCommSize();
int TMPI_IsLeadingRank();
/**
* Returns 1 if the calling rank belongs to team 0.
* This routine is useful if only a single team should execute some actions such as file I/O.
*/
int TMPI_IsLeadingRank();
#endif
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment