d0_supercomputers.tex 7.61 KB
Newer Older
1
\chapter{Running \exahype\ on some supercomputers}
2
\label{sec:apx-supercomputers}
3
4
5
6
7
8


In this section, we collect some remarks on experiences how to use \exahype\ on
particular supercomputers. 

\section{Hamilton (Durham's local supercomputer)}
9
10
\label{section:supercomputers:Hamilton}

11
12
13
14
15

We have successfully tested \exahype\ with the following modules on Hamilton 7:
\begin{code}
module load intel/xe_2017.2 
module load intelmpi/intel/2017.2
tobias's avatar
tobias committed
16
module load gcc
17
18
\end{code}

tobias's avatar
tobias committed
19
20
\noindent
Given \exahype's size, it is reasonable to use \texttt{/ddn/data/username} as
tobias's avatar
tobias committed
21
22
work directory instead of the home. SLURM is used as batch system and
appropriate SLURM scripts resemble
tobias's avatar
tobias committed
23
\begin{code}
tobias's avatar
tobias committed
24
#!/bin/bash
tobias's avatar
tobias committed
25
#SBATCH --job-name="ExaHyPE"
tobias's avatar
tobias committed
26
#SBATCH -o ExaHyPE.%A.out
27
#SBATCH -e ExaHyPE.%A.err
tobias's avatar
tobias committed
28
29
30
#SBATCH -t 01:00:00
#SBATCH --exclusive
#SBATCH -p par7.q
tobias's avatar
tobias committed
31
32
#SBATCH --nodes=24
#SBATCH --cpus-per-task=6
tobias's avatar
tobias committed
33
34
#SBATCH --mail-user=tobias.weinzierl@durham.ac.uk
#SBATCH --mail-type=ALL
tobias's avatar
tobias committed
35
36
source /etc/profile.d/modules.sh

tobias's avatar
tobias committed
37
38
39
40
module load intel/xe_2017.2 
module load intelmpi/intel/2017.2
module load gcc

tobias's avatar
tobias committed
41
42
43
setenv I_MPI_FABRICS "tmi"


Tobias Weinzierl's avatar
Tobias Weinzierl committed
44
export I_MPI_FABRICS="tmi"
tobias's avatar
tobias committed
45

tobias's avatar
tobias committed
46
47
48
mpirun ./ExaHyPE-Euler  EulerFlow.exahype
\end{code}

tobias's avatar
tobias committed
49
50
51
\noindent
For the Euler equations (five unknowns) on the unit square with polynomial order
$p=3$, $h=0.001$ is a reasonable start grid as it yields a tree of depth 8.
Tobias Weinzierl's avatar
Tobias Weinzierl committed
52
53
54
55
56
57
58


Hamilton relies on Omnipath.
Unfortunately, the default fabric configuration of Intel MPI seems not to work
properly for \exahype\ once the problem sizes become big. 
You have to tell MPI explicitly which driver/fabric to use.
Otherwise, your code might deadlock.
Sven K's avatar
Sven K committed
59
One version that seems to work is \texttt{dapl} chosen by
Tobias Weinzierl's avatar
Tobias Weinzierl committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
\begin{code}
export I_MPI_FABRICS="dapl"
\end{code}

\noindent
While \texttt{dapl} seems to be very robust, we found it slightly slower than
\texttt{tmi} as used in the script above.
Furthermore, it needs significantly more memory per MPI rank. 
Therefore, we typically use \texttt{tmi} which however has to be set explicitly
via \texttt{export} on Hamilton.


One of the big selling points of Omnipath is that it is well-suited for small
messages sizes.
Compared to other (Infiniband-based) systems, it thus seems to be wise to reduce
the package sizes in your \exahype\ specification file. 
Notably, we often get improved performance once we start to decrease
\texttt{buffer-size}.
tobias's avatar
tobias committed
78

79
80

\section{SuperMUC (Munich's petascale machine)}
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
\label{section:supercomputers:SuperMUC}

There are very few pitfalls on SuperMUC that mainly arise from the interplay 
of IBM's MPI with Intel TBBs as well as new compiler versions. Please load 
a recently new GCC version (Intel by default uses a too old version) as well 
as TBBs manually before you compile
\begin{code}
module load gcc/4.9
module load tbb
\end{code}

\noindent
and remember to do so in your job scripts, too:
\begin{code}
#!/bin/bash
#@ job_type = parallel
##@ job_type = MPICH
#@ class = micro
#@ node = 1
#@ tasks_per_node = 1
#@ island_count = 1
#@ wall_clock_limit = 24:00:00
#@ energy_policy_tag = ExaHyPE_rulez
#@ minimize_time_to_solution = yes
#@ job_name = LRZ-test
#@ network.MPI = sn_all,not_shared,us
#@ output = LRZ-test.out
#@ error =  LRZ-test.err
#@ notification=complete
#@ notify_user=tobias.weinzierl@durham.ac.uk
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
module load gcc/4.9
module load tbb
\end{code}
117

118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
\noindent
If you use Intel's TBB in combination with \texttt{poe} or MPI, please ensure
that you set
\begin{code}
export OMP_NUM_THREADS=28
export MP_TASK_AFFINITY=core:28
\end{code}

\noindent
manually, explicitly and correctly before you launch your application.
If you forget to do so, \exahype's TBB launches the correct number of TBB
threads as specified in your \exahype\ specification file, but it pins all of
these threads to one single core.
You will get at most a speedup of two (from the core plus its hyperthread) in
this case\footnote{Thanks to Nicolay Hammer from LRZ for identifying this
issue.}.

 
136
137


138
%\section{Tornado KNL (RSC group prototype)}
139

tobias's avatar
tobias committed
140
141


142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
\section{Archer's KNL partition (EPCC supercomputer)}
\label{section:supercomputers:archer}

Archer's default Java version does not meet \exahype's requirements and the Java
configuration does not provide the toolkit with enough heap memory (cf.~Section
\ref{section:appendix-toolkit:troubleshooting}).
Furthermore, we haven't used the Cray tools yet but stick to Intel and there
have to load a well-suited GCC version manually:

\begin{code}
module load java/jdk1.8.0_51
module swap PrgEnv-cray PrgEnv-intel
module load gcc
\end{code}

\noindent
To accomodate the toolkit, we use the modified Java invocation:
\begin{code}
java -XX:MaxHeapSize=512m -jar Toolkit/dist/ExaHyPE.jar
\end{code}


\noindent
For shared memory support, we encountered three issues:
\begin{enumerate}
  \item We should use EPCC's compiler macro \texttt{CC} instead of manually
  invocations of the compilers.
  \item The module does not initialise the \texttt{TBB\_SHLIB} variable that we
  use in our script. So we have to set it manually.
  \item The default compiler behaviour links all libraries static into the
  executable. However, the TBB libs are not available in their static variant.
  To change this behaviour, we had to instruct the linker explicitly to link
  against shared memory library variants.
\end{enumerate}

\noindent
Overall, these three lines fix the behaviour:
\begin{code}
  export EXAHYPE_CC=CC
  export TBB_SHLIB="-L/opt/intel/compilers_and_" \
    "libraries_2017.0.098/linux/tbb/lib/intel64/gcc4.7 -ltbb"
  export CRAYPE_LINK_TYPE=dynamic
\end{code}

186
187
188
189
190
191
192
193
194
195
196
197
198
199

\noindent
Similar to SuperMUC, we observe that a plain launch of executables through
\texttt{aprun} {\em does not allow the codes to exploit shared memory
parallelism}. We explicitly have to unlock the cores for the scripts in the run
command through
\begin{code}
  aprun -n ... -d coresPerTask ... -cc depth
\end{code}
where \texttt{-cc} configures the pinning. According to the Archer
documentation, this configuration still does not enable hyperthreading. 
If hyperthreading is required, we have to append \texttt{-j 4} to the
invocation, too.

tobias's avatar
tobias committed
200
201
202
203
204
205
206
207
\section{RWTH Aachen Cluster}

We have successfully tested \exahype\ on RWTH Aachen's RZ clusters using MUST.
Here, it is important to switch the GCC implementation before you compile, as
GCC is by default version 4.8.5 which does not fully implement C++11.

\begin{code}
module load UNITE must
208
209
210
211
212
213
214


#module unload gcc
module unload openmpi
module switch intel gcc/5
module load intel openmpi

tobias's avatar
tobias committed
215
216
217
	
export SHAREDMEM=none
export COMPILER=manual
tobias's avatar
tobias committed
218
export EXAHYPE_CC="mpiCC -std=c++11 -g3"
tobias's avatar
tobias committed
219
export COMPILER_CFLAGS="$FLAGS_FAST"
tobias's avatar
tobias committed
220
221
\end{code}

tobias's avatar
tobias committed
222
223
224
225
226
227
\noindent
The above setups use the compiler variant \texttt{manual} as RWTH has installed
MUST such thatight \texttt{mustrun} automatically throws the executable onto the
right cluster. 
To create a binary that is compatible with this cluster, the flags from
\texttt{FLAST\_FAST} are to be used. 
tobias's avatar
tobias committed
228

Tobias Weinzierl's avatar
Tobias Weinzierl committed
229
230
231
232
233
234
235
236
237
238
239
240
241

\section{CoolMUC 3}

LRZ's KNL system CoolMUC 3 drives Omnipath as well. Therefore, ensure that you
set the MPI fabric properly as soon as you use more than one node. Otherwise,
\exahype\ will deadlock:

\begin{code}
export I_MPI_FABRICS="tmi"
\end{code}



242
243
244
245
246
247
248
249
250
251
\section{Hazelhen (Cray)} 

Cray may configure the intel compiler to link in all libraries statically
but TBB by default is not built statically so add the following to the
\texttt{TBB\_SHLIB}
\begin{code}
-dynamic -ltbb
\end{code} 
i.e. before the link command.

Tobias Weinzierl's avatar
Tobias Weinzierl committed
252
253
254



Sven K's avatar
Sven K committed
255
256
257
258
259
260
261
262
263
264
265
266
267
268
\section{Frankfurt machines}
For the machines

\begin{itemize}
  \item Generic Ubuntu laptop
  \item Iboga
  \item LOEWE
  \item FUCHS
  \item Hamilton
  \item SuperMUC
\end{itemize}

please see the configuration settings in the \texttt{ClusterConfigs}
directory in the \texttt{Miscellaneous} in the main repository.
269