Commit 5da2498a authored by Dominic Etienne Charrier's avatar Dominic Etienne Charrier
Browse files

Started appendix chapter on performance studies.

parent 1eaea933
\chapter{Performance and parameter studies}
\label{sec:apx-performance-studies}
In this section, we collect some remarks on how to determine the performance of
\exahype\ applications.
\section{Improving and measuring shared memory scalability}
\begin{itemize}
\item If you turn predictor or update background jobs on, TBB may still use two cores during
the time stepping and the mesh refinement iterations even if you specify
that you only want to use a single TBB thread.
You will then measure a runtime which is close to the one when specifying two TBB threads.
Therefore, turn predictor and update background jobs off when measuring single-core runtimes.
\item For single-node runs with uniform meshes, it is often more efficient to
use parallel fors and no prediction or update background jobs. Spawning of background jobs incurs additional overhead.
Turn them back on if you use AMR or MPI.
\item When performing any scalability studies be aware that your application can only
exploit multiple cores or nodes if your problem size is big enough.
If the number of cells is small, you cannot expect that work
is distributed efficiently and fairly among the used hardware processing units.
\item For low orders or smaller PDEs, it often make sense to put ADERDG's
space-time predictor and space-time volume flux arrays on the stack.
\end{itemize}
\section{Performing performance and parameter studies with \texttt{sweep.py}}
One of the major issues in High-Performance Computing has been identified as the difficulty to
reproduce performance studies.
\exahype\ ships a tool which allows performing easily reproducible and portable performance studies.
The tool named \texttt{sweep.py} can be found in folder \texttt{ExaHyPE-Engine/Benchmarks/python}.
\texttt{sweep.py} uses a single configuration file as input where you specify a parameter range, environment variables, and the hardware
resources you want to use. You further specify which parameter variation requires building a new executable.
Furthermore, the configuration file refers to a specification file template, and a job script template.
The templates hold placeholders which have to match the parameter keys in the configuration file.
\paragraph{Building executables and submitting jobs} We start with building all required executables built via:
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> build
\end{code}
During the build process, it is ensured that the parameters specified in the
configuration map to a placeholder in the specification file template.
After building the required executables, the next step is to generate the job scripts.
This is done via:
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> scripts
\end{code}
Again, consistency checks are performed.
Finally, the jobs are submitted via:
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> submit
\end{code}
Before submission, \texttt{sweep.py} checks if all executables and scripts
exist and if all mandatory placeholders have been placed into the
job script template. Furthermore, the tool checks if a log filter file named \texttt{exahype.log-filter}
was placed into the project repository. Log filters are used to minimise the
log output written by the application. Output can become a bottleneck in
parallel applications.
If any of the mentioned checks fails, the submission process is stopped and
an error message is printed out.
\paragraph{Cancelling jobs and deleting files}
Per batch of experiments, \texttt{sweep.py} memorises the jobs you have launched
in a subfolder <myoutputfolder>/history.
They can be all cancelled with:
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> cancel
\end{code}
If you have submitted multiple experiments in a row
using the same output folder, you can find the configuration
files in the <myoutputfolder>/history, too. They will have a hashed
name though. This way you can also cancel the jobs
submitted for previous experiments.
If you want to delete the whole output folder including executables, scripts, history and results, run:
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> cleanAll
\end{code}
You can delete individual subfolders via:
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> cleanBuild
<path_to_sweep>/sweep.py <path_to_configuration_file> cleanScripts
<path_to_sweep>/sweep.py <path_to_configuration_file> cleanResults
<path_to_sweep>/sweep.py <path_to_configuration_file> cleanHistory
\end{code}
\paragraph{Parsing runtimes}
After completion of all jobs (or of some jobs if you want to have a peek),
you can run
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> parseAdapters
\end{code}
which tells you which jobs have been processed successfully and which
ones have not. However, the main purpose of this subprogram is to parse
the measured CPU and real time spent in the different phases (\textit{adapters})
of the algorithm from the output files written by your jobs.
The parsed data is put into a CSV table which can be easily edited
with any spreadsheet software.
Depending on the solver implementation (fused algorithmic phases vs. straightforward implementation using separate
predictor, Riemann, and corrector loops), the algorithmic phases constituting a time step differ.
\texttt{sweep.py} comes with another subprogram \texttt{parseTimeStepTimes} for post-processing the
output of the \texttt{parseAdapters} subprogram. This subprogram is aware of the two different solver implementations
when computing the CPU and real time spent in the time stepping adapters.
Furthermore, if you have run the same experiment multiple times, it will compute
the measured minimum, maximum, and mean times. It will also include the standard
deviation of the measured times into the generated CSV table.
\paragraph{Parsing runtimes}
After completion of all jobs (or of some jobs if you want to have a peek),
you can run
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> parseAdapters
\end{code}
which tells you which jobs have been processed successfully and which
ones have not. However, the main purpose of this subprogram is to parse
the measured CPU and real time spent in the different phases (\textit{adapters})
of the algorithm from the output files written by your jobs.
The parsed data is put into a CSV table which can be easily edited
with any spreadsheet software.
Depending on the solver implementation (fused algorithmic phases vs. straightforward implementation using separate
predictor, Riemann, and corrector loops), the algorithmic phases constituting a time step differ.
\texttt{sweep.py} comes with another subprogram \texttt{parseTimeStepTimes} for post-processing the
output of the \texttt{parseAdapters} subprogram. This subprogram is aware of the two different solver implementations
when computing the CPU and real time spent in the time stepping adapters.
Furthermore, if you have run the same experiment multiple times, it will compute
the measured minimum, maximum, and mean times. It will also include the standard
deviation of the measured times into the generated CSV table.
\paragraph{Parsing LIKWID metrics}
It is possible to instruct \texttt{sweep.py} to use LIKWID's \texttt{perfctr} wrapper.
In this case, you specify the LIKWID metric groups you want to measure
in the \texttt{[general]} section of the configuration file, e.g.:
\begin{code}
[general]
...
likwid = MEM_DP,L2,L3CACHE,L2CACHE
\end{code}
Note that these measurements will then be performed
by the same job. You might need to increase the walltime in this case.
After the experiments have been run, parsing the metrics can be
with just another parser subprogram:
\begin{code}
<path_to_sweep>/sweep.py <path_to_configuration_file> parseMetrics
\end{code}
\subsection{Example: ElasticWave3D - shared memory scalability}
Below you find an example of an comprehensive performance study examining the multi-core
scalability of the linear ADER-DG application \texttt{ElasticWave3D} \\
(path: \texttt{ApplicationExamples}\texttt{/Linear/ElasticWave3D}).
The example is used to detail the sections of the configuration file
and to show an example of a job script and a specification file template.
The test varies the number of cores (and background job consumer tasks), the polynomial order,
the ADER-DG implementation (``fused'' vs. ``nonfused''), if predictor background jobs are used, if
update background jobs are used, if zero or two levels of AMR should be used, and so on ...
The parameter keys and their associated list of values can be specified as
single parameter key with single parameter value:
\begin{code}
<key0> = <value0>
\end{code}
or as a single key with multiple values
\begin{code}
<key> = <value0>,<value1>
\end{code}
or as single key with multiple quoted values
\begin{code}
<key> = "<value0>","<value1>"
\end{code}
or as a single key with a mix of quoted and unquoted values:
\begin{code}
<key> = "<value0>",<value1>,"<value2>"
\end{code}
We span the parameter space as a Cartesian product of the single
parameter key-value list combinations.
A parameter key is a dimension in the parameter space.
Different values for a certain parameter key are separated by commas (',').
You can further put parameter values in quotation marks ('"') if a parameter
contains commas itself or to prevent trimming of leading or trailing whitespaces.
For each parameter combination, one experiment will be performed.
Elements of the parameter space may be directly mapped to a job script or whole subspaces may be
mapped to a job script: While the parameter combinations in \texttt{parameters} are directly mapped to
a job script, the parameters in \texttt{parameters\_grouped} are grouped onto the same job script.
It is further possible to prescribe and vary environment variables.
For each different configuration of environment variables,
a new executable is created.
This prevents you from accidentally using a different compiler when rerunnin an experiment.
You specify environment variables exactly in the same way as parameters.
The workflow from which extracted the scripts below assumes that the script is placed
into the project folder. For convenience, we typically symlink \texttt{Benchmarks/python/sweep.py}
into the project folder, too.
\begin{code}
[general]
exahype_root = <mypath>/ExaHyPE-Engine
project_name = Elastic
project_path = ApplicationExamples/Linear/ElasticWave3D/
spec_template = %(project_path)s/elastic.exahype2-template
job_template = %(project_path)s/hamilton7.job-template
output_path = %(project_path)s/tbb-scaling
make_threads = 24
run_command = ""
job_submission = sbatch --qos=no_core_limit
job_cancellation = scancel
compile_time_parameters = order,tempArrays,kernels
; you can also perform likwid measurements
; likwid = MEM_DP,L2,L3CACHE,L2CACHE
[jobs]
time = 12:00:00
mail = myemail@address.com
num_cpus = 24
; total ranks (!) x nodes x { cores : consumerTasks }
ranks_nodes_cores = 1 x 1 x {24:12,16:8,12:6,8:4,4:2,2:1,1:0}
; tag for the current run
run = 0
; any number is possible as tag; for each tag, a run is performed
; run = 1,2,3
[environment]
; for each environemnt variable combination, an executable is built
EXAHYPE_CC = icpc
COMPILER = Intel
MODE = RELEASE
DISTRIBUTEDMEM = NONE
SHAREDMEM = TBB
USE_IPO = on
COMPILER_CFLAGS = ""
[parameters]
; a job is created for each configuration
architecture = hsw
dimension = 3
; for each order an executable is built as order is listed in 'compile_time_parameters'
order = 3,4,5,6,7,8
timeSteps = 50
fused = false,true
predictorBackground = false,true
updateBackground = false,true
profilingTarget = whole_code
maximumMeshDepth = 0,1
[parameters_grouped]
; runs for these parameters are grouped onto the same job
; the number of generated jobs thus equals the number of
; combinations of parameters in section [parameters]
prolongationBackground = false
amrBackground = true
tempArrays = stack
minJobsPerRush = 2
maxJobsPerRush = 2147483647
kernels = optimised
timeStepping = globalfixed
batchFactor = 1.0
maximumMeshSize = 1.0
highPrioJobProcessing = one_at_a_time
lowPrioJobProcessing = run_if_no_high_priority_job_left
\end{code}
The matching specification file template is given below.
It has to be in the novel \exahype\ JSON specification file format.
You can generate the new format with the \texttt{-s} switch from
the original \exahype\ specification file format.
\begin{code}
{
"project_name": "Elastic",
"paths": {
"exahype_path": "./ExaHyPE",
"output_directory": "./ApplicationExamples/Linear/ElasticWave3D",
"peano_kernel_path": "./Peano"
},
"architecture": "{{architecture}}",
"computational_domain": {
"dimension": {{dimension}},
"offset": [
-12.0,
0.0,
-12.0
],
"time_steps": {{timeSteps}},
"width": [
24.0,
24.0,
24.0
]
},
"profiling": {
"profiling_target": "{{profilingTarget}}"
},
"shared_memory": {
"cores": {{coresPerRank}},
"background_job_consumers": {{consumerTasks}},
"high_priority_background_job_processing" : "{{highPrioJobProcessing}}",
"low_priority_background_job_processing" : "{{lowPrioJobProcessing}}",
"min_background_jobs_in_one_rush" : {{minJobsPerRush}},
"max_background_jobs_in_one_rush" : {{maxJobsPerRush}},
"properties_file": "sharedmemory.properties",
"autotuning_strategy": "dummy",
"thread_stack_size": 16777216
},
"distributed_memory": {
"timeout": 60,
"load_balancing_type": "static",
"buffer_size": 64,
"load_balancing_strategy": "hotspot",
"node_pool_strategy": "fair",
"ranks_per_node": {{ranksPerNode}}
},
"optimisation": {
"fuse_algorithmic_steps": {{fused}},
"fuse_algorithmic_steps_factor": 0.99,
"spawn_predictor_as_background_thread": {{predictorBackground}},
"spawn_update_as_background_thread": {{updateBackground}},
"spawn_prolongation_as_background_thread": {{prolongationBackground}},
"spawn_amr_background_threads": {{amrBackground}},
"disable_vertex_exchange_in_time_steps": true,
"time_step_batch_factor": {{batchFactor}},
"disable_metadata_exchange_in_batched_time_steps": true,
"double_compression": 0.0,
"spawn_double_compression_as_background_thread": false
},
"solvers": [
{
"name": "MyElasticWaveSolver",
"order": {{order}},
"maximum_mesh_depth": {{maximumMeshDepth}},
"maximum_mesh_size": {{maximumMeshSize}},
"type": "ADER-DG",
"time_stepping": "{{timeStepping}}",
"aderdg_kernel": {
"basis": "Legendre",
"implementation": "{{kernels}}",
"allocate_temporary_arrays" : "{{tempArrays}}",
"adjust_solution": "patchwise",
"language": "C",
"nonlinear": false,
"optimised_kernel_debugging": [],
"optimised_terms": [],
"space_time_predictor": {},
"terms": [
"flux",
"ncp",
"material_parameters",
"point_sources"
]
},
"variables": [
{
"multiplicity": 3,
"name": "v"
},
{
"multiplicity": 6,
"name": "sigma"
}
],
"material_parameters": [
{
"multiplicity": 1,
"name": "rho"
},
{
"multiplicity": 1,
"name": "cp"
},
{
"multiplicity": 1,
"name": "cs"
}
],
"point_sources": 1,
"parameters": {
"amr_regularization": false
},
"plotters": []
}
]
}
\end{code}
Finally, you need to specify a job script template in order to
perform the performance study on your supercomputer of choice.
We ran the study on Durham University's \texttt{hamilton7} machine which
uses the SLURM scheduler. A template for this machine may look as follows:
\begin{code}
#!/bin/bash
# Mandatory parameters are:
# time, ranks, nodes,
# job_name, output_file, error_file,
# body
#
# Optional parameters are:
# tasks, coresPerTask, mail
#SBATCH --job-name={{job_name}}
#SBATCH -o {{output_file}}
#SBATCH -e {{error_file}}
#SBATCH -t {{time}}
#SBATCH --exclusive
#SBATCH -p par7.q
#SBATCH --mem=MaxMemPerNode
#SBATCH --ntasks={{ranks}}
#SBATCH --nodes={{nodes}}
#SBATCH --cpus-per-task={{coresPerRank}}
#SBATCH --mail-user={{mail}}
#SBATCH --mail-type=END
module purge
module load slurm
module load intel/xe_2017.2
module load intelmpi/intel/2017.2
module load gcc
module load likwid
export TBB_SHLIB="-L/ddn/apps/Cluster-Apps/intel/xe_2017.2/tbb/lib/intel64/gcc4.7 -ltbb"
export I_MPI_FABRICS="tmi"
{{body}}
\end{code}
\section{Supercomputer build environments and job script templates}
You find build environments for a range of supercomputers in the
\texttt{ExaHyPE-Engine/Benchmarks/environment} subfolder.
You find job templates for a range of supercomputers in the
\texttt{ExaHyPE-Engine/Benchmarks/job-templates} subfolder.
\section{Further post-processing of CSV tables}
You can find further useful tools in the \texttt{ExaHyPE-Engine/Benchmarks/python}
for post-processing the CSV tables generated by \texttt{sweep.py}.
The \texttt{tableslicer.py} tool provides the following options:
\begin{code}
This is tableslicer.py: A small tool for extracting columns from a table which
is scanned according to a filter.
positional arguments:
table The CSV table to work with.
optional arguments:
-h, --help show this help message and exit
--filter FILTER [FILTER ...]
Specify a list of key-value pairs. Example:
./tableslicer.py --filter order=3 maximumMeshDepth=3
...
--cols COLS [COLS ...]
Specifiy the list columns you want to read from the
rows matching the filter. Example: ./tableslicer.py
... --cols cores realtime_min
--min [MIN] Specify the column you want to determine the minimum
value of. All (filtered) rows with that value will be
written out. If you do not specify anything, the last
column will be used. Example: ./tableslicer.py
--filter .. --cols order maximumMeshDepth ... --min
order
--max [MAX] Specify the column you want to determine the maximum
value of. All (filtered) rows with that value will be
written out. If you do not specify anything, the last
column will be used. Example: ./tableslicer.py
--filter .. --cols order maximumMeshDepth ... --max
order
--header Write a header to the output file.
--no-header Write no header to the output file.
--compress Remove columns where the same value is found in every
row.
--no-compress Do not remove columns where the same value is found in
every row.
-s SORT [SORT ...], --sort SORT [SORT ...]
Specify a list of sorting key columns. Order is
important. Example: ./tableslicer.py ... ... --cols
fused cores --sort cores fused
--input-delim [INPUTDELIM]
Specify the delimiter used in the input table.
--output-delim [OUTPUTDELIM]
Specify the delimiter for the output table.
--output OUTPUT The output file.
\end{code}
There is further the \texttt{speedupcalculator.py} which can be used
as follows:
\begin{code}
This is speedupcalculator: A small tool for computing speedups for data stored
in a CSV table. The last column of a table row is assumed to be the data
column. The first row is assumed to store the reference value if no reference
value is explicitly specified.
positional arguments:
table The CSV table to work on.
optional arguments:
-h, --help show this help message and exit
--header Write a header to the output file.
--no-header Write no header to the output file.
--keys Include the key columns to the output.
--no-keys Remove the key columns from the output.
--data Include the original data column into the output file.
--no-data Remove the original data column from the output file.
--input-delim [INPUTDELIM]
Specify the delimiter used in the input table.
--output-delim [OUTPUTDELIM]
Specify the delimiter for the output table.
--reference [REFERENCE]
Specify a reference value > 0.
-o [OUTPUT], --output [OUTPUT]
Output file
\end{code}
......@@ -188,6 +188,7 @@ This particular document was generated on \today\ at \currenttime. \\
\input{c1_hdf5}
\input{d0_supercomputers}
\input{e0_performance-studies}
\input{x0_lists}
%----------------------------------------------------------------------------------------
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment