Authors: G. Hautreuxa, D. Dellisb, C. Moulinecc, A. Sunderlandc, A. Grayd, A. Proemed, V. Codreanue, A. Emersonf, B. Eguzkitzag, J. Strassburgg, M. Louhivuorih

Abstract: The work produced within this task is an extension of the UEABS (Unified European Applications Benchmark Suite) for accelerators. As a first version of the extension, this document will present a full definition of a suite for accelerators. This will cover each code, presenting the code in itself as well as the test cases defined for the benchmarks and the problems that could occur during the next months.

As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to make a decision in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVidia GPU cards for benchmarking as they are the two most important accelerated resources available at the moment.

The following table lists the codes that will be presented in the next sections as well as their implementations available. It should be noted that OpenMP can be used with the Intel Xeon Phi architecture while CUDA is used for NVidia GPU cards. OpenCL is a third alternative that can be used on both architectures.
The chosen codes are all part of the UEAS suite excepted PFARM which comes from PRACE-2IP and SHOC which is a synthetic benchmark suite for accelerators.

Download paper: PDF

Authors: Jeremie Gaidamoura, Dimitri Lecasa, Pierre-Francois Lavalleea
IDRIS/CNRS, Campus universaire d’Orsay, rue John Von Neumann, Batiment 506, F-91403 Orsay, France

Abstract: The HYDRO mini-application has been successfully used as a research vehicle in previous PRACE projects [6]. In this paper, we evaluate the benefits of the tasking model introduced in recent OpenMP standards [9]. We have developed a new version of HYDRO using the concept of OpenMP tasks and this implementation is compared to already existing and optimized OpenMP versions of HYDRO.

Download paper: PDF

Authors: Jorge Rodrigueza
aBSC-CNS: Barcelona Supercomputing Center, Torre Girona, C/Jordi Girona, 31, 08034 Barcelona, Spain

Alya [5] is a computational mechanics code capable of solving different physics. It has been extensively used in
MareNostrum III (BSC’s Tier-0 machine), and it has been also used as a benchmarking code in PRACE Unified
European Applications Benchmark Suite. In this document, Extrae will be used to collect and analyze
performance data during an Alya simulation in a petaflop environment.
As a result of the performance analysis using Extrae [2] [3], some potential improvements in Alya have shown
up, and if considered, exascale scalability could be achieved.
Application Code: Alya

Download PDF

B. Lindia*, T. Ponweiserb, P. Jovanovicc, T. Arslana
aNorwegian University of Science and Technology
bRISC Software GmbH A company of Johannes Kepler University Linz
cInstitute of Physics Belgrade

This study has profiled the application Code Saturne, which is part of the PRACE benchmark suite. The profiling has been
carried out with the tools HPCtookit and Tuning and Analysis Utilities (TAU) with the target of finding compute kernels
suitable for autotuning.
Autotuning is regarded as a necessary step in achieving sustainable performance at an Exascale level as Exascale systems
most likely will have a heterogeneous runtime environment. A heterogeneous runtime environment imposes a parameter
space for the applications run time behavior which cannot be explored by a traditional compiler. Neither can the run time
behavior be explored manually by the developer/code owner as this will be too time consuming.
The tool Orio has been used for autotuning idenitified compute kernels. Orio has been used on traditional Intel processors,
Intel Xeon Phi and NVIDIA GPUs.The compute kernels have a small contribution to the overall execution time for Code
Saturne. By autotuning with Orio these kernels have been improved by 3-5%.

Download PDF

Authors: Sadaf Alam, Ugo Varettoa
aSwiss National Supercomputing Centre, Lugano, Switzerland

Abstract: Recently MPI implementations have been extended to support accelerator devices, Intel Many Integrated Core (MIC) and nVidia GPU. This has
been accomplished by changes to different levels of the software stacks and MPI implementations. In order to evaluate performance and
scalability of accelerator aware MPI libraries, we developed portable micro-benchmarks to indentify factors that influence efficincies of primitive
MPI point-to-point and collective operations. These benchmarks have been implemented in OpenACC, CUDA and OpenCL. On the Intel MIC
platform, existing MPI benchmarks can be executed with appropriate mapping onto the MIC and CPU cores. Our results demonstrate that the
MPI operations are highly sensitive to the memory and I/O bus configurations on the node. The current implemetation of MIC on-node
communication interface exhibit additional limitations on the placement of the card and data transfers over the memory bus.

Download PDF

Authors: Mikael
Rannara, Maciej Szpindlerb
aHPC2N & Department of
Computing Science, Umea University
bInterdisciplinary Centre
for Mathematical and Computational Modelling, University of Warsaw

The HBM (HIROMB-BOOS Model) ocean circulation model scaling on the selected PRACE Tier-0 systems is
described. The model has been ported to the BlueGene/Q architecture and tested against OpenMP and mixed
OpenMP/MPI parallel performance and scaling with a given test case scenario. Benchmarking of the selected
computational kernels and model procedures with a micro-benchmarking module has been proposed for further
integration with the model code. Details on the micro-benchmark proposal and results of the scaling tests are

Download PDF

Authors: J. Donnersa*,
A. Mouritsb, M. Gensebergerb, B. Jagersb
aSURFsara, Amsterdam, The

bDeltares, Delft, The

The Delft3D modelling suite has been ported to the PRACE Tier-0 and Tier-1 infrastructure. The portability of
Delft3D was improved by removing platform-dependent options from the build system and replacing non-standard constructs from the source. Three benchmarks were used to investigate the scaling of Delft3D: (1) a
large, regular domain; (2) a realistic, irregular domain with a low fill-factor; (3) a regular domain with a
sediment transport module. The first benchmark clearly shows a good scalability up to a thousand cores for a
suitable problem. The other benchmarks show a reasonable scalability up to about 100 cores. For test case (2) the
main bottleneck is the serialized I/O. It was attempted to implement a separate I/O server by using the last MPI
process only for the I/O, but this work is not yet finished. The imbalance due to the irregular domain can be
reduced somewhat by using a cyclic placement of MPI tasks. Test case (3) benefits from inlining of often-called

Download PDF

Authors: Maciej
Cytowski, Maciej Filocha, Jakub Katarzynski, Maciej Szpindler
Interdisciplinary Centre for
Mathematical and Computational Modeling (ICM), University of Warsaw,

In this whitepaper we describe the effort we have made to measure performance of applications and synthetic benchmarks
with the use of different simultaneous multithreading (SMT) modes. This specific processor architecture feature is currently
available in many petascale HPC systems worldwide. Both IBM Power7 processors available in Power775 (IH) and IBM Power
A2 processors available in Blue Gene/Q are built upon 4-way simultaneous multithreaded cores. It should be also mentioned that
multithreading is predicted to be one of the leading features of future exascale systems available by the end of next decade [1].

Download PDF

Authors: J. Mark
Bulla*, Andrew Emersonb
aEPCC, University of
Edinburgh, King’s Buildings, Mayfield Road, Edinburgh EH9 3JZ, UK.

bCINECA, via Magnanelli 6/3,
40033 Casalecchio di Reno, Bologna, Italy

This White Paper reports on the selection of a set of application codes taken from the existing PRACE and
DEISA application benchmark suites to form a single Unified European Application Benchmark Suite

The selected codes are: QCD, NAMD, GROMACS, Quantum Espresso, CP2K, GPAW, Code_Saturne,

Download PDF


These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.

They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of
PRACE AISBL and therefore do not emanate from it nor should be considered to reflect
PRACE AISBL’s individual opinion.

Copyright notices

© 2014 PRACE Consortium Partners. All rights reserved. This document is a project
document of a PRACE Implementation Phase project. All contents are reserved by default
and may not be disclosed to third parties without the written consent of the PRACE partners,
except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.

All trademarks and other rights on third party products mentioned in the document are
acknowledged as own by the respective holders.