Performance prediction

Authors: Jerry Erikssonb, Pedro Ojeda-Mayb, Thomas Ponweisera,*, Thomas Steinreitera
a RISC Software GmbH, Softwarepark 35, 4232 Hagenberg, Austria
b High Performance Computing Center North (HPC2N), MIT Huset, Umeå Universitet, 901 87 Umeå, Sweden

Abstract: The usage of modern profiling and tracing tools is vital for understanding program behaviour, performance bottlenecks and optimisation potentials in HPC applications. Despite their obvious benefits, such tools are still not that widely adopted within the HPC user community. The two main reasons for this are firstly unawareness and secondly the sometimes inhibitive complexity of getting started with these tools. In this work we aim to address this issue by presenting and comparing the capabilities of four different performance analysis tools, which are 1) HPCToolkit, 2) Extrae and Paraver, 3) SCALASCA and 4) the Intel Trace Analyzer and Collector (ITAC). The practical usage of these tools is demonstrated based on case studies on the widely used molecular dynamics simulation code GROMACS.

Download paper: PDF


Authors: M. Hruszowieca, P. Potaszb, A. Szymańska-Kwiecieńa, M. Uchrońskia
a Wroclaw Centre of Networking and Supercomputing (WCSS), Wroclaw University of Science and Technology
b Department of Theorethical Physics, Wroclaw University of Science and Technology

Abstract: This work will be focused on parallel simulation of electron-electron interactions in materials with non-trivial topological order (i.e. Chern insulators). A problem of electron-electron interaction systems can be solved by diagonalizing a many-body Hamiltonian matrix in a basis of configurations of electrons distributed among possible single particle energy levels – a configuration interaction method. The number of possible configurations exponentially increases with a number of electrons and energy levels; 6 electrons occupying 24 energy levels corresponds to the dimension of Hilbert space about 105, for 12 electrons it gives 106 configurations. Solving such a problem requires effective computational methods and highly efficient optimization of the source code. The project will focus on many-body effects related to strongly interacting electrons on flat bands with non-trivial topology. Such systems are expected to be useful in study and understanding of new topological phases of matter, and in a further future can be used to design novel nanomaterials. GPU accelerators will be used for improving performance and scalability in parallel simulation of electron-electron interaction in materials with a non-trivial topological order.

Download paper: PDF


Authors: William Killiana, Renato Micelia,*, EunJung Parka, Marco Alvarez Vegaa, John Cavazosaa
a University of Delaware, USA
a Irish Centre for High-End Computing (ICHEC), Ireland
a Universite de Rennes, France

Abstract: Vectorization support in hardware continues to expand and grow as we still continue on superscalar architectures. Unfortunately, compilers are not always able to generate optimal code for the hardware; detecting and generating vectorized code is extremely complex. Programmers can use a number of tools to aid in development and tunin, but most of these tools require expert or domain-specific knowledge to use. In this work we aim to provide techniques for determining the best way to optimize certain codes, with an end goal of guiding the compiler into generating optimized code without requiring expert knowledge from the developer.

Download paper: PDF


Authors: Zhengxiong Houa, Christian Perez
INRIA, LIP, ENS-Lyon, France

Abstract: On multi-core clusters or supercomputers, how to get good performance when running high performance computing (HPC)applications is a main concern. In this report, performance oriented auto-tuning strategies and experimental results are presentedfor stencil HPC applications on multi-core parallel machines. A typical 2D Jacobi benchmark is chosen as the experimentalstencil application. The main tuning strategies include data partitioning within a multi-core node, number of threads within amulti-core node, data partitioning for a number of nodes, number of nodes in a multi-core cluster system. The results of theexperiments are based on multi-core parallel machines from PRACE or Grid’5000, such as Curie, and Stremi cluster.

Download paper: PDF


Disclaimer

These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.

They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.

Copyright notices

© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.

All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.