Quantum MD applications
Authors: Thomas Ponweisera,*, Malgorzata Wierzbowskab,†
aResearch Institute for Symbolic Computation (RISC), Johannes Kepler University, Altenberger Stra?e 69, 4040 Linz, Austria
bInstitute of Physics, Polish Academy of Science, Al. Lotnikow 32/46, 02-668 Warsaw, Poland
WANNIER90 is a quantum-mechanics code for the computation of maximally localized Wannier functions, ballistic transport, thermoelectrics, and Berry-phase derived properties – such as optical conductivity, orbital magnetization, and anomalous Hall conductivity. In this whitepaper, we report on optimizations for WANNIER90 carried out in the course of the PRACE preparatory access project PA2231. Through performance tuning based on the integrated tool suite HPCToolkit and further parallelization of runtime-critical calculations, significant improvements in performance and scalability have been achieved.
Previously unfeasible computations with more than 64 atoms are now possible and the code exhibits almost perfect strong scaling behavior up to 2048 processes for sufficiently large problem settings.
Authors: J. Alberdi-Rodrigueza,b, A. Rubioa,c, M. Oliveirad, A. Charalampidoue,f, D. Foliase,f
aNano-Bio Spectroscopy Group and European Theoretical Spectroscopy Facility (ETSF)
University of the Basque Country UPV/EHU, Donostia, Spain
b Department of Computer Architecture and Technology University of the Basque Country UPV/EHU, Donostia, Spain
c Max Planck Institute for the Structure and Dynamics of Matter, Hamburg, Germany, Departamento de Fisica de Materiales, Centro de Fisica de Materiales CSIC-UPV/EHU-MPC and DIPC, University of the Basque Country, UPV/EHU, Donostia, Spain
Fritz-Haber-Institut Max-Planck-Gesellschaft, Berlin, Germany
d Dep. Physique, Universite of Liege
e Greek Research and Technology Network, Athens, Greece
f Scientific Computing Center, Aristotle University of Thessaloniki, Greece
Octopus is a software package for density-functional theory (DFT), and its time-dependent (TDDFT) variant. Linear Combination of the Atomic Orbitals (LCAO) is performed previous to the actual DFT run. LCAO is used to get an initial guess of densities, and therefore, to start with the Self Consistent Field (SCF) of the Ground-State (GS). System initialization and LCAO steps consume a large amount of memory and do not demonstrate good performance. In this study, extensive profiling has been performed, in order to identify large matrices
and scaling behavior of initialization and LCAO. Alternative implementations of LCAO in Octopus have been investigated in order to optimize memory usage and performance of LCAO approach. The use of ScaLAPACK library led to significant improvement of memory allocation and performance. Benchmark tests have been performed on MareNostrum III HPC system using various combinations of atomic systems’ sizes and numbers of CPU cores.
Soon-Heum Koa, Simen Reine, Thomas Kjargaard
National Supercomputing Centre, Linkoping University, 581 83 Linkoping, Sweden
Centre for Theoretical and Computational Chemistry, Department of Chemistry, Oslo University, Postbox 1033, Blindern, 0315, Oslo, Norway
LEAP – Center for Theoretical Chemistry, Department of Chemistry, Aarhus University, Langelandsgade 140, Aarhus C, 8000, Denmark
In this paper, we present the performance of LSDALTON’s DFT method in large molecular simulations of biological interest. We primarily focus on evaluating the performance gain by applying the density fitting (DF) scheme and the auxiliary density matrix method (ADMM). The enabling effort is put towards finding the right build environment (the composition of the compiler, an MPI and extra libraries) which generates a full 64-bit integer-based binary. Using three biological molecules varying in size, we verify that the DF and the ADMM schemes provide much gain in the performance of the DFT code, at the cost of large memory consumption to store extra matrices and a little change on scalability characteristics with the ADMM calculation. In the insulin stimulation, the parallel region of the code accelerates by 30 percent with the DF calculation and 56 percent in the case of the DF-ADMM calculations.
Mariusz Uchronski, Agnieszka Kwiecien, Marcin Gebarowski
WCSS, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland
CP2K is an application for atomistic and molecular simulation and, with its excellent scalability, is particularly important with regards to use on future exascale systems. The code is well parallelized using MPI and hybrid MPI/OpenMP, typically scaling well to 1 core per atom in the system. The research on CP2K done within PRACE-1IP stated that due to heavy usage of sparse matrix multiplication for large systems, there is a place for improvement of performance. The main goal of this work, undertaken within PRACE-3IP, was to investigate the most time-consuming routines and port them to accelerators, particularly GPGPUs. The relevant areas of the code that can be effectively accelerated are the matrix multiplications (DBCSR library). A significant amount of work has already been done on DBCSR library using CUDA. We focused on enabling the library on a potentially wider range of computing resources using OpenCL and OpenACC technologies, to bring the overall application closer to exascale. We introduce the ports and promising performance results. The work done has led to the identification of a number of issues with using OpenACC in CP2K, which need to be further investigated and resolved to make the application and technology work better together.
Authors: J.A. Astrom
CSC – It-centre for science, Esbo, Finland
Abstract: NUMFRAC is a generic particle-based code for simulation of non-linear mechanics in disordered solids. The generic theory of the code is outlined and examples are given by glacier calving and fretting. This text is to a large degree a part of the publication: J. A. °Astr?om, T. I. Riikil?a, T. Tallinen, T. Zwinger, D. Benn, J. C. Moore, and J. Timonen, A particle-based simulation model for glacier dynamics, The Cryosphere Discuss, 7, 921-941, 2013.
Authors: A. Calzolaria, C. Cavazzonib
a Istituto Nanoscienze CNR-NANO-S3, I-41125 Modena Italy
b CINECA – Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)
Abstract: This work regards the enabling of the Time-Dependent Density Functional Theory kernel (TurboTDDFT) of Quantum-ESPRESSO package on petascale systems. TurboTDDFT is a fundamental tool to investigate nanostructured materials and nanoclusters, whose optical properties are determined by their electronic excited states. Enabling of TurboTDDFT on petascale system will open up the possibility to compute optical properties for large systems relevant for technological applications. Plasmonic excitations, in particular, are important for a large range of applications from biological sensing, over energy conversion to subwavelength waveguides. The goal of the present project was the implementation of novel strategies for reducing the memory requirements and improving the weak scalability of the TurboTDDFT code, aiming at obtaining an important improvement of the code capabilities and to be able to study the plasmonic properties of metal nanoparticle (Ag, Au) and their
dependence on the size of the system under test.
Authors: Massimiliano Guarrasia, Sandro Frigiob, Andrew Emersona and Giovanni Erbaccia
a CINECA, Italy
b University of Camerino, Italy
In this paper, we will present part of the work carried out by CINECA in the framework of the PRACE-2IP project aimed to study the effect on performance due to the implementation of a 2D Domain Decomposition algorithm in DFT codes that use standard 1D (or slab) Parallel Domain Decomposition. The performance of this new algorithm are tested on two example applications: Quantum Espresso, a popular code used in materials science, and , the CFD code BlowupNS.
In the first part of this paper, we will present the codes that we use. In the last part of this paper, we will show the increase of performance obtained using this new algorithm.
Authors: Al. Charalampidoua,b, P. Korosogloua,b, F. Ortmannc, S. Rochec
a Greek Research and Technology Network, Athens, Greece
b Scientific Computing Center, Aristotle University of Thessaloniki, Thessaloniki
c Catalan Institute of Nanotechnology, Spain
This study has focused on an application for Quantum Hall Transport simulations and more specifically on how
to overcome an initially identified potential performance bottleneck related to the I/O of wave functions. These
operations are required in order to enable and facilitate continuation runs of the code. After following several
implementations for performing these I/O operations in parallel (using the MPI I/O library) we showcase that a
performance gain in the range 1.5 – 2 can be achieved when switching from the initial POSIX only approach to
the parallel MPI I/O approach on both CURIE and HERMIT PRACE Tier-0 systems. Moreover, we showcase
that because I/O throughput scales with an increasing number of cores overall the performance of the code is
efficient up to at least 8192 processes.
Authors: Martti Louhivuoria, Jussi Enkovaaraa,b
a CSC – IT Center for Science Ltd., PO Box 405, 02101 Espoo, Finland
b Aalto University, Department of Applied Physics, PO Box 11100, 00076 Aalto,
In recent years, graphical processing units (GPUs) have generated a lot of excitement in computational sciences by
promising a significant increase in computational power compared to conventional processors. While this is true in many cases for small-scale computational problems that can be solved using the processing power of a single computing unit, the efficient usage of multiple GPUs in parallel over multiple interconnected computing units has been problematic.
Increasingly the real-life problems tackled by computational scientists require large-scale parallel computing and thus it is crucial that GPU-enabled software reach good parallel scalability to reap the bene-ts of GPU acceleration. This is exactly what has been achieved for GPAW, a popular quantum chemistry program, by Hakala et al. in their recent work .
Authors: Simen Reinea , Thomas Kjrgaarda, Trygve Helgakera, Ole Widar Saastadb, Andrew Sunderlandc
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, Oslo, Norway,
b University Center for Information Technology, University of Oslo, Oslo, Norway
c STFC Daresbury Laboratory, Warrington, United Kingdom
Linear Scaling DALTON (LSDALTON) is a powerful molecular electronic structure program that is the focus of software optimization projects in PRACE 1IP-WP7.2 and PRACE 1IP-WP7.5. This part of the project focuses on the introduction of parallel diagonalization routines from the ScaLAPACK library into the latest MPI version of LSDALTON. The parallelization work has involved three main tasks: i) Redistribution of the matrices assembled for the SCF cycle from a serial / distributed state to the two dimensional block-cyclic data distribution used for PBLAS and ScaLAPACK; ii) Interfacing of LSDALTON data structures to parallel diagonalization routines in ScaLAPACK; iii) Performance testing to determine the favored ScaLAPACK eigensolver methodology
Authors: Fabio Affinitoa, Emanuele Cocciab, Sandro Sorellac, Leonardo Guidonib,
a CINECA, Casalecchio di Reno, Italy
b Universita dell’Aquila,L’Aquila,Italy
c SISSA, Trieste, Italy
Quantum Monte Carlo (QMC) methods are a promising technique for the study of the electronic structure of
correlated molecular systems. The technical of the present project is to demonstrate the scalability of the TurboRVB
code for a series of systems having different properties in terms of number of electrons, number of variational
parameters and size of the basis set.
Authors:Iain Bethunea, Adam Cartera, Xu Guoa, Paschalis Korosogloub,c
a EPCC, The University of Edinburgh, James Clerk Maxwell Building, The King’s
Buildings, Edinburgh EH9 3JZ, United Kingdom
b AUTH, Aristotle University of Thessaloniki, Thessaloniki 52124, Greece,
c GRNET, Greek Research & Technology Network, L. Mesogeion 56, Athens
CP2K is a powerful materials science and computational chemistry code and is widely used by research groups across Europe and beyond. The recent addition of a linear scaling KS-DFT method within the code has made it possible to simulate systems of unprecedented size – 1,000,000 atoms or more – making full use of Petascale computing resources. Here we report on work undertaken within PRACE 1-IP WP 7.1 to port and test CP2K on Jugene, the PRACE Tier 0 BlueGene/P system. In addition, development work was performed to reduce the memory usage of a key data structure within the code, to make it more suitable for the limited memory environment of the BlueGene/P. Finally, we present a set of benchmark results and analysis of a large test
Authors: Luigi Genovesea,b, Brice Videaua, Thierry Deutscha, Huan Tranc, Stefan Goedeckerc
a Laboratoire de Simulation Atomistique, SP2M/INAC/CEA, 17 Av. des Martyrs, 38054
b European Synchrotron Radiation Facility, 6 rue Horowitz, BP 220, 38043 Grenoble, France
c Institut fur Physik, Universitat Basel, Klingelbergstr.82, 4056 Basel, Switzerland
Electronic structure calculations (DFT codes) are certainly among the disciplines for which an increase of the computational power corresponds to an advancement in the scientific results. In this report, we present the ongoing advancements of DFT code that can run on massively parallel, hybrid and heterogeneous CPU-GPU clusters. This DFT code, named BigDFT, is delivered within the GNU-GPL license either in a stand-alone version or integrated in the ABINIT software package. Hybrid BigDFT routines were initially ported with NVidia’s CUDA language, and recently more functionalities have been added with new routines written within Kronos’ OpenCL standard. The formalism of this code is based on Daubechies wavelets, which is a systematic real-space based basis set. The properties of this basis set are well suited for an extension on a GPU-accelerated environment. In addition to focusing on the performances of the MPI and OpenMP parallelization the BigDFT code, this presentation also relies on the usage of the GPU resources in a complex code with different kinds of operations. A discussion on the interest of present and expected performances of Hybrid architectures computation in the framework of electronic structure calculations is also addressed.
Reyesa, Iain Bethunea
The University of Edinburgh, James Clerk Maxwell Building, Mayfield
Road, Edinburgh, EH9 3JZ,UK
This report describes the results of a PRACE Preparatory Access Type Cb project to optimize the implementation
of Moller-Plesset second-order perturbation theory (MP2) in CP2K, to allow it to be used efficiently on the
PRACE Research Infrastructure. The work consisted of three stages: firstly serial optimization of several key
computational kernels; secondly, OpenMP implementation of parallel 3D Fourier Transform to support mixed-mode MPI/OpenMP use of CP2K; and thirdly – benchmarking the performance gains achieved by new code on
HERMIT for a test case representative of proposed production simulations. Consistent speedups of 8% were
achieved in the integration kernel routines as a result of the serial optimization. When using 8 OpenMP threads
per MPI process, speedups of up to 10x for the 3D FFT were achieved, and for some combinations of MPI
processes and OpenMP threads, overall speedups of 66% for the whole code were measured. As a result of this
work, a proposal for full PRACE Project Access has been submitted.
Authors: Peicho Petkova, Petko Petkovb,*, Georgi Vayssilovb, Stoyan Markovc
a Faculty of Physics, University of Sofia, 1164 Sofia, Bulgaria
b Faculty of Chemistry, University of Sofia, 1164 Sofia, Bulgaria
c Natioanl Centre for Supercomputing Aplications, Sofia, Bulgaria
The reported work aims at implementation of a method allowing realistic simulation of large or extra-large biochemical systems (of 106 to 107 atoms) with first-principle quantum chemical methods. The current methods treat the whole system simultaneously. In this way, the compute time increases rapidly with the size of the system and does not allow efficient parallelization of the calculations due to the mutual interactions between the electron density in all parts of the system. In order to avoid these problems, we implemented a version of the Fragment Orbital Method (FMO) in which the whole system is divided into fragments calculated separately. This approach assures nearly linear scaling of the compute time with the size of the system and provides efficient parallelization of the job. The work includes the development of pre- and post-processing components for an automatic division of the system into monomers and reconstructing of the total energy and electron density of the whole system.
Authors: Iain Bethunea, Adam Cartera, Kevin Stratforda, Paschalis Korosogloub,c
a EPCC, The University of Edinburgh, James Clerk Maxwell Buidling, The King’s
Buildings, Edinburgh, EH9 3JZ, United Kingdom
This report describes the work undertaken under PRACE-1IP to support the European scientific communities who make use of CP2K in their research. This was done in two ways – firstly, by improving the performance of the code for a wide range of usage scenarios. The updated code was then tested and installed on the PRACE CURIE supercomputer. We believe this approach both supports existing user communities by delivering better application performance and demonstrates to potential users the benefits of using optimized and scalable software like CP2K on the PRACE infrastructure.
Authors: Jussi Enkovaaraa,?, Martti Louhivuoria, Petar Jovanovicb, Vladimir Slavnicb, Mikael R ?annarc
a CSC IT Center for Science, P.O. Box 405 FI-02101 Espoo Finland
b Scientific Computing Laboratory, Institute of Physics Belgrade, Pregrevica 118, 11080 Belgrade, Serbia
c Department of Computing Science, Umea University, SE-901 87 Umea, Sweden
GPAW is a versatile software package for first-principles simulations of nanostructures utilizing density-functional theory and time-dependent density-functional theory. Even though GPAW is already used for massively parallel calculations in several supercomputer systems, some performance bottlenecks still exist. First, the implementation based on the Python programming language introduces an I/O bottleneck during initialization which becomes serious when using thousands of CPU cores. Second, the current linear-response time-dependent density-functional theory implementation contains a large matrix, which is replicated on all CPUs. When reaching for larger and larger systems, memory runs out due to the replication. In this report, we discuss the work done on resolving these bottlenecks. In addition, we have also worked on optimization aspects that are directed more to future usage. As the number of cores in multicore CPUs is still increasing, a hybrid parallelization combining shared memory and distributed memory parallelization is becoming appealing. We have experimented with hybrid OpenMP/MPI and report here the initial results. GPAW also performs large dense matrix diagonalizations with the ScaLAPACK library. Due to limitations in ScaLAPACK these diagonalizations are expected to become a bottleneck in the future, which has led us to investigate alternatives for the ScaLAPACK.
The work aims at evaluating the performance of DALTON on different platforms and implementing new strategies to enable the code for petascale. The activities have been organized into four tasks within PRACE project: (i) Analysis of the current status of the DALTON quantum mechanics (QM) code and identification of bottlenecks, implementation of several performance improvements of DALTON QM and the first attempt of hybrid parallelization;
(ii) Implementation of MPI integral components into LSDALTON, improvements of optimization and scalability, interface of matrix operations to PBLAS and ScaLAPACK numerical library routines; (iii) Interfacing the DALTON and LSDALTON QM codes to the ChemShell quantum mechanics/molecular mechanics (QM/MM) package and benchmarking of QM/MM calculations using this approach; (vi) Analysis of the impact of DALTON QM system
components with Dimemas. Part of the results reported here has been achieved through the collaboration with ScalaLife project.
Authors: Simen Reine(a), Thomas Kj?rgaard(a), Trygve Helgaker(a), Olav Vahtras(b,d), Zilvinas Rinkevicius(b,g), Bogdan Frecus(b), Thomas W. Keal(c), Andrew Sunderland(c), Paul Sherwood(c), Michael Schliephake(d), Xavier Aguilar(d), Lilit Axner(d), Maria Francesca Iozzi(e), Ole Widar Saastad(e), Judit Gimenez(f)
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, P.O.Box 1033 Blindern, N-0315 Oslo, Norway
b KTH Royal Institute of Technology, School of Biotechnology, Division of Theoretical Chemistry & Biology, S-106 91 Stockholm, Sweden
c Computational Science & Engineering Department, STFC Daresbury Laboratory, Daresbury Science and Innovation Campus, Warrington, Cheshire, WA4 4AD, UK
d PDC Center for High-Performance Computing at Royal Institute of Technology (KTH), Teknikringen 14, 100 44 Stockholm, Sweden
e University center for Information technology, University of Oslo, P.O.Box 1059 Blindern, N-0316 Oslo, Norway
f Computer Sciences – Performance Tools, Barcelona Supercomputing Center, Campus Nord UP C6, C/ Jordi Girona, 1-3, Barcelona, 08034
g KTH Royal Institute of Technology, Swedish e-Science Center (SeRC), S-100 44, Stockholm, Sweden
In this paper, we present development work carried out on Quantum ESPRESSO  software package within PRACE-1IP. We describe the different activities performed to enable the Quantum ESPRESSO user community to challenge frontiers of science running extreme computing simulation on European Tier-0 system of current and next generation. Their main sections are described: 1) the improvement of parallelization efficiency on two DTF-based applications: Nuclear Magnetic Resonance (NMR) and EXact-eXchange (EXX) calculation; 2) introduction of innovative van der Waals interaction at the ab-initio level; 3) porting of PWscf code to a hybrid system equipped with NVIDIA GPU technology.
Authors: Simen Reinea, Thomas Kj?rgaarda, Trygve Helgakera, Ole Widar Saastadb, Andrew Sunderlandc
a Centre for Theoretical and Computational Chemistry (CTCC), Department of Chemistry, University of Oslo, Oslo, Norway
b University Center for Information Technology, University of Oslo, Oslo, Norway
c STFC Daresbury Laboratory, Warrington, United Kingdom
Linear Scaling DALTON (LSDALTON) is a powerful molecular electronic structure program that is the focus of software optimization projects in PRACE 1IP-WP7.2 and PRACE 1IP-WP7.5. This part of the project focuses on the introduction of parallel diagonalization routines from the ScaLAPACK library into the latest MPI version of LSDALTON. The parallelization work has involved three main tasks: i) Redistribution of the matrices assembled for the SCF cycle from a serial / distributed state to the two dimensional block-cyclic data distribution used for PBLAS and ScaLAPACK; ii) Interfacing of LSDALTON data structures to parallel diagonalization routines in ScaLAPACK; iii) Performance testing to determine the favored ScaLAPACK eigensolver methodology.
Authors: Dusan Stankovic*a, Aleksandar Jovica, Petar Jovanovica, Dusan Vudragovica, Vladimir Slavnica
a Institute of Physics Belgrade, Serbia
In this whitepaper we report work that was done on enabling support for FFTE library for Fast Fourier Transform in Quantum ESPRESSO, enabling threading for FFTW3 library already supported in Quantum ESPRESSO (only a serial version), benchmarking and comparing their performance with existing implementations of FFT in Quantum ESPRESSO.
Classical MD applications
Authors: Mariusz Uchronskia, Agnieszka Kwieciena,*, Marcin Gebarowskia, Justyna Kozlowskaa
a WCSS, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland
Abstract: The prototypes evaluated within PRACE-2IP project provide a number of different computing hardware, including general-purpose Graphics Processing Units (GPUs) and accelerators like Intel Xeon Phi. In this work, we evaluated the performance and energy consumption of two prototypes when used for a real case simulation. Due to the heterogeneity of the prototypes, we decided to use the DL_POLY molecular simulation package and its OpenCL port for the tests. The DL_POLY OpenCL port implements one of the methods – the Constraints Shake (CS) component. SHAKE is a two-stage algorithm based on the leapfrog Verlet integration scheme. We used four test cases for the evaluation, one from the DL_POLY application test-suite – H2O, and three real cases, provided by a user. We show the performance results and discuss the usage experience with prototypes in a context of ease of use, porting effort required, and energy consumption.
Download paper: PDF
D. Grancharov, N. Ilieva, E. Lilkova, L. Litov, S. Markov, P. Petkov, I. Todorov
NCSA, Akad. G. Bonchev 25A, Sofia 1311, Bulgaria
STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK
A library, implementing the AGBNP2 [1, 2] implicit solvent model that was developed within PRACE-2IP  is integrated into the DL_POLY_4  molecular dynamics package in order to speed up the time to solution for protein solvation processes. Generally, implicit solvent models lighten the computational loads by reducing the degrees of freedom of the model, removing those of the solvent and thus only concentrating on the protein dynamics that is facilitated by the absence of friction with solvent molecules. Furthermore, periodic boundary conditions are no longer formally required, since long-range electrostatic calculations cannot be applied to systems with variable dielectric permittivity. The AGBNP2 implicit solvation model improves the conformational sampling of the protein dynamics by including the influence of solvent accessible surface and water-protein hydrogen bonding effects as interactive force corrections to the atoms of the protein surface. This requires the development of suitable bookkeeping data structures, in accordance with the domain decomposition framework of DL_POLY, with dynamically adjustable inter-connectivity to describe the protein surface. The work also requires the use of advanced b-tree search libraries as part of the AGBNP library, in order to reduce the memory and compute requirements, and the automatic derivation of the van der Waals radii of atoms from the self-interaction potentials.
P. Petkov, I. Todorov, D. Grancharov, N. Ilieva, E. Lilkova, L. Litov, S. Markov
NCSA, Akad. G. Bonchev 25A, Sofia 1311, Bulgaria
STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, UK
Electrostatic interactions in molecular simulations are usually evaluated by employing the Ewald summation method, which splits the summation into a short-ranged part, treated in real space and a long-range part treated in reciprocal space. For performance purposes in molecular dynamics software, the latter is usually handled by SPME or P3M grid-based methods both relying on 3D fast Fourier transform (FFT) as their central operation. However, the Ewald summation method is derived for model systems that are subject to 3D periodic boundary conditions (PBC) while there are many models of scientific as well as commercial interest, where geometry implies a 1D or 2D structures. Thus for systems, such as membranes, interfaces, linear protein complexes, thin layers, nanotubes, etc.; employing Ewald summation based techniques is either very disadvantageous computationally or impossible at all. Another approach to evaluate the electrostatic interactions is to solve the Poisson equation of the model-system charge distribution on a 3D special grid. The formulation of the method allows an elegant way to switch on and off the dependence on periodic boundary conditions in a simple manner. Furthermore, 3D FFT kernels are known to scale poorly on large scale due to excessive memory and communication overheads, which makes the Poisson solvers a viable alternative for DL_POLY on the road to exascale. This paper describes the work undertaken to integrate a Poisson solver library, developed in PRACE-2IP , within the DL_POLY_4 domain decomposition framework. The library relies on a unique combination of the bi-conjugated gradient (BiCG) and conjugated gradient (CG) methods to warrant both independence on initial conditions with rapid convergence of the solution on the one hand and stabilization of possible fluctuations of the iterative solution on the other. The implementation involves the development of procedures for generating charge density and electrostatic potential grids in real space over all domains in a distributed manner as well as halo exchange routines and functions to calculate the gradient of the potential in order to recover electrostatic forces on point charges.
Buket Benek Gursoy, Henrik Nagel
Irish Center for High-End Computing, Ireland, bNorwegian University of Science and Technology, Norway
This whitepaper investigates the potential benefit of using the OpenACC directive-based programming tool for
enabling DL_POLY_4 on GPUs. DL_POLY is a well-known general-purpose molecular dynamics simulation
package, which has already been parallelized using MPI-2. DL_POLY_3 was accelerated using the CUDA
framework by the Irish Centre for High-End Computing (ICHEC) in collaboration with Daresbury Laboratory.
In this work, we have been inspired by the existing CUDA port to evaluate the effectiveness of OpenACC in
further enabling DL_POLY_4 on the road to Exascale. We have been particularly concerned with investigating
the benefits of OpenACC in terms of maintainability, programmability and portability issues that are becoming
increasingly challenging as we advance to the Exascale era. The impact of the OpenACC port has been assessed
in the context of a change in the reciprocal vector dimension for the calculation of SPME forces. Moreover, the
interoperability of OpenACC with the existing CUDA port has been analyzed.
Authors: Mariusz Uchronskia, Marcin Gebarowskia, Agnieszka Kwieciena,*
a Wroclaw Centre for Networking and Supercomputing (WCSS), Wyb. Wyspianskiego
27, 50-370 Wroclaw, Poland
SHAKE and RATTLE algorithms are widely used in molecular dynamics simulations and for this reason, are relevant for a broad range of scientific applications. In this work, an existing CPU+GPU implementations of the SHAKE and
RATTLE algorithms from the DL_POLY application are investigated. DL_POLY is a general-purpose parallel molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith and I.T. Todorov. OpenCL code of the SHAKE algorithm for DL_POLY application is analyzed for further optimization possibilities. Our work with RATTLE algorithm is focused on the porting of the algorithm from Fortran to OpenCL and adjusting it to the GPGPU architecture.
Lysaghta, Mariusz Uchronskib, Agnieszka
Kwiecienb, Marcin Gebarowskib, Peter Nasha,
Ivan Girottoa and Ilian T.Todorovc
a Irish Centre for High End Computing, Tower Building, Trinity Technology and
Enterprise Campus, Grand Canal Quay, Dublin 2, Ireland
b Wroclaw Centre for Network and Supercomputing, Wybrzeze Wyspianskiego 27,
50-370 Wroclaw, Poland
c STFC Daresbury Laboratory, Daresbury, Warrington WA4 4AD, United Kingdom
We describe recent development work carried out on the GPU-enabled classical molecular dynamics software package, DL_POLY. We describe how we have updated the original GPU port of DL_POLY 3 in order to align the ‘CUDA+OpenMP’-based code with the recently released MPI-based DL_POLY 4 package. In the process of updating the code, we have also fixed several bugs that allows us to benchmark the GPU-enabled code on many more GPU-nodes than was previously possible. We also describe how we have recently initiated the development of an OpenCL-based implementation of DL_POLY and present a performance analysis of the set of DL_POLY modules that have so far been ported to GPUs using the OpenCL framework.
Preliminary Porting Experiences, Results and Next Steps
Authors: Sadaf Alam, Ugo Varetto
Swiss National Supercomputing Centre, Lugano, Switzerland
Abstract: This report introduces a hybrid implementation of the Gromacs application and provides instructions on building and executing on PRACE prototype platforms with Graphical Processing Units (GPU) and Many Integrated Cores (MIC) accelerator technologies. GROMACS currently employs message-passing MPI parallelism, multi-threading using OpenMP and contains kernels for non-bonded interactions that are accelerated using the CUDA programming language. As a result, the execution model is multi-faceted where end users can tune the application execution according to the underlying platforms. We present results that have been collected on the PRACE prototype systems as well as on other GPU and MIC accelerated platforms with similar configurations. We also report on the preliminary porting effort that involves a fully portable implementation of GROMACS using OpenCL programming language instead of CUDA, which is only available on NVIDIA GPU devices.
Authors: Fabio Affinitoa, Andrew Emersona, Leandar Litovb, Peicho Petkovb, Rossen Apostolovc,d, Lilit Axnerc, Berk Hessdand Erik Lindahld, Maria Francesca Iozzie
a CINECA Supercomputing, Applications and Innovation Department, via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
b National Center for Supercomputing Applications, Sofia, Bulgaria
c PDC Center for High-Performance Computing at Royal Institute of Technology (KTH), Teknikringen
14, 100 44 Stockholm, Sweden
d Department of Theoretical Physics, Royal Institute of Technology (KTH), Stockholm, Sweden
e) Research Computing Services Group, University of Oslo,Postboks 1059 Blindern 0316 Oslo, Norway
The work aims at evaluating the performance of GROMACS on different platforms and determines the
optimal set of conditions for given architectures for petascale molecular dynamics simulations. The activities have been organized into three tasks within PRACE project: (i) Optimization of GROMACS performance on Blue Gene systems; (ii) Parallel scaling of the OpenMP implementation; (iii) Development of multiple step-size symplectic integrators adapted to the large biomolecule systems. Part of the results reported here has been achieved through collaboration with the ScalaLife project.
Computational Fluid Dynamics (CFD) applications
Authors: A. Shamakinaa,1, P. Tsoutsanisb,2
a High-Performance Computing Center Stuttgart (HLRS), University of Stuttgart, Nobelstrasse 19, 70569 Stuttgart, German
b Centre for Computational Engineering Sciences, Cranfield University, College Rd, Cranfield MK43 0AL, United Kingdom
Abstract: The Higher-Order finite-Volume unstructured code Enhancement (HOVE2) is an open-source software in the field of computational-fluid-dynamics (CFD). This code enables to do the simulation of compressible turbulent flows. In this White Paper, we report on optimizations of the HOVE2 code implemented in the course of the PRACE Preparatory Access Type C project “HOVE2” in the time frame of December 2018 to June 2019. A focus of optimization was an implementation of the ParMETIS support and MPI-IO. Through the optimization of the MPI collective communications significant speedups have been achieved. In particular, the acceleration of the write time of the MPI-IO compared to the normal I/O on 70 compute nodes amounted to 180 times.
Download paper: PDF
Authors: T. Arslana,*, M. Ozbulutb
a Norwegian University of Science and Technology
b Piri Reis University
Abstract: Graphics processing unit (GPU) accelerated supercomputers have proved to be very powerful and energy effective for accelerating the compute-intensive applications and become the new standard for high-performance computing (HPC) and a critical ingredient in the pursuit of exascale computing. In this study, a quantitative comparison of a recent numerical treatment on GPUs which are applied on the solution of a violent free-surface flow problem by Smoothed Particle Hydrodynamics (SPH) method will be presented. The performance and the scalability of the cards will be evaluated on sway sloshing problem in a tank by solving Euler’s equation of motion and utilizing the weakly compressible SPH method (WCSPH). The algorithms demand extensive computational power for the simulations that require a large number of particles as in a sloshing tank which is a two or three-dimensional complex geometry. Thus, the parallelization of the solver is the key to utilize the method on a real industrial flow problem. The recent researches showed that WCSPH approach is highly suitable for adopting it into GPU cards because of its explicit approach. The comparisons will show how many times the computational speed of the proposed method on the GPU is faster than that implemented on the CPU. For the sway-sloshing problem, the time histories of free surface elevations on the left side wall of the tank will be compared with experimental and numerical results available in the literature to show the accuracy of the method briefly and finally analyze the solver’s efficiency on GPUs.
Download paper: PDF
Authors: Thomas Ponweisera,*, Panagiotis Tsoutsanisb
a Research Institute for Symbolic Computation (RISC), Johannes Kepler University, Altenberger Stra?e 69, 4040 Linz, Austria
b Centre for Computational Engineering Sciences, Cranfield University, College Rd, Cranfield MK43 0AL, United Kingdom
Abstract: UCNS3D is a computational-fluid-dynamics (CFD) code for the simulation of viscous flows on arbitrary unstructured meshes. It employs very high-order numerical schemes which inherently are easier to scale than lower-order numerical schemes due to the higher ratio of computation versus communication. In this white paper, we report on optimizations of the UCNS3D code implemented in the course of the PRACE Preparatory Access Type C project “HOVE” in the time frame of February to August 2016. Through the optimization of dense linear algebra operations, in particular matrix-vector products, by formula rewriting, pre-computation and the usage of BLAS, significant speedups of the code by factors of 2 to 6 have been achieved for representative benchmark cases. Moreover, very good scalability up to the order of 10,000 CPU cores has been demonstrated.
Download paper: PDF
Authors: A. Cassagnea,*,J-F. Boussugeb, G. Puigtb
a Centre Informatique National de l’Enseignement Superieur, Montpellier, France
b Centre Europeen de Recherche et de Formation Avancee en Calcul Scientifique, Toulouse, France
Abstract: We enabled hybrid OpenMP/MPI computations for a new generation of CFD code based on a new high-order method (Spectral Difference method) dedicated to Large Eddy Simulation (LES). The code is written in Fortran 90 with MPI library and OpenMP directives for the parallelization. This white-paper is focused on achieving
good performances with the OpenMP shared memory model on the standard environment (bi-socket nodes and
multi-core x86 processors). The goal was to reduce the number of MPI communications by considering MPI
communications between nodes and OpenMP approach for all cores on any node. Three different approaches are
compared: full MPI, full OpenMP and hybrid OpenMP/MPI. We observed that hybrid and full MPI
computations took nearly the same time for a small number of cores.
Download paper: PDF
Authors: Ahmet Durana,b,*, M. Serdar Celebia,c, Senol Piskina,c and Mehmet Tuncela,c
a Istanbul Technical University, National Center for High Performance Computing of Turkey (UHeM), Istanbul 34469, Turkey
b Istanbul Technical University,Department of Mathematics, Istanbul 34469, Turkey
c Istanbul Technical University, Informatics Institute, Istanbul 34469, Turkey
Abstract: We study a bio-medical fluid flow simulation using the incompressible, laminar OpenFOAM solver icoFoam and other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT (Many-Core Distributed) for the large penta-diagonal and hepta-diagonal matrices coming from the simulation of blood flow in arteries with a structured mesh domain. A realistic simulation for the sloshing of blood in the heart or vessels in the whole body is a complex problem and may take a very long time, thousands of hours, for the main tasks such as pre-processing (meshing), decomposition and solving the large linear systems. We generated the structured mesh by using blockMesh as a mesh generator tool. To decompose the generated mesh, we used the decomposePar tool. After the decomposition, we used icoFoam as a flow simulator/solver. For example, the total run time of a simple case is about 1500 hours without preconditioning on one core for one period of the cardiac cycle, measured on the Linux Nehalem Cluster (see ) available at the National Center for High Performance Computing (UHeM) (see ). Therefore, this important problem deserves careful consideration for usage on multi petascale or exascale systems. Our aim is to test the potential scaling capability of the fluid solver for multi petascale systems. We started from the relatively small instances for the whole simulation and solved large linear systems. We measured the wall clock time of single time steps of the simulation. This version gives important clues for a larger version of the problem. Later, we increase the problem size and the number of time steps to obtain a better picture gradually, in our general strategy. We test the performance of the solver icoFoam at TGCC Curie (a Tier-0 system) at CEA, France (see ). We consider three large sparse matrices of sizes 8 million x 8 million, 32 million x 32 million, and 64 million x 64 million. We achieved scaled speed-up for the largest matrices of 64 million x 64 million to run up to 16384 cores. In other words, we find that the scalability improves as the problem size increases for this application. This shows that there is no structural problem in the software up to this scale. This is an important and encouraging result for the problem.
Moreover, we imbedded other direct solvers (kernel class) such as SuperLU_DIST 3.3 and SuperLU_MCDT in addition to the solvers provided by OpenFOAM. Since future exascale systems are expected to have heterogeneous and many-core distributed nodes, we believe that our SuperLU_MCDT software is a good candidate for future systems. SuperLU_MCDT worked up to 16384 cores for the large penta-diagonal matrices for 2D problems and hepta-diagonal matrices for 3D problems, coming from the incompressible blood flow simulation, without any problem.
Download paper: PDF
Authors: Sebastian Szkodaa,c, Zbigniew Kozaa, Mateusz Tykierkob,c
a email@example.com Faculty of Physics and Astronomy, University of Wroclaw, Poland
b Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Poland
c Wroclaw Centre for Networking and Supercomputing, Wroclaw University of Technology, Poland
Abstract: The aim of this research it to examine the possibility of parallelizing the Frish-Hasslacher-Pomeau (FHP) model, a cellular automata algorithm for modeling fluid flow, on clusters of modern graphics processing units (GPUs). To this end an Open Computing Language (OpenCL) implementation for GPUs was written and compared with a previous, semi-automatic one based on the OpenACC compiler pragmas (S. Szkoda, Z. Koza, and M. Tykierko, Multi-GPGPU Cellular Automata Simulations using OpenACC, http://www.prace-project.eu/IMG/pdf/wp154.pdf). Both implementations were tested on up to 16 Fermi-class GPUs using the MPICH3 library for the inter-process communication. We found that for both of the multi-GPU implementations the weak scaling is practically linear for up to 16 devices, which suggests that the FHP model can be successfully run even on much larger clusters. Secondly, while the pragma-based OpenACC implementation is much easier to develop and maintain, it gives as good performance as the manually written OpenCL code.
Download paper: PDF
Authors: Charles Moulinec, David R. Emerson
STFC Daresbury, Laboratory, Warrington, WA4 4AD, UK
Understanding the influence of wave distribution, hydrodynamics and sediment transport is crucial for the placement of off-shore energy generating platforms. The TELEMAC suite is used for this purpose, and the performance of the triple coupling between TOMAWAC for wave propagation, TELEMAC-3D for hydrodynamics and SISYPHE for sediment transport is investigated for several mesh sizes, the largest grid having over 10 million elements. The coupling has been tested up to 3,072 processors and good performance is in general observed.
Authors: E. Casonia, J. Aguadoa, M. Riveroa, M. Vazquez, G. Houzeaux
Department of Computer Applications for Scientific Engineering. BSC, Nexus I, Gran Capita 2-4, 08034 Barcelona, Spain
This paper describes the work done in Alya multiphysics code, which is an open source software developed at Barcelona Supercomputing Center BSC-CNS. The main activities of this socio-economic application project concern the development of a coupled uid-electro-mechanical model to simulate the cardiac computational mechanical problem of the heart. Several aspects involved in the simulation process, methodology and performance of the code are carefully shown.
Authors: J. Donners, M. Guarrasi, A. Emerson, M. Genseberger
SURFsara, Amsterdam, The Netherlands
CINECA, Bologna, Italy
Deltares, Delft, The Netherlands
The applications Delft3D-FLOW and SWAN are used to simulate respectively water flow and water waves. These two applications have been coupled with Delft3D-WAVE and the combination of these three executables has been optimized on the Bull cluster “Cartesius”. The runtime could be decreased by a factor 4 with hardly any additional hardware. Over 80% of the total runtime consists of unnecessary I/O operations for the coupling, of which 70% could be removed. Both I/O optimizations and replacement with MPI were used. The Delft3DFLOW
application has also been ported to and benchmarked on the IBM Blue Gene/Q system “Fermi”.
J. Donners, M. Guarrasi, A. Emerson, M. Genseberger
SURFsara, Amsterdam, The Netherlands
CINECA, Bologna, Italy
Deltares, Delft, The Netherlands
The applicatons Delft3D-FLOW and SWAN are used to simulate respectively water flow and water waves. These two applications have been coupled with Delft3D-WAVE and the combination of these three executables has been optimized on the Bull cluster “Cartesius”. The runtime could be decreased by a factor 4 with hardly any additional hardware. Over 80% of the total runtime consists of unnecessary I/O operations for the coupling, of which 70% could be removed. Both I/O optimizations and replacement with MPI were used. The Delft3DFLOW application has also been ported to and benchmarked on the IBM Blue Gene/Q system “Fermi”.
Thomas Ponweiser, Peter Stadelmeyer, Tomas Karasek
Johannes Kepler University Linz, RISC Software GmbH, Austria
VSB-Technical University of Ostrava, IT4Innovations, Czech Republic
Multi-physics, high-fidelity simulations become an increasingly important part of industrial design processes. Simulations of fluid-structure interactions (FSI) are of great practical significance – especially within the aeronautics industry – and because of their complexity they require huge computational resources. On the basis of OpenFOAM a partitioned, strongly coupled solver for transient FSI simulations with independent meshes for the fluid and solid domains has been implemented. Using two different kinds of model sets, a geometrically simple 3D beam with quadratic cross section and a geometrically complex aircraft configuration, runtime and scalability characteristics are investigated. By modifying the implementation of OpenFOAM’s inter-processor communication the scalability limit could be increased by one order of magnitude (from below 512 to above 4096 processes) for a model with 61 million cells.
Seren Soner, Can Ozturan
Computer Engineering Department, Bogazici University, Istanbul, Turkey
OpenFOAM is an open source computational fuid dynamics (CFD) package with a large user base from many areas of engineering and science. This whitepaper documents an enablement tool called PMSH that was developed to generate multi-billion element unstructured tetrahedral meshes for OpenFOAM. PMSH is developed as a wrapper code around the popular open source sequential Netgen mesh generator. Parallelization of the mesh generation process is carried out in five main stages: (i) generation of a coarse volume mesh (ii) partitioning of the coarse mesh to get sub-meshes, each of which is processed by a processor (iii) extraction and refinement of coarse surface sub-meshes to produce fine surface sub-meshes (iv) re-meshing of each fine surface sub-mesh to get the final fine volume mesh (v) matching of partition boundary vertices followed by global vertex numbering. An integer based barycentric coordinate method is developed for matching distributed partition boundary vertices. This method does not have precision related problems of oating point coordinate based vertex matching. Test results obtained on an SGI Altix ICE X system with 8192 cores and 14 TB of total memory confirm that our approach does indeed enable us to generate multi-billion element meshes in a scalable way. PMSH tool is available at https:/
/. code.google.com/ p/ pmsh/
Authors: Tomas Karasek, David Horak, Vaclav Hapla, Alexandros Markopoulos, Lubomir Riha, Vit Vondrak, Tomas Brzobohaty
IT4Innovations, VSB-Technical University of Ostrava (VSB)
Solution of multiscale and/or multiphysics problems is one of the domains which can most benefit from use of
supercomputers. Those problems are often very complex and their accurate description and numerical solution requires use of several different solvers. For example problems of Fluid Structure Interaction (FSI) are usually solved using two different discretization schemes, Finite volumes to solve Computational Fluid Dynamics (CFD) part and Finite elements to solve the structural part of the problem. This paper summarizes different libraries and solvers used by the PRACE community that are able to deal with multiscale and/or multiphysic problems such as Elmer, Code_Saturne and Code_Aster, and OpenFOAM.
The main bottlenecks in performance and scalability on the side of Computational Structure Mechanics (CSM) codes are identified and their possible extension to fulfill needs of future exascale problems are shown. The numerical results of the strong and weak scalabilities of CSM solver implemented in our FLLOP library are presented.
Authors : Sebastian Szkoda, Zbigniew Koza, Mateusz Tykierko
Faculty of Physics and Astronomy, University of Wroclaw, Poland
Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Poland
Wroclaw Centre for Networking and Supercomputing, Wroclaw University of Technology, Poland
The Frisch-Hasslacher-Pomeau (FHP) model is a lattice gas cellular automaton designed to simulate fluid flows using the exact, purely Boolean arithmetic, without any round-off error. Here we investigate the problem of its efficient porting to clusters of Fermi-class graphic processing units. To this end two multi-GPU implementations were developed and examined: one using the NVIDIA CUDA and GPU Direct technologies explicitly and the other one using the CUDA implicitly through the OpenACC compiler directives and the MPICH2 MPI interface for communication. For a single Tesla C2090 GPU device both implementations yield up to a 7-fold acceleration over an algorithmically comparable, highly optimized multi-threaded implementation running on a server-class CPU. The weak scaling for the explicit multi-GPU CUDA implementation is almost linear for up to 8 devices (the maximum number of the devices used in the tests), which suggests that the FHP model can be successfully run on much larger clusters and is a prospective candidate for exascale computational fluid dynamics. The scaling for the OpenACC approach turns out less favorable due to compiler-related technical issues. We found that the multi-GPU approach can bring considerable benefits for this class of problems, and the GPU programming can be significantly simplified through the use of the OpenACC standard, without a significant loss of performance, providing that the compilers supporting OpenACC improve their handling of the communication between GPUs.
Code_Saturne is a popular open-source computational fluid dynamics package. We have carried out a study of applying MPI 2.0 / MPI 3.0 one sided communication routines to Code_Saturne and its impact on improving the scalability of the code for future peta/exa-scaling. We have developed modified versions of the halo exchange routine in Code_Saturne. Our modifications showed that MPI 2.0 one sided calls give some speed improvement and less memory overhead compared to the original version. The MPI 3.0 version on the other hand is unstable and could not run.
Application Code: Code_Saturne
Jan Christian Meyer
High Performance Computing Section, IT Dept., Norwegian University of Science and Technology
The LULESH proxy application models the behavior of the ALE3D multi-physics code with an explicit shock
hydrodynamics problem, and is made in order to evaluate interactions between programming models and
architectures, using a representative code significantly less complex than the application it models. As identified
in the PRACE deliverable D7.2.1 , the OmpSs programming model specifically targets programming at the
exascale, and this whitepaper investigates the effectiveness of its support for development on hybrid
Cytowskia, Matteo Bernardinib
a Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw
b Universita di Roma La Sapienza, Dipartimento di Ingegneria Meccanica e Aerospaziale
The project aimed at extending the capabilities of an existing ow solver for Direct Numerical Simulation of turbulent
flows. Starting from the scalability analysis of the MPI baseline code, the main goal of the project was to devise a
MPI/OpenMP hybridization capable of exploiting the full potential of the current architectures provided in the PRACE framework. The project was very successful, the new hybrid version of the code outperformed the pure MPI version on IBM Blue Gene/Q architecture (FERMI).
Scotta, V. Weinbergb†, O. Hoenena,
A. Karmakarb, L. Fazendeiroc
a Max-Planck-Institut fur Plasmaphysik IPP, 85748 Garching b. Munchen, Germany
b Leibniz Rechenzentrum der Bayerischen Akademie der Wissenschaften, 85748 Garching b. Munchen, Germany
c Chalmers University of Technology, 412 96 Gothenburg, Sweden
We discuss a detailed weak scaling analysis of GEM, a 3D MPI-parallelised gyrofluid code used in theoretical
plasma physics at the Max Planck Institute of Plasma Physics, IPP at Garching near Munich, Germany. Within a
PRACE Preparatory Access Project various versions of the code have been analysed on the HPC systems
SuperMUC at LRZ and JUQUEEN at Julich Supercomputing Centre (JSC) to improve the parallel scalability of
the application. The diagnostic tool Scalasca has been used to filter out suboptimal routines. The code uses the
electromagnetic gyrofluid model which is a superset of magnetohydrodynamic and drift-Alfven microturbulance
and also includes several relevant kinetic processes. GEM can be used with different geometries depending on
the targeted use case, and has been proven to show good scalability when the computational domain is
distributed amongst two dimensions. Such a distribution allows grids with sufficient size to describe
conventional tokamak devices. In order to enable simulation of very large tokamaks (such as the next generation
nuclear fusion device ITER in Cadarache, France) the third dimension has been parallelised and weak scaling
has been achieved for significantly larger grids.
Siwei Donga, Vegard Eideb, Jeroen Engelbertsc
a Universidad Politecnica Madrid, Spain
b Norwegian University of Science and Technology, Norway
c SURFsara, Amsterdam, Netherlands
The SHEAR code is developed at the School of Aeronautics, Universidad Politecnica de Madrid, for the
simulation of turbulent structures of shear flows. The code has been well tested on smaller clusters. This white
paper desbribes the work done to scale and optimise SHEAR for large systems like the Blue Gene/Q system
JUQUEEN in Julich.
Riccardo Brogliaa, Stefano Zaghia, Roberto
Muscaria, Francesco Salvadoreb, Soon-Heum Koc
a CNR-INSEAN National Marine Technology Research Institute, Via di Vallerano
139, Rome 00128, Italy
b CINECA, Via dei Tizii 6, Rome 00185, Italy
cNSC National Supercomputing Centre, Linkoping University, 58183
In this paper, the work that has been performed to extend the capabilities of the -Xnavis software, a well tested and
validated parallel flow solver developed by the research group of CNR-INSEAN, is reported. The solver is based on the fi-nite volume discretization of the unsteady incompressible Navier-Stokes equations; main features include a level set approach to handle free surface flows and a dynamical overlapping grids approach, which allows to deal with bodies in relative motion. The baseline code features a hybrid MPI/OpenMP parallelization, proven to scale when running on order of hundreds of cores (i.e. Tier-1 platforms). However, some issues arise when trying to use this code with the current massively parallel HPC facilities provided in the Tier-0 PRACE context. First of all, it is mandatory to assess an effi-cient speed-up up to thousands of processors. Other important aspects are related to the pre- and post- processing phases which need to be optimized and, possibly, parallelized. The last one concerns the implementation of MPI-I/O procedures in order to try to accelerate data access and to reduce the number of generated -files.
Stoyan Markov, Peicho Petkov, Damyan Grancharov and Georgi Georgiev
National Centre for Supercomputing Application, Akad. G. Bonchev Str., 25A, 1113 Sofia, Bulgaria
We investigated the possible way for treatment of electrostatic interactions by
solving numerically Poisson’s equation using Conjugate Gradient method and
Stabilized BiConjugate Gradient method. The aim of the research was to test the
execution time of prototype programs running on BLueGene/P and CPU/GPU
system. The results show that the tested methods are applicable for electrostatics
treatment in molecular-dynamics simulations.
a Greek Research and Technology Network, Athens, Greece
b Scientific Computing Center, Aristotle University of Thessaloniki Thessaloniki
c HLRS, Nobelstr. 19, D-70569 Stuttgart, Germany
d The Cyprus Institute, 20 Konstantinou Kavafi Street 2121 Aglantzia,
The project objective has been to develop and justify the OpenFOAM model for the simulation of a TES tank. In
the course of the project we have obtained scalability results, which are presented in this paper. Scalability tests
have been performed on HLRS Hermit HPC system using various combinations of decomposition methods, cell
capacities and number of physical cpu cores.
O. Akinci1, M. Sahin2, B.O. Kanat3
1 National High Performance Computing Center of Turkey (UHeM), Istanbul
Technical University (ITU), Ayazaga Campus, UHeM Office, Istanbul
2 ITU, Ayazaga Campus, Faculty of Aeronautics and Astronautics, Istanbul
3 ITU, Ayazaga Campus, Computer Engineering Department, Istanbul 34469,
ViscoSolve is a stable unstructured finite volume method for parallel large-scale viscoelastic fluid flow
calculations. The code incorporates the open-source libraries PETSc and MPI for parallel computation. In this
whitepaper we report work that was done to investigate scaling the performance of the ViscoSolve code.
SURFsara, Science Park 140, 1098XG Amsterdam, the Netherlands
This white-paper reports on an enabling e-ort that involves porting a legacy 2D fluid dynamics Fortran code to NVIDIA GPUs. Given the complexity of both code and underlying (custom) numerical method, the natural choice was to use NVIDIA CUDA C to achieve the best possible performance. We achieved over 4.5x speed-up on a single K20 compared to the original code executed on a dual-socket E5-2687W.1
Paride Dagnaa, Joerg Hertzerb
a CINECA-SCAI Department, Via R. Sanzio 4, Segrate (MI) 20090, Italy
b HLRS, Nobelstr. 19, D-70569 Stuttgart, German
The performance results from the hybridization of the OpenFOAM linear system solver, tested on the CINECA Fermi and the HLRS Hermit supercomputers are presented in this paper. A comparison between the original and the hybrid OpenFOAM versions on four physical problems, based on four different solvers, will be shown and a detailed analysis of the behavior of the main computing and communication phases, which are responsible for scalability during the linear system solution, will be given.
Stephane Glocknerb, N.
Audiffrena, H. Ouvrardb,a
a CINES, Centre Informatique
National de l’Enseignement Superieur, Montpellier, France
bI2M (Institut de Mecanique
et d’Ingenierie de Bordeaux) *Corresponding
It has already been shown that the numerical tool Thetis based on the resolution of the Navier-Stokes
equation for multiphase flows gives accurate results for coastal applications, e.g. wave breaking, tidal
bore propagation, tsunamis generation, swash flows, etc. [1,2,3,4,5,6]. In this study our goal is to
improve the time and memory consumption in the set-up phase of the simulation (partitioning and
building the computational mesh), examining the eventual benefits of an hybrid approach of the Hypre
library, and doing fine tuning in implementation of the code on Curie Tier-0. We also implement
parallel POSIX VTK and HDF5 I/O. Thetis is now able to run efficiently up to 1 billion mesh nodes at
16384 cores on CURIE in a production context.
Thibaut Delozea, Yannick Hoaraub,
a IMFT, 2 allee du
Professeur Camille Soula, Toulouse 31400, France
b IMFS, 2 rue Boussingault,
Strasbourg 67000, France
The present work focuses on code development of an efficient and robust CFD solver in aeronautics: the NSMB
code, Navier-Stokes MultiBlock. A specific advanced version of the code containing Turbulence Modeling
Approached developed by IMFT will be the object of the present optimization. The present project aims at
improving the performances of the MPI version of the code in order to use efficiently the fat node part of Curie
(TGCC, France). Different load balancing strategies have been studied in order to achieve an optimal distribution of
work using up to 4096 processes.
A. Schnurpfeila, A. Schillera,
F. Janetzkoa, St. Meiera, G. Sutmanna
a Forschungszentrum Juelich –
JSC, Wilhelm-Johnen-Strasse, 52425 Juelich, Germany
MP2C is a molecular dynamic code that focusses on mesoscopic particles dynamics simulations on massive parallel
computers. The program is a development of JSC together with the Institute for Theoretical Biophysics and Soft
Matter (IFF-2) of the Institute for Solid State Research (IFF) at Juelich. Within the frame of the PRACE Internal
Call further optimization of the code as well as targeting possible bottlenecks were addressed. The project was mainly performed on JUGENE, a BG/P based architecture located at the Forschungszentrum Juelich/Germany. Besides that, some scaling test were performed on JUROPA, an Intel-Nehalem based general purpose supercomputer, also located at the Forschungszentrum Juelich. In this report the e-orts made by working on the program package are presented.
CSC –IT Center for Science Ltd,
P.O.Box 405, FI-02101 Espoo, Finland
The Sun exhibits magnetic activity at various spatial and temporal scales. The best known example is the 11-year
sunspot cycle which is related to the 22-year periodicity of the Sun’s magnetic field. The sunspots, and thus solar
magnetic activity, have some robust systematic features: in the beginning of the cycle sunspots appear at latitudes
around 40 degrees. As the cycle progresses these belts of activity move towards the equator. The sign of the
magnetic field changes from one cycle to the next and the large-scale field remains approximately anti-symmetric
with respect to the equator. This cycle has been studied using direct observations for four centuries. Furthermore,
proxy data from tree rings and Greenland ice cores has revealed that the cycle has persisted through millennia. The
period and amplitude of activity change from cycle to cycle and there are even periods of several decades in the
modern era when the activity has been very low. Since it is unlikely that the primordial field of the hydrogen gas that formed the Sun billions of years ago could have survived to the present day, the solar magnetic field is considered to be continuously replenished by some dynamo mechanism.
Kevin Stratforda, Ignacio
aEPCC, The King’s Buildings,
The University of Edinburgh, EH9 3JZ, United Kingdom
bDepartament de Fisica
Fonamental, Universitat ode Barcelona, Carrer Mart i Franques, 08028
This project looked at the performance of simulations of bacterial swimmers using a lattice Boltzmann code for complex fluids.
Authors: A. Turka,?,
C. Moulinecb, A.G. Sunderlandb, C. Aykanata
aBilkent University, Comp.
Eng. Dept., Ankara, Turkey
bSTFC Daresbury Laboratory,
Warrington WA4 4AD, UK
Code Saturne is an open-source, multi-purpose Computational Fluid Dynamics (CFD) software, which has been developed by Electricite de France Recherche et Development (EDF-R&D). Code Saturne has been selected as an application of interest for CFD community in The Partnership for Advanced Computing in Europe First Implementation Phase Project (PRACE-1IP) and various efforts towards improving the scalability of Code Saturne have been conducted. In this whitepaper, the efforts towards understanding and improving the preprocessing subsystem of Code Saturne are described, and to this end, the performance of different mesh partitioning software packages that can be used are investigated.
Authors: aC. Moulinec, A.G. Sunderland1, bP. Kabelikova, A. Ronovsky, V. Vondrak2, cA. Turk, C. Aykanat3, dC. Theodosiou4
aComputational Science and Engineering Department, STFC Daresbury Laboratory, United Kingdom
bDepartment of Applied Mathematics, VSB-Technical University of Ostrava, 17. listopadu 15,
708 33 Ostrava, Czech Republic
cDepartment of Computer Engineering, Bilkent University, 06800 Bilkent Ankara, Turkey
dScientific Computational Center, Aristotle University, 54 124 Thessaloniki, Greece
Some of the optimisations required to prepare Code_Saturne for petascale simulations are presented in this white paper, along with the performance of the code. A mesh multiplication package based on parallel global refinement of hexahedral meshes has been developed for Code_Saturne to handle meshes containing billions†
of cells and to efficiently exploit PRACE Tier-0 system capabilities. Several parallel partitioning tools have been tested and Code_Saturne performance has been assessed up to a 3.2 billion cell mesh. The parallel code is highly scalable and demonstrates good parallel speed-up at very high core counts, e.g. from
32,768 to 65,536 cores.
Author: Massimiliano Culpo
CINECA, Via Magnanelli 6/3, Casalecchio di Reno (BO) I-40033, Italy
The scaling behavior of different OpenFOAM versions is analyzed on two benchmark
problems. Results show that the applications scale reasonably well up to a thousand tasks.
An in-depth profiling identifies the calls to the MPI_Allreduce function in the linear algebra
core libraries as the main communication bottleneck. A sub-optimal performance on-core is
due to the sparse matrices storage format that does not employ any cache-blocking
mechanism at present. Possible strategies to overcome these limitations are proposed and
analyzed, and preliminary results on prototype implementations are presented.
Moylesa, Peter Nash, Ivan Girotto
Irish Centre for High End Computing, Grand Canal Quay, Dublin 2
The following report outlines work undertaken for PRACE-2IP. The report will outline the computational methods used to examine petascaling of OpenFOAM on the French Tier-0 system CURIE. The case study used has been provided by the National University of Ireland, Galway (NUIG). The profiling techniques utilised to uncover bottlenecks, specifically in communication and file I/O within the code, will provide an insight into the behaviour of OpenFOAM and highlight practices that will be of benefit to the user community.
Middle East Technical University, Department of Computer Engineering, 06800 Ankara, Turkey
Solution of large sparse linear systems is frequently the most time consuming operation in computational fluid dynamics simulations. Improving the scalability of this operation is likely to have significant impact on the overall scalability of application. In this white paper we show scalability results up to a thousand cores for a new algorithm devised to solve large sparse linear systems. We have also compared pure MPI vs. MPI-OpenMP hybrid implementation of the same algorithm.
Earth Science applications
Authors: Mads R. B. Kristensena and Roman Nutermana
a Niels Bohr Institute, Universityt of Copenhagen, Denmark
In this paper, we explore the challenges of running the current version (v1.2.2) of Community Earth System Model (CESM) Simulation on Juqueen. We present a set of workarounds for the Juqueen supercomputer that enables massively parallel executions and demonstrate scalability up to 3024 CPU-cores.
P. Nolan, A. McKinstry
Irish Centre for High-End Computing (ICHEC), Ireland
Climate change due to increasing anthropogenic greenhouse gases and land surface change is currently one of
the most relevant environmental concerns. It threatens ecosystems and human societies. However, its impact on
the economy and our living standards depends largely on our ability to anticipate its effects and take appropriate
action. Earth System Models (ESMs), such as EC-Earth, can be used to provide society with information on the
future climate. EC-Earth3 generates reliable predictions and projections of global climate change, which are a
prerequisite to support the development of national adaptation and mitigation strategies.
This project investigates methods to enhance the parallel capabilities of EC-Earth3 by offloading bottleneck
routines to GPUs and Intel Xeon Phi coprocessors. To gain a full understanding of climate change at a regional
scale will require EC-Earth3 to be run at a much higher spatial resolution (T3999 5km) than is currently
feasible. It is envisaged that the work outlined in this project will provide climate scientists with valuable data
for simulations planned for future exascale systems.
CINECA-SCAI Department, Via R. Sanzio 4, Segrate (MI) 20090, Italy
SPEED (Spectral Element in Elastodynamics with Discontinuous Galerkin) is an open source code, jointly developedby the Department of Structural Engineering and of Mathematics at Politecnico di Milano, for seismic hazard analyses.
In this paper, performanceresults, which come from the optimization and hybridization work done on SPEED, tested on the CINECA Fermi BG/Q supercomputer will be shown. A comparison between the pure MPI and the hybrid SPEED versions on three earthquake scenarios, with increasing complexity, will be presented and a detailed analysisof the advantages that come from hybridization and optimization of the computing and I/O phases will be given.
Molinesa, Nicole Audiffrenba, Albanne
a CNRS, LEGI, Grenoble,
bCINES, Centre Informatique
National de l’Enseignement Superieur, 34000 Montpellier FRANCE
This project aims at preparing the high-resolution ocean/sea-ice realistic modeling environment implemented by the European DRAKKAR consortium for use on PRACE Tier-0 computers. DRAKKAR participating Teams jointly develop and share this modeling environment to address a wide range of scientific questions investigating multiple-scale interactions in the world ocean. Each team relies on the achievements of DRAKKAR to have available for its research the most efficient and up-to-date ocean models and related modeling tools. Two original realistic model configurations, based on the NEMO modeling framework, are considered in this project. They are designed to make possible the study of the role of multiple-scale interactions in the ocean variability, in the ocean carbon cycle and in marine ecosystems changes.
Charles Moulinec, Yoann Audouin, Andrew Sunderland
STFC Daresbury Laboratory, UK
This report details optimization undertaken on the Computational Fluid Dynamic (CFD) software suite TELEMAC, a modelling system for free surface waters with over 200 installations worldwide. The main focus of the work has involved eliminating memory bottlenecks occurring at the pre-processing stage that have historically limited the size of simulations processed. This has been achieved by localizing global arrays in the pre-processing tool, known as PARTEL. Parallelism in the partitioning stage has also been improved by replacing the serial partitioning tool with a new parallel implementation.
These optimizations have enabled massively parallel runs of TELEMAC-2D, a Shallow Water Equations based code,
involving over 200 million elements to be undertaken on Tier-0 systems. These runs simulate extreme flooding events on very -ne meshes (locally less than one meter). Simulations at this scale are crucial for predicting and understanding flooding events occurring, e.g. in the region of the Rhine river.
Zwinger, Mika Malinen, Juha Ruokolainen, Peter Raback
CSC – IT Center for Science, P.O. Box 405, FI-02101 Espoo,Finland
By gaining and losing mass, glaciers and ice-sheets play a key role in sea level evolution. This is obvious when
considering the past 20000 years, during which the collapse of the large northern hemisphere ice-sheets after the
Last Glacial Maximum contributed to a 120m rise in sea level. This is particularly worrying when the future is
considered. Indeed, recent observations clearly indicate that important changes in the velocity structure of both
the Antarctic and Greenland ice-sheets are occurring, suggesting that large and irreversible changes may already
have been initiated. This was clearly emphasised in the last report published by the Intergovernmental Panel on
Climate Change (IPCC) . The IPCC also asserted that current knowledge of key processes causing the
observed accelerations was poor, and concluded that reliable projections obtained with process-based models for
sea-level rise (SLR) are currently unavailable. Most of these uncertain key processes have in common that their
physical/numerical characteristics, such as shallow ice approximation (SIA), are not accordingly reflected or
even completely missing in the established simplified models that have been in use since decades. Whereas those
simplified models run on common PC systems, the new approaches require higher resolution and larger
computational models, which demand High Performance Computing (HPC) methods to be applied. In other
words, numerical glaciology, like climatology and oceanography decades ago, needs to be updated for HPC with
scalable codes, in order to deliver the prognostic simulations demanded by the IPCC. The DECI project
ElmerIce, and enabling work associated with it, improved simulations of key processes that lead to continental
ice loss. The project also developed new data assimilation methods. This was intended to decrease the degree of
uncertainty affecting future SLR scenarios and consequently contribute to on-going international debates
surrounding coastal adaptation and sea-defence planning. These results directly feed into existing projects, such
as the European FP7 project ice2sea , which has the objective of improving projections of the contribution of
continental ice to future sea-level rise and the French ANR ADAGe project , coordinated by O. Gagliardini,
which has the objective to develop data assimilation methods dedicated to ice flow studies. Results from these
projects will directly impact the upcoming IPCC assessment report (AR5).
Authors: Sebastian von Alfthana, Dusan Stankovicb, Vladimir
Institute, Helsinki, Finland
bInstitute of Physics
In this whitepaper we report work that was done to investigate and improve the performance of a hyrid-Vlasov
code for simulating Earth’s Magnetosphere. We improved the performance of the code through a hybrid OpenMP-MPI mode.
Dalibor Lukas, Jan Zapletal
Department of Applied Mathematics, IT4Innovations, VSB-Technical University of Ostrava, Czech Republic
In this paper, a new parallel acoustic simulation package has been created, using the boundary element method (BEM). The package is built on top of SPECFEM3D, which is parallel software for doing seismic simulations, e.g. earthquake simulations of the globe. The acoustical simulation relies on a Fourier transform of the seismic elastodynamic data, resulting from SPECFEM3D_GLOBE, which are then postprocessed by a sequence of solutions to Helmholtz equations, in the exterior of the globe. For the acoustic simulations BEM has been employed, which reduces computation to the sphere; however, its naive implementation suffers from quadratic time and memory complexity, with respect to the number of unknowns. To overcome the latter, the method was accelerated by using hierarchical matrices and adaptive cross approximation techniques, which is referred to as Fast BEM. First, a hierarchical clustering of the globe surface triangulation is performed. The arising cluster pairs decompose the fully populated BEM matrices into a hierarchy of blocks, which are classified as far-field or near-field. While the near-field blocks are kept as full matrices, the far-field blocks are approximated by low-rank matrices. This
reduces the quadratic complexity of the serial code to almost linear complexity, i.e. O(n*log(n)), where n denotes the number of triangles. Furthermore, a parallel implementation was done, so that the blocks are assigned to concurrent MPI processes with an optimal load balance. The processes share the triangulation data. The parallel code reduces the computational complexity to O(n*log(n)/N), where N denotes the number of processes. This is a novel implementation of BEM that overcomes computational times of traditional volume discretization methods, e.g. finite elements, by an order of magnitude.
Chandan Basub, Alastair McKinstryc, Muhammad
Asifd, Andrew Portere, Eric Maisonnavef,
Sophie Valckef, Uwe Fladrichg
aSARA, Amsterdam, The
bNSC, Linkoping, Sweden
cICHEC, Galway, Ireland
dIC3, Barcelona, Spain
eSTFC, Daresbury, United
fCERFACS, Toulouse, France
gSMHI, Norrkoping, Sweden
The EC-EARTH model is a global, coupled climate model that consists of the separate components IFS for the
atmosphere and NEMO for the ocean that are coupled using the OASIS coupler. EC-EARTH was ported and run on
the Curie system. Different configurations, using resolutions from T159 (approx. 128km) to T799 (approx 25km),
were available for benchmarking. Scalasca was used to analyze the performance of the model in detail. Although it
was expected that either the I/O or the coupling would be a bottleneck for scaling of the highest resolution model,
that is clearly not, yet, the case. The IFS model uses two MPI_Alltoallv calls per timestep that dominate the loss of
scaling at 1024 cores. Using the OpenMP functionality in IFS could potentially increase scalability considerably, but
this does not yet work on Curie. Work is ongoing to make MPI_Alltoallv more efficient on Curie. It is expected that
I/O and/or coupling does become a bottleneck when IFS can be scaled further than 2000 cores. Therefore, the
OASIS team increased the scalability of OASIS dramatically with the implementation of a radically different
approach, showing less than 1% overhead at 2000 cores. The scalability of NEMO was improved during an earlier
PRACE project. The I/O subsystem in IFS is described and is probably not easily accelerated unless it is being
rewritten and uses a different file format.
Dalibor Lukas, Petr Kovar, Tereza Kovarova, Jan Zapletal
Department of Applied Mathematics, IT4Innovations, VSB-Technical University of Ostrava, Czech Republic
In this paper, a new parallel acoustic simulation package has been created, using the boundary element method (BEM). The package is built on top of SPECFEM3D, which is parallel software for doing seismic simulations, e.g. earthquake simulations of the globe. The acoustical simulation relies on a Fourier transform of the seismic elastodynamic data, resulting from SPECFEM3D_GLOBE, which are then postprocessed by a sequence of solutions to Helmholtz equations, in the exterior of the globe. For the acoustic simulations BEM has been employed, which reduces computation to the sphere; however, its naive implementation suffers from quadratic time and memory complexity, with respect to the number of unknowns. To overcome the latter, the method was accelerated by using hierarchical matrices and adaptive cross approximation techniques, which is referred to as Fast BEM. First, a hierarchical clustering of the globe surface triangulation is performed. The arising cluster pairs decompose the fully populated BEM matrices into a hierarchy of blocks, which are classified as far-field or near-field. While the
near-field blocks are kept as full matrices, the far-field blocks are approximated by low-rank matrices. This reduces the quadratic complexity of the serial code to almost linear complexity, i.e. O(n log(n)), where n denotes the number of triangles. Furthermore, a parallel implementation was done, so that the blocks are assigned to concurrent MPI processes with an optimal load balance. The novelty of our approach is based on a nontrivial and theoretically supported memory distribution of the hierarchical matrices and right-hand side vectors so that the overall memory consumption leads to O(n log(n)/N+n/sqrt(N)), which is the theoretical limit at the same time.
Zielinski1a, John Donnersa
aSARA B.V., Science Park
140, 1098XG Amsterdam, The Netherlands
The entire project focused on an evaluation of the code for a possible introduction of OpenMP and its actual implementation and extensive tests. Major time consuming parts of the code were detected and thoroughly analyzed. The most time consuming part was successfully parallelized using OpenMP. Very extensive test simulations using the hybrid code allowed for many further improvements and validations of its results. Possible improvements have also been discussed with the developers to be implemented in the near future.
Author: Chandan Basu
National Supercomputer Center, Linkoping University, Linkoping 581 83, Sweden
The high resolution version of EC-EARTH is ported on Curie. The scalability of the code is tested up to 3500 CPU cores. An example EC-EARTH run is profiled using the TAU tool.
Authors: Mikolaj Szydlarskia*, and and Vegard Eideb
a Institute of Theoretical Astrophysics, UiO
b CIT Department, NTNU
Abstract: In this white paper we report our experiences in porting the stellar atmosphere simulation code Bifrost to Intel Xeon Phi – Knights Landing Architecture (KNL). Bifrost is a parallel, Fortran 95/MPI code for solving the 3D radiation magnetohydrodynamic (RMHD) equations on a staggered grid using a high order compact finite difference scheme. The focus is on finding the most performant runtime setup and estimate possible performance gain when compared with Intel Haswell based nodes.
Authors: T. Ponweiser, M.E. Innocenti, G. Lapent A. Beck, S. Markidis
Research Institute for Symbolic Computation (RISC), Johannes Kepler University, Altenberger Stra?e 69, 4040 Linz, Austria
Center for mathematical Plasma Astrophysics, Department of Mathematics, K.U. Leuven, Celestijnenlaan 200B, B-3001 Leuven, Belgium
Laboratoire Leprince-Ringuet, Ecole Polytechnique, CNRS-IN2P3, France
KTH Royal Institute of Technology, Stockholm, Sweden
Parsek2D-MLMD is a semi-implicit Multi Level Multi Domain Particle-in-Cell (PIC) code for the simulation of astrophysical and space plasmas. In this Whitepaper, we report on improvements on Parsek2D-MLMD carried out in the course of the PRACE preparatory access project 2010PA1802. Through algorithmic enhancements – in particular the implementation of smoothing and temporal sub-stepping – as well as through performance tuning using HPCToolkit, the efficiency of the code has been improved significantly. For representative benchmark cases, we consistently achieved a total speedup of a factor 10 and higher.
J. Donners, J. Bedorf
SURFsara, Amsterdam, The Netherlands
Leiden,UniversityLeiden, The Netherlands
This white paper describes a project to modify the I/O of the Bonsai astrophysics code to scale up to more than 10,000 nodes on the Titan system. A remaining bottleneck is the I/O: the creation of separate files for each MPI task overloads the Lustre metadata server. The use of the SIONlib library on the Lustre filesystem of different PRACE systems is investigated. Several issues had to be resolved, both with the SIONlib library and the Liblustre API, before a satisfactory I/O performance could be achieved. SIONlib reaches about half the performance of a naive approach where each MPI task writes a separate file for a few thousand MPI tasks. However, when more MPI tasks are used, the SIONlib library shows the same performance as the naive approach. The SIONlib library exhibits both the performance and the scalability that is needed to be successful at exascale.
Kacper Kowalik, Artur Gawryszczak, Marcin Lawenda, Michal Hanasza, Norbert Meyer
Nicolaus Copernicus University, Jurija Gagarina 11, 87-100 Torun, Poland
Copernicus Astronomical Center, Polish Academy of Sciences, Bartycka 18, 00-716 Warszawa, Poland
Poznan Supercomputing and Networking Centre, Dabrowskiego 79a, 60-529 Poznan, Poland
PIERNIK is an MHD code created in Centre for Astronomy, Nicolaus Copernicus University in Torun, Poland. The
current version of the code uses a simple, conservative numerical scheme, which is known as Relaxing TVD
scheme (RTVD). The aim of this project was to increase the performance of the PIERNIK code in a case where
the computational domain is decomposed into large number of smaller grids and each concurrent processes is
assigned a significant number of those grids. This optimization enable the PIERNIK to efficiently run on Tier-0
machines. In chapter 1 we introduce PIERNIK software more particularly. Next we focus on scientific aspects
(chapter 2) and discuss used algorithms (chapter 3) including potential optimization issues. Subsequently we
present performance analysis (chapter 4) carried out with Scalasca and Vampir tools. In the final chapter 5 we
present optimization results. In the appendix we provided technical information about the installation and test
Joachim Heina, Anders Johansonb
of Mathematical Sciences & Lunarc, Lund University, Box 118, 221
00 Lund, Sweden
of Astronomy and Theoretical Physics, Lund University, Box 43, 221 00
The simulation of weakly compressible turbulent gas fiows with embedded particles is one of the main objectives of the Pencil Code. While the code mostly deploys high order -finite difference schemes, portions of the code require the use of Fourier space methods. This report describes an optimisation project to improve the performance of the parallel Fourier transformation in the code. Certain optimisations which significantly improve the performance of the parallel FFT were observed to have a negative impact on other parts of the code, such that the overall performance decreases. Despite this challenge the project managed to improve the performance of the parallel FFT within the Pencil Code by a factor of 2.4 and the overall performance of the application by 8% for a project-relevant benchmark.
Nikunena, Frank Scheinerb
aCSC – IT Center for
Science, P.O. Box 405, FI-02101 Espoo, Finland
bHigh Performance Computing
Center Stuttgart (HLRS),University of Stuttgart, D-70550 Stuttgart,
Planck is a mission of the European Space Agency (ESA) to map the anisotropies of the cosmic microwave
background with the highest accuracy ever achieved. Planck is supported by several computing centres,
including CSC (Finland) and NERSC (USA). Computational resources were provided by CSC through the DECI
project Planck-LFI, and by NERSC as a regular production project. This whitepaper describes how PRACE-2IP
staff helped Planck-LFI with two types of support tasks: (1) porting their applications to the execution machine
and seeking ways to improve applications’ performance; and (2) improving performance and facilities to transfer
data between the execution site and the different data centres where data is stored.
M.Cytowski*b, J.Heinc, J.Hertzerd
aINAF – Osservatorio Astrofisico di Catania, Italy
bInterdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw,
cLunarc Lund University, Sweden
dHLRS, University of Stuttgart, Germany
In this whitepaper we report work that was done to investigate and improve the performance of a mixed MPI and OpenMP implementation of the FLY code for cosmological simulations on a PRACE Tier-0 system Hermit (Cray XE6).
Ghellera, Graziella Ferinia, Maciej Cytowskib,
aCINECA, Via Magnanelli 6/3,
Casalecchio di Reno, 40033, Italy
bICM, University of Warsaw,
ul.Pawinskiego 5a, 02-106 Warsaw, Poland
Ring 1, 28759 Bremen, Germany
In this paper we present the work performed in order to build and optimize the cosmological simulation code ENZO
on the Jugene, Blue Gene/P system available at the Forschungszentrum Juelich in Germany. The work allowed us to
define the optimal setup to perform high resolution simulations -finalized to the description of non thermal phenomena (e.g. the acceleration of relativistic particles at shock waves) active in massive galaxy clusters during their cosmological evolution. These simulations will be the subject of a proposal in a future call for projects of the PRACE EU funded project (http://www.prace-ri.eu/).
Authors: Ata Turka,
Cevdet Aykanata, G. Vehbi Demircia, Sebastian
von Alfthanb, Ilja Honkonenb
Computer Engineering Department, 06800 Ankara, Turkey
Institute, PO Box 503, Helsinki, FI-00101, Finland
This whitepaper describes the load-balancing performance issues that are observed and tackled during the petascaling of a spaceplas ma simulation code developed at the Finnish Meteorological Institute (FMI). It models the communication pattern as a hypergraph, and partitions the computational grid using the parallel hypergraph partitioning scheme (PHG) of the Zoltan partitioning framework. The result of partitioning determines the distribution of grid cells to processors. It is observed that the partitioning phase takes a substantial percentage of the overall computation time. Alternative (graph-partitioning-based) schemes that perform almost as well as the hypergraph partitioning scheme and that require less preprocessing overhead and better balance are proposed and investigated. A comparison in terms of effect on running time, preprocessing overhead and load-balancing quality of Zoltan’s PHG, ParMeTiS, and PT-SCOTCH are presented. Test results on Juelich BlueGene/P cluster are
Finite Element applications
A. Abba a*, A. Emerson b, M. Nini a, M. Restelli c, M. Tugnoli a
a* Dipartimento di Scienze e Tecnologie Aerospaziali, Politecnico di Milano, Via La Masa, 34, 20156 Milano, Italy
b Cineca, via Magnanelli 6/3, 40033 Casalecchio di Reno, Bologna, Italy
c Max-Planck-Institut fur Plasmaphysik, Boltzmannstra?e 2, D-85748 Garching, Germany
We present a performance analysis of the numerical code DG-comp which is based on a Local Discontinuous Galerkin method and designed for the simulation of compressible turbulent flows in complex geometries. In the code several subgrid-scale models for Large Eddy Simulation and a hybrid RANS-LES model are implemented. Within a PRACE Preparatory Access Project, the attention was mainly focused on the following aspects:
1. Testing of the parallel scalability on three different Tier-0 architectures available in PRACE (Fermi, MareNostrum3 and Hornet);
2. Parallel profiling of the application with the Scalasca tool;
3. Optimization of the I/O strategy with the HDF5 library.
Without any code optimizations it was found that the application demonstrated strong parallel scaling of more than 1000 cores on Hornet and MareNostrum, and least up to 16384 cores on Fermi. The performance characteristics giving rise to this behaviour were confirmed with Scalasca. As regards the I/O strategy, a modification to the code was made to allow the use of HDF5-formatted files for output. This enhancement resulted in an increase in performance for most input datasets and a significant decrease in the storage space required. Other data were collected on the influence of the optimal compiler options to employ on the different computer systems and the influence of numerical libraries for the linear algebra computations in the code..
Byckling, Mika Malinen, Juha Ruokolainen, Peter Raback
CSC – IT Center for Science, Keilaranta
14, 02101 Espoo, Finland
Recent developments of Elmer finite element solver are presented. The applicability of the code to industrial problems has been improved by introducing features for handling rotational boundary conditions with mortar -nite elements. The scalability of the code has been improved by making the code thread-safe and by multithreading some critical sections of the code. The developments are described and some scalability results are presented.
Authors: X. Saez, E.
Casoni, G. Houzeaux, M. Vazquez
Dept. of Computer Applications in
Science and Egineering, Barcelona Supercomputing Center (BSC-CNS),
08034 Barcelona, Spain
While solid mechanics codes are now proven tools both in the industry and research sectors, the increasingly more exigent requirements of both sectors are fuelling the need for more computational power and more advanced algorithms. While commercial codes are widely used in industry during the design process, they often lag behind academic codes in terms of computational efficiency. In fact, the commercial codes are usually general purpose and include millions of lines of codes. Massively parallel computers appeared only recently, and the adaptation of
these codes is going slowly. In the meantime, academy adapted very quickly to the new computer architectures and now offers an attractive alternative: not so much modeling but more accuracy.
Alya is a computational mechanics code developed at Barcelona Supercomputing Center (BCS-CNS) that solves Partial Differential Equations (PDEs) on non-structured meshes. To address the lack of an efficient parallel solid mechanics code, and motivated by the demand coming from industrial partners, Alya-Solidz, the specific Alya module for solving computational solid mechanics problems, has been enhanced to treat large complex problems involving solid deformations and fracture. Some of these developments have been carried out in the framework of PRACE-2IP European project.
In this article a solid mechanics simulation strategy for parallel supercomputers based on a hybrid approach is presented. A hybrid parallelization approach combining MPI tasks with OpenMP threads is proposed in order to exploit the different levels of parallelism of actual multicore architectures. This paper describes the strategy programmed in Alya and shows nearly optimal scalability results for some solid mechanical problems.
T. Kozubek, M. Jarosov, M. Mens??k, A. Markopoulos
CE IT4Innovations, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic
We describe a hybrid FETI (Finite Element Tearing and Interconnecting) method based on our variant of the FETI
type domain decomposition method called Total FETI. In our approach a small number of neighboring subdomains is aggregated into the clusters, which results into a smaller coarse problem. To solve the original problem the Total FETI method is applied twice: to the clusters (macro-subdomains) and then to the subdomains in each cluster. This approach simpli?es implementation of hybrid FETI methods and enables to extend parallelization of the original problem up to tens of thousands of cores due to the coarse space reduction and thus lower memory requirements. The performance is demonstrated on a linear elasticity benchmark.
T. Kozubek, D. Horak, V. Hapla
CE IT4Innovations, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava, Czech Republic
Most of computations (subdomain problems) appearing in FETI-type methods are purely local and therefore parallelizable without any data transfers. However, if we want to accelerate also dual actions, some communication is needed due to primal-dual transition. Distribution of primal matrices is quite straightforward. Each of cores works with local part associated with its subdomains. A natural e?ort using the massively parallel computers is to maximize the number of subdomains so that sizes of subdomain sti?ness matrices are reduced which accelerates their factorization and subsequent pseudoinverse application, belonging to the most time consuming actions. On the other hand, a negative e?ect of that is an increase of the null space dimension and the number of Lagrange multipliers on subdomains interfaces, i.e. the dual dimension, so that the bottleneck of the TFETI method becomes the application of the projector onto the natural coarse space, especially its part called coarse problem solution. In this paper, we suggest and test di?erent parallelization strategies of the coarse problem solution regarding to the improvements of the TFETI massively parallel implementation.
Simultaneously we discuss some details of our FLLOP (Feti Light Layer on Petsc) implementation and demonstrate its performance on an engineering elastostatic benchmark of the car engine block up to almost 100 million DOFs. The best parallelization strategy based on the MUMPS was implemented into the multi-physical ?nite element based opensource code ELMER developed by CSC, Finland.
Authors: T. Kozubeka,
V. Vondraka, P. Rabackb, J. Ruokolainenb
a Department of Applied
Mathematics, VSB-TU of Ostrava, 17. listopadu 15, 70833 Ostrava,
b CSC – IT Center for
Science, Keilaranta 14 a, 20101 Espoo, Finland
The bottlenecks related to the numerical solution of many engineering problems are very dependent on the techniques used to solve the systems of linear equations that result from their linearizations and ?nite element discretizations. The large linearized problems can be solved efficiently using the so-called scalable algorithms based on multigrid or domain decomposition method. In cooperation with the Elmer team two variants of the domain decomposition method have been implemented into Elmer: (i) FETI-1 (Finite Element Tearing and Interconnecting) introduced by Farhat and Roux and (ii) Total FETI introduced by Dostal, Horak, and Kucera. In the latter, the Dirichlet boundary conditions are torn of to have all subdomains floating, which makes the method very flexible. In this paper, we review the results related to the efficient solution of symmetric positive semide?nite systems arising in FETI methods when they are applied on elliptic boundary value problems. More specifically, we show three di?erent strategies to find the so-called fixing nodes (or DOFs – degrees of freedom), which enable an effective regularization of the corresponding subdomain system matrices that eliminates the work with singular matrices. The performance is illustrated on an elasticity benchmark computed using ELMER on the French Tier-0 system CURIE.
Ruokolainena, P. R ?abacka,?, M. Lylya,
T. Kozubekb, V. Vondrakb, V. Karakasisc,
a CSC – IT Center for
Science, Keilaranta 14 a, 20101 Espoo, Finland
b Department of Applied
Mathematics, V?SB – Technical University of Ostrava, 17. listopadu
15, 70833 Ostrava Poruba, Czech Republic
c ICCS-NTUA, 9, Iroon.
Polytechniou Str., GR-157 73 Zografou, Greece
Elmer is a ?nite element software for the solution of multiphysical problems. In the present work some performance
bottlenecks in the work?ow are eliminated: In prepocessing the mesh splitting scheme is improved to allow the conservation of mesh grading for simple problems. For the solution of linear systems a preliminary FETI domain decomposition method is implemented. It utilizes a direct factorization of the local problem and an iterative method for joining the results from the subproblems. The weak scaling of FETI is shown to be nearly ideal the number of iterations staying almost ?xed. For postprocessing binary output formats and a XDMF+HDF5 I/O routine are implemented. Both may be used in conjunction with parallel visualization software.
Authors: Vasileios Karakasis1, Georgios Goumas1, Konstantinos Nikas2,*, Nectarios Koziris1, Juha
Ruokolainen3, and Peter Raback3
1Institute of Communication and Computer Systems (ICCS), Greece
2Greek Research & Technology Network (GRNET), Greece
3CSC -IT Center for Science Ltd., Finland
Multiphysics simulations are at the core of modern Computer Aided Engineering (CAE) allowing the analysis of multiple, simultaneously acting physical phenomena. These simulations often rely on Finite Element Methods (FEM) and the solution of large linear systems which, in turn, end up in multiple calls of the costly Sparse Matrix-Vector Multiplication (SpMV) kernel. We have recently proposed the Compressed Sparse eXtended (CSX) format, which applies aggressive compression to the column indexing structure of the CSR format and is able to provide an average performance improvement of more than 40% over multithreaded CSR implementations. This work integrates CSX into the Elmer multiphysics simulation software and evaluates its impact on the total execution time of the solver. Despite its preprocessing cost, CSX is able to improve by almost 40% the performance of the Elmer’s SpMV component (using multithreaded CSR) and provides an up to 15% performance gain in the overall solver time after 1000 linear system iterations. To our knowledge, this is one of the -rst attempts to evaluate the real impact of an innovative sparse-matrix storage format within a `production’ multiphysics software.
K. Georgiev, N. Kosturski, I. Lirkov, S. Margenov, Y. Vutov
National Center for Supercomputing Applications, Acad. G. Bonchev str, Bl. 25-A, 1113 Sofia, Bulgaria
The White Paper content is focused on: a) construction and analysis of novel scalable algorithms to enhance scientific applications based on mesh methods (mainly on finite element method (FEM) technology); b) optimization of a new class of algorithms on many core systems.
From one site, the commonly accepted benchmark problem in computational fluid dynamics (CFD) – time dependent system of incompressible Navier-Stokes equations, is considered. The activities were motivated by advanced large-scale simulations of turbulent flows in the atmosphere and in the ocean, simulation of multiphase flows in order to extract average statistics, solving subgrid problems as part of homogenization procedures etc. The computer model is based on implementation of a new class of parallel numerical methods and algorithms for time dependent problems. It only requires solution of tridiagonal linear systems and therefore it is computationally very efficient, with a computational complexity of the same order as that of an explicit scheme, and yet, unconditionally stable. The scheme is particularly convenient for parallel implementation. Among the most important novel ideas is to avoid the transposition which is usually used in alternating directions time stepping algorithms. The final goal is to provide portable tools for integration in commonly accepted codes like Elmer and OpenFOAM. The new development software is organized as a computer library for using of researchers dealing with solution of incompressible Navier-Stokes equations From other hand, we implement and develop new scalable algorithms and software for FEM simulations with typically O(109) degrees of freedom in space for an IBM Blue Gene/P computer. We have considered voxel and unstructured meshes; stationary and time dependent problems; linear and nonlinear models The performed work was focused on the development of scalable mesh methods, and tuning of the related software tools mainly to the IBM Blue Gene/P architecture but other massively parallel computers and MPI clusters were taken into account too. Efficient algorithms for time stepping, mesh refinement and parallel mappings were implemented. The aim here is again providing software tools for integration in Elmer and OpenFOAM. The
computational models address discrete problems in the range of O(109) degrees of freedom in space. The related time stepping techniques and iterative solvers are targeted to meet the Tier-1 and (further) Tier-0 requirements.
Scalability on 512 IBM Blue Gene/P nodes and several other high performance computing clusters is currently documented for the tested software modules and some of them are presented in this paper. Comparison results of running Elmer code on Intel cluster (16 cores, Intel Xeon X5560) and on IBM Blue Gene/P computer can be found. Variants of 1D, 2D and 3D domain partitioning for the 3D test problems were systematically analysed showing the advantages of the 3D partitioning for the Blue Gene/P communication system.
Saeza, Taner Akguna, Edilberto Sanchezb
a Barcelona Supercomputing
Center – Centro Nacional de Supercomputacion, C/ Gran Capita 2-4,
Barcelona, 08034, Spain
b Laboratorio Nacional de
Fusion. Avda Complutense 22, 28040 Madrid, Spain
In this paper we report the work done in Task 7.2 of PRACE-1IP project on the code EUTERPE. We report on the progress made on the hybridization of the code to MPI and OpenMP; the status of the porting to GPUs; the outline of the analysis of parameters; and the study on the possibility of incorporating I/O forwarding to improve performance. Our initial findings indicate that particle-in-cell algorithms such as EUTERPE are suitable candidates for the new computing paradigms involving heterogeneous architectures.
Life Science applications
V. Ruggieroa*, D. Codonib and D. Tordellab
aCINECA, SCAI Rome, bPolitecnico di Torino, DISAT
In past literature, most simulations of lukewarm clouds assumed static and homogeneous conditions. We are interested in simulating more realistic regimes of warm clouds that actually are systems which live in perpetual transitional situations. These time evolutions highly depend on the turbulent air flow hosting the cloud, and on transport phenomena taking place through the complex surfaces that bound the cloud with respect to the clear air surrounding it.
In our simulations, cloud boundaries (called interfaces in the text) are modelled through the shear-less turbulent mixing, matching two interacting flow regions – a small portion of cloud, and an adjacent clear air portion of equivalent volume – at different turbulent intensities. An initial condition reproduces local stable or unstable stratification in density and temperature. The droplets model includes evaporation, condensation, collision and coalescence. The typical water content inside a warm cloud parcel of about 500 m^3, when associated to an initial condition, where drops are 30 microns in diameter, leads to an initial number of drops of the order of 10^11. A simulation grid up to 4092x2048x2048 points is sought after, which leads to a Taylor’s microscale Reynolds number of 500. The governing equations are the Navier-Stokes equations under the Boussinesq‘s approximation, and are coupled to the transport equation for the water vapour represented as a passive scalar, and for drops seen as inertia particles, transported by background turbulence and gravity. The code uses a slab parallelization. The system contains a huge number of discrete elements, i.e. the water droplets, which undergo an intense clustering due to turbulent fluctuations. Turbulent clustering is not predictable, and in turn produces an imbalance on the communication rate among different cores. As a consequence, the computational burden among the cores in the cluster is not evenly distributed. This, per se, highly limits performance and binds the parallelization organization to that of a slab structure. Furthermore, clustering increases in time, and induces an inhomogeneous enhancement of the local droplets collision rate, as well as a concomitant depression of the growth in size of water droplets.
The long-term evolution of many kinds of transients must be considered in order to understand the above processes. This, in association to the variation of a quite large set of control parameters, will be the main motivation to ask in the future 0 level of computational resources for the simulation of water droplets’ growth, collision, coalescence and clustering inside turbulent warm cloud-clear air interfaces.
Judit Gimeneza*, Fernando Gomezb, Roberto Lopezb
aBarcelona Supercomputing Center, Spain, bArtelnics S.L., Spain
Artelnics develops the professional predictive analytics solution called Neural Designer. It makes intelligent use of data by discovering complex relationships, recognizing unknown patterns, predicting actual trends or finding associations. The new OpenMP and MPI version of the code allows Artelnics to build predictive models in computer instances with many virtual cores and in supercomputing clusters, respectively. The current version of the code reports efficiencies close to 90% for both the MPI and the OpenMP parallelizations. Neural Designer now is capable of analysing bigger data sets in less time, providing Artelnics customers with results in a way previously unachievable.
Andrew Sunderland, Martin Plummer
STFC Daresbury Laboratory, Warrington, United Kingdom
DNA oxidative damage has long been associated with the development of a variety of cancers including colon, breast and prostate, whilst RNA damage has been implicated in a variety of neurological diseases, such as Alzheimer’s disease and Parkinson’s disease. Radiation damage arises when energy is deposited in cells by ionizing radiation, which in turn leads to strand breaks in DNA. The strand breaks are associated with electrons trapped in quasi-bound ‘resonances’ on the basic components of the DNA. HPC usage will enable the study of this resonance formation in much more detail than in current initial calculations. The associated application is UKRmol , a widely used, general-purpose electron-molecule collision package and the enabling aim is to replace a serial propagator (coupled PDE solver) with a parallel equivalent module.
R. Oguz Selvitopi, Gunduz Vehbi Demirci, Ata Turk, Cevdet Aykanat
Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY
This whitepaper addresses applicability of the Map/Reduce paradigm for scalable and easy parallelization of fundamental data mining approaches with the aim of exploring/enabling processing of terabytes of data on PRACE Tier-0 supercomputing systems. To this end, we first test the usage of MR-MPI library, a lightweight Map/Reduce implementation that uses the MPI library for inter-process communication, on PRACE HPC systems; then propose MR-MPI-based implementations of a number of machine learning algorithms and constructs; and finally provide experimental analysis measuring the scaling performance of the proposed implementations. We test our multiple machine learning algorithms with different datasets. The obtained results show that utilization of the Map/Reduce paradigm can be a strong enhancer on the road to petascale.
Thomas Roblitz, Ole W. Saastad, Hans A. Eide, Katerina Michalickova, Alexander Johan Nederbragt, Bastiaan Star
Department for Research Computing, University Center for Information Technology (USIT), University of Oslo, P.O. Box 1059, Blindern, 0316 Oslo, Norway
Center for Ecological and Evolutionary Synthesis, Department of Biosciences (CEES), University of Oslo, P.O. Box 1066, Blindern, 0316 Oslo, Norway
Sequencing projects, like the Aqua Genome project, generate vast amounts of data which is processed through dif-
ferent workfows composed of several steps linked together. Currently, such workfows are often run manually on
large servers. With the increasing amount of raw data that approach is no longer feasible. The successful imple-
mentation of the project’s goals requires 2-3 orders of magnitude scaling of computing, while achieving high reli-
ability on and supporting ease-of-use of super computing resources at the same time. We describe two example
use cases, the implementation challenges and constraints, the actual application enabling and report our findings.
Authors: A. Charalampidoua,b, P. Daogloua,b, D. Foliasa,b, P. Borovskac,d, V. Ganchevac,e
a Greek Research and Technology Network, Athens, Greece
b Scientific Computing Center, Aristotle University of Thessaloniki, Greece
c National Centre for Supercomputing Applications, Bulgaria
d Department of Computer Systems, Technical University of Sofia, Bulgaria
e Department of Programming and Computer Technologies, Technical University of Sofia, Bulgaria
The project focuses on performance investigation and improvement of multiple biological sequence alignment software MSA_BG on the BlueGene/Q supercomputer JUQUEEN. For this purpose, scientific experiments in the area of bioinformatics have been carried out, using as case study influenza virus sequences. The objectives of the project are code optimization, porting, scaling, profiling and performance evaluation of MSA_BG software. To this end we have developed hybrid MPI/OpenMP parallelization on the top of the MPI only code and we showcase the advantages of this approach through the results of benchmark tests that were performed on JUQUEEN. The experimental results show that the hybrid parallel implementation provides considerably better performance than the original code.
Authors: Plamenka Borovska, Veska Gancheva
National Centre for Supercomputing Applications, Bulgaria
In silico biological sequence processing is a key task in molecular biology. This scientific area requires powerful computing resources for exploring large sets of biological data. Parallel in silico simulations based on methods and algorithms for analysis of biological data using high-performance distributed computing is essential for accelerating the research and reducing the investment. Multiple sequence alignment is a widely used method for biological sequence processing. The goal of this method is DNA and protein sequences alignment. This paper presents an innovative parallel algorithm MSA_BG for multiple alignment of biological sequences that is highly scalable and locality aware. The MSA_BG algorithm we describe is iterative and is based on the concept of Artificial Bee Colony metaheuristics and the concept of algorithmic and architectural spaces correlation. The metaphor of the ABC
metaheuristics has been constructed and the functionalities of the agents have been defined. The conceptual parallel model of computation has been designed and the algorithmic framework of the designed parallel algorithm constructed. Experimental simulations on the basis of parallel implementation of MSA_BG algorithm for multiple sequences alignment on heterogeneouc compact computer cluster and supercomputer BlueGene/P have been carried out for the case study of the influenza virus variability investigation. The performance estimation and profiling analyses have shown that the parallel system is well balanced both in respect to the workload and machine size.
Authors: Soon-Heum Koa, Plamenka Borovskab‡, Veska Ganchevac†
a National Supercomputing Center, Linkoping University, 58183 Linkoping, Sweden
b Department of Computer Systems, Technical University of Sofia, Sofia, Bulgaria
c Department of Programming and Computer Technologies, Technical University of Sofia, Sofia,
This activity with the project PRACE-2IP is aimed to investigate and improve the performance of multiple sequence alignment software ClustalW on the supercomputer BlueGene/Q, so-called JUQUEEN, for the case study of the influenza virus sequences. Porting, tuning, profiling, and scaling of this code have been accomplished in this aspect. A parallel I/O interface has been designed for efficient sequence dataset input, in which sub-groups’ local masters take care of reading operation and broadcast the dataset to their slaves. The optimal group size has been investigated and the effects of reading buffer size on reading performance has experimented. The application to ClustalW software shows that the current implementation with parallel I/O provides considerably better performance than the original code in view of I/O segment, leading up to 6.8 times speed-up for inputting dataset in case of using 8192 JUQUEEN cores.
Authors: D. Grancharov, E. Lilkova, N. Ilieva, P. Petkov, S. Markov and L. Litov
National Centre for Supercomputing Applications, Acad. G. Bonchev Str, Bl. 25-A, 1113 Sofia, Bulgaria
Based on the analysis of the performance, scalability, work-load increase and distribution of the MD simulation packages GROMACS and NAMD for very large systems and core numbers, we evaluate the possibilities for overcoming the deterioration of the scalability and performance of the existing MD packages by the implementation of symplectic integration algorithms with multiple-step sizes.
Particle Physics applications
Jacques David,Vincent Bergeaud
CEA/DEN/SA2P, CEA-Saclay, 91191 , Gif sur Yvette, France
CEA/DEN/DM2S/LGLS, CEA-Saclay, 91191 , Gif sur Yvette, France
Mathematical models, designed to simulate complex physics are used in scientific and engineering studies. In the case of nuclear applications, assessing safety parameters such as fuel temperature, numerical simulations is paramount, in order to gain confidence versus comparison to the experience. The URANIE tool uses propagation methods to assess uncertainties in simulation output parameters in order to better evaluate confidence intervals (e.g., of temperature, pressure, etc.). This is used for Verification and Validation and Uncertainties Quantification (or VVUQ) process used for safety analysis.
While URANIE is well suited for launching many instances of serial codes, it suffers from a lack of scalability and portability when used for coupled simulations and/or parallel codes. The aim of the project is therefore to enhance this launching mechanism to support a wider variety of applications, leveraging on HPC capabilities to go further to a new level of statistical assessment for models.
Alexei Strelchenko, Marcus Petschlies and Giannis Koutsou
CaSToRC, Nicosia 2121, Cyprus
We extend the QUDA library, an open-source library for performing calculations in lattice QCD on Graphics
Processing Units (GPUs) using NVIDIA’s CUDA platform, to include kernels for non-degenerate twisted mass and
multi-GPU Domain Wall fermion operators. Performance analysis is provided for both cases.
Authors: Oguz Selvitopia, Cevdet Aykanata*
a Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY
Abstract: Parallelizing sparse irregular application on distributed memory systems poses serious scalability challenges due to the communication bottlenecks which manifest themselves in an unpredictable manner as high bandwidth and/or latency overhead. The importance of different components of overall communication cost can be disproportionate due to the irregularity and sparseness inherent in the application. In such conditions, the best strategy for reducing communication overheads should favor the metric that is most crucial for the performance and a general method that attributes the same importance to all metrics is likely to suffer. This work takes on the communication challenges offered by the latency-bound irregular applications, i.e., the applications characterized by a high number of average and/or maximum messages per processor. The basic idea of our approach is to impose a regular communication pattern onto otherwise irregular communication operations and in this way to provide a low upper bound on the maximum number of messages handled by a processor. Using a regular communication pattern eliminates the irregularity in latency-bound communication operations and necessitates a store-and-forward scheme that consists of multiple stages of communication. Our findings show that the proposed approach is a remedy for the latency-bound applications; it scales seemingly unscalable instances and leads to an average of 50% reduction in parallel runtime on 256 processors.
Authors: Emmanuel Agulloa, Luc Girauda, Stephane Lanterib, Ludovic Moyab, Jean Romana, and Olivier Rouchonc
a Hiepacs project-team, Inria Bordeaux-Sud Ouest Research Center, Talence, France
b Nachos project-team, Inria Sophia Antipolis-Mediterranee, Sophia Antipolis, France
c CINES, Montpellier, France
Abstract: We report on our work aiming at enabling large-scale simulations of frequency-domain electromagnetic wave propagation based on a recently developed innovative simulation software combining a high order finite element discretization scheme formulated on an unstructured tetrahedral grid, and scalable sparse linear solvers. The enabling numerical tool is a domain decomposition solution strategy for the sparse system of linear equations resulting from the spatial discretization of the underlying PDEs (Partial Differential Equations), that can be either a purely algebraic algorithm working at the matrix operator level (i.e. a black-box solver), or a tailored algorithm designed at the continuous PDE level (i.e. a PDE-based solver). The PDEs at hand here are the frequency-domain (or time-harmonic) Maxwell equations. Two concrete and different applications are considered for illustrating the modeling capabilities of the simulation software and assessing its parallel performances on the road to Exascale: the scattering of a plane wave by an aircraft, and the interaction of an electromagnetic wave with a heterogeneous model of head tissues.
Authors:A. Artiguesa, G. Houzeauxa
aBarcelona Supercomputing Center
The Alya System is the BSC simulation code for multi-physics problems . It is based on a Variational
Multiscale Finite Element Method for unstructured meshes.
Work distribution is achieved by partitioning the original mesh into subdomains (sub-meshes). This pre-partition
step has until now been done in serial by only one process, using the metis library . This is a huge bottleneck
when larger meshes with millions of elements have to be partitioned. This is due to the data not fitting in the
memory of a single computing node and in the cases where the data does fit; Alya takes too long in the
In this document, we explain the tasks done to design, implement and test a new parallel partitioning algorithm for Alya. In this algorithm a subset of the workers, is in charge of partition the mesh in parallel, using the parmetis library .
Partitioning workers, load consecutive parts of the main mesh, with a parallel space partitioning bin structure ,
capable of obtaining the adjacent boundary elements of their respective sub meshes. With this local mesh, each of
the partitioning workers is able to create its local element adjacency graph and to partition the mesh.
We have validated our new algorithm using a Navier-Stokes problem on a small cube mesh of 1000 elements.
Then we performed a scalability test on a 30M element mesh to check if the time to partition the mesh is reduced
proportionally with the number of partitioning workers.
We have also done a comparison between metis and parameters, the balancing of the element distribution among
the domains, to test how the use of many partitioning workers to partition the mesh affects the scalability of
Alya. We have noticed in these tests that it’s better to use fewer partitioning workers to partition the mesh.
Finally, we have two sections explaining the results and the future work that has to be done in order to finalize and improve the parallel partition algorithm.
Krzysztof T. Zwierzynski*a
*aPoznan Supercomputing and Networking Center, ul. Z. Noskowskiego 12/14, 61-704 Poznan, Poland
In this paper, we consider the problem of designing a self-improving meta-model of job workflow that is sensitive to the change of the computational environment. As an example of searched combinatorial objects permutations and some classes of integral graphs are used. We propose a number of dedicated methods that can improve the execution time of workflow based on decision trees and the replication of some actors in the workflow.
Authors: Jerome Richarda,+, Vincent Lanoreb,+ and Christian Perezc,+
a University of Orleans, France
b Ecole Normale Superieure de Lyon, France
+ Avalon Research-Team, LIP, ENS Lyon, France
Abstract: The Fast Fourier Transform (FFT) is a widely-used building block for many high-performance scientific applications. The efficient computing of FFT is paramount for the performance of these applications. This has led to many efforts to implement machine and computation specific optimizations. However, no existing FFT library is capable of easily integrating and automating the selection of new and/or unique optimizations.
To ease FFT specialization, this paper evaluates the use of component-based software engineering, a programming paradigm that consists of building applications by assembling small software units. Component models are known to have many software engineering benefits but usually, have an insufficient performance for high-performance scientific applications.
This paper uses the L2C model, a general-purpose high-performance component model, and studies its performance and adaptation capabilities on 3D FFTs. Experiments show that L2C, and components in general, enables easy handling of 3D FFT specializations while obtaining performance comparable to that of well-known libraries. However, a higher-level component model is needed to automatically generate an adequate L2C assembly.
Download paper: PDF
Authors: R. Oguz Selvitopia, Cevdet Aykanata,*
a Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY
Abstract: Parallel iterative solvers are widely used in solving large sparse linear systems of equations on large-scale parallel architectures. These solvers generally contain two different types of communication operations: point-to-point (P2P) and global collective communications. In this work, we present a computational reorganization method to exploit a property that is commonly found in Krylov subspace methods. This reorganization allows P2P and collective communications to be performed simultaneously. We realize this opportunity to embed the content of the messages of P2P communications into the messages exchanged in the collective communications in order to reduce the latency overhead of the solver. Experiments on two different supercomputers up to 2048 processors show that the proposed latency-avoiding method exhibits superior scalability, especially with an increasing number of processors.
Download paper: PDF
Gunduz Vehbi Demirci, Ata Turk, R. Oguz Selvitopi, Kadir Akbudak, Cevdet Aykanat
Bilkent University, Computer Engineering Department, 06800 Ankara, TURKEY
Abstract: This whitepaper addresses the applicability of the MapReduce paradigm for scientific computing by realizing it on the widely used sparse matrix-vector multiplication (SpMV) operation with a recent library developed for this purpose. Scaling SpMV operations proves vital as it is a kernel that finds its applications in many scientific problems from different domains. Generally, the scalability improvement of these operations is negatively affected by high communication requirements of the multiplication, especially at large processor counts in the case of strong scaling. We propose two partitioning-based methods to reduce these requirements and allow SpMV operations to be performed more efficiently. We demonstrate how to parallelize SpMV operations using MR-MPI, an efficient and portable library that aims at enabling usage of MapReduce paradigm in scientific computing. We test our methods extensively with different matrices. The obtained results show that the utilization of communication-efficient methods and constructs are required on the road to Exascale.
Download paper: PDF
Authors: Petri Nikunena, Frank Scheinerb
a CSC – IT Center for
Science, P.O. Box 405, FI-02101 Espoo, Finland
b High Performance Computing Center Stuttgart (HLRS),University of Stuttgart, D-70550 Stuttgart, Germany
Abstract: Planck is a mission of the European Space Agency (ESA) to map the anisotropies of the cosmic microwave background with the highest accuracy ever achieved. Planck is supported by several computing centers, including CSC (Finland) and NERSC (USA). Computational resources were provided by CSC through the DECI project Planck-LFI, and by NERSC as a regular production project. This whitepaper describes how PRACE-2IP
the staff helped Planck-LFI with two types of support tasks: (1) porting their applications to the execution machine and seeking ways to improve applications’ performance; and (2) improving performance and facilities to transfer data between the execution site and the different data centers where data is stored.
Download paper: PDF
Krzysztof T. Zwierzynski
Poznan Supercomputing and Networking Center, ul. Z. Noskowskiego 12/14, 61-704 Poznan, Poland
In this white paper, we report the work that was done on the problem of generation combinatorial structures with some rare invariant properties. These combinatorial structures are connected to integral graphs. All (588) of such graphs of order 1 ? n ? 12 are known. The main goal of this work was to reduce the time of generation by distributing graph generators over hosts in PRACE-RI, and to reduce the time of sieving integral graphs by applying eigenvalue calculation in GPGPU device using the OpenCL technique. This work is also a study of how to minimize the overhead connected with using OpenCL kernels.
Download paper: PDF
Authors: Dimitris Siakavaras, Konstantinos Nikas, Nikos Anastopoulos, and Georgios Goumas
Greek Research & Technology Network (GRNET), Greece
Abstract: This whitepaper studies the various aspects and challenges of performance scaling on large scale shared memory systems.
Our experiments are performed on a large ccNUMA machine that consists of 72 IBM 3755 nodes connected with NumaConnect and provides shared memory over a total of 1728 cores, a number that is far beyond conventional server platforms. As benchmarks, three data-intensive and memory-bound applications with different communication patterns are selected, namely Jacobi, CSR SpM-V and Floyd-Warshall. Our results illustrate the need for numa-aware design and implementation of shared-memory parallel algorithms in order to achieve scaling to high core counts. At the same time, we observed that, depending on its communication pattern, an application could benefit more from explicit communication using message passing.
Download paper: PDF
Authors: P. Marsaleka, A. Grygarb, T. Karaseka*, T. Brzobohatya
a IT4Innovations, V SB – Technical University of Ostrava, Ostrava, Czech Republic
b Invent Medical Group s.r.o., Ostrava, Czech Republic
The objective of this work is to replace laboratory testing of cranial orthosis design by virtual prototyping using numerical modeling and simulation technologies. In this paper, the focus is put on the pre-processing stage of the numerical modeling process. In future Computational Structural Dynamics Simulations will be performed during the development of a new cranial orthosis design at the Invent Medical Group (IMG) company. Stiffness of cranial orthosis will be evaluated by means of numerical modeling and simulation and will replace the necessity of physical testing of each and every new design of cranial orthosis to ensure its proper behavior. The objective of this project is to create a semi-automatic system of mesh generation from the input geometric model based on the open-source software Netgen Mesh Generator. The geometry of the structure as an input file for meshing is provided in STL format. The output of this project is a software tool that will take a geometric model as an input and produce the finite element mesh with all boundary conditions as an output. The mesh produced is then used for calculation of cranial orthosis stiffness using open source code ESPRESO. Simulation results are verified by comparing with results of the same numerical simulation performed using well-established software package ANSYS and with physical experiment as well.
|Disclaimer These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.
They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.
|Copyright notices © 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.|