Scalable algorithms

Authors: Emmanuel Agulloa, Luc Girauda, Matias Hastarana, Stéphane Lanterib, Laurent Lothc, Ludovic Moyab, Florent Pruvosta, Jean Romanand Olivier Rouchond

a Hiepacs project-team, Inria Bordeaux-Sud Ouest Research Center, Talence, France

b Nachos project-team, Inria Sophia Antipolis-Méditerranée, Sophia Antipolis, France

c ANDRA, Direction de la Recherche et du Développement, Chatenay-Malabry, France

d CINES, Montpellier, France

Abstract: We report on our work aiming at enabling large-scale simulations of nuclear waste management based on a simulation software combining a finite element discretization scheme formulated on a hybrid hexahedral-tetrahedral grid, and scalable sparse linear solvers. The enabling numerical tool is a domain decomposition solution strategy for the sparse system of linear equations resulting from the spatial discretization of the underlying PDEs (Partial Differential Equations). The PDEs at hand here are the Darcy’s equations. A concrete application is considered for assessing the benefits of using a domain decomposition sparse linear solution strategy as compared to the originally used approach, which is based on a more classical preconditioned iterative algorithm.

Download PDF

Authors: A. Sgattonia, L. Fedeliaa,b, S. Sinigardia,c, A. Marocchinod, A. Macchiaa,b,*, V. Weinberge, A. Karmakare

a CNR/INO Via Moruzzi 1, 56124 Pisa, Italy

b University of Pisa, Largo Pontecorvo 3, 56127 Pisa Italy

c University of Bologna, Dipartimento di Fisica e Astronomia, Via Irnerio 46, 40126, Bologna, Italy
d “La Sapienza” – Università di Roma, Italy

e Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities, 85748 Garching b. München, Germany

We present a detailed strong and weak scaling analysis of PICCANTE, an open source, massively parallel, fully-relativistic Particle-In-Cell (PIC) code. PIC codes are widely used in plasma physics and astrophysics to study the cases where kinetic effects are relevant. PICCANTE is primarily developed to study laser-plasma interaction.
Within a PRACE Preparatory Access Project, various revisions of different routines of the code have been analysed on the HPC systems JUQUEEN at Jülich Supercomputing Centre (JSC), Germany, and FERMI at CINECA, Italy, to improve scalability and I/O performance of the application. The diagnostic tool Scalasca is used to identify suboptimal routines. Different output strategies are discussed. The detailed strong and weak scaling behaviour of the improved code are presented in comparison with the original version of the code.

Download PDF

Authors: Luca Marradia,b, Dimitris Dellisc, Kirsten Martens*a,b
a Universite Grenoble Alpes, LIPHY, F-38000 Grenoble, France
b CNRS, LIPHY, F-38000 Grenoble, France
c Greek Research & Technology Network, Mesogeion 56, 11527, Athens, Greece

In this paper we describe the development of a code that implements a coarse grained dynamics for the large scale modeleling of 3 dimensional athermal yielding and flow of disordered systems under externally applied steady shear. The stochastic lattice model for the heterogeneous flow response involves long range elastic interactions, that are resolved using fast Fourier techniques, implemented in parallel in an efficient and well scaling MPI algorithm.

Download PDF

Authors: S. Lanteriaa*, R. Légera, C. Scheid and J. Viquerata, T. Cabelb and G. Hautreuxb
a INRIA Sophia Antipolis – Méditerannée research center, France
b CINES, Montpellier, France

This paper is concerned with the development of a scalable high order finite element type solver for the numerical modeling of light interaction with nanometer scale structures. From the mathematical modeling point of view, one has to deal with the differential system of Maxwell equations in the time domain, coupled to an appropriate differential model of the behavior of the underlying material (which can be a dielectric and/or a metal) at optical frequencies. For the numerical solution of the resulting system of differential equations, we have designed a high order DGTD (Discontinuous Galerkin Time-Domain) solver that has been adapted to hybrid MIMD/SIMD computing. Here we discuss about this later aspect and report on preliminary performance results on the Curie system of the PRACE research infrastructure.

Download PDF

Authors: Nenad Vukmirovića*, Petar Jovanovića, Marko Mladenovića
a Scientific Computing Laboratory, Institute of Physics Belgrade, University of Belgrade, Pregrevica 118, 11080 Belgrade, Serbia

The overlapping methods code for electronic structure calculations of large organic systems was benchmarked and profiled. Based on the profiling results, several performance and scalability bottlenecks caused by communication were identified. The code was then refactored to improve its performance with respect to these identified weak points. The modified code extends the maximal system size where good scaling is observed on Curie machine from 4,000 atoms to 16,000 atoms. On the other hand, we found that on Hermit machine the original code already exhibits rather good scaling and consequently the modified code leads to minor improvements only.

Download PDF

Authors: Tomas Kozubeka, David Horaka, Vaclav Haplaa, Lubomir Rihaa
aIT4Innovations, VSB-Technical University of Ostrava (VSB-TUO)

Abstract: This report summarizes the results of the scalability improvements of the algorithms used in Total FETI TFETI). A performance evaluation of two new techniques is presented in this report: (1) a novel pipelined implementation of CG method in PETSc and (2) a MAGMA LU solver running on following many-cores accelerators: GPU Nvidia Tesla K20m and Intel MIC Xeon Phi 5110P.

Download paper: PDF

Authors: Ahmet Duranab* and Mehmet Tuncela,c
aIstanbul Technical University, National Center for High Performance Computing of Turkey (UHeM), Istanbul 34469, Turkey
bIstanbul Technical University,Department of Mathematics, Istanbul 34469, Turkey
cIstanbul Technical University, Informatics Institute, Istanbul 34469, Turkey

Abstract: In this project, we propose a new hybrid algorithm for parameter optimization and implement it using MPI. In particular, we study a scalable parallel nonlinear parameter optimization algorithm with parameter pools for a nonlinear dynamical system called the asset flow differential equations (AFDEs) in ?4. We generate time series pairs as proxy to market price and net asset value by using random walk simulation where the volatilities of the time series are similar to that of real closed-end funds traded on New York Stock Exchange (NYSE). When we apply the algorithm by using simulations for a set of time series, we observe that the computed optimal parameter values, average number of quasi-Newton iterations, the average nonlinear least squares errors, and the average maximum improvement factors can converge certain values within corresponding small ranges, after oscillations. Moreover, we tested for 64, 128, 256 and 512 cores using the 512 initial parameter vectors. We achieved speed-up for the time series to run up to 512 cores. The algorithm is applicable for parameter optimization of the related nonlinear dynamical system of differential equations with thousands of parameters as well.

Download paper: PDF

Authors: Peicho Petkov, Damyan Grancharov, Stoyan Markov, Georgi Georgiev, Elena Lilkova, Nevena Ilieva, Leandar Litov
National Centre for Supercomputing Application, Bulgaria

Abstract: We have optimized an implementation of a massively parallel algorithm for solving the Poisson equation using a 27-stencil discretization scheme based on the stabilized biconjugate gradient method. The code is written in the C programming language with OpenMP parallelization. The main objective of this whitepaper lies in the optimization of the code for native Intel Xeon Phi execution, where we observe nearly linear scalability on the MIC architecture for the bigger problem sizes.

Download paper: PDF

Authors: Seren Soner, Can Ozturan
Computer Engineering Department, Bogazici University, Istanbul, Turkey

Abstract: The problem of computing the maximum flow problem on capacitated networks arises in many application areas. In the area of heterogeneous computing, it arises in job or process scheduling when allocations of resources to jobs/processes need to be tuned. The maximum flow solver is difficult to parallelize. Highly optimized sequential version of maximum flow solvers such as those by Goldberg exists. This work describes how some of the concurrency problems are resolved in our existing Pmaxflow ( Pmaxflow employs a parallel pre-flow push algorithm to compute the maximum flow. Results of various tests that compare Goldberg’s sequential solvers and Pmaxflow on a NUMA shared memory computer are presented.

Download paper: PDF

Authors: Chandan Basu, Johan Raber
National Supercomputer Center, Linköping University, SE-581 83 Linköping

Abstract: A network topology aware MPI all-to-all algorithm which was developed in earlier PRACE project is improved. The interface of the all-to-allroutine is made similar to MPI all-to-all routine. The routine is being tested for accuracy.

A new data classifier routine is developed to collect network characteristics data. The classifier works very well and in a scalable way whencategories are as well separated in feature space. A new data collection scheme is developed. The scalability of the data acquisition has beenpartly addressed but there are remaining issues to be solved.

Download paper: PDF

Authors: Nick Brown, J. Mark Bull, Iain Bethune
EPCC, The University of Edinburgh, King’s Buildings, Mayfield Road, Edinburgh, EH9 3JZ, U.K

Abstract: Iterative methods for solving large sparse systems of linear equations are widely used in many HPC applications. Extreme scaling of these methods can be difficult, however, since global communication to form dot products is typically required at every iteration.

To try to overcome this limitation we propose a hybrid approach, where the matrix is partitioned into blocks. Within each block, we use a highly optimised (parallel) conventional solver, but we then couple the blocks together using block Jacobi or some other multisplitting technique that can be implemented in either a synchronous or an asynchronous fashion. This allows us to limit the block size to the point where the conventional iterative methods no longer scale, and to avoid global communication (and possibly synchronisation) across all processes.

Our block framework has been built to use PETSc, a popular scientific suite for solving sparse linear systems, as the synchronous intra-block solver, and we demonstrate results on up to 32768 cores of a Cray XE6 system. At this scale, the conventional solvers are still more efficient, though trends suggest that the hybrid approach may be beneficial at higher core counts.

Download paper: PDF

Authors: Ahmet Durana,b, M. Serdar Celebia,c, Mehmet Tuncela,c, Figen Oztopraka,c
a Istanbul Technical University, National Center for High Performance Computing of Turkey (UHeM), Istanbul 34469, Turkey
bIstanbul Technical University,Department of Mathematics, Istanbul 34469, Turkey
cIstanbul Technical University, Informatics Institute, Istanbul 34469, Turkey

Abstract: It is significant to perform structural analysis of large sparse matrices in order to obtain scalable direct solvers. In this paper, we focus on spectral analysis of large sparse matrices. We believe that the approach for exception handling of challenging matrices via Gerschgorin circles and using tuned parameters is beneficial and practical to stabilize the performance of sparse direct solvers. Nearly defective matrices are among challenging matrices for the performance of solver. We observe that the usage of super-nodal storage parameters affects the number of fill-ins and memory usage accordingly.

Download paper: PDF


These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.

They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.

Copyright notices

© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may no t be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.

All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.

Share: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+Email this to someone