Scalable algorithms

Authors: Tomas Kozubeka, David Horaka, Vaclav Haplaa, Lubomir Rihaa
aIT4Innovations, VSB-Technical University of Ostrava (VSB-TUO)

Abstract: This report summarizes the results of the scalability improvements of the algorithms used in Total FETI TFETI). A performance evaluation of two new techniques is presented in this report: (1) a novel pipelined implementation of CG method in PETSc and (2) a MAGMA LU solver running on following many-cores accelerators: GPU Nvidia Tesla K20m and Intel MIC Xeon Phi 5110P.

Download paper: PDF

Authors: Ahmet Duranab* and Mehmet Tuncela,c
aIstanbul Technical University, National Center for High Performance Computing of Turkey (UHeM), Istanbul 34469, Turkey
bIstanbul Technical University,Department of Mathematics, Istanbul 34469, Turkey
cIstanbul Technical University, Informatics Institute, Istanbul 34469, Turkey

Abstract: In this project, we propose a new hybrid algorithm for parameter optimization and implement it using MPI. In particular, we study a scalable parallel nonlinear parameter optimization algorithm with parameter pools for a nonlinear dynamical system called the asset flow differential equations (AFDEs) in ?4. We generate time series pairs as proxy to market price and net asset value by using random walk simulation where the volatilities of the time series are similar to that of real closed-end funds traded on New York Stock Exchange (NYSE). When we apply the algorithm by using simulations for a set of time series, we observe that the computed optimal parameter values, average number of quasi-Newton iterations, the average nonlinear least squares errors, and the average maximum improvement factors can converge certain values within corresponding small ranges, after oscillations. Moreover, we tested for 64, 128, 256 and 512 cores using the 512 initial parameter vectors. We achieved speed-up for the time series to run up to 512 cores. The algorithm is applicable for parameter optimization of the related nonlinear dynamical system of differential equations with thousands of parameters as well.

Download paper: PDF

Authors: Peicho Petkov, Damyan Grancharov, Stoyan Markov, Georgi Georgiev, Elena Lilkova, Nevena Ilieva, Leandar Litov
National Centre for Supercomputing Application, Bulgaria

Abstract: We have optimized an implementation of a massively parallel algorithm for solving the Poisson equation using a 27-stencil discretization scheme based on the stabilized biconjugate gradient method. The code is written in the C programming language with OpenMP parallelization. The main objective of this whitepaper lies in the optimization of the code for native Intel Xeon Phi execution, where we observe nearly linear scalability on the MIC architecture for the bigger problem sizes.

Download paper: PDF

Authors: Seren Soner, Can Ozturan
Computer Engineering Department, Bogazici University, Istanbul, Turkey

Abstract: The problem of computing the maximum flow problem on capacitated networks arises in many application areas. In the area of heterogeneous computing, it arises in job or process scheduling when allocations of resources to jobs/processes need to be tuned. The maximum flow solver is difficult to parallelize. Highly optimized sequential version of maximum flow solvers such as those by Goldberg exists. This work describes how some of the concurrency problems are resolved in our existing Pmaxflow ( Pmaxflow employs a parallel pre-flow push algorithm to compute the maximum flow. Results of various tests that compare Goldberg’s sequential solvers and Pmaxflow on a NUMA shared memory computer are presented.

Download paper: PDF

Authors: Chandan Basu, Johan Raber
National Supercomputer Center, Linköping University, SE-581 83 Linköping

Abstract: A network topology aware MPI all-to-all algorithm which was developed in earlier PRACE project is improved. The interface of the all-to-allroutine is made similar to MPI all-to-all routine. The routine is being tested for accuracy.

A new data classifier routine is developed to collect network characteristics data. The classifier works very well and in a scalable way whencategories are as well separated in feature space. A new data collection scheme is developed. The scalability of the data acquisition has beenpartly addressed but there are remaining issues to be solved.

Download paper: PDF

Authors: Nick Brown, J. Mark Bull, Iain Bethune
EPCC, The University of Edinburgh, King’s Buildings, Mayfield Road, Edinburgh, EH9 3JZ, U.K

Abstract: Iterative methods for solving large sparse systems of linear equations are widely used in many HPC applications. Extreme scaling of these methods can be difficult, however, since global communication to form dot products is typically required at every iteration.

To try to overcome this limitation we propose a hybrid approach, where the matrix is partitioned into blocks. Within each block, we use a highly optimised (parallel) conventional solver, but we then couple the blocks together using block Jacobi or some other multisplitting technique that can be implemented in either a synchronous or an asynchronous fashion. This allows us to limit the block size to the point where the conventional iterative methods no longer scale, and to avoid global communication (and possibly synchronisation) across all processes.

Our block framework has been built to use PETSc, a popular scientific suite for solving sparse linear systems, as the synchronous intra-block solver, and we demonstrate results on up to 32768 cores of a Cray XE6 system. At this scale, the conventional solvers are still more efficient, though trends suggest that the hybrid approach may be beneficial at higher core counts.

Download paper: PDF

Authors: Ahmet Durana,b, M. Serdar Celebia,c, Mehmet Tuncela,c, Figen Oztopraka,c
a Istanbul Technical University, National Center for High Performance Computing of Turkey (UHeM), Istanbul 34469, Turkey
bIstanbul Technical University,Department of Mathematics, Istanbul 34469, Turkey
cIstanbul Technical University, Informatics Institute, Istanbul 34469, Turkey

Abstract: It is significant to perform structural analysis of large sparse matrices in order to obtain scalable direct solvers. In this paper, we focus on spectral analysis of large sparse matrices. We believe that the approach for exception handling of challenging matrices via Gerschgorin circles and using tuned parameters is beneficial and practical to stabilize the performance of sparse direct solvers. Nearly defective matrices are among challenging matrices for the performance of solver. We observe that the usage of super-nodal storage parameters affects the number of fill-ins and memory usage accordingly.

Download paper: PDF


These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.

They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.

Copyright notices

© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may no t be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.

All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.

Share: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+Email this to someone