Performance Assessment of Pipelined Conjugate Gradient method in Alya

Currently, one of the trending topics in High Performance Computing is related to exascale computing. Although the hardware is not yet available, the software community is working on developing and updating codes, which can efficiently use exascale architectures when they become available. Alya is one of the codes that are being developed towards exascale computing. It is part of the simulation packages of the Unified European Applications Benchmark Suite (UEABS) and Accelerators Benchmark Suite of PRACE and thus complies with the highest standards in HPC. Even though Alya has proven its scalability for up to hundreds of thousands of CPU-cores, there are some expensive routines that could affect its performance on exascale architectures. One of these routines is the conjugate gradient (CG) algorithm. CG is relevant because it is called at each time step in order to solve a linear system of equations. The bottleneck in CG is the large number of collective communications calls. In particular, the preconditioned CG (PCG) already implemented in Alya utilises two collective communications. In the present work, we developed and implemented a pipelined version of the PCG (PPCG) algorithm which allows us to half the number of collectives. Then, we took advantage of non-blocking MPI communications to reduce the waiting time during message exchange even further. The resulting implementation was analysed in detail by using Extrae/Paraver profiling tools. The PPCG implementation was tested by studying the flow around a 3D sphere. Several tests were performed using a different number of processes/workloads to attest the strong and weak scaling of the implemented algorithms. This work has been developed in the context of the preparatory access program of PRACE, simulations were run on the MareNostrum 4 (MN4) supercomputer at Barcelona Supercomputing Center (BSC).

  • WP287
    Performance Assessment of Pipelined Conjugate Gradient method in Alya