Authors: Marcin Krotkiewskia
a University of Oslo/SIGMA2
Abstract: Processing of large numbers (hundreds of thousands) of small files (i.e., up to a few KB) is notoriously problematic for all modern parallel file systems. While modern storage solutions provide high and scalable bandwidth through parallel storage servers connected with a high-speed network, accessing small files is sequential and latency-bounded. Paradoxically, performance of file access is worse than if the files were stored on a local hard drive. We present a generic solution for large-scale HPC facilities that improves the performance of workflows dealing with large numbers of small file. The files are saved inside a single large file containing a disk image, similarly to an archive. When needed, the image is mounted through the Unix loop-back device, and the contents of the image are available to the user in the form of a usual directory tree. Since mounting of disks under Unix often requires super-user privileges, security concerns and possible ways to address them are considered. A complete Python implementation of image creation, mounting, and unmounting framework is presented. A seamless integration into HPC environments managed by SLURM is discussed on an example of read-only software modules created by administrators, and user-created disk images with read-only application input data. Finally, results of performance benchmarks carried out on the Abel supercomputer facility in Oslo, Norway, are shown.
Download paper: PDF
Authors: F. Hernandeza
a IN2P3/CNRS computing center, Lyon, France
Abstract: In this white paper we present preliminary results of our work which aims at evaluating the suitability of HTTP-based software tools for transporting scientific data over high latency network links. We present a motivating use case, the tools used for this test and provide some quantitative results and the perspectives of this work.
Download paper: PDF
Authors: Marcin Krotkiewski*a, Andreas Pantelib
a University of Oslo/SIGMA2
bThe Cyprus Institute
Abstract: We present a portable and flexible framework for transferring large amounts of scientific data. Its main objective is to prepare the data for transfer (e.g., tar, compress, encrypt, compute and verify hash), while delegating the network transfer to established external utilities. The framework is based on the pipeline approach, i.e., the transferred data stream is passed through a sequence of pipeline elements (or ‘filters’), which process the data and forward it downstream. The computations are done on the fly, hence there is no need for extra storage to archive and otherwise process the data. Compute-intensive tasks (e.g., compression, hash computations, encryption) are executed in parallel with the network transfer, which decreases the overall execution time of the workflow. The framework is implemented in Python3, it is portable, and easy to extend: individual pipeline elements are similar to Python’s file-like objects. We provide a general-purpose stager application driven by configuration files, and a Python API, which we use to implement example custom workflows.
Download paper: PDF
Authors: B. Lawrencea, C. Maynardb, A. Turnerc, X. Guoc, D. Sloan-Murphyc*
a University of Reading, United Kingdom
b Met Office, United Kingdom
c Edinburgh Parallel Computing Centre, University of Edinburgh, United Kingdom
Abstract: Solving the bottleneck of I/O is key in the move towards exascale computing. Research communities must be informed of the I/O performance of existing resources in order to make reasonable decisions for the future. This paper therefore presents benchmarks for the write capabilities of the ARCHER, COSMA, UK-RDF DAC, and JASMIN HPC systems, using MPI-IO and, in selected cases, the HDF5 and NetCDF parallel libraries.
We find a reasonable expectation is for approximately 50% of the theoretical system maximum bandwidth to be attainable in practice. Contention is shown to have a dramatic effect on performance. MPI-IO, HDF5 and NetCDF are found to scale similarly but the high-level libraries introduce a small amount of performance overhead.
For the Lustre file system, on a single shared file, maximum performance is found by maximising the stripe count and matching the individual stripe size to the magnitude of I/O operation performed. HDF5 is discovered to scale poorly on Lustre due to an unfavourable interaction with the H5Fclose() routine.
Authors: C. Basua*, A. De Nicolab, G. Milanob
a National Supercomputer Centre, Linköping University, Sweden
b University of Salerno, Department of Chemistry and Biology, Via Giovanni Paolo II, 132, 84084, Fisciano, Italy
Abstract: A new parallel I/O scheme is implemented in the hybrid particle-field MD simulation code called OCCAM. In the new implementation the numbers of input and output files are greatly reduced. Furthermore, the sizes of the input and output files are reduced as the new files are in binary format compared to ASCII files in the original code. The I/O performance is improved due to bulk data transfer instead of small and frequent data transfer in the original code. The results of tests on two different systems show 6-18% performance improvements.
Authors: F. Sottilea,C. Roedla, V. Slavnicb, P. Jovanovicb,D. Stankovicb, P. Kestenerc, F. Houssenc
aLaboratoire des SolidesIrradies, Ecole Polytechnique, CNRS, CEA, UMR 7642, 91128 Palaiseaucedex, France
bScientic ComputingLaboratory, Institute of Physics Belgrade, University of Belgrade,Pregrevica 118, 11080 Belgrade, Serbia
cMaison de la Simulation,USR 3441, bat. 565, CEA Saclay, 91191 Gif-sur-Yvette cedex, France
Abstract: Main goal of this PRACE project was to evaluate how GPUs could speed up the DP code – alinear response TDDFT code. Pro-ling analysis of the code has been done to identify computationalbottlenecks to be delegated to the GPU. In order to speed up this code using GPUs, two di-erentstrategies have been developed: a local one and a global one. Both strategies have been implementedwith cuBLAS and/or CUDA C. Results showed that one can reasonably expect about 10 times speedupon the total execution time, depending on the structure of the input and the size of datasets used,and speedups up to 16 have been observed for some cases.
Authors: ErnestArtiagaa, Toni Cortes a,b
aBarcelona SupercomputingCenter, Barcelona, Spain
bUniversitat Politècnica deCatalunya, Barcelona, Spain
Abstract: The purpose of this white paper is to document the measurements obtained in PRACE-2IP regarding file system metadata performance, and to assess mechanisms to further improve such performance. The final goal of the task is to identify the open issues related to file systems for multi-petascale and exascale facilities, and propose novel solutions that can be applied to Lustre, enabling it to manage a huge number of files on a system with many heterogeneous devices while efficiently delivering huge data bandwidth and low latency, minimizing the response time.
The performed tasks being reported included the observation, measurement and study of a large scale system currently in production, in order to identify the key metadata-related issues; the development of a prototype aimed to improve the metadata behaviour in such system and also to provide a framework to easily deploy novel metadata management techniques on top of other systems; the measurement and study of specially deployed Lustre and GPFS prototypes to validate the presence of the metadata issues observed in current in-production systems; and finally the porting of the framework prototype to test novel metadata management techniques on both a production system using GPFS and the PRACE Lustre prototype facility at CEA.
Authors: A.Mignonea,∗, G. Muscianisib, M. Rivib,G. Bodoc
aDipartimento di FisicaGenerale, Universit ́a di Torino, via Pietro Giuria 1, 10125 Torino,Italy
bConsorzioInteruniversitario CINECA, via Magnanelli, 6/3, 40033 Casalecchio diReno (Bologna), Italy
cINAF, OsservatorioAstronomico di Torino, Strada Osservatorio 20, Pino Torinese, Italy
Abstract: PLUTO is a modular and multi-purpose numerical code for astrophysical ﬂuid dynamics targeting highly supersonic andmagnetized ﬂows. As astrophysical applications are becoming increasingly demanding in terms of grid resolution andI/O, eﬀorts have been spent to overcome the main bottlenecks of the code mainly related to an obsolete and no longermaintained library providing parallel functionality. Successful achievements have been pursued in The Partnership forAdvanced Computing in Europe First Implementation Phase Project (PRACE-1IP) and are described in the presentwhite-paper.
Authors: ValentinPavlov, Peicho Petkov
NCSA, Acad. G. Bonchev str., bl. 25A,Sofia 1000, Bulgaria
Abstract: In MPI multiprocessing environments, data I/O in the GROMACS molecular dynamics package is handled by the master node. All input data is read by the master node, then scattered to the computing nodes, and on each step gathered back in full and possibly written out. This method is fine for most of the architectures that use shared memory or where the amount of RAM on the master node can be extended (as in clusters and Cray machines), but introduces a bottleneck for distributed memory systems with hard memory limits like the IBM Blue Gene/P. The effect is that even though a Tier-0 Blue Gene/P machine has enough overall computing power and RAM, it cannot process molecular systems with more than 5,000,000 atoms because the master node simply does not have enough RAM to hold all necessary input data. In this paper we describe an approach that eliminates this bottleneck and allows such large systems to be processed on Tier-0 Blue Gene/Ps. The approach is based on using global memory structures which are distributed among all computing nodes. We utilize the Global Arrays Toolkit by PNNL to achieve this goal. We analyze which structures need to be changed, design an interface for virtual arrays and rewrite all routines which deal with data I/O of the corresponding structures. Our results indicate that the approach works and we present the simulation of a large bio-molecular system (lignocellulose) on an IBM Blue Gene/P machine.
Authors: Jan ChristianMeyer a, Jørn Amundsena, Xavier Saezb
a Norwegian University ofScience and Technology (NTNU), Trondheim, NO-7491, Norway
b Barcelona SupercomputingCenter, c/ Jordi Girona, 29,08034 Barcelona,Spain
Abstract: Rewriting application I/O for performance and scalability on petaflops machines easily becomes a formidable task in terms of man-hour effort. Furthermore, on HPC systems the gap of compute to I/O capability in Tflop/s vs GByte/s has increased by a factor of 10 in recent years. It makes the insertion of I/O forwarding software layers between the application and file system layer increasingly feasible from a performance point of view. This whitepaper describes the work on eval
uating the IOFSL I/O Forwarding and Scalability Layer on PRACE applications. The results show that the approach is relevant, but is presently made infeasible by the associated overhead and issues with the software stack.
Authors: Ra ́ul de laCruz, Hadrien Calmet, Guillaume Houzeaux
Barcelona Supercomputing Center,Edificio NEXUS I, c/ Gran Capit ́an 2-4, 08034 Barcelona, Spain
Abstract: Alya is a Computational Mechanics (CM) code developed at Barcelona Supercomputing Center, which solves PartialDiﬀerential Equations (PDEs) in non-structured meshes, using Finite Element (FE) methods. Being a large scalescientiﬁc code, Alya demands substantial I/O processing, which may consume considerable time and can thereforepotentially reduce speed-up at petascale. Consequently, I/O task turns out a critical key-point to consider in achievingdesirable performance levels. The current Alya I/O model is based on a master-slave approach, which limits scalingand I/O parallelization. However, eﬃcient parallel I/O can be achieved using freely available middleware libraries thatprovide parallel access to disks. The HDF5 parallel I/O implementation shows a relatively low complexity of use and awide number of features compared to others implementations, such as MPI-IO and netCDF. Furthermore, HDF5 exposessome interesting aspects such as a shorter development cycle, a hierarchical data format with metadata support andis becoming a de facto standard as well. Moreover, in order to achieve an open-standard format in Alya, the XDMFapproach (eXtensible Data Model Format) has been used as metadata container (light data) in cooperation with HDF5(heavy data). To overcome the I/O barrier at petascale, XDMF & HDF5 have been introduced in Alya and comparedto the original master-slave strategy. Both versions are deployed, tested and measured on Curie and Jugene Tier-0supercomputers. Our preliminary results on the testbed platforms show a clear improvement of the new parallel I/Ocompared with the original implementation.
Author: Bjørn Lindi
Norwegian University of Science andTechnology (NTNU), Trondheim, NO-7491, Norway
Abstract: Darshan is a set of libraries that can characterize MPI-IO and POSIX file access within typical HPC applications in a non-intrusive way. It can be used to investigate I/O behavior of a MPI-program. An application’s I/O behavior can easily be an obstacle to achieving petascale performance. Hence, to be able to characterize the I/O of a HPC-application is an important step on the path to develop scaling properties. Darshan have been used on selected applications from task 7.1 and 7.2. This whitepaper describes the work carried out and the results achieved.
Authors: PhilippeWauteleta∗, Pierre Kestenerb
aIDRIS-CNRS, Campusuniversitaire d’Orsay, rue John Von Neumann, Bˆatiment 506,F-91403 Orsay, France
bCEA Saclay, DSM / Maison dela Simulation, centre de Saclay, F-91191 Gif-sur-Yvette, France
Abstract: The results of two kinds of parallel IO performance measurements on the CURIE supercomputer are presented in thisreport. In a ﬁrst series of tests, we use the open source IOR benchmark to make a comparative study of the parallelreading and writing performances on the CURIE Lustre ﬁlesystem using diﬀerent IO paradigms (POSIX, MPI-IO,HDF5 and Parallel-netCDF). The impact of the parallel mode (collective or independent) and of the MPI-IO hints onthe performance is also studied. In a second series of tests, we use a well known scientiﬁc code in the HPC astrophysicscommunity: RAMSES, which a grid-based hydrodynamics solver with adaptive mesh reﬁnement (AMR). IDRIS addedsupport for the 3 following parallel IO approaches: MPI-IO, HDF5 and Parallel-netCDF. They are compared to thetraditional one ﬁle per MPI process approach. Results from the two series of tests (synthetic with IOR and more realisticwith RAMSES) are compared. This study could serve as a good starting point for helping other application developersin improving parallel IO performance.
Author: Huub Stoffers
SARA, Science Park 140,1098XGAmsterdam, The Netherlands
Abstract: The I/O subsystems of high performance computing installations tend to be very system specific. There is only a loose coupling with compute architectures and consequently considerable freedom of choice in design and ample room for tradeoffs at various implementation levels. The PRACE Tier-0 systems are no exception in this respect. The I/O subsystem of JUGENE at the Forschungszentrum Jülich is not necessarily very similar to the I/O subsystems of IBM Blue Gene/P installations elsewhere. However, most applications that need efficient handling of petascale data cannot afford to ignore the I/O subsystem. To some extent system specific arrangements of resources need to be known to avoid system- or site-specific bottlenecks.
This first section of paper gives a fairly detailed description of the I/O subsystem of the PRACE Tier-0 system JUGENE at Jülich and points out what I/O rates that can be achieved maximally, following from the specifications of the components that have been used and the way they are interconnected.
In the second section, the description is complemented with some practical guidelines on how to do efficient I/O on JUGENE and a description of some tools for source level adaptation of applications for improving I/O performance. Using standard I/O and MPI communication, rather than adopting a particular library, variations on hierarchical organizing of I/O within an application are explored more in depth and compared for performance. Splitting a parallel program of multiple tasks into a number of equally sized I/O groups, in which a few tasks do I/O on behalf improves performance only moderately when group membership is determined rather arbitrarily. BlueGene specific MPI extensions can be used to bring topological information, about which tasks are being served by the same I/O node, into the program. Example programs are given on how these calls can be used to create a division into groups that are not only balanced in the IO/volume their produce, but also in the underlying resources they have at their disposal to handle th
|DisclaimerThese whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.
They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.
|Copyright notices© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.
All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.