PRACE Preparatory Access – 13th cut-off evaluation in June 2013


Type A – Code scalability testing


Abstract: Purpose of the project is to evaluate tools, techniques and methodologies for optimizing and tuning MPI/OpenMP/hybrid applications on the Curie thin and fat nodes, and demonstrate their best usage in the context of Curie Best Practice Guide (PRACE-3IP, Task 7.3). The codes that will be used will consist of simple micro-benchmarks (sparse kernels).


Abstract: Nanoparticle formulations provide a potential route to increasing the oral bioavailability of poorly-soluble drugs. However at present there is little detailed understanding of the processes, at the atomistic scale, that determine the efficacy of drug encapsulation into the nanoparticle during synthesis. We have developed a multiscale molecular dynamics simulation method that allows us to simulate the process of drug-polymer nanoparticle formation from mixed organic/aqueous solvent mixtures. However, to produce realistically-sized simulations (3.5M particles or more) requires extreme computing resources. In this preparatory project we will explore the ability of MareNostrum to run these simulations, benchmarking the code and optimising parameters for the most efficient use of resources.


Abstract: We want, in the present project to test the faisability of ab initio electronic structure Density Functional calculation on an overstoichiolmetric uranium oxide : U4O9. The calculations cells we plan to test would contain 832 atoms which is quite a lot for such calculations. The presence of uranium atoms makes the calculations all the more heavier.With the allocated time, we plan not to perform any actual calculations but just to test the scalability and calculation times for the U4O9 structure in order to check the faisability of the calculations we plan to propose in a regular submission and the CPU time we should then ask for. The scientific interest for such calculations is detailled below.We want to use the Abinit code for these tests (see below). this code has already been used in many Tier0 platforms, especially Curie. We would like to test its ability to deal with U4O9 both in a platform known by us (Curie) and on another one to which we are not used (Mare Nostrum), but which is used by other persons in our laboratory. The presence of uranium atoms in the U4O9 simulation box makes it mandatory to use a large number of plane waves (large energy cut-off) and also makes the number of electrons in the calculation very large (7040), thus needing the code to obtain a lot of eigen vectors and functions. In this sense the present calculation would correspond to a regular calculation (e;g; on silicon) containing about 2000 atoms. Such calculations need computer ressources unaivalable from Tier1 and below calculation platforms.


Abstract: Cunetsim is a wireless mobile network simulator designed for large scale simulations. The main idea is to perform an entire network simulation inside the GPU context, where each simulated node is executed by one dedicated GPU core, i.e. independent execution environment for each simulated node. Cunetsim implements the MW model and introduces an innovative CPU-GPU co-simulation methodology in that it considers the GPU as a main simulation environment and the CPU as a controller. In addition, nodes communicate using the message passing approach based on the buffer exchange without any global information. It has to be mentioned that Cunetsim is developed following a hardware-software co-design methodology optimized for the NVIDIA GPUs architecture. The distributed version of CuNetSim is developed according to a three-tier architecture denoted as coordinator-master-worker (CMW) which targets large to extra-large scale network simulation. The model supports distributed and parallel simulation for a heterogeneous computing node architecture with both multi-core CPUs and GPUs. The model aims at maximizing the hardware usage rate while reducing the overall management overhead. In the CMW model, the coordinator is the top-level simulation CPU process that performs an initial partitioning of the simulation into multiple instances and is responsible for load balancing and synchronization services among all the active masters.

Compared to existing master-worker models, the CMW is natively parallel and GPU compliant, and can be extended to support additional computing resources. The performance gain of the model is evaluated through different benchmarking scenarios using low-cost publicly available GPU platforms. The results have shown that a speedup up to 3000 times can be achieved compared to a sequential execution and up to 5.8 times using 6 NVIDIA GPUs (compared to a mono-GPU MW-based simulation). Using the TGCC Curie supercomputer, we target to evaluate the scalability of the framework; in particular, we study the variability of the simulation bottleneck during the simulation.


Abstract: Cells are enclosed by membranes, controlling the signals and molecules that enter and leave each cell. Cell membranes comprise an impermeable bilayer, made up from a wide range of lipids, into which are inserted many membrane proteins. This complex system manages many vital and important physiological processes, for example the infection and budding of viruses and the proliferation of tumours. Increasingly, the curvature of cell membranes is being recognized as important in these (and other) biological processes.

We are studying in detail the role played by membrane curvature in two key biological processes: (1) the recruitment of an oncogeneic cell signalling protein (Ras) to different regions on a curved patch of cell membrane and (2) the behaviour of the membrane proteins that adorn the envelope of the Dengue virus. These are both important: mutations in Ras are found in >30% of all human tumours, and the mosquito-borne dengue virus infects more than 50 million people annually.

We have built coarse-grained models of both these systems and propose to study their dynamics using the highly-optimised classical molecular dynamics algorithm, GROMACS. Despite the reduction in complexity brought about by the coarse-graining, the models of the curved patch of bilayer and the Dengue virus are of the size 1.0 million and 1.5 million particles, respectively. This is too large for us to simulate either system using local computational resources. We have experience of running large coarse-grained systems ( millions particles) on Cray XE6 systems (Hermit and HECToR), which has reinforced the importance of careful scaling calculations for each new project. Preparatory access to Hermit and the other PRACE supercomputers would allow us to test and benchmark the scaling and suitability of both of the new simulation models on a wide range of machines. This is necessary due to the recent, large improvements made to the GROMACS molecular dynamics code courtesy of the EU ScalaLife project (see point 5). If successful this data would be used to support an application to use a suitable PRACE Tier-0 machine in the next regular call.


Abstract: Magnetic reconnection is a ubiquitous plasma physics phenomenon, characterized by a sudden, often explosive, reconfiguration of the magnetic field topology. It is one of the most fundamental and important processes in plasmas. Reconnection governs magnetic energy dissipation and its transformation into the plasma thermal and kinetic energy, and to nonthermal particle acceleration, and is thus responsible for many of the most spectacular and violent phenomena in space, astrophysical and laboratory plasmas, such as solar/stellar and accretion-disk flares, magnetic substorms in the Earth’s magnetosphere and disruptions in magnetic confinement fusion experiments. A common feature of many reconnection environments is the presence of background plasma turbulence, which affects the reconnection process in ways that are only poorly understood. Conversely, the reconnection of small-scale magnetic fields plays a crucial role for the dissipation of energy in turbulent plasmas, i.e., the understanding of plasma turbulence itself is pinned to the understanding of reconnection.

The general aim of this project is to investigate turbulence and magnetic reconnection in strongly magnetised, weakly collisional plasmas. Our approach pioneers a novel physical model [the Kinetic Reduced Electron Heating Model (KREHM) — Zocco & Schekochihin, Phys. Plasmas 18, 102309 (2011)], which analytically explores the plasma anisotropy introduced by a strong magnetic field to yield an asymptotically exact, reduced [4D (position vector plus velocity parallel to the magnetic field) instead of 6D (position and velocity vectors)] kinetic description of the plasma. The advantage of such a reduced description over fully 6D models is that much larger separation of scales (and numerical resolutions) can be achieved, while retaining the key physical processes that govern the dynamics of strongly magnetised, weakly collisional plasmas. We have recently developed a new, massively parallel code (VIRIATO) to numerically integrate the equations of KREHM. VIRIATO employs a suite of state-of-the-art numerical algorithms, including a Hermite-polynomial representation of velocity space, which ensures the code is spectrally accurate both in real and velocity space. This project will use this unique computational tool to shed light on the processes behind energy conversion and dissipation in 4D phase-space in kinetic turbulence and reconnection.


Abstract: The intergalactic medium (IGM) is the rarefied material which spans the vast distances between galaxies in the Universe. The IGM therefore straddles the interface between studies of galaxy formation and the evolution of large scale structure, and its observable properties are closely intertwined with both processes. One of the key observational probes of the IGM is the Lyman-alpha forest of hydrogen absorption lines observed in the spectra of distant quasars. Careful comparison between detailed hydrodynamical simulations of the Lyman-alpha forest and high resolution, high signal-to-noise echellete spectra have yielded valuable insights into how cold dark matter is [Viel et al. 2008, PRL, 100, 041304] the epoch of reionisation [Bolton & Haehnelt 2007, MNRAS, 382, 385] and the interplay between galaxies and gas in the early Universe [Viel et al. 2013, MNRAS, 429, 1734]. A key limitation of the existing numerical models, however, is their limited dynamic range. This translates into rather small simulation volumes due to the requirement of resolving the Jeans scale in the IGM. Highly resolved simulations are essential for a quantitative comparison with the available high-quality, high-resolution observational data [Bolton & Becker 2009, MNRAS, 398, L26]. Large scale variations and rare objects, such as massive dark matter haloes and deep voids, are therefore not well captured in existing Lyman-alpha forest simulations. This significantly limits the utility of these models when confronted with observational data, and requires (sometimes large) corrections to be applied to the simulation results. This preparatory project aims to alleviate these issues by bridging the important gap between small and large scales, by preparing the groundw
ork necessary for performing the highest resolution Lyman-alpha forest simulation to date within a large 75 Mpc/h volume.


Type B – Code development and optimization by the applicant (without PRACE support)


Abstract: The goals of this project are to (i) implement and test the scalability of the OASIS3-MCT coupler on the Hermit PRACE tier-0 system and (ii) enhance the parallel capabilities of the EC-EARTH3 earth system model by coupling with OASIS3-MCT.

The OASIS3 coupler, currently developed in the framework of the EU FP7 IS-ENES project, is software allowing synchronized exchanges of coupling information between numerical codes representing different components of the climate system. OASIS3 is currently used by approximately 35 climate modelling groups in Europe, USA, Canada, Australia and Asia. The scaling performance of OASIS3 has been tested locally using several cluster computers with up to 2000 CPU cores. It was found that the scaling of OASIS3 plateaued at 500 cores. This scaling issue was found to be memory-related, in particular due to creating the regridding weighting maps.

The current OASIS3-MCT version was significantly refactored with respect to OASIS3. OASIS3-MCT is now interfaced with the Model Coupling Toolkit (MCT) developed by the Argonne National Laboratory in the USA. MCT implements fully parallel regridding (as a parallel matrix vector multiplication) and parallel distributed exchanges of the coupling fields, based on pre-computed regridding weights and addresses. First tests done with a high resolution toy model and with CERFACS’ real ARPEGE T799 NEMIX 0.25 deg coupled model on the PRACE tier-0 machine, Bullx Curie, show very good results (1). These results were duplicated on the ICHEC cluster “stokes”. Although only preliminary, these results are encouraging and clearly show that the bottleneck of the previous OASIS3 version at high number of cores (> 1000) is solved.

With the continual advancement of available architectures it is necessary to port OASIS3-MCT to new systems, in this case the HERMIT Cray XE6 system to ensure its continued usefulness as a cutting edge coupling tool. In addition, the improvement of the scaling of OASIS-MCT may be due to memory size. Hence, testing on a machine such as Hermit, which has a relatively small memory/core ratio, will provide valuable results.

On completion of the scaling experiments of OASIS3-MCT, the EC-EARTH3 earth system model will be coupled with OASIS3-MCT and scaling experiments will be performed. The three main components of EC-EARTH3 are IFS (for atmosphere), NEMO (for ocean) and OASIS3 (for coupling). It is envisaged that the use of OASIS3-MCT in place of OASIS3 will improve the scaling performance of EC-EARTH3 and depending on the setup (resolution etc.) should scale well up to and above 4000 cores on Hermit.

This work is part of the PRACE-2IP Work Package 8 and more specifically falls under the ‘scientific domain’ of climate.

(1) https://verc.enes.org/oasis/general…


Abstract: Protoplanetary disks are structures mostly composed of gas rotating around newly born stars. Thae is slowly losing its angular momentum and falls onto the central star. The main issue in protoplanetary disks theory in the last 40 years has precicely been to understand the physical mechanism responsible for such an outward angular momentum transport. It has quickly become obvious that a source of turbulence was necessaty in order to provide the efficient level of transport required by the typical disk lifetime. For a long time, though, its origin eluded the community. Hydrodynamical processes, initially identified as a likely candidate, have so far proven unsuccessful, although they remain an area of active research. In 1991, Balbus and Hawley realized that the magnetorotational instability (MRI), first identified in the 60’s (Velikhov et Chandrasekhar) was a promising source to drive the turbulence, which would thus be magnetohydrodynamical (MHD) by essence. The last two decades have seen detailed MHD numerical simulations of accretion disks being developed. They have confirmed that turbulence in accretion disk is most likely MRI-driven.

Most of the established properties of MRI-driven turbulence have been obtained in the so-called shearing box approximation. Instead of solving the set of MHD equations in a computational domain containing the entire disk, we perform local simulations, i.e. solve the MHD flow inside a small box arround a given radius and use periodic border conditions in the azimuthal and vertical directions and shear-periodic border conditions in the radial direction to take into account the disk differential rotation (angular velocity decreases with radius).

The purpose of this project is to extend the capabilities of RAMSES-GPU code by adding the gravitational in the MHD solver to study the development of the MRI instability in stratified shearing boxes. Other technical improvements related to the GPU implementation will be done done (e.g. GPU-CPU memory transfert overlap with MPI communications). The RAMSES-GPU code has already been used on CURIE for high resolution (800x1600x800) MRI simulation at moderate Prandtl number (Pm=4) using 256 GPU for 500 000 time steps.



Abstract: PHOENIX is a multi-purpose stellar and planetary atmosphere simulation package which can
be used to model all types of stars (from brown dwarfs to massive O stars), novae and supernovae, (irradiated or not) gas giant planets and (in development) terrrestial planets. PHOENIX has 1D (/1D) and 3D (/3D) modes, where we have taken care that the 3D mode includes the same micro-physics as the 1D mode. We have just completed the non-local thermodynamic equilibrium (NLTE) module for the 3D mode of PHOENIX, which allows detailed non-equilibrium modeling of the spectra of planets, stars for very complex model atoms. Such NLTE/3D calculations are only possible (for scientifically interesting simulations) if massively parallel systems are used, e.g., 4 million core-hours (Cray XE6) for a simulation with NLTE effects in H,He, C, N, O, Mg and Ca (ionization stages I-III) and only 70000 voxels. PHOENIX/3D is designed to scale to millions of processes, in this project we plan to port and adapt PHOENIX/3D to BlueGene/Q systems and to develop algorithms and setups for such machines that will scale to millions of processes.


Abstract: Overall, the Work Package 7 of PRACE 3rd Implementation Phase (3IP), 2IP and 1IP provided support forapplications via PRACE Preparatory Access or DECI calls. A new activity within the 3IP, Task 7.1.C, is focusedon identification of the applications that address major socio-economic challenges. The process that follows theidentification concentrates on enabling of those applications or on bringing expertise and support to solveparticular, complex scientific problem. The standard procedures for enabling contain, among other things, stepslike porting, performance analysis, optimization, scaling improvement, validating and testing and naturally reporting.

One of the ongoing projects within this task, concerning climate modelling and accurate description of the chemical processes in long term simulations, has his goals set to take advantage of the modern equipment and GPGPU programing paradigm. The development work is already ongoing for a few months and being readyfor the first test runs. A local GPU cluster has been used so far for debugging and for very small testing. To pushthe development, implementation, validation and tests further, towards the real-implementation-scale runs and tests,an access to a Tier0 system with a hybrid partition is essential. Since the hybrid partition of CURIE has not beenavailable on the 3rd December cut-off date, we now would like to apply for the budget for it.


Abstract: Overall, the Work Package 7 of PRACE 3rd Implementation Phase (3IP), 2IP and 1IP provided support forapplications via PRACE Preparatory Access or DECI calls. A new activity within the 3IP, Task 7.1.C, is focusedon identification of the applications that address major socio-economic challenges. The process that follows theidentification concentrates on enabling of those applications or on bringing expertise and support to solveparticular, complex scientific problem. The standard procedures for enabling contain, among other things, stepslike porting, performance analysis, optimization, scaling improvement, validating and testing and naturally reporting.

After the initial enabling period between January and May 2013, the first results are known. One of the ongoing projects concerns Big Data for Machine Learning topic. For this particular topic, the budget of JUQUEEN from the 2010PA1393 proposal has been completely depleted and overused by another 80% (about 450.000 in total used). The contributors to this project foreseen more tests on JUQUEEN and also on CURIE in the period of July 2013 till May 2014. Therefore is has been decided to submit a separate proposal for it, that would not collide with thebudget requirements of the other ongoing projects. It would be very beneficial if the amount on PNUs available for both JUQUEEN and CURIE could be a little bit bigger than for 2010PA1393.


Abstract: Overall, the Work Package 7 of PRACE 3rd Implementation Phase (3IP), 2IP and 1IP provided support forapplications via PRACE Preparatory Access or DECI calls. A new activity within the 3IP, Task 7.1.C, is focusedon identification of the applications that address major socio-economic challenges. The process that follows theidentification concentrates on enabling of those applications or on bringing expertise and support to solveparticular, complex scientific problem. The standard procedures for enabling contain, among other things, stepslike porting, performance analysis, optimization, scaling improvement, validating and testing and naturally reporting.

This proposal is again a cumulative application for the remaining 6 run projects within the T7.1.C on the Major Socio-Economic challenges and associated applications (the two other projects on Big Data and on Climate Changewere submitted separately under 2010PA1754 and 2010PA1510 respectively). After the initial stage of enablingbetween January and May 2013, the first preliminary results are known.


Abstract: This project is a continuation of the project “Broadening the scalability of TermoFluids code” with code 2010PA0813. This was one of the first project agreements between the PRACE partnership and a SME in Spain. In that project different parts of the code were accelerated and adapted in order to achieve performance using T
ier 0 machines, including hybrid clusters with GPU accelerators. Given the encouraging results achieved on the adaptation of the code to hybrid clusters, the objective of this new project is to deepen in the use of the hybrid parallelization strategies in TermoFluids libraries.

TermoFluids (TF) is a parallel object oriented library focused on the simulation of CFD&HT (Computational Fluid Dynamics and Heat Transfer) problems. It is based on finite volume symmetry-preserving discretizations and includes several turbulent models in order to solve problems of industrial interest. TermoFluids has been conceived from the early stages to achieve good performance in modern supercomputers. In the last years the code functionalities have been extended to multi-physic simulations, opening a range of possibilities to obtain good performance in the different existing supercomputer architectures. Currently TF is mostly used on medium sized clusters, engaging typically between 250 and 2048 CPU-cores per execution. However its scalability was proven with up to 10^4 CPU-cores in the MareNostrum II supercomputer. Therefore, since TermoFluids is a throughput oriented software, we invest a lot of effort in the evolution of its parallelization strategies. In particular, in this project, we are focused in benefiting from the great potential of GPU accelerators.

As mentioned above, in the previous project we took the first steps on adapting TermoFluids software to hybrid clusters. An initial analysis of the main CFD kernel, showed that around 60% of the time-step execution is spent in the solution of the Poisson system, derived from the pressure-correction in a fractional-step projection scheme. Consequently, we focused on introducing the GPU co-processors for the acceleration of this part. The result of the acceleration in Curie hybrid nodes (compared with the execution without the GPUs but engaging all the CPU-cores available) was around 4x. This result is in agreement with results shown by other authors for similar problems. Moreover, considering the whole execution, including the parts that were not accelerated, the overall speedup achieved was around 2x. Additionally to this results, in the previous project we could identify the main aspects that influence the performance in the hybrid clusters for our application.

With this scenario in mind, the objective of this new project is the 40% of the time-step that was not touched in the previous one. We plan to apply techniques similar to the ones used in the acceleration of the Poisson system solution, but attending the particularities of the different stages of the time-step algorithm. With a first version of the implementation we expect a performance of around 4x, but we will buckle down in exploring further optimizations.


Abstract: The last two decades have seen tremendous advances in our understanding of human brain structure and function, particularly at the level of systems neuroscience, where neuroimaging methods have led to better delineation of brain networks and brain modules. Even more striking advances have been reported in molecular genetics research, where the Human Genome Project has provided a first estimate of the human genome, including the total number of genes and their chromosomal locations, and with the development of functional genomics. Yet, despite important progress in both molecular genetics and neuroimaging research, there has been relatively little integration of the two fields (Hariri and Weinberger, 2003).Clearly, this integration of genetics and functional genomics information into neuroimaging methods promises to significantly improve our understanding of a given brain disease. It should lead to the development of biomarkers and in the future personalised medicine.However, a successful investigation requires several challenges to be met: 1) manage complex, high dimensional, and large data, 2) develop the statistical methods that are needed to extract the relevant information, 3) develop the software components that will permitlarge computation to be done, and 4) organise the leading by specific applications with expert clinical partners. These four components are challenges addressed by Brainomics project.We gather academic and industrial partners to tackle these four aspects of imaging genetics integration. We believe that imaging genetics will make fast and significant advances only if appropriate tools are designed, constructed, and proposed to the scientific community. These tools will also be of relevance for pharmaceutical companies.We therefore plan to develop the following components. First, a data management system will integrate neuroimaging, phenotypic clinical or behavioural- information, and genetics or genomics data. In our experience, it is not enough that these complex and large data be well organized on the disks: a relevant query system should be proposed to help researchers quickly access the data on which statistical analysis has to be performed. Access to the genetic / genomic information available as web resources (NCBI, EBI, Kegg etc.) is also essential to link with the data available from a specific cohort. Once the relevant data are extracted and that often requires several complex queries they have to be analysed with the relevant statistical methods, which often involve cross validation or permutation techniques and therefore large computing resources. Last, these analyses have to be tailored and put in the context of a specific research (e.g. Addiction, Schizophrenia, brain tumours).


Abstract: Helsim is a 3D electro-magnetic Particle-in-Cell simulation with in-situ visualization developed in the Intel ExaSciance Lab Flanders. It takes particular care of balancing the computational load for handling the particles and trading this off against computation. This should allow Helsim to simulate experimental configurations with highly imbalanced particle distributions. Moreover Helsim includes in-situ visualization, where the visualization happens during the simulation. The goal of this proposal is to scale the Helsim simulation and the in-situ visualization to large node counts in order to tackle large problems featuring highly imbalanced particle distributions. This will be validated by running large-scale experiments and study magnetic reconnection in setups such as those described in http://arxiv.org/pdf/1104.0605.pdf.


Abstract: The design of breeding blankets in a fusion reactor is one of the most challenging technological problems for the development of fusion energy. Unfortunately, experimental set-ups will not be at our disposal during the next decade, and numerical tools that provide realistic simulations are necessary. In this proposal, we aim at developing efficient large-scale magnetohydrodynamic solvers for the forthcoming many-core supercomputers.

This preparatory access can be considered as a further step on the road to a fully distributed-memory implementation of the multilevel BDDC solver. This implementation devotes a subset of MPI tasks in the global communicator to the solution of the coarse-grid problem, which is solved by means of a recursive invocation to the BDDC preconditioning approach.

The most involved challenge to be solved by this layer of code is the efficient re-distribution of the coarse-grid problem from the fine-grid to the coarse-grid tasks. We were able so far to develop within FEMPAR a set of subroutines that given a computational mesh distributed over a set, of say P processes, computes a new optimal partition of the mesh into Q parts (with Q not necessarily equal to P) in parallel by means of the distributed-memory multilevel graph partitioning codes available in Zoltan. Once the computational mesh is partitioned, it has to be migrated so that both the partition and distributed-memory layout of the data matches. The proposed development tasks for this preparatory access are focused on this data migration process. It is unclear whether our codes can tolerate a high-quality static partition of the coarse-grid mesh with (potentially) high associated re-distribution costs, or whether they require fast incremental partitioners with low associated data movement costs otherwise. Even worse, the trade-off among these two factors may be problem, scale, algorithm, architecture and/or problem dependent. This justifies why this project requires the PRACE facilities in order to guide the development and optimization processes.


Abstract: Conjugate Gradient (CG) type iterative solvers are widely used for solving sparse systems of linear equations. They are used in several scientific computing applications which can be applied to sparse systems that are too large to be handled by direct methods such as the Cholesky decomposition. Conjugate Gradient method, being a kernel operation in sparse linear systems, finds its application in a wide range of areas such as solving partial differential equations and other various optimization problems.

For CG, in most cases, the basic operations performed at each iteration are a sparse matrix vector multiplication (SpMxV) of the form y = Ax, followed by linear vector operations on dense vectors y and x. In the parallel versions of these solvers, sparse matrix A is partitioned among processors, and each processor becomes responsible for the computation of the data assigned to it. Here, the processors might need data from other processors to perform their local computations, hence communication between different processors is required. Specifically, each SpMxV operation requires point to point communication of some vector elements between specific processors; whereas some of the linear vector operations (e.g. inner product) require the communication of a single word among all processors.

We devised a reformulation of the parallel CG algorithm that reduces the number of messages being communicated in a single iteration. Our method allows reducing the number of global synchronization points in a single iteration and consequently reduces the overhead of the parallel CG algorithm. Another advantage of our reformulation is that it allows the parallel CG method to scale better since it avoids point to point communication operations. Since CG solvers are used as low-level kernels in several scientific computation applications, this approach can improve the scalability characteristics of the corresponding applications considerably.

We realized our formulation with various communication algorithms proposed for implementing reduction operations. We plan to study models and algorithms for architectures of which underlying network topologies are torus and fat trees. We plan to test our methods and algorithms on different architectures and increase the performance of parallel CG on supercomputers, benefiting the performance concerns of sparse linear systems. We plan to validate our method on different supercomputers to show our reformulation’s improvement is independent of the underlying architecture.


Abstract: Parallel tempering is a well-established technique to accelerate simulations of complex interacting systems displaying rugged energy landscapes. Applications range from simulation of biomolecules and proteins to studies of phase transitions in condensed matter or spin glasses. The key idea is to perform Monte Carlo or molecular dynamics simulations of independent replicas of the system of interest at different temperatures. At regular time intervals, an exchange is attempted between different replicas and the corresponding temperatures are swapped using a Metropolis acceptance rule. As replicas are allowed to explore states at higher temperatures, they can overcome energy barriers and therefore sample more effectively the relevant configuration space.

Since replicas evolve independently, the algorithm can be easily parallelized by distributing replicas on different cores. However, due to synchronization at exchange attempts, a straightforward implementation may result in inefficient use of computational resources and poor scalability. The problem is particularly severe when equilibration times depend strongly on temperature, as in systems with glassy dynamics. Moreover, from the implementation point of view, it is desirable to decouple the parallel tempering logic from the actual simulation kernels, which could then be chosen so as to better the match the target performance for a given architecture and problem size.

To tackle these issues, we want to implement a flexible, multi-GPU parallel tempering code. The idea is to exploit a hybrid simulation scheme, where groups of replicas are distributed over GPU nodes and exchanges are implemented throu
gh a high-level interface. Our goal is twofold: to keep the code highly scalable on HPC resources and to make it open to extensions and further optimization by allowing different simulation backends. The currently targeted backend is RUMD (http://rumd.org/), which runs entirely on GPU and is particularly efficient on small and medium system sizes. Adapters for other simulation packages may be added in the future. If our multi-GPU implementation is successful, we plan to apply for a regular PRACE project on hybrid machines to study an exciting, open problem of condensed matter physics: the quest for ideal glass transitions.


Type C – Code development with support from experts from PRACE


Abstract: Thanks to the computational power which is now available with petaflopic machines, it now possible to run hundreds or thousands of numerical simulations where only one was possible a few years ago. This gives engineers the possibility to introduce a statistical treatment of uncertainties in numerical simulation, which gives a lot of added value for a whole range of engineering problems : uncertainty ranking and quantification, model calibration, estimation of the likelihood of extreme events. In CEA, these statistical methods are grouped in a tool called URANIE. Given a nominal simulation, URANIE gives the possibility to define uncertainties on the input data, create a set of modified input files taking into account those uncertainties, launch hundreds or thousands of instances of the code on these modified input files, and then perform a statistical treatment of the dependency between uncertainties on input and output data. While URANIE is well suited for lauching many instances of serial codes, it suffers from a lack of scalability and portability when used for coupled simulations (via MPI or via the SALOME framework) and/or parallel codes. The aim of the project is therefore to enhance this launching mechanism to support a wider variety of applications.

This project has been granted help from PRACE via support from PSNC experts.


Abstract: This preparatory access project will port and optimize our existing finite difference scheme for the simulation of turbulent Rayleigh-Benard convection in a closed cylindrical cell to the Eurora compute cluster at CINECA which is based on new Intel MIC processors. We wish to optimize the program which is available in hybrid parallelization for future high-resolution studies of large aspect ratio convection systems.


Abstract: WARIS is an in-house multi-purpose framework focused on solving scientific problems using mainly Finite Difference Methods (FDM) as numerical scheme. Nevertheless, the numerical methods supported in the framework are not only tied to explicit time integration schemes, but also to semi-implicit and implicit schemes in order to guarantee stability for linear and non-linear terms. Its framework was designed from scratch to solve in a parallel and efficient way Earth Science and Computational Fluid Dynamic problems among a wide variety of architectures. Structured meshes (regular or non-regular) are employed to represent the problem domains, which are better suited to be optimized in accelerator-based architectures. To succeed in such challenge, WARIS framework was initially designed to be modular in order to ease development cycles, portability, reusability and future extensions of the framework.

The WARIS framework is composed of two primary systems, the Physical Simulator Kernel (PSK) and the Workflow Manager (WM). The PSK system is in charge of providing the spatial and temporal discretization scheme code for the simulated physics. Its aim is also to provide a base for the specialization of physical problems (i.e. Advection-Diffusion-Reaction or Navier-Stokes governing equations) on any forthcoming architecture (i.e. new general purpose processor vectorial ISA set, GPGPUs or Intel Xeon Phi). Therefore, this module is basically a template that provides the appropriate framework for implementing a specific simulator. As a consequence, flexibility in design must be attained to let the specialization accommodate any kind of physics by reusing as much code as possible. This approach will minimize the development cycle by reducing the code size and the debugging efforts.

The porting of WARIS to an emerging heterogeneous architecture, such as Intel Xeon Phi (MIC), would make a step forward in our developing milestone. This new many-core architecture may enable us to run very large environmental simulations (volcanic ash dispersion or fluid dynamics cases) in a rapid and efficient way. Thus, unveiling the next generation of real-time simulations for the foreseeable future of actual cases at an European and world-wide level.


Abstract: The UCD-SPH code utilises the Smoothed Particle Hydrodynamics (SPH) method for modelling wave interaction with an Oscillating Wave Surge Converter (OWSC) device.The SPH scheme used in the UCD-SPH code is based on the SPH-ALE formulation. The standard SPH method is a purely Lagrangian technique and the particles are moving nodes that are advected with the local velocity and carry field variables such as pressure and density. However, the SPH-ALE formulation is based on the solution of a moving Riemann problem in the Arbitrary Lagrangian-Eulerian context and hence the so-called particles are moving control volumes and not particles. As the fields are only defined at a set of discrete points, smoothing (interpolation) kernels are used to define a continuous field and to ensure differentiability.

The code has been developed by Dr. Ashkan Rafiee and has
been validated extensively in numerous applications. The version that is going to be used in this project is based upon the three-dimensional SPH approach and has recently being parallelised using MPI.

This project seeks to accelerate compute-intensive components of the code by enabling code for the Intel MIC architecture using OpenMP and Intel’s Language Extensions for Offload (LEO). Subsequently the code will be optimised for the MIC architecture where the MIC’s wide vector units will exploited where possible. The code will be a hybrid MPI/OpenMP code which we aim to scale across all MIC-based nodes on the Eurora machine.


Abstract: The goal of the project is to port, analysis and evaluate three relevant HPC applications (Alya, iPIC3D and AVBP) and two computational kernels (Cholesky Decomposition and Sparse LU factorization) on the Intel Xeon Phi accelerator. This accelerator promises performance comparable to GPUs but also supporting a wide range of well-know parallel and distributed programming models such as pthreads, OpenMP, OmpSs*, Intel TBB, MPI, Cilk++, or even OpenCL.

The first step of the project will be to quickly port the three applications and the two computational kernels to the Xeon Phi to study the out-of-the-box performance and scalability of this platform. The second step will study the inter-node scalability of the three applications and the intra-node scalability of the two computational kernels. Finally, on the last step, OmpSs will be used to optimize selected parts of the three applications and the two computational kernels.

The results of this study will bring new insights about the performance/productivity tradeoff offered by the Xeon Phi and the role that this family of accelerators may have on future Exascale systems. Additionally, the results obtained will be analyzed to provide feedback to improve the scalability and performance of the three studied applications and the OmpSs runtime on the Xeon Phi.

*OmpSs is a superset of OpenMP that also supports a dataflow execution of a sequential program via tasks.


Abstract: The AVBP code is a recognized standard in the domain of massively parallel computing applied to Large Eddy Simulation for compressible reactive and transcritical flows.As of 2010 it had been ported on 18 of the top 20 machines in the world ( top500.org) and sincethen has been ported to all major architectures (intel, IBM, AMD, Sycortex). With a community of over 200 users in Europe (France, Germany, Spain, Italy) and in the world (USA, Australia) it is in constant adaptation and improvement to take advantage of new leadership class architectures.With this access CERFACS aims at providing to the community the best experience on this new architecture. Since a few months, we have begun to run AVBP on one Xeon Phi coprocessor. The first attempt was the native mode : AVBP runs entirely on Xeon Phi based on its MPI implementation. This approach is working but performance is of course not great compared to the CPU version. It is of course due to the hardware specificities of the Xeon phi compared to the Xeon processor but also to the fact that the vectorization in AVBP is not sufficiently developped and neither does the thread parallelisation.We would like to find what is the best programming model to obtain performance on this new architecture. Two models are available : MPI versus offload. The first one is very easy to manage because AVBP is already based on MPI but performance is deceiving. The offload model should give better performance but it implies a different programming model : Intel Language Extension or OpenACC directives or OPENMP4 in the future.


Abstract: The plan is to extend on the successfully completed simulations of Type I Burst flame propagation on Neutron Stars onto XeonPhi platform. The original code is written in Fortran 90 but requires an adaption to run efficiently on XeonPhi platform. A specific property of this project that distinguish it from many others is that we are not very interested to run much bigger problems, rather we want to run current (medum-sized) problems much faster for parameter space sweeps. As such it present and exciting challenge to extract parallelism from XeonPhi on medium-sized problems.


Abstract: The primary goal of this proposal is the porting and optimization of a sophisticated 3D seismic modeling and imaging tool called SPECFEM3D for the characterization of earthquakes and mapping of Earth’s interior on all scales. The proposed research thus affects the fields of exploration geophysics, regional and global seismology, and even ocean acoustics or non destructive testing, since the tool propagates acoustic waves. Extrapolating the www.top500.org Top500 list, computers with peak performance exaflop capabilities are expected to become available around 2020, and the proposed research ensures that seismologists will be able to effectively and efficiently harness such resources.

Over the proposed research period we intend to further develop and enhance our open-source software SPECFEM3D for the simulation of 3D seismic wave propagation in acoustic, elastic or anelastic media on hybrid INTEL MIC architectures to address the so-called forward problem in seismology, i.e., accurately and efficiently propagating acoustic waves in a known complex medium. These simulations account for heterogeneity in the crust and mantle, topography, anisotropy, attenuation, fluid-solid interactions, self-gravitation, rotation, and the oceans.A major goal is to be able to routinely and efficiently reach short seismic periods in global simulations.

In the context of the seismol
ogical so-called inverse problem, our goal is to harness the power of the forward modeling tools to enhance the quality of images of Earth’s interior and the earthquake rupture process. The approach used is to minimize remaining frequency-dependent phase and amplitude differences between simulated and observed seismograms based on adjoint techniques in combination with conjugate gradient methods, an approach we refer to as ’adjoint tomography’.In seismology, adjoint methods facilitate the calculation of the gradient of a misfit function with respect to the model parameters based on just two numerical simulations for each earthquake at each iteration, independent of the number of seismographic stations or the number of measurements.

Following successful application of this technique on conventional CPU clusters in the last few years, during the proposed research period we plan to adapt our code to perform such adjoint tomography on an INTEL MIC cluster. The resulting images of the crust and mantle will help address unresolved tectonic and geodynamical questions. In future work our most ambitious goal is to move towards adjoint tomography of the entire planet, which is an illustration of the development of application software towards exascale computing.

In the context of exascale software development, we will take advantage of the recent introduction of INTEL MIC co-processor and hierarchical architectures with multicore chips, hardware multithreading, etc. For our parallel application, this will require introducing a high level of shared memory parallelism without compromising portability. To make our software package efficient and yet easy to maintain, we will rely on higher-level programming environments such as OpenMP. During the proposed research period we will work extensively on computing on INTEL MIC co-processor cards.


Abstract: In statistical mechanics spin system indicates a broad class of modelsused for the description of a number of physical phenomena. Althoughthe formulation of the models appears quite simple, the study of spin systems is by no means trivialand most of the times numerical simulations (very often based onMontecarlo methods) are the only way to understand their behaviour.In the recent years we developed highly tuned versions of a code for the simulationof spin systems and in particular of the Heisenberg Spin Glass system (HSG).The code has been ported to multiple platforms including cluster of GPUs.The code resorts to the MPI for inter-GPU parallelization whereas on each computing nodewe use either OpenMP and vectorization or CUDA depending on the type of node (CPU or GPU).We carried out extensive numerical experiments on single CPUs and on clusters of GPUs and obtainedinteresting results that have been presented in conferences and published in major journalslike Journal of Parallel and Distributed Computing or Computer Physics Communications.With the present proposal, we would like to have access to the recently announced Intel MIC technologyto extend the experiments and compare the performances of MIC with those of latest Nvidia GPUs (basedon the Kepler architecture). Although the code implements two algorithms (Overrelaxation and Heat Bath)for a specific spin system (HSG), the memory access pattern and the kind of operations are quite commonin other applications so the results can be of interest also for people working in related fields(e.g., for the solution of PDE by means of relaxation methods).We expect to start the activity by tuning the single node code for using at its best the MIC technology.After that we need to work on the development of the multi-node version to see if it is possible to achieve overlapping ofcommunication and computation as we successfully did running on clusters of GPUs.The code is written in C and is fully portable. The most time consuming loops have been written in sucha way that the vectorization is carried out directly by the Intel compiler so thatneither compiler directives nor intrisic vectorization primitives are required.The code requires random number generation that is carried out by using the SFMT,the SIMD-oriented Fast Mersenne Twister algorithm. No external library is required.The code allows to save periodically the spin configuration for checkpointing long lasting simulations.However, for benchmarking, the dump is not required so the I/O is very limited.In the of the project we expect to release the source code as we did for the multi-GPU version.


Abstract: Alya is a parallel multiphysics code developed at BSC-CNS. Alya solves coupled problems in parallel such as in/compressible turbulent flow, non-linear solids mechanics, electromagnetism, thermal flow, etc. It is parallelized using MPI and OpenMPI. In the last months we have been doing small tests in Xeon Phi cards, with very promising results. This projects aims to push as far as possible these tests but now with large scale real world problems, in particular, in the computational solid mechanics field. The problem we target comes from a geometry provided by FUSION FOR ENERGY for the reactor vacuum vessel.


Abstract: The aim of the project is to port the ab initio electron transport code SMEAGOL to the new Intel Xeon Phi based computer architecture. SMEAGOL allows to calculate the quantum electron transport properties of nano-devices at an applied bias voltage from first principles. In order to obtain the electron density in such systems out of equilibrium at an applied voltage the non-equilibrium Green’s function formalism (NEGF) is used, and the Hamiltonian is obtained from the density functional theory (DFT) localized orbitals basis set code SIESTA. SMEAGOL is used to predict current versus voltage characteristics across different types of nano-devices, such as molecular junctions, thin films, nanowires and two-dimensional materials such as graphene. Such theoretical calculations are important in order to rationalize the multitude of experimental data on transport characteristics of such junctions for both fundamental scientific understanding as well as for design of new devices. An example of a successful application of these methods are magnetic tunnel ju
nctions, where NEGF based first principles calculations have driven experimental research, and which are used routinely in hard disk drives read heads and in magnetic random access memories.The typical device setup consists of two metallic electrodes, separated by a spacer, which can for example be a molecule or a thin film. Inherently the simulations of such nano-device setups involve rather large supercells, and SMEAGOL can currently treat systems with more than 10000 atoms. Moreover, since experimentally such nano-junctions show considerable fluctuations in their structures, which are then reflected in the conductivities, one needs to perform many calculations for different realistic interfaces in order to obtain statistically relevant results that can be compared to experimental data. Such calculations therefore require considerable computational resources, and therefore efficient parallelization of the code is of paramount importance. SMEAGOL scales well in parallel on standard compute clusters, and on BlueGeneQ systems scaling has been demonstrated up to more than 10000 processors for calculations at an applied voltage. The Intel MIC architecture is an emerging technology that can provide the possibility to accelerate sections of parallel code at a reduced cost and power consumption. We therefore aim to port the SMEAGOL code to the Intel MIC and to compare the performance to other available computer architectures, such as standard compute clusters or BlueGeneQ systems. The work will be carried out in collaboration with PRACE experts at the Irish Centre for High-End Computing (ICHEC).


Abstract: It is important to have a fast, robust and scalable library to solve a sparse linear system AX=B in many science and engineering applications. In terms of correctness, direct methods are more reliable than iterative counterparts. SuperLU_DIST (see [1]) is a suitable base algorithm for developing a solver. However, it has several weaknesses that diminish its practical performance for certain situations (see [2, 3] and references therein). We (ITU-UHeM) are developing a linear solver SuperLU_MCDT (Many Core Distributed) utilizing the MPI+X hybrid programming model to reduce the communication overhead associated with MPI so that better scalability can be achieved, in addition to other new capabilities. We will customize and test SuperLU_MCDT on EURORA, Eurotech SandyBridge+Intel MIC hybrid cluster of CINECA.

[1] X. S. Li, J. W. Demmel, J. R. Gilbert, L. Grigori, M. Shao, I. Yamazaki, SuperLU Users’ Guide, 1999, update: 2012

[2] A. Duran, M.S. Celebi, M. Tuncel and B. Akaydin, Design and implementation of new hybrid algorithm and solver on CPU for large sparse linear systems, PRACE, PN:283493, PRACE-2IP white paper, Libraries, WP 43, July 13, 2012

[3] M.S. Celebi, A. Duran, M. Tuncel and B. Akaydin, Scalable and improved SuperLU on GPU for heterogeneous systems, PRACE, PN:283493, PRACE-2IP white paper, Libraries, WP 44, July 13, 2012


Abstract: PARFLUX is a high performance computing project based on a finite volumes code with interface capture called FluxIC. The method has been developed by Jean-Philippe Braeunig and al. in his PhD thesis, 2007 and belongs to the family of the CmFD (Computational multi-Fluid Dynamics) codes.

FluxIC has proved its efficiency on various physical test cases, more particularly for impacts calculations, this type of results have been presented during several ISOPE conferences (http://www.isope.org/) and are the subject of scientific publications.

The goal of this PRACE project is the porting of FluxIC to the MIC architecture with the support of experts.


Abstract: The software GADGET is a freely available code, widely used for cosmological N-body/SPH simulations to solve a wide range of astrophysical tasks – colliding and merging galaxies, forming of large-scale structure in the space, studying the dynamics of the gaseous intergalactic medium, forming of the stars and its regulation, etc. The objective of this project is code optimization, porting GADGET for the new hybrid computing platforms, and scalability testing of the software GADGET on the Eurora system in order to assess the efficiency of the algorithm. Our work is part of the PRACE-1IP peoject.


Abstract: The biological sequence processing is essential for bioinformatics and life science. This scientific area requires powerful computing resources for exploring large sets of biological data. The parallel implementations of methods and algorithms for analysis of biological data using high-performance computing are important for accelerating the research and reduce the investment. Multiple sequence alignment is an basic method in the DNA and protein analysis. The project is aimed to carrying out scientific experiments in the area of bioinformatics, on the basis of parallel computer simulations. The aim of the project is optimization and investigation the parallel performance and efficiency of an innovative parallel algorithm MSA_BG for multiple alignment of biological sequences, whi
ch is highly scalable and locality aware. The MSA_BG algorithm is iterative and is based on the concept of Artificial Bee Colony metaheuristics and the concept of algorithmic and architectural spaces correlation. The case study is discovering the evolution of influenza virus and similarity searching between RNA segments of various influenza viruses A strains utilizing all available 8 segments of the influenza virus A on the basis of parallel hybrid program implementation of the MSA_BG multiple sequence alignment method.


Abstract: The study of the variability of influenza virus is very of great importance problem nowadays. Influenza type A viruses cause epidemics and pandemics. The problem of restricting the spreading of pandemics and the treatment of the infected by the influenza virus people is widely based on the latest achievements of molecular biology, bioinformatics and biocomputing, as well as many other advanced areas of science. The influenza virus A genome comprises 10 genes cared by 8 single-stranded negative-sense RNA molecules of total length 13600 bases that replicate in the host cell nucleus. Segments size range from 890 to 2341 nucleotides. The world DNA databases are accessible for common use and usually contain information for more than one (up to several thousands) individual genomes for each species. Until June 1, 2011, 6895 human and avian isolates of influenza virus have been completely sequenced and made available through GenBank. Scientists are now dependent on databases and access to the set of information that is produced. This scientific area requires powerful computing resources for exploring large sets of biological data. In silico biological sequence processing is a key for molecular biology. The parallel implementation of methods and algorithms for analysis of biological data using high-performance computing is essential for accelerating the research and reduce the investment. Multiple sequence alignment is an important method in the DNA and protein analysis. ClustalW has become the most popular tool and implements a progressive method for multiple sequence alignment. The project is aimed to carrying out scientific experiments in the area of bioinformatics, on the basis of parallel computer simulations. The computationally aspect of this project is code optimization with respect to GPGPU system and performance investigation of the efficiency and scalability of hybrid parallel implementation based on ClustalW algorithm on EURORA system for the case study of investigating viral nucleotide sequences.


Abstract: PIERNIK is a grid-based MHD code using a simple, conservative numerical scheme, which is known as Relaxing TVD scheme (RTVD). The code relies on a dimensionally split, second order algorithm in space and time. The Relaxing TVD scheme is easily extensible to account for additional fluid components: multiple fluids, dust, cosmic rays, and additional physical processes, such as fluid interactions, Ohmic resistivity effects and self-gravity. The simplicity and a small number of floating point operations of the basic algorithm is reflected in a high serial performance.

A unique feature of PIERNIK code relies on our original implementation of anisotropic transport of cosmic-ray component in fluid approximation (Hanasz and Lesch 2003, Astronomy and Astrophysics 412, 331). The basic explicit CR diffusion scheme has been recently supplemented by one of our team members (Artur Gawryszczak) with a multigrid-diffusion scheme.

We have recently implemented two new modules: Adaptive Mesh Refinement (AMR) and a Multigrid (MG) solver. The AMR algorithm allows us to reach much bigger effective resolutions than it was possible with uniform grid. It dynamically adds regions of improved resolution (fine grids) where it is required by refinement criteria. It also can delete grids which are no longer needed to maintain high-quality solution. The MG on the other hand is one of the fastest known methods to solve parabolic and elliptic differential equations, which in our case are used to describe respectively selfgravity of the fluid and diffusion of the cosmic rays. In addition, the isolated external boundaries for self-gravity use multipole expansion of the potential to determine proper boundary values in a fast and efficient manner.

Combination of those two algorithms make PIERNIK an ideal tool for simulations of multiphysics phenomena in gaseous disks of galaxies, active galactic nuclei, and planet forming circumstellar disks. In this project we focus on the case of galactic disks, which involve many physical ingredients and processes, such as magnetic field generation, cosmic-ray transport and gravitational instability induced star formation. In such cases we need to resolve multiscale environment ranging from parsec scale, gravitationally bound star forming regions, up to tenths of kpc long CR-driven outflows. However, our initial large-scale tests show that PIERNIK suffers from a few bottlenecks that cripple its performance for a large number (grater than 4096) of processors. We hope that this project will allow us to overcome scarce obstacles that prevent us from reaching maximum potential of our code.


Abstract: The original contribution of this project is an handy scalable parallelization of nested loops which couples some features of strip mining, fission and fusion of loops, see http://parlab.eecs.berkeley.edu/wik…In essence, the solution proposed in this project makes sequential, but vectorized, the light part of the nested operations while leaving the heavy ones handily deployable for an easily scalable parallel computation.The application chosen is the valuation of Rainbow options, e.g. option on the max/min of n assets, using multivariate binomial lattices (Boyle et al., 1989). This kind of modelling is usually not much used due to the curse of dimensionality. The maximum size of this problem increases ex
ponentially with the number of assets.For #sec=4 on 120 intervals, on a 32 cores machine with 128 Gb RAM, a conveniently honed multi threaded program in Aptech Gauss executes in 211033 seconds on 8 threads and it scales down to 51602 sec on 32 threads.The code is easily scalable to an higher number of threads, stepwise with the number of cores in a single shared memory machine/node. As a matter of fact, Gauss is not an MPI application and it works on individual nodes.With this new computational tool in hand, some research targets before out of reach become feasible. For instance, the study of the applicability of Richardson’s extrapolation to multivariate lattice results is another original contribution of this research project, since in extant literature only univariate binomial lattices have been studied with respect to these acceleration techniques.In conclusion, coupling parallel computation with Richardson’s extrapolation will make multivariate lattices competitive with respect to Monte Carlo simulation / Regression or Quantization methods which could even result slower when computed serially.


Abstract: We aim at the development of efficient, reliable and future-proof numerical schemes and software for the parallel solution of partial differential equations (PDEs) arising in industrial and scientific applications. Here, we are especially interested in technical flows including Fluid-Structure interaction, chemical reaction and multiphase flow behaviour which can be found in a wide variety of (multi-) physics problems. We use a paradigm we call ’hardware-oriented numerics’ for the implementation of our simulators: Numerical- as well as hardware-efficiency are addressed simultaneously in the course of the augmentation process. On the one hand, adjusting data-structures and solvers to a specific domain-patch dominates the asymptotic behaviour of a solver. However, utilising modern multi- and many-core architectures as well as hardware accelerators such as GPUs has become state-of-the-art recently. With this preparatory access application, we aim at evaluating the Intel XEON Phi accelerator as a backend-expansion for our codes.


Abstract: Atomistic simulation using ab initio approaches such as Density Functional Theory is becoming increasingly important as a tool for discovering, designing and analysing the properties of new materials for a diverse set of applications. CP2K is a freely available and highly popular toolkit implementing a wide range of simulation methods ranging from classical potential models, through the efficient Quickstep DFT algorithm, to post-DFT methods such as Hartree-Fock exchange and Moller-Plesset 2nd order perturbation theory (MP2). The code is widely used by European computational scientists and has been ported to various HPC architectures including multi-core clusters, IBM BlueGene, and CUDA GPU systems. Intel’s Xeon Phi architecture is an attractive new architecture for HPC applications as it offers the high performance and low cost/power of GPU-based accelerators with the ease of programming of standard CPUs. In particular for CP2K, we already have an efficient OpenMP implementation which can be used to exploit the parallelism available on the Xeon Phi co-processor and thus improve the performance of the application compared with running on the host CPU alone. In this project, we will optimise and extend the parallelisation of CP2K for the Xeon Phi co-processor, and demonstrate the benefits of this approach by computing the optimised structure of a range of materials of the Langasite family, which have potential applications as fuel cells, using the PRACE EURORA cluster.


Abstract: The simulation of earthquake rupture and radiating seismic wave propagation in complex 3-D heterogeneous material is today at the heart of hypothesis testing and knowledge gain in many branches of Earth sciences. A code frequently used in the geophysics community to simulate the propagation of seismic waves is the software package SeisSol.

The goal of this PRACE-internal preparatory access project is to port and optimize the application SeisSol on the new Intel MIC architecture.

The main activities planned within the project can be summarized as follows:

- Performance optimization and improvement of the scaling behavior on the EURORA prototype machine at CINECA.

- Test of the various programming models for the Intel MIC architecture, like native execution or offload mode. Investigation of the performance of the different modes of the MKL library (Automatic Offload, Compiler Assisted Offload, Native Execution).

- Performance-evaluation and tuning of a hybrid implementation using both MPI and OpenMP. Investigation of the influence of the granularity of the parallelization on the performance.

- Optimization of the sparse matrix-matrix multiplications.

- Optimization and tuning of the dynamic load balancing.


Abstract: New
hydrocarbon reservoirs are increasingly harder to find and has forced exploration into ever more complex geological areas increasing the demand for new seismic imaging technology. In areas of structural complexity, seismic surveys are combined with controlled-source electromagnetic surveys to improve seismic velocity models and seismic imaging in these areas. Mapping the crust is an inverse problem where the principal objective is to infer mechanical properties of the subsurface from acoustic reflection measurements. Solving the inverse problem is compute intensive both for acoustic and controlled-source electromagnetic methods (CSEM). The project will investigate these methods on the new Intel Xeon Phi architecture with the target of improving time to solution ( higher throughput) and if possible also improved imaging (higher performance).

The institute of applied geophysics, NTNU together with the company EMGS (www.emgs.com) will provide the methods for study on the Intel Xeon Phi. The methods are forward-modeling techniques applied to acoustic or controlled-source electromagnetic data.


Abstract: The software framework FMPS implements a non-linear FEM analysis method, which consists mainly of two computational intensive parts for most engineering problems: Assembly of the stiffness matrix and solving a linear system via the preconditioned conjugate gradient method. In the proposed project, both parts will be vectorized and parallelized for running on a cluster of Intel MICs.The planed performance evaluations will be done with a special focus on the general feasibility to use Intel MIC accelerated compute nodes for the execution and optimization of structural mechanics FEM codes.


Share: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+Email this to someone