Procurement and commissioning of HPC service

Author: Willi Homberga
a Jülich Supercomputing Centre, JSC, Germany

Contributors: Eric Boyer (GENCI), Radoslaw Januszewski (PSNC), Norbert Meyer (PSNC), Dirk Pleiter (JSC), Francois Robin (CEA), Philippe Segers (GENCI), Gert Svensson (KTH), Mateusz, Tykierko (WCSS),
Torsten Wilde (LRZ) and all PRACE partners contributing to the TCO survey

Abstract: An analysis of Total Cost of Ownership (TCO) is widely used to support acquisition and planning decisions for a wide range of assets that bring significant maintenance or operating costs across ownership life. This calculation of TCO aggregates the initial investment as capital expenditure (CapEx) with the operational expenditure (OpEx) along the whole life cycle of the asset. It is a central concern e.g. in budgeting and asset life cycle management, when prioritizing acquisition proposals. Not surprisingly, evaluating the TCO of a supercomputing system is a widely used approach to compare different options.
Any data centre, especially an HPC centre, requires an advanced infrastructure which supports the efficient work with computing and data resources. Although depending on the particular customer situation, infrastructure related costs are important contributors to the TCO, energy being among those costs, one of the most important cost driving factors to be considered in a TCO calculation. The worldwide fastest supercomputers of 2017 consume up to 18 MW, close to what is generally considered as the upper limit for the electricity budget (20 MW). Our study shows that on average the CapEx for IT equipment represents less than half of the total costs (46.8 %) and a large fraction of costs (33.8 %) is due to the investment incurred in developing the infrastructure and the related operational costs collectively. A share of 11.1 % of the TCO is attributed to personnel costs.
Reducing TCO is generally challenging and there are no simple recipes. The decrease of costs has to be analysed in conjunction with strategic goals of the data centres, defined requirements and SLAs. Therefore, it’s a multi-layered analysis and not only a simple decrease of costs. Changes that help to reduce costs in one part of the system may result in an increase of costs in other parts. So there is a need for trade-off analysis taking all system and data infrastructure related costs into account. This also requires to make certain assumptions about the usage of the system, and the typical workload it will operate and to balance Time-to-Solution (TtS) and Energy-to-Solution (EtS) to target the “best value for money”.
This white paper is based on a questionnaire that has been sent out to PRACE partners in work package “HPC Commissioning and Prototyping” (WP5) to find the most relevant cost factors which have to be considered when purchasing, installing or upgrading, and operating a state-of-the-art HPC centre. Another goal of this report is to assess the increasing role of the infrastructure and the great power consumption, especially with regards to the upcoming exascale systems. Promising strategies for cutting costs while preserving high-performance are discussed and examples of best-practices are listed. Finally, a practical example is given to what extent the TCO concept is used in HPC procurements at PRACE partner sites..

Download paper: PDF


Authors: Miroslaw Kupczyka*, Damian Kaliszana, Huub Stoffersb, Niall Wilsonc,Felip Molld
a Poznan Supercomputing and Networking Center (PSNC)
b SURFsara, Amsterdam
c ICHEC,Galway city
d Barcelona Supercomputing Center (BSC)

Abstract: The document presents the background and assumptions of the pilot deployment of the Urgent Computing environment in the PRACE RI. It presents several possible scenarios of integrating the functionality bearing in mind that the PRACE RI is a distributed environment with distinct policies, requirements and local limitations. The final recommendations and guidelines will be presented in the project deliverable.

Download paper: PDF


Authors: N. Ilievaa,b,c, Z. Kissd**, B. Pavlove,a, G. Szigetid
a National Centre for Supercomputing Applications (NCSA), Sofia, Bulgaria
b Institute of Information and Communication Technologies, BAS, Sofia, Bulgaria
c Institute for Interdisciplinary Research and Technology (INIRT), Sofia, Bulgaria
d KIFU – NIIFP, Budapest, Hungary
e Sofia University “St. Kl. Ohridski”, Sofia, Bulgaria

Abstract: Being able to handle large volumes of valuable data is a central point in linking large-scale scientific instruments (i.e. satellites, laser facilities, sequencers, accelerators, etc.) with HPC infrastructure. Whether the data is created through experiment or synthetically, it has to be reliably transferred, analysed, stored and archived for further use, or reference, as re-creating it may be expensive, time-consuming or even impossible. This necessitates support improvement for data and computationally intensive applications in terms of data transfer and architecture diversification. We discuss these issues on the example of two large-scale research infrastructures – accelerators (in the context of high-energy physics research and medical applications) and Extreme Light Infrastructure (ELI).

Download paper: PDF


Authors: Radosław Januszewskia
aPoznań Supercomputing and Networking Center, Noskowskiego 10,61-704 Poznań, Poland

 

Abstract: Due to the increasing role of energy costs and growing heat density of servers, cooling issues are becoming highly challenging and very important. While power consumption is recognised to be one of the main challenges for future HPC systems, attention to this issue tends to be limited to the power consumed by the computer hardware only, leaving aside the increase in cooling power required by more densely packaged, highly integrated hardware. The information presented herein is a result of data analysed and collected in a process of distributing a detailed survey among PRACE partners. The PRACE 3IP Pre-Commercial Procurement (PCP) contributes to the development of energy efficient HPC technologies and architecture, targeting improved cooling and energy efficiency of the overall system along with fine scale monitoring of energy consumption. The methodology designed for this PCP to evaluate energy efficiency could be re-used for other HPC infrastructure procurements, allowing for a better view of the TCO of the future system. This paper provides an overview of traditional technologies cooling and devices currently used in modern HPC data centres as well as some innovative and promising solutions adopted by some PRACE partners that may pave the way for future standards. The advantages and disadvantages of each described solution are discussed and general recommendations are provided as to which aspects HPC centres should take into account when selecting and building a cooling system.

Download PDF


Authors:
Michał Nowak, Gerard Frankowski, Norbert Meyer
Poznań Supercomputing and Networking Center, Poznań, Poland (PSNC)

Erhan Yilmaz, Okan Erdogan
National Center for High Performance Computing of Turkey (UHEM)

Contributors:
Jean-Philippe Nominé, François Robin
CEA/DIF/DSSI – CEA/DAM Ile-de-France, Bruyères-le-Châtel 91297 Arpajon Cedex France

Abstract: Securing the HPC infrastructure is an important task. The level of awareness regarding the importance of this topic is high, but the level of investments and skills required to organise a proper protection make it a difficult task, with contrasted levels of solutions and practices observed. There are a huge number of security threats coming from both the Internet and internal networks and, despite the fact that it may seem as a high cost, it is crucial to introduce an adequate level of security to the infrastructure, because the costs of losing data are usually much higher. Based on a survey of the PRACE community, this paper describes security technologies used in data centres and especially its subset, i.e. HPC centres. It gives a set of general recommendations concerning how to enhance security of the HPC infrastructure.

Download paper: PDF


Authors:
Radosław Januszewski, Poznań Supercomputing and Networking Center, Noskowskiego 10,61-704 Poznań, Poland
Ladina Gilly, Swiss National Supercomputing Centre, Via Trevano 131, CH-6900 Lugano, Switzerland
Erhan Yilmaz, Natinonal High Performance Computing Center of Turkey, Uydu, Yolu, Maslak, 34469, İstanbul, Turkey
Axel Auweter, Leibniz-Rechenzentrum, Boltzmannstraße 1, D-85748 Garching bei München, Germany
Gert Svensson, PDC, KTH, Center for High Performance Computing, SE-100 44 Stockholm, Sweden

Abstract: Due to the increasing role of energy costs and growing heat density of servers, cooling issues are becoming very challenging and very important. While it is commonly accepted that power consumption is the number one challenge for future HPC systems, people often focus on the power consumed by the compute hardware only, often leaving aside the necessary increase in cooling power, which is required for more densely packaged, highly integrated hardware. The information presented herein is a result of data analysed and collected in a process of distributing a detailed survey among PRACE partners. In the paper we go into particulars of the cooling area by presenting different technologies and devices currently used in modern HPC data centres. We also try to describe innovative and promising solutions adopted by some PRACE partners that may pave the way for future standards. We focus on highlighting all advantages and disadvantages of each described solution. In the final part we try to provide general recommendations for HPC centres required to be taken into account when building an appropriate cooling system.

Download paper: PDF


Authors:
Marcin Pospieszny, Poznań Supercomputing and Networking Center, Noskowskiego 10,61-704 Poznań, Poland

Contributors:
Jean-Philippe Nominé, CEA/DIF/DSSI – CEA/DAM Ile-de-France, Bruyères-le-Châtel 91297 Arpajon Cedex France
Ladina Gilly, Swiss National Supercomputing Centre, Via Trevano 131, CH-6900 Lugano, Switzerland
François Robin, CEA/DIF/DSSI – CEA/DAM Ile-de-France, Bruyères-le-Châtel 91297 Arpajon Cedex France
Norbert Meyer, Poznań Supercomputing and Networking Center, Noskowskiego 10,61-704 Poznań, Poland
Radosław Januszewski, Poznań Supercomputing and Networking Center, Noskowskiego 10,61-704 Poznań, Poland

Abstract: The design of the electrical distribution network for an HPC centre is an important strategic task when planning a new centre or planning the upgrade of an existing one. All decisions taken at the design stage may have long-term effects on the operation, maintenance and later upgrade of the power supply infrastructure.

Based on a survey of the PRACE community, this paper describes common issues related to HPC centres power distribution, discusses available solutions to the common problems associated with electrical energy distribution and eventually gives recommendations that may be useful for general or technical managers facing a new facility design or upgrade project.

Download paper: PDF


Authors: Ladina Gillya, Swiss National Supercomputing Centre, Via Trevano 131, CH-6900 Lugano, Switzerland

Abstract: Choosing the location is an important strategic task when planning a new data centre, as it will impact virtually every step of the planning and realisation process as well as the future operation of the centre and extension possibilities. It is not uncommon in the data centre industry to spend significant effort and resources on finding and acquiring the optimal site for a new data centre. It is therefore unsurprising, that the industry has written widely about the factors that ought to be considered, when selecting a future location. However, within the community of HPC research centres requirements in a number of areas tend to vary from those of the traditional data centres. Based on a survey of the PRACE community to which 10 sites responded this paper sets out to discuss where requirements and options in terms of the search for a new location for an HPC centre differ from those of a traditional data centre and briefly discusses the criteria that this community attributes the most importance to when selecting a site.

Download paper: PDF


Authors: Erhan Yılmaz, National High Performance Computing Center of Turkey, Uydu, Yolu, Maslak, 34469, Đstanbul, Turkey _ Ladina Gilly, CSCS- Swiss National Supercomputing Centre, Lugano, Switzerland

Abstract: Defining a level of redundancy is a strategic question when planning a new data centre, as it will directly impact the entire design of the building as well as the construction and operational costs. It will also affect how to integrate fu
ture extension plans into the design. Redundancy is also a key strategic issue when upgrading or retrofitting an existing facility.

Redundancy is a central strategic question to any business that relies on data centres for its operation. In the traditional data centre reliant industries such as Internet Service Providers (ISP’s), banks, insurances, or credit card services redundancy is of paramount importance, as a loss of availability has an immediate and sometimes drastic impact on revenue or legal due diligence for example. For this reason, the industry has formed a number of clear standards and guidelines that address the topic of redundancy and reliability.

Both these topics are of course just as important for HPC centres too, but not always in the same way given that some of the trade-off mechanisms may differ substantially and thus make it difficult for an HPC centre to rely fully on the existing standards used by the traditional data centre industry.

This white paper aims to discuss the key factors to be taken into account when selecting a level of redundancy and reliability for an HPC centre, providing managers with a set of topics that need to be considered when designing a new HPC centre or upgrading an existing one. These factors all have an impact on the design and cost of construction as well as on future operational costs for your centre.

Download paper: PDF


Authors: Richard Blake, STFC, Daresbury Laboratory, Warrington WA4 4AD, England
Francois Robin, CEA/DIF, Bruyères-le-Châtel, 91297 Arpajon, France
Marco Sbrighi, CINECA, Via Magnanelli 6/3, 40033 Casalecchio di Reno, Bologna, Italy

Abstract: The procurement of High Performance Computing systems is a complex process which seeks to meet a range of technical and financial targets with an optimal solution that maximises the benefits to the potential users whilst minimising risk. At the heart of the procurement process is the need to define the requirements of the system – is it to procure a test vehicle for assessing new technologies, provide a solution to a specific application or provide a general solution to a broad range of applications? Is the expectation to have high availability, minimise the acquisition or running cost, maximise the capabilities, minimise the amount of time it takes to solve a problem, or a mix of the above? Once the requirements have been decided and quantified the appropriate procurement route needs to be selected among a spectrum of possibilities ranging from pre-commercial procurement to stimulate Research and Development into innovative technologies to open procurements where the requirements could be met by current technologies. The various offerings need to be evaluated quantitatively, according to pre-defined criteria, in order to determine the Most Economically Advantageous Tender prior to signature of contract. Then acceptance of the system is usually performed on site through the completion of appropriate benchmarks. The management of commercial and technical risks is paramount to the success of the project in terms of securing quality solutions to time and to budget. The purpose of the white paper, produced within the PRACE 1IP project, is to address these issues based on the experience of PRACE partners in terms of procuring large systems.

Download paper: PDF


Disclaimer

These whitepapers have been prepared by the PRACE Implementation Phase Projects and in accordance with the Consortium Agreements and Grant Agreements n° RI-261557, n°RI-283493, or n°RI-312763.

They solely reflect the opinion of the parties to such agreements on a collective basis in the context of the PRACE Implementation Phase Projects and to the extent foreseen in such agreements. Please note that even though all participants to the PRACE IP Projects are members of PRACE AISBL, these whitepapers have not been approved by the Council of PRACE AISBL and therefore do not emanate from it nor should be considered to reflect PRACE AISBL’s individual opinion.

Copyright notices

© 2014 PRACE Consortium Partners. All rights reserved. This document is a project document of a PRACE Implementation Phase project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contracts RI-261557, RI-283493, or RI-312763 for reviewing and dissemination purposes.

All trademarks and other rights on third party products mentioned in the document are acknowledged as own by the respective holders.