Table of Contents
- 1. Introduction
- 2. System Architecture / Configuration
- 2.1. Processor Architecture / MCM Architecture
- 2.2. Building Block Architecture
- 2.3. Memory Architecture
- 3. Programming Environment / Basic Porting
- 4. Performance Analysis
- 5. Debugging
- 6. European ARM-based Systems
- 6.1. Isambard
- 6.2. Mont-Blanc
- Further documentation
The ARM processor is heavily used in mobile phones and has the reputation of being very energy efficient. This has peaked interest to use this processor as building block in HPC systems, because energy efficiency has always been a major concern in developing Peta-scale and Exa-scale computers.
This best practice guide provides information about the ARM64 architecture and the programming models that programmers can use in order to achieve good performance with their applications on this architecture.
The guide gives a description of the hardware of the ARM64 processor. It provides information on the programming models and development environment as well as information about porting programs. Furthermore it provides information about tools and strategies on how to analyze and improve the performance of applications.
Finally, there is a description of test and production systems, that already exist in Europe or are planned in the near future.
The ARM version used for testing is produced by Huawei and High Silicon (a company owned 100% by Huawei). The system used for evaluation is a Huawei Taishan 2280 dual socket system with the following characteristics:
Huawei TaiShan 2280 characteristics
- Dual socket system
- Processors: Two ARM Hi1616 64 bit, 32 core, 2.4 GHz processors, Cortex-A72 cores
- Memory: 256 GiB RAM, DDR4, 8×32 GiB, Micron, 2400 MHz
- Memory is distriobuted in Four NUMA banks, 64 GiB each
- Network: 2x GbE + 2x 10GbE
HiSilicon 1616 processor cores
- Full ARMv8-A support
- Superscalar structure and out-of-order pipeline execution
- Dynamic branch prediction, including branch target buffer (BTB), global history buffer (GHB), return stack, and indirect prediction
- Prefetch is supported. The regular sequence for reading addresses will lead to the pre-placement of non-accessed cachelines for later addresses.
- Cacheline size 128 bytes (1024 bits)
- Level 1 cache 1 MiB, 2-way Set-associative
- Level 2 cache 8 MiB, 16-way Set-associative
- Level 3 cache 32MiB, 16-way Set-associative
- SIMD instruction set (32× 128-bit registers)
The exact internal specifications of the Hi1616 processor is available in the processor manual from HiSilicon (available upon request). Hi1616 does accept the compiler flags : -march=armv8-a -mcpu=cortex-a72. E.g. a Cortex-72 version of the ARM architecture.
Cavium is dedicated as a customer reference platform of the ThunderX® systems family that hosts ThunderX 2K – one of two platforms of the ThunderX family .
Cavium serves foremost the testing purposes for the state-of-the-art hardware and software solutions of ARM in order to evaluate ThunderX performance and features based on customers’ specific applications.
ThunderX 2K reference platform is a 1/2 SSI form factor sled inside of a 2U chassis. The 2U chassis can accommodate 4 of such sleds and each of the sleds being a dual socket design.
The ThunderX 2K reference platform has several configurations including CN8800-2S1N-128G-2.0-CDK. The considered configuration has the CN8890-Series processor as a bootable system-on-chip (SoC). ThunderX implements a fully-compliant ARMv8 64-bit instruction set architecture (ISA). The processor has the dual-socket configuration with 48 cores per socket.
Cavium ThunderX CN8890-Series processor characteristics
- Dual Socket Configuration with 48 cores per socket
- 64-bit ARMv8 server SoC
- Running up to 2.5GHz each, up to 160 GOPS
- 78 KiB-L cache and 32 KiB -D cache per core, 16 MiB shared L2
- 128 virtual NICs Package: 52.5×52.5 mm 2601 BGA
The ThunderX family supports dual socket solutions (Figure 1) using Cavium Coherent Processor Interconnect (CCPITM). This allows creation of nodes with up to 96 cores. The dual socket CCPI systems allow creation of many core nodes. Developers can expand the number of nodes using Cavium’s scalable fabric available on some chips in the family. The fabric provides high speed connectivity in an XY mesh with connections to the four nearest neighbors in a 2D fabric. The architecture also supports 3D fabric connections. Nodes connected via the fabric do not share I/O or memory but they can deliver messages to applications anywhere in the fabric. This provides a way to build systems with thousands of nodes. Fabric monitoring and SLA features allow management and tracking of very large clusters .
Figure 1. Cavium ThunderX design for the dual-socket system
Figure 2. Cavium ThunderX2 ARM Processor
The second generation ThunderX2® product family introduced by Cavium has been released for general availability in early 2018. ThunderX2 is a family of 64-bit ARMv8 processors rebranded by Cavium based on the original design of Broadcom’s Vulcan. ThunderX2 is fully compliant with ARMv8 architecture and ARM’s Server Base System Architecture (SBSA) standard.
In the ThunderX2 CN99XX series, the ARM based System-on-Chip (SoC) integrates high performance custom fully out-of-order (OOO) cores, supporting up to 32 ARMv8.1 cores in a single socket configuration and 64 cores in a dual socket configuration, with the frequency up to 2.5GHz in nominal mode and 3GHz in Turbo mode. Simultaneous Multithreading (SMT) is supported which allows up to 4 threads per physical core. 
ThunderX2 ARM based SoC is fully cache coherent across dual sockets using the 2nd generation of Cavium Coherent Processor Interconnect (CCPI) with a speed of 600Gbps. Each core has 32 KB L1 instruction and data cache, as well as 256 KB L2 cache. The 32 MB L3 cache is distributed among the cores. 
Each ThunderX2 ARM processor provides multiple, up to 8, DDR4 memory controllers per chip with the capability of up to 4TB in a dual socket configuration. Up to 56 PCIe Gen3 lanes (supported widths including x1, x2, x4, x8, and x16) are supported with integrated IPO and SATAv3 ports. 
An overview of the units within the HiSilicon Hi1616 processor is shown in Figure 3.
Dual socket systems are using the HCCS cache coherence interconnect for inter chip connection. It’s assumed that the cache-coherence system in the QIAN (Skyros Coherent Fabric) extend through the HCCS to make a dual socket cache coherent system.
Two memory controllers per processor yield 4 NUMA banks for a dual socket system. See (Section 2.3.1) for more on NUMA banks.
Figure 3. HiSilicon 1616 block diagram
An overview of the building blocks that Cavium has used to build the ThunderX is shown in Figure 4. Depending on the target market, some of these building blocks are removed to reduce power consumption or to increase the clock speed .
The dual socket CCPI systems allow creation of many core nodes. Developers can expand the number of nodes using Cavium’s scalable fabric available on some chips in the family. The fabric provides high speed connectivity in an XY mesh with connections to the four nearest neighbors in a 2D fabric. The architecture also supports 3D fabric connections. Nodes connected via the fabric do not share I/O or memory but they can deliver messages to applications anywhere in the fabric. This provides a way to build systems with thousands of nodes. Fabric monitoring and SLA features allow management and tracking of very large clusters.
Figure 4. Cavium ThunderX building blocks
Figure 5 shows the design for the Cavium Thunder X2 CN99XX product with the single socket configuration. Dual-socket configuration is also supported – the interconnection between the two sockets/NUMA nodes has been upgraded to the Cavium Coherent Processor Interconnect 2 (CCPI2) from the original CCPI that used for the first generation ThunderX processors. The speed of CCPI2 is 600Gbps.  
The nodes are connected with different interconnects according to the vendors’ choices. Please refer to the Chapter about “European ARM-Based systems” for the specific ARM-based systems’ interconnect info.
Figure 5. Cavium ThunderX2 CN99XX product design (single socket) 
From a programmers point of view the command numactl -H provides an overview of the NUMA architecture. Below is shown the output obtained from the Huawei system:
numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 64353 MB node 0 free: 59467 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 64446 MB node 1 free: 62318 MB node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 2 size: 64446 MB node 2 free: 57445 MB node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 3 size: 64443 MB node 3 free: 91 MB node distances: node 0 1 2 3 0: 10 15 20 20 1: 15 10 20 20 2: 20 20 10 15 3: 20 20 15 10
A single ThunderX SoC can include up to 48 cores. Full cache coherence is maintained across all 96 cores on a node using CCPI as shown in Figure 6 . Each core has its own 78 KiB L1 instruction cache and a 32 KiB L1 data cache. The cores on the same socket share a 16 MiB L2 cache. Cavium ThunderX has no L3 cache. All cores in the system are cache coherent in respect to L1, L2 and DMA accesses. The node has 128 GiB DDR4 DRAM: 288-pin UDIMM with ECC per socket up to 2133MHz.
Figure 6. Two Cavium ThunderX SoC connected via CCPI
The main properties of the L1 and L2 caches are summarized in the following table (from ).
|L1 data cache||Policy||Write-through|
|L1 instruction cache||Size||78Kb|
Full cache coherency is supported across the dual sockets. Each core has 32 KB L1 instruction and data cache. L2 cache per core is 256 KB. 32 MB L3 cache is distributed among the cores. 
Multiple DDR4 72-bit memory controllers (up to 8) are supported on each chip, with the speed of Up to 2666MHz in 1DPC and up to 2400MHz in 2DPC. The dual socket configuration has the capability of 1+TB and up to 4TB of memory.
Please refer to the Chapter about “European ARM-Based systems” for the specific ARM-based systems’ memory architecture info, e.g. the NUMA architecture overview, etc.
There are several compilers available for the ARM architecture. The GNU set of compilers comes with the Linux distribution, while a set of open-source compilers (FLANG, CLANG, CLANG C++) for integration with LLVM comes from the public GitHub repository. This project is being spearheaded by the Lawrence Livermore, Sandia and Los Alamos national laboratories. The ARM HPC compiler is a set of compilers made by the ARM consortium, ARM HPC development tools. Cray’s CCE compilers, included in the Cray Programming Environment, are also supported for the ThunderX2 ARM processor. 
Compilers installed on Huawei High Silicon:
- GNU compiler suite gcc, gfortran, g++ (version 7.2.0)
- ARM compilers for HPC armclang, armclang++, armflang (version 18.0)
Currently the GNU and ARM compilers have been tested and evaluated on Ubuntu 16.04.
Compilers installed on Cavium ThunderX:
- GNU compiler suite gcc, gfortran, g++ (version 7.2.0)
- LLVM compilers clang, clang++, flang (version 6.0.1)
Currently the GNU and LLVM compilers have been tested and evaluated on Ubuntu 16.04.
Compilers installed on Cavium ThunderX2 (GW4 Isambard’s Cray XC50/ARM-based early access system)
- Cray compilers (included in PrgEnv-cray/6.0.5 with cce version 18.104.22.16823)
- GNU compiler suite gcc, gfortran, g++ (version 6.1.0 , 7.1.0, 7.2.0, 8.1.0)
- ARM compilers for HPC armclang, armclang++, armflang (version 18.1, 18.2, 18.3, 18.4, 18.4.2)
Currently the Cray, GNU, and ARM compilers have been tested and evaluated on the GW4 Isambard’s Cray XC50/ARM-based early access system with SUSE Linux 12.3
Table 1. Suggested compiler flags for ARM HPC compilers
|armclang||-Ofast -march=armv8-a -mcpu=cortex-a72 -fomit-frame-pointer|
|armclang++||-Ofast -march=armv8-a -mcpu=cortex-a72 -fomit-frame-pointer|
|armflang||-Ofast -march=armv8-a -mcpu=cortex-a72 -fomit-frame-pointer|
The ARM compiler flags do not differ very much from the GNU compiler flags. Loop unrolling etc are similar, but prefetch will work differently as there are no software prefetch instructions.
Table 2. Suggested compiler flags for GNU compilers
|gcc||-Ofast -march=armv8-a -mcpu=cortex-a72 -floop-optimize -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer|
|g++||-Ofast -march=armv8-a -mcpu=cortex-a72 -floop-optimize -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer|
|gfortran||-Ofast -march=armv8-a -mcpu=cortex-a72 -floop-optimize -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer|
Table 3. Suggested compiler flags for GNU compilers
|gcc||-Ofast -march=armv8-a -mcpu=thunderxt88 -fopenmp -floop-optimize -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer|
|g++||-Ofast -march=armv8-a -mcpu=thunderxt88 -fopenmp -floop-optimize -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer|
|gfortran||-Ofast -march=armv8-a -mcpu=thunderxt88 -fopenmp -floop-optimize -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer|
Table 4. Suggested compiler flags for LLVM compilers
|clang||-Ofast -march=armv8-a -mcpu=thunderxt88 -fopenmp -fomit-frame-pointer|
|clang++||-Ofast -march=armv8-a -mcpu=thunderxt88 -fopenmp -fomit-frame-pointer|
|flang||-Ofast -march=armv8-a -mcpu=thunderxt88 -fopenmp -fomit-frame-pointer|
The -mcmodel=large option was used by all compilers to compile the FT application. This application uses more than 2GB of memory per thread/process and requires the large memory model.
Table 5. Suggested compiler flags for Cray compilers
|cc (C compiler wrapper)||default options|
|CC (C++ compiler wrapper)||default options|
|ftn (Fortran compiler wrapper)||default options|
Table 6. Suggested compiler flags for GNU compilers
|gcc||-Ofast -march=armv8.1-a -mcpu=thunderx2t99 -mtune=thunderx2t99 -fopenmp -funroll-loops|
|g++||-Ofast -march=armv8.1-a -mcpu=thunderx2t99 -mtune=thunderx2t99 -fopenmp -funroll-loops|
|gfortran||-Ofast -march=armv8.1-a -mcpu=thunderx2t99 -mtune=thunderx2t99 -fopenmp -funroll-loops|
Table 7. Suggested compiler flags for ARM HPC compilers
|armclang||-Ofast -march=armv8.1-a -mcpu=native -ffast-math -fopenmp|
|armclang++||-Ofast -march=armv8.1-a -mcpu=native -ffast-math -fopenmp|
|armflang||-Ofast -march=armv8.1-a -mcpu=native -ffast-math -fopenmp|
The well known set of benchmarks found in the NPB  suite is used for several examples in this guide. The different compilers show varying performance with the different NPB benchmarks. The figure below show the performance recorded using the OpenMP version of the NPB benchmarks. OpenMP version is choosen over MPI as the OpenMP thread library is an integral part of the compiler and should be evaluated together with the code generation. From the figure it’s evident that they all do a reasonably good job.
The High Performance Conjugated Gradients (HPCG) benchmark  is gaining more and more interest as the Linpack benchmark used to assess the 500 fastest systems in the world has some shortcomings . HPCG generally yields a very low processor efficiency due to the fact that this benchmark is highly memory bound.
Figure 7. Compiler scaling performance using the ARM compiler
The figure above shows that not all of the different benchmarks in the NPB suite scale equally well. Some expose limited suitability for running parallel using a thread model like OpenMP. It’s also a performance measure of the OpenMP library used. The GNU and ARM compiler use different OpenMP libraries. GNU uses the default supplied with the Linux distribution while ARM compiler supplies it’s own library. The figure below illustrates the difference in scaling when using a different threading model with different OpenMP libraries.
Figure 8. Compiler scaling performance using the GNU compilers
Figure 9. Compiler performance comparison
There are some differences in performance between the ARM compiler and the GNU compilers. The GNU compilers seem to have an edge with integer sort (IS, a bucket sort, in C) and unstructured grids (unstructured memory access, in Fortran) like the UA benchmark.
Figure 10. Compiler performance comparison
The HPCG benchmark is compiled using the reference version of the linear algebra and other functions that normally are called from an optimized library. This is by choice as the test should illustrate the different compiler’s ability to generate efficient code. It’s clear that all the C++ compilers tested generate code that performs this task reasonably well.
It’s interesting to note that performance drops as soon as the benchmark footprint spans more than one NUMA memory node. The total memory of 256 GiB is made up of four NUMA nodes of 64 GiB each. This performance drop is a consequence of the difference of memory accesses in a None Uniform Memory Access (NUMA) system. It might well be that this problem should have been attacked using a hybrid model with one MPI rank per NUMA node and 16 threads per MPI rank keeping the fine granular memory accesses within the local NUMA node. It’s beyond the scope of this guide to go into details of hybrid programming.
As always with HPCG the absolute performance compared to the theoretical peak performance is very low. This is part of the reason that this benchmark now shows growing interest and is used as an alternative benchmark as top500 HPCG along with the top500 HPL benchmark.
The compiler scaling performance of Cavium ThunderX using the GNU compiler is shown in Figure 11 .
Figure 11. Compiler scaling performance of Cavium ThunderX using the GNU compiler
The compiler scaling performance of Cavium ThunderX for the LLVM compiler is shown in Figure 12.
Figure 12. Compiler scaling performance of Cavium ThunderX using the LLVM compiler
Figure 13. Compiler performance comparison
There are some differences in performance between the GNU and LLVM compilers.
The ARM performance libraries are a set of libraries containing BLAS, LAPACK and FFT (icluding fftw wrappers)in both serial and multithreaded (mp) versions for both 32 and 64 bits (int64 / ilp64) integers.
Name of ARM performance libraries
Only dynamically linked versions or ARM Performance libraries are available.
Documentation is available in different formats at the ARM.com web site.  The BLAS and LAPACK syntax should be compatible with the usual Netlib syntax. There is also an interface for FFTW so applications written to use FFTW can be compiled without changes.
Another performance library often used is the OpenBLAS library. While not vendor optimised it is often used in HPC. It builds on the framework of the well known Goto and Goto2 libraries (see the OpenBLAS wiki page). OpenBLAS is ported to the ARM architecture and works well with the top 500 HPL benchmark.
For this guide only a superficial study of the performance of the libraries is performed. A quick test of the most common functions both single threaded and multi threaded. The goal is to show that libraries exist for the most common functions needed for scientific programs.
The well known top 500 test HPL is a good candidate to evaluate the two performance libraries head to head. While only using a few routines from the library it will expose differences. The HPL test is an MPI program and hence only single threaded versions of the BLAS library are tested. Hybrid models could also be tested, but they are beyond the scope of this guide. In the HPL test the OpenBLAS seems to perform quite well.
As with the compilers the threading model can vary from library to library, this is addressed in the next test below.
Figure 14. Performance Libraries running HPL using 64 cores
Multithreaded libraries normally have their own threading model, they normally have slightly different algorithms for the different routines. The thread performance of the libraries can be tested using the commonly used matrix matrix multiplication (dgemm) test. While this is only a single routine it is nevertheless one of the best known and is commonly selected for performance evaluation.
Figure 15. Performance Libraries testing dgemm
Fast Fourier Transforms are widely used and also implemented in the ARM performance library. The scaling in the FFTs are often limited and with the current two dimensional test the scaling was found to be limited with both FFTW and ARMpl. Hence the performance evaluation below is run using a single thread on only one core. The figure below shows run times (where lower is better) for the commonly used FFTW library and the ARM performance library.
Figure 16. Performance Libraries testing fft (single threaded, run times means that lower is better)
The performance picture is somewhat mixed, it’s not easy to make good FFT implementations. The numbers are not stable using the ARM performance library (reason for this is unknown). The FFTW library on the other hand has been around for a long time and is more mature. The ARM performance library seems to be a bit less stable. Clearly more focus should have been put on the FFT implementations. It must be noted that only two dimensional FFT functions have been tested.
The open source OpenMPI works nicely with the ARM architecture. Both the configure and the make process run smoothly and the resulting binaries and libraries work well.
OpenMP is implemented in a range of compilers, it’s normally a part of the compiler. Some compilers come with an embedded OpenMP library while some rely on the GNU gomp library. Most of the OpenMP libraries use the POSIX thread library that comes with the Linux installation, libpthread. Therefore it is expected that performance differences show up when evaluating multithreaded applications with different compilers.
Performance is a crucial characteristic of most applications, regardless whether they are serial or parallel. Performance analysis is usually a challenging process, as it requires a lot of efforts and skills from the analyst. Fortunately, there are a number of tools and frameworks that facilitate the application code analysis and optimization. There are two major techniques that are applied by those tools: profiling and tracing. Profiling is a general term for an inventory of performance of application events (which might be methods, functions, classes, etc.) and timings for their execution, based on the hardware counters information obtained as a result of sampling (program interruption at every given time interval). A profile typically contains aggregated (minimum, maximum, average, maximal deviation etc.) statistics for all application events. The profiling information is pretty much static – it does not contain any details about the timing of the events included into the profile, but only their consolidated values. One of the typical results of profiling is a call-graph (Figure 17), which shows the inclusive costs (absolute time or a percentage of the total execution) for the most time-consuming events of the application, but not gives details on when (at which time) these parts were actually running. However, exactly this latter information is strictly required in order to trace back the in-depth statistics of every particular call to the event during the application execution, for example when analyzing the bottlenecks of an MPI communication. This task can be accomplished by tracing – a technique that allows logging the timing of all application events (i.e. tracking the start and end timestamps). A trace (Figure 18) allows to retrieve the exact duration of each event and align it with the other events, thus enabling the identification of sources of performance degradation.
Figure 17. Example of a profile
Figure 18. Example of a profile
The eco-system of performance analysis tools spans over a wide range of frameworks for profiling, tracing and (most often) combination of both techniques. Below an outlook of the most wide-spread tools for the standard CPUs as well as tools for ARM SoCs is given, which can be used on the high-performance ARM systems.
Perf is a profiler tool for Linux 2.6+ based systems, which also allow to create a trace .
Ftrace is the open-source tracing framework for the Linux kernel .
LTTng is an open source tracing framework for Linux .
Extrae  is the instrumentation package that captures information during the program execution and generates Paraver and Dimemas traces. The information collected by Extrae includes entry and exit to the programming model runtime, hardware counters, call stack reference, user functions, periodic samples and user events . Extrae is distributed under the LGPL license.
The Score-P (Scalable Performance Measurement Infrastructure for Parallel Codes)  measurement infrastructure is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications. Score-P offers the user a maximum of convenience by supporting a number of analysis tools. Currently, it works with Periscope, Scalasca, Vampir, and Tau and is open for other tools. Score-P comes together with the new Open Trace Format Version 2, the Cube4 profiling format and the Opari2 instrumenter. Score-P is available under the New BSD Open Source license .
ARM DS-5 Streamline performance analyzer
The commercial tools such as the ARM DS-5 Streamline performance analyzer  combine the functionality of a sample-based profiler with kernel trace data and processor performance counters, providing high-level visibility of how applications make use of the kernel and system-level resources .
Callgrind is an open-source profiler using execution-driven cache simulation via dynamic runtime instrumentation (provided by the open-source project Valgrind). This way, it can directly analyze cache behavior of program binaries for x86, x86-64, PPC32/64 and ARM. It comes with the visualization GUI KCachegrind, which provides various views of performance data such as annotated call graphs and tree maps for call nesting, as well as annotation of source and machine code .
Cray Performance Analysis Tool (CrayPAT) is the full-featured performance analysis tool set provided on Cray systems, including ARM-based Cray XC50. CrayPAT-lite is a simplified, easy-to-use version of CrayPAT which provides basic performance analysis information automatically with a minimum of user interaction, i.e. less overhead. Using Cray-PAT-lite can be a good starting point for the users to explore the code behaviour and decide whether to use the full-featured CrayPAT for further performance profiling and analysis.  The latest version Cray Performance Measurement and Analysis Tools User Guide can be found here: .
Arm Forge combines Arm DDT (the graphical debugger) and Arm MAP (the parallel profiler). Arm Forge is a cross platform tool and supports multiple parallel architectures and models, including MPI, UPC, CUDA and OpenMP. The Arm Forge User Guide can be found here: . Further info on Arm DDT and Arm MAP are introduced below.
Arm MAP is a parallel profiler that supports MPI, OpenMP and single-threaded programs. Arm MAP supports both interactive and batch modes for gathering profile data. Further info on Arm MAP can be found here here: .. The Arm MAP User Guide is provided here: .
Most of the performance tools (e.g. Extrae, Score-P) use performance counters in order to gather information regarding the application performance. The access to these counters is critical for performance analysis of the parallel applications. The hardware counters are available through standard libraries, e.g. Linux perf and PAPI (Performance Application Programming Interface). PAPI provides the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors .
However, hardware counters are not always available on SoC by default. In the framework of the Mont-Blanc project, the PAPI support for advanced performance analysis on Cavium ThunderX CN8890 SoC was provided . In order to gain access the following steps were performed:
- enabling the support for the ThunderX PMU in the Linux Kernel;
- extend PAPI event definition in order to support the ThunderX SoC and its hardware counters
The ARMv8 architecture foresees that performance counters can be accessed via a PMU (Processor Unit Component). The PMU component is handled as component 0 in PAPI – which is the default CPU component . However, the PMU hardware module is optional on ARMv8 architecture definition. The availability of the PMU also depends on a kernel version of an operating system.
Moreover, extending PAPI to support Cavium ThunderX was provided by the authors . This made it possible to install Extrae and Paraver and with their help the performance analysis for the LB (Lattice Boltzmann) application was performed.
The perf package was installed on Huawei HiSilicon. Below are the commands for working with this tool.
# To install the perf sudo yum install perf # To show the list of all supported hardware and software events perf list # To collect basic statistics for the IS benchmark perf stat -e task-clock,cycles,instructions,branch-misses ./bin/is.C.x ... Performance counter stats for './bin/is.C.x': 44296,251400 task-clock:u (msec) # 3,852 CPUs utilized 105.470.817.037 cycles:u # 2,381 GHz 61.900.885.161 instructions:u # 0,59 insn per cycle 53.159 branch-misses:u 11,498640056 seconds time elapse # To collect a profile data perf record ./bin/is.C.x # To display the perf.data from the current directory perf report -n
The perf profile for the IS benchmark is represented in Figure 19.
Figure 19. The perf profile for the IS benchmark
The Score-P v4.1 measurement infrastructure was installed on Huawei HiSilicon. The IS benchmark from the NPB3.3.1-OMP suite was selected as an example application for profiling. A profile created using Score-P is given in Figure 20. The Cube is used to visualize the information from the profile. The Cube  is a generic tool for displaying a multi-dimensional performance space consisting of the dimensions (i) performance metric, (ii) call path, and (iii) system resource. Each dimension can be represented as a tree, where non-leaf nodes of the tree can be collapsed or expanded to achieve the desired level of granularity. In addition, Cube can display multi-dimensional Cartesian process topologies.
Figure 20. Visualization of the IS profile using the Cube
Arm DDT is a graphical debugger that can be used in many different development environments, such as MPI codes, OpenMP codes, GPU codes, and hybrid codes, etc. Further info on Arm DDT can be found here here: . The Arm DDT User Guide is provided here: .
GDB4HPC, as a GDB-based parallel debugger included in Cray Programming Environment (Cray PE), can be used to debug Fortran, C/C++ applications. using MPI, SHMEM, PGAS and OpenMP. .
To compile the code to debug, the debug flag should be added to the compile command. For most compilers this flag is -g.
Please refer to the Arm DDT User Guide for further info on the compiler flags to use/avoid when debugging with Arm DDT: .
GW4 Isambard, the first Arm-based Cray XC50 ’Scout’ system , was due to arrive in 2018 as part of the UK’s national Tier-2 HPC service. The system, funded by EPSRC, is delivered by the GW4 Alliance together with the UK’s Met Office, Cray, Arm and Cavium.
The full Isambard system is a Cray XC50 system which contains more than 10,496 cores. Each of the 164 compute nodes has two 32-core Cavium ThunderX2 processors on it with the clock rate of 2.1GHZ and each processor has eight 2666 MHz DDR4 Channels. Isambard uses Cray Aries interconnect with a Dragonfly topology for the nodes connection. 480 terabytes of Lustre storage is to be provided by a Cray Sonexion 3000 storage cabinet. The Isambard User Documentation can be found here: 
Account requests for Isambard should be submitted via SAFE for EPSRC Tier2 system. Further information can be found here: .
Guidance on how to connect to Isambard can be found here:
The production environment on the Isambard Phase 2 – XC50 ARM system, including the available modules, can be found here: .
A full Cray XC50 CPE (Cray Programming Environment) is to be provided on Isambard.
Figure 21. Cray Programing Environment for XC50 with ARM 
Further information about Isambard usage can be found in the Isambard User Documentation here: 
There are also a number of useful materials which are good source for better understanding / using Isambard:
- McIntosh-Smith, S., Price, J., Deakin, T., and Poenaru, A. Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard. Cray User Group, CUG 2018, Stockholm, Sweden. 
- McIntosh-Smith, S., Price, J., Deakin, T., and Poenaru, A. A Performance Analysis of the First Generation of HPC-Optimised Arm Processors. In press, Concurrency and Computation: Practice and Experience, Feb 2019. 
- Isambard’s bechmarks repository, which contains build instructions, run scripts and output files for all the benchmarks presented in the paper above:
- The “isambard list” on the ARM HPC packages wiki:
Since October 2011, the aim of the European projects called Mont-Blanc has been to design a new type of computer architecture capable of setting future global HPC standards, built from energy efficient Arm solutions.
The first phase of the Mont-Blanc project established the following goals: to design and deploy a sufficiently large HPC prototype system based on the current mobile commodity technology; to port and optimize software stack and enable its use for HPC; to port and optimize a set of HPC applications to be run at this HPC system. The first phase of Mont-Blanc was coordinated by the Barcelona Supercomputing Center (BSC), and the first goal of the project was achieved with the deployment of the Mont-Blanc 1 Prototype on the BSC facilities. They investigated the usage of low-power Arm processors for HPC and gave rise to the world’s first Arm-based HPC cluster, which helped demonstrate the viability of using Arm technology for HPC.
Phase 2 of the project was again coordinated by the BSC. Two years after the start of the initial project, the European Commission extended the Mont-Blanc project activities until September 2016. This three-year extension enabled further development of the OmpSs parallel programming model to automatically exploit multiple cluster nodes, transparent application check pointing for fault tolerance, support for ARMv8 64-bit processors, and the initial design of the Mont-Blanc Exascale architecture.
The third phase of the Mont-Blanc project started in October 2015: it is coordinated by Bull (Atos group), funded by the European Commission under the Horizon 2020 programme. The third phase adopts a co-design approach to ensure that hardware and system innovations are readily translated into benefits for HPC applications. It aims at designing a new high-end HPC platform that is able to deliver a new level of performance / energy ratio when executing real applications. As part of the Phase 3 of Mont-Blanc, Atos built a new prototype. It is named Dibona, after the Dibona peak in the French Alps, and it started operation in Fall 2017. It aims at designing a new high-end HPC platform that is able to deliver a new level of performance / energy ratio when executing real applications.
Following on from the three successive Mont-Blanc projects since 2011, the three core partners Arm, Barcelona Supercomputing Center and Bull (Atos Group) have united again to trigger the development of the next generation of industrial processor for Big Data and High Performance Computing. The Mont-Blanc 2020 consortium also includes CEA, Forschungszentrum Jülich, Kalray, and SemiDynamics. It intends to pave the way to the future low-power European processor for Exascale. To improve the economic sustainability of the processor generations that will result from the Mont-Blanc 2020 effort, the project includes the analysis of the requirements of other markets.
The Mont-Blanc 1 Prototype has 63 blades and each of them contains 15 SDB (Samsung Daughter Board) nodes. The Mont-Blanc compute node is a Server-on-Module architecture. Each SDB is built around a Samsung Exynos 5250 mobile SoC integrating ARM Cortex-A15 CPUs at 1.7 GHz in a dual core configuration sharing 1 MB of on-die L2 cache, and a mobile ARM Mali-T604 GPU. The SoC connects to the on- board 4 GB of LPDDR3-1600 RAM through two memory channels shared among the CPUs and GPU, providing a peak memory bandwidth of 12.8 GB/s. The node’s interconnect interface is provided through an on-die USB 3.0 interface utilizing a discrete on-board NIC (integrating USB 3.0 to 1 Gb Ethernet bridge), and the additional fabric providing Ethernet PHY. An external 16 GB μ SD card is connected to a μ SD memory slot and provides boot-loader, OS system image and local scratch storage. The blade hosts 15 Mont-Blanc nodes which are interconnected through an on-board 1 Gb Ethernet switch fabric. The switch provides two 10 GbE up-links. The operating system is Ubuntu 14.04, Kernel 3.11.0. Each node has 15GB at root partition, /home and /apps folders are mounted over the network (Lustre) from the file servers and are shared among all the nodes.
Figure 22. Rack Mont-Blanc 1 Prototype
During the Mont-Blanc 2 project, several ARM-based clusters have been deployed. We use the term mini-clusters for referring to the set of platforms deployed within the project, based on ARM SoCs from both the mobile and server markets. These platforms are usually deployed in the form of a small set of computational nodes made accessible as standard HPC clusters.
Table 8. Mini-clusters
|Cluster||Number of CPUs in the cluster (CPUs/node)||Architecture||Feature of Interest|
|Jetson (Jetson TK1)||28 (4)||ARMv7-A||Ethernet via PCIe|
|Octodroid (ODROID-XU3)||112 (8)||ARMv7-A||big.LITTLE|
|Merlin (Applied Micro X-gene 2)||24 (8)||ARMv8||L3 cache|
|Thunder (ThunderX)||480 (96)||ARMv8||NUMA / CPU density|
|Jetson-TX (Nvidia Jetson TX1)||56 (4)||ARMv8||Cortex-A57 implementation|
The Mont-Blanc Dibona prototype has 48 nodes, each node includes:
- 2 ThunderX2 CPU (32 cores per CPU, 64 per node, each core at 2GHz, 32MB L3 cache)
- 128 of main memory per node (16 DDR4 DIMM slots, 8 channels per CPU)
- 128 GB local storage e (+ 8TB nfs)
The Mont-Blanc Dibona prototype is equipped with a fat-tree with a pruning factor of ½ at L1 level interconnect topology with EDR 100Gb/s. The theoretical peak performance of the Dibona prototype is approximately 49 Tflops.
Figure 23. Rack Mont-Blanc 3 Prototype
The users must use Secure Shell (ssh) tools to login into the cluster or transfer files to it. On the Mont-Blanc 1 Prototype, the user login to one of the login nodes in a random way when connecting to the cluster (mb.bsc.es). On the Mont-Blanc Mini-Clusters, the user login to ssh.hca.bsc.es.
The Environment Modules package (http://modules.sourceforge.net/) provides a dynamic modification of a user’s environment via modulefiles. Each modulefile contains the information needed to configure the shell for an application or a compilation. Modules can be loaded and unloaded dynamically, in a clean fashion.
The SLURM job scheduler is the tool selected by the Mont-Blanc consortium as the load balancing engine for the mini-clusters. SLURM is an open source and highly scalable job scheduling system for Linux clusters.
On Mont-Blanc Prototypes and the mini-clusters there are different compilers and programming models available (GNU Compiler Suite, ARM Compiler, ARM HPC Compiler, ARM HPC Compiler, MPICH, OpenMPI, OmpSs), Scientific Libraries (ARM Performance Libraries, ATLAS, clBLAS, etc) and developer tools (Extrae, Paraver, AllineaDDT, PAPI).
Figure 24. Mont-Blanc Software Stack
 Best Practice Guide – Intel Xeon Phi, January 2017, http:/
 PRACE Webpage, http://www.prace-ri.eu/.
 Cavium ThunderX ARM processors, https:/
 Cavium ThunderX design for the dual-socket system, http:/
 ThunderX2® ARM Processors, https:/
 ThunderX2 – Cavium, https:/
 ThunderX2® CN99XX Product Brief, https:/
 Investigating Cavium’s ThunderX: The First ARM Server SoC With Ambition, https:/
 Puzovic, M., Quantitative and Comparative Analysis of Manycore HPC Nodes, http://www.oerc.ox.ac.uk/sites/default/files/uploads/ProjectFiles/CUDA/ Presentations/2017/2017-06-14-Milos-Puzovic.pdf.
 Cray XC50 ARM Product Brief, https:/
 Isambard User Documentation, https:/
 Isambard User Documentation – Request Account, https:/
 Isambard User Documentation – Connecting to Isambard, https:/
 Isambard User Documentation – Phase 2 XC50 ARM, https:/
 Isambard Benchmarks Repository, https:/
 Isambard-list on ARM HPC package wiki, https://gitlab.com/arm-hpc/packages/wikis/categories/isambard-list .
 The Cray Programming Environment https://www.cray.com/sites/default/files/SB-Cray-Programming-Environment.pdf .
 Cray Performance Measurement and Analysis Tools User Guide, https://pubs.cray.com/content/S-2376/7.0.0/cray-performance-measurement-and-analysis-tools-user-guide/ .
 Arm Forge User Guide, https://developer.arm.com/docs/101136/latest/arm-forge .
 Arm MAP webpage, https://developer.arm.com/products/software-development-tools/hpc/arm-forge/arm-map .
 Arm MAP User Guide, https://developer.arm.com/docs/101136/latest/map .
 Arm DDT webpage, https://developer.arm.com/products/software-development-tools/hpc/arm-forge/arm-ddt .
 Arm DDT User Guide, https://developer.arm.com/docs/101136/latest/ddt .
 PRACE Public Deliverable 7.6 Best Practice Guides for New and Emerging Architectures, http:/
 NPB Benchmark. The High NPB benchmark.
 The High Performance Conjugate Gradients (HPCG) Benchmark project is an effort to create a new metric for ranking HPC systems. The High Performance Conjugate Gradients (HPCG) Benchmark.
 The High Performance Conjugate Gradients (HPCG) Benchmark top500. The High Performance Conjugate Gradients top500 list.
 OpenBLAS library Wiki OpenBLAS library.
 McIntosh-Smith, S., Price, J., Deakin, T., and Poenaru, A. Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard. Cray User Group, CUG 2018, Stockholm, Sweden. http:/
 McIntosh-Smith, S., Price, J., Deakin, T., and Poenaru, A. A Performance Analysis of the First Generation of HPC-Optimised Arm Processors. In press, Concurrency and Computation: Practice and Experience, Feb 2019. DOI: 10.1002/cpe.5110.
 D Ernst, The Arm Technology Ecosystem: Current Products and Future Outlook, 2018. https:/
 McCraw, H., Terpstra, D., Dongarra, J., Davis, K., Musselman, R. Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q .
 It should also be noted that the vector instructions have been renamed from version 7 to version 8. They no longer start with the prefix ‘v’, example: from ‘vmul d0, d0, d1’ in version 7 to ‘mul v0.u8, v0.u8, v1.u8’ in version 8.
 For each compiler, a significant dissipation in the runtime of the benchmarks was observed during these experiments. Values with the greatest deviation from the mean were removed, the plots were built without taking them into account. In addition, the processor has thermal problems: overheating during computation on all cores.