Best Practice Guide – Cray XE/XC

Best Practice Guide – Cray XE/XC (V4.1)

Andrew Turner

EPCC

Xu Guo

EPCC

Lilit Axner

KTH

Mark Filipiak

EPCC

11 December 2013


Table of Contents

1. Introduction
1.1 PRACE Cray XE/XC/XT Systems
1.2 Useful Links
2. System Architecture and Configuration
2.1 Cray XE
2.1.1 Processor architecture / MCM architecture
2.1.2 Building block architecture
2.1.3 Memory architecture
2.1.4 Interconnect
2.1.5 I/O subsystem architecture
2.1.6 Available file systems
2.1.7 Operating system (CLE)
2.2 Cray XC
2.2.1 Processor architecture / MCM architecture
2.2.2 Building block architecture
2.2.3 Memory architecture
2.2.4 Interconnect
2.2.5 I/O subsystem architecture
2.2.6 Available file systems
2.2.7 Operating system (CLE)
3. System Access
3.1 HERMIT
3.2 HECToR
3.3 Lindgren
3.4 Sisu
3.5 Archer
3.6 Making access more convenient using a SSH Agent
3.6.1 Setup a SSH key pair protected by a passphrase
3.6.2 Copy the public part of the key to the remote host
3.6.3 Enabling the SSH Agent
3.6.4 Adding access to other remote machines
3.6.5 SSH Agent forwarding
4. Programming Environment / Basic Porting
4.1 Modules environment
4.2 Compiler wrapper commands
4.3 Available compilers
4.3.1 Partitioned Global Address Space (PGAS) compiler support
4.4 Available (vendor optimised) numerical libraries
4.5 Available MPI implementations
4.6 OpenMP
4.6.1 Compiler flags
4.7 SHMEM
5. Batch system/job command language
5.1 Basic batch system commands
5.2 Hyper-Threading (Cray XC systems only)
5.3 Job submission example scripts for parallel jobs using MPI
5.4 Job submission example scripts for parallel jobs using OpenMP
5.5 Multiple ‘aprun’ commands in a single job script
5.5.1 Requesting the correct number of cores
5.5.2 Multiple ‘aprun’ syntax
5.5.3 Example job submission script
5.6 Job arrays
5.6.1 Example job array submission script
5.6.2 Submitting job arrays using PBSPro
5.6.3 Submitting job arrays using Torque
5.6.4 Submitting job arrays using SLURM
5.6.5 Interacting with individual job instances in an array
5.7 Interactive jobs
5.8 Selecting nodes with particular attributes
5.8.1 Identifying nodes with particular attributes
5.9 Writing job submission scripts in Perl and Python
5.9.1 Example Perl job submission script
5.9.2 Example Python job submission script
6. Performance analysis
6.1 Available Performance Analysis Tools
6.2 Cray Performance Analysis Tool (CrayPAT)
6.2.1 Instrumenting a code with pat_build
6.2.2 Analysing profile results using pat_report
6.2.3 Using hardware counters
6.3 Reveal
6.3.1 Using Reveal
6.4 Allinea MAP
6.5 General hints for interpreting profiling results
6.5.1 Spotting load-imbalance
7. Tuning
7.1 Optimisation summary
7.2 Serial (single-core) optimisation
7.2.1 Compiler optimisation flags
7.2.2 Using Libraries
7.2.3 Writing Optimal Serial Code
7.2.4 Cache Optimisation
7.3 Parallel optimisation
7.3.1 Load-imbalance
7.3.2 MPI Optimisation
7.3.3 Mapping tasks/threads onto cores
7.3.4 Core specialisation
7.4 Advanced OpenMP usage
7.4.1 Environment variables
7.5 Memory optimisation
7.5.1 Memory affinity
7.5.2 Memory allocation (malloc) tuning
7.5.3 Using huge pages
7.6 I/O optimisation
8. Debugging
8.1 Available Debuggers
8.2 TotalView
8.2.1 Example, Debugging an MPI application
8.2.2 Example: Using Totalview to debug a core file
8.2.3 TotalView Limitations
8.3 Cray ATP
8.3.1 ATP Example
8.4 GDB (GNU Debugger)
8.4.1 Useful lgdb commands

1. Introduction

This best-practice guide is designed to help users get the best productivity out of the PRACE Cray XE and XC (and to a lesser extent XT) systems. We will cover a wide range of topics including:

  • Architecture

  • Code porting and compilation

  • Debugging tools

  • Serial optimisation and compiler flags

  • Parallel optimisation

  • I/O best-practice and optimisation

  • Performance analysis tools

We will focus on providing information that is generally applicable to all Cray XE/XC systems. There are considerable differences in the architecture of Cray XE and XC systems, so these are described separately in Section 2. Most of the rest of the guide is common to Cray XE and XC systems. Where there are differences between sites (e.g. the login process), we provide the information for each site and links to further information available from the site itself.

1.1 PRACE Cray XE/XC/XT Systems

Cray XE systems are hosted by a number of PRACE partner sites including:

Tier-0 – HERMIT, HLRS, Germany Hermit is a PRACE Tier-0 system located at HLRS, Germany. This system will be installed in 3 steps and currently the system is in Installation Step 1; the final step: Installation Step 2 is planned for 2014 when a fully integrated phase 1 system will be available.

Tier-1 – HECToR, EPCC, UK HECToR (High-End Computing Terascale Resource) is the UK’s front-line national supercomputing service, which is provided by the HECToR Partners including EPCC. The HECToR service consists of a Cray XE6 supercomputer (phase 3) with a peak performance of greater than 800 TFlops, a high-performance parallel file system (esFS), a GPU testbed machine and an archive facility.

Tier-1 – Lindgren, KTH, Sweden Lindgren is a Cray XE6 system, based on the AMD Opteron 12-core “Magny-Cours” (2.1 GHz) processors and the Cray Gemini interconnect technology. It is located at PDC Center for High Performance Computing. It has 16 racks with a theoretical peak performance of 305 TFlops. Lindgren was ranked in place 31 amongst the 500 most powerful computer systems in the world (Top500, June 2011). Lindgren is named after the Swedish 20th Century children’s book author Astrid Lindgren.

Cray XC systems are hosted by a number of PRACE partner sites including:

Tier-1 – Sisu, CSC, Finland Sisu is a Cray XC30 system installed at CSC’s Kajaani site. The first stage has been completed, providing 250 TFlops theoretical peak performance. The second stage will be completed in 2014, to bring Sisu into the petaflops class.

Tier-1 – Archer, EPCC, UK Archer will be the UK’s front-line national supercomputing service, replacing HECToR in 2014, and is provided by the Archer Partners including EPCC. The Archer service consists of a Cray XC30 supercomputer with a peak performance of greater than 1.5 PFlops, a high-performance parallel file system (Lustre), and an archive facility.

This guide may also be useful for users of the Cray XT systems which are hosted by PRACE partner sites:

2. System Architecture and Configuration

This section is split into two parts: one for Cray XE systems, one for Cray XC systems.

2.1 Cray XE

This section provides an overview of the Cray XE architecture and configurations. Cray XE systems generally consist of two types of nodes: service nodes and compute nodes. Service nodes are used for a variety of tasks on the system including acting as login nodes.

2.1.1 Processor architecture / MCM architecture

2.1.1.1 Compute node hardware

Cray XE compute nodes contain two AMD Opteron processors. Depending on the site these can be either 12-core Magny-Cours or 16-core Interlagos processors. These individual processors are connected to each other by HyperTransport links.

Each Opteron processor consists of two NUMA regions (containing either 6 or 8 cores).

On Compute NodesHERMITHECToRLindgren
ProcessorInterlagosInterlagosMagny-Cours
Processors per node222
Cores per processor161612
Clock rate2.3 GHz2.3 GHz2.1 GHz
L1 Cache16 KB16 KB64 KB
L2 Cache2 MB2 MB512 KB
L3 Cache6 MB6 MB5 MB
2.1.1.2 Vector-type instructions

One of the keys to getting good performance out of the Opteron architecture is writing your code in such a way that the compiler can make use of the vector-type, floating point operations available on the processor. There are a number of different vector-type instruction families available that all execute in a similar manner: SSE (Streaming SIMD Extensions) Instructions; AVX (Advanced Vector eXtensions) Instructions and FMA4 (Fused Multiply-Add with 4 operands) Instructions. AVX and FMA4 instructions are only available on the Bulldozer architecture.

These instructions can use the floating-point unit (FPU) to operate on multiple floating point numbers simultaneously as long as the numbers are contiguous in memory. SSE instructions contain a number of different operations (for example: arithmetic, comparison, type conversion) that operate on two operands in 128-bit registers. AVX instructions expand SSE to allow operations to operate on three operands and on a data path expanded from 128- to 256-bits – this is especially important in the Bulldozer architecture as a core can have exclusive access to a 256-bit floating point pipeline. The FMA4 instructions are a further expansion to SSE instructions that allow a fused multiply-add operation on 4 operands – these have the potential to greatly increase performance for simulation codes. Both the AVX and FMA4 instruction sets are relatively new innovations and it may take some time before they are effectively supported by compilers.

2.1.1.3 Bulldozer (Interlagos) Architecture

Each Interlagos processor consists of two NUMA regions (or dies) each containing 8 cores. Each NUMA region is made up of 4 modules each of which has two cores and which also contains a shared floating point execution unit. See Figure 2.1 for an overvie w of the Bulldozer architecture.

The shared FPU in a module is the major difference from previous versions of the Opteron processor. This unit consists of two 128-bit pipelines can be combined into a 256-bit pipeline. Hence, the module FPU is able to execute either a single 256-bit AVX vector instruction or two 128-bit SSE vector instructions per instruction cycle. The FPU also introduces an additional 256-bit fused multiply/add vector instruction which can improve the performance of these operations on the processor. Each 128-bit pipeline can operate on two double precision floating point numbers per clock cycle and the combined 256-bit pipeline can operate on 4 double precision floating point numbers per clock cycle.

Figure 2.1: Overview of an Interlagos processor NUMA region (8 cores). Image courtesy of Wikipedia.

2.1.1.4 Magny-Cours Architecture

Each Magny-Cours processor consists of two NUMA regions (or dies) each containing 6 cores.

The FPU in a Magny-Cours processor has a single 128-bit pipeline that can operate on two double precision floating-point numbers per clock cycle.

2.1.1.5 Service node hardware

The following are the information of the service nodes on each machine. The service nodes are used for login environment, internal services, etc.

On Service NodesHERMITHECToRLindgren
ProcessorAMD Opteron Processor 23AMD OpteronAMD Opteron Processor 23
Cores per processor626
Clock rate2.2 GHz2.6 GHz2.2 GHz

2.1.2 Building block architecture

Cray XE compute nodes contain two AMD Opteron processors. Depending on the site these can be either 12-core Magny-Cours or 16-core Interlagos processors. These individual processors are connected to each other by HyperTransport links.

Each Opteron processor consists of two NUMA regions (or dies) each containing either 6 or 8 cores.

The HyperTransport network is also linked directly to the Gemini router chips to provide access to the Cray High-Performance Network.

The layout of a Cray XE compute node is shown in Figure 2.2.

Figure 2.2: Overview of a “Magny-Cours” Cray XE compute node. Image courtesy of Cray Inc.

The number of nodes varies from system to system. The table below summarises the number of nodes on each system.

SystemHERMITHECToRLindgren
Cabinets383016
No. of Compute Nodes355228161516
No. of Cores/Node323224
No. of Cores> 113,00090,11236,384
Peak Performance1000 Tflops> 800 Tflops305.6 Tflops
No. of Service Nodes962424

2.1.3 Memory architecture

All the processors on the the node share 32 GB of DDR3 memory (HERMIT also offers some nodes with 64GB memory). The total memory on HECToR is 58TB and on Lindgren the total memory is 47.38 TB.

2.1.3.1 Bulldozer (Interlagos) architecture

All 4 modules (8 cores) within a NUMA region (or die) share an 8MB L3 cache (6MB data) with each module having a 2MB L2 data cache shared between the two cores. Each core has its own 16 KB data cache.

The memory bandwidth for each of the PRACE systems is shown in the table below.

SystemHERMITHECToRLindgren
Memory frequency (MHz)160013331333
Memory bandwidth per node (MB/s)102.485.385.3
Memory bandwidth per socket (MB/s)51.242.742.7
Memory bandwidth per module (MB/s)6.45.3-
Memory bandwidth per core (MB/s)3.22.73.6

The main memory bandwidth is 51.2GB/s (6.4GB/s per module, 3.2GB/s per core).

2.1.3.2 Magny-Cours architecture

All 6 cores within a NUMA region (or die) share an 6MB L3 cache (5MB data) with each core having a 512KB L2 data cache and 64KB L1 data cache.

The main memory bandwidth is 42.6GB/s (3.55GB/s per core).

2.1.4 Interconnect

Cray XE systems use the Cray Gemini interconnect which links all the compute nodes in a 3D torus. Every two XE compute nodes share a Gemini router which is connected to the processors and main memory via their HyperTransport links.

There are 10 links from each Gemini router on to the high-performance network (HPN); the peak bi-directional bandwidth of each link is 8 GB/s and the latency is around 1-1.5 microseconds.

Further details can be found at the following links:

2.1.5 I/O subsystem architecture

The I/O subsystems are system dependent.

  • HERMIT

HERMIT uses the Cray Data Virtualization Service (DVS) which is an I/O forwarding service that can parallelize the I/O transactions of an underlying POSIX-compliant file system.

  • HECToR

HECToR phase 2b has 12 I/O nodes to provide the connection between the machine and the data storage. Each I/O node is fully integrated into the toroidal communication network of the machine via their own Gemini chips. They are connected to the high-performance esFS external data storage via Infiniband fibre.

  • Lindgren

The primary storage for users of Lindgren is the site-wide Lustre file-system Klemming to which it is connected via Lustre router nodes (currently two). These transport Lustre traffic between the compute nodes on the Cray-internal Gemini interconnect and the Lustre servers residing on PDCs high-performance storage Infiniband-fabric. The current aggregate bandwidth of Klemming is about 5Gbyte/s reading and writing and the size is roughly 300TB.

2.1.6 Available file systems

Cray XE systems use two separate file systems: the “home” filesystem and the “work” filesystem.

The “home” filesystem is backed up and can be used for critical files and small permanent datasets. It cannot be accessed from the compute nodes, so all files required for running a job on the compute nodes must be present in the “work” filesystem. It should also be noted that the “home” filesystem is not designed for the long term storage of large sets of results. For long term storage, an archive facility should be used.

The “work” filesystem is a Lustre distributed parallel file system. It is the only filesystem that can be accessed from the compute nodes. Thus all input data files must be present on the “work” filesystem before running and all output files generated during the execution on compute nodes must be written to the “work” filesystem. There is no separate backup of data on the “work” filesystem.

The table below shows the filesystem architectures/capacities at the sites.

SystemHERMITHECToRLindgren
“home” architectureBlueArc mercury 55BlueArc Titan 2200AFS
“home” capacity60TB70TB
“work” architectureLustreLustreLustre
“work” capacity2.7PB1.0PB0.45PB
Archive capacity1.02PB

Please consult the individual site documentation for further information on the file systems.

The HECToR site also includes an archiving facility. More information is available at:

2.1.7 Operating system (CLE)

The operating system on Cray XE is the Cray Linux Environment (CLE) which in turn is based on SuSE Linux. CLE consists of two components: CLE and Compute Node Linux (CNL).

The service nodes of a Cray XE system (for example, the frontend nodes) run the a full-featured version of Linux (CLE).

The compute nodes of a Cray XE system run CNL. CNL is a stripped-down version of Linux that has been extensively modified to reduce both the memory footprint of the OS and also the amount of variation in compute node performance due to OS overhead.

2.1.7.1 Cluster Compatibility Mode (CCM)

If you require full-featured Linux on the compute nodes of a Cray XE system (for example, to run an independent software vendor (ISV) code) you may be able to employ Cluster Compatibility Mode (CCM). The installation of this feature is generally site dependent. Please contact your site for information.

2.2 Cray XC

This section provides an overview of the Cray XC architecture and configurations. Cray XC systems generally consist of two types of nodes: service nodes and compute nodes. Depending on the site, external login nodes are also provided.

Service nodes are used for a variety of tasks on the system: interfaces for the file system, interfaces to external login nodes, job launcher nodes, or internal login nodes.

The processors used in the compute nodes, service nodes, and external login nodes can be different. Compute nodes currently all use Intel Xeon processors; in the future, compute nodes i ncluding Nvidia GPU or Intel Xeon Phi accelerators will become available.

2.2.1 Processor architecture / MCM architecture

2.2.1.1 Compute node hardware

Cray XC compute nodes contain two Intel Xeon E5-2600 series processors. Depending on the site these can be either 8-core E5-2670 (Sandy Bridge) or 12-core E5-2697 (Ivy Bridge), and each core can support 2 Hyper-Threads. These individual processors are connected to each other by two QuickPath Interconnect (QPI) links. The memory in a node is shared between the two processors, with non-uniform memory access: each processor consists of one NUMA region (containing either 8 or 12 cores) and access to the processor’s own memory region is faster than access to the other processor’s memory region.

On Compute NodesSisuArcher
ProcessorE5-2670E5-2697
Processors per node22
Cores per processor812
Clock rate2.6 GHz2.7 GHz
L1 Cache32I+32D KB 8-way32I+32D KB 8-way
L2 Cache256 KB 8-way256 KB 8-way
L3 Cache20 MB 16-way30 MB 16-way

(L1 Cache is 32 KB for Instructions + 32 KB for Data. L3 Cache is shared between all the cores in one processor, but each core has direct access to its ‘local’ 2.5 MB of the L3 Cache, see Figure 2.3.)

2.2.1.2 Vector-type instructions

One of the keys to getting good performance out of the Xeon architecture is writing your code in such a way that the compiler can make use of the vector-type, floating point operations available on the processor. There are two different vector-type instruction families available that execute in a similar manner: SSE (Streaming SIMD Extensions) Instructions and AVX (Advanced Vector eXtensions) Instructions.

These instructions can use the floating-point unit (FPU) to operate on multiple floating point numbers simultaneously as long as the numbers are contiguous in memory. SSE instructions contain a number of different operations (for example: arithmetic, comparison, type conversion) that operate on two operands in 128-bit registers. AVX instructions expand SSE to allow operations to operate on three operands and on a data path expanded from 128- to 256-bits. In the E5-2600 architecture each core has a 256-bit floating point pipeline.

2.2.1.3 E5-2670 (Sandy Bridge) Architecture

Each E5-2670 processor consists of one NUMA region and contains 8 cores. Each core contains one 256-bit Floating Point Unit (FPU), executing one 256-bit AVX vector instruction or one 128-bit SSE instruction per cycle. See Figure 2.3 for an overview of the E5-2670 architecture.

Figure 2.3: Overview of an E5-2670 processor. Each core is directly connected to 2.5 MB of the L3 cache, and all the components are connected with a bidirectional ring interconnection. There is an integrated memory controller connected to DDR3 memory, dual QPI links and a 40-lane PCIe3 link. Image courtesy of Christopher Dahnken, Intel; taken from his talk ‘Intel Sandy Bridge Overview: Understanding the Core’ given during the ‘Get “up to speed” with the Cray Cascade’ tutorial hosted by CSCS, 11-14 March 2013.

2.2.1.4 E5-2697 (Ivy Bridge) Architecture

The E5-2697 processor has a similar structure to the E5-2670, except with 12 cores and with a corresponding increase in L3 cache to 30 MB, still with fast access from each core to its ‘local’ 2.5 MB of the L3 cache.

2.2.1.5 Service node hardware

The following table gives the information on the service nodes on each machine. The service nodes are used as internal login nodes, interfaces to external login nodes, interfaces to the file system, job launcher nodes, and for internal services. The service nodes often have the same processors as the compute nodes but this is not necessary.

On Service NodesSisuArcher
ProcessorE5-2670E5-2670
Processors per node11
Cores per processor88
Clock rate2.6 GHz2.6 GHz
2.2.1.6 External login nodes

On Cray XC systems, the login nodes are usually external to the system (that is, not connected to the high-speed interconnect (Cray’s Aries interconnect) between the compute nodes). This is the case on Sisu and Archer. The external login nodes are used for tasks such as compilation, performance analysis tools, and job submission. An advantage of external login nodes is that users can still do compilation, file transfer, etc., even if the XC system itself is down.

On External Login NodesSisuArcher
ProcessorE5-2670E5-2670
Processors per node12
Cores per processor88
Clock rate2.6 GHz2.6 GHz
2.2.1.7 Pre/Postprocessing nodes

On Archer, there are two further external nodes which provide serial preprocessing and postprocessing.

2. 2.2 Building block architecture

Cray XC compute nodes contain two Intel Xeon processors. Depending on the site these can be either 8-core Sandy Bridge or 12-core Ivy Bridge processors. These individual processors are connected to each other by dual QuickPath Interconnect (QPI) links.

Each Xeon processor consists of one NUMA region, each containing either 8 or 12 cores.

One of the processors is connected via the PCI Express 3 bus to the Aries router chip to provide access to the Cray Aries interconnect.

The layout of a Cray XC compute node is shown in Figure 2.4.

Figure 2.4: Overview of a Cray XC compute node, in this case on Sisu. Image courtesy of CSC, Finland.

The number of nodes varies from system to system. The table below summarises the number of nodes on each system. Groups consist of 2 cabinets.

SystemSisuArcher
Groups (2 cabinets / group)28 (note 1)
No. of Compute Nodes7363008
No. of Cores/Node1624
No. of Cores11,77672,192
Peak Performance250 Tflops1.5 Pflops
No. of Service Nodes1632
No. of External Login Nodes68
  • Note 1: 7 groups on Archer have 2632 nodes with 64 GB memory per node, 1 group has 384 nodes with 128 GB memory per node.

2.2.3 Memory architecture

The amount of memory per node varies with the system, and Archer has some nodes with larger amounts of memory (128 GB per node). On Sisu, all the processors on a compute node share 32 GB of 1600 MHz DDR3 memory. On Archer, all the processors on a compute node share 64/128 GB of 1866 MHz DDR3 memory.

Service and login nodes have different amounts of amount of memory.

The memory for the nodes is given in the table below.

SystemSisuArcher
Memory per compute node32 GB64/128 GB
Total memory for the compute nodes24 TB216 TB
Memory per service node32 GB32 GB
Memory per external login node256 GB256 GB
2.2.3.1 Intel Sandy Bridge architecture

A Sandy Bridge processor consists of one NUMA region. All 8 cores share a 20MB L3 unified (i.e., containing data and instructions) cache. Each core has its own 256 KB L2 unified cache, 32 KB L1 data cache, and 32 KB L1 instruction cache.

2.2.3.2 Intel Ivy Bridge architecture

An Ivy Bridge processor consists of one NUMA region. All 12 cores share a 30MB L3 unified (i.e., containing data and instructions) cache. Each core has its own 256 KB L2 unified cache, 32 KB L1 data cache, and 32 KB L1 instruction cache.

The nominal memory bandwidth for each of the PRACE XC systems is shown in the table below.

SystemSisuArcher
Memory frequency (MHz)16001866
Memory bandwidth per node (GB/s)102120
Memory bandwidth per socket (GB/s)5158
Memory bandwidth per core (GB/s)6.45.0

2.2.4 Interconnect

Cray XC systems use the Cray Aries interconnect which links all the compute nodes in a Dragonfly topology. The Dragonfly topology consists of 2D all-to-all electrical connections between nodes in a group, with all-to-all optical connections between the groups. Every 4 XC compute nodes share an Aries router. The connection to each node is via PCIe3 to one of the processors in the node. The peak bi-directional node-to-node bandwidth is 15 GB/s and the latency is around 1.3 microseconds with an additional 100 nanoseconds when communicating over the optical links.

Further details can be found at the following links:

2.2.5 I/O subsystem architecture

The I/O subsystems are system dependent.

  • Sisu

Sisu uses a Lustre file system shared between the Sisu and Taito machines. On Sisu all directories use the same Lustre-based file server: thus all directories are visible to both the login nodes and the computing nodes. The connection between the Aries network and the Lustre servers is via service nodes acting as Lustre routers, which connect to the Lustre file servers via an Infiniband network. In addition to the local directories in Sisu, users have access to the CSC archive server, which is intended for long term data storage. The archive server is used through iRODS software.

  • Archer

Archer uses a Lustre file system for the “work” file system (/work on Archer), which is accessible from the compute and service nodes and the external login nodes, through an Infiniband network. The “home” file system (/home on Archer) is accessible from the external login nodes and compute nodes, through a 10 GB Ethernet network and NFS to NetAPP network attached storage. The archive server is connected as a GPFS file system and is accessible from the login nodes.

2.2.6 Available file systems

Cray XC systems generally use two file systems: a “home” filesystem and a “work” filesystem.

The “home” filesystem is backed up and can be used for users’ source code and individual applications, scripts, small permanent datasets, and other critical files. Generally, there is limited disk space on the “home” system, and large datasets will be stored on the “work” file system. On Archer, access from the compute nodes to the “home” filesystem is read-only.

The “work” filesystem is a Lustre distributed parallel file system. It is the recommended (or only) filesystem that can be accessed from the compute nodes. Thus all input data files should be present on the “work” filesystem before running and all output files generated during the execution on compute nodes should be written to the “work” filesystem. There is no separate backup of data on the “work” filesystem and the “work” filesystem is not designed for the long term storage of large sets of results. For long term storage, an archive facility should be used.

On Sisu, it is recommended that users’ individual software should be compiled in the $TMPDIR directory. This resides on the login node and compilation is much faster using this directory.

The table below shows the filesystem architectures/capacities at the sites.

SystemSisuArcher
“home” name$HOME, $USERAPPL/home
“home” architectureLustreNFS
“home” capacity43 TB216 TB
“work” name$WRKDIR/work
“work” architectureLustreLustre
“work” capacity768 TB4.6 PB
Archive capacity2 TB / user7.5 PB disk, 19.5 PB tape

Please consult the individual site documentation for further information on the file systems.

All sites also include an archiving facility. More information is available at:

2.2.7 Operating system (CLE)

The operating system on Cray XC is the Cray Linux Environment (CLE) which in turn is based on SuSE Linux. CLE consists of two components: CLE and Compute Node Linux (CNL).

The service nodes of a Cray XC system (for example, the frontend nodes) run a full-featured version of Linux (CLE).

The compute nodes of a Cray XC system run CNL. CNL is a stripped-down version of Linux that has been extensively modified to reduce both the memory footprint of the OS and also the amount of variation in compute node performance due to OS overhead.

2.2.7.1 Cluster Compatibility Mode (CCM)

If you require full-featured Linux on the compute nodes of an Cray XC system (for example, to run a code from an independent software vendor (ISV)) you may be able to employ Cluster Compatibility Mode (CCM). The installation of this feature is generally site dependent. Please contact your site for information.

3. System Access

The majority of users will connect to the systems using an interactive SSH session. The connection and authentication instructions are typically site-dependent so please consult the documentation for the particular site you are connecting to.

The following are some basic instructions for connecting to each system.

3.1 HERMIT

The only way to access HERMIT frontend/login nodes from outside HWW net is through ssh using the following command:

ssh [userID]@xe601.hww.de

The frontend node is the single point to access the entire cluster, where the users can setenvironment, move data, edit and compile programs and create batch scripts, etc. Interactive usage,e.g. running programs which may lead to a high load, is NOT allowed on the frontend/login node.The compute nodes for running parallel jobs are only available via the batch system.

3.2 HECToR

To log into HECToR, the users should use the “login.hector.ac.uk” address:

ssh [userID]@login.hector.ac.uk

Transferring data to and from HECToR can be performed using scp, GridFTP or bbFTP.

3.3 Lindgren

The Lindgren users can access the system via t he login node using a proper login software:

lindgren.pdc.kth.se

Depending on what operating system (Linux, Windows, Mac OS X) users have on their local computer,they will have to find the correct software to install to access the Lindgren system. Kerberos v5software (from Heimdal or MIT) is needed to get a Kerberos ticket. Alternatively, the users requireSSH software that supports GSSAPI with KeyExchange (from modified OpenSSH) or kerberized telnet software(from Heimdal).

3.4 Sisu

To connect Sisu, use ssh (linux, MacOSX) or PuTTY (Windows). For example, when using the ssh commandconnect in the following way:

ssh sisu.csc.fi -l username

The Sisu supercomputer has six login nodes named sisu-login1.csc.fi – sisu-login6.csc.fi.When you open a new terminal connection using the server name sisu.csc.fi you will end up to one of these loginnodes. You can also open the connection to a specific login node if needed

ssh sisu-login5.csc.fi -l username

Access is also possible using the SSH console tool in the Scientist’s User Interface provided by CSC(more details in the CSC Computing Environment User’s Guide chapter 1.3).

3.5 Archer

On the ARCHER system interactive access can be achieved via SSH, either directly from a command line terminalor using an SSH client.

To log into ARCHER you should use the “login.archer.ac.uk” address:

ssh [userID]@login.archer.ac.uk

3.6 Making access more convenient using a SSH Agent

Using a SSH Agent makes accessing the resources more convenient as you only have to enter your passphrase once per day to access any remote resource – this can include accessing resources via a chain of SSH sessions.

This approach combines the security of having a passphrase to access remote resources with theconvenience of having password-less access. Having access of this sort set up also makes itextremely convenient to use client applications to access remote resources, for example:

  • the Tramp Emacs plugin that allows you to access an edit files on a remote host as if they are local files;

  • the Parallel Tools Platform for the Eclipse IDE that allows you to edit your source code on a local Eclipse installation and compile and test on a remote host;

  • the Allinea DDT (debugger) and MAP (profiler) that allows you to run the client on your local machine while debugging and/or profiling codes running on a remote host.

We will demonstrate the process using the HECToR facility but this should work on most remoteLinux machines (unless the system administrator has explicitly set up the system to forbidaccess using an SSH Agent).

Note: this description applies if your local machine is Linux or Mac OSX. Windows is not setup for this type of access and to make it work includes lots of extra setup and installing aLinux emulator called Cygwin – it is probably much easier to install a proper operating systemsuch as Linux.

Note: not all remote hosts allow connections using a SSH key pair. If you find this methoddoes not work it is worth checking with the remote site that such connections are allowed.

3.6.1 Setup a SSH key pair protected by a passphrase

Using a terminal (the command line), set up a key pair that containsyour e-mail address and enter a passphrase you will use to unlock thekey, in this example for a user with username “user” and email address”your@email.ac.uk”:

ssh-keygen -t rsa -C "your@email.ac.uk"
...
-bash-4.1$ ssh-keygen -t rsa -C "your@email.ac.uk"
Generating public/private rsa key pair.
Enter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]
Enter passphrase (empty for no passphrase): [Passphrase]
Enter same passphrase again: [Passphrase]
Your identification has been saved in /Home/user/.ssh/id_rsa.
Your public key has been saved in /Home/user/.ssh/id_rsa.pub.
The key fingerprint is:
03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.ac.uk
The key's randomart image is:
+--[ RSA 2048]----+
|    . ...+o++++. |
| . . . =o..      |
|+ . . .......o o |
|oE .   .         |
|o =     .   S    |
|.    +.+     .   |
|.  oo            |
|.  .             |
| ..              |
+-----------------+

(Remember to replace “your@email.ac.uk” with your e-mail address. Thedefault location for the key, here /Home/user/.ssh/id_rsa,depends on the system: it will be the pathname of ” /.ssh/id_rsa”.)

3.6.2 Copy the public part of the key to the remote host

Using your normal login password, add the public part of your key pair to the “authorized_keys” file on the remote host you wish to connect to using the SSH Agent. This can be achieved byappending the contents of the public part of the key to the remote file:

-bash-4.1$ cat ~/.ssh/id_rsa.pub | 
ssh user@login.hector.ac.uk 'cat - >> ~/.ssh/authorized_keys'
Password: [Password]

(remember to replace “user” with your username). Now you can test that your key pair isworking correctly by attempting to connect to the remote host and run a command. Youshould be asked for your key pair passphrase (which you entered when you created the key pair) rather than your remote machine password .

-bash-4.1$ ssh user@login.hector.ac.uk 'date'
Enter passphrase for key '/Home/user/.ssh/id_rsa': [Passphrase]
Wed May  8 10:36:47 BST 2013

(remember to replace “user” with your username).

3.6.3 Enabling the SSH Agent

So far we have just replaced the need to enter a password to access a remote host with theneed to enter a key pair passphrase. The next step is to enable an SSH Agent on your localsystem so that you only have to enter the passphrase once per day and after that you willbe able to access the remote system without entering the passphrase.

Most modern Linux distributions (and Mac OSX) should have ssh-agent running by default. If your system does not then you should find the instructions for enabling it in your distributionusing Google.

To add the private part of your key pair to the SSH Agent, use the ’ssh-add’ command (on yourlocal machine), you will need to enter your passphrase one more time:

-bash-4.1$ ssh-add ~/.ssh/id_rsa
Enter passphrase for /Home/user/.ssh/id_rsa: [Passphrase]
Identity added: /Home/user/.ssh/id_rsa (/Home/user/.ssh/id_rsa
)

Now you can test that you can access the remote host without needing to enter your passphrase:

-bash-4.1$ ssh user@login.hector.ac.uk 'date'
Warning: Permanently added the RSA host key for IP address '193.62.216.27'
to the list of known hosts.
Wed May  8 10:42:55 BST 2013

(remember to replace “user” with your username).

3.6.4 Adding access to other remote machines

If you have more than one remote host that you access regularly, you can simply add the publicpart of your key pair to the ’authorized_keys’ file on any hosts you wish to access by repeatingstep 2 above.

3.6.5 SSH Agent forwarding

Now that you have enabled an SSH Agent to access remote resources you can perform an additionalconfiguration step that will allow you to access all hosts that have your public key partuploaded from any host you connect to with the SSH Agent without the need to install the private part of the key pair anywhere except your local machine.

This increases the security of the key pair as the private part is only stored in one place(your local machine) and makes access more convenient (as you only need to enter your passphraseonce on your local machine to enable access between all machines that have the public part of the key pair).

Forwarding is controlled by a configuration file located on your local machine at “.ssh/config”.Each remote site (or group of sites) can have an entry in this file which may look somethinglike:

Host hector
  HostName login.hector.ac.uk
  User user
  ForwardAgent yes

(remember to replace “user” with your username).

The “Host hector” line defines a short name for the entry. In this case, instead of typing”ssh login.hector.ac.uk” to access the HECToR login nodes, you could use “ssh hector” instead.The remaining lines define the options for the “hector” host.

  • Hostname login.hector.ac.uk – defines the full address of the host

  • User username – defines the username to use by default for this host (replace “username” with your own username on the remote host)

  • ForwardAgent yes – tells SSH to forward the local SSH Agent to the remote host, this is the option that allows you to store the private part of your key on your local machine only and export the access to remote sites

Now you can use SSH to access HECToR without needing to enter my username or the fullhostname every time:

-bash-4.1$ ssh hector 'hostname -a'
c0-0c1s1n2 login3 hector-xe6-3

You can set up as many of these entries as you need in your local configuration file. Otheroptions are available. See:

(or “man ssh_config” on any machine with SSH installed) for a description of the SSHconfiguration file.

4. Programming Environment / Basic Porting

4.1 Modules environment

Cray XE/XC systems use the modules environment to control the programming environment. For more informationon using modules see the output of “man module” on the system or:

Some quick start commands for the modules environment usage:

To check which modules are presently loaded, type

module list

To search all the available modules, type

module avail

To search for a specific module, type

module show [module_name]

To load a specific module for usage, type

module load [module_name]

To unload a specific module, type

module unload [module_name]

To swap a loaded module with another one, type

module swap [first_module_name] [second_module_name]

4.2 Compiler wrapper commands

No matter which programming environment you have loaded you access the compilers via the following commands:

  • ftn – Fortran compiler

  • cc – C compiler

  • CC – C++ compiler

using these high-level compiler wrapper commands ensures that you are compiling for the correct processor architecture on the compute nodes and also makes sure that all the correct library versionsare accessed.

You should not use the native compiler commands (e.g. gfortran) on Cray XE/XC systems.

4.3 Available compilers

A number of different compiler suites are available on Cray XE/XC systems. These include:

  • Cray compilers – the “PrgEnv-cray” module

  • Portland group compilers – the “PrgEnv-pgi” module (note that Portland group compilers are currently not included in any PRACE Cray XC system)

  • GNU compilers – the “PrgEnv-gnu” modules

  • Intel compilers – the “PrgEnv-intel” modules (note that Intel compilers have no support for the AMD Bulldozer architecture so will produce sub-optimal code on XE systems).

SiteAvailable Compiler Suites
HERMITCray (default), GNU, PGI
HECToRCray (default), GNU, PGI
LindgrenCray, GNU, Intel, PGI
SisuCray (default), GNU, Intel
ArcherCray (default), GNU, Intel

The default programming environment differs from site to site. You can switch compiler suites withthe “module swap” command. For example, to switch from the Cray compiler suite to the GNU compilersuite you would use:

module swap PrgEnv-cray PrgEnv-gnu

You can also switch between different versions of compilers within a compiler suite by using the “module swap” command. For example, to change the version of the GNU compiler you are using:

module swap gcc gcc/4.6.1

The compiler version modules for the different compiler suites are:

  • cce – For the Cray compiler suite

  • pgi – For the PGI compiler suite

  • gcc – For the GNU compiler suite

  • intel – For the Intel compiler suite

4.3.1 Partitioned Global Address Space (PGAS) compiler support

The Cray compiler suite supports both the Co-array Fortran (CAF) and Unified Parallel C (UPC) PGAS language extensions. For more information on enabling these options see the appropriate section on parallel programming below. You can also find out more at:

Cray XE/XC systems also support the Chapel PGAS language through the “chapel” compiler. You canuse this compiler by adding the “chapel” module:

module add chapel

You can find more information on the Chapel language on the Chapel web site.

4.4 Available (vendor optimised) numerical libraries

The Cray CLE distribution comes with a range of optimised numerical libraries compiled for allthe supported compiler suites listed above. The libraries are listed in the table below along withtheir current module names and a brief description. Generally, if you wish to use a library in yourcode you should only need to load the module before compilation.

LibraryModuleDescription
LibScicray-libsciCray Scientific Library includes BLAS, LAPACK, BLACS and ScaLAPACK
MKLPrgEnv-intelIntel Math Kernel Library includes BLAS, LAPACK, BLACS and ScaLAPACK (XC systems only)
PETSccray-petscPortable, Extensible Toolkit for Scientific Computation
FFTWfftwFastest Fourier Transform in the West versions 2 and 3
Trilinoscray-trilinosFor solving linear and non-linear systems of equations
Global Arrayscray-gaAn efficient and portable “shared-memory” programming interface for distributed-memory computers.

Many of these libraries use the Cray autotuning framework to improve the on-node performance. Thisframework automatically selects the best version of the library routines based on the size and natureof your problem at runtime. Note that the Cray Scientific Library is not currently provided for theIntel compiler suite, use Intel’s Math Kernel Library. The linker options for MKL when used with theIntel compiler can be found using Intel’s Link Line Advisor.

More information on the library contents can be found on the following web:

4.5 Available MPI implementations

Cray XE systems use a version of the MPICH 2 library that has been optimised for the Gemini interconnect.Cray XC systems use a version of the MPICH 2 library that has been optimised for the Aries interconnect.The version of the MPI library is controlled by the “cray-mpich2″ module. All users will have the default”cray-mpich2″ module loaded when they connect to the system – for best performance we recommend using thedefault or later versions.

You can get a list of available versions of the “cray-mpich2″ module by using the “module avail”command. For example:

module avail cray-mpich2

Once the cray-mpich2 module is loaded, compiling using the standard wrapper compiler commands willautomatically include and link to the MPI headers and libraries – you do not need to specify any moreoptions on the command line.

4.6 OpenMP

All of the compiler suites available on the Cray XE system support the OpenMP 3 standard.

Note: in the Cray compiler suite OpenMP functionality is turned on by default.

4.6.1 Compiler flags

The compiler flags to include OpenMP for the various compiler suites are:

CompilerEnable OpenMPDisable OpenMP
Cray-h omp-h noomp
PGI-mp-Mnoopenmp
GNU-fopenmpby omission of -fopenmp
Intel-openmpby omission of -openmp

You may find these links useful:

4.7 SHMEM

To compile code that uses SHMEM you should ensure that the cray-shmem module is loaded. Loading this module will ensure that all the correct environment variables are set for linking to the libsma staticand dynamic libraries. You can load the module with the command:

module load cray-shmem

For more information on using SHMEM, see the Cray man pages at:

5. Batch system/job command language

To run a job on Cray XE/XC machines, you will need to write a submission script and submit your job tothe batch system. As the batch system installed is site-specific, you should consult the localdocumentation for details on the batch system in operation at the site:

This section will introduce the basics of writing job submission scripts for Cray XE/XC systems and then go on to look at more advanced batch job topics, including:

  • how to run multiple, concurrent parallel jobs in a single job submission script;

  • how to run multiple, concurrent parallel jobs using job arrays;

  • how to run interactive parallel jobs through the batch system;

  • how to select compute nodes with particular characteristics for your job;

  • how to use Perl or Python to write job submission scripts.

5.1 Basic batch system commands

This is a very short summary of the most important basic job submission commands on a Cray XE/XC system.

To submit your job to the batch system:

qsub your_job_script.pbs # for PBS or Torque/Moab
sbatch your_job_script # for SLURM

To check the job status:

qstat # for PBS or Torque/Moab
squeue # for SLURM

To check only your job status:

qstat -u $USER # for PBS or Torque/Moab
squeue -u $USER # for SLURM

To remove your job from the job queue. If your job is running, it will stop it running too:

qdel [jobID] # for PBS or Torque/Moab
scancel [jobID] # for SLURM

5.2 Hyper-Threading (Cray XC systems only)

Cray XC systems use Intel Xeon processors, which offerHyper-Threading. Intel Hyper-Threading improves the throughput of aprocessor by allowing two program threads to share each of the processor cores.When one thread stalls (e.g. because of a cache miss) the processorcan execute instructions from the second thread. Hyper-Threading iseffectively a hardware version of threading. The operating systemsees the two Hyper-Threads as two processors.

The Cray XC30 software and hardware stack provides full support for usingHyper-Threads and Hyper-Threading is always on. So for a 16 coreSandy Bridge processor, 32 ranks are visible to the OS: Ranks 0-7 &16-23 on Socket 0, Ranks 8-15 & 24-31 on Socket 1. Hyper-Thread pairsare 0 and 16, 1 and 17, …, 7 and 23, …, 15 and 31.

Any performance improvement is typically much less than 2 times with 2 Hyper-Threads.Using Hyper-Threading may actually degrade the performance of your application.It is recommended to try it, if it does not help, turn it off. Switching Hyper-Threadingon and off is easily controlled at run time with the aprun -j option:

  • -j 1 = no Hyper-Threading (a node appears as 16 cores (Sandy Bridge))

  • -j 2 = Hyper-Threading enabled (a node appears as 32 cores (Sandy Bridge))

The default on Sisu and Archer is no Hyper-Threading.

5.3 Job submission example scripts for parallel jobs using MPI

Version 12 of PBSpro is used on Archer, with some site-dependentmodifications, and this means a slightly different batch script fromthose used on the other PRACE systems that use PBSpro. The essentialdifference is that the number of nodes is specified rather than thenumber of tasks. The mppwidth, mppnppn, and mppdepth settingsare no longer used.

For PBS (Hector) or Torque/Moab (Hermit, Lindgren):

#!/bin/bash --login

# This example is for Hermit or Hector
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallel tasks
#PBS -l mppwidth=2048

# Specify how many processes per node.
# On HERMIT and HECToR valid mppnppn values are from 1 to 32. 
# On Lindgren valid mppnppn values are from 1 to 24. 
#PBS -l mppnppn=32

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Specify which budget account that your job will be charged to.
#PBS -A your_budget_account               

# Change to the directory that the job was submitted from.
cd $PBS_O_WORKDIR

# Launch the parallel job using aprun.
# Run the executable my_mpi_executable.x using total
# of 2048 parallel tasks, with 32 tasks assigned per node.
aprun -n 2048 -N 32 ./my_mpi_executable.x arg1 arg2

For PBS (Archer):

#!/bin/bash --login

# This example is for Archer
# On Archer the maximum core number per node is 48 with hyper-threading,
# 24 without hyper-threading.

# The jobname
#PBS -N your_job_name

# The total number of nodes for your job.
# The example requires 100 compute nodes
#PBS -l select=100

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Specify which budget account that your job will be charged to.
#PBS -A your_budget_account               

# Make sure any symbolic links are resolved to the absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)               

# Change to the directory that the job was submitted from.
cd $PBS_O_WORKDIR

# Launch the parallel job using aprun.
# Run the executable my_mpi_executable.x using total
# of 2400 parallel tasks, with 24 tasks assigned per node.
aprun -n 2400 -N 24 ./my_mpi_executable.x arg1 arg2
# Using hyper-threading, 4800 tasks can be run, with 48 tasks
# assign
ed per node
# aprun -j 2 -n 4800 -N 48 ./my_mpi_executable.x arg1 arg2

For SLURM (Sisu):

#!/bin/bash --login

# This example is for Intel Xeon processors with hyperthreading
# On Sisu the maximum core number per node is 32 with hyper-threading,
# 16 without hyper-threading.

# The jobname
#SBATCH -J your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallel tasks
#SBATCH -n 2048

# Specify how many processes per node.
# On Sisu valid ntasks-per-node values are from 1 to 32 with hyper-threading,
# 1 to 16 without hyper-threading.
#SBATCH --ntasks-per-node=32

# Specify the wall clock time required for your job.
#SBATCH -t 00:20:00

# Specify the partition to use, SLURM does not choose one for you.
#SBATCH -p large

# Specify which budget account that your job will be charged to.  If
# this line is omitted, the budget account associated with your group ID
# will be charged.
#SBATCH --account=your_budget_account               

# Change to the directory that the job was submitted from.
cd $SLURM_SUBMIT_DIR

# Launch the parallel job using aprun.
# Run the executable my_mpi_executable.x using total
# of 2048 parallel tasks, with 32 tasks assigned per node.
# -j 2 is needed to use all 32 hyperthreads
aprun -j 2 -n 2048 -N 32 ./my_mpi_executable.x arg1 arg2

5.4 Job submission example scripts for parallel jobs using OpenMP

For PBS (Hector) or Torque/Moab (Hermit, Lindgren):

#!/bin/bash --login

# This example is for Hermit or Hector
# On HERMIT and HECToR: maximum thread number per node is 32. 
# On Lindgren: maximum thread number per node is 24.

# The jobname
#PBS -N job_name

# The total number of cores required for your job.
#PBS -l mppwidth=32

# Specify how many processes per node.
#PBS -l mppnppn=32

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Set the number of OpenMP threads per node
export OMP_NUM_THREADS=32

# Launch the OpenMP job to the allocated compute node using aprun
aprun -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2

For PBS (Archer):

#!/bin/bash --login

# This example is for Archer
# On Archer the maximum core number per node is 48 with hyper-threading,
# 24 without hyper-threading.

# The jobname
#PBS -N job_name

# The total number of nodes required for your job.
#PBS -l select=1

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Make sure any symbolic links are resolved to the absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)               

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Set the number of OpenMP threads per node
export OMP_NUM_THREADS=24

# Launch the OpenMP job to the allocated compute node using aprun
aprun -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2

# On Archer, -j 2 is needed to use all 48 hyperthreads
#export OMP_NUM_THREADS=48
#aprun -j 2 -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2

For SLURM (Sisu):

#!/bin/bash --login

# This example is for Intel Xeon processors with hyperthreading
# On Sisu the maximum core number per node is 32 with hyper-threading,
# 16 without hyper-threading.

# The jobname
#SBATCH -J your_job_name

# The total number of cores required for your job.
#SBATCH -n 32

# Specify how many processes per node.
#SBATCH --ntasks-per-node=32

# Specify the wall clock time required for your job.
#SBATCH -t 00:20:00

# Specify the partition to use, SLURM does not choose one for you.
#SBATCH -p test

# Change to the directory that the job was submitted from.
cd $SLURM_SUBMIT_DIR

# Set the number of OpenMP threads per node
export OMP_NUM_THREADS=32

# Launch the OpenMP job to the allocated compute node using aprun
# -j 2 is needed to use all 32 hyperthreads
aprun -j 2 -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2

5.5 Multiple ’aprun’ commands in a single job script

One of the most efficient ways of running multiple simulations in parallel on Cray XE/XC systems is to usea single job submission script to run multiple simulations. This can be achieved by having multiple ’aprun’ commands in a single script and requesting enough resources from the batch system to run themin parallel.

The examples in this section all assume you are using the bash shell for your job submission script but theprinciples are easily adapted to perl, python or tcsh.

This technique is particularly useful if you have many jobs that use a small number of cores that youwant to run simultaneously as the job looks to the batch system like a single large job and is thuseasier to schedule.

Note: each ’aprun’ command must run on a separate compute node as Cray XE/XC machinesonly allow exclusive node access. This means you cannot use this technique to run multipleinstances of a program on a single compute node.

5.5.1 Requesting the correct number of cores

The total number of cores requested for a job of this type is the sum of the number of cores required forall the simulations in the script. For example, if we have 16 simulations which each run using 2048 coresthen we would need to ask for 32768 cores (1024 nodes on a 32-core per node system).

5.5.2 Multiple ’aprun’ syntax

The differences from specifying a single aprun command to specifying multiple ’aprun’ commands in yourjob submission script is that each of the aprun command must be run in the background (i.e., appendedwith an &) and there must be a ’wait’ command after the final aprun command. For example, to run 4CP2K simulations which each use 2048 cores (8192 cores in total) and 32 cores per node:

cd $basedir/simulation1/
aprun -n 2048 -N 32 cp2k.popt < input1.cp2k > output1.cp2k &
cd $basedir/simulation2/
aprun -n 2048 -N 32 cp2k.popt < input2.cp2k > output2.cp2k &
cd $basedir/simulation3/
aprun -n 2048 -N 32 cp2k.popt < input3.cp2k > output3.cp2k &
cd $basedir/simulation4/
aprun -n 2048 -N 32 cp2k.popt < input4.cp2k > output4.cp2k &

# Wait for all simulations to complete
wait

of course, this could have been more concisely achieved using a loop:

for i in {1..4}; do
  cd $basedir/simulation${i}/
  aprun -n 2048 -N 32 cp2k < input${i}.cp2k > output${i}.cp2k &
done

# Wait for all simulations to complete
wait

5.5.3 Example job submission script

This PBSPro job submission script runs 16, 2048-core CP2K simulations inparallel with the input in the directories ’simulation1’, ’simulation2’, etc.:

For PBS (Hector) or Torque/Moab (Hermit, Lindgren):

#!/bin/bash --login

# This example is for Hermit or Hector
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
#    This is the sum of the number of parallel tasks
#    required by each of the aprun commands you are using.
#    In this example we have 16 *
2048 = 32768 tasks
#PBS -l mppwidth=32768

# Specify how many processes per node.
# On HERMIT and HECToR valid mppnppn values are from 1 to 32. 
# On Lindgren valid mppnppn values are from 1 to 24. 
#PBS -l mppnppn=32

# Specify the wall clock time required for your job.
#    In this example we want 6 hours 
#PBS -l walltime=6:0:0

# Specify which budget that your job will be charged to.
#PBS -A your_budget_account               

# The base directory is the dir that the job was submitted from.
# All simulations are in subdirectories of this directory.
basedir=$PBS_O_WORKDIR

# Loop over simulations, running them in the background
for i in {1..16}; do
   # Change to the directory for this simulation
   cd $basedir/simulation${i}/
   aprun -n 2048 -N 32 cp2k.popt < input${i}.cp2k > output${i}.cp2k &
done

# Wait for all jobs to finish before exiting the
# job submission script
wait
exit 0

For PBS (Archer):

#!/bin/bash --login

# This example is for Archer
# On Archer the maximum core number per node is 48 with hyper-threading,
# 24 without hyper-threading.

# The jobname
#PBS -N your_job_name

# The total number of nodes for your job.
#    This is the sum of the number of nodes
#    required by each of the aprun commands you are using.
#    In this example we have 16 * 2400 / 24 = 1600 nodes
#    since we are not using hyper-threading
#PBS -l select=1600

# Specify the wall clock time required for your job.
#    In this example we want 6 hours 
#PBS -l walltime=6:0:0

# Specify which budget that your job will be charged to.
#PBS -A your_budget_account               

# Make sure any symbolic links are resolved to the absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)               

# The base directory is the dir that the job was submitted from.
# All simulations are in subdirectories of this directory.
basedir=$PBS_O_WORKDIR

# Loop over simulations, running them in the background
for i in {1..16}; do
   # Change to the directory for this simulation
   cd $basedir/simulation${i}/
   aprun -n 2400 -N 24 cp2k.popt < input${i}.cp2k > output${i}.cp2k &
done

# Wait for all jobs to finish before exiting the
# job submission script
wait
exit 0

And here is a SLURM version.

#!/bin/bash --login

# This example is for Intel Xeon processors with hyperthreading
# On Sisu the maximum core number per node is 32 with hyper-threading,
# 16 without hyper-threading.

# The jobname
#SBATCH -J your_job_name

# The total number of parallel tasks for your job.
#    This is the sum of the number of parallel tasks
#    required by each of the aprun commands you are using.
#    In this example we have 16 * 2048 = 32768 tasks
#SBATCH -n 32768

# Specify how many processes per node.
# On Sisu valid ntasks-per-node values are from 1 to 32 with hyper-threading,
# 1 to 16 without hyper-threading.
#SBATCH --ntasks-per-node=32

# Specify the wall clock time required for your job.
#    In this example we want 6 hours 
#SBATCH -t 6:0:0

# Specify the partition to use, SLURM does not choose one for you.
#SBATCH -p large

# Specify which budget that your job will be charged to.
#SBATCH --account=your_budget_account

# The base directory is the dir that the job was submitted from.
# All simulations are in subdirectories of this directory.
basedir=$SLURM_SUBMIT_DIR

# Loop over simulations, running them in the background
for i in {1..16}; do
   # Change to the directory for this simulation
   cd $basedir/simulation${i}/
   # -j 2 is needed to use all 32 hyperthreads
   aprun -j 2 -n 2048 -N 32 cp2k.popt < input${i}.cp2k > output${i}.cp2k &
done

# Wait for all jobs to finish before exiting the
# job submission script
wait
exit 0

In this example, it is assumed that all of the input for the simulations has been setupprior to submitting the jobs. Of course, in reality, you may find that it is more usefulfor the job submission script to programmatically prepare the input for each job beforethe aprun command.

5.6 Job arrays

Often, you will want to run the same job submission script multiple times in parallel for many differentinput parameters. Job arrays provide a mechanism for doing this without the need to issue multiple ’qsub’ or ’sbatch’commands and without the penalty of having large numbers of jobs appearing in the queue.

5.6.1 Example job array submission script

Each job instance in the job array is able to access its unique array index through the environmentvariable $PBS_ARRAY_INDEX (for PBSPro) or $PBS_ARRAYID (for Torque).

This can be used to programmatically select which set of input parameters you want to use.One common way to use job arrays is to place the input for each job instance in a separate subdirectorywhich has a number as part of its name. For example, if you have 10 sets of input in ten subdirectoriescalled job01, job02, …, job10 then you would be able to use the following script to run a job array that runs each of these jobs (this example is for PBSPro):

For PBS (Hector) or Torque/Moab (Hermit, Lindgren):

#!/bin/bash

# This example is for Hermit and Hector.
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallel tasks
#PBS -l mppwidth=2048

# Specify how many processes per node.
# On HERMIT and HECToR valid mppnppn values are from 1 to 32. 
# On Lindgren valid mppnppn values are from 1 to 24. 
#PBS -l mppnppn=32

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Specify which budget account that your job will be charged to.
#PBS -A your_budget_account               

# Change to the directory that the job was submitted from.
cd $PBS_O_WORKDIR

# Get the subdirectory name for this job instance in the array
#   Note that this example is for PBSPro. For the Torque batch system
#   you would need to use the $PBS_ARRAYID environment variable instead
jobid=`printf "%02d" $PBS_ARRAY_INDEX`
jobdir="job$jobid"

# Change to the subdirectory for this job instance in the array
cd $jobdir

# Run this job instance in its subdirectory.  The executable is in
# directory above.
echo "Running $jobname"
aprun -n 2048 -N 32 ../my_mpi_executable.x arg1 arg2

For PBS (Archer):

#!/bin/bash

# This example is for Archer.
# On Archer the maximum core number per node is 48 with hyper-threading,
# 24 without hyper-threading.

# The jobname
#PBS -N your_job_name

# The total number of nodes for your job.
# The example requires 100 nodes
#PBS -l select=100

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Specify which budget account that your job will be charged to.
#PBS -A your_budget_account               

# Job arrays have to be rerunnable, so over-ride any system-wide
# default setting
#PBS -r y

## Make sure any symbolic links are resolved to the absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)               

# Change to the directory that the job was submitted from.
cd $PBS_O_WORKDIR

# Get the subdirectory name for this job instance in the array
jobid=`printf "%02d" $PBS_ARRAY_INDEX`
jobdir="job$jobid"

# Change to the subdirectory for this job instance in the array
cd $jobdir

# Run this job instance in its subdirectory.  The executable is in
# directory above.
echo "Running $PBS_JOBNAME"
aprun -n 2400 -N 24 ../my_mpi_executable.x arg1 arg2
# On Archer, -j 2 is needed to use all 48 hyperthreads
#aprun -j 2 -n 4800 -N 48 ../my_mpi_executable.x arg1 arg2

How you submit job arrays differs depending on whether your site uses the PBSPro or Torque batch system.

5.6.2 Submitting job arrays using PBSPro

The ’-J’ option to the ’qsub’ command is used to submit a job array under PBSPro. For example, to submita job array consisting of 10 instances, numbered from 1 to 10 you would use the command:

qsub -J 1-40 array_job_script.pbs

You can also specify a stride other than 1 for array jobs. For example, to submit a job array consistingof 5 instances, numbered 2, 4, 6, 8, and 10 you would use the command:

qsub -J 2-10:2 array_job_script.pbs

5.6.3 Submitting job arrays using Torque

The ’-t’ option to the ’qsub’ command is used to submit a job array under Torque. For example, to submita job array consisting of 10 instances, numbered from 1 to 10 you would use the command:

qsub -t 1-10 array_job_script.pbs

5.6.4 Submitting job arrays using SLURM

The current version of SLURM (2.5.6) on Sisu does not support job arrays. SLURM 2.6 does support job arraysand if this becomes available on Sisu, this section will be written.

5.6.5 Interacting with individual job instances in an array

You can see the status of job arrays and their job instances using:

qstat -Jt

You can refer to individual job instance in a job array by using their array index. For example, todelete just the job instance with array index 5 from the batch system (assuming your job ID is 1234),you would use:

qdel 1234[5]

5.7 Interactive jobs

Note that interactive jobs are not available on all Cray XE/XC execution sites.

Interactive jobs on Cray XE/XC systems are useful for debugging or developmental work as they allow youto issue ’aprun’ commands directly from the command line.

With PBS or Torque/Moab (on Hermit, Hector, Lindgren, Archer), you need to have the nodes allocated to you via the batch system.To submit a interactive job reserving 256cores for 1 hour you would use the following qsub command for Hermit, Hector andLindgren:

qsub -IVl mppwidth=256,walltime=1:0:0 -A budget

and for Archer, requesting 10 nodes, equivalent to 240 cores:

qsub -IVl select=10,walltime=1:0:0 -A budget

When you submit this job your terminal will display something like:

qsub: waiting for job 492383.sdb to start

and once the job runs you will be returned to a standard Linux command line. However, while the job lastsyou will be able to run parallel jobs by using the ’aprun’ command directly at your command prompt. The maximum number of cores you can use is limited by the value of mppwidth you specified at submission time.

On Sisu, SLURM is not used for interactive jobs. Interactive jobs are launched from the login node usingaprun directly, e.g.

aprun -n12 hello.exe

The number of interactive nodes you can use depends on the total number in the system and whether other usersare already using them. You can find out how many interactive nodes are available with

xtnodestat -d

which displays the number of available nodes (amongst other information). You can find out the maximum number ofinteractive nodes on the system using

xtprocadmin | grep interactive

which gives a list of the interactive nodes.

5.8 Selecting nodes with particular attributes

It is possible to select specific nodes for job execution using thebatch system. The resource ’mppnodes’ (PBS, but not Archer, see below)or ’nodelist’ (SLURM) allows you specify a list of particular computenodes for execution. This can be useful in the case that you areworking on a heterogeneous system (with different numbers of cores ordifferent amounts of memory on a particular compute node).

For example, to submit a 4 node job (128 cores on a 32 core per nodesystem) to the compute nodes numbered 2, 3, 4 and 5 you would add thefollowing line to a PBS or Torque/Moab job submission script:

#PBS -l mppnodes="2-5"

or, if your nodes are not sequentially numbered, for example 2, 4, 6, 8, you can use a comma separated list

#PBS -l mppnodes="2,4,6,8"

Using SLURM, the corresponding lines would be (here for nodes named nid00002 to nid00005):

#SBATCH --nodelist=nid[00002-00005]

or, if your nodes are not sequentially numbered, for example 2, 4, 6, 8, you can use a comma separated list

#SBATCH --nodelist=nid00002,nid00004,nid00006,nid00008

On Archer, choosing different types of nodes can be achieved throughthe PBS select statement: the ’mppnodes’ resource is not used.Specifying particular nodes is not so useful on Archer since thecompute nodes are almost all the same. It is more useful to selectnodes based on the type of node. For example, if you want to choose 4large memory nodes, put the following select statement in the jobsubmission script:

#PBS -l select=4:bigmem=true

Each node has has several different resource settings, which can be used to choose particular types of nodes.To list all the nodes and their available resources, give the command

pbsnodes -a | less

This produces a very long list, so it is best to pipe this to lessto scroll through it. Archer has a very homogeneous set of nodes, andthe only useful resources to use in the select statement arebigmem to choose large memory compute nodes, and serial to chooseone of the pre-/post-processing nodes.

5.8.1 Identifying nodes with particular attributes

To make effective use of the ability to select particular compute nodes for execution you need to havea way to get a list of compute nodes with particular attributes in the format needed for the ’mppnodes’resource. The ’cnselect’ command is used on Cray XE/XC systems to perform this task.

For example, to return a list of all compute nodes with 32 cores you would use:

cnselect numcores.eq.32

or to select all compute nodes with more than 32GB of memory:

cnselect availmem.gt.32000

On Archer, ’cnselect’ is available but it is simpler to use the PBS select statement to choosenodes with particular attributes.

5.9 Writing job submission scripts in Perl and Python

It can often be useful to be able to use the features of Perl and/or Python to write more complex jobsubmission scripts. The richer programming environment compared to standard shell scripts can makeit easier to dynamically gene rate input for jobs or put complex workflows together.

Please note that the examples provided in this section are so simple that they could easily be writtenin bash or tcsh but they provide the necessary information needed to be able to use Perl and Python towrite your own, more complex, job submission scripts.

You submit Perl and Python job submission scripts using ’qsub’ or ’sbatch’ as for standard jobs.

5.9.1 Example Perl job submission script

This example script shows how to run a CP2K job using Perl. It illustrates the necessary system callsto change directories and load modules within a Perl script but does not contain any program complexity.

For PBS (Hermit, Hector) or Torque/Moab (Lindgren):

#!/usr/bin/perl

# This example is for Hermit or Hector
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallel tasks
#PBS -l mppwidth=2048

# Specify how many processes per node.
# On HERMIT and HECToR: mppnppn values are from 1 to 32. 
# On Lindgren: mppnppn values are from 1 to 32. 
#PBS -l mppnppn=32

# Set the budget to charge the job to.
#    The budget name is site-dependent
#PBS -A budget

# Set the number of MPI tasks and MPI tasks per node
my $mpiTasks = 2048;
my $tasksPerNode = 32;

# Set the executable name and input and output files
my $execName = "cp2k.popt";
my $inputName = "input";
my $outputName = "output";
my $runCode = "$execName < $inputName > $outputName";

# Set up the string to run our job
my $aprunString = "aprun -n $mpiTasks -N $tasksPerNode $runCode";

# Set the command to load the cp2k module
#   This is more complicated in Perl as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   aprun command
my $moduleString = "source /etc/profile; module load cp2k;";

# Change to the directory the job was submitted from
chdir($ENV{'PBS_O_WORKDIR'});

# Run the job
#    This is a combination of the module loading string and the
#    actual aprun command. Both of these are set above.
system("$moduleString  $aprunString");

# Exit the job
exit(0);

For PBS (Archer):

#!/usr/bin/perl

# This example is for Archer
# On Archer the maximum core number per node is 48 with hyper-threading,
# 24 without hyper-threading.

# The jobname
#PBS -N your_job_name

# Run time
#PBS -l walltime=01:00:00

# The total number of nodes for your job.
# The example requires 2400 parallel tasks = 100 nodes
#PBS -l select=100

# Set the budget to charge the job to.
#    The budget name is site-dependent
#PBS -A budget

# Set the number of MPI tasks and MPI tasks per node
my $mpiTasks = 2400;
my $tasksPerNode = 24;

# Set the executable name and input and output files
my $execName = "cp2k.popt";
my $inputName = "input";
my $outputName = "output";
my $runCode = "$execName < $inputName > $outputName";

# Set up the string to run our job
# On Archer, -j 2 is needed to use all 48 hyperthreads
my $aprunString = "aprun -n $mpiTasks -N $tasksPerNode $runCode";

# Set the command to load the cp2k module
#   This is more complicated in Perl as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   aprun command
my $moduleString = "source /etc/profile; module load cp2k;";

# Change to the diectory the job was submitted from
chdir($ENV{'PBS_O_WORKDIR'});

# Run the job
#    This is a combination of the module loading string and the
#    actual aprun command. Both of these are set above.
system("$moduleString  $aprunString");

# Exit the job
exit(0);

For SLURM (Sisu):

#!/usr/bin/perl

# This example is for Intel Xeon processors with hyperthreading
# On Sisu the maximum core number per node is 32 with hyper-threading,
# 16 without hyper-threading.

# The jobname
#SBATCH -J your_job_name

# Run time
#SBATCH -t 6:0:0

# The total number of parallel tasks for your job.
# The example requires 2048 parallel tasks
#SBATCH -n 2048

# Specify how many processes per node.
# On Sisu valid ntasks-per-node values are from 1 to 32 with hyper-threading,
# 1 to 16 without hyper-threading.
#SBATCH --ntasks-per-node=32

# Specify the partition to use, SLURM does not choose one for you.
#SBATCH -p large

# Specify which budget account that your job will be charged to.  If
# this line is omitted, the budget account associated with your group ID
# will be charged.
#SBATCH --account=your_budget_account               

# Set the number of MPI tasks and MPI tasks per node
my $mpiTasks = 2048;
my $tasksPerNode = 32;

# Set the executable name and input and output files
my $execName = "cp2k.popt";
my $inputName = "input";
my $outputName = "output";
my $runCode = "$execName < $inputName > $outputName";

# Set up the string to run our job
# -j 2 is needed to use all 32 hyperthreads
my $aprunString = "aprun -j 2 -n $mpiTasks -N $tasksPerNode $runCode";

# Set the command to load the cp2k module
#   This is more complicated in Perl as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   aprun command
my $moduleString = "source /etc/profile; module load cp2k;";

# Change to the diectory the job was submitted from
chdir($ENV{'SLURM_SUBMIT_DIR'});

# Run the job
#    This is a combination of the module loading string and the
#    actual aprun command. Both of these are set above.
system("$moduleString  $aprunString");

# Exit the job
exit(0);

5.9.2 Example Python job submission script

This example script shows how to run a CP2K job using Python. It illustrates the necessary system callsto change directories and load modules within a Python script but does not contain any program complexity.

For PBS (Hermit, Hector) or Torque/Moab (Lindgren):

#!/usr/bin/python
# This example is for systems using Interlagos processors.
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallel tasks
#PBS -l mppwidth=2048

# Specify how many processes per node.
# On HERMIT and HECToR valid mppnppn values are from 1 to 32. 
# On Lindgren valid mppnppn values are from 1 to 32. 
#PBS -l mppnppn=32

# Set the budget to charge the job to.
#    The budget name is site-dependent
#PBS -A budget

# Import the Python modules required for system operations
import os
import sys

# Set the number of MPI tasks and MPI tasks per node
mpiTasks = 2048
tasksPerNode = 32

# Set the executable name and input and output files
execName = "cp2k.popt"
inputName = "input"
outputName = "output"
runCode = "{0} < {1} > {2}".format(execName, inputName, outputName)

# Set up the string to run our job
aprunString = 
  "aprun -n {0} -N {1} {2}".format(mpiTasks, tasksPerNode, runCode)

# Set the command to load the cp2k module
#   This is more complicated in Python as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   apr
un command
moduleString = "source /etc/profile; module load cp2k; "

# Change to the diectory the job was submitted from
os.chdir(os.environ["PBS_O_WORKDIR"])

# Run the job
#    This is a combination of the module loading string and the
#    actual aprun command. Both of these are set above.
os.system(moduleString + aprunString)

# Exit the job
sys.exit(0)

For PBS (Archer):

#!/usr/bin/python

# This example is for Archer
# On Archer the maximum core number per node is 48 with hyper-threading,
# 24 without hyper-threading.

# The jobname
#PBS -N your_job_name

# Run time
#PBS -l walltime=01:00:00

# The total number of nodes for your job.
# The example requires 2400 parallel tasks = 100 nodes
#PBS -l select=100

# Set the budget to charge the job to.
#    The budget name is site-dependent
#PBS -A budget

# Import the Python modules required for system operations
import os
import sys

# Set the number of MPI tasks and MPI tasks per node
mpiTasks = 2400
tasksPerNode = 24

# Set the executable name and input and output files
execName = "cp2k.popt"
inputName = "input"
outputName = "output"
runCode = "{0} < {1} > {2}".format(execName, inputName, outputName)

# Set up the string to run our job
# On Archer, -j 2 is needed to use all 48 hyperthreads
aprunString = 
  "aprun -n {0} -N {1} {2}".format(mpiTasks, tasksPerNode, runCode)

# Set the command to load the cp2k module
#   This is more complicated in Python as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   aprun command
moduleString = "source /etc/profile; module load cp2k; "

# Change to the diectory the job was submitted from
os.chdir(os.environ["PBS_O_WORKDIR"])

# Run the job
#    This is a combination of the module loading string and the
#    actual aprun command. Both of these are set above.
os.system(moduleString + aprunString)

# Exit the job
sys.exit(0)

For SLURM (Sisu):

#!/usr/bin/python
# This example is for Intel Xeon processors with hyperthreading
# On Sisu the maximum core number per node is 32 with hyper-threading,
# 16 without hyper-threading.

# The jobname
#SBATCH -J your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallel tasks
#SBATCH -n 2048

# Specify how many processes per node.
# On Sisu valid ntasks-per-node values are from 1 to 32 with hyper-threading,
# 1 to 16 without hyper-threading.
#SBATCH --ntasks-per-node=32

# Specify the partition to use, SLURM does not choose one for you.
#SBATCH -p large

# Specify which budget account that your job will be charged to.  If
# this line is omitted, the budget account associated with your group ID
# will be charged.
#SBATCH --account=your_budget_account               

# Import the Python modules required for system operations
import os
import sys

# Set the number of MPI tasks and MPI tasks per node
mpiTasks = 2048
tasksPerNode = 32

# Set the executable name and input and output files
execName = "cp2k.popt"
inputName = "input"
outputName = "output"
runCode = "{0} < {1} > {2}".format(execName, inputName, outputName)

# Set up the string to run our job
# -j 2 is needed to use all 32 hyperthreads
aprunString = 
  "aprun -j 2 -n {0} -N {1} {2}".format(mpiTasks, tasksPerNode, runCode)

# Set the command to load the cp2k module
#   This is more complicated in Python as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   aprun command
moduleString = "source /etc/profile; module load cp2k; "

# Change to the directory the job was submitted from
os.chdir(os.environ["SLURM_SUBMIT_DIR"])

# Run the job
#    This is a combination of the module loading string and the
#    actual aprun command. Both of these are set above.
os.system(moduleString + aprunString)

# Exit the job
sys.exit(0)

6. Performance analysis

6.1 Available Performance Analysis Tools

All Cray XE/XC machines come with the Cray Performance Tools (module “perftools”) installed. These toolsinclude:

  • Cray Performance Analysis Tool (CrayPAT).

  • PAPI hardware counters.

  • Apprentice2 performance visualisation suite.

  • Reveal performance analysis and code restructuring tool.

CrayPAT provides the interface to using all these tools.

There are also other tools available on each system. Please refer to the appropriate documentation at each site.

SiteTools
HERMITCray PerfTools, PAPI
HECToRCray PerfTools, Scalasca, VampirTrace
LindgrenCray PerfTools, Paraver, Extrae
SisuCray PerfTools
ArcherCray PerfTools

6.2 Cray Performance Analysis Tool (CrayPAT)

CrayPAT consists of two command line tools which are used to profile your code: ’pat_build’ addsinstrumentation to your code so that when you run the instrumented binary the profiling informationis stored on disk; ’pat_report’ reads the profiling information on disk and compiles it into human-readable reports. The analysis produced by ’pat_report’ can be visualised using theCray Apprentice2 tool ’app2’.

CrayPAT can perform two types of performance analysis: sampling experiments and tracing experiments.A sampling experiment probes the code at a predefined interval and produces a report based on thesestatistics. A tracing experiment explicitly monitors the code performance within named routines.Typically, the overhead associated with a tracing experiment is higher than that associatedwith a sampling experiment but provides much more detailed information. The key to getting usefuldata out of a sampling experiment is to run your profiling for a representative length of time.

Detailed documentation on CrayPAT is available from Cray:

6.2.1 Instrumenting a code with pat_build

Often the best way to analyse the performance is to run a sampling experiment and then use the resultsfrom this to perform a focused tracing experiment. In fact, CrayPAT contains the functionality forautomating this process – known as Automatic Profiling Analysis (APA).

The example below illustrates how to do this.

6.2.1.1 Example: Profiling the CASTEP Code

Here is a step-by-step example to instrumenting and profiling the performance of a code (CASTEP)using CrayPAT.

The first step is to compile your code with the profiling tools attached. First load the Cray ’perftools’module with:

module add perftools

Next, compile the CASTEP code in the standard way on the “work” filesystem (which is accessible on the computenodes). Once you have the CASTEP executable you need to instrument it for a sampling experiment. Thefollowing command will produce the ’castep+samp’ instrumented executable:

pat_build -O apa -o castep+samp castep

Run your program as you usually would in the batch system at your site. Here is an example job submissionscript for a 32-core per node (Interlagos) system using PBS.

#!/bin/bash --login
#PBS -N bench4_il
#PBS -l mppwidth=1024
#PBS -l mppnppn=32 
#PBS -l walltime=1:00:00
# Set this to your budget code
#PBS -A budget

# Switch to the directory you submitted the job from
cd $PBS_O_WORKDIR

# Load the perftools module
module add perftools

# Run the sampling experiment
CASTEP_EXEDIR=/work/user/software/CASTEP/bin
aprun -n 1024 -N 32 $CASTEP_EXEDIR/castep+samp input

When the job completes successfully the directory you submitted the job from will either containan *.xf file (if you used a small number of cores) or a directory (for large core counts) with aname something like:

castep+samp+25370-14s

The actual name is dependent on the name of your instrumented executable and the process ID. In thiscase we used 1024 cores so we have got a directory.

The next step is to produce a basic performance report and also the input for ’pat_build’ to createa tracing experiment. This is done (usually interactively) by using the ’pat_report’ command (you must have the ’perftools’ module loaded to use the pat_report command). In our examplewe would type:

pat_report -o castep_samp_1024.pat castep+samp+25370-14s

this will produce a text report (in castep_samp_1024.pat) listing the various routines in the programand how much time was spent in each one during the run (see below for a discussion on the contentsof this file). It also produces a castep+samp+25370-14s.ap2 file and a castep+samp+25370-14s.apa file:

  • The *.ap2 file can be used with the Apprentice2 visualisation tool to get an alternative view of the code performance.

  • The *.apa file can be used as the input to “pat_build” to produce a tracing experiment focused on the routines that take most time (as determined in the previous sampling experiment). We illustrate this below.

For information on interpreting the results in the sampling experiment text report, see thesection below.

To produce a focused tracing experiment based on the results from the sampling experimentabove we would use the ’pat_build’ command with the *.apa file produced above. (Note, thereis no need to change directory to the directory containing the original binary to run thiscommand.)

pat_build -O castep+samp+25370-14s.apa -o castep+apa

This will generate a new instrumented version of the CASTEP executable (castep+apa) in the currentworking directory. You can then submit a job (as outlined above) making sure that you reference this new executable. This will then run a tracing experiment based on the options in the *.apa file.

Once again, you will find that your job generates a *.xf file (or a directory) which you can analyse using the ’pat_report’ command. Please see the section below for information on analysing tracing experiments.

6.2.2 Analysing profile results using pat_report

The ’pat_report’ command is able to produce many different profile reports from the profiledata. You can select the type of pre-defined report options with the -O flag to ’pat_report’.A selection of the most generally useful pre-defined report types are:

  • ca+src – Show the callers (bottom-up view) leading to the routines that have a high use in the report and include source code line numbers for the calls and time-consuming statements.

  • load_balance – Show load-balance statistics for the high-use routines in the program. Parallel processes with minimum, maximum and median times for routines will be displayed. Only available with tracing experiments.

  • mpi_callers – Show MPI message statistics. Only available with tracing experiments

Multiple options can be specified to the -O flag if they are separated by commas. For example:

pat_report -O ca+src,load_balance -o castep_furtherinfo_1024.pat 
castep+samp+25370-14s

You can also define your own custom reports using the -b and -d flags to ’pat_report’. Detailson how to do this can be found in the ’pat_report’ documentation and man pages. The outputfrom the pre-defined report types (described above) show the settings of the -b and -dflags that were used to generate the report. You can use these examples as the basis foryour own custom reports.

pat_report automatically produces a .ap2 file, which is a self-contained archiveof the reports, in a form suitable for the Cray Apprentice2 tool app2.

Below, we show examples of the output from sampling and tracing experiments for the pure MPI version of CASTEP. A report from your own code will differ in the routines and times shown. Also, if your code uses OpenMP threads, SHMEM, or CAF/UPC then you will also seedifferent output.

6.2.2.1 Example: results from a sampling experiment on CASTEP

In a sampling experiment, the main summary table produced by ’pat_report’ will looksomething like (of course, your code will contain different functions and subroutines):

Table 1:  Profile by Function

 Samp%  |  Samp  |  Imb.  |  Imb.  |Group 
        |        |  Samp  | Samp%  | Function 
        |        |        |        |  PE=HIDE 

 100.0% | 6954.6 |     -- |     -- |Total
|----------------------------------------------------------------------
|  46.0% | 3199.8 |     -- |     -- |MPI
||---------------------------------------------------------------------
||  12.9% |  897.0 |  302.0 |  25.6% |mpi_bcast
||  12.9% |  896.4 |   78.6 |   8.2% |mpi_alltoallv
||   8.7% |  604.2 |  139.8 |  19.1% |MPI_ALLREDUCE
||   5.4% |  378.6 | 1221.4 |  77.5% |MPI_SC
ATTER
||   4.9% |  341.6 |  320.4 |  49.2% |MPI_BARRIER
||   1.1% |   75.8 |  136.2 |  65.2% |mpi_gather
||=====================================================================
|  38.8% | 2697.9 |     -- |     -- |USER
||---------------------------------------------------------------------
||   9.4% |  652.1 |  171.9 |  21.2% |comms_transpose_exchange.3720
||   6.0% |  415.2 |   41.8 |   9.3% |zgemm_kernel_n
||   3.4% |  237.6 |   29.4 |  11.2% |zgemm_kernel_l
||   3.1% |  215.6 |   24.4 |  10.3% |zgemm_otcopy
||   1.7% |  119.8 |   20.2 |  14.6% |zgemm_oncopy
||   1.3% |   92.3 |   11.7 |  11.4% |zdotu_k
||=====================================================================
|  15.2% | 1057.0 |     -- |     -- |ETC
||---------------------------------------------------------------------
||   2.0% |  140.5 |   32.5 |  19.1% |__ion_MOD_ion_q_recip_interpolation
||   2.0% |  139.8 |  744.2 |  85.5% |__remainder_piby2
||   1.1% |   78.3 |   32.7 |  29.9% |__ion_MOD_ion_apply_and_add_ylm
||   1.1% |   73.5 |   35.5 |  33.1% |__trace_MOD_trace_entry
|======================================================================

You can see that CrayPAT splits the results from a sampling experiment into threesections:

  • MPI – Samples accumulated in message passing routines

  • USER – Samples accumulated in user routines. These are usually the functions and subroutines that are part of your program.

  • ETC – Samples accumulated in library routines.

The first two columns indicate the % and number of samples (mean of samplescomputed across all parallel tasks) and indicate the functions in the programwhere the sampling experiment spent the most time. Columns 3 and 4 give anindication of the differences between the minimum number of samples and maximumnumber of samples found across different parallel tasks. This, in turn, gives anindication of where the load-imbalance in the program is found. Of course, it may bethat load imbalance is found in routines where insignificant amounts of time are spent. In this case, the load-imbalance may not actually be significant (for example,the large imbalance time seen in ’mpi_gather’ in the example above.

In the example above – the largest amounts of time seem to be spent in the MPI routines:mpi_bcast, mpi_alltoallv and MPI_ALLREDUCE and in the program routines:comms_transpose_exchange and the BLAS routine ’zgemm’.

To narrow down where these time consuming calls are in the program we can use the’ca+src’ report option to ’pat_report’ to get a view of the split of samples betweenthe different calls to the time consuming routines in the program.

pat_report -O ca+src -o castep_calls_1024.pat castep+samp+25370-14s

An extract from the the report produced by the above command looks like

Table 1:  Profile by Function and Callers, with Line Numbers

 Samp%  |  Samp  |Group 
        |        | Function 
        |        |  Caller 
        |        |   PE=HIDE 

 100.0% | 6954.6 |Total
|--------------------------------------------------------------------------------------
--snip--
|  38.8% | 2697.9 |USER
||-------------------------------------------------------------------------------------
||   9.4% |  652.1 |comms_transpose_exchange.3720
|||------------------------------------------------------------------------------------
3||   3.7% |  254.4 |__comms_MOD_comms_transpose_n:comms.mpi.F90:line.17965
4||        |        | __comms_MOD_comms_transpose:comms.mpi.F90:line.17826
5||        |        |  __fft_MOD_fft_3d:fft.fftw3.F90:line.482
6||   3.1% |  215.0 |   __basis_MOD_basis_real_recip_reduced_one:basis.F90:line.4104
7||        |        |    __wave_MOD_wave_real_to_recip_wv_bks:wave.F90:line.24945
8||        |        |     __pot_MOD_pot_nongamma_apply_wvfn_slice:pot.F90:line.2640
9||        |        |      __pot_MOD_pot_apply_wvfn_slice:pot.F90:line.2550
10|        |        |       __electronic_MOD_electronic_apply_h_recip_wv:electronic.F90:line.4364
11|        |        |        __electronic_MOD_electronic_apply_h_wv:electronic.F90:line.4057
12|   3.0% |  211.4 |         __electronic_MOD_electronic_minimisation:electronic.F90:line.1957
13|   3.0% |  205.6 |          __geometry_MOD_geom_get_forces:geometry.F90:line.6299
14|   2.8% |  195.2 |           __geometry_MOD_geom_bfgs:geometry.F90:line.2199
15|        |        |            __geometry_MOD_geometry_optimise:geometry.F90:line.1051
16|        |        |             MAIN__:castep.f90:line.1013
17|        |        |              main:castep.f90:line.623
--snip--

This indicates that (in the USER routines) the part of the ’comms_transpose_exchange’ routinein which the program is spending most time is around line 3720 (in many cases the time-consumingline will be the start of a loop). Furthermore, we can see that approximately one third of the samples in the routine (3.7%) are originating from the call to ’comms_transpose_exchange’ inthe stack shown. Here, it looks like they are part of the geometry optimisation section ofCASTEP. The remaining two-thirds of the time will be indicated further down in the report.

In the original report (castep_samp_1024.pat) we also set a table reporting on the wall clocktime of the program. The table reports the minimum, maximum and median values of wall clocktime from the full set of parallel tasks. The table also lists the maximum memory usage ofeach of the parallel tasks:

Table 3:  Wall Clock Time, Memory High Water Mark

   Process  |  Process  |PE=[mmm] 
      Time  |    HiMem  |
            | (MBytes)  |

 124.409675 |   100.761 |Total
|---------------------------------
| 126.922418 |   101.637 |pe.39
| 124.329474 |   102.336 |pe.2
| 122.241495 |   101.355 |pe.4
|=================================

If you see a very large difference in the minimum and maximum walltime for different paralleltasks it indicates that there is a large amount of load-imbalance in your code and it should beinvestigated more thoroughly.

6.2.2.2 Example: results from a tracing experiment on CASTEP

The main summary table from a tracing experiment look very similar to the main summary tablefrom the sampling experiment with the sample counter columns replaced by time columns:

Table 1:  Profile by Function Group and Function

 Time%  |      Time  |     Imb.  |  Imb.  |    Calls  |Group 
        |            |     Time  | Time%  |           | Function 
        |            |           |        |           |  PE=HIDE 

 100.0% | 219.421582 |        -- |     -- | 5661054.0 |Total
|--------------------------------------------------------------------------
|  48.3% | 106.010194 |        -- |     -- |  210415.6 |MPI_SYNC
||-------------------------------------------------------------------------
||  22.1% |  48.462454 | 48.436312 |  99.9% |    1700.1 |mpi_gather_(sync)
||  12.0% |  26.327995 | 25.902518 |  98.4% |   13549.0 |mpi_bcast_(sync)
||   7.0% |  15.312987 | 12.925545 |  84.4% |       9.0 |mpi_barrier_(sync)
||   3.3% |   7.155499 |  3.679692 |  51.4% |   90020.5 |mpi_allreduce_(sync)
||   2.8% |   6.232381 |  6.227867 |  99.9% |    1104.0 |mpi_scatter_(sync)
||   1.1% |   2.518877 |  0.841358 |  33.4% |  104033.0 |mpi_alltoallv_(sync)
||=========================================================================
|  25.1% |  55.160228 |        -- |     -- |       2.0 |USER
||-------------------------------------------------------------------------
|  25.1% |  55.159796 | 65.926769 |  55.3% |       1.0 | main
||=========================================================================
|  20.9% |  45.947098 |        -- |     -- |  210447.6 |MPI
||-------------------------------------------------------------------------

||  15.0% |  32.998366 |  0.835895 |   2.5% |  104033.0 |mpi_alltoallv
||   4.3% |   9.431468 |  0.338688 |   3.5% |   90020.5 |MPI_ALLREDUCE
||=========================================================================
|   5.6% |  12.304063 |  1.288770 |   9.6% | 5240188.9 |STRING
||-------------------------------------------------------------------------
|   5.6% |  12.304063 |  1.288770 |   9.6% | 5240188.9 | memcpy
|==========================================================================

You should note that the single MPI section has now been split into two sections: MPI and MPI_SYNC. The MPI section now contains the time spent by the MPI library in actually doing work while the MPI_SYNC section contains the time spent in MPI routines waiting for themessage to complete.

6.2.3 Using hardware counters

CrayPAT also provides hardware counter data. CrayPAT supports both predefined hardware counterevents and individual hardware counters. Please refer to the CrayPAT documentation for furtherdetails of the hardware counter events: http://docs.cray.com/

To use the hardware counters, you need to set the environment variable PAT_RT_PERFCTRin your job script when running your tracing experiment. For example:

export PAT_RT_PERFCTR=21

(on an Interlagos system) will specify the twenty-first of twenty-two predefined sets ofhardware counter events to be measured and reported. This set provides information onfloating-point operations, and cache performance. On Intel Xeon Sandy Bridge systems, there are12 sets of hardware counters and on Intel Xeon Ivy Bridge systems, there are9 sets of hardware counters.

Note that hardware counters are only supported with tracing experiments, not on samplingexperiments. If you set PAT_RT_PERFCTR for a sampling experiment the instrumented executablewill fail to run.

When you produce a profile report using ’pat_report’, hardware counter information willautomatically be included in the report if the PA_RT_PERFCTR environment variable was setwhen the experiment was run.

A full list of the predefined sets of counters can be found on the hardware countersman page on your execution site. To access the documentation, use:

man hwpc
6.2.3.1 Example: Hardware counter results from CASTEP running on a Cray XE (Interlagos-based).
=====================================================================================
  USER
-------------------------------------------------------------------------------------
  Time%                                                     25.2% 
  Time                                                  55.178458 secs
  Imb. Time                                                    -- secs
  Imb. Time%                                                   -- 
  Calls                                  0.036 /sec           2.0 calls
  PAPI_L1_DCM                           31.388M/sec    1732264145 misses
  PAPI_TLB_DM                            0.244M/sec      13481844 misses
  PAPI_RES_STL                          11.189 secs   25734805221 cycles
  PAPI_L1_DCA                          823.706M/sec   45459611237 refs
  PAPI_FP_OPS                         2031.598M/sec  112122144958 ops
  DATA_CACHE_REFILLS_FROM_NORTHBRIDGE    6.508M/sec     359181707 fills
  Average Time per Call                                 27.589229 secs
  CrayPat Overhead : Time                 0.0%                    
  User time (approx)                    55.189 secs  126935432457 cycles  100.0% Time
  Total time stalled                    11.189 secs   25734805221 cycles
  HW FP Ops / User time               2031.598M/sec  112122144958 ops   11.0%peak(DP)
  HW FP Ops / WCT                     2031.598M/sec               
  Computational intensity                 0.88 ops/cycle     2.47 ops/ref
  MFLOPS (aggregate)                 130022.30M/sec               
  TLB utilization                      3371.91 refs/miss    6.586 avg uses
  D1 cache hit,miss ratios               96.2% hits          3.8% misses
  D1 cache utilization (misses)          26.24 refs/miss    3.280 avg hits
  System to D1 refill                    6.508M/sec     359181707 lines
  System to D1 bandwidth               397.229MB/sec  22987629227 bytes

The hardware counters for a routine can give hints as to what is causing any problemsin performance.

6.3 Reveal

Reveal is Cray’s next-generation integrated performance analysis andcode optimization tool. It works with the Cray Compiling Environment(CCE) only. Reveal provides an GUI to navigate through annotated sourcecode of an application, displaying information about loops (similar tothe Cray compiler loopmark listing) and inlined code. It providesinformation on why a loop was vectorised (or why it wasn’t) and whatinlined code was produced. It can work out the OpenMP scope ofvariables within loops, suggest OpenMP directive to parallelise loops,and insert these directives into the source code.

Performance data collected by the Cray performance tools (pat_build,pat_report) from tracing runs of the application can be used in Revealto indicate which loops are the best candidates for parallelisation.

6.3.1 Using Reveal

Load the Cray compiler and performance tools modules:

module load PrgEnv-cray
module load perftools

and generate a program library (only available with CCE), for example:

ftn -O3 -hpl=my_program.pl -c my_program_file1.f90
ftn -O3 -hpl=my_program.pl -c my_program_file2.f90
ftn -O3 -hpl=my_program.pl -c my_program_file3.f90

Then you can run Reveal on the program library

reveal my_program.pl

to find out what optimisationsthe compiler has performed, and find suggested OpenMP directives for loops.

To combine performance data with Reveal, the work performed in the loops needs to be measured. Compile and link the program with -h profile_generate to switchon the loop work measurements

ftn -c -h profile_generate my_program.f
ftn -o my_program -h profile_generate my_program.o

instrument the executable, run it, and generate a performance report in Apprentice2 format:

pat_build -w my_program
aprun -n pes ./my_program+pat
pat_report -o my_program.ap2 my_program+pat+1233456-XX.xf

Then the performance data file can be combined with the program library in Reveal:

reveal my_program.pl my_program.ap2

Note that -h profile_generate disables most optimisations, so you should do the loop work estimate runs first, and then generate the program library with yourchosen optimisations.

Detailed documentation on Reveal is available from Cray:

6.4 Allinea MAP

Note: not all sites have Allinea MAP installed, please consult your loc al documentationfor more information.

6.5 General hints for interpreting profiling results

There are several general tips for using the results from the performance analysis:

  • Examine the routines where most of the time is spent to see if they can be optimised in any way – The “ct+src” report option can often be useful here to determine which regions of a function are using the most time.

  • Look for load-imbalance in the code – This is indicated by a large difference in computing time between different parallel tasks.

  • High values of time spent in MPI usually indicate something wrong in the code – Load-imbalance, a bad communication pattern, or just not scaling to the specified number of tasks.

  • High values of cache misses (seen via hardware counters) could mean a bad memory access in an array or loop – Examine the arrays and loops in the problematic function to see if they can be reorganised to access the data in a more cache-friendly way.

The Tuning chapter in this guide has a lot of information on how to optimise both serial and parallelcode. Combining profiling information with the optimisation techniques is the best way to improvethe performance of your code.

6.5.1 Spotting load-imbalance

By far the largest obstacle to scaling applications out to large numbers of parallel tasks is thepresence of load-imbalance in the parallel code. Spotting the symptoms of load-imbalance inperformance analysis are key to optimising the scaling performance of your code. Typical symptomsinclude:

For MPI codes

large amounts of time spent in MPI_BARRIER or MPI collectives (which include an implicit barrier).

For OpenMP codes

large amounts of time in “_omp_barrier” or “_omp_barrierp”.

For PGAS codes

large amounts of time spent in synchronisation functions.

When running a CrayPAT tracing experiment with “-g mpi” the MPI time will be split into computationand “sync” time. A high “sync” time can indicate problems with load imbalance in an MPI code.

The location of the accumulation of load-imbalance time often gives an indication of where inyour program the load-imbalance is occurring.

Please also see the Load-imbalance section in the Tuning chapter.

7. Tuning

This section discusses best practice for improving the performance of your code on Cray XE/XC systems.We begin with a discussion of how to optimise the serial (single-core) compute performance and then discuss how to improve parallel performance.

Please note that these are general guidelines and some/all of the recommendations may not havean impact on your code. We always advise that you analyse the performance of your code usingthe profiling tools detailed in the Performance analysis section to identify bottlenecks andparallel performance issues (such as load imbalance).

7.1 Optimisation summary

A summary of getting the best performance from your code would be:

  1. Select the right (parallel) algorithm for your problem. If you do not do this then no amount of optimisation will give you the best performance.

  2. Use the compiler optimisation flags (and use pointers sparingly in your code).

  3. Use the optimised numerical libraries supplied by Cray rather than coding yourself.

  4. Eliminate any load-imbalance in your code (CrayPAT can help identify load-balance issues). If you have load-imbalance then your code will never scale up to large core counts.

7.2 Serial (single-core) optimisation

7.2.1 Compiler optimisation flags

One of the easiest optimisations to perform is to use the correct compiler flags. This optimisationtechnique is extremely simple as it does not require you to modify your source code – althoughalterations to your source code may allow compiler flags to have more beneficial effects. Itis often worth taking the time to try a number of optimisation flag combinations to see what effect they have on performance of your code. In addition, many of the compilers willprovide information on what optimisations they are performing and, more usefully, whatoptimisations they are not performing and why. The flags needed to enable this information are indicated below.

Typical optimisations that can be performed by the compiler include:

Loop optimisation

such as vectorisation and unrolling.

Inlining

replacing a call to a function with the actual function source code.

Local logical block optimisations

such as scheduling, algebraic identity removal.

Global optimisations

such as constant propagations, dead store eliminations (still within a single source code file).

Inter-procedural analyses

try to optimise across subroutine/function boundary calls (can span multiple source code files).

The compiler-specific documentation and man pages contain more information about which optimisationsparticular flags will enable/disable.

When using the more aggressive optimisation options it is important to be aware that the resultingoutput might be affected, for example a loss of precision. Some of the optimisation options allowchanging the order of execution and changing how arithmetic computations are performed. When usingaggressive optimisations it is important to test your code to ensure that it still produces thecorrect result.

Many compiler suites allow pragmas or flags to be placed in the source code to give more information on whether or not (or even how) particular sections of code should be optimised.These can be useful, particularly on restricting optimisation for sections of code wherethe order of execution is critical. For example, the Portland group compiler can vectorize individualloops, perform memory prefetching, and select an optimization level for a code section.

Cray Compiler Suite

The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with
- O1 being the least aggressive and -O3 being the most aggressive). The default is -O2 but mostcodes should benefit from increasing this to -O3. Increased floating point optimisation(at the expense of reduced conformity to the IEEE standard) can switched on with the -fp3 flag.

To enable information on successful optimisations use the -Omsgs flag and to enable informationon failed optim isations add the -Onegmsgs flag.

GNU Compiler Suite

The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with
- O1 being the least aggressive and -O3 being the most aggressive). The options -ffast-mathand -funroll-loops may also be tried.

The option -ftree-vectorizer-verbose=N will generate information about attempted loop vectorisations.

PGI Compiler Suite

The most useful set of optimisation flags for most codes will be: -fast -Mipa=fast. Other usefuloptimisation flags are -O3, -Mpfi, -Mpfo, -Minline, -Munroll and -Mvect.

To enable information on successful optimisations use the -Minfo flag and to enable informationon failed optimisations add the -Mneginfo flag.

Intel Compiler Suite

The most useful high-level flag is -fast which enables a set of common recommended optimisations(including -O3).The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with
- O1 being the least aggressive and -O3 being the most aggressive). The default is -O2. Furtheroptimisations can be tried with the -unroll-aggressive and -opt-prefetch flags, and the
- no-prec-div -fp-model fast=2 flags to increase the floating point optimisations.As always,test whether the increased optimisation level actually does improve the performance of your code.The AVX floating point instructions are switched on with the -xAVX option.

The option -opt-report will give information on the optimisations performed by the Intel compiler, and
- guide can give you guidance on vectorisation and parallelisation. Intel also provides profile-guidedoptimisation, please see the Intel documentation.

7.2.2 Using Libraries

Another easy way to boost the serial performance for your code is to use the optimisednumerical libraries provided on the Cray XE/XC system. More information on the librariesavailable on the system can be found in the section: Available numerical libraries.

7.2.3 Writing Optimal Serial Code

The speed of computation is determined by the efficiency of your algorithm (essentially the number ofoperations required to complete the calculation) and how well the compiled executable can exploit the Opteron or Xeon architecture.

When actually writing your code the largest single effect you can have on performance is by selecting the appropriate algorithm for the problems you are studying. The algorithm you choose isdependent on many things but may include such considerations as:

Precision

Do you need to use double precision floating point numbers? If not, single or mixed-precision algorithms can run up to twice as fast as the double precision versions.

Problem size

what are the scaling properties of your algorithm? Would a different approach allow you to treat larger problems more efficiently?

Complexity

Although a particular algorithm may theoretically have the best scaling properties, is it so complex that this benefit is lost during coding?

Often algorithm selection is non-trivial and a good proportion of code benchmarking and profilingis needed to elucidate the best choice.

Once you have selected the best algorithms for your code you should endeavour to write your code insuch a way that allows the compiler to exploit the Opteron or Xeon processor architecture in the most efficient way.

The first rule is that if your code segment can be replaced by an optimised library call then youshould do this (see Available Numerical Libraries). If your code segment does not have a equivalentin one of the standard optimised numerical libraries then you should try to use code constructsthat will expose instruction-level parallelism or vectorisation (also known as SSE/AVX instructions) tothe compiler while avoiding simple optimisations that the compiler can perform easily. For floating-point intensive kernels the following general advice applies:

  • Avoid the use of pointers – these limit the optimisation that the compiler can perform.

  • Avoid using function calls, branching statements and goto statements wherever possible.

  • Only loops of stride 1 are amenable to vectorisation.

  • For nested loops, the innermost loop should be the longest and have a stride of 1.

  • Loops with a low number of iterations and/or little computation should be unrolled.

7.2.4 Cache Optimisation

Main memory access on systems such as Cray XE/XC machines is usually around two orders of magnitude slowerthan performing a single floating-point operation. One solution used in the Opteron and Xeon architecturesto militate this is to use a hierarchy of smaller, faster memory spaces on the processor known ascaches. This solution works as there is often a high chance of a particular address from memorybeing needed again within a short interval or a address from the same vicinity of memory being neededat the same time. This suggests that we could improve the performance of our code if we write itin such a way so that we access the data in memory that allows the cache hierarchy to be used asefficiently as possible.

Cache optimisation can be a very complex subject but we will try to provide a few general principlest hat can be applied to your codes that should help improve cache efficiency. The CrayPAT tool introduced in the Performance Analysis section can be used to monitor the cache efficiencyof your code through the use of hardware counters.

Effectively, in programming for cache efficiency we are seeking to provide additional localityin our code. Here, locality, refers to both spatial locality – using data located in blocksof consecutive memory addresses; and temporal locality – using the same address multiple timesin a short period of time.

  • Spatial locality can be improved by looping over data (in the innermost loop of nested loops) using a stride of 1 (or, in Fortran, by using array syntax).

  • Temporal locality can be improved by using short loops that do not contain function calls or branching statements.

There are two other ways in which the cache technology can have a detrimental effect on code performance.

Part of the way in which caches are able to achieve high performance is by mappingeach memory address on to a set number of cache lines, this is known as n-way set associativity.This property of caches can seriously affect the performance of codes where two array variablesinvolved in an operation exist on the same cache line and the cache line must be refilled twice foreach instance of the operation. One way to minimise this effect is to avoid using powers of 2for your array dimensions (as cache lines are always powers of 2) or, if you see this happening inyour code, to pad the array with enough zeroes to stop this happening.

The other major effect on users codes comes in the form of so-called TLB misses. The TLB inquestion is the translation lookaside buffer and is the mechanism that the cache/memoryhierarchy uses to convert application addresses to physical memory addresses. If a mapping isnot contained in the TLB then main memory must be accessed for further information resultingin a large performance penalty. TLB misses most often occur in codes as they loop throughan array using a large stride.

The specific cache layout is processor model dependent, consult the site-specific documentationfor information of the cache layout on the machine you are using.

7.3 Parallel optimisation

Some of the most important advice from the serial optimisation section also applies forparallel optimisation, namely:

  • Choose the correct algorithm for your problem.

  • Use vendor-provided libraries wherever possible.

When programming in parallel you will also need to select the parallel programming model to use. As the Cray XE/XC systems are MPP machines with distributed memory you have the following options:

  • Pure MPI – using just the MPI communications library.

  • Pure SHMEM – using just the SHMEM, single-sided communications library.

  • Pure PGAS – using one of the Partitioned Global Address Space (PGAS) implementations, such as Coarray Fortran (CAF) or Unified Parallel C (UPC).

  • Hybrid approach – using a combination of parallel programming models (most often MPI+OpenMP but MPI+CAF and MPI+SHMEM are also used).

The Cray XE/XC interconnect architecture includes hardware support for single-sided communications, which means that SHMEM and PGAS approaches run very efficiently and, if your algorithm is amenable to such an approach, are worth considering as an alternative to the more traditional pure MPI approach. A caveat hereis that if your code makes heavy use of collective communications (for example,all-to-all or all reduce type operations) then you will find that the optimisedMPI versions of these routines almost always outperform the equivalents coded using SHMEM or PGAS. The Aries interconnect on the Cray XC machines includes hardware support for MPI Barrier, MPI All reduce (message size up to 16 bytes), and MPI Alltoall.

In addition, due to the fact that Cray XE/XC machines are constructed from quite powerful SMP building blocks (i.e., individual nodes with up to 32 cores),a hybrid programming approach using OpenMP for parallelism within a node and MPIfor communications outwith a node will generally produce code with better scalingproperties than a pure MPI approach.

7.3.1 Load-imbalance

None of the parallel optimisation advice here will allow your code to scale to larger numbers of cores if your code has a large amount of load-imbalance.

Load-imbalance in parallel algorithms is where different parallel tasks (or threads)have a large amount of difference in computational work to perform. This, in turn,leads to some tasks (or threads) sitting idle at synchronisation points while waitingfor other tasks to complete there block of work. Obviously, this can lead to a largeamount of inefficiency in the program and can seriously inhibit good scaling behaviour.

Before optimising the parallel performance of your code it is always worth profiling(see the Profiling section) to try and identify the level of load-imbalance in your code, CrayPAT provides excellent tools for this. If you find a large amount of load-imbalance then you should eliminate this as much as possible before proceeding.Note that load-imbalance may only become apparent once you start using the code onhigher and higher numbers of cores.

Eliminating load-imbalance can involve changing the algorithm you are using and/orchanging the parallel decomposition of your problem. Generally, this issue is very code specific.

7.3.2 MPI Optimisation

The majority of parallel, scientific software still uses the MPI library as themain way to implement parallelism in the code so much effort has been put inby Cray software engineers to optimise the MPI performance on Cray XE/XC systems.You should make use of this by using high-level MPI routines for parallel operationswherever possible. For example, you should almost always use MPI collective calls ratherthan writing you own versions using lower-level MPI sends and receives.

When writing MPI (or hybrid MPI+X) code you should:

  • overlap communication and computation by using non-blocking operations wherever possible;

  • pre-post receives before the matching send operation is called to save memory copies and MPI buffer management overheads;

  • send few large messages rather than many small messages to minimise latency costs;

  • use collective communication routines as little as possible.

  • avoid the use of mpi_sendrecv as this is an extremely slow operation unless the two MPI tasks involved are perfectly synchronised.

Some useful MPI environment variables that can be used to tune the performance of yourapplication are:

MPICH_GNI_MAX_EAGER_MSG_SIZE

tune the u se of the eager messaging protocol which tries to minimise the use of the MPI system buffer. The default on Cray XE/XC systems is 8192 bytes. Increasing/decreasing this value may improve performance.

MPICH_GNI_RECV_CQ_SIZE

increases the buffer size for messages that are received before the receive has been posted (default is 40960 bytes on XE/XC systems). Increasing this may improve performance if you have a large number of such messages. Better to alter the code to pre-post receives if possible though.

MPICH_USE_DMAPP_COLL

attempts to use the highly optimized collective algorithms, if available. On XC systems, this can use the Aries hardware to implement MPI Barrier, Allreduce, and Alltoall.

MPICH_ENV_DISPLAY

set to display the current environment settings when a MPI program is executed.

Use “man intro_mpi” on the machine to show a full list of available options.

7.3.3 Mapping tasks/threads onto cores

The way in which your parallel tasks/threads are mapped onto the cores of the Cray XE/XCcompute nodes can have a large effect on performance. Some options you may want to consider are:

  • When underpopulating a compute node with parallel tasks it can often be beneficial to ensure that the parallel tasks are evenly spread across NUMA regions using the -S option to aprun (see below). This has the potential to optimise the memory bandwidth available to each core and to free up the additional cores for use by the multithreaded version of Cray’s LibSci library by setting the OMP_NUM_THREADS environment variable to the number of spare cores that are available to each parallel task and using the “-d $OMP_NUM_THREADS” option to aprun (see below).

  • On the AMD Bulldozer architecture (Interlagos processors) if you use half the cores per node you may be able to get additional performance by ensuring that each core has exclusive access to the shared floating point unit. You can do this by specifying the “-d 2″ option to aprun (see below).

The aprun command which launches parallel jobs onto Cray XE/XC compute nodes has a rangeof options for specifying how parallel tasks and threads are mapped onto the actual cores on a node. Some of the most important options are:

-n parallel_tasks

Total number of parallel tasks (not including threads). Default is 1.

-N parallel_tasks_per_node

Number of parallel tasks (not including threads) per node. Default is the number of cores in a node.

-d threads_per_parallel_task

Number of threads per parallel task. For OpenMP codes this will usually be equal to $OMP_NUM_THREADS. Default is 1.

-S parallel_tasks_per_numa

Number of parallel tasks to assign to each NUMA region on the node. There are 4 NUMA regions per XE compute node, 2 NUMA regions per XC compute node. Default is number of cores in a NUMA region.

-j num_cpus

(XC only) Number of CPUs per compute unit, which is the Cray terminology for number of Hyper-Threads per core. Default is 1.

Some examples should help to illustrate the various options. In the first two examples we assumewe are running on Cray XE compute nodes that have 32 cores per node arranged into 4 NUMAregions of 8 cores each.

Example 1:

Pure MPI job using 1024 MPI tasks (-n option) with 32 tasks per node (-N option):

aprun -n 1024 -N 32 my_app.x

This is analogous to the behaviour of mpiexec on Linux clusters.

Example 2:

Hybrid MPI/OpenMP job using 512 MPI tasks (-n option) with 8 OpenMP threads per MPI task(-d option), 4096 cores in total. There will be 4 MPI tasks per node (-n option) and the8 OpenMP threads are placed such that the threads associated with each MPI task are assigned to the same NUMA region (1 MPI task per NUMA region, -S option):

aprun -n 512 -N 4 -d 8 -S 1 my_app.x

The equivalent commands on a Cray XC system with 16 cores (each possibly with 2 Hyperthreads) per nodearranged into 2 NUMA regions of 8 cores are:

Example 3:

Pure MPI job using 1024 MPI tasks (-n option) with 16 tasks per node (-N option):

aprun -n 1024 -N 16 my_app.x

Using Hyperthreading, 32 tasks canbe fitted onto each node:

aprun -n 1024 -N 32 -j 2 my_app.x

Example 4:

Hybrid MPI/OpenMP job without Hyperthreading, using 512 MPI tasks (-n option) with 8 OpenMP threads per MPI task(-d option), 4096 cores in total. There will be 2 MPI tasks per node (-n option) and the8 OpenMP threads are placed such that the threads associated with each MPI task are assigned to the same NUMA region (1 MPI task per NUMA region, -S option):

aprun -n 512 -N 2 -d 8 -S 1 my_app.x

Further information on job placement can be found in the Cray document:

or by typing:

man aprun

when logged on to any of the Cray XE/XC systems.

Note that when submitting jobs using SLURM, the cores (or cores x Hyper-Threads if using aprun -j 2)used for the OpenMP threads in each MPI task need to be reserved using:

#SBATCH --cpus-per-task=<number of OpenMP threads>

in the SLURM batch script.

7.3.4 Core specialisation

Although the Cray Linux Environment on the compute nodes is highly optimised for HPC use, systemtasks do take up some time and since they have to run on one of the cores used for the application, theycan affect scalability at very high core counts; this is ’OS noise’or ’OS jitter’. It is possible to dedicate one core for doing system tasks, with all the othercores on a node being used to run the application. This can improve the scalability of some codes.

This is called ’core specialisation’. To enable core specialisation use the -r n option in aprun,with n usually 1 or 2. For example

aprun -r 2 -n 1024 -N 30 my_app.x

on a 32-core per node XE system will run with two cores per node dedicated to system tasks, leaving 30cores per node for the application. The apcount command can be used to calculate the batch reservationneeded. On an XC system, the system tasks can be run on the Hyper-Threads when using the -j 1 optionin aprun, so that there wil l still be 16 cores (Sandy bridge) or 24 cores (Ivy Bridge) per node availablefor the application.

On XE systems, if the MPI asynchronous progress feature is used, then core specialisation is neededto reserve some cores for the MPI helper threads used for the asynchronous progress engine. On XC systems,if the application is not using Hyper-Threads (aprun -j 1) then the asynchronous progress engine will runin the unused Hyper-Threads; if Hyper-Threads are being used (aprun -j 2), then cores will need to bereserved for the asynchronous progress engine. See the intro_mpi man page for details.

7.4 Advanced OpenMP usage

On Cray XE/XC systems, when using the GNU compiler suite, the locationof the thread that initialises the data can determine the location ofthe data. This means that if you initialise your data in the serialportion of the code then the location of the data will be on the NUMAregion associated with thread 0. This behaviour can have implicationsfor performance in the parallel regions of the code if a thread from adifferent NUMA region then tries to access that data. If you areusing the Cray or PGI compiler suites then there is no guarantee ofwhere shared data will be located if your OpenMP code spans multipleNUMA regions.

You can overcome these limitations, when using the GNU compiler suite,by initialising your data in parallel (within a parallel region).This can give improved performance if all the subsequent data accessis only from each thread to its own NUMA region, but poorerperformance if threads access data from other NUMA regions.

We always recommend that OpenMP code does not span multiple NUMAregions on Cray XE/XC systems. In general, it has been found that itis very difficult to gain any parallel performance when using OpenMPparallel regions that span multiple NUMA regions on a Cray XE computenode – it is expected that this will be true for Cray XC compute nodesalso. For this reason, you will generally find that it is best to useone of the following task/thread layouts if your code combines MPI andOpenMP.

Optimal MPI/OpenMP task/thread affinities for 32-core Cray XE compute nodes

MPI Tasks per NUMA RegionThreads per MPI taskaprun syntax
18aprun -n … -N 4 -S 1 -d 8 …
24aprun -n … -N 8 -S 2 -d 4 …
42aprun -n … -N 16 -S 4 -d 2 …

Optimal MPI/OpenMP task/thread affinities for 24-core Cray XE compute nodes

MPI Tasks per NUMA RegionThreads per MPI taskaprun syntax
16aprun -n … -N 4 -S 1 -d 6 …
23aprun -n … -N 8 -S 2 -d 3 …
32aprun -n … -N 12 -S 3 -d 2 …

Optimal MPI/OpenMP task/thread affinities for 16-core Cray XC compute nodes

MPI Tasks per NUMA RegionThreads per MPI taskaprun syntax
18aprun -n … -N 2 -S 1 -d 8 …
24aprun -n … -N 4 -S 2 -d 4 …
42aprun -n … -N 8 -S 4 -d 2 …

Optimal MPI/OpenMP task/thread affinities for 24-core Cray XC compute nodes

MPI Tasks per NUMA RegionThreads per MPI taskaprun syntax
112aprun -n … -N 2 -S 1 -d 12 …
26aprun -n … -N 4 -S 2 -d 6 …
34aprun -n … -N 6 -S 3 -d 4 …
43aprun -n … -N 8 -S 4 -d 3 …

There is a known issue with OpenMP thread migration when using the GNU programming environment to compile OpenMP code with multiple parallel regions. When a parallelregion is finished and a new parallel region begins all of the threads become assignedto core 0 leading to extremely poor performance. To prevent this happening you shouldadd the “-cc none” option to aprun.

Combining MPI and OpenMP using the Intel compiler needs some care with the binding of the threads. A helperthread is created in the Intel implementation of OpenMP and this affects the CPU binding by aprun.So CPU binding should be switched off, spreading the threads over the whole node or the NUMA region.Two cases have been tested and found to work (here on a 16 cores per node XC system):

  • 1 MPI task per node x 16 OpenMP threads, without Hyper-Threading

export OMP_NUM_THREADS=16
export KMP_AFFINITY="granularity=fine,compact,1"
aprun -n <number of MPI tasks> -d 16 -N 1 -cc none application.exe
  • 2 MPI tasks per node x 8 OpenMP threads, without Hyper-Threading

export OMP_NUM_THREADS=8
export KMP_AFFINITY="compact,1"
aprun -n <number of MPI tasks> -d 8 -N 2 -cc numa_node -S 1 -ss application.exe

7.4.1 Environment variables

The following are the most important OpenMP environment variables:

OMP_NUM_THREADS=n

Sets the maximum number of OpenMP threads available to each parallel task.

OMP_NESTED=true

Enable nested OpenMP parallel regions. Available on Cray, GNU, and Intel compilers.

OMP_SCHEDULE=policy

Determines how iterations of loops are scheduled.

OMP_STACKSIZE=size

Specifies the size of the stack for threads created.

OMP_WAIT_POLICY=policy

Controls the desired behaviour of waiting threads.

A more complete list of OpenMP environment variables can be found at:

7.5 Memory optimisation

Although the dynamic memory allocation procedures in modern programming languages offer a largeamount of convenience the allocation and deallocation functions are time consuming operations.For this reason they should be avoided in subroutines/functions that are frequently called.

The aprun option -m size[h|hs] specifies the per-PE required Resident Set Size (RSS) memory size inmegabytes. (K, M, and G suffixes, case insensitive, are supported). If you do not include the -moption, the default amount of memory available to each task equals the minimum value of (computenode memory size) / (number of cores) calculated for each compute node.

7.5.1 Memory affinity

Please see the discussion of memory affinity in the OpenMP section

7.5.2 Memory allocation (malloc) tuning

The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. Use the aprunoption -ss to specify strict memory containment per NUMA node.

Linux also provides some environment variables to control how malloc behaves, e.g.MALLOC_TRIM_THRESHOLD_ that is the amount of free space at the top of the heap after a free() thatneeds to exist before malloc will return the memory to the OS. Returning memory to the OS is costly.The default setting of 128 KBytes is much too low for a node with 32 GBytes of memory and oneapplication. Setting it higher might improve performance for some applications.

7.5.3 Using huge pages

Huge pages are virtual memory pages that are larger than the default 4KB page size. They can improvethe memory performance for codes that have common access patterns across large datasets. Generally,huge pages are accessed by a user using the libhugelbfs library (-lhugetlbfs) and by setting theenvironment variableHUGETLB_MORECORE=yes at runtime. Huge pages can sometimes provide better performance by reducing thenumber of TLB misses and by enforcing larger sequential physical memory inside each page.

The Cray XE/XC systems are set up to have huge pages available bydefault. The modules craype-hugepages<size> can be used set thenecessary link options and environment variables to enable the usageof huge pages. <size> can be 128K, 513K, 2M, 8M, 16M, 64M for XEsystems and 2M, 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M for XCsystems. The choices craype-hugepages2M and craype-hugepages8Mhave been the most commonly successful on XE systems.

Note that the -lhugetlbfs link option is set up appropriately when a huge pages module is loaded; you do notneed to specify this explicitly on the link line.You will also need to load the appropriate craype-hugepages<size> module at runtime (in your job submissionscript) for huge pages to work.

If you know the memory requirements of your application in advance you should set the -m optionto aprun when you launch your job to preallocate the appropriate number of huge pages. This improvesperformance by reducing operating system overhead. The syntax is:

-m<size>h

request size MBytes per PE (advisory)

-m<size>hs

request size MBytes per PE (required)

7.6 I/O optimisation

I/O subsystems are system dependent. Please consult the individual site documentation forinformation on the I/O subsystems.

The HECToR CSE team has published some information on how to optimise the I/O on the HECToR systemwhich includes information that may be generally applicable.

8. Debugging

8.1 Available Debuggers

The Cray XE and XC systems come with a number of tools to aid in debugging your program. Somesystems may also have third-party debugging tools (such as the TotalView debugger) installed.Please consult the site-specific documentation for information on any installed third-party debugging tools.

SiteAvailable Debugger
HERMITAllinea DDT, Cray ATP
HECToRCray ATP, RogueWave TotalView, lgdb, Allinea DDT
LindgrenRogueWave TotalView
SisuCray ATP, RogueWave TotalView, lgdb
ArcherCray ATP, Allinea DDT, lgdb

Note that the usefulness and accuracy of the information within any debugger depends on yourcompilation options. If you have optimisation switched on then you may find that the line numbers listed in the debugging information do not correspond with the statements in your source code file. For debugging code we always recommend that you compile with optimisation switched offand the -g flag enabled to provide the most accurate information.

8.2 TotalView

Cray XE/XC TotalView provides source-level debugging of Fortran, C, and C++ code compiled by the Cray, PGI,Intel and GNU compilers. The debugging tool provides both a command line interface and a Motif-based GUI.It supports MPI message queue display and watchpoints. There may be some limitations which are sitedependent. Please consult your site for further information.

8.2.1 Example, Debugging an MPI application

The following example shows how to invoke TotalView to debug an MPI code.

  • Start an X-server on your local machine (if you need to).

  • Login to system using ’ssh -Y’ (or ’ssh -X’) to enable X-windows forwarding.

  • Compile your code with the ’-g’ option. Your code and executable must be in the “work” directory

  • If the batch system is PBS or Torque/Moab, submit the TotalView job to the batch system and leave the terminal you submitted the job from open. Below is an example for a Interlagos based Cray XE system.

#!/bin/sh
#PBS -A your_budget_account
#PBS -l walltime=00:05:00
#PBS -v DISPLAY
#PBS -l mppwidth=64
#PBS -l mppnppn=32

cd $PBS_O_WORKDIR

totalview aprun -a -n 64 -N 32 /work/.../myprog.x

The -a option after aprun causes TotalView to pass all the rest of the arguments to aprun; otherwise they would be treated as arguments to TotalView.

  • If the batch system is SLURM, use the salloc command. For example, for 32 MPI tasks on 2 nodes:

salloc --nodes=2 --ntasks-per-node=16 -t 00:05:00 -p test totalview aprun -a -j 1 -n 32 -N 16 /work/.../myprog.x

On XC systems, this may need to be launched on an internal login node.

  • When the job starts, the dialogue in Figure 8.1 will be displayed – click ’OK’:

Figure 8.1: Initial TotalView dialogue.

  • An empty TotalView debugging window will be displayed (Figure 8.2) – click ’Go’ to start the program:

Figure 8.2: Empty TotalView debugging window.

  • The ’Halt program’ dialogue will be displayed (Figure 8.3) – click ’Yes’ to begin debugging

Figure 8.3: Initial Totalview dialogue.

  • The TotalView debugging window (Figure 8.4) will be displayed with your source code in the middle frame. The top-left frame shows the current call tree and the top-right frame shows the current values of defined variables.

Figure 8.4: Populated TotalView debugging window.

  • To add a breakpoint at a particular subroutine or function select ’Debug -> Breakpoint -> At Location…’ and enter the name of the subroutine or function in the resulting dialogue (Figure 8.5) and click ’OK’:

Figure 8.5: Add Breakpoint dialogue.

  • Click ’Go’ in the Totalview debugging window and the program will run until the named routine is reached (Figure 8.6):

Figure 8.6: TotalView debugging window halted at breakpoint.

  • You can add further breakpoints by scrolling through the source and clicking on the line number to the left of the source code.

8.2.2 Example: Using Totalview to debug a core file

To generate core files you just need your working directory to be in the “work” filesystem, andhave the line:

ulimit -c unlimited

in your batch script. Unfortunately the option to tag core files with the process ID is not enabledso if more then one processor dumps core then the core files will overwrite each other. You can useCray ATP (see below) to produce a selected set of core files for a program that crashes.

To use the TotalView GUI to debug a core file, follow these steps:

  • Start an X Server on your local machine and login to the system using the ’-Y’ (or ’-X’) option to ssh. Launch TotalView :

totalview
  • Click the down arrow on the field showing Start a new process, and from the drop-down list select “Open a core file”. The Program and Core file fields display.

  • In the Program field, enter the name of the program you wish to debug. In the Core file field, enter the name of the core file produced by this program. If necessary, use the Browse functions to find and select the files.

  • Click the “OK” button. TotalView opens the executable and core files.

Alternatively, you can use the command-line interface (CLI) to debug the program by entering thefollowing command:

totalviewcli program_name core_file_name

or the GUI by entering

totalview program_name core_file_name

8.2.3 TotalView Limitations

The TotalView debugging suite for the Cray XE/XC differs in functionality from the standardTotalView implementation in the following ways:

  • The TotalView Visualizer is not included.

  • The TotalView HyperHelp browser is not included.

  • Debugging multiple threads on compute nodes is not supported.

  • Debugging MPI_Spawn(), OpenMP, Cray SHMEM, or PVM programs is not supported.

  • Compiled EVAL points and expressions are not supported.

  • Type transformations for the PGI C++ compiler standard template library collection classes are not supported.

  • Exception handling for the PGI C++ compiler runtime library is not supported.

  • Spawning a process onto the compute processors is not supported.

  • Machine partitioning schemes, gang scheduling, or batch systems are not supported.

In some cases, TotalView functionality is limited because Compute Node Linux (CNL) does not supportthe feature in the user program.

8.3 Cray ATP

Cray ATP (Abnormal Termination Processing) is a tool that monitors your application and, in theevent of an abnormal termination, it will collate the failure information from all the runningprocesses into files for analysis.

With ATP enabled, in the event of abnormal termination, all of the stacktraces are gathered from the dying processes, analysed and collated into a single file called atpMergedBT.dot. In additionthe stacktrace from the first process to die (hence the probable cause for the failure) isdelivered to stderr.

The atpMergedBT.dot file can be viewed using the statview command that is accessible by loading the stat module.

ATP will also dump a heuristically-selected set of core files which will be named: core.atp.apid.rankwhere ’apid’ is the APID of the process and ’rank’ is the parallel rank of the process that producedthe core file.

8.3.1 ATP Example

The ATP module (atp) is always loaded by the PrgEnv module set.To enable ATP functionality, set the ATP_ENABLED environment variable in your job submission scriptand allow core files to be produced:

export ATP_ENABLED=1
ulimit -c unlimited

and then run your job using aprun as usual. Once your application has terminated abnormallyyou need to log into the service while exporting the X display back to your local machine (you musthave an X server running locally) with:

ssh -Y username@site

Load the stat module with:

module add stat

and view the merged stacktrace with:

statview atpMergedBT.dot

The stderr from your job should also contain useful information that has been processed by ATP.For further information on stat see the manual page: man STAT.

8.4 GDB (GNU Debugger)

The standard GNU debugger (gdb) is available on Cray XE/XC systems in a modified form as lgdb.The debugger supports Cray, PGI, and GNU compilers and Fortran, C, and C++ languages.The debugger currently only supports the command-line interface.

First, you must compile your program with debugging symbols (-g flag). You should alsousually ensure that optimisation is turned off (-O0 flag) so that reordering of source code lines does not take place. (Of course, it may sometimes be necessary to includeoptimisation if this is the cause of the problems.)

The lgdb module needs to be loaded using

module load cray-lgdb

and an interactive session requested (see Section 5.7 Interactive jobs). Make sure that youare in the ’work’ filesystem, and then type lgdb. Thisstarts the debugger, providing a command-line interface similar to gdb.

At the lgdb command prompt (dbg all>), use the launch command to launch an application . For example,to run the application xthi that is in the current directory:

launch $job{64} ./xthi

$job64 defines a ’process set’ here called $job with 64 ’processes’.You choose the name of the process set, and there can be several process sets, with different names.launch uses aprunto run the application and $job64 is exactly equivalent to using aprun -n 64. Additional optionsfor aprun can be passed using —aprun-args=”<aprun options>. For example, to launch a hybrid MPI-OpenMPjob, use

export OMP_NUM_THREADS=8
lgdb
launch $job{4} --aprun-args="-j 1 -N 2 -d 8 -S 1 -ss" ./xthi
.
.
.

Environment variables for the application can also be set within lgdb.

If an error occurs, there can be leftover applications running after using lgdb. After quitting from lgdb,always use apstat to see if you have any aprun jobs still running from the lgdb session and use apkillwith the relevant application ID (apid) to kill them (you may need apkill -SIGKILL <apid> to properly killoff the application).

8.4.1 Useful lgdb commands

Please see the man page for lgdb and also the help command in lgdb for detailed information on how to use lgdb.Some of the most often used commands are listed below.

  • break function_name – (or b) insert breakpoint at start of function function_name

  • break file:line_number – insert breakpoint at line_number in specified file

  • continue – (or c) continue running program until next breakpoint is reached

  • next – (or n) step to next line of program (will also step into subroutines)

  • list – (or l) list source code around current position

  • list start_line,end_line – list source code from start_line to end_line in current function.

  • print variable_name – (or p) print the value of the specified variable variable_name in all ranks

  • print $process_set_name::variable_name – print the value of the specified variable variable_name in the specified process set process_set_name

  • print $process_set_name{rank}::variable_name – print the value of the specified variable variable_name for the particular rank rank in the specified process set process_set_name

  • whatis variable_name – print information on the variable type and array dimensions (if this is an array), for variable variable_name

  • kill $process_set_name – kill the debugging session for the specified process set process_set_name, remaining in lgdb

  • quit – (or q) quit lgdb and halt the running program.

Share: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+Email this to someone