Best Practice Guide – Cray XE

Table of Contents

1. Introduction

This best-practice guide is designed to help users get the best productivity out of the PRACE Cray XE(and to a lesser extent XT) systems. We will cover a wide range of topics including:

  • Architecture
  • Code porting and compilation
  • Debugging tools
  • Serial optimisation and compiler flags
  • Parallel optimisation
  • I/O best-practice and optimisation
  • Performance analysis tools

We will focus on providing information that is generally applicable to all Cray XE systems but someinformation is provided separately for the site-specific reason. We will also provide links tosite-specific information in the relavent places.

1.1 PRACE Cray XE/XT Systems

Cray XE systems are hosted by a number of PRACE partner sites including:

Tier-0 – Hermit, HLRS, GermanyHermit is a new
PRACE Tier-0 system located at HLRS, Germany. This system will be installed in 3steps and currently the system is in Installation Step 1; the final step: Installation Step 2 isplanned for 2013 when a fully intergrated phase 1 system will be available.


Tier-1 – HECToR, EPCC, UKHECToR (High-End Computing Terascale Resource) is the UK’s front-line national supercomputing service,which is provided by the HECToR Partners including EPCC. The HECToR service consists of a Cray XE6supercomputer (phase 3) with a peak performance of greater than 800 TFlops, a high-performanceparallel file system (esFS), a GPU testbed machine and an archive facility.


Tier-1 – Lindgren, KTH, SwedenLindgren is a Cray XE6 system, based on the AMD Opteron 12-core “Magny-Cours” (2.1 GHz) processors andthe Cray Gemini interconnect technology. It is located at PDC Center for High Performance Computing.It has 16 racks with a theoretical peak performance of 305 TeraFlop. Lindgren is ranked in place 31amongst the 500 most powerful computer systems in the world (Top500, June 2011). Lindgren is namedafter the Swedish 20th Century children’s book author Astrid Lindgren.


This guide may also be useful for users of the Cray XT systems which are hosted by PRACE partnersites:

2. System Architecture and Configuration

This section provides an overview of the Cray XE architecture and configurations. Cray XE systemsgenerally consist of two types of nodes: service nodes and compute nodes.Service nodes are used for a variety of tasks on the system including acting as login nodes.

2.1 Processor architecture / MCM architecture

2.1.1 Compute node hardware

Cary XE compute nodes contain two AMD Opteron processors. Depending on the site these can be either12-core Magny-Cours or 16-core Interlagos processors. Theseindividual processors are connected to each other by HyperTransport links.

Each Opteron processor consists of two NUMA regions (containing either 6 or 8 cores).

On Compute NodesHERMITHECToRLindgren
Processors per node222
Cores per processor161612
Clock rate2.3 GHz2.3 GHz2.1 GHz
L1 Cache16 KB16 KB64 KB
L2 Cache2 MB2 MB512 KB
L3 Cache6 MB6 MB5 MB

2.1.2 Vector-type instructions

One of the keys to getting good performance out of the Opteron architecture is writing your codein such a way that the compiler can make use of the vector-type, floating point operations available onthe processor. There are a number of different vector-type operations available that all execute in asimilar manner: SSE (Streaming SIMD Extensions) Instructions; AVX (Advanced Vector eXtensions) Instructionsand FMA4 (Fused Multiply-Add with 4 operands) Instructions. AVX and FMA4 instructions are only availableon the Bulldozer architecture.

These instructions can use the floating-point unit (FPU) to operate on multiple floating point numberssimultaneously as long as the numbers are contiguous in memory. SSE instructions contain a number of different operations (for example: arithmetic, comparison, type conversion) that operate on two operands in 128-bit registers. AVX instructions expand SSE to allow operations to operate on three operandsand on a data path expanded from 128- to 256-bits – this is especially important in the Bulldozerarchitecture as a core can have exclusive access to a 256-bit floating point pipeline. The FMA4instructions are a further expansion to SSE instructions that allow a fused multiply-add operationon 4 operands – these have the potenital to greatly increase performance forsimulation codes. Both the AVX and FMA4 instruction sets are relatively new innovations and it maytake some time before they are effectively supported by compilers.

2.1.3 Bulldozer (Interlagos) Architecture

Each Interlagos processor consists of two NUMA regions (or dies) each containing 8 cores. Each NUMAregion is made up of 4 modules each of which has two cores and which also contains a shared floatingpoint execution unit. See Figure 2.1 for an overview of the Bulldozer architecture.

The shared FPU in a module is the major difference from previous versions of the Opteronprocessor. This unit consists of two 128-bit pipelines can be combined into a 256-bit pipeline. Hence,the module FPU is able to execute either a single 256-bit AVX vector instruction or two 128-bit SSEvector instructions per instruction cycle. The FPU also introduces an additional 256-bit fusedmultiply/add vector instruction which can improve the performance of these operations on the processor.Each 128-bit pipeline can operate on two double precision floating point numbers per clock cycle andthe combined 256-bit pipeline can operate on 4 double precision floating point numbers per clock cycle.


Figure 2.1: Overview of an Interlagos processor NUMA region (8 cores). Image courtesy of Wikipedia.

2.1.4 Magny-Cours Architecture

Each Interlagos processor consists of two NUMA regions (or dies) each containing 6 cores.

The FPU in a Magny-Cours processor has a single 128-bit pipeline that can operate on two dou
ble precisionfloating-point numbers per clock cycle.

2.1.5 Service node hardware

The following are the information of the service nodes on each machine. Ther service nodes are used for login environment, internal services, etc.

On Service NodesHERMITHECToRLindgren
ProcessorAMD Opteron Processor 23AMD OpteronAMD Opteron Processor 23
Cores per processor626
Clock rate2.2 GHz2.6 GHz2.2 GHz

2.2 Building block architecture

Cary XE compute nodes contain two AMD Opteron processors. Depending on the site these can be either12-core Magny-Cours or 16-core Interlagos processors. Theseindividual processors are connected to each other by HyperTransport links.

Each Opteron processor consists of two NUMA regions (or dies) each containing either 6 or 8 cores.

The HyperTransport network is also linked directly to the Gemini router chips to provide accessto the Cray High-Performance Network.

The layout of a Cray XE compute node is shown in Figure 2.2.


Figure 2.2: Overview of a “Magny-Cours” Cray XE compute node. Image courtesy of Cray Inc.

The number of nodes varies from system to system. The table below summarises the number ofnodes on each system.

No. of Compute Nodes355228161516
No. of Cores/Node323224
No. of Cores> 113,00090,11236,384
Peak Performance?> 800 Tflops305.6 Tflops
No. of Service Nodes962424

2.3 Memory architecture

All the processors on the the node share 32 GB of DDR3 memory (HERMIT also offers some nodes with64GB memory). The total memory on HECToR is 58TB and on Lindgren the total memory is 47.38 TB.

2.3.1 Bulldozer (Interlagos) architecture

All 4 modules (8 cores) within a NUMA region (or die) share an 8MB L3 cache (6MB data) with eachmodule having a 2MB L2 data cache shared between the two cores. Each core has its own 16 KB data cache.

The memory bandwidth for each of the PRACE systems is shown in the table below.

Memory frequency (MHz)160013331333
Memory bandwidth per node (MB/s)102.485.385.3
Memory bandwidth per socket (MB/s)51.242.742.7
Memory bandwidth per module (MB/s)6.45.3
Memory bandwidth per core (MB/s)

The main memory bandwidth is 51.2GB/s (6.4GB/s per module, 3.2GB/s per core).

2.3.2 Magny-Cours architecture

All 6 cores within a NUMA region (or die) share an 6MB L3 cache (5MB data) with each core havinga 512KB L2 data cache and 64KB L1 data cache.

The main memory bandwidth is 42.6GB/s (3.55GB/s per core).

2.4 Interconnect

Cray XE systems use the Cray Gemini interconnect which links all the compute nodes in a 3D torus.Every two XE compute nodes share a Gemini router which is connected to the processorsand main memory via their HyperTransport links.

There are 10 links from each Gemini router on to the high-performance network (HPN); the peakbi-directional bandwidth of each link is 8 GB/s and the latency is around 1-1.5 microseconds.

Further details can be found at the following links:

2.5 I/O subsystem architecture

The I/O subsystems are system dependent.


HERMIT uses the Cray Data Virtualization Service (DVS) which is an I/O forwarding service that canparallelize the I/O transactions of an underlying POSIX-compliant file system.

  • HECToR

HECToR phase 2b has 12 I/O nodes to provide the connection between the machine and the data storage.Each I/O node is fully integrated into the toroidal communication network of the machine via theirown Gemini chips. They are connected to the high-performance esFS external data storage viaInfiniband fibre.

  • Lindgren

The primary storage for users of Lindgren is the site-wide Lustre file-system Klemming to which it isconnected via Lustre router nodes (currently two). These transport Lustre traffic between the computenodes on the Cray-internal Gemini interconnect and the Lustre servers residing on PDCs high-performancestorage Infiniband-fabric. The current aggregate bandwidth of Klemming is about 5Gbyte/s reading andwriting and the size is roughly 300TB.

2.6 Available file systems

Cray XE systems use two separate file systems: the “home” filesystem and the “work” filesystem.

The “home” filesystem is backed up and can be used for critical files and small permanet datasets.It cannot be accessed from the compute nodes, so all files required for running a job on the computenodes must be present in the “work” filesystem. It should also be noted that the “home” filesytsemis not designed for the long term storage of large sets of results. For long term storage, anarchive facility should be used.

The “work” filesystem is a Lustre distributed parallel file system. It is the
only filesystem thatcan be accessed from the compute nodes. Thus all input data files must be present on the “work” filesystembefore running and all output files generated during the execution on compute nodes must be wrtten to the”work” filesystem. There is no separate backup of data on the “work” filesystem.

The table below shows the filesystem architectures/capacities at the sites.

“home” architectureBlueArc mercury 55BlueArc Titan 2200AFS
“home” capacity60TB70TB
“work” architectureLustreLustreLustre
“work” capacity2.7PB1.0PB0.45PB
Archive capacity1.02PB

Please consult the individual site documentation for further information on the file systems.

The HECToR site also includes an archiving facility. More information is available at:

2.7 Operating system (CLE)

The operating system on Cray XE is the Cray Linux Environment (CLE) which in turn is based on SuSELinux. CLE consists of two components: CLE and Compute Node Linux (CNL).

The service nodes of a Cray XE system (for example, the frontend nodes) run the a full-featured version of Linux (CLE).

The compute nodes of a Cray XE system run CNL. CNL is a stripped-down version of Linux that has beenextensively modified to reduce both the memory footprint of the OS and also the amount of variationin compute node performance due to OS overhead.

2.7.1 Cluster Compatibility Mode (CCM)

If you require full-featured Linux on the compute nodes of an Cray XE system (for example, to runan ISV code) you may be able to employ Cluster Compatibility Mode (CCM). The installation of thisfeature is generally site dependent. Please contact your site for information.

3. System Access

The majority of users will connect to the systems using an interactive SSH session. The connectionand authentication instructions are typically site-dependent so please consult the documentationfor the particular site you are connecting to.

The following are some basic instructions for connecting to each system.


The only way to access HERMIT frontend/login nodes from outside HWW net is through ssh using thefollowing command:

ssh [userID]

The frontend node is the single point to accesss the entire cluster, where the users can setenvironment, move data, edit and compile programs and create batch scripts, etc. Interactive usage,e.g. running programs which may lead to a high load, is NOT allowed on the frontend/login node.The compute nodes for running parallel jobs are only available via the batch system.

3.2 HECToR

To log into HECToR, the users should use the “” address:

ssh [userID]

Transferring data to and from HECToR can be performed using scp, GridFTP or bbFTP.

3.3 Lindgren

The Lindgren users can access the system via the login node using a proper login software:

Depending on what operating system (Linux, Windows, Mac OS X) users have on their local computer,they will have to find the correct software to install to access the Lindgren system. Kerberos v5software (from Heimdal or MIT) is needed to get a Kerberos ticket. Alternatively, the users requireSSH software that supports GSSAPI with KeyExchange (from modified OpenSSH) or kerberized telnet software(from Heimdal).

4. Programming Environment / Basic Porting

4.1 Modules environment

The Cray XE system uses the modules environment to control the programming environment. For more informationon using modules see the output of “man module” on the system or:

Some quick start commands for the modules environment usage:

To check which modules are presently loaded, type

module list

To search all the available modules, type

module avail

To search for a specific module, type

module show [module_name]

To load a specific module for usage, type

module load [module_name]

To unload a specific module, type

module unload [module_name]

To swap a loaded module with another one, type

module swap [first_module_name] [second_module_name]

4.2 Compiler wrapper commands

No matter which programming environment you have loaded you access the compilers via the following commands:

  • ftn – Fortran compiler
  • cc – C compiler
  • CC – C++ compiler

using these high-level compiler wrapper commands ensures that you are compiling for the correct processor architecture on the compute nodes and also makes sure that all the correct library versionsare accessed.

You should not use the native compiler commands (e.g. pgf90, gfortran) on Cray XE systems.

4.3 Available compilers

A number of different compiler suites are available on Cray XE systems. These may include:

  • Cray compilers – the “PrgEnv-cray” module
  • Portland group compilers – the “PrgEnv-pgi” module
  • GNU compilers – the “PrgEnv-gnu” modules
  • Intel compilers – the “PrgEnv-intel” modules (note that Intel compilers have no support for theAMD Bulldozer architecture so will produce sub-optimal code).
    SiteAvailable Compiler Suites
    HERMITCray (default), GNU, PGI
    HECToRCray (default), GNU, PGI
    LindgrenCray, GNU, Intel, PGI

The default programming environment differs from site to site. You can switch compiler suites withthe “module swap” command. For example, to switch from the Cray compiler suite to the GNU compilersuite you would use:

module swap PrgEnv-cray PrgEnv-gnu

You can also switch between different versions of compilers within a compiler suite by using the “module swap” command. For example, to change the version of the GNU compiler you are using:

module swap gcc gcc/4.6.1

The compiler version modules for the different compiler suites are:

  • cce – For the Cray compiler suite
  • pgi – For the PGI compiler suite
  • gcc – For the GNU compiler suite
  • intel – For the Intel compiler suite

4.3.1 Partitioned Global Address Space (PGAS) compiler support

The Cray compiler suite supports both the Co-array Fortran (CAF) and Unified Parallel C (UPC) PGAS language extensions. For more information on enabling these options see the appropriate section on parallel programming below. You can also find out more at:

Cray XE systems also support the Chapel PGAS language through the “chapel” compiler. You canuse this compiler by adding the “chapel” module:

module add chapel

You can find more information on the Chapel language on the Chapel web site.

4.4 Available (vendor optimised) numerical libraries

The Cray CLE distribution comes with a range of optimised numerical libraries compiled for allthe supported compiler suites listed above. The libraries are listed in the table below along withtheir current module names and a brief description. Generally, if you wish to use a library in yourcode you should only need to load the module before compilation.

LibScixt-libsciCray Scientific Library includes BLAS, LAPACK, BLACS and ScaLAPACK
PETScpetscPortable, Extensible Toolkit for Scientific Computation
FFTWfftwFastest Fourier Transform in the West versions 2 and 3
Global Arraysga

Many of these libraries use the Cray autotuning framework to improve the on-node performance. Thisframework automatically selects the best version of the library routines based on the size and natureof your problem at runtime.

More information on the library contents can be found on the following web:

4.5 Available MPI implementations

Cray XE systems use a version of the MPICH 2 library that has been optimised for the Gemini interconnect.The version of the MPI library is controlled by the “xt-mpich2″ module. All users will have the default”xt-mpich2” module loaded when they connect to the system – for best performance we recommend using thedefault or later versions.

You can get a list of available versions of the “xt-mpich2” module by using the “module avail”command. For example:

module avail xt-mpich2

Once the xt-mpich2 module is loaded, compiling using the standard wrapper compiler commands willautomatically include and link to the MPI headers and libraries – you do not need to specify any moreoptions on the commnad line.

4.6 OpenMP

All of the compiler suites available on the Cray XE system support the OpenMP 3 standard.

Note: in the Cray compiler suite OpenMP functionality is turned on by default.

4.6.1 Compiler flags

The compiler flags to include OpenMP for the various compiler suites are:

CompilerEnable OpenMPDisable OpenMP
Cray-h omp-h noomp
GNU-fopenmpby omission of -fopenmp
Intel-openmpby omission of -openmp

You may find these links useful:


To compile code that uses SHMEM you should ensure that the xt-shmem module lis loaded. Loading this module will ensure that all the correct environment variables are set for linking to the libsma staticand dynamic libraries. You can load the module with the command:

module load xt-shmem

For more information on using SHMEM, see the Cray man pages at:

5 Batch system/job command language

To run a job on Cray XE6 machines, you will need to write a submission script and submit your job tothe batch system. As the batch system installed is site-specific, you should consult the localdocumentation for details on the batch system in operation at the site:

This section will introduce the basics of
writing job submission scripts for Cray XE systems and then go on to look at more advanced batch job topics, including:

  • how to run multiple, concurrent parallel jobs in a single job submission script;
  • how to run multiple, concurrent parallel jobs using job arrays;
  • how to run interactive parallel jobs through the batch system;
  • how to select compute nodes with particular characteristics for your job;
  • how to use Perl or Python to write job submission scripts.

5.1 Basic batch system commands

This is a very short summary of the most important basic job submission commands on a Cray XE system.

To submit your job to the batch system:

qsub your_job_script.pbs

To check the job status:


To check only your job status:

qstat -u $USER

To remove your job from the job queue. If your job is running, it will stop it running too:

qdel [jobID]

5.2 Job submission example scripts for parallel jobs using MPI

#!/bin/bash --login

# This example is for systems using Interlagos processors.
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallal tasks
#PBS -l mppwidth=2048

# Specify how many processes per node.
# On HERMIT and HECToR valid mppnppn values are from 1 to 32. 
# On Lindgren valid mppnppn values are from 1 to 24. 
#PBS -l mppnppn=32

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Specify which budget account that your job will be charged to.
#PBS -A your_budget_account               
# Change to the direcotry that the job was submitted from.

# Launch the parallel job using aprun.
# Run the executable my_mpi_executable.x using total
# of 2048 parallel tasks, with 32 tasks assigned per node.
aprun -n 2048 -N 32 ./my_mpi_executable.x arg1 arg2

5.3 Job submission example scripts for parallel jobs using OpenMP

#!/bin/bash --login

# This example is for systems with Interlagos processors.
# On HERMIT and HECToR: maximum thread number per node is 32. 
# On Lindgren: maximum thread number per node is 24.

# The jobname
#PBS -N job_name

# The total number of cores required for your job.
#PBS -l mppwidth=32

# Specify how many processes per node.
#PBS -l mppnppn=32

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Change to the directory that the job was submitted from

# Set the number of OpenMP threads per node

# Launch the OpenMP job to the allocated compute node using aprun
aprun -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2

5.3 Multiple ’aprun’ commands in a single job script

One of the most efficient ways of running multiple simulations in parallel on Cray XE systems is to usea single job submission script to run multiple simulations. This can be achieved by having multiple ’aprun’ commands in a single script and requesting enough resources from the batch system to run themin parallel.

The examples in this section all assume you are using the bash shell for your job submission script but theprinciples are easily adapted to perl, python or tcsh.

This technique is particularly useful if you have many jobs that use a small number of cores that youwant to run simultaneously as the job looks to the batch system like a single large job and is thuseasier to schedule.

Note: each ’aprun’ command must run on a separate compute node as Cray XE machinesonly allow exclusive node access. This means you cannot use this technique to run multipleinstances of a program on a single compute node.

5.3.1 Requesting the correct number of cores

The total number of cores requested for a job of this type is the sum of the number of cores required forall the simulations in the script. For example, if we have 16 simulations which each run using 2048 coresthen we would need to ask for 32768 cores (1024 nodes on a 32-core per node system).

5.3.2 Multiple ’aprun’ syntax

The differences from specifying a single aprun command to specifying multiple ’aprun’ commands in yourjob submission script is that each of the aprun command must be run in the background (i.e. appendedwith an &) and there must be a ’wait’ command after the final aprun command. For example, to run 4CP2K simulations which each use 2048 cores (8192 cores in total) and 32 cores per node:

cd $basedir/simulation1/
aprun -n 2048 -N 32 cp2k.popt < input1.cp2k > output1.cp2k &
cd $basedir/simulation2/
aprun -n 2048 -N 32 cp2k.popt < input2.cp2k > output2.cp2k &
cd $basedir/simulation3/
aprun -n 2048 -N 32 cp2k.popt < input3.cp2k > output3.cp2k &
cd $basedir/simulation4/
aprun -n 2048 -N 32 cp2k.popt < input4.cp2k > output4.cp2k &

# Wait for all simulations to complete

of course, this could have been more consisely achieved using a loop:

for i in {1..4}; do
  cd $basedir/simulation${i}/
  aprun -n 2048 -N 32 cp2k < input${i}.cp2k > output${i}.cp2k &

# Wait for all simulations to complete

5.3.3 Example job submission script

This PBSPro job submission script runs 16, 2048-core CP2K simulations inparallel with the input in the directories ’simulation1’, ’simulation2’, etc.:

#!/bin/bash --login

# This example is for the systems using Interlagos processors.
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
#    This is the sum of the number of parallel tasks
#    required by each of the aprun commands you are using.
#    In this example we have 16 * 2048 = 32768 tasks
#PBS -l mppwidth=32768

# Specify how many processes per node.
# On HERMIT and HECToR valid mppnppn values are from 1 to 32. 
# On Lindgren valid mppnppn values are from 1 to 32. 
#PBS -l mppnppn=32

# Specify the wall clock time required for your job.
#    In this example we want 6 hours 
#PBS -l walltime=6:0:0

# Specify which budget that your job will be charged to.
#PBS -A your_budget_account               
# The base directory is the dir that the job was submitted from.
# All simulations are in subdirectories of this directory.

# Loop over simulations, running them in the background
for i in {1..16}; do
   # Change to the directory for this simulation
   cd $basedir/simulation${i}/
   aprun -n 2048 -N 32 cp2k.popt < input${i}.cp2k > output${i}.cp2k &

# Wait for all jobs to finish before exiting the
# job submission script
exit 0

In this example, it is assumed that all of the input for the simulations has been setupprior to submitting the jobs. Of cour
se, in reality, you may find that it is more usefulfor the job submission script to programmatically prepare the input for each job beforethe aprun command.

5.4 Job arrays

Often, you will want to run the same job submission script multiple times in parallel for many differentinput parameters. Job arrays provide a mechanism for doing this without the need to issue multiple ’qsub’commands and without the penalty of having large numbers of jobs appearing in the queue.

5.4.1 Example job array submission script

Each job instance in the job array is able to access its unique array index through the environmentvariable $PBS_ARRAY_INDEX (for PBSPro) or $PBS_ARRAYID (for Torque).

This can be used to programmatically select which set of input parameters you want to use.One common way to use job arrays is to place the input for each job instance in a separate subdirectorywhich has a number as part of its name. For example, if you have 10 sets of input in ten subdirectoriescalled job01, job02, …, job10 then you would be able to use the following script to run a job array that runs each of these jobs (this example is for PBSPro):


# This example is for the systems using Interlagos processors.
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallal tasks
#PBS -l mppwidth=2048

# Specify how many processes per node.
# On HERMIT and HECToR valid mppnppn values are from 1 to 32. 
# On Lindgren valid mppnppn values are from 1 to 32. 
#PBS -l mppnppn=32

# Specify the wall clock time required for your job.
#PBS -l walltime=00:20:00

# Specify which budget account that your job will be charged to.
#PBS -A your_budget_account               
# Change to the direcotry that the job was submitted from.

# Get the subdirectory name for this job instance in the array
#   Note that this example is for PBSPro. For the Torque batch system
#   you would need to use the $PBS_ARRAYID environment variable instead
jobid=`printf "%02d" $PBS_ARRAY_INDEX`

# Change to the subdirctory for this job instance in the array
cd $jobdir

# Run this job instance in its subdirectory
echo "Running $jobname"
aprun -n 2048 -N 32 ./my_mpi_executable.x arg1 arg2

How you submit job arrays differs depending on whether your site uses the PBSPro or Torque batch system.

5.4.2 Submitting job arrays using PBSPro

The ’-J’ option to the ’qsub’ command is used to submit a job array under PBSPro. For example, to submita job array consisting of 10 instances, numbered from 1 to 10 you would use the command:

qsub -J 1-40 array_job_script.pbs

You can also specify a stride other than 1 for array jobs. For example, to submit a job array consitingof 5 instances, numbered 2, 4, 6, 8, and 10 you would use the command:

qsub -J 2-10:2 array_job_script.pbs

5.4.3 Submitting job arrays using Torque

The ’-t’ option to the ’qsub’ command is used to submit a job array under Torque. For example, to submita job array consisting of 10 instances, numbered from 1 to 10 you would use the command:

qsub -t 1-10 array_job_script.pbs

5.4.4 Interacting with individual job instances in an array

You can refer to individual job insatnce in a job array by using their array index. For example, todelete just the job instance with array index 5 from the batch system (assuming your job ID is 1234),you would use:

qdel 1234[5]

5.5 Interactive jobs

Note that interactive jobs are not available on all Cray XE execution sites.

Interactive jobs on Cray XE systems are useful for debugging or developmental work as they allow youto issue ’aprun’ commands directly from the command line. To submit a interactive job reserving 256cores (on a 32 core-per-node Interlagos system) for 1 hour you would use the following qsub command:

qsub -IVl mppwidth=256,walltime=1:0:0 -A budget

When you submit this job your terminal will display something like:

qsub: waiting for job 492383.sdb to start

and once the job runs you will be returned to a standard Linux command line. However, while the job lastsyou will be able to run parallel jobs by using the ’aprun’ command directly at your command prompt. The maximum number of cores you can use is limited by the value of mppwidth you specified at submission time.

5.6 Selecting nodes with particular attributes

It is possible to select specific nodes for job execution using the batch system. The resource ’mppnodes’allows you specifiy a list of particular compute nodes for execution. This can be useful in the case thatyou are working on a heterogeneous system (with different numbers of cores or different amounts of memoryon particular compute node).

For example, to submit a 4 node job (128 cores on a 32 core per node system) to the compute nodes numbered2, 3, 4 and 5 you would add the following line to your job submission script:

#PBS -l mppnodes="2-5"

or, if your nodes are not sequentially numbered, for example 2, 4, 6, 8, you can use a comma separated list

#PBS -l mppnodes="2,4,6,8"

5.6.1 Identifying nodes with particular attributes

To make effective use of the ability to select particular compute nodes for execution you need to havea way to get a list of compute nodes with particular attributes in the format needed for the ’mppnodes’resource. The ’cnselect’ command is used on Cray XE systems to perform this task.

For example, to return a list of all compute nodes with 32 cores you would use:

cnselect numcores.eq.32

or to select all compute nodes with more than 32GB of memory:


5.7 Writing job submission scripts in Perl and Python

It can often be useful to be able to use the features of Perl and/or Python to write more complex jobsubmission scripts. The richer programming environment comapred to standard shell scripts can makeit easier to dynamically generate input for jobs or put complex workflows together.

Please note that the examples provided in this section are so simple that they could easily be writtenin bash or tcsh but they provide the necessary information needed to be able to use Perl and Python towrite your own, more complex, job submission scripts.

You submit Perl and Python job submission scripts using ’qsub’ as for standard jobs.

5.7.1 Exmaple Perl job submission scr

This example script shows how to run a CP2K job using Perl. It illustrates the necessary system callsto change directories and load modules within a Perl script but does not contain any program complexity.


# This example is for systems using Interlagos processors.
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallal tasks
#PBS -l mppwidth=2048

# Specify how many processes per node.
# On HERMIT and HECToR: mppnppn values are from 1 to 32. 
# On Lindgren: mppnppn values are from 1 to 32. 
#PBS -l mppnppn=32

# Set the budget to charge the job to.
#    The budget name is site-dependent
#PBS -A budget

# Set the number of MPI tasks and MPI tasks per node
my $mpiTasks = 2048;
my $tasksPerNode = 32;

# Set the executable name and input and output files
my $execName = "cp2k.popt";
my $inputName = "input";
my $outputName = "output";
my $runCode = "$execName < $inputName > $outputName";

# Set up the string to run our job
my $aprunString = "aprun -n $mpiTasks -N $tasksPerNode $runCode";

# Set the command to load the cp2k module
#   This is more complicated in Perl as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   aprun command
my $moduleString = "source /etc/profile; module load cp2k;";

# Change to the diectory the job was submitted from

# Run the job
#    This is a combination of the module loading string and the
#    actaul aprun command. Both of these are set above.
system("$moduleString  $aprunString");

# Exit the job

5.7.2 Example Python job submission script

This example script shows how to run a CP2K job using Python. It illustrates the necessary system callsto change directories and load modules within a Python script but does not contain any program complexity.

# This example is for systems using Interlagos processors.
# On HERMIT and HECToR the maximum core number per node is 32. 
# On Lindgren the maximum core number per node is 24.

# The jobname
#PBS -N your_job_name

# The total number of parallel tasks for your job.
# The example requires 2048 parallal tasks
#PBS -l mppwidth=2048

# Specify how many processes per node.
# On HERMIT and HECToR valid mppnppn values are from 1 to 32. 
# On Lindgren valid mppnppn values are from 1 to 32. 
#PBS -l mppnppn=32

# Set the budget to charge the job to.
#    The budget name is site-dependent
#PBS -A budget

# Import the Python modules required for system operations
import os
import sys

# Set the number of MPI tasks and MPI tasks per node
mpiTasks = 2048
tasksPerNode = 32

# Set the executable name and input and output files
execName = "cp2k.popt"
inputName = "input"
outputName = "output"
runCode = "{0} < {1} > {2}".format(execName, inputName, outputName)

# Set up the string to run our job
aprunString = 
  "aprun -n {0} -N {1} {2}".format(mpiTasks, tasksPerNode, runCode)

# Set the command to load the cp2k module
#   This is more complicated in Python as we cannot access the 
#   'module' command directly so we need to use a set of commands
#   to make sure the subshell that runs the aprun process has the 
#   correct environment setup. This string will be prepended to the
#   aprun command
moduleString = "source /etc/profile; module load cp2k; "

# Change to the diectory the job was submitted from

# Run the job
#    This is a combination of the module loading string and the
#    actaul aprun command. Both of these are set above.
os.system(moduleString + aprunString)

# Exit the job

6. Performance analysis

6.1 Available Performance Analysis Tools

All Cray XE machines come with the Cray Performance Tools (module “perftools”) installed. These toolsinclude:

  • Cray Performance Analysis Tool (CrayPAT).
  • PAPI hardware counters.
  • Apprentice2 performance visualisation suite.

CrayPAT provides the interface to using all these tools.

There are also other tools available on each system. Please refer to the appropriate documentation at each site.

HECToRCray PerfTools, Scalasca, VampirTrace
LindgrenCray PerfTools, Paraver, Extrae

6.2 Cray Performance Analysis Tool (CrayPAT)

CrayPAT consists of two command line tools which are used to profile your code: ’pat_build’ addsinstrumentation to your code so that when you run the instrumented binary the profiling informationis stored on disk; ’pat_report’ reads the profiling information on disk and compiles it into human-readable reports.

CrayPAT can perform two types of performance analysis: sampling experiments and tracing experiments.A sampling experiment probes the code at a predefined interval and produces a report based on thesestatistics. A tracing experiment explicitly monitors the code performance within named routines.Typically, the overhead associated with a tracing experiment is higher than that associatedwith a sampling experiment but provides much more detailed information. The key to getting usefuldata out of a sampling experiment is to run your profiling for a representitive length of time.

Detailed documentation on CrayPAT is available from Cray:

6.2.1 Instrumenting a code with pat_build

Often the best way to to analyse the performance is to run a sampling experiment and then use the resultsfrom this to perform a focussed tracing experiment. In fact, CrayPAT contains the functionality forautomating this process – known as Automatic Profiling Analysis (APA).

The example below illustrates how to do this.

  • Example: Profiling the CASTEP Code
    Here is a step-by-step example to instrumenting and profiling the performance of a code (CASTEP)using CrayPAT.The first step is to compile your code with the profiling tools attached. First load the Cray ’perftools’module with:

    module add perftools

    Next, compile the CASTEP code in the standard way on the “work” filesystem (which is accessible on the computenodes). Once you have the CASTEP executable you need to instrument it for a sampling experiment. Thefollo
    wing command will produce the ’castep+samp’ instrumented executable:

    pat_build -O apa -o castep+samp castep

    Run your program as you usually would in the batch system at your site. Here is an example job submissionscript for a 32-core per node (Interlagos) system.

    #!/bin/bash --login
    #PBS -N bench4_il
    #PBS -l mppwidth=1024
    #PBS -l mppnppn=32 
    #PBS -l walltime=1:00:00
    # Set this to your budget code
    #PBS -A budget
    # Switch to the directory you submitted the job from
    # Load the perftools module
    module add perftools
    # Run the sampling experiment
    aprun -n 1024 -N 32 $CASTEP_EXEDIR/castep+samp input

    When the job completes successfully the directory you submitted the job from will either containan *.xf file (if you used a small number of cores) or a directory (for large core counts) with aname something like:


    The actual name is dependent on the name of your instrumented executable and the process ID. In thiscase we used 1024 cores so we have got a directory.

    The next step is to produce a basic performance report and also the input for ’pat_build’ to createa tracing experiment. This is done (usually interactively) by using the ’pat_report’ command (you must have the ’perftools’ module loaded to use the pat_report command). In our examplewe would type:

    pat_report -o castep_samp_1024.pat castep+samp+25370-14s 

    this will produce a text report (in castep_samp_1024.pat) listing the various routines in the programand how much time was spent in each one during the run (see below for a discussion on the contentsof this file). It also produces a castep+samp+25370-14s.ap2 file and a castep+samp+25370-14s.apa file:

    • The *.ap2 file can be used with the Apprentice2 visualisation tool to get an alternative view of the code performance.
    • The *.apa file can be used as the input to “pat_build” to produce a tracing experimentfocussed on the routines that take most time (as determined in the previous sampling experiment).We illustrate this below.

    For information on interpreting the results in the sampling experiment text report, see thesection below.

    To produce a focussed tracing experiment based on the results from the sampling experimentabove we would use the ’pat_build’ command with the *.apa file produced above. (Note, thereis no need to change directory to the directory containing the original binary to run thiscommand.)

    pat_build -O castep+samp+25370-14s.apa -o castep+apa

    This will generate a new instrumented version of the CASTEP executable (castep+apa) in the currentworking directory. You can then submit a job (as outlined above) making sure that you reference this new executable. This will then run a tracing experiment based on the options in the *.apa file.

    Once again, you will find that your job generates a *.xf file (or a directory) which you can analyse using the ’pat_report’ command. Please see the section below for information on analysing tracing experiments.

6.2.2 Analysing profile results using pat_report

The ’pat_report’ command is able to produce many different profile reports from the profiledata. You can select the type of pre-defined report options with the -O flag to ’pat_report’.A selection of the most generally useful pre-defined report types are:

  • ca+src – Show the callers (bottom-up view) leading to the routines that have a high usein the report and include source code line numbers for the calls and time-consming statements.
  • load_balance – Show load-balance statistics for the high-use routines in the program. Parallelprocesses with minimum, maximum and median times for routines will be displayed. Only available with tracing experiments.
  • mpi_callers – Show MPI message statistics. Only available with tracing experiments

Multiple options can be specified to the -O flag if they are separated by commas. For example:

pat_report -O ca+src,load_balance -o castep_furtherinfo_1024.pat castep+samp+25370-14s

You can also define your own custom reports using the -b and -d flags to ’pat_report’. Detailson how to do this can be found in the ’pat_report’ documentation and man pages. The outputfrom the pre-defined report types (described above) show the settings of the -b and -dflags that were used to generate the report. You can use these examples as the basis foryour own custom reports.

Below, we show examples of the output from sampling and tracing experiments for the pure MPI version of CASTEP. A report from your own code will differ in the routines and times shown. Also, if your code uses OpenMP threads, SHMEM, or CAF/UPC then you will also seedifferent output.

  • Example: results from a sampling experiment on CASTEP
    In a sampling experiment, the main summary table produced by ’pat_report’ will looksomething like (of course, your code will contain different functions and subroutines):

    Table 1:  Profile by Function
     Samp%  |  Samp  |  Imb.  |  Imb.  |Group 
            |        |  Samp  | Samp%  | Function 
            |        |        |        |  PE=HIDE 
     100.0% | 6954.6 |     -- |     -- |Total
     | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     |  46.0% | 3199.8 |     -- |     -- |MPI
     || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
     ||  12.9% |  897.0 |  302.0 |  25.6% |mpi_bcast
     ||  12.9% |  896.4 |   78.6 |   8.2% |mpi_alltoallv
     ||   8.7% |  604.2 |  139.8 |  19.1% |MPI_ALLREDUCE
     ||   5.4% |  378.6 | 1221.4 |  77.5% |MPI_SCATTER
     ||   4.9% |  341.6 |  320.4 |  49.2% |MPI_BARRIER
     ||   1.1% |   75.8 |  136.2 |  65.2% |mpi_gather
     |  38.8% | 2697.9 |     -- |     -- |USER
     || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
     ||   9.4% |  652.1 |  171.9 |  21.2% |comms_transpose_exchange.3720
     ||   6.0% |  415.2 |   41.8 |   9.3% |zgemm_kernel_n
     ||   3.4% |  237.6 |   29.4 |  11.2% |zgemm_kernel_l
     ||   3.1% |  215.6 |   24.4 |  10.3% |zgemm_otcopy
     ||   1.7% |  119.8 |   20.2 |  14.6% |zgemm_oncopy
     ||   1.3% |   92.3 |   11.7 |  11.4% |zdotu_k
     |  15.2% | 1057.0 |     -- |     -- |ETC
     || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
     ||   2.0% |  140.5 |   32.5 |  19.1% |__ion_MOD_ion_q_recip_interpolation
     ||   2.0% |  139.8 |  744.2 |  85.5% |__remainder_piby2
     ||   1.1% |   78.3 |   32.7 |  29.9% |__ion_MOD_ion_apply_and_add_ylm
     ||   1.1% |   73.5 |   35.5 |  33.1% |__trace_MOD_trace_entry

    You can see that CrayPAT splits the results from a sampling experiment into threesections:

    • MPI – Samples accumulated in message passing routines
    • USER – Samples accumulated in user routines. These are usually the functionsand subroutines that are part of your program.
    • ETC – Samples accumulated in library routines.

    The first two columns indicate the % and number of samples (mean of samplescomputed across all parallel tasks) and indicate the functions in the programwhere the sampling experiment spent the most time. Columns 3 and 4 give anindication of the differences between the minimum number of samples and maximumnumber of samples found across different parallel tasks. This, in turn, gives anindication of where the load-imbalance in the program is found. Of course, it may bethat load imbalance is found in routines where insignificant amounts of time are spent. In this case, the load-imbala
    nce may not actually be significant (for example,the large imbalance time seen in ’mpi_gather’ in the example above.

    In the example above – the largest amounts of time seem to be spent in the MPI routines:mpi_bcast, mpi_alltoallv and MPI_ALLREDUCE and in the program routines:comms_transpose_exchange and the BLAS routine ’zgemm’.

    To narrow down where these time consuming calls are in the program we can use the’ca+src’ report option to ’pat_report’ to get a view of the split of samples betweenthe different calls to the time consuming routines in the program.

    pat_report -O ca+src -o castep_calls_1024.pat castep+samp+25370-14s 

    An extract from the the report produced by the above command looks like

    Table 1:  Profile by Function and Callers, with Line Numbers
     Samp%  |  Samp  |Group 
            |        | Function 
            |        |  Caller 
            |        |   PE=HIDE 
     100.0% | 6954.6 |Total
     | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     |  38.8% | 2697.9 |USER
     || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
     ||   9.4% |  652.1 |comms_transpose_exchange.3720 
     ||| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    3||   3.7% |  254.4 |__comms_MOD_comms_transpose_n:comms.mpi.F90:line.17965
    4||        |        | __comms_MOD_comms_transpose:comms.mpi.F90:line.17826
    5||        |        |  __fft_MOD_fft_3d:fft.fftw3.F90:line.482
    6||   3.1% |  215.0 |   __basis_MOD_basis_real_recip_reduced_one:basis.F90:line.4104
    7||        |        |    __wave_MOD_wave_real_to_recip_wv_bks:wave.F90:line.24945
    8||        |        |     __pot_MOD_pot_nongamma_apply_wvfn_slice:pot.F90:line.2640
    9||        |        |      __pot_MOD_pot_apply_wvfn_slice:pot.F90:line.2550
    10|        |        |       __electronic_MOD_electronic_apply_h_recip_wv:electronic.F90:line.4364
    11|        |        |        __electronic_MOD_electronic_apply_h_wv:electronic.F90:line.4057
    12|   3.0% |  211.4 |         __electronic_MOD_electronic_minimisation:electronic.F90:line.1957
    13|   3.0% |  205.6 |          __geometry_MOD_geom_get_forces:geometry.F90:line.6299
    14|   2.8% |  195.2 |           __geometry_MOD_geom_bfgs:geometry.F90:line.2199
    15|        |        |            __geometry_MOD_geometry_optimise:geometry.F90:line.1051
    16|        |        |             MAIN__:castep.f90:line.1013
    17|        |        |              main:castep.f90:line.623

    This indicates that (in the USER routines) the part of the ’comms_transpose_exchange’ routinein which the program is spending most time is around line 3720 (in many cases the time-consumingline will be the start of a loop). Furthermore, we can see that approximately one third of the samples in the routine (3.7%) are originating from the call to ’comms_transpose_exchange’ inthe stack shown. Here, it looks like they are part of the geometry optimisation section ofCASTEP. The remaining two-thirds of the time will be indicated further down in the report.

    In the original report (castep_samp_1024.pat) we also set a table reporting on the wall clocktime of the program. The table reports the minimum, maximum and median values of wall clocktime from the full set of parallel tasks. The table also lists the maximum memory usage ofeach of the parallel tasks:

    Table 3:  Wall Clock Time, Memory High Water Mark
       Process  |  Process  |PE=[mmm] 
          Time  |    HiMem  |
                | (MBytes)  |
     124.409675 |   100.761 |Total
     | - - - - - - - - - - - - - - - - 
     | 126.922418 |   101.637 |pe.39
     | 124.329474 |   102.336 |pe.2
     | 122.241495 |   101.355 |pe.4

    If you see a very large difference in the minimum and maximum walltime for different paralleltasks it indicates that there is a large amount of load-imbalance in your code and it should beinvestigated more thouroughly.

  • Example: results from a tracing experiment on CASTEP
    The main summary table from a tracing experiment look very similar to the main summary tablefrom the sampling experiment with the sample counter columns replaced by time columns:

    Table 1:  Profile by Function Group and Function
     Time%  |      Time  |     Imb.  |  Imb.  |    Calls  |Group 
            |            |     Time  | Time%  |           | Function 
            |            |           |        |           |  PE=HIDE 
     100.0% | 219.421582 |        -- |     -- | 5661054.0 |Total
     | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     |  48.3% | 106.010194 |        -- |     -- |  210415.6 |MPI_SYNC
     || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
     ||  22.1% |  48.462454 | 48.436312 |  99.9% |    1700.1 |mpi_gather_(sync)
     ||  12.0% |  26.327995 | 25.902518 |  98.4% |   13549.0 |mpi_bcast_(sync)
     ||   7.0% |  15.312987 | 12.925545 |  84.4% |       9.0 |mpi_barrier_(sync)
     ||   3.3% |   7.155499 |  3.679692 |  51.4% |   90020.5 |mpi_allreduce_(sync)
     ||   2.8% |   6.232381 |  6.227867 |  99.9% |    1104.0 |mpi_scatter_(sync)
     ||   1.1% |   2.518877 |  0.841358 |  33.4% |  104033.0 |mpi_alltoallv_(sync)
     |  25.1% |  55.160228 |        -- |     -- |       2.0 |USER
     || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
     |  25.1% |  55.159796 | 65.926769 |  55.3% |       1.0 | main
     |  20.9% |  45.947098 |        -- |     -- |  210447.6 |MPI
     || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
     ||  15.0% |  32.998366 |  0.835895 |   2.5% |  104033.0 |mpi_alltoallv
     ||   4.3% |   9.431468 |  0.338688 |   3.5% |   90020.5 |MPI_ALLREDUCE
     |   5.6% |  12.304063 |  1.288770 |   9.6% | 5240188.9 |STRING
     || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
     |   5.6% |  12.304063 |  1.288770 |   9.6% | 5240188.9 | memcpy

    You should note that the single MPI section has now been split into two sections: MPI and MPI_SYNC. The MPI section now contains the time spent by the MPI library in actually doing work while the MPI_SYNC section contains the time spent in MPI routines waiting for themessage to complete.

6.2.3 Using hardware counters

CrayPAT also provides hardware counter data. CrayPAT supports both predefined hardware counterevents and individual hardware counters. Please refer to the CrayPAT documatations for furtherdetails of the hardware counter events:

To use the hardware counters, you need to set the environment variable PAT_RT_HWPCin your job script when running your tracing experiment. For example:

export PAT_RT_HWPC=21

(on an Interlagos system) will specify the twenty-first of twenty-two predefined sets ofhardware counter events to be measured and reported. This set provides information onfloating-point operations, and cache performance.

Note that hardware counters are only supported with tracing experiments, not on samplingexperiments. If you set PAT_RT_HWPC for a sampling experiment the instrumented executablewill fail to run.

When you produce a profile report using ’pat_report’, hardware counter information willautomatically be included in the report if the PA_RT_HWPC environment variable was setwhen the experiment was run.

A full list of the predefined sets of counters can be found on the hardware countersman page on your execution site. To access the documentation, use:

man hwpc
  • Example: Hardware counter results from CASTEP
     - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
      Time%                                                     25.2% 
      Time                                                  55.178458 secs
      Imb. Time                                                    -- secs
      Imb. Time%                                                   -- 
      Calls                                  0.036 /sec           2.0 calls
      PAPI_L1_DCM                           31.388M/sec    1732264145 misses
      PAPI_TLB_DM                            0.244M/sec      13481844 misses
      PAPI_RES_STL                          11.189 secs   25734805221 cycles
      PAPI_L1_DCA                          823.706M/sec   45459611237 refs
      PAPI_FP_OPS                         2031.598M/sec  112122144958 ops
      DATA_CACHE_REFILLS_FROM_NORTHBRIDGE    6.508M/sec     359181707 fills
      Average Time per Call                                 27.589229 secs
      CrayPat Overhead : Time                 0.0%                    
      User time (approx)                    55.189 secs  126935432457 cycles  100.0% Time
      Total time stalled                    11.189 secs   25734805221 cycles
      HW FP Ops / User time               2031.598M/sec  112122144958 ops   11.0%peak(DP)
      HW FP Ops / WCT                     2031.598M/sec               
      Computational intensity                 0.88 ops/cycle     2.47 ops/ref
      MFLOPS (aggregate)                 130022.30M/sec               
      TLB utilization                      3371.91 refs/miss    6.586 avg uses
      D1 cache hit,miss ratios               96.2% hits          3.8% misses
      D1 cache utilization (misses)          26.24 refs/miss    3.280 avg hits
      System to D1 refill                    6.508M/sec     359181707 lines
      System to D1 bandwidth               397.229MB/sec  22987629227 bytes

    The hardware counters for a routine can give hints as to what is causing any problemsin performance.

6.3 General hints for interpreting profiling results

There are several general tips for using the results from the performance analysis:

  • Examine the routines where most of the time is spent to see if they can be optimised in any way – The “ct+src” report option can often be useful here to determine which regions of a function areusing the most time.
  • Look for load-imbalance in the code – This is indicted by a large difference in computing timebetween different parallel tasks.
  • High values of time spent in MPI usually indicate something wrong in the code – Load-imbalance, a badcommunication pattern, or just not scaling to the specified number of tasks.
  • High values of cache misses (seen via hardware counters) could mean a bad memory access in an arrayor loop – Examine the arrays and loops in the problamatic function to see if they can be reorganisedto access the data in a more cache-friendly way.

The Tuning chapter in this guide has a lot of information on how to optimise both serial and parallelcode. Combining profiling information with the optimisation techniques is the best way to improvethe performance of your code.

6.3.1 Spotting load-imbalance

By far the largest obstacle to scaling applications out to large numbers of parallel tasks is thepresence of load-imbalance in the parallel code. Spotting the symptoms of load-imbalance inperformance analysis are key to optimising the scaling performance of your code. Typical symptomsinclude:

For MPI codes
large amounts of time spent in MPI_BARRIER or MPI collectives (which includean implicit barrier).
For OpenMP codes
large amounts of time in “_omp_barrier” or “_omp_barrierp”.
For PGAS codes
large amounts of time spent in sychronisation functions.

When running a CrayPAT tracing experiment with “-g mpi” the MPI time will be split into computationand “sync” time. A high “sync” time can indicate problems with load imbalance in an MPI code.

The location of the accumulation of load-imbalance time often gives an indication of where inyour program the load-imbalance is occuring.

Please also see the Load-imbalance section in the Tuning chapter.

7. Tuning

This section discusses best practice for improving the performance of your code on Cray XE systems.We begin with a discussion of how to optimise the serial (single-core) compute performance and then discuss how to improve parallel performance.

Please note that these are general guidelines and some/all of the recommendations may not havean impact on your code. We always advise that you analyse the performance of your code usingthe profiling tools detailed in the Performance analysis section to identify bottlenecks andparallel performance issues (such as load imbalance).

7.1 Optimisation summary

A summary of getting the best performance from your code would be:

  1. Select the right (parallel) algorithm for your problem. If you do not do this then no amountof optimisation will give you the best performance.
  2. Use the compiler optimisation flags (and use pointers sparingly in your code).
  3. Use the optimised numerical libraries supplied by Cray rather than coding yourself.
  4. Eliminate any load-imbalance in your code (CrayPAT can help identify load-balance issues).If you have load-imbalance then your code wil never scale up to large core counts.

7.2 Serial (single-core) optimisation

7.2.1 Compiler optimisation flags

One of the easiest optmisations to perform is to use the correct compiler flags. This optimisationtechnique is extremely simple as it does not require you to modify your source code – althoughalterations to your source code may allow compiler flags to have more beneficial effects. Itis often worth taking the time to try a number of optimisation flag combinations to see what effect they have on performance of your code. In addition, many of the compilers willprovide information on what optimisations they are performing and, more usefully, whatoptimisations they are not performing and why. The flags needed to enable this information are indicated below.

Typical optimisations that can be performed by the compiler include:

Loop optimisation
such as vectorisation and unrolling.
replacing a call to a function with the actual function source code.
Local logical block optimisations
such as scheduling, algebraic identity removal.
Global optimisations
such as constant propagations, dead store eliminations (still within a single source code file).
Inter-procedural analyses
try to optimise across subroutine/function boundary calls (can span multiple source code files).

The compiler-specific documentation and man pages contain more information about which optimisationsparticular flags will enable/disable.

When using the more aggressive optimisation options it is important to be aware that the resultingoutput might be affected, for example a loss of precision. Some of the optimisation options allowchanging the order of execution and changin
g how arithmetic computations are performed. When usingaggressive optimisations it is important to test your code to ensure that it still produces thecorrect result.

Many compiler suites allow pragmas or flags to be placed in the source code to give more information on whether or not (or even how) partcular sections of code should be optimised.These can be useful, particularly on restriciting optimisation for sections of code wherethe order of execution is critical. For example, the Portland group compiler can vectorize individualloops, perform memory prefetching, and select an optimization level for a code section.

Cray Compiler Suite

The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with -O1 being the least aggressive and -O3 being the most aggressive). The default is -O2 but mostcodes should benefit from increasing this to -O3.

To enable information on successful optimisations use the -Omsgs flag and to enable informationon failed optimisations add the -Onegmsgs flag.

GNU Compiler Suite

The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with -O1 being the least aggressive and -O3 being the most aggressive).

The option -ftree-vectorizer-verbose=N will generate information about attempted loop vectorisations.

PGI Compiler Suite

The most useful set of optimisation flags for most codes will be: -fast -Mipa=fast. Other usefuloptimisation flags are -O3, -Mpfi, -Mpfo, -Minline, -Munroll and -Mvect.

To enable information on successful optimisations use the -Minfo flag and to enable informationon failed optimisations add the -Mneginfo flag.

Intel Compiler Suite

The most useful high-level flag is -fast which enables an set of common recommended optimisations.The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with -O1 being the least aggressive and -O3 being the most aggressive).

7.2.2 Using Libraries

Another easy way to boost the serial performance for your code is to use the optimisednumerical libraries provided on the Cray XE system. More information on the librariesavailable on the system can be found in the section: Available Numerical Libraies.

7.2.3 Writing Optimal Serial Code

The speed of computation is determined by the efficiency of your algorithm (essentially the number ofoperations required to complete the calculation) and how well the compiled executable can exploit the Opteron architecture.

When actually writing your code the largest single effect you can have on performance is by selecting the appropriate algorithm for the problems you are studying. The algorithm you choose isdependent on many things but may include such considerations as:

Do you need to use double precision floating point numbers? If not, single or mixed-precision algorithms can run up to twice as fast as the double precision versions.
Problem size
what are the scaling properties of your algorithm? Would a different approachallow you to treat larger problems more efficiently?
Although a particular algorithm may theoretically have the best scaling properties,is it so complex that this benefit is lost during coding?

Often algorithm selection is non-trivial and a good proportion of code benchmarking and profilingis needed to elucidate the best choice.

Once you have selected the best algorithms for your code you should endevour to write your code insuch a way that allows the compiler to exploit the Opteron processor architecture in the most efficient way.

The first rule is that if your code segment can be replaced by an optimised library call then youshould do this (see Available Numerical Libraries). If your code segment does not have a equivalentin one of the standard optimised numerical libraries then you should try to use code constructsthat will expose instruction-level parallelism or vectorisation (also known as SSE/AVX instructions) tothe compiler while avoiding simple optimisations that the compiler can perform easily. For floating-point intensive kernels the following general advice applies:

  • Avoid the use of pointers – these limit the optimisation that the compiler can perform.
  • Avoid using function calls, branching statements and goto statements wherever possible.
  • Only loops of stride 1 are ammeanable to vectorisation.
  • For nested loops, the innermost loop should be the longest and have a stride of 1.
  • Loops with a low number of iterations and/or little computation should be unrolled.

7.2.4 Cache Optimisation

Main memory access on systems such as CrayXE machines is usually arount two orders of magnitude slowerthan performing a single floating-point operations. One solution used in the Opteron architectureto militate this is to use a hierarchy of smaller, faster memory spaces on the processor known ascaches. This solution works as there is often a high chance of a particular address from memorybeing needed again within a short interval or a address from the same vicinity of memory being neededat the same time. This suggests that we could improve the performance of our code if we write itin such a way so that we access the data in memory that allows the cache hierarchy to be used asefficiently as possible.

Cache optimisation can be a very complex subject but we will try to provide a few general principlesthat can be applied to your codes that should help improve cache efficiency. The CrayPAT tool introduced in the Performance Analysis section can be used to monitor the cache efficiencyof your code through the use of hardware counters.

Effectively, in programming for cache efficiency we are seeking to provide additional localityin our code. Here, locality, refers to both spatial locality – using data located in blocksof consecutive memory addresses; and temporal locality – using the same address multiple timesin a short period of time.

  • Spatial locality can be improved by looping over data (in the innermost loop of nested loops)using a stride of 1 (or, in Fortran, by using array syntax).
  • Temporal locality can be improved by using short loops that do not contain function calls orbranching statements.

There are two other ways in which the cache technology can have a detrimental effect on code performance.

Part of the way in which caches are able to achieve high performance is by mappingeach memory address on to a set number of cache lines, this is known as n-way set associativity.This property of caches can seriously affect the performance of codes where two array variablesinvolved in an operation exist on the same cache line and the cache line must be refilled twice foreach instance of the operation. One way to minimise this effect is to avoid using powers of 2for your array dimensio
ns (as cache lines are always powers of 2) or, if you see this happening inyour code, to pad the array with enough zeroes to stop this happening.

The other major effect on users codes comes in the form of so-called TLB misses. The TLB inquestion is the translation lookaside buffer and is the mechanism that the cache/memoryhierachy uses to convert application addresses to physical memory addresses. If a mapping isnot contained in the TLB then main memory must be accessed for further information resultingin a large performance penalty. TLB misses most often occur in codes as they loop throughan array using a large stride.

The specific cache layout is processor model dependent, consult the site-specific documentationfor information of the cache layout on the machine you are using.

7.3 Parallel optimisation

Some of the most important advice from the serial optimisation section also applies forparallel optimisation, namely:

  • Choose the correct algorithm for your problem.
  • Use vendor-provided libraries wherever possible.

When programming in parallel you will also need to select the parallel programming model to use. As the Cray XE system is an MPP machine with distributed memory you have the following options:

  • Pure MPI – using just the MPI communications library.
  • Pure SHMEM – using just the SHMEM, single-sided communications library.
  • Pure PGAS – using one of the Partitioned Global Address Space (PGAS) implementations,such as Coarray Fortran (CAF) or Unified Parallel C (UPC).
  • Hybrid approach – using a combination of parallel programming models (mostoften MPI+OpenMP but MPI+CAF and MPI+SHMEM are also used).

The nature of the Cray XE interconnect architecture, it includes hardware supportfor single-sided communications, mean that SHMEM and PGAS approaches run very efficiently and, if your algorithm is ammeanable to such an approach, are worth considering as an alternative to the more traditional pure MPI approach. A caveat hereis that if your code makes heavy use of collective communications (for example,all-to-all or allreduce type operations) then you will find that the optimisedMPI versions of these routines almost always outperform the equivalents coded usingSHMEM or PGAS.

In addition, due to the fact that Cray XE machines are constructed from quitepowerful SMP building blocks (i.e. individual nodes with up to 32 cores),a hybrid programming approach using OpenMP for parallelism within a node and MPIfor communitions outwith a node will generally produce code with better scalingproperties than a pure MPI approach.

7.3.1 Load-imbalance

None of the parallel optimisation advice here will allow your code to scale to larger numbers of cores if your code has a large amount of load-imbalance.

Load-imbalance in parallel algorithms is where different parallel tasks (or threads)have a large amount of difference in computational work to perform. This, in turn,leads to some tasks (or threads) sitting idle at synchronisation points while waitingfor other tasks to complete there block of work. Obviously, this can lead to a largeamount of inefficiency in the program and can seriously inhibit good scaling behaviour.

Before optimising the parallel performance of your code it is always worth profiling(see the Profiling section) to try and identify the level of load-imbalance in your code, CrayPAT provides excellent tools for this. If you find a large amount of load-imbalance then you should eliminate this as much as possible before proceeding.Note that load-imbalance may only become apparent once you start using the code onhigher and higher numbers of cores.

Eliminating load-imbalance can involve changing the algorithm you are using and/orchanging the parallel decomposition of your problem. Generally, this issue is very code specific.

7.3.2 MPI Optimisation

The majority of parallel, scientific software still uses the MPI library as themain way to implement parallelism in the code so much effort has been put inby Cray software engineers to optimise the MPI performance on Cray XE systems.You should make use of this by using high-level MPI routines for parallel operationswherever possible. For example, you should almost always use MPI collective calls ratherthan writing you own versions using lower-level MPI sends and receives.

When writing MPI (or hybrid MPI+X) code you should:

  • overlap commumication and computation by using non-blocking operations wherever possible;
  • pre-post receives before the matching send operation is called to save memory copiesand MPI buffer management overheads;
  • send few large messages rather than many small messages to minimise latency costs;
  • use collective communication routines as little as possible.
  • avoid the use of mpi_sendrecv as this is an extremely slow operation unless the twoMPI tasks involved are perfectly synchronised.

Some useful MPI environment variables that can be used to tune the performance of yourapplication are:

set to display the current environment settings when a MPIprogram is executed.
use an optimised memory copy function in all MPI routines.
tune the use of the eager messaging protocol which triesto minimise the use of the MPI system buffer. The default on Cray XE systems isusually 128000 bytes. Increasing/decreasing this value may improve performance.
can give better performance for MPI_Allreduce and MPI_Barrier for large numbers of cores.
increases the buffer size for messages that are receivedbefore the receive has been posted (default is 60MB). Increasing this may improveperformance if you have a large number of such messages. Better to alter the codeto pre-post receives if possible though.

Use “man intro_mpi” on the machine to show a full list of available options.

7.3.3 Mapping tasks/threads onto cores

The way in which your parallel tasks/threads are mapped onto the cores of the Cray XEcompute nodes can have a large effect on performance. Some options you may want to consider are:

  • When underpopulating a compute node with parallel tasks it can often be beneficial toensure that the parallel tasks are evenly spread across NUMA regions using the -Soption to aprun (see below). This has the potential to optimise the memory bandwidthavailable to each core and to free up the additional cores for use by the multithreaded version of Cray’s LibSci library by setting the OMP_NUM_THREADSenvironment variable to the number of spare cores that are availble to each paralleltask and using the “-d $OMP_NUM_THREADS” option to aprun (see below).
  • On the AMD Bulldozer architecture (Interlagos processors) if you use half the coresper node you may be able to get additional performance by ensuring that each core hasexclusive access to the shared floating pont unit. You can do this by specifying the”-d 2″ option to aprun (see below).

The aprun command which launches parallel jobs onto Cray XE compute nodes has a rangeof options for specifying how parallel tasks and threads are mapped onto the actual cores on a node. Some of the most important options are:

-n parallel_tasks
Total number of parallel tasks (not including threads). Default is 1.
-N parallel_tasks_per_node
Number of parallel tasks (not including thr
eads) per node. Default isthe number of cores in a node.
-d threads_per_parallel_task
Number of threads per parallel task. For OpenMP codes this will usually be equal to $OMP_NUM_THREADS. Default is 1.
-S parallel_tasks_per_numa
Number of parallel tasks to assign to each NUMA region on the node. There are 4 NUMA regions per XE compute node. Default is number of cores in a NUMAregion.

Some examples should help to illustrate the various options. In all the examples we assumewe are running on Cray XE compute nodes that have 32 cores per node arranged into 4 NUMAregions of 8 cores each.

Example 1:

Pure MPI job using 1024 MPI tasks (-n option) with 32 tasks per node (-N option):

aprun -n 1024 -N 32 my_app.x

This is analogous to the behaviour of mpiexec on Linux clusters.

Example 2:

Hybrid MPI/OpenMP job using 512 MPI tasks (-n option) with 8 OpenMP threads per MPI task(-d option), 4096 cores in total. There will be 4 MPI tasks per node (-n option) and the8 OpenMP threads are placed such that the threads associated with each MPI task are assigned to the same NUMA region (1 MPI task per NUMA region, -S option):

aprun -n 512 -N 4 -d 8 -S 1 my_app.x

Further information on job placement can be found in the Cray document:

or by typing:

man aprun

when logged on to any of the Cray XE systems.

7.4 Advanced OpenMP usage

On Cray XE systems, when using the GNU compiler suite, the location of the thread that initialisesthe data can determine the location of the data. This means that if you allocate your data in the serial portion of the code then the location of the data will be on the NUMA region associated with thread 0. This behaviourcan have implications for performance in the parallel regions of the code if a thread from a different NUMA region then tries to access that data. If you are using the Cray or PGI compilersuites then there is no guarantee of where shared data will be located if your OpenMP codespans multiple NUMA regions. We always recommend that OpenMP code does not spanmultiple NUMA regions on Cray XE systems. See below for recommended task/thread configurations.

You can overcome this limitation, when using the GNU compier suite, by initialising your data inparallel (within a parallel region) or, for any compiler suite, by not using OpenMP parallelregions that span multiple NUMA regions on a node.

In general, it has been found that it is very difficult to gain any parallel performancewhen using OpenMP parallel regions that span multiple NUMA regions on a Cray XE computenode. For this reason, you will generally find that it is best to use one of the followingtask/thread layouts if your code contains OpenMP.

MPI Tasks per NUMA RegionThreads per MPI taskaprun syntax
18aprun -n … -N 4 -S 1 -d 8 …
24aprun -n … -N 8 -S 2 -d 4 …
42aprun -n … -N 16 -S 4 -d 2 …
MPI Tasks per NUMA RegionThreads per MPI taskaprun syntax
16aprun -n … -N 4 -S 1 -d 6 …
23aprun -n … -N 8 -S 2 -d 3 …
32aprun -n … -N 12 -S 3 -d 2 …

There is a known issue with OpenMP thread migration when using the GNU programming environment to compile OpenMP code with multiple parallel regions. When a parallelregion is finished and a new parallel region begins all of the threads become assignedto core 0 leading to extremely poor performance. To prevent this happening you shouldsupply add the “-cc none” option to aprun.

7.4.1 Environment variables

The following are the most important OpenMP environment variables:

Sets the maximum number of OpenMP threads available to eachparallel task.
Enable nested OpenMP parallel regions. Note that this functionality iscurrently only supported by the Cray and GNU compilers.
Determines how iterations of loops are scheduled.
Specifies the size of the stack for threads created.
Controls the desired behavior of waiting threads.

A more complete list of OpenMP environment variables can be found at:

7.5 Memory optimisation

Although the dynamic memory allocation procedures in modern programming languages offer a largeamount of convenience the allocation and deallocation functions are time consuming operations.For this reason they should be avoided in subroutines/functions that are frequently called.

The aprun option -m size[h|hs] specifies the per-PE required Resident Set Size (RSS) memory size inmegabytes. (K, M, and G suffixes, case insensitive, are supported). If you do not include the -moption, the default amount of memory available to each task equals the minimum value of (computenode memory size) / (number of cores) calculated for each compute node.

7.5.1 Memory affinity

Please see the discussion of memory affinity in the OpenMP section

7.5.2 Memory allocation (malloc) tuning

The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. Use the aprunoption -ss to specify strict memory containment per NUMA node.

Linux also provides some environment variables to control how malloc behaves, e.g.MALLOC_TRIM_THRESHOLD_ that is the amount of free space at the top of the heap after a free() thatneeds to exist before malloc will return the memory to the OS. Returning memory to the OS is costly.The default setting of 128 KBytes is much too low for a node with 4 GBytes of memory and oneapplication. Setting it higher might improve performance for some applications.

7.5.3 Using huge pages

Huge pages are virtual memory pages that are larger than the default 4KB page size. They can improvethe memory performance for codes that have common access patterns across large datastes. Generally,huge pages are accessed by a user using the libhugelbfs library (-lhugetlbfs) and by setting theHUGETLB_MORECORE=yes at runtime. Huge pages can sometimes provide better performance by reducing thenumber of TLB misses and by enforcing larger sequential physical memory inside each page.

The Cray XE system
is set up to have huge pages available by default. The modules craype-hugepages2mand craype-hugepages8m can be used set the necessary link options and environment variables toenable the usage of 2MB or 8MB huge pages respectively. Also, the AMD Opteron supports multiple page sizes(128KB, 512KB, 2MB, 8MB, 16MB, 64MB). The default huge page size is 2 Mbytes.You will also need to load the appropirate craype-hugepages module at runtime (in you job submissionscript) for hugepages to work.

If you know the memory requirements of your application in advance you should set the -m optionto aprun when you launch your job to preallocte the appropriate number of huge pages. This improvesperformance by reducing operating system overhead. The syntax is:

request size Mbytes per PE (advisory)
request size Mbytes per PE (required)

7.6 I/O optimisation

I/O subsystems are system dependent. Please consult the individual site documentation forinformation on the I/O subsystems.

The HECToR CSE team has published some information on how to optimise the I/O on the HECToR systemwhich includes information that may be generally applicable.

8. Debugging

8.1 Available Debuggers

The Cray XE system comes with a number of tools to aid in debugging your program. Somesystems may also have third-party debugging tools (such as the TotalView debugger) installed.Please consult the site-specific documentation for information on any installed third-party debugging tools.

SiteAvailable Debugger
HECToRCray ATP, RogueWave TotalView, lgdb
LindgrenRogueWave TotalView

Note that the usefulness and accuracy of the information within any debugger depends on yourcompilation options. If you have optimisation switched on then you may find that the line numbers listed in the debugging information do not correspond with the statements in your source code file. For debugging code we always recommend that you compile with optimisation switched offand the -g flag enabled to provide the most accurate information.

8.2 TotalView

Cray XE TotalView provides source-level debugging of Fortran, C, and C++ code compiled by the Cray, PGI,Intel and GNU compilers. The debugging tool provides both a command line interface and a Motif-based GUI.It supports MPI message queue display and watchpoints. There may be some limitations which are sitedependent. Please consult your site for further information.

8.2.1 Example, Debugging an MPI application

The following example shows how to invoke TotalView to debug an MPI code.

  • Start an X-server on your local machine (if you need to).
  • Login to system using ’ssh -Y’ (or ’ssh -X’) to enable X-windows forwarding.
  • Compile your code with the ’-g’ option. Your code and executable must be in the “work” directory
  • Submit the TotalView job to the batch system and leave the terminal you submitted the job from open.Below is an example for a Interlagos based Cray XE system.
#PBS -A your_budget_account
#PBS -l walltime=00:05:00
#PBS -l mppwidth=64
#PBS -l mppnppn=32


totalview aprun -a -b -a xt -n 64 -N 32 /work/.../myprog.x 
  • When the job starts, the dialogue in Figure 8.1 will be displayed – click ’OK’:


Figure 8.1: Initial TotalView dialogue.

  • An empty TotalView debugging window will be displayed (Figure 8.2) – click ’Go’ to start the program:


Figure 8.2: Empty TotalView debugging window.

  • The ’Halt program’ dialogue will be displayed (Figure 8.3) – click ’Yes’ to begin debugging


Figure 8.3: Initial Totalview dialogue.

  • The TotalView debugging window (Figure 8.4) will be displayed with your source code in the middleframe. The top-left frame shows the current call tree and the top-right frame shows the current valuesof defined variables.


Figure 8.4: Populated TotalView debugging window.

  • To add a breakpoint at a particular subroutine or function select ’Debug -> Breakpoint -> At Location…’and enter the name of the subroutine or function in the resulting dialogue (Figure 8.5) and click ’OK’:


Figure 8.5: Add Breakpoint dialogue.

  • Click ’Go’ in the Totalview debugging window and the program will run until the named routine is reached (Figure 8.6):


Figure 8.6: TotalView debugging window halted at breakpoint.

  • You can add further breakpoints by scrolling through the source and clicking on the line numberto the left of the source code.

8.2.2 Example: Using Totalview to debug a core file

To generate core files you just need your working directory to be in the “work” filesystem, andhave the line:

ulimit -c unlimited

in your batch script. Unfortunately the option to tag core files with the process ID is not enabledso if more then one processor dumps core then the core files will overwrite each other. You can useCray ATP (see below) to produce a merged set of core files for a program that crashes.

To use the TotalView GUI to debug a core file, follow these steps:

  • Start an X Server on your local machine and login to HECToR using the ’-Y’ (or ’-X’) option to

ssh. Launch TotalView :

  • Click the down arrow on the field showing Start a new proc
    ess, and from the drop-down list select”Open a core file”. The Program and Core file fields display.
  • In the Program field, enter the name of the program you wish to debug. In the Core file field,enter the name of thecore file produced by this program. If necessary, use the Browse functions to find and select the files.
  • Click the “OK” button. TotalView opens the executable and core files.

Alternately, you can use the command-line interface (CLI) to debug the program by entering thefollowing command:

totalviewcli program_name core_file_name

8.2.3 TotalView Limitations

The TotalView debugging suite for the Cray XE differs in functionality from the standardTotalView implementation in the following ways:

  • The TotalView Visualizer is not included.
  • The TotalView HyperHelp browser is not included.
  • Debugging multiple threads on compute nodes is not supported.
  • Debugging MPI_Spawn(), OpenMP, Cray SHMEM, or PVM programs is not supported.
  • Compiled EVAL points and expressions are not supported.
  • Type transformations for the PGI C++ compiler standard template library collection classes arenot supported.
  • Exception handling for the PGI C++ compiler runtime library is not supported.
  • Spawning a process onto the compute processors is not supported.
  • Machine partitioning schemes, gang scheduling, or batch systems are not supported.

In some cases, TotalView functionality is limited because Compute Node Linux (CNL) does not supportthe feature in the user program.

8.3 Cray ATP

Cray ATP (Abnormal Termination Processing) is a tool that monitors your application and, in theevent of an abnormal termination, it will collate the failure information from all the runningprocesses into files for analysis.

With ATP enabled, in the event of abnormal termination, all of the stacktraces are gathered from the dying processes, analysed and collated into a single file called In additionthe stacktrace from the first process to die (hence the probable cause for the failure) isdelivered to stderr.

The file can be viewed using the stat command that is accessible by loading the stat module.

ATP will also dunp a heuristically-selected set of core files which will be named: core.atp.apid.rankwhere ’apid’ is the APID of the process and ’rank’ is the parallel rank of the process that producedthe core file.

8.3.1 ATP Example

To enable ATP you should load the atp module in your job submission script with the command:

module load atp

and set the ATP_ENABLED environment variable to enable the ATP functionality:

export ATP_ENABLED=1

and then run your job using aprun as usual. Once your application has terminated abnormallyyou need to log into the service while exporting the X display back to your local machine (you musthave an X server running locally) with:

ssh -Y username@site

Load the stat module with:

module add stat

and view the merged stacktrace with:


The stderr from your job should also contain useful information that has been processed by ATP.

8.4 GDB (GNU Debugger)

The standard GNU debugger: GDB is available on Cray XE systems. The debugger currently onlysupports the command line interface.

There are two components that you must use to debug your parallel program using GDB:

  • The ’lgdb’ program which launchers gdbserver processes on the login nodes.
  • The ’gdb’ program which connects to the remote program instances (started using ’lgdb’)and provides the debugging command line interface.

When you execute your program using ’lgdb’ the system will provide instructions on howto connect to the gdbserver process to debug your program. If your site does not supportinteractive access (i.e. you can only execute jobs parallel jobs in job submission scripts)then you must remember to redirect STDOUT from the gdbserver process to a file you can access while the job is running so that you have access to the information needed to connectgdb to the gdbserver. By default, on many Cray XE systems, the output from STDOUT is only delivered once the job is completed. See the example below for details on this.

8.4.1 Launching your program using ’lgdb’

The ’lgdb’ command is used to launch your program and attach a gdbserver process to enabledebugging. If you are running interactively, then the syntax for launching a 64 taskjob and debugging parallel process 0 would be:

lgdb --pes=0 --command="aprun -n 64 -N 32 my_parallel_program.x"

This command will yield instructions on how to connect the ’gdb’ process that will looksomething like:

user@login1:/work/user/debug> less stdout.txt 
sending /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/lgdbd... completed
sending /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdbserver... completed

*** create a new window and load the correct lgdb module for each target
*** run gdb from the following path:
/opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdb [PATH-TO-YOUR-APPLICATION]

*** the following gdb target commands should be used in separate windows
*** [Pe=0] to debug this Pe type the following in gdb
target remote nid00003:10000

If you do not have access to interactive access and need to run in batch mode then you simply replace the normal aprun command in your job submission script with the call to’lgdb’ and redirect STDOUT to a file. For example:

lgdb --pes=0 --command="aprun -n 64 -N 32 my_parallel_program.x" > stdout.txt

You must redirect STDOUT to a file in this way so you can access the information printed above on how to connect to the gdbserver from the ’gdb’ program.

8.4.2 Debugging the remote gdbserver using ’gdb’

Once you have your compute process running with an associated gdbserver using the ’lgdb’ command as specified above then you can start the GNU debugger on the commandline on the login node with a command such as:

user@login1:/work/user/debug> \
  /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdb my_parallel_program.x

This will give you the ’(gdb)’ prompt where you can enter the command to link tothe gdbserver process to start debugging. For example:

(gdb) target remote nid00003:10000

Now you can use gdb in the same way as you would if you were debugging a local program.

8.4.3 Useful GDB commands

Please see the documentation for GNU debugger documentation for a full list of the gdb commands. Some of the most often used commands are listed below.

  • break function_name – (or b) insert breakpoint at start of specified function
  • break file:/line_
    number/ – insert breakpoint at line number in specified file
  • continue – (or c) continue runnning program until next breakpoint is reached
  • next – (or n) step to next line of program (will also step into subroutines)
  • list – (or l) list source code around current position
  • list start_line,/end_line/ – list source code from start_line to end_linein current function.
  • print variable_name – (or p) print the value of the specified variable
  • print array_name/(/index) – print value at specified index of 1D array
  • print array_name/(/index1,/index2/) – print value at specified index of 2D array
  • print array_name/(/index)@/elements/ – print elements values from the arraystarting at index.
  • ptype variable_name – print information on the variable type and array dimensions(if this is an array).
  • quit – (or q) quit gdb and halt the running program.

8.4.4 Example: debugging an MPI program using GDB

This example illustrates the debugging of the VASP 5 code.

First, you must compile your program with debugging symbols (-g flag). You should alsousually ensure that optimisation is turned off (-O0 flag) so that reordering of source code lines does not take place. (Of course, it may sometimes be necessary to includeoptimisation if this is the cause of the problems.)

In this example we will assume that you are running without interactive access to thecompute nodes. Write a job submission script for your job in the usual way but with thefollowing changes: you should load the ’xt-lgdb’ module and you replace the standard aprun line with a call to ’lgdb’ that contains your aprun command and which redirectsSTDOUT. For example:

#!/bin/bash --login
#PBS -N vasp_debug

# Number of MPI processes
#PBS -l mppwidth=64
#PBS -l mppnppn=32

# Walltime for the debug job
#PBS -l walltime=1:0:0

# Your account code
#PBS -A z01

# Add the Cray GDB module
module add xt-lgdb

# Location of the VASP 5 executable
export VASP_EXEDIR=/work/user/software/VASP/bin

# Change to the directiry the job was submitted from

# Start the gdbserver with our parallel job.
#   We make sure we redirect STDOUT (to stdout.txt) so we can access
#    the information needed to attach to the remote gdbserver
#   We also use the --pes=0 option to start a single gdbserver instance
#    attached to the first MPI task
lgdb --pes=0 --command="aprun -n 64 -N 32 $VASP_EXEDIR/vasp" > stdout.txt

You should then submit this job in the usual way. Once the job is running, you willbe able to inspect the contents of the ’stdout.txt’ file to get the ID of the serverto attach to using GDB. For example:

user@login1:/work/user/debug> less stdout.txt 
sending /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/lgdbd... completed
sending /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdbserver... completed

*** create a new window and load the correct lgdb module for each target
*** run gdb from the following path:
/opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdb [PATH-TO-YOUR-APPLICATION]

*** the following gdb target commands should be used in separate windows
*** [Pe=0] to debug this Pe type the following in gdb
target remote nid00003:10000

This tells us the ’gdb’ binary to use and indicates that we should use GDB to target the remotegdbserver at ’nid00003:10000’. On the login node command line run the specified ’gdb’ executable:

user@login1:/work/user/debug> \
  /opt/cray/xt-tools/lgdb/1.4/xt/x86_64-unknown-linux-gnu/bin/gdb $VASP_EXEDIR/vasp

dlopen failed on '' - /lib64/ undefined symbol: ps_lgetfpregs
GDB will not be able to debug pthreads.

GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

and then target the remote gdbserver with the command specified in the ’stdout.txt’ file:

(gdb) target remote nid00003:10000
Remote debugging using nid00003:10000
[New Thread 22131]
0x00000000012aed60 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
82      ../sysdeps/unix/syscall-template.S: No such file or directory.
        in ../sysdeps/unix/syscall-template.S
Current language:  auto; currently asm

Now we can add a breakpoint at one of our program functions and proceed to it. For example:

(gdb) b force_and_stress_
Breakpoint 1 at 0x87f168: file ./force.f, line 1160.
(gdb) c

Once the program has reached the specified breakpoint we can start debugging. To see thecurrent backtrace of where we are in the program:

Breakpoint 1, force_and_stress_ (kineden=Cannot access memory at address 0x0
) at ./force.f:1160
1160          CALL START_TIMING("G")
Current language:  auto; currently fortran
(gdb) bt
#0  force_and_stress_ (kineden=Cannot access memory at address 0x0
) at ./force.f:1160
#1  0x000000000041ad48 in vamp () at ./main.f:2665
#2  0x00000000004008e0 in main ()
#3  0x0000000001374d14 in __libc_start_main (main=0x4008a0 <main>, argc=1, ubp_av=0x7fffffffb548, 
    init=0x1375200 <__libc_csu_init>, fini=0x13751c0 <__libc_csu_fini>, rtld_fini=0, stack_end=0x7fffffffb538)
    at libc-start.c:226
#4  0x00000000004007a9 in _start () at ../sysdeps/x86_64/elf/start.S:113

We can list the source code lines and add another breakpoint further into the routine by line number:

(gdb) l 1160,1180
1160          CALL START_TIMING("G")
1162          DO ISP=1,WDES%NCDIJ
1163             CALL RC_ADD(CHTOT(1,ISP),1.0_q,CHTOT(1,ISP),0.0_q,CHTOTL(1,ISP),GRIDC)
1164          ENDDO
1165          IF (LDO_METAGGA().AND.LMIX_TAU()) THEN
1166             DO ISP=1,WDES%NCDIJ
1167                CALL RC_ADD(KINEDEN%TAU(1,ISP),1.0_q,KINEDEN%TAU(1,ISP),0.0_q,KINEDEN%TAUL(1,ISP),GRIDC)
1168             ENDDO
1169          ENDIF
1170          RHOLM_LAST=RHOLM
1173             CALL SET_CHARGE(W, WDES, INFO%LOVERL, &
1174                  GRID, GRIDC, GRID_SOFT, GRIDUS, C_TO_US, SOFT_TO_C, &
1175                  LATT_CUR, P, SYMM, T_INFO, &
1178             CALL STOP_TIMING("G",IO%IU6,'CHARGE')
1179          ENDIF
1180    ! - - - - - - - - - - -- FORCES ON IONS     - - - - - - - - - - - - - - 
(gdb) b ./force.f:1172
Breakpoint 2 at 0x87f37a: file ./force.f, line 1172.

and then proceed to this breakpoint:

(gdb) c

Breakpoint 2, force_and_stress_ (kineden=Cannot access memory at address 0x0
) at ./force.f:1172

Now we can examine the values of some of the variables

(gdb) ptype info%lchcon
type = logical
(gdb) p info%lchcon
$1 = .FALSE.
(gdb) ptype rholm
type = double precision (0,0)
(gdb) p rholm(1,1)
$2 = 0.051804883959039337
(gdb) p rholm(1,1)@3
$3 = (0.051804883959039337, 0.0083683781999898572, -0.0018751730313048671)

The last expression shows the next 3 array element values of rholm starting at (1,1).

Once you have finished debugging you can kill the running program and quit the debuggerwith the ’quit’ command:

(gdb) q
The program is running.  Exit anyway? (y or n) y

Author: Andrew Turner, Xu Guo, Lilit

Date: 30 May 2012

HTML generated by org-mode 6.33x in emacs 23

Share: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+Email this to someone