Best Practice Guide – Curie

Best Practice Guide – Curie v1.17

Jacques David

CEA

Jean-Noel Richet

CEA

Eric Boyer

CINES

Nikos Anastopoulos

GRNET

Guillaume Collet

CEA

Guillaume Colin de
Verdière

CEA

Tyra van
Olmen

CINES

Hilde Ouvrard

CINES

Cédric Cocquebert

CEA

Bruno Frogé

CEA

Alexandros Haritatos

ICCS

Konstantinos Nikas

GRNET

Theodoros Gkountouvas

GRNET

Nikela Papadopoulou

GRNET

Evangelia Athanasaki

GRNET

November 29th, 2013


Table of Contents

1. Introduction
2. System Architecture and Configuration
2.1. Curie’s configuration
2.1.1. Extra Large nodes
2.1.2. Thin nodes
2.1.3. Hybrid nodes
2.2. Operating system
2.3. File systems
3. System Access
3.1. Login to Curie
3.2. Passwords
3.3. File transfer
3.4. PRACE infrastructure
3.4.1. Connect to a remote PRACE supercomputer
3.4.2. Connect to Curie from a remote PRACE supercomputer
3.4.3. Transfer data between PRACE supercomputers
4. Production Environment
4.1. Accounting
4.1.1. The ccc_myproject command
4.2. Available shells
4.3. Restore lost files
4.4. Module environment
4.5. Job submission and monitoring
4.5.1. Job submission
4.5.2. Job monitoring
5. Programming Environment / Basic Porting
5.1. Text editors
5.2. Compilers
5.2.1. Available compilers
5.2.2. Compiler flags
5.3. Available libraries
5.3.1. MKL Library
5.3.2. Other libraries
5.4. MPI
5.4.1. MPI implementations
5.4.2. MPI Compiler
5.5. OpenMP
5.5.1. Compiler flags
5.5.2. Execution modes
5.6. CUDA
5.7. OpenCL
6. Debugging
6.1. Available Debuggers
6.1.1. GDB
6.1.2. IDB
6.1.3. DDT
6.2. Compiler flags
7. Performance Analysis
7.1. Available performance analysis tools
7.1.1. PAPI
7.1.2. VampirTrace/Vampir
7.1.3. Scalasca
7.1.4. Paraver
7.1.5. Valgrind
7.2. Use cases and hints for interpreting results
7.2.1. PAPI
7.2.2. VampirTrace/Vampir
7.2.3. Scalasca + Vampir
7.2.4. Scalasca + PAPI
8. Tuning Applications
8.1. Choosing or excluding nodes
8.2. Process distribution, affinity and binding
8.2.1. Introduction
8.2.2. SLURM
8.2.3. BullxMPI
8.3. Advanced MPI usage
8.3.1. Embarrassingly parallel jobs and MPMD jobs
8.3.2. BullxMPI
8.4. Advanced OpenMP usage
8.4.1. Environment variables
8.4.2. Thread affinity
8.5. Hybrid programming
8.5.1. Optimal processes/threads strategy
8.6. GPGPU programming
8.6.1. Reorganise the code
8.6.2. Profile the result
8.6.3. Introduce GPGPU
8.7. Memory optimisation
8.7.1. Memory affinity
8.7.2. Memory Allocation tuning
9. Case studies
9.1. Case study 1: Executing the STREAM benchmark on Extra Large nodes
9.2. Case study 2: Executing the SpMV and Jacobi kernels on Extra Large and Thin nodes
9.2.1. The SpMV kernel
9.2.2. The Jacobi kernel
9.2.3. Optimising the SpMV kernel using OpenMP on Extra Large nodes
9.2.4. Executing hybrid SpMV and Jacobi kernels on Extra Large nodes
9.2.5. Executing hybrid SpMV and Jacobi kernels on Thin nodes

1. Introduction

This Best Practice Guide is intended to help users to get the best
productivity out of the PRACE French Tier 0 Curie system. Curie is a
supercomputer of the bullx series (designed by Bull for extreme computing),
which is owned by GENCI (Grand Equipement National de Calcul Intensif,
French representative in PRACE), hosted and operated by CEA (Commissariat
à l’Energie Atomique) in its TGCC (Très Grand Centre de calcul du CEA) in
Bruyères-le-Châtel.

Curie is the second European Tier-0 system, installed in France; it is
a Bull system, based on Intel processors. The first phase of the system, a.k.a.
“fat nodes”(Intel X7560, code-named Nehalem), was completed end 2010; a subsequent “hybrid”
part with Nvidia M2090 GPUS was installed during summer 2011, and the full
system, a.k.a. “thin nodes” (with more than 80,000 of latest Intel’s E5-2680, code-named
Sandy-Bridge ) became available October 2011.
In late 2012, the “fat nodes” were converted into “extra large” nodes, by
grouping four of them into a single, “super-fat” shared-memory node, using the
novel Bull Coherent Switch (BCS) architecture.

 

Figure 1. Bull Curie supercomputer in TGCC

Bull Curie supercomputer in TGCC

 

This guide covers a wide range of topics, including :

  • Architecture
  • Code porting and compilation
  • Debugging tools
  • Serial optimisation and compiler flags
  • Parallel optimisation, including two hybrid modes, MPI/OpenMP and CPU/GPU
  • I/O best practice and optimisation
  • Performance analysis tools
  • Tuning applications
  • Case Studies

While targeted for the bullx supercomputer series, most of
information presented here is also relevant to any supercomputer based
on Intel x86-64 and Infiniband.

More information about Curie and Curie’s access can be found
on


http://www-hpc.cea.fr/en/complexe/tgcc-curie.htm

also more information on Curie’s hosting center can be found
on


http://www-hpc.cea.fr/en/complexe/tgcc.htm

Information on GENCI can be found on :


http://www.genci.fr/?lang=en

Information on CEA can be found on :


http://www.cea.fr/english_portal

Latest version of this document can be found on :


http://www.prace-ri.eu/Best-Practice-Guides?lang=en

and other PRACE-related information on


http://www.prace-ri.eu/?lang=en

2. System Architecture and Configuration

2.1. Curie’s configuration

Curie is composed of three different architectures: 90 Extra Large
nodes, 5040 Thin nodes, and 144 Hybrid nodes.

2.1.1. Extra Large nodes

The Extra Large nodes basically target
hybrid parallel codes (MPI-OpenMP) requiring large memory and/or multithreading
capacity, and pre and post processing tasks. They have been formed by grouping
four S6010 bullx modules (previously known as “Curie Fat nodes”) into a single
shared-memory system, using Bull’s Coherent Switch (BCS) novel architecture.
BCS is an ASIC chip designed by Bull to feature a memory caching system for the
interconnect network between the processors, memory and I/O. It is responsible
for providing a global, consistent view of main memory data for all processors
of the system. The end result is a super-fat node consisting of 128 cores and
a total of 512 GB main memory. Specifications for the Curie Extra Large nodes
are given in Table 1.

 

Table 1. Curie Extra Large nodes specifications

System bullx S6010 modules (4x), interconnected with Bull’s BCS technology
Processor Model Intel X7560 (Nehalem EX)
Total Number of Nodes 90
Total Number of Cores 11520
Number of Sockets (NUMA nodes) per Node 16
Number of Cores per Socket 8
Hyperthreading disabled
Threads per Core 1
Total number of Threads per Node 128
Total amount of Memory per Node 512 GB
Peak Frequency per Core 2.26 GHz
Amount of local memory per socket (NUMA node) 32 GB
L3 Cache per Socket (shared) 24 MB
L2 Cache per core 256 KB
L1 Cache per core 32 KB

 

The Extra Large nodes are accessible through the xlarge queue.

Figure 2 shows the block diagram of a Curie Extra Large Node. It
comprises of 4 modules (boards), each of which contains 4 sockets (packages) with 8
cores, giving 128 processors in total. Each socket is adjacent to a local
memory controller and memory bus (NUMA node), which offers a separate path
to main memory for that socket’s processors. This is a Non-Uniform Memory
Access (NUMA) organisation , which guarantees uncontended and fast memory
access when threads are spread across different sockets and their data are
allocated on the local memory modules.

The figure depicts the CPU numbers that correspond to each
processor in a Curie Extra Large Node, as seen by the operating system itself. These
numbers can be used by the programmer to specify a desired affinity mapping at a
low level.

 

Figure 2. Processor topology inside a Curie Extra-Large node

Processor topology inside a Curie Extra-Large node

 

Figure 3 presents a closer view of the processor
topology for one of the four modules of a Curie Extra Large node. Up until
the autumn of 2012, such a module constituted a single and standalone Curie
Fat node.

 

Figure 3. Closer view at one the four modules of a Curie Extra Large node
(formerly constituting a single Curie Fat node)

Closer view at one the four modules of a Curie Extra Large node (formerly constituting a single Curie Fat node)

 

Figure 4 shows the Bull Coherent Switch (BCS) architecture used to
interconnect the 4 bullx S6010 modules (boards) into one Extra Large Node. With BCS architecture,
through CPU caching and coherency snoop responses consume only 5 to 10% of
the Intel QPI bandwidth and that of the switch fabric. BCS provides local
memory access latency comparable to regular 4-socket systems. Via the
eXtended QPI (XQPI) network, a 4-socket bullx module communicates with the
other 3 modules as it was a single 16-socket system. Therefore all accesses
to local memory have the bandwidth and latency of a regular 4-socket system.
BCS is capable of transferring data at an aggregate rate of 230 GB/s (9 ports
x 25.6 GB/s).

 

Figure 4. Bull Coherent Switch architecture

Bull Coherent Switch architecture

 

2.1.2. Thin nodes

The Thin nodes are mainly targeted at MPI parallel
codes. They contain fewer cores per node, compared to the Extra Large nodes, but
their total number is much larger. Their specifications are given in Table
2.

 

Table 2. Curie Thin nodes specifications

System bullx B510
Processor Model Intel E5-2680 (Sandy-Bridge EP)
Total Number of Nodes 5040
Total Number of Cores 80640
Number of Sockets (NUMA nodes) per Node 2
Number of Cores per Socket 8
Hyperthreading disabled
Threads per Core 1
Total number of Cores per Node 16
Total amount of Memory per Node 64 GB
Peak Frequency per Core 2.7 GHz
Amount of local memory per socket (NUMA node) 32 GB
L3 Cache per Socket (shared) 20 MB
L2 Cache per core 256 KB
L1 Cache per core 32 KB

 

The Thin nodes are accessible through the standard queue.

Figure 5 shows the block diagram of a Curie Thin Node. It has 2
sockets each with 8 cores, giving 16 processors in total. Each socket is
adjacent to a local memory controller and memory bus (NUMA node).

The figure depicts the CPU numbers that correspond to each
processor in a Curie Thin Node, as seen by the operating system itself. These
numbers can be used by the programmer to specify a desired affinity mapping at a
low level.

 

Figure 5. Processor topology inside a Curie Thin node

Processor topology inside a Curie Thin node

 

2.1.3. Hybrid nodes

The Hybrid nodes are targeted for parallel codes that
are designed to use (partially or totally) the GPU as accelerator.
Their specifications are given in Table
3.

 

Table 3. Curie Hybrid nodes specifications

System bullx B505
Processor Model Intel Westmere EP
GPU Model Nvidia M2090 T20A
Total Number of Nodes 144
Total Number of Cores 1152
Total Number of GPUs 288
Number of Sockets (NUMA nodes) per Node 2
Number of Cores per Socket 4
Total number of Cores per Node 8
Number of GPUs per Node 2
Total amount of Memory per Node 24 GB
Total amount of Memory per GPU 6 GB
Peak Frequency per Core 2.67 GHz

 

The Hybrid nodes are accessible through the hybrid queue.

2.2. Operating system

Operating system on Curie’s nodes is Bullx Supercomputer Suite
RC1, based on Red Hat Enterprise Linux 6 beta.

2.3. File systems

On Curie four file systems for user files are available: HOME,
SCRATCH, WORK, and STORE.

  • HOME
    • I/O performance: slow (NFS)
    • Quota: 3GB
    • Use: sources, job submission scripts, parameter
      files…
    • Note : data are backed-up
    • Environment variable: $HOMEDIR
      (or $HOME )
  • SCRATCH
    • I/O performance: fastest (Lustre)
    • Quota: 20 TB and 2 000 000 files or directories
    • Use: data, code output…
    • Environment variable:
      $SCRATCHDIR
  • WORK
    • I/O performance: fast (Lustre via routers)
    • Quota: 1 TB and 500 000 files or directories
    • Use: commonly used file (Source code, Binary,…)
    • Environment variable: $WORKDIR
  • STORE
    • I/O performance: fast (Lustre via routers + HPSS +
      Tape)
    • Quota: 100 000 files and directories
    • Use: data archiving as packed data (tar file,…), no
      direct computation
    • Important:
      • Expected file size range 1GB-100GB
      • Backup mechanism relies on file modification time:
        avoid using copy options like -p or -a
    • Environment variable: $STOREDIR

    Inappropriate usage might stop the execution of a user’s code.

  • ccc_quota gives information about your current usage of the
    filesystems:

    bash-4.0$ $ ccc_quota
    Disk quotas for user xxxxxx (uid xxxxx):
                ----------VOLUME----------   -----------INODE------------
    Filesystem    usage  soft  hard  grace   files    soft    hard  grace
    ----------    -----  ----  ----  -----   -----    ----    ----  -----
          home       3G    3G    3G      -       -       -       -      -
          work  903.68G  1.0T  1.1T      -   5.07K  500.0K  501.0K      -
         store        4  4.0T  4.1T      -       1  100.0K  101.0K      -

    You have the size (VOLUME) and the number of files or directories (INODES).

    The “soft” value is the threshold that you should not exceed. It is a hard limit if no grace period is set. If a grace period is set, when you go above the soft limit you will be warned so you can go down your threshold before end of period.

    If set, the “grace period” is the time you have to correct you excess of the soft limit before it is enforced as a hard limit.

    The “hard” value is the one the system will enforce anyway so you cannot ever exceed it.

  • Data protectionFor each filesystem you can access, your personal directory is located under a container structure that protects your data. The directory path follows the following template :
    /ccc/filesystem/container/group/login
    

    A container is a lock that cannot be crossed by a user not part of that container. A container gathers a community of similar users and corresponds to a trusting perimeter.

    Users sharing the same container can collaborate and exchange data

    Users from different containers cannot collaborate and exchange data

3. System Access

3.1. Login to Curie

From your local machine, you need to use the ssh command to access Curie. ssh is a program for logging into a remote machine and for executing commands on it.

-bash-4.1$ ssh login@curie.ccc.cea.fr
password: ****

If you need a graphical environment you have to use the -X option :

-bash-4.1$ ssh -X login@curie.ccc.cea.fr

To log out from Curie, you can use the Ctrl-d command, or exit.

If you have problems when authenticating, you can try -Y option.

3.2. Passwords

You will often need to change your password. This can be done thanks to the kpasswd command:

-bash-4.1$ kpasswd
Changing password for user **

Password rules are the following:

  • Your password should be at least 12 characters long and use at least 3 different character classes
  • Your password need to be renewed after 12 months
  • A previously used password cannot be reused
  • If you modify your password, you will need to wait 5 days before modifying it again

If you loose your password, contact the TGCC Hotline (mailto: hotline.tgcc@cea.fr) and ask for a password reset.

3.3. File transfer

To transfer files between Curie and your local machine, you can use the scp command.

  • Create an archive with the directories you want to copy (it will be faster to transfer):
    -bash-4.1$ tar -cvzf archivename.tgz directoryname1 directoryname2

    or in case of a file:

    -bash-4.1$ tar -cvzf archivename.tgz filename
  • Transfer the archive to Curie:
    -bash-4.1$ scp archivename.tgz login@curie.ccc.cea.fr:/ccc/cont***/home/login
  • Uncompress the archive in your target directory:
    -bash-4.1$ tar -xvzf archivename.tgz destinationdirectory

3.4. PRACE infrastructure

Curie is part of the PRACE infrastructure and access to the internal PRACE network and relative services are available from Curie login nodes.

Note: PRACE services like GSI-SSH and GridFTP require an authorised X.509 grid certificate. To register your grid certificate in the CEA authorisation database, please provide the Distinguished Name of your X.509 grid certificate to hotline.tgcc@cea.fr.

Note: in-depth documentation of the PRACE services and their use will be soon provided on the PRACE-RI website. In the mean time we provide you guidelines to perform most useful tasks.

3.4.1. Connect to a remote PRACE supercomputer

  • To connect with SSH to a remote PRACE supercomputer with your login information for that system:
    $ ssh jugene5d.zam.kfa-juelich.de -l your_fzj_login
  • If you have a X.509 grid certificate, registered in the authorisation database of the remote site you want to connect to, you can also connect with GSI-SSH to that remote system once your grid credentials are enabled on Curie
    $ gsissh jugene5d.zam.kfa-juelich.de -p 2222

Note: Please see your grid Certification Authority documentation to learn how to enable your grid credential on Curie and the remote site documentation to learn how to register your grid certificate at this remote site.

3.4.2. Connect to Curie from a remote PRACE supercomputer

From a remote PRACE supercomputer, you can connect with SSH to Curie login nodes with your Curie login information

$ ssh curie-prace.ccc.cea.fr -l your_cea_login

If you have a X.509 grid certificate, registered in CEA authorisation database, you can also connect to Curie with GSI-SSH once your grid credential is enabled on remote site:

$ gsissh curie-prace.ccc.cea.fr -p 2222

Note: please see your Grid Certification Authority documentation to learn how to enable your grid credential on remote site.

Note: to register your grid certificate in CEA authorisation database, please provide the Distinguished Name of your X.509 grid certificate to hotline.tgcc@cea.fr.

3.4.3. Transfer data between PRACE supercomputers

To transfer data from/to Curie, you can use SCP with the login information on both local and remote supercomputers:

  • From a remote PRACE supercomputer
    $ scp myfile your_cea_login@curie-prace.ccc.cea.fr:/path/to/copy
  • From a Curie login node
    $ scp myfile your_fzj_login@jugene5d.zam.kfa-juelich.de:/path/to/copy

    If you have a X.509 grid certificate, registered in both CEA and remote site authorisation databases, you can also transfer data with GridFTP:

  • From a remote PRACE supercomputer
    $ globus-url-copy gsiftp://jugene5d.zam.kfa-juelich.de:2812/path/to/sourcefile gsiftp://garbin-prace.eole.ccc.cea.fr:2812/path/to/destdir/
  • From a Curie login node
    $ globus-url-copy gsiftp://garbin-prace.eole.ccc.cea.fr:2812/path/to/sourcefile gsiftp://jugene5d.zam.kfa-juelich.de:2812/path/to/destdir/

Note: to register your grid certificate in CEA authorisation database, please provide the Distinguished Name of your X.509 grid certificate to hotline.tgcc@cea.fr .

4. Production Environment

4.1. Accounting

4.1.1. The ccc_myproject command

The command ccc_myproject gives information about the accounting of your project:

bash-4.1$ ccc_myproject
Accounting for project XXXXXXXX on Curie at 2011-04-13
Login                              Time in hours
login01     ............................75382.44
login02     ................................0.00
Total       ............................75382.44
Allocated   ..........................2000000.00
Percent Used...............................3.77%
Project deadline 201X-0X-0X

You will find:

  • the consumed compute time per project’s member
  • the total consumed compute time
  • the project’s deadline

The accounting is updated once a day.

4.2. Available shells

The default shell is bash. ksh, csh, tcsh and zsh are also available. We strongly recommend you to use bash shell (Only bash and csh are supported by the support team).

4.3. Restore lost files

Contact or +33 1 7757 4242

4.4. Module environment

module allows the shell environment to be changes easily by initialising, modifying or unsetting environment variables. This option gives you a complete environment to launch software or to link your code with a library.

The command line option list indicates the loaded modules in your environment:

-bash-4.1$ module list
Currently Loaded Modulefiles:
1) intel/12.0.084(default) 2) bullmpi/0.18.1(default)

The command line option avail gives all the available modules:

-bash-4.1$ module avail
------------------------------ /usr/local/ccc_users_env/modules/softwares -----------------------------------
abinit/6.4.2    cpmd/3.13.2     espresso/4.2.1  gaussian/09-B01
gromacs/4.5.3   namd/2.7        saturne/2.0.0   siesta/2.0.2    vasp/5.2.11
------------------------------ /usr/local/ccc_users_env/modules/development---------------------------------
cmake/2.8.3    ddd/3.3.12     jdk/1.6.0_23   papi/4.1.1     paraver/3.99
scalasca/1.3.2 swig/2.0.1     valgrind/3.6.0
------------------------------ /usr/local/ccc_users_env/modules/mpi -----------------------------------------
bullmpi/0.17.2          bullmpi/0.18.1(default)
------------------------------ /usr/local/ccc_users_env/modules/libraries -----------------------------------
boost/1.45.0    fftw3/3.2.2     hdf5/1.8.5      mumps/4.9.2     parmetis/3.1.1  phdf5/1.8.5     qt/4.7.1
fftw2/2.1.5     gsl/1.14        metis/4.0.1     netcdf/4.1.1    petsc/3.1       ptscotch/5.1.11 scotch/5.1.11
------------------------------ /usr/local/ccc_users_env/modules/compilers -----------------------------------
gcc/4.5.1 intel/12.0.084(default)

The command line options load and unload respectively load and unload a module:

-bash-4.1$ module list
Currently Loaded Modulefiles:
1) intel/12.0.084(default) 2) bullmpi/0.18.1(default)
-bash-4.1$ module unload bullmpi/0.18.1
-bash-4.1$ module list
Currently Loaded Modulefiles:
1) intel/12.0.084(default)
-bash-4.1$ module load bullmpi/0.17.2
-bash-4.1$ module list
Currently Loaded Modulefiles:
1) intel/12.0.084(default) 2) bullmpi/0.17.2

The command line option switch does the previous operation in one command line:

-bash-4.1$ module switch bullmpi bullmpi/0.17.2
-bash-4.1$ module list
Currently Loaded Modulefiles:
1) intel/12.0.084(default) 2) bullmpi/0.17.2

The command line option show indicates how the environment is changed by loading a module. The option help gives information about the specified module.

-bash-4.1$ module help gcc/4.5.1
----------- Module Specific Help for 'gcc/4.5.1' ------------------
Name : gcc
Description : GNU C, C++ and Fortran compilers
Version : 4.5.1
Web Site : http://gcc.gnu.org/
-bash-4.1$ module show gcc/4.5.1
-------------------------------------------------------------------
/usr/local/ccc_users_env/modules/compilers/gcc/4.5.1:
module-whatis GNU Compiler Collection
conflict gcc
prepend-path PATH /usr/local/gcc-4.5.1/bin
prepend-path LIBRARY_PATH /usr/local/gcc-4.5.1/lib
prepend-path LD_LIBRARY_PATH /usr/local/gcc-4.5.1/lib:/usr/local/gcc-
4.5.1/lib64
prepend-path MANPATH /usr/local/gcc-4.5.1/man
prepend-path INFOPATH /usr/local/gcc-4.5.1/info
prepend-path CPATH /usr/local/gcc-4.5.1/include
prepend-path FPATH /usr/local/gcc-4.5.1/include
-------------------------------------------------------------------

Advice : in most of modules, we set some environment variables like $MKL_LIBS or $FFTW3_INC_DIR which point to a library or path. We strongly recommend that you use them in your Makefile. For example when you switch between newer modules, these variables will be there (but they will point to another library or path).

4.5. Job submission and monitoring

4.5.1. Job submission

Job submission, resources allocation and job launching over the cluster are managed by SLURM. Special commands prefixed by ccc_ are provided to execute these operations. To submit a batch job, you first have to write a shell script which contains:

  • A set of directives. These directives are lines beginning with #MSUB which describes needed resources for your job.
  • How to execute your code.

Then your job can be launched by submitting this script to SLURM. The job will enter into a batch queue. When resources are available, the job will be launched over allocated nodes. Jobs can be monitored.

The following paragraphs describe ccc_* commands and gives some examples of script for different types of jobs.

4.5.1.1. The ccc_mprun command

ccc_mprun launches parallel jobs over nodes allocated by resources manager:

ccc_mprun ./a.out

By default, ccc_mprun takes information (number of nodes, number of processors, etc.) from the resources manager to launch the job. However, you can specify or change its behaviour with the command line options:

  • -n nproc : number of tasks to run.
  • -c ncore : number of cores per task.
  • -N nnode : number of nodes to use.
  • -M mem : required amount of memory per core in Mo.
  • -T time : maximum walltime of the allocations in seconds.
  • -x : requests exclusive usage of allocated nodes.
  • -E extra : extra parameters to pass directly to the underlying resource mgr.
  • -K : only allocates resources. If a program is defined it will be executed only once. It would contain ccc_mprun calls to launch parallel commands using the allocated resources.
  • -eoptions’ : additional parameters to pass to the mpirun command.
  • -d ddt : launches the application in debug mode using DDT.

Type ccc_mprun -h for an updated and complete documentation.

4.5.1.2. Script examples
  • Sequential job
    #!/bin/bash
    #MSUB -r MyJob               # Request name
    #MSUB -n 1                   # Number of tasks to use
    #MSUB -T 600                 # Elapsed time limit in seconds of the job (default: 1800)
    #MSUB -q test                # Batch queue request
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -A paxxxx              # Project ID
    ##MSUB -@ noreply@cea.fr:end # Uncomment this line for being notified at the end of the job by sending a mail at the given address
    
    set -x
    cd ${BRIDGE_MSUB_PWD}        # BRIDGE_MSUB_PWD is a environment variable which contains the directory where the script was submitted
    ./a.out
  • Parallel MPI job
    #!/bin/bash
    #MSUB -r MyJob_Para          # Request name
    #MSUB -n 32                  # Number of tasks to use
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}
    ccc_mprun ./a.out
    #or
    # ccc_mprun -n 32 ./a.out
    #or
    # ccc_mprun -n ${BRIDGE_MSUB_NPROC} ./a.out
    # BRIDGE_MSUB_NPROC represents the number of tasks
  • Parallel OpenMP/Multithreaded job
    #!/bin/bash
    #MSUB -r MyJob_Para          # Request name
    #MSUB -n 1                   # Number of tasks to use
    #MSUB -c 16                  # Number of threads per task to use
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}        # BRIDGE_MSUB_PWD is a environment variable which contains the directory where the script was submitted
    export OMP_NUM_THREADS=16
    ./a.out
    #!/bin/bash
    #MSUB -r MyJob_Para          # Request name
    #MSUB -n 1                   # Number of tasks to use
    #MSUB -c 16                  # Number of threads per task to use
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}        # BRIDGE_MSUB_PWD is a environment variable which contains the directory where the script was submitted
    export OMP_NUM_THREADS=${BRIDGE_MSUB_NCORE} # BRIDGE_MSUB_NCORE represents the number of core dedicated per task
    ccc_mprun ./a.out

    Warning : an OpenMP/Multithreaded program can only run inside a node. If you ask for more threads than available cores in a node, your submission will be rejected.

    As of now only one thread is available per core as hyperthreading is disabled.

  • Parallel hybrid OpenMP/MPI or Multithreaded/MPI
    #!/bin/bash
    #MSUB -r MyJob_ParaHyb       # Request name
    #MSUB -n 8                   # Total number of tasks to use
    #MSUB -c 4                   # Number of threads per task to use
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}
    export OMP_NUM_THREADS=4
    ccc_mprun ./a.out
    # This script will launch 8 MPI tasks. Each task will have 4 OpenMP threads

    The previous script can be rewritten:

    #!/bin/bash
    #MSUB -r MyJob_ParaHyb       # Request name
    #MSUB -n 32                  # Total number of core to use
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}
    export OMP_NUM_THREADS=4
    ccc_mprun -n 4 -c 8 ./a.out
    # This script will launch 8 MPI tasks. Each task will have 8 cores for launching OpenMP threads.

    You can ask the number of nodes you need:

    #!/bin/bash
    #MSUB -r MyJob_ParaHyb       # Request name
    #MSUB -n 4                   # Total number of tasks to use
    #MSUB -c 16                  # Number of threads per task to use
    #MSUB -N 4                   # Number of nodes
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}
    export OMP_NUM_THREADS=16
    ccc_mprun ./a.out
    # This script will launch 4 MPI tasks over 4 nodes (.ie. one task MPI per node). Each task will have 16 OpenMP threads.
  • GPU jobSimple one GPU job:
    #!/bin/bash
    #MSUB -r GPU_Job             # Request name
    #MSUB -n 1                   # Total number of tasks to use
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -q hybrid              # Hybrid partition of GPU nodes
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}
    module load cuda
    ccc_mprun ./a.out

    You should use ccc_mprun to run GPU code because ccc_mprun manages the binding of processes (see section Process binding).

    Hybrid MPI/GPU job:

    #!/bin/bash
    #MSUB -r MPI_GPU_Job         # Request name
    #MSUB -n 8                   # Total number of tasks to use
    #MSUB -N 4                   # Number of nodes
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -q hybrid              # Hybrid partition of GPU nodes
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}
    module load cuda
    ccc_mprun ./a.out

    Curie Hybrid nodes have 2 GPUs per node. This script launches 8 MPI processes over 4 nodes.

  • MPMD jobA MPMD job (for Multi Program Multi Data) is a parallel job which launches different executables over the processes.
    #!/bin/bash
    #MSUB -r MyJob_Para          # Request name
    #MSUB -n 32                  # Total number of tasks to use
    #MSUB -T 1800                # Elapsed time limit in seconds
    #MSUB -o example_%I.o        # Standard output. %I is the job id
    #MSUB -e example_%I.e        # Error output. %I is the job id
    #MSUB -A paxxxx              # Project ID
    
    set -x
    cd ${BRIDGE_MSUB_PWD}
    cat << END > app.conf
    1       ./bin1
    5       ./bin2
    26      ./bin3
    END
    ccc_mprun -f app.conf
4.5.1.3. The ccc_msub command

The previous scripts have to be submitted to the resources manager with ccc_msub command:

bash-4.1$ cat script.sh
#!/bin/bash
#MSUB -r MyJob_Para   # Request name
#MSUB -n 32           # Number of tasks to use
#MSUB -T 1800         # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -A paxxxx       # Project ID

set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun ./a.out
bash-4.1$ ccc_msub script.sh
Submitted Batch Session 1556

Remark: #MSUB directive lines are not necessary. If a directive is not specified, a default value will be initialized.

Directive lines can be specified through command line options to ccc_msub. In this case, command line parameters take precedence over script directives.

bash-4.1$ cat script.sh
#!/bin/bash
#MSUB -r MyJob_Para   # Request name
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id

set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun ./a.out
bash-4.1$ ccc_msub -n 32 -T 1800 script.sh
Submitted Batch Session 1557

If one of these command line options like -n, -N, -c or -x is given, it cancels all effects of MSUB directives with -n, -N, -c or -x.

You should rather use the #MSUB directives than the command line options. Here are some other command line options for ccc_msub:

  • -o output_file : standard output file (special character %I will be replaced by the job ID).
  • -e error_file : standard error file (special character %I will be replaced by the job ID).
  • -r reqname : job name
  • -n nprocs : number of tasks that will be used in parallel mode (default=1)
  • -c ncores : number of cores per parallel task to allocate (default=1)
  • -N nnodes : number of nodes to allocate for parallel usage
  • -T time_limit : maximum walltime of the batch job in seconds (default=18000)
  • -M mem_limit : maximum memory amount required per allocated core in Mb
  • -x : request for exclusive usage of allocated nodes
  • -X : enables X11 forwarding (useful for DDT)
  • -A project : specifiy the project id
  • -E “extra_parameters…” : extra parameters to pass directly to the underlying batch system
  • -q partition : requested type of node
  • -Q qos : requested QoS
  • -S starttime : requested start time using format like “HH:MM” or “MM/DD HH:MM”
  • -@ mailopts : mail options following the pattern mailaddress:begin|end|begin,end
    • The following line will send a mail to jdoe at the beginning and the end of the job
      ccc_msub -@ jdoe@foo.com:begin,end
    • Default behaviour depends of the underlying batch system

Type ccc_msub -h for an updated and complete documentation.

Don’t forget to specify your correct project ID with the -A option. Otherwise, you may use hours from another project.

4.5.1.4. Choosing between Curie’s three architectures

When you’re submitting your job on Curie, you can choose on which of the three architectures available your job is going to run:

  • Curie’s Thin nodes, using the -q standard option.
  • Curie’s Hybrid nodes, using the -q hybrid option.
  • Curie’s Extra Large nodes, using the -q xlarge option.

This choice is exclusive : your job can only be submitted on one of these architectures at a time.

ccc_mpinfo displays the available partitions/queues that can be used on a job.

                      --------------CPUS------------  -------------NODES------------
PARTITION    STATUS   TOTAL   DOWN    USED    FREE    TOTAL   DOWN    USED    FREE     MpC  CpN SpN CpS TpC
---------    ------   ------  ------  ------  ------  ------  ------  ------  ------   ---- --- --- --- ---
standard     up        80240       0   68856   11384    5015       0    4313     702   4000  16   2   8   1
xlarge       up         9728       0    9473     255      76       0      75       1   4000  128  16   8   1
hybrid       up         1144       0      32    1112     143       0      84      59   2900   8   2   4   1
4.5.1.5. Using job QOS

Depending the quantity of resources, the job duration needed or depending job purpose (debugging, normal production, etc.) you have to select the appropriate job QOS (Quality Of Service). Job QOS enables you to extend normal job submission limits against a lower job priority and also to get a higher job priority in exchange for fewer resources or shorter duration.

ccc_mqinfo displays job QOS that can be used and associated job limits

$ ccc_mqinfo
Name    Priority  MaxCPUs  MaxNodes  MaxRun  MaxSub     MaxTime
------  --------  -------  --------  ------  ------  ----------
long          18     1024                 2       8  3-00:00:00
normal        20                                300  1-00:00:00
test          40                  8               2    00:30:00

For instance, to develop or debug your code, you may submit a job using the test QOS which will allow it to be scheduled faster. This QOS is limited to 2 jobs of maximum 30 minutes and 8 nodes each. CPU time accounting is not dependent to the chosen QOS.

To specify a QOS, you can use the -Q option of ccc_msub command-line or add #MSUB -Q qosname directive to your submission script, like in the below example. If no QOS is mentioned, default QOS normal will be used.

#!/bin/bash
#MS UB -r MyJob        # Request name
#MS UB -n 64           # Number of tasks to use (256 max for test QoS)
#MS UB -T 1800         # Elapsed time limit in seconds of the job (1800 max with test QoS)
#MS UB -Q test         # QoS test
#MS UB -o example_%I.o # Standard output. %I is the job id
#MS UB -e example_%I.e # Error output. %I is the job id
#MS UB -q standard     # Choosing standard nodes
#MS UB -A raxxxx       # Project ID

ccc_mprun ./a .out
4.5.1.6. Multi Step job

To launch a multi step job like this:

JOB A ==> JOB B ==> JOB C

where JOB B can be launched only if JOB A is finished, then JOB C can be launched if JOB B is finished.

Here are the corresponding scripts:

  • JOB_A.sh :
    #!/bin/bash
    #MSUB -r JOB_A
    #MSUB -n 32
    ccc_mprun ./a.out
    ccc_msub JOB_B.sh
  • JOB_B.sh :
    #!/bin/bash
    #MSUB -r JOB_B
    #MSUB -n 16
    ccc_mprun ./b.out
    ccc_msub JOB_C.sh
  • JOB_C.sh :
    #!/bin/bash
    #MSUB -r JOB_C
    #MSUB -n 8
    ccc_mprun ./c.out

Then, only JOB_A.sh has to be submitted. When it finishes, the script launches JOB_B.sh, etc…

Be careful, if the job is killed or has reached his time allocation limit, all the job will be removed and the ccc_msub may not be launched. To avoid this case, you can use ccc_tremain from libccc_user (described below) or use the #MSUB -w directive like this:

#!/bin/bash
#MSUB -r JOB_A
#MSUB -n 32
#MSUB -w
ccc_msub JOB_A.sh
ccc_mprun ./a.out

The directive #MSUB -w creates a dependence between jobs with the same name. If you submit two jobs with the same name, the second will run only if the first has finished. In our case of multi-step jobs, you submit the next script before ccc_mprun command, but the next will be launched after the current job will be done.

4.5.2. Job monitoring

4.5.2.1. The ccc_mpp command

ccc_mpp provides information about jobs on the cluster.

bash-3.0$ ccc_mpp
USER     GROUP     BATCHID  NCPU QUEUE   STATE   RLIM    RUN   SUSP   OLD   NAME     NODES
login    s8        3117     36   test    RUN     30.0m   3.4m      -  3.4m  job_A    curie[22-23]
login    s8        3119     24   test    PEN     30.0m      -      - 31.0s  job_B

Here are command line options for ccc_mpp:

  • -r : prints ’running’ batch jobs
  • -s : prints ’suspended’ batch jobs
  • -p : prints ’pending’ batch jobs
  • -q queue : requested batch queue
  • -u user : requested user
  • -g group : requested group
  • -n : prints results without colors
4.5.2.2. The ccc_mpeek command

ccc_mpeek gives information about a job during its run.

Here are command line options for ccc_mpeek:

  • -o : prints the standard output
  • -e : prints the standard error output
  • -s : prints the job submission script
  • -t : same as -o in “tail -f” mode
4.5.2.3. The ccc_mdel command

ccc_mdel kills jobs:

bash-4.1$ ccc_mpp
USER     GROUP     BATCHID  NCPU QUEUE   STATE   RLIM    RUN   SUSP   OLD   NAME     NODES
login    s8        3117     36   test    RUN     30.0m   3.4m      -  3.4m  job_A    curie[22-23]
bash-4.1$ ccc_mdel 3117
4.5.2.4. The libccc_user library

We provide a library which allows you to get information about jobs. A useful functionality is the subroutine ccc_tremain which gives the remaining execution time in seconds before the job ends. For example, it is useful if your code runs beyond the duration allocated. Then, you can save restart files for a subsequent job.

  • C/C++
    #include "ccc_user.h"
    ...
    double time_remain;
    int error;
    ...
    error = ccc_tremain(&time_remain);
    if(!error)
      printf("Time remaining before job ends: %lf secondsn", time_remain);
    ...
  • Fortran
    ...
    double precision :: time_remain
    ...
    call ccc_tremain(time_remain)
    print*, 'Time remaining before job ends: ', time_remain, ' seconds'
    ...
  • Example to compile a program using libccc_use
    ...
    $ module load libccc_user
    $ icc -o prog.exe prog.c ${CCC_LIBCCC_USER_LDFLAGS}
    ...
4.5.2.5. Diagnosing a stuck job

Sometimes a job seems to be stuck.

To understand what is really happening, a useful script named “clustack” has been developed. It will give you an aggregated view of the stacks of the processes associated to a job.

Usage :

clustack slurmjob:jobid

The jobid is displayed when submitting a job, and can also be obtained afterwards through the use of “ccc_mpp -u $USER”.

5. Programming Environment / Basic Porting

5.1. Text editors

  • vi, for knowledgeable Unix people
  • emacs, for Unix Gurus,
  • nano, a simple text editor
  • nedit, a full fledged- graphical (X11) editor
  • gedit, the Gnome graphical editor

5.2. Compilers

5.2.1. Available compilers

The available compilers on the cluster are:

  • Intel Compiler suite (icc, icpc, ifort)
  • GNU Compiler suite (gcc, g++, gfortran)

To know which version is installed, use the command

bash-4.1$ module avail

We strongly recommend you to use the Intel Compiler Suite which provides the best performances.

5.2.2. Compiler flags

5.2.2.1. Intel Compiler suite
  • C/C++C/C++ Intel compilers are icc and icpc. Compilation options are the same, except for the the C language behaviour. icpc manages all the source files as C++ files whereas icc makes a difference between both of them.
    • Basic flags:
      • -o

        exe_file : names the executable exe_file.

      • -c : generates the corresponding object file. Does not create an executable.
      • -g : compiles in a debugging mode – see ’Debugging’.
      • -I

        dir_name : specifies the path where the include files are located.

      • -L

        dir_name : specifies the path where the libraries are located.

      • -l bib : asks to link the libbib.a (or libbib.so) library.
    • Optimisation s:
      • -O0, -O1, -O2, -O3 : optimisation levels – default : -O2.
      • -opt_report : generates a report which describes the optimisation in stderr (-O3 required).
      • -ip, -ipo : inter-procedural optimisations (mono and multi files).
      • -fast : default high optimisation level (
        -O3 -ipo
        -static

        ).

        Be careful: This option is not allowed using MPI, the MPI context needs to call some libraries which only exists in dynamic mode. This is incompatible with the -static option. You need to replace -fast by

        -O3
        -ipo

        .

      • -ftz : considers all denormalised numbers as zeros at runtime.
      • -fp-relaxed : mathematical optimisation functions. Leads to a small loss of accuracy.
    • Preprocessor:
      • -E : preprocess the files and sends the result to the standard output.
      • -P : preprocess the files and sends the result in file.i.
      • -Dname=<value> : defines the macro “name”.
      • -M : creates a list of dependancies.
    • Practical:
      • -pg : profiling with gprof (needed at the compilation).
      • -mp, -mp1 : IEEE arithmetic, mpl is a compromise between time and accuracy.
  • FortranFortran Intel compiler is ifort.
    • Basic flags:
      • -o

        exe_file : names the executable exe_file.

      • -c : generates the corresponding object file. Does not create an executable.
      • -g : compiles in a debugging mode – see ’Debugging’.
      • -I dir_name : specifies the path where the include files are located.
      • -L dir_name : specifies the path where the libraries are located.
      • -l bib: asks to link the libbib.a (or libbib.so) library.
    • Optimisations:
      • -O0, -O1, -O2, -O3 : optimisation levels – default : -O2.
      • -opt_report : generates a report which describes the optimisation in stderr (-O3 required).
      • -ip, -ipo : inter-procedural optimisations (mono and multi files).
      • -fast : default high optimisation level (
        -O3 -ipo
        -static

        ).

        Be careful: This option is not allowed using MPI, the MPI context needs to call some libraries which only exists in dynamic mode. This is incompatible with the -static option. You need to replace -fast by

        -O3
        -ipo

        .

      • -ftz : considers all denormalised numbers as zeros at runtime.
      • -fp-relaxed : mathematical optimisation functions. Leads to a small loss of accuracy.
      • -align all : fills the memory up to get a natural alignment of the data.
      • -pad : makes the modification of the memory positions operational.
    • Run-time check:
      • -C or -check : generates a code which adds run time array bound checks in order to diagnose ’run time error’ such as segmentation fault.
    • Preprocessor:
      • -E : preprocess the files and sends the result to the standard output.
      • -P : preprocess the files and sends the result in file.i.
      • -Dname=<value> : defines the macro “name” .
      • -M : creates a list of dependencies.
      • -fpp : preprocess the files and compiles.
    • Practical:
      • -pg : profiling with gprof (needed at the compilation).
      • -mp, -mp1 : IEEE arithmetic, mpl is a compromise between time and accuracy.
      • -i8 : promotes integers to 8 bytes (64 bits) by default.
      • -r8 : promotes reals to 8 bytes (64 bits) by default.
      • -module <dir> : send/read the files *.mod in the dir directory
      • -fp-model strict : Tells the compiler to strictly adhere to value-safe optimisation s when implementing floating-point calculations and enables floating-point exception semantics. It might slow down your program.

Should you wish further information, please refer to the ’man pages’ of the compilers.

There are some options which allow to use specific instructions of Intel processors in order to optimise the code. These options are compatible with most of Intel processors. The compiler will try to generate these instructions if the processor allows it.

  • -xSSE4.2 : May generate Intel® SSE4 Efficient Accelerated String and Text Processing instructions. May generate Intel® SSE4 Vectorising Compiler and Media Accelerator, Intel® SSSE3, SSE3, SSE2, and SSE instructions.
  • -xSSE4.1 : May generate Intel® SSE4 Vectorising Compiler and Media Accelerator instructions for Intel processors. May generate Intel® SSSE3, SSE3, SSE2, and SSE instructions.
  • -xSSSE3 : May generate Intel® SSSE3, SSE3, SSE2, and SSE instructions for Intel processors.
  • -xSSE3 : May generate Intel® SSE3, SSE2, and SSE instructions for Intel processors.
  • -xSSE2 : May generate Intel® SSE2 and SSE instructions for Intel processors.
  • -xHost : this option will apply one of the previous options depending on the processor where the compilation is performed. This option is recommended for optimising your code.

None of these options are used by default. The SSE instructions use the vectorisation capability of Intel processors.

5.2.2.2. Intel Sandy Bridge processors

Curie Thin nodes use the latest Intel processors based on the Sandy Bridge architecture. This architecture provides new vectorisation instructions called AVX for Advanced Vector eXtensions. The option -xAVX allows to generate a specific code for Curie Thin nodes.

Be careful, code generated with the -xAVX option only runs on Intel Sandy Bridge processors. Otherwise, you will get the error message:

Fatal Error: This program was not built to run in your system.
Please verify that both the operating system and the processor support Intel(R) AVX.

Curie login nodes are Curie Extra Large nodes with Nehalem-EX processors. AVX code can be generated on these nodes through cross-compilation by adding the -xAVX option. On a Curie Extra Large node, the -xHost option will not generate a AVX code. If you need to compile with -xHost or if the installation requires some tests (like autotools/configure), you can submit a job which will compile on the Curie Thin nodes.

5.2.2.3. GNU Compiler suite
  • Debugging:
    • -Wall : Short for “warn about all,” this flag tells gfortran to generate warnings about many common sources of bugs, such as having a subroutine or function with the same name as a built-in one, or passing the same variable as an intent(in) and an intent(out) argument of the same subroutine.
    • -Wextra : In conjunction with -Wall, gives warnings about even more potential problems. In particular, -Wextra warns about subroutine arguments that are never used, which is almost always a bug.
    • -w : Inhibits all warning messages (Not advised)
    • -Werror : Makes all warnings into errors.

5.3. Available libraries

5.3.1. MKL Library

Intel MKL library is integrated in the Intel package and contains:

  • BLAS, SparseBLAS.
  • LAPACK, ScaLAPACK.
  • Sparse Solver, CBLAS.
  • Discrete Fourier and Fast Fourier transform (contains the FFTW interface, see FFTW).

If you don’t need ScaLAPACK:

ifort -o myexe myobject.o ${MKL_LIBS}

If you need ScaLAPACK:

mpif90 -o myexe myobject.o ${MKL_SCA_LIBS}

We provide multithreaded versions for compiling with MKL:

ifort -o myexe myobject.o ${MKL_LIBS_MT}
mpif90 -o myexe myobject.o ${MKL_SCA_LIVBS_MT}

To use multithreaded MKL, you have to set the OpenMP environment variable OMP_NUM_THREADS.

We strongly recommend you to use those MKL_* variables.

5.3.2. Other libraries

Please see the other software section.

5.4. MPI

5.4.1. MPI implementations

The default MPI implementation is Bullxmpi, a library provided by Bull. It is based on OpenMPI.

curie50$ module list
Currently Loaded Modulefiles:
1) c/intel/12.0.3.174(default) 2) bullxmpi/1.1.8.1(default)
curie50$ ompi_info -a

The default version of Bullxmpi is given by the command module list.

5.4.2. MPI Compiler

MPI runs using mpicc, mpic++, mpif77 and mpif90 wrappers for compiling and linking MPI programs.

curie50$ mpicc -c test.c
curie50$ mpicc -o test.exe test.o

By default, the wrappers use Intel compilers. To use GNU compilers, you need to set the following environment variables:

  • OMPI_CC for C
  • OMPI_CXX for C++
  • OMPI_F77 for Fortran77
  • OMPI_FC for Fortran90

For example:

curie50$ module load gcc
curie50$ module list
Currently Loaded Modulefiles:
1) oscar-modules/1.0.3   2) c/intel/12.0.3.174   3) fortran/intel/12.0.3.174
curie50$ mpicc -show
icc -I/opt/mpi/bullxmpi/1.1.8.1/include -pthread -L/opt/mpi/bullxmpi/1.1.8.1/lib -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl
curie50$ export OMPI_CC=gcc
curie50$ mpicc -show
gcc -I/opt/mpi/bullmpi/1.1.8.1/include -pthread -L/opt/mpi/bullmpi/1.1.8.1/lib -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

The -show option includes all the libraries and header needed to use MPI.

5.5. OpenMP

The Intel and GNU compilers both support OpenMP. More precisely, GNU gcc 4.4 supports OpenMP 3.0 and GNU gcc 4.7 supports OpenMP 3.1. Since September 2011, Intel C++ and Fortran compilers support OpenMP 3.1.

5.5.1. Compiler flags

Intel compilers flags : -openmp.

-bash-4.1$ ifort -openmp -o prog.exe prog.f90

GNU compilers flags : -fopenmp.

-bash-4.1$ gcc -fopenmp -o prog.exe prog.c

5.5.2. Execution modes

OpenMP offers a set of different scheduling policies that control how iterations of “for” and “do” parallel loops are being scheduled to threads. This enables applications to choose a policy that best matches their needs. These policies can be specified by the programmer either with the schedule clause in the parallel loop constructs, or via the OMP_SCHEDULE environment variable.

The schedule clause has the following syntax:

schedule
(type,[chunk_size])

. The type can be one of the following values:

  • staticThe iteration space is divided into chunks of size chunk_size, which are assigned cyclically to threads. This policy is called static because the assignment is decided at compile time. Because of this, there is no runtime overhead associated with work distribution. If chunk_size is not specified, the whole iteration space is evenly divided among threads. Static scheduling without specified chunk size is the default scheduling policy if a parallel loop construct is not appended with the schedule clause.
  • dynamicThis policy assigns dynamically chunks of contiguous iterations to threads. Whenever a thread “consumes” its chunk, it requests from the runtime system and gets assigned the next available, until to chunks remain to be distributed. In this way threads are always kept busy, system resources are utilized efficiently and the load imbalance is reduced. The chunk size is again specified with the chunk_size parameter. Small chunk sizes achieve better load balancing but entail larger overhead, and vice-versa. The user should experiment with different sizes to find the one that makes the best compromise between low overhead and good scalability.
  • guidedThis policy assigns initially large iteration chunks to threads, reducing them exponentially with each successive assignment. The assignment is dynamic. Due to the “geometry” of chunks allocation, guided scheduling has usually less overhead than dynamic. The minimum size of chunk size beyond of which no further reduction is allowed can be optionally specified through chunk_size.
  • autoThe decision regarding scheduling is left to the compiler and/or the runtime system.
  • runtimeThe schedule type and
    chunk
    size

    is determined at runtime, through the OMP_SCHEDULE environment variable.

5.6. CUDA

The CUDA compiler is available on Curie Hybrid nodes to compile GPU-accelerated programs.

curie50$ module load cuda
curie50$ module list
Currently Loaded Modulefiles:
1) c/intel/12.0.3.191 2) fortran/intel/12.0.3.191 3) mkl/12.0.4.191

To compile a simple CUDA code:

curie50$ nvcc -arch=sm_20 -o prog.exe prog.cu

To compile a hybrid CUDA code:

curie50$ ls
  cuda.cu  prog.c
curie50$ module load cuda
curie50$ icc -c prog.c
curie50$ nvcc -arch=sm_20 -ccbin=icc -c cuda.cu
curie50$ icc -o prog_cuda.exe -L$(CUDA/ROOT)/lib64 -lcudart

The CUDA module sets environment variables (like CUDA_ROOT) which gives access to CUDA SDK. For example:

curie50$ module show cuda
-------------------------------------------------------------------
/usr/local/ccc_users_env/modules/compilers/cuda/4.0:

module-whatis    NVIDIA Comput Unified Device Architecture
conflict         cuda
prepend-path     PATH /usr/local/cuda-4.0/bin
prepend-path     PATH /usr/local/cuda-4.0/computeprof/bin
prepend-path     LD_LIBRARY_PATH /usr/local/cuda-4.0/lib64
prepend-path     LD_LIBRARY_PATH /usr/local/cuda-4.0/computeprof/bin
setenv           CUDA_LIB_DIR /usr/local/cuda-4.0/lib64
setenv           CUDA_ROOT /usr/local/cuda-4.0
setenv           CUDA_SDK_ROOT /usr/local/cuda-4.0/sdk/C
setenv           NV_OPENCL_SDK_ROOT /usr/local/cuda-4.0/sdk/OpenCL
setenv           NV_OPENCL_INC_DIR /usr/local/cuda-4.0/sdk/OpenCL/common/inc
-------------------------------------------------------------------

5.7. OpenCL

NVIDIA provides tools to compute OpenCL programs. It will be loaded with CUDA module.

curie50$ module load cuda
curie50$ gcc -I${NV_OPENCL_INC_DIR} -o prog_ocl.exe prog.c -lOpenCL

6. Debugging

6.1. Available Debuggers

6.1.1. GDB

The “gdb” command launches the GNU debugger. In order to be able to use symbol tables, you could should have been compiled with the the corresponding “-g” debugging flag. Then your can launch the debugging session with e.g :

gdb ./a.out

For more details, please refer to the “gdb” man page.

6.1.2. IDB

The “idb” command

launches the Intel
debugger

. In order to be able to use symbol tables, you could should have been compiled with the the corresponding “-g” debugging flag. Then your can launch the debugging session with e.g :

idb ./a.out

For more details, please refer to the “idb” man page.

6.1.3. DDT

To use DDT, you need to load a module. For instance:

bash-4.1 $ module load ddt

Then use the command ddt. In case of parallel code, you need to replace the line

mpirun -n 16 ./a.out

by

ddt -start -n 16 ./a.out

in your submission script.

Example of submission script:

#!/bin/bash
#MSUB -r MyJob_Para   # Request name
#MSUB -n 32           # Number of tasks to use
#MSUB -T 1800         # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id

set -x
cd ${BRIDGE_MSUB_PWD}
ddt -start -n 32 ./a.out
bash-4.1$ ccc_msub -X ddt.job

Note: you must submit with -X for ccc_msub if you want X11 forwarding.

6.2. Compiler flags

Before debugging, you need to compile your code with these flags:

  • Generic flags:
    • -g : Generates extra debugging information usable by GDB. -g3 includes even more debugging information. This option is available for GNU and INTEL C/C++ and Fortran compilers.
    • -O0 : Suppress all optimisation s.
  • Gnu Fortran compiler: gfortran
    • -fbacktrace : Specifies that if the program crashes, a backtrace should be produced if possible, showing what functions or subroutines were being called at the time of the error.
    • -fbounds-check : Add a check that the array index is within the bounds of the array every time an array element is accessed. This substantially slows down a program using it, but is a very useful way to find bugs related to arrays; without this flag, an illegal array access will produce either a subtle error that might not become apparent until much later in the program, or will cause an immediate segmentation fault with very little information about cause of the error.
  • Intel Fortran compiler: ifort
    • -traceback : generate extra information to provide source file traceback at run time.
    • -check bounds : enables checking for array subscript expressions.

7. Performance Analysis

7.1. Available performance analysis tools

7.1.1. PAPI

PAPI is an API which allows you to retrieve hardware counters from the CPU. To compile a program with PAPI, you need to load the PAPI module

module load papi/4.1.3

You need to compile your code with links to PAPI like:

ifort -I${PAPI_INC_DIR} papi.f90 ${PAPI_LIBS}

Two examples of PAPI usage are given in Section 7.2.1, “PAPI”.

7.1.2. VampirTrace/Vampir

VampirTrace is a library which enables you to profile your parallel code by taking traces during the execution of the program. Here we present an introduction to Vampir/VampirTrace.

In order to use VampirTrace, you need to load the VampirTrace module and compile your code with a VampirTrace compiler:

bash-4.00 $ module load vampirtrace
bash-4.00 $ vtcc -c prog.c
bash-4.00 $ vtcc -o prog.exe prog.o

Available VampirTrace compilers are:

  • C compiler: vtcc
  • C++ compilers: vtc++, vtCC and vtcxx
  • Fortran compilers: vtf77 and vtf90

To compile a MPI code with VampirTrace, you need to use the -vt option like in this example

bash-4.00 $ vtcc -vt:cc mpicc -g -c prog.c
bash-4.00 $ vtcc -vt:cc mpicc -g -o prog.exe prog.o

This option applies for all the previous languages:

  • MPI C compiler:
    vtcc  -vt:cc mpicc
  • MPI C++ compilers:
    vtc++ -vt:cxx mpic++
    vtCC  -vt:cxx mpiCC
    vtcxx -vt:cxx mpicxx
  • MPI Fortran compilers:
    vtf77 -vt:f77 mpif77
    vtf90 -vt:f90 mpif90

By default, VampirTrace wrappers use Intel compilers. To change to another compiler, you can also use the -vt option:

bash-4.00 $ vtcc -vt:cc gcc -O2 -c prog.c
bash-4.00 $ vtcc -vt:cc gcc -O2 -o prog.exe prog.o

To profile an OpenMP or a Hybrid OpenMP/MPI application, you should add the corresponding OpenMP option for the compiler:

bash-4.00 $ vtcc -openmp -O2 -c prog.c
bash-4.00 $ vtcc -openmp -O2 -o prog.exe prog.o

Once your code is compiled, you can submit your job. Here is an example of submission script:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id

set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun ./prog.exe

At the end of execution, the program generates many profiling files:

bash-4.00 $ ls
a.out a.out.0.def.z a.out.1.events.z ... a.out.otf

To visualise those files, you must load the vampir module:

bash-4.00 $ module load vampir
bash-4.00 $ vampir a.out.otf

If you need more information, you can contact Curie’s support hotline.

7.1.3. Scalasca

Scalasca is a performance analysis toolset for parallel programs developed by the Jülich Supercomputing Center. Further information about the Scalasca project is available at http://www.scalasca.org.

The languages and parallel programming paradigms supported by Scalasca are Fortran, C, C++, MPI, OpenMP, hybrid OpenMP/MPI. Scalasca integrates both measurements and analysis tools.

The Scalasca Userguide is available at http://www2.fz-juelich.de/jsc/datapool/scalasca/UserGuide.pdf.

7.1.3.1. Using Scalasca on Curie

To load the scalasca module:

bash-4.00 $ module load scalasca

Scalasca has been compiled with the bull MPI compiler. Please check that bull MPI is in to the list it of your loaded modules. You can use the command:

bash-4.00 $ module list
7.1.3.2. Application instrumentation

Instrumentation by means of Scalasca is introduced into the program by compiling it with the

scalasca
–instrument

command.

  • C with MPITo compile a standard MPI C program:
    bash-4.00 $ scalasca –instrument mpicc -o prog.exe prog.c
  • Fortran with MPITo compile a standard MPI Fortran program:
    bash-4.00 $ scalasca –instrument mpif90 -o ftest.exe ftest.f90
  • Fortran with OpenMPTo create an executable instrumented for OpenMP measurement:
    bash-4.00 $ scalasca –instrument ifort –openmp -o ftest.exe ftest.f90
  • Fortran with hybrid OpenMP/MPITo create an executable instrumented for MPI and OpenMP measurement from hybrid OpenMP/MPI program ftest.F90:
    bash-4.00 $ scalasca –instrument mpif90 –openmp -O3 -o ftest.exe ftest.F90
  • C/C++ with OpenMP or hybrid OpenMP/MPIUse the same commands as in the Fortran example by replacing mpif90 with mpicc/mpicxx.
  • Instrumentation inside a makefileWhen using Makefiles, it is often convenient to define a “preparation preposition” placeholder (e.g., PREP) which can be prefixed to (selected) compile and link commands:
    MPIF90 = $(PREP) mpif90
    MPICC = $(PREP) mpicc
    MPICXX = $(PREP) mpicxx

    These can make it easier to prepare an instrumented version of the program by building with

    make PREP="scalasca -instrument"

Once your program is compiled, you can submit your job. Here is an example of submission script:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id

set -x
cd ${BRIDGE_MSUB_PWD}
scalasca -analyze mpirun ./prog.exe

At the end of execution, the program generates a directory which contains the profiling files:

bash-4.00 $ ls epik_*
...

To visualise the files, type:

bash-4.00 $ scalasca -examine epik_*

If you need more information, you can contact Curie’s support hotline.

7.1.4. Paraver

To use Paraver, load the paraver and papi modules:

bash-4.00 $ module load paraver
bash-4.00 $ module load papi

You need to compile your code with –g option and to export the environment variable EXTRAE_CONFIG_FILE:

bash-4.00 $ export EXTRAE_CONFIG_FILE=my_config_file.xml

Here is an example of basic my_config_file.xml (source : paraver training, September 2010 at BSC) :

<?xml version="1.0" ?>
<trace enabled="yes" home="/usr/local/paraver-3.99" initial-mode="detail" type="paraver" xml-parser-id="Id: xml-parse.c 494 2010-11-10 09:44:07Z harald $">
  <mpi enabled="yes">
    <counters enabled="yes" />
  </mpi>
  <openmp enabled="yes">
    <locks enabled="no" />
    <counters enabled="yes" />
  </openmp>
  <counters enabled="yes">
    <cpu enabled="yes" starting-set-distribution="1">
      <set enabled="yes" domain="all">
      PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L1_DCM
      </set>
    </cpu>
   <network enabled="no" />
   <resource-usage enabled="no" />
   <memory-usage enabled="no" />
  </counters>
  <bursts enabled="yes">
    <threshold enabled="yes">500u</threshold>
    <mpi-statistics enabled="yes" />
  </bursts>
</trace>

PAPI_TOT_INS, PAPI_TOT_CYC and PAPI_L1_DCM are hardware counters. To see which counters are available on Curie, type:

bash-4.00 $ module loap papi; papi_avail

Other xml configuration files are available in /usr/local/paraver-3.99/share/example.

Once the instrumented job has run, you need to merge the traces:

bash-4.00 $ mpi2prv -syn -f TRACE.mpits -o my_trace.prv

Then, to launch Paraver:

bash-4.00 $ wxparaver

To analyse your .prv trace, load it in the tool, and load a .cfg configuration file. Configuration files are available at http://www.bsc.es/plantillaA.php?cat_id=493.

7.1.5. Valgrind

Valgrind is a debugging and profiling suite for Linux executables. It consists of a core that performs program instrumentation and a set of tools for profiling and detecting memory management and threading bugs. In general, the profiling performed by Valgrind is solely based on instrumentation and thus is more time-consuming than hardware-assisted methods (e.g. PAPI).

The most common Valgrind tools are the following:

  • memcheck: detects memory-management problems such as illegal accesses, use of uninitialized pointers, memory leaks, bad memory de-allocations, etc.
  • cachegrind: performs cache profiling by simulating program execution under a user-specified cache hierarchy.
  • callgrind: extension to cachegrind that additionally provides information about call graphs.
  • massif: performs detailed heap profiling by taking regular snapshots of a program’s heap, and shows heap usage over time, including information about which parts of the program are responsible for most memory allocations.
  • helgrind: detects misuses of the Pthreads API (e.g. unlocking a not-locked mutex), potential deadlocks arising from lock ordering problems (e.g. acquiring two locks in reverse order by two threads could lead to deadlock), and data races.
  • DRD (Data-Race Detector): detects the same classes of errors as Helgrind, but offers additionally the opportunity to detect situations of excessive lock contention.

To use Valgrind on Curie, you should first load the valgrind module:

bash-4.00 $ module load valgrind

The user program should be compiled with -g to include debugging information. Using -O0 is also a good idea, if you can tolerate the slowdown. Use of -O2 and above is not recommended.

You invoke Valgrind like this:

bash-4.00 $ valgrind [valgrind-options] your-prog [your-prog-options]

The most important option is —tool which dictates which Valgrind tool to run. For example, if want to run the command ls -l using the memory-checking tool Memcheck, issue this command:

bash-4.00 $ valgrind --tool=memcheck ls -l

The user is referred to Valgrind’s home page (http://valgrind.org/) for a detailed description of the functionality and the command line options of each tool.

7.2. Use cases and hints for interpreting results

7.2.1. PAPI

Here is an example of PAPI usage to get the number of floating point operations of a matrix DAXPY (in Fortran).

program main
  implicit none
  include 'f90papi.h'
!
  integer, parameter :: size = 1000
  integer, parameter :: ntimes = 10
  double precision, dimension(size,size) :: A,B,C
  integer :: i,j,n
! PAPI Variables
  integer, parameter :: max_event = 1
  integer, dimension(max_event) :: event
  integer :: num_events, retval
  integer(kind=8), dimension(max_event) :: values
! Init PAPI
  call PAPIf_num_counters( num_events )
  print *, 'Number of hardware counters supported: ', num_events
  call PAPIf_query_event(PAPI_FP_INS, retval)
  if (retval .NE. PAPI_OK) then
    event(1) = PAPI_TOT_INS
  else
    ! Total floating point operations
    event(1) = PAPI_FP_INS
  end if
! Init Matrix
  do i=1,size
    do j=1,size
      C(i,j) = real(i+j,8)
      B(i,j) = -i+0.1*j
    end do
  end do
! Set up counters
  num_events = 1
  call PAPIf_start_counters( event, num_events, retval)
! Clear the counter values
  call PAPIf_read_counters(values, num_events,retval)
! DAXPY
  do n=1,ntimes
    do i=1,size
      do j=1,size
        A(i,j) = 2.0*B(i,j) + C(i,j)
      end do
    end do
  end do
! Stop the counters and put the results in the array values
  call PAPIf_stop_counters(values,num_events,retval)
! Print results
  if (event(1) .EQ. PAPI_TOT_INS) then
    print *, 'TOT Instructions: ',values(1)
  else
    print *, 'FP Instructions: ',values(1)
  end if
end program main

To compile, you have to load the PAPI module.

bash-4.00 $ module load papi/4.1.3
bash-4.00 $ ifort -I${PAPI_INC_DIR} papi.f90 ${PAPI_LIBS}
bash-4.00 $ ./a.out
Number of hardware counters supported: 7
FP Instructions: 10046163

To get the available counters, type the papi_avail command.

PAPI can also retrieve the MFLOPS of a given region of code, here is an example.

program main
  implicit none
  include 'f90papi.h'
!
  integer, parameter :: size = 1000
  integer, parameter :: ntimes = 100
  double precision, dimension(size,size) :: A,B,C
  integer :: i,j,n
! PAPI Variables
  integer :: retval
  real(kind=4) :: proc_time, mflops, real_time
  integer(kind=8) :: flpins
! Init PAPI
  retval = PAPI_VER_CURRENT
  call PAPIf_library_init(retval)
  if ( retval.NE.PAPI_VER_CURRENT) then
    print*, 'PAPI_library_init', retval
  end if
  call PAPIf_query_event(PAPI_FP_INS, retval)
! Init Matrix
  do i=1,size
    do j=1,size
      C(i,j) = real(i+j,8)
      B(i,j) = -i+0.1*j
    end do
  end do
! Setup Counter
  call PAPIf_flips( real_time, proc_time, flpins, mflops, retval )
! DAXPY
  do n=1,ntimes
    do i=1,size
      do j=1,size
        A(i,j) = 2.0*B(i,j) + C(i,j)
      end do
    end do
  end do
! Collect the data into the Variables passed in
  call PAPIf_flips( real_time, proc_time, flpins, mflops, retval)
! Print results
  print *, 'Real_time: ', real_time
  print *, ' Proc_time: ', proc_time
  print *, ' Total flpins: ', flpins
  print *, ' MFLOPS: ', mflops
!
end program main

The ouput is the following:

bash-4.00 $ module load papi/4.1.3
bash-4.00 $ ifort -I${PAPI_INC_DIR} papi_flops.f90 ${PAPI_LIBS}
bash-4.00 $ ./a.out
Real_time: 6.1250001E-02
 Proc_time: 5.1447589E-02
 Total flpins: 100056592
 MFLOPS: 1944.826

For more details, contact Curie’s support hotline or visit the PAPI website at http://icl.cs.utk.edu/papi/index.html .

7.2.2. VampirTrace/Vampir

VampirTrace allocates a buffer to store its profiling information. If the buffer is full, VampirTrace will flush the buffer on disk. By default, the size of this buffer is 32MB per process and it can be flushed only one time. You can increase (or reduce) the size of the buffer through the environment variable VT_BUFFER_SIZE:

bash-4.00 $ export VT_BUFFER_SIZE=64M
bash-4.00 $ ccc_mprun ./prog.exe

In this example, the buffer is set to 64MB. You can also increase the number of flushes through the environment variable VT_MAX_FLUSHES:

bash-4.00 $ export VT_MAX_FLUSHES=10
bash-4.00 $ ccc_mprun ./prog.exe

If the value of VT_MAX_FLUSHES is set to 0, the number of flushes is unlimited.

By default, VampirTrace first stores profiling information in a local directory (/tmp) of process. These files can be very large and fill the directory. You can change this local directory through the environment variable VT_PFORM_LDIR:

bash-4.00 $ export VT_PFORM_LDIR=$SCRATCHDIR
7.2.2.1. Vampir server

Traces generated by Vampirtrace can be very large: Vampir can be very slow if you want to visualise these traces. Vampir provides Vampirserver: it is a parallel program which uses the CPU to accelerate Vampir visualisation .

7.2.3. Scalasca + Vampir

Scalasca can generate OTF tracefile whic can be visualised with Vampir. To activate traces, add the -t flag to scalasca when you launch the run. Here is the modified previous script:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id

set -x
cd ${BRIDGE_MSUB_PWD}
scalasca -analyze -t mpirun ./prog.exe

At the end of execution, the program generates a directory which contains the profiling files:

bash-4.00 $ ls epik_*
...

You can visualise these files as before. To generate the OTF trace files, you can type:

bash-4.00 $ ls epik_*
bash-4.00 $ elg2otf epik_*

It will generate an OTF file under the epik_* directory. To visualise it, load Vampir:

bash-4.00 $ module load vampir
bash-4.00 $ vampir epik_*/a.otf

7.2.4. Scalasca + PAPI

Scalasca can retrieve the hardware counter with PAPI. For example, if you want to retrieve the number of floating point operations:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id

set -x
cd ${BRIDGE_MSUB_PWD}
export EPK_METRICS=PAPI_FP_OPS
scalasca -analyze mpirun ./prog.exe

The number of floating point operations appears on the profile when you visualise it. You can retrieve only 3 hardware counters at the same time on Curie. To specify the counters, you need to set the environment variable EPK_METRICS:

bash-4.00 $ export EPK_METRICS="PAPI_FP_OPS:PAPI_TOT_CYC"

8. Tuning Applications

8.1. Choosing or excluding nodes

SLURM provides the possibility of choosing or excluding any nodes in the reservation for your job.

To choose nodes, you need to modify your submission script like:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '-w curie[1000-1003]' # Include 4 nodes (curie1000 to curie1003)
set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun ./a.out

To exclude nodes:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '-x curie[1000-1003]' # Exclude 4 nodes (curie1000 to curie1003)
set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun ./a.out

8.2. Process distribution, affinity and binding

8.2.1. Introduction

8.2.1.1. Definitions
  • Hardware TopologyThe hardware topology is the organisation of cores, processors, sockets and memory in a node. A table xref can be created with hwloc (http://www.open-mpi.org/projects/hwloc/) . You can have access to hwloc on Curie like this:
    bash-4.00 $ module load hwloc
  • BindingA Linux process can be bound (or pinned) to one or many cores. It means that a process and its threads can only run on a given selection of cores. For example, a process which is bound to a socket on a Curie Thin node can run on any of the 8 cores of the socket.
  • AffinityAffinity represents the policy of resources binding (cores and memory) for processes and/or threads.
  • DistributionThe distribution of MPI processes describes how these processes are spread across the core, sockets or nodes.

On Curie, the default behaviour for distribution, affinity and binding are managed by SLURM, using the ccc_mprun command.

8.2.1.2. Process distribution

We present here some example of MPI processes distributions.

  • Block or RoundThis is the standard distribution. From SLURM manpage: The block distribution method will distribute tasks to a node such that consecutive tasks share a node. For example, consider an allocation of two nodes each with 8 cores. A block distribution request will distribute those tasks to the nodes with tasks 0 to 7 on the first node, task 8 to 15 on the second node.
  • Cyclic by socketFrom SLURM manpage, the cyclic distribution method will distribute tasks to a socket such that consecutive tasks are distributed over consecutive socket (in a round-robin fashion). For example, consider an allocation of two nodes each with 2 sockets each with 4 cores. A cyclic distribution by socket request will distribute those tasks to the socket with tasks 0,2,4,6 on the first socket, task 1,3,5,7 on the second socket. In the following image, the distribution is cyclic by socket and block by node.
  • Cyclic by nodeFrom SLURM manpage, the cyclic distribution method will distribute tasks to a node such that consecutive tasks are distributed over consecutive nodes (in a round-robin fashion). For example, consider an allocation of two nodes each with 2 sockets each with 4 cores. A cyclic distribution by node request will distribute those tasks to the nodes with tasks 0,2,4,6,8,10,12,14 on the first node, task 1,3,5,7,9,11,13,15 on the second node.
8.2.1.3. Why is affinity important to improve performances ?

Curie nodes are NUMA (Non-Uniform Memory Access) nodes. It means that it will take longer to access some regions of memory than others. This is due to the fact that all memory regions are not physically on the same bus.

As an example, if a datum is in the memory module 0, a process running on the second socket like the 4th process will take more time to access the data. We can introduce the notion of local data vs remote data. In our example, if we consider a process running on socket 0, a datum is local if it is on memory module 0. The data is remote if it is on memory module 1.

We can thus deduce the reasons why tuning the process affinity is important:

  • Data locality improvess performance. If your code uses shared memory (like pthreads or OpenMP), the best choice is to regroup your threads on the same socket. The shared data should be local to the socket and moreover, the data will potentially stay on the processor’s cache.
  • System processes can interrupt your process running on a core. If your process is not bound to a core or to a socket, it can be moved to another core or to another socket. In this case, all data for this process has to be moved with the process too and this can take some time.
  • MPI communication is faster between processes which are on the same socket. If you know that two processes perform many communications, you can bind them to the same socket.
  • On Curie Hybrid nodes, GPUs are connected to buses which are local to socket. Processes can take longer time to access a GPU which is not connected to its socket.

For all these reasons, it is better to know the NUMA configuration of Curie nodes (Extra Large, Thin and Hybrid). In the following section, we will present some ways to tune your processes’ affinity for your jobs.

8.2.1.4. CPU affinity mask

The affinity of a process is defined by a mask. A mask is a binary value which length is defined by the number of cores available on a node. For example, Curie Hybrid nodes have 8 cores: the binary mask value will have 8 digits. Each digit will be 0 or 1. The process will run only on the core(s) which have a corresponding 1 as value. A binary mask must be read from right to left.

For example, a process which runs on the cores 0,4,6 and 7 will have as affinity binary mask: 11010001

SLURM and BullxMPI use these masks but converted in hexadecimal number.

  • To convert a binary value to hexadecimal:< pre class=”screen”>bash-4.00 $ echo “ibase=2;obase=16;11010001″|bc
    21202
  • To convert a hexadecimal value to binary:
    bash-4.00 $ echo "ibase=16;obase=2;21202"|bc
    11010001

The numbering of the cores is the PU number from the output of hwloc.

8.2.2. SLURM

SLURM is the default launcher for jobs on Curie. SLURM manages the processes even for sequential jobs. We recommend you use ccc_mprun. By default, SLURM binds processes to a core. The distribution is block by node and by core.

The option -E '--cpu_bind=verbose' for ccc_mprun gives you a report about the binding of processes before the run:

bash-4.00 $ ccc_mprun -E '--cpu_bind=verbose' -q hybrid -n 8 ./a.out
cpu_bind=MASK - curie7054, task 3 3 [3534]: mask 0x8 set
cpu_bind=MASK - curie7054, task 0 0 [3531]: mask 0x1 set
cpu_bind=MASK - curie7054, task 1 1 [3532]: mask 0x2 set
cpu_bind=MASK - curie7054, task 2 2 [3533]: mask 0x4 set
cpu_bind=MASK - curie7054, task 4 4 [3535]: mask 0x10 set
cpu_bind=MASK - curie7054, task 5 5 [3536]: mask 0x20 set
cpu_bind=MASK - curie7054, task 7 7 [3538]: mask 0x80 set
cpu_bind=MASK - curie7054, task 6 6 [3537]: mask 0x40 set

In this example, we can see the process 5 has 20 as hexadecimal mask or 00100000 as binary mask: the 5th process will run only on the core 5.

8.2.2.1. Process distribution

To change the default distribution of processes, you can use the option -E '-m' for ccc_mprun. With SLURM, you have two levels for process distribution: node and socket.

  • Node block distribution:
    ccc_mprun -E '-m block' ./a.out
  • Node cyclic distribution:
    ccc_mprun -E '-m cyclic' ./a.out

By default, the distribution over the socket is block. In the following examples for socket distribution, the node distribution will be block.

  • Socket block distribution:
    ccc_mprun -E '-m block:block' ./a.out
  • Socket cyclic distribution:
    ccc_mprun -E '-m block:cyclic' ./a.out
8.2.2.2. Curie Hybrid node

On a Curie Hybrid node, each GPU is connected to a socket. It will take longer for a process to access a GPU if this process is not on the same socket of the GPU. By default, the distribution is block by core. Then the MPI rank 0 is located on the first socket and the MPI rank 1 is on the first socket too. The majority of GPU codes will assign GPU 0 to MPI rank 0 and GPU 1 to MPI rank 1. In this case, the bandwidth between MPI rank 1 and GPU 1 is not optimal.

If your code does this, in order to obtain the best performance, you should:

  • use the block:cyclic distribution
  • if you intend to use only 2 MPI processes per node, you can reserve 4 cores per process with the directive #MSUB -c 4. The two processes will be placed on two different sockets.
8.2.2.3. Process binding

By default, processes are bound to the core. For multi-threaded jobs, processes creates threads: these threads will be bound to the assigned core. To allow these threads to use other cores, SLURM provides the option -c to assign many cores to a process.

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 8 # Number of tasks to use
#MSUB -c 4 # Assign 4 cores per process
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID

export OMP_NUM_THREADS=4
ccc_mprun ./a.out

In this example, our hybrid OpenMP/MPI code runs on 8 MPI processes and each process will use 4 OpenMP threads. We give here an example for the output with the verbose option for binding:

bash-4.00 $ ccc_mprun ./a.out
cpu_bind=MASK - curie1139, task 5 5 [18761]: mask 0x40404040 set
cpu_bind=MASK - curie1139, task 0 0 [18756]: mask 0x1010101 set
cpu_bind=MASK - curie1139, task 1 1 [18757]: mask 0x10101010 set
cpu_bind=MASK - curie1139, task 6 6 [18762]: mask 0x8080808 set
cpu_bind=MASK - curie1139, task 4 4 [18760]: mask 0x4040404 set
cpu_bind=MASK - curie1139, task 3 3 [18759]: mask 0x20202020 set
cpu_bind=MASK - curie1139, task 2 2 [18758]: mask 0x2020202 set
cpu_bind=MASK - curie1139, task 7 7 [18763]: mask 0x80808080 set

We can see here the MPI rank 0 process is launched over the cores 0,8,16 and 28 of the node. These cores are all located on the node’s first socket.

Remark: With the -c option, SLURM will try to gather at best the cores to have best performances. In the previous example, all the cores of a MPI process will be located on the same socket.

Another example:

bash-4.00 $ ccc_mprun -n 1 -c 32 -E '--cpu_bind=verbose' ./a.out
cpu_bind=MASK - curie1017, task 0 0 [34710]: mask 0xffffffff set

We can see the process is not bound to a core and can run over all cores of a node.

8.2.3. BullxMPI

BullxMPI has its own process management policy. To use it, you have first to disable SLURM’s process management policy by adding the directive #MSUB -E '--cpu_bind=none' . You can then use BullxMPI launcher mpirun:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding
mpirun -np 32 ./a.out

Note: In this example, BullxMPI process management policy can be effective only on the 32 cores allocated by SLURM.

The default BullxMPI process management policy is:

  • the processes are not bound
  • the processes can run on all cores
  • the default distribution is block by core and by node

The option --report-bindings gives you a report about the binding of processes before the run:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding
mpirun --report-bindings --bind-to-socket --cpus-per-proc 4 -np 8 ./a.out

This script gives the following output:

+ mpirun --bind-to-socket --cpus-per-proc 4 -np 8 ./a.out
[curie1342:19946] [[40080,0],0] odls:default:fork binding child [[40080,1],3] to socket 1 cpus 2222
[curie1342:19946] [[40080,0],0] odls:default:fork binding child [[40080,1],4] to socket 2 cpus 4444
[curie1342:19946] [[40080,0],0] odls:default:fork binding child [[40080,1],5] to socket 2 cpus 4444
[curie1342:19946] [[40080,0],0] odls:default:fork binding child [[40080,1],6] to socket 3 cpus 8888
[curie1342:19946] [[40080,0],0] odls:default:fork binding child [[40080,1],7] to socket 3 cpus 8888
[curie1342:19946] [[40080,0],0] odls:default:fork binding child [[40080,1],0] to socket 0 cpus 1111
[curie1342:19946] [[40080,0],0] odls:default:fork binding child [[40080,1],1] to socket 0 cpus 1111
[curie1342:19946] [[40080,0],0] odls:default:fork binding child [[40080,1],2] to socket 1 cpus 2222

In the following paragraphs, we present the different possibilities of process distribution and binding. These options can be mixed (if possible).

Remark: the following examples use a whole Curie Thin node. We reserve 16 cores with

#MSUB -n
16

and #MSUB -x to have all the cores and to do what we want with them. This is only examples for simple cases. In others case, there may be conflicts with SLURM.

8.2.3.1. Process distribution

Block distribution by core:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding

mpirun --bycore -np 16 ./a.out

Cyclic distribution by socket:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding

mpirun --bysocket -np 16 ./a.out

Cyclic distribution by node:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -N 16
#MSUB -x # Require exclusive nodes
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding

mpirun --bynode -np 16 ./a.out
8.2.3.2. Process binding

No binding:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding

mpirun --bind-to-none -np 16 ./a.out

Core binding:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding

mpirun --bind-to-core -np 16 ./a.out

Socket binding (the process and its threads can run on any core of a socket):

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding

mpirun --bind-to-socket -np 16 ./a.out

The user can specify the number of cores to assign to an MPI process:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding

mpirun --bind-to-socket --cpus-per-proc 4 -np 2 ./a.out

Here we assign 4 cores per MPI process.

8.2.3.3. Manual process management

BullxMPI gives the possibility to manually assign your processes through a hostfile and a rankfile. An example:

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 16 # Number of tasks to use
#MSUB -x # Require a exclusive node
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -A paxxxx # Project ID
#MSUB -E '--cpu_bind=none' # Disable default SLURM binding
hostname > hostfile.txt
echo "rank 0=${HOSTNAME} slot=0,1,2,3 " > rankfile.txt
echo "rank 1=${HOSTNAME} slot=4,5,6,7 " >> rankfile.txt
echo "rank 2=${HOSTNAME} slot=8,9,10,11" >> rankfile.txt
echo "rank 3=${HOSTNAME} slot=12,13,14,15" >> rankfile.txt
mpirun --hostfile hostfile.txt --rankfile rankfile.txt -np 4 ./a.out

In this example, there are many steps :

  • You have to create a hostfile here hostfile.txt where you put the hostname of all nodes your run will use
  • You have to create a rankfile here rankfile.txt where you assign to each MPI rank the core where it can run. In our example, the process of rank 0 will have as affinity the core 0,1,2 and 3, etc.. Be careful, the numbering of the cores is different from the hwloc output.
  • you can launch mpirun by specifying the hostfile and the rankfile.

8.3. Advanced MPI usage

8.3.1. Embarrassingly parallel jobs and MPMD jobs

An embarrassingly parallel job is a job which launch independent processes. These processes need little or no communication.

A MPMD job (for Multiple Program Multiple Data) is a parallel job which launches different executables over the processes. A MPMD job can be parallel with MPI and can do many communications.

These two concepts are different but we present them together because the way to launch them on Curie is similar. A simple example was already given in Section 4.5.1.2, “Script examples”.

In the following example, we use ccc_mprun to launch the job (srun can be used too). We want to launch bin0 on the MPI rank 0, bin1 on the MPI rank 1, and bin2 on the MPI rank 2. We write a shell script to describe the topology of our job:

launch_exe.sh:

#!/bin/bash
if [ $SLURM_PROCID -eq 0 ]
then
./bin0
fi
if [ $SLURM_PROCID -eq 1 ]
then
./bin1
fi
if [ $SLURM_PROCID -eq 2 ]
then
./bin2
fi

We can then launch our job with 3 processes:

bash-4.00 $ ccc_mprun -n 3 ./launch_exe.sh

The script launch_exe.sh must have execute permission. When ccc_mprun launches the job, it will initialise some environment variables. Among them, SLURM_PROCID defines the current MPI rank.

8.3.2. BullxMPI

8.3.2.1. MPMD jobs

BullxMPI (or OpenMPI) jobs can be launched with mpirun launcher. In this case, we have other ways to launch MPMD jobs. Based on the example proposed in Section 8.3.1, “Embarrassingly parallel jobs and MPMD jobs”, we present two other ways to launch the MPMD script:

  • Without the launch_exe.sh script:
    bash-4.00 $ mpirun -np 1 ./bin0 : -np 1 ./bin1 : -np 1 ./bin2
  • With the launch_exe.sh script modified as follows (SLURM_PROCID is replaced by OMPI_COMM_WORLD_RANK):launch_exe.sh:
    #!/bin/bash
    if [ ${OMPI_COMM_WORLD_RANK} -eq 0 ]
    then
    ./bin0
    fi
    if [ ${OMPI_COMM_WORLD_RANK} -eq 1 ]
    then
    ./bin1
    fi
    if [ ${OMPI_COMM_WORLD_RANK} -eq 2 ]
    then
    ./bin2
    fi

    Then, the job is launched on three proccesses:

    bash-4.00 $ mpirun -np 3 ./launch_exe.sh
8.3.2.2. Tuning BullxMPI

BullxMPI is based on OpenMPI and can be tuned with parameters. The command ompi_info -a gives you the list of all parameters and their descriptions.

curie50$ ompi_info -a
(...)
  MCA mpi: parameter "mpi_show_mca_params" (current value: <none>, data source: defau
           Whether to show all MCA parameter values during MPI_INIT or not (good for
           - or a comma delimited combination of them
(...)

These parameters can be modified through environment variables set before the ccc_mprun command. The form of the corresponding environment variable is OMPI_MCA_xxxxx where xxxxx is the parameter name.

#!/bin/bash
#MSUB -r MyJob_Para # Request name
#MSUB -n 32 # Number of tasks to use
#MSUB -T 1800 # Elapsed time limit in seconds
#MSUB -o example_%I.o # Standard output. %I is the job id
#MSUB -e example_%I.e # Error output. %I is the job id
#MSUB -A paxxxx # Project ID

set -x
cd ${BRIDGE_MSUB_PWD}
export OMPI_MCA_mpi_show_mca_params=all
ccc_mprun ./a.out
8.3.2.3. Optimising with BullxMPI

To optimise BullxMPI, these parameters can be used:

curie50$ export OMPI_MCA_mpi_leave_pinned=1
curie50$ export OMPI_MCA_btl_openib_use_eager_rdma=0

Be careful, these parameters are not set by default. They can influence the behaviour of your code.

8.3.2.4. Debugging with BullxMPI

Sometimes, BullxMPI codes can hang in any collective communication for large jobs. If you find yourself in this case, you can try this parameter:

export OMPI_MCA_coll="^ghc,tuned"

This setting disables optimised collective communication: it can slow down your code if it uses many collective operations.

8.4. Advanced OpenMP usage

8.4.1. Environment variables

The OpenMP specification defines a set of environment variables that affect several parameters of application execution during runtime. These variables are:

  • OMP_NUM_THREADS num : sets the maximum number of threads to use in parallel regions if no other value is specified in the application. As of version 3.1 of the OpenMP specification, this environment variable takes a comma-separated number list that specifies the number of threads at each nesting level.
  • OMP_SCHEDULE type[,chunk_size] : sets the iteration scheduling policy for parallel loops to one of static, dynamic, guided or auto types, and optionally defines their chunk size.
  • OMP_DYNAMIC dynamic : enables or disables the dynamic adjustment of threads the runtime system uses for parallel regions. Valid values for dynamic are true or false. When enabled, the runtime system adjusts the number of threads so that it makes the most efficient use of system resources. This is meaningful when the user has not specified the desired number of threads globally or on a per-region basis.
  • OMP_NESTED nested : enables or disables nested parallelism, i.e. whether OpenMP threads are allowed to create new thread teams. Valid values for nested are true or false.
  • OMP_MAX_ACTIVE_LEVELS levels : sets the maximum number of nested active parallel regions.
  • OMP_THREAD_LIMIT limit : sets the maximum number of threads participating in the OpenMP program. When the number of threads in the program is not specified (e.g. with OMP_NUM_THREADS variable), the number of threads used is the minimum between this variable and the total number of CPUs.
  • OMP_WAIT_POLICY policy : defines the behaviour of threads waiting on synchronisation events. When policy is set to ACTIVE, threads consume processor cycles while waiting (“busy-waiting”). When it is set to PASSIVE, threads do not consume CPU power but may require more time to resume.
  • OMP_STACKSIZE size[B|K|M|G] : sets the default thread stack size in kilobytes, unless the number is suffixed by B, K, M or G (bytes, KB, MB, GB, respectively).
  • OMP_PROC_BIND bind : specifies whether threads may be moved between processors. If set to true, OpenMP threads should not be moved. If set to false, the operating system may move them.

Table 4 shows the default values for the aforementioned environment variables, both for the gcc and Intel compilers.

 

Table 4. Default OpenMP specifications

Environment variable gcc icc
OMP_NUM_THREAD Number of processors visible to the operating system
OMP_SCHEDULE dynamic, chunk size = 1 static, no chunk size specified
OMP_DYNAMIC FALSE
OMP_NESTED FALSE
OMP_MAX_ACTIVE_LEVELS unlimited
OMP_THREAD_LIMITS unlimited
OMP_WAIT_POLICY (undocumented) PASSIVE
OMP_STACKSIZE system dependent IA-32 architecture: 2M; Intel 64 and IA-64 Architectures: 4M
OMP_PROC_BIND (undocumented) (undocumented)

 

Besides the environment variables defined in the OpenMP standard, many implementations provide their own environment variables. Below we present briefly the most important ones for Intel and gcc compilers. The user is referred to the corresponding manuals for a more detailed description of the variables and their possible values.

Intel:

  • KMP_BLOCKTIME : sets the time (in milliseconds) that a thread should wait, after completing the execution of a parallel region, before sleeping.
  • KMP_DYNAMIC_MODE : selects the method used to determine the number of threads for a parallel region when OMP_DYNAMIC is set to true.
  • KMP_LIBRARY : selects the OpenMP run-time library execution mode.
  • KMP_AFFINITY : enables the runtime system to bind OpenMP threads to (logical) CPUs (that is, cores, as hyperthreading not enabled).

GNU:

  • GOMP_CPU_AFFINITY : specifies a list of
    CPUs where OpenMP threads should be bound.

8.4.2. Thread affinity

Thread affinity means assigning a thread to one or more CPUs. The user can specify on which CPUs of a node his application is allowed to run, and the OS scheduler will choose only among these CPUs when deciding where to run each thread (even when there are other CPUs sitting idle). By controlling how threads are being mapped to CPUs, the user can affect the interaction between the application and the underlying platform. Different thread-CPU mappings often entail notable differences in terms of performance.

8.4.2.1. Determining machine topology

The most important architectural parameters that need to be considered in order to enforce a particular thread-CPU mapping, often dictated by the application’s special needs, are:

  • Which CPUs reside within the same physical package (socket).
  • Which CPUs reside within the same core as different hardware threads (in a Hyper-Threaded architecture).
  • Which CPUs within the same package share a certain level of cache.
  • Which CPUs have different access paths to main memory (i.e., different NUMA links).
  • Which memory path is local to a certain CPU and which is remote.

The Linux kernel publishes this kind of information through the sys pseudo file system (sysfs). The directories and files that are most important for extracting the processor topology of a node are summarised in Table 5.

 

Table 5. List of files and directories to find system information

Files and directories Information provided
/sys/devices/system/cpu/cpuX/ Contains all information regarding CPU with id X (the unique processor id as seen by the OS)
/sys/devices/system/cpu/cpuX/cache/ Characteristics of cache hierarchy of X
/sys/devices/system/cpu/cpuX/cache/indexL/

Characteristics for level L cache of X
/sys/devices/system/cpu/cpuX/cache/indexL/shared_cpu_list List with CPUs that share cache level L with X
/sys/devices/system/cpu/cpuX/topology/ Details on X’s topology
/sys/devices/system/cpu/cpuX/topology/physical_package_id The physical package id where X resides. Package id’s are unique within the platform
/sys/devices/system/cpu/cpuX/topology/core_id

The core id of X. Core id’s are unique within a specific package
/sys/devices/system/cpu/cpuX/topology/core_siblings_list List with cores that are in the same package as X
/sys/devices/system/cpu/cpuX/topology/thread_siblings_list In an architecture with multi-threaded cores (e.g. Intel Hyper-Threading), shows which other CPUs are in the same core with X, as different hardware threads
/sys/devices/system/node/nodeN/ Contains information about NUMA memory node N
/sys/devices/system/node/nodeN/cpulist List with CPUs that are adjacent (local) to N

 

For the Curie system, the processor topology within the Extra Large and Thin nodes, along with the processor id’s as seen by the OS, are presented in Section 2.1, “Curie’s configuration”. Once the user has knowledge of the processor topology, a number of different approaches can be used in order to pin the threads of an OpenMP program to specific processors.

8.4.2.2. The taskset utility

The taskset utility found in most modern Linux systems can be used to set the CPU affinity of an already running process, or to launch a new program with a given CPU affinity. The Linux scheduler will honour the requested processors to the application, and the application threads will be restricted to run on those processors only.

The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first processor id (i.e., “/sys/devices/system/cpu/cpu0”). Not all CPUs may exist on a given system but a mask may specify more CPUs than are present. The masks are typically given in hexadecimal. For example, 0x00000001 represents CPU with processor id 0, while 0x00000003 represents the CPUs with processor id’s 0 and 1.

The most important command line arguments of taskset are:

  • -a : set the CPU affinity of all the threads for a given application.
  • -p : operate on an already running application (i.e., existing process ID) and do not launch a new application.
  • -c : specify a numerical list of processors instead of a bitmask. The numbers are separated by commas and may include ranges. E.g., 0,1,3,5-7

The default behaviour is to run a new application with a given affinity mask:

taskset mask application [application-arguments]

For example, in an OpenMP application that creates four threads, the following command will restrict the threads to run on processors with id’s 0, 2, 4 and 6.

taskset 0x55 ./my_app arg1 arg2

All threads of the application will get an affinity mask of 0x55. Strictly speaking, this means that any of them may execute on any of the specified processors, and nothing prevents them from being moved between those processors at runtime. In practice, however, the Linux scheduler will strive to keep each thread on a separate processor throughout execution, basically for performance reasons. As of OpenMP 3.1, the programmer can use the OMP_PROC_BIND environment variable initialised to true, to guarantee more strongly that threads will not bounce between the specified processors.

8.4.2.3. Intel compiler

The KMP_AFFINITY environment variable provided by the Intel compiler offers a high-level, yet flexible interface for mapping threads to processors. It uses the following generic syntax:

KMP_AFFINITY = [<modifier>,...] <type> [,<permute>] [,<offset>]

Table 6 describes the arguments along with their most important possible values:

 

Table 6. KMP_AFFINITY main arguments and their possible values

Argument Value Description
modifier granularity= &lt;specifier&gt; Granularity describes the lowest level of processor topology that OpenMP threads are allowed to float (i.e. execute unbounded) within a topology map. Practically, it refers to machines with Hyper-Threading technology enabled. When specifier takes the value core (default), then OpenMP threads that are bound to a specific core may float between the different hardware contexts (Hyper-Threads) of the core. When specifier is initialized to fine or thread, each OpenMP thread will be bound to a specific hardware context of the core.
respect It is the default value for modifier. It respects the original affinity mask of the thread that initialised the OpenMP runtime library.
norespect It does not respect original affinity mask of the
initial thread. It binds OpenMP threads to all operating system processors.
proclist={&lt;proc&gt;} The user explicitly specifies OpenMP threads assignment by using a list of processor id’s (as seen by the OS). It has effect when explicit type is specified. OS processors specified in this list are then assigned to OpenMP threads, in order of OpenMP Global Thread IDs. If more OpenMP threads are created than there are elements in the list, the assignment occurs modulo the size of the list. Examples:
proclist=3,0-2 , and the application creates 6 threads: OpenMP thread 0 will run on processor with id 3, thread 1 on 0, thread 2 on 1, thread 3 on 2, thread 4 on 3 and thread 5 on 0.
proclist=[3,0,{1,2},{1,2}] , and the application creates 4 threads: OpenMP thread 0 will run on processor 3, thread 1 on 0, and threads 2 and 3 will be both allowed to float between processors 1 and 2.
verbose Prints messages concerning the supported affinity. The messages include information about the number of packages, number of cores in each package, number of thread contexts for each core, and OpenMP thread bindings to physical thread contexts.
type compact Assigns consecutive OpenMP threads to processors that are as close as possible to each other, in terms of processor topology. In a multi-socket, multi-core machine, it will first assign threads to all cores of a socket before proceeding to the next socket available.
scatter Distributes the threads as evenly as possible across the system, in a “breadth-first” fashion. It is the opposite of compact. In a multi-socket, multi-core machine, at first it will distribute threads on core #0 of all sockets, then on core #1, and so on.
explicit Assigns OpenMP threads to a list of processor ID’s that have been explicitly specified by using the proclist= modifier.
none Does not bind OpenMP to particular processors.
disabled Completely disables thread affinity. Any other affinity mechanism (e.g. affinity system calls) will have no effect.
permute &lt;positive integer&gt; Controls which levels are most significant when sorting the machine topology map. (see User Guide for a more detailed description).
offset &lt;positive integer&gt; Indicates the starting position for thread assignment.

 

For an extensive description of the KMP_AFFINITY arguments, along with a variety of examples, the user is referred to the Intel C++ compiler Reference Guide (online at http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/index.htm)

8.4.2.4. GNU compiler

The GNU compiler gcc provides the GOMP_CPU_AFFINITY environment variable to enforce OpenMP thread-to-CPU mappings, in a way similar to KMP_AFFINITY. The variable should contain a space-separated or comma-separated list of CPUs. This list may contain different kinds of entries:

  • Single CPU numbers in any order.
  • A range of CPUs (M-N).
  • A range with some stride (M-N:S).

OS processors specified in this list are assigned to OpenMP threads, in order of OpenMP Global Thread IDs. Examples:

  1. GOMP_CPU_AFFINITY="0 3 1-2 4-15:2" will bind the initial thread to CPU 0, the second to CPU 3, the third to CPU 1, the fourth to CPU 2, the fifth to CPU 4, the sixth through tenth to CPUs 6, 8, 10, 12, and 14 respectively and then start assigning back from the beginning of the list.
  2. GOMP_CPU_AFFINITY=0 binds all threads to CPU 0.

If this environment variable is omitted, the host system will handle the assignment of threads to CPUs.

Note also that GOMP_CPU_AFFINITY=&lt;proc_list&gt; is equivalent to

KMP_AFFINITY= granularity=fine,
proclist=[&lt;proc_list&gt;], explicit

.

8.5. Hybrid programming

Hybrid MPI-OpenMP programming is a natural approach in large-scale hierarchical machines, where MPI is used for communication between nodes and OpenMP for parallelisation within a node. In its most typical form, there are a few MPI processes running in each node, and within each process multiple OpenMP threads operate in parallel on process’s shared data. Usually, there are no MPI calls inside parallel regions. An example which corresponds to this scenario is shown below:

while ( i < iters ) {
  MPI_Send(prev_data);
  MPI_Recv(curr_data);
  #pragma omp parallel for
  {
     /* process in parallel curr_data */
  }
  ...
}

As of MPI-2.0, the programmer may use one of the following arguments in MPI_Init_thread function in order to specify the desired level of thread support:

  • MPI_THREAD_SINGLE : only one thread will execute.
  • MPI_THREAD_FUNNELED : the process may be multithreaded, but only the main thread will make MPI calls.
  • MPI_THREAD_SERIALIZED : the process may be multithreaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads.
  • MPI_THREAD_MULTIPLE : multiple threads may call MPI, with no restrictions.

The reduction of MPI processes within a node and the employment of OpenMP threads, usually has a number of advantages:

  • The MPI overhead (e.g. message packing / unpacking) for intra-node communication is eliminated; communication between OpenMP threads is performed solely through shared memory.
  • The memory footprint is reduced, due to the reduction of message buffers and various bookkeeping structures internal to MPI library.
  • The pressure on the network is mitigated, due to the reduction of communication channels and total number of messages. It is often more beneficial to have fewer but larger messages being exchanged, than the opposite.
  • MPI calls are reduced.
  • Load balancing can be improved when utilising OpenMP dynamic work-sharing constructs. This would be hard to implement with pure MPI.
  • Under several cases, cache sharing in multi-core nodes may be exploited by OpenMP threads in a beneficial way. For example, when threads work under common data, they effectively have larger cache space available or opportunities for mutual data prefetching. This would not be possible in pure MPI implementations, where each MPI process works on its private address space.

8.5.1. Optimal processes/threads strategy

Determining the optimal configuration of MPI processes and OpenMP threads in the hybrid model is not trivial. In order to achieve best performance, the programmer needs to focus on the following goals when examining a certain processes/threads combination:

  • Reduced MPI communication overhead.
  • Reduced synchronisation overhead.
  • Reduced memory consumption.
  • Reduced load imbalance .

Let C be the total number of cores in a system node. We consider the following alternatives regarding the processes/threads configuration, denoted by P x T, where P is the number of MPI processes and T the number of threads. We also assume that, in all cases, there is no over- or under-subscription of a node, i.e. P x T = C.

  • C x 1This corresponds to a pure MPI implementation. It treats all MPI processes as peers, disregarding the fact that some cores (i.e., those in the same node) can use the shared-memory infrastructure to communicate. This approach can be applied “as is” in a hierarchical system, without requiring major code restructuring. Its drawback is the MPI overhead (e.g. message processing) that must be paid for communication that happens within the same node. In practice, the MPI library will optimize this case and force communication to happen via shared memory, in a way that is transparent to the programmer. However, the user lacks the flexibility offered by the shared memory programming model, and the ability to leverage certain key features of OpenMP (e.g. automatic work scheduling and load balancing).
  • 1 x CThis corresponds to an (almost) pure OpenMP implementation, where a single MPI process does all the communication and the C OpenMP threads do the computations. One drawback of this strategy is that all OpenMP threads need to wait while the MPI process communicates data. Depending on the application, this communication overhead could quickly become the limiting factor of scalability. Furthermore, the total inter-node network bandwidth would remain under-utilized. A possible way to overcome both problems would be to overlap computation with communication, by assigning some OpenMP threads to perform communication in parallel with computation. For many applications, however, this approach is not so straightforward to implement. Another possible problem with this scheme is when two or more threads tend to work on the same data, following a read-write pattern. In this case the shared data must travel between the threads’ private caches. If the two threads share e.g. the L2 cache, then the communication will be rather fast. However, if they are bound to different sockets which do not share any level of cache, then the communication will be relatively slow since it has to go off-chip. The problem becomes more intense as the number of sharers, or their distance in terms of processor topology, increases. Under several cases (e.g. with a more judicious placement of threads on cores) the negative effects of read-write sharing can be mitigated.
  • N x MThis approach lies between the two previous extremes, and represents the most typical case for the hybrid model. There are N MPI processes, each of which is paired with a team of M OpenMP threads. Again, the MPI processes do all the communication and the OpenMP threads the computation, as shown in the code snippet in the previous section. The literature has shown that, in most cases, this configuration best balances the pros and cons of Cx1 and 1xC schemes and yields optimal performance.

There is no rule-of-thumb for choosing the best configuration for N x M scheme (i.e., the best values for N and M). However, especially for NUMA systems, an option that matches well the hierarchical memory subsystem and the hierarchical nature of the hybrid MPI-OpenMP model, is to use as many MPI processes as the number of NUMA domains (i.e., the NUMA memory node together with its loc
al processor socket), and as many OpenMP threads (per-MPI process) as the number of cores in each domain. This choice promotes data sharing between cores of the same NUMA domain (fast), avoids data exchange between cores on different NUMA domains (slow), and at the same time offers all the advantages of the NxM scheme discussed so far.

For instance, for the Curie Thin nodes, where each node features 2 NUMA domains (sockets + memory modules) and each socket contains 8 cores, a reasonable configuration would be to use 2 MPI processes with 8 OpenMP threads each. As we discuss in Section 8.7.1.3, “Hybrid MPI-OpenMP”, it is essential to bind each MPI process (along with its thread team) on a certain NUMA domain. However, since different applications have different characteristics (in terms of communication, synchronisation , locality, load balancing, etc.), it would be advisable for the programmer to test “neighbouring” configurations to 2 x 8, e.g. 1 x 16 or 4 x 4, in order to find the best one. In any case, the total number of threads should be equal to the total number of cores in the node, i.e. 16.

8.6. GPGPU programming

This section is intended to be a guide to the programmer intending to move a legacy code to its GPGPU implementation. It applies both to FORTRAN and C (even C++) yet most of the text refers to FORTRAN. As it is meant to represent guidelines, the details of the reasoning behind the advices have been either omitted or tersely given.

Moving towards GPGPU will potentially demonstrate important speedups. Yet we often find the main bottleneck in the algorithm itself. If it has been designed only for a sequential usage, it might be the main point of work.

The algorithm should promote massive parallelism

Do not forget that Amdahl’s Law[1] will always rule your code. It means that all sequential parts should be kept to a minimum (avoided altogether if possible).

Amdahl’s Law is forever!

Once you know that your algorithm should behave nicely in parallel, the most important part of the work is to adapt the code. For that we propose the following methodology of development.

  1. Reorganise the code
  2. Profile the result
  3. Introduce GPGPU

This will be the structure of the following text.

8.6.1. Reorganise the code

The reorganisation of the code is the essential phase. It is the unique occasion to revisit the source in detail. The first point to verify is the architecture of the source.

8.6.1.1. Architecture

The architecture is the skeleton of the code. It should make it easy to understand the purpose of the code (why), not how it is done (the details of the actual computations).

Clearly separate why from how

A piece of code like:

Call Fourier(an_array)                  <- Why
Do i = 1, N
Other_array(i) = an_array(i) + 1/sin(i) <- How
End do
Call Fourier_reverse(other_array)       <- Why

Is too often seen in legacy codes. This is clearly a mix of purposes which should be forbidden.

8.6.1.2. Dynamic memory

Dynamic memory is expensive at runtime and is not available on the GPU (kernel side). Therefore we have the rule:

No ALLOCATABLE and allocations in a compute subroutine

To achieve this rule, it means leaving memory allocation up to the architecture and letting compute subroutines do only computations on regular arrays. This means that we have to declare temporaries outside the compute subroutines.

8.6.1.3. Interfaces of compute subroutines

Most of the time, using a GPU imposes to « go out » of the FORTRAN world. This introduces strong constraints on our programming habits. For example, kernels on a GPU are executed asynchronously. Therefore it is impossible to have a value returned automatically by the compiler. This forces the following rule use:

SUBROUTINE only

On the same token, having to deal with a remote execution forces the subroutines to have simple fixed interfaces. Therefore no variable or optional arguments should be used.

Neither FORTRAN 90 shapes nor (even) array syntax

To be able to easily transform a code into another language, one should favour the simplest style possible where everything is explicit. This was the case with FORTRAN 77. On the other hand, FORTRAN 90 introduced clearer interfaces for subroutines. The following features of FORTRAN 90 are mandatory yet some programmers do not bother with them:

  • INTENT
  • IMPLICIT NONE
  • INTRINSICS

Even though FORTRAN 90 brings all the functions to deal with array bounds, it turns out that they are not available once outside of the FORTRAN world. Therefore we recommend that a program should

Explicit dimensions (FORTRAN 77 like) and pass the dimensions as actual arguments

MODULE and USE are pretty comfortable on a CPU only code. The only drawback is that variables accessed through a “USE” are equivalent to a global variable (in C/C++). The problem is that global variables are invisible from GPU and it makes it really difficult to track variable usage across the code (without a good set of tools). Therefore:

Limit MODULE/USE for TYPES declarations!

Limit MODULE/USE to the upper level of the architecture should they contain variables

Otherwise use arguments. Using arguments have a number of good properties (and the drawback of being rather verbose):

  • It defines a clear associated contract with the subroutine
  • It helps the compiler (optimizer)
  • It promotes GPU usage since relying only on arguments makes it easy to “cut out” the current implementation of a routine and replace it by its GPU equivalent.
8.6.1.4. F90

Do not use all the possibilities of the language. It will either hide many computations and data movements which have to be reproduced on the GPU most of the time in a very different way.

Use only a subset of the FORTRAN 90 syntax on a FORTRAN 77 code

Do not forget the F2003 iso_c_binding, it opens a nice way to efficiently use C routines. For example, it allows the usage of pointers created in C.

Some aspects of F90 should be avoided altogether since they are misleading. As an example a pointer to a member of a structure is quite unsuitable for a GPU port:

Arrays_1D%Ro_ptr => Hydro_vars%ro( first:last, j, k)
8.6.1.5. SOA or AOS[2]

The data layout of a code is a rather essential issue. It has a dramatic impact on performances. As a rule of thumb, CPUs prefer AOS. This is true only IF we use many members of the structure at the same time. Otherwise use a SOA.

CPUs prefer AOS

Unfortunately, this statement is false for a GPU for which SOA is the preferred layout. Your design of the data structure should take its most frequent usage into account.

GPUs prefer SOA

8.6.1.6. Study Arrays streams

To benefit from the GPU, a program must limit the number of transfers to/from the GPU from/to the CPU since those transfers can slow down the program by a huge factor. To achieve this, we recommend that the usage of arrays in the code is well understood and seen as a stream. This is a mandatory step to improving the data movement. We found that using a table[3] to describe where an array is defined (allocated), filled, used, released, and its status (in a module or not) helps a lot to get a clear view of the program.

Fill a table of array’s usages

 

Table 7. Array’s usages

Array 1 Array 2
Subroutine 1 Def Def, Fill
Subroutine 2 Temp Use

 

Using this table the programmer must be able to state what set of arrays should go on the GPU and how long they should stay on the GPU.

We have noticed that the INTENT clause helps a lot to fill the array table. An automatic tool that would be able to create the table would be helpful.

8.6.2. Profile the result

Once the program has been restructured, the programmer must have a clear knowledge of the performance of each subroutine. For this, use all available tools on your machine (gprof, timers, paraver, tau, scalascala – see the corresponding sections
in this guide).

Please note that to master different performance tools is an investment which pays off in the long run.

Once an optimisation effort has been decided, it is important to focus only on useful sections (remember the 80% / 20% rule where 80% of the compute time is spent in 20% of the source code.). One nice trick we promote is to introduce permanent timers in the source. It is an important tool to measure the code’s evolutions along all its life.

First of all: really optimise the sequential code

8.6.3. Introduce GPGPU

When the code is in a state which is suitable for GPU usage (meaning after the optimisation phase) you can choose the language (HMPP, PGI, Cuda, OpenCL, OpenACC). HMPP is a good one to start with.

8.6.3.1. Oddities of a GPU

The usage of a GPU introduces some constraints on the way an algorithm is written to reach the maximum performances. The first goal is to maximise the “compute intensity”[4].

The compute intensity should be at least 2.

On these architectures, what is expensive are memory access and tests. For this second point, fuse loops whenever possible (to limit the loop overhead and tests). Loop fusion is easier if all the code has been made explicit (i.e. not array syntax involved).

Compute instead of store

On a GPU a good approximation is to claim that compute is free. Temporaries which can easily be recomputed should not be stored.

Limit conditionals or use masks

GPUs implement the SIMT[5] paradigm. As a consequence, the hardware doesn’t take branches efficiently. Henceforth, conditionals should be avoided (if possible). In the best case, pull up the conditional to the caller or use mask (as was done on vector machines).

Res[i] = cond[i] * a[i] + (1 – cond[i]) * b[i]

with cond[i] = 1|0 allows to implement:

if (cond[i]) { res[i] = a[i] } else { res[i] = b[i] }

No I/O functions (PRINT/WRITE/READ).

Since input/output routines are not yet widely available for GPUs, leave them to the CPU.

Reactivate vectorized subroutines.

A vectorised code can easily be transfered to a GPU. Furthermore, we are also seeing vector units coming to future architectures (e.g in the Knights series from Intel). Vectorisation techniques can usefully be reactivated.

8.6.3.2. Porting to CUDA

At this stage of the methodology, moving to CUDA (or HMPP) should be a straightforward process.

Create interfaces for CUDA routines.

First create interfaces to the new code (see above). Then duplicate compute routines in CUDA while keeping active the original code for validation purposes.

Implement proxy routines.

Using a proxy subroutine is often the easiest way to switch from one implementation to the other. Then concentrate on the manual management of the GPU memory and data movements. The array table will be very useful at this step.

8.6.3.3. Porting to HMPP

Porting to HMPP requires less code development (no subroutine creation) but rather some code annotation (comments or pragmas). The source code annotation can be done according to the following steps:

  • Compute on the GPU but allow having a transfer all the arrays at each call. Verify that the code is still valid (output should be invariant). Performance is poor (normal in this mode).
  • Iteratively minimise the transfers using our array table. In the end, only the minimum set of transfers should be left active.
  • Optimise the compute subroutines. The timers described above should help to focus on a minimal number of routines to optimise for the GPU.

Optimise transfers iteratively

8.6.3.4. In all cases

Do NOT reprogram what is available as a library

Experts are paid full time to deliver the most optimized versions for all architectures. This is the case for example for Intel’s MKL, or NVIDIA’s CUBLAS… Do not try to beat those implementations.

As of TODAY

Start to prepare the codes for future architectures. It will require mandatory coding practices as well as a revision of the algorithms for hyper parallelism. Doing so will be beneficial even for sequential or MPI codes. Since parallelism will be at all levels, application MUST be hyper parallel from the ground up. As a consequence: a sequential run should be a degenerated run case of a parallel code.

The future will be hyper parallel

8.7. Memory optimisation

One of the most important features in modern multicore designs is the Non-Uniform Memory Access capabilities they support. Such systems incorporate multiple on-chip memory controllers and buses, so that each socket (or group of sockets) can have its own path to main memory. This alleviates bus contention and offers better scalability, but at the cost of non-uniform memory access times (there are different costs when accessing different parts of the address space). A processor can read or write on a local NUMA memory node in a fast and uncontended way, but to access a remote one incurs additional latency and may introduce contention. Both the Curie Extra Large and Thin nodes feature such an organisation of memory modules, incorporating a total of 16 and 2 NUMA memory nodes, respectively, as illustrated in Figure 2 and Figure 5.

This variation in access times often translates to notable differences in ultimate performance. This means that an arbitrary thread and/or data placement can lead to suboptimal executio
n times for different classes of applications. It is essential therefore to consider memory affinity (i.e., to which memory node data should be allocated) together with cpu affinity when optimising for performance.

8.7.1. Memory affinity

The latest Linux kernels provide system calls to control data placement at low level (e.g. at page granularity). At a higher level, the numactl package provides a set of utilities and library calls to control memory allocation, or enforce a more generic memory policy. Memory policies represent a general strategy about how memory should be allocated across memory nodes. The idea behind this is to allow existing code to work reasonably well in a NUMA environment, without having to modify the source code by fine-tuning malloc calls. The numactl utility can be used to control the memory policy for a specific application. Its syntax is as follows:

numactl [options] <application> [application-arguments]

numactl launches the application and enforces the memory policy specified in the options. The policy is inherited by all application’s children (e.g., OpenMP threads created within an MPI process). The most important options and policy settings that numactl can take are the following:

  • --interleave=nodesSet a memory interleave policy. Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target, allocation will fall back to other nodes. Multiple nodes may be specified on --interleave, --membind and – –cpunodebind , using the NUMA node numbers as seen by the OS (see Section 8.4.2.1, “Determining machine topology”). It is possible to specify all, in which case all nodes that are local to the current cpuset will be used. nodes may also be specified using “,” to separate distinct values, “-” to define ranges, and “!” to invert selection. E.g., 0,2-4 means that allocations should be performed on nodes 0, 2, 3 and 4.
  • --membind=nodesOnly allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. nodes may be specified as above.
  • --cpunodebind=nodesOnly execute process on the CPUs that are local to each memory node of nodes. This option controls cpu affinity rather than memory affinity. Implicitly, the memory will be allocated in a best-effort manner in those nodes, as well.
  • --physcpubind=cpusOnly execute process on cpus. This is similar to the ––cpunodebind option, apart from the fact that processor id’s (as seen by the OS) are given as arguments, instead of memory node id’s. Again, it is possible to specify all in order to use all CPUs in the current cpuset.
  • --localallocAlways allocate memory on the current node.
  • --preferred=nodePreferably allocate memory on node, but if memory cannot be allocated there fall back to other nodes. This option takes only a single node id.
  • --showShow NUMA policy settings of the current process.

Some examples are given below:

  • numactl --interleave=all ./program : Run program with its memory interleaved on all cores.
  • numactl --cpubind=0 --membind=0,1
    ./program

    : Run program on any core local to node 0, with memory allocated on node 0 and 1.

  • numactl --physcpubind=0-4,8-12 ./program : Run program on CPUs 0 to 4 and 8 to 12.

When the user does not specify any memory policy with numactl or lower-level library calls, the Linux kernel by default employs a “local allocation on first touch” policy. According to this, memory is allocated on the NUMA node that is local to the cpu that first accesses the corresponding data. That is, the location of the process that first touches the data determines the location of the data.

In the following, when we say that a process or thread runs (or executes, resides, etc.) within a memory node, we mean that it runs on a core that it is local to the node. That is, both the node and the core belong to the same NUMA domain.

8.7.1.1. MPI

In a pure-MPI program, the default memory policy implies that each process will have its data allocated on the node where the process was running while it initialised them. Even if it migrates later to another node, the data will be resident on the first one, and the accesses of course will be more expensive. Therefore, in order to guarantee a more stable and predictable behaviour of the application during runtime, it is desirable to specify in advance both the cpu and memory affinity of each MPI process.

On a Curie Extra Large node, we can achieve this is using a script like the one below. It is actually a wrapper to the MPI program of the user (i.e., it can be given as argument to the ccc_mprun command). It assigns the cpu and memory affinity in such a way, that each MPI process will be bound to a certain NUMA node and processes will be spread cyclically across available NUMA nodes. This implies that any memory allocation of the process will be satisfied by its local node, and the process itself will be scheduled on a specific core within the node.

#!/bin/bash

# nd: NUMA domain. Assign a different NUMA domain id for each
# process according to its local rank, so that consecutive
# processes are distributed cyclically across domains
# (i.e., MPI process 0 will be assigned to domain 0, process
# 1 to 1, 2 to 2, ..., 15 to 15, 16 to 0, and so on)
nd=$(expr $OMPI_COMM_WORLD_LOCAL_RANK % 16)

# Set memory and thread affinity for this MPI process.
# Local ranks [0,16,32,...,112] will be assigned to different cores
# of domain 0, ranks [1,17,...,113] to cores of domain 1, ranks [2,18,...,114]
# to cores of domain 2, ..., ranks [15,31,...,127] to cores of domain 15.
numactl –-preferred=$nd –-cpunodebind=$OMPI_COMM_WORLD_LOCAL_RANK ./my_mpi_prog
8.7.1.2. OpenMP

In OpenMP there is a single address space shared by all threads. The first-touch policy implies that the location of data depends on the region of the program where data was first written. If memory allocation and initialisation happened within the serial part (“main thread”), then data allocation will start from the NUMA node close to the core where the program was running at that time. Having data in a single (or few) memory node, however, would give unbalanced access times if threads are spread across different sockets. We need therefore to place threads and data in such a way, that data is evenly distributed across NUMA nodes, and threads mostly make local accesses.

For an application that has its threads scattered on all packages of the platform, there are two basic options to achieve balanced data distribution in an easy way:

  • Have data allocations and initialisations to occur within the parallel regions, i.e. on behalf of each thread. This would cause each thread to bring the corresponding data to its local memory node. The drawback of this approach is that it may need code modifications, in order to transfer allocations/initialisations from the serial to the parallel portion of the code. Furthermore, it might incur some overhead if multiple malloc’s are called simultaneously at a time-critical path.
  • Use the --interleave option of numactl, with arguments all the memory node id’s of the platform (0-15, in Curie Extra Large nodes). This approach would evenly distribute data across all memory nodes, it would not require code modifications, but it would not enforce affinity between a thread and the data it works on. Thus, it might be possible that all memory accesses are local, or all remote, or some intermediate situation.

If none of the above options are suitable for a specific case, the user can always resort to a low-level NUMA API in order to explicitly control thread and data mappings. Such an API provides functions for allocating a specific data buffer on a certain NUMA node, overriding any generic policy enforced by the OS. From an algorithmic perspective, this would require a-priori knowledge of which data segments are processed by each thread, which might not be easy to find for all applications. From an implementation aspect, it would need fine-tuning of memory allocations using the low-level libnuma API, in order to direct the kernel to allocate each data segment on the node where its respective thread will execute. This approach is efficient (does not introduce much overhead during runtime), strongly guarantees that data placement will occur as desired by the programmer, but it is usually difficult to implement.

If the application uses only a subset of the available sockets, we could use numactl to restrict memory allocations to happen on the corresponding memory nodes only. For example, to run an application on NUMA nodes 0 and 1, we could use the following wrapper script:

#!/bin/bash
export OMP_PROC_BIND=true
numactl –-membind=0,1 –-cpunodebind=0,1 ./my_openmp_prog

The argument cpunodebind restricts thread execution on cores of NUMA nodes 0 and 1, and OMP_PROC_BIND set to true guarantees that threads will not move between these cores.

8.7.1.3. Hybrid MPI-OpenMP

A similar rationale as in the pure MPI and OpenMP cases can be used in the Hybrid MPI-OpenMP case, as well. To enforce a scheme of 16 MPI processes x 8 OpenMP threads, which naturally fits the architecture of a Curie Extra Large node, we could use a script like the one that follows. The script assumes the use of gcc implementation of OpenMP. It pins each MPI process along with its thread team on a separate NUMA node, and additionally, causes any memory allocation (either from the process or from the threads) to happen on that node.

#!/bin/bash

# 8 OpenMP threads per MPI process
export OMP_NUM_THREADS=8

# nd: NUMA domain. Set a different NUMA domain id for each
# process and initialise it to the process's local MPI rank
# (will be a number between 0 and 15 when 16 MPI processes
# per-node are specified)
nd=$OMPI_COMM_WORLD_LOCAL_RANK

# For each MPI process, bind its OpenMP threads to different
# cores of the package that it will be assigned on (e.g., processors
# [0,16,32,...,112] of package 0, for MPI process 0)
export GOMP_CPU_AFFINITY=$nd-128:16

# Set memory and thread affinity for this MPI process to the
# specified NUMA domain, and launch it
numactl –-preferred=$nd –-cpunodebind=$nd ./my_mpi_prog

8.7.2. Memory Allocation tuning

As described in Section 8.7.1.2, “OpenMP”, to better control affinity of memory allocations in a more explicit and fine-grain fashion than the generic memory policies, one should use the lower-level functions provided by the libnuma library. Many of these functions act as a substitute for malloc function, taking additional arguments for memory affinity. In the following paragraphs, the most important of them are presented. For a more detailed description of their arguments, return values, as well as the rest libnuma functions, the user is referred to the manual pages (man numa).

The following functions perform memory allocation at system page size granularity (i.e., the specified bytes are rounded up to a multiple of page size). They are relatively slow compared to the malloc family of functions, and thus their use is recommended outside the critical path of the application. The allocated memory must be freed with numa_free.

  • void *numa_alloc_onnode(size_t size, int
    node)

    Allocates size bytes of memory on a specific node.

  • void *numa_alloc_local(size_t
    size)

    Allocates size bytes of memory on the local node.

  • void *numa_alloc_interleaved(size_t
    size)

    Allocates size bytes of memory interleaved on all nodes. The interleaving works at page level and will only show an effect when the area is large.

  • void *numa_alloc_interleaved_subset(size_t size,
    struct bitmask *nodemask)

    Attempts to allocate size bytes of memory interleaved on the nodes specified in the bitmask. The interleaving works at page level and will only show an effect when the area is large.

  • void *numa_alloc(size_t size)Allocates size bytes of memory with the current NUMA policy.
  • void numa_free(void *start, size_t
    size)

    Frees size bytes of memory starting at start. The size argument will be rounded up to a multiple of the system page size.

Other useful functions follow in brief:

  • numa_move_pages : moves a list of pages in the address space of the currently executing or current process. It simply uses the move_pages system call.
  • numa_migrate_pages : uses the migrate_pages system call to cause the pages of the calling thread, or a specified thread, to be migrated from one set of nodes to another.
  • numa_set_preferred : sets the preferred node for the current thread. The system will attempt to allocate memory from the preferred node, but will fall back to other nodes if no memory is available on the the preferred node.
  • numa_set_interleave_mask : sets the memory interleave mask for the current thread. All new memory allocations are page interleaved over all nodes in the interleave mask. Interleaving can be turned off again by passing an empty mask. The page interleaving only occurs on the actual page fault that puts a new page into the current address space. It is also only a hint: the kernel will fall back to other nodes if no memory is available on the interleave target.
  • numa_interleave_memory : interleaves memory page by page on the specified nodes. This is a lower level function to interleave allocated but not yet faulted in memory. Not yet faulted in means the memory is allocated using mmap (or shmat), but has not been accessed by the current process yet.
  • numa_bind : binds the current thread and its children to the specified nodes. They will only run on the CPUs of the specified nodes and only be able to allocate memory from them.
  • numa_set_localalloc : sets the memory allocation policy for the calling thread to local allocation. In this mode, the preferred node for memory allocation is effectively the node where the thread is executing at the time of a page allocation.
  • numa_set_membind : sets the memory allocation mask. The thread will only allocate memory from the nodes set in the mask.

9. Case studies

9.1. Case study 1: Executing the STREAM benchmark on Extra Large nodes

This section deals with the Extra Large nodes of the machine, which are targeted to codes that require large memory and/or multithreading capabilities, as well as pre and/or post processing tasks.

As such, we investigate the capabilities of these nodes in terms of memory bandwidth, which is a key factor to achieving high performance in many multithreaded codes. Furthermore, the NUMA characteristics of the node must be taken into account when deploying multithreaded applications. Each NUMA node of the platform provides independent and uncontended access to main memory for threads running on cores adjacent to the NUMA node. Therefore, the maximum attainable bandwidth is different when a number of threads are placed in such a way that they share a single or few NUMA nodes, than the case when each thread has its own NUMA node at its disposal; the contention of the first case leads to much lower per-thread and aggregate bandwidth. As a conclusion, the placement configuration greatly affects memory performance.

The memory bound benchmark STREAM was used to measure the impact of placement on attainable memory bandwidth. It was run on a single Curie Extra Large node for 1,2,4,8,16,32,64 and 128 threads, and for various placement configurations. In the results that follow, the placement configurations are denoted as MxPxC, meaning that a specific run used “M” S6010 modules out of the four consisting an Extra Large node, “P” packages on each module, and “C” cores on each package. In every configuration, each thread runs on its own core and accesses the NUMA node that is local to it. CPU affinity was enforced with the taskset utility, and memory affinity was implied by the fact that all threads work on private data.

Figure 6 shows four different options for placing a 16-thread version of the STREAM benchmark on an Extra Large node. Each option implies a different model for accessing the NUMA nodes, and therefore yields different performance, as we will see in the results that follow.

We used the following configuration of the STREAM benchmark:

  • Context: STREAM, OpenMP version
  • Parameters: Array size = 205000000, Offset =0, NTIMES=10.
  • Kernel used: TRIAD

To run e.g. configuration 1x2x8, we used the following commands:

export OMP_PROC_BIND=true
export OMP_NUM_THREADS=16
taskset -c 0,16,32,48,64,80,96,112,1,17,33,49,65,81,97,113 ./stream_omp

 

Figure 6. Examples of thread placement for the case of 16 threads (shown in red). Configuration: #Modules x #Packages_per_module x #Cores_per_package.

Examples of thread placement for the case of 16 threads (shown in red). Configuration: #Modules x #Packages_per_module x #Cores_per_package.

(a) 1x2x8 configuration

Examples of thread placement for the case of 16 threads (shown in red). Configuration: #Modules x #Packages_per_module x #Cores_per_package.

(b) 2x1x8 configuration

Examples of thread placement for the case of 16 threads (shown in red). Configuration: #Modules x #Packages_per_module x #Cores_per_package.

(c) 2x4x2 configuration

Examples of thread placement for the case of 16 threads (shown in red). Configuration: #Modules x #Packages_per_module x #Cores_per_package.

(d) 4x4x1 configuration

 

Figure 7 and Figure 8 depict the maximum aggregate and per-thread bandwidth, respectively, for various placement configurations, as reported by the STREAM benchmark. These results are also presented in more detail in Table 8.

As expected, from these results we see that configurations that tend to spread a given number of threads across available packages (e.g. 4x4x1), rather than trying to pack them in small number of packages (e.g. 1x2x8), yield in general much better overall performance. This is because the former cases involve more dedicated and uncontended links with local memory, while the latter involve fewer shared and contended links. Another observation is that the maximum memory bandwidth is not achieved when all 128 cores are utilised (4x4x8), obviously because this configuration introduces some amount of memory contention. Instead, the more moderate configuration of 2x4x4 is the one that yields the maximum attainable bandwidth (100113.9 MB/s), because it best balances thread-level parallelism and memory contention.

From Figure 8 we can also see which configurations lead to increased memory contention, by comparing the maximum bandwidth achieved by a single thread in the standalone case (1x1x1) with the per-pthread bandwidth achieved in other co-execution cases. Again, based on the explanation given above, when threads are placed in separate packages (i.e., the configurations *x*x1) they suffer much smaller memory contention compared to other placements, and their bandwidth is nearly identical to the standalone case. And they suffer even less contention when they are placed in separate S6010 modules (e.g. compare 4x1x1 vs 2x2x1 or 1x4x1).

 

Figure 7. Maximum aggregate memory bandwidth for various placement configurations on a Curie XL node using the STREAM benchmark

Maximum aggregate memory bandwidth for various placement configurations on a Curie XL node using the STREAM benchmark

 

 

Figure 8. Maximum per-thread memory bandwidth for various placement configurations on a Curie XL node using the STREAM benchmark

Maximum per-thread memory bandwidth for various placement configurations on a Curie XL node using the STREAM benchmark

 

 

Table 8. Maximum memory bandwidth for various placement configurations on a Curie XL node

Placement (#Modules x #Packages_per_module x #Cores_per_package) Number of threads Aggregate bandwidth (MB/s) Per-thread bandwidth (MB/s)
1x1x1 1 5702.4 5702.4
1x1x2 2 10286 5143
1x2x1 2 10852.4 5426.2
2x1x1 2 11354.4 5677.2
1x1x4 4 14313.4 3578.3
1x2x2 4 19103.4 4775.8
1x4x1 4 21652.3 5413.0
2x1x2 4 20507.4 5126.8
2x2x1 4 21691.3 5422.8
4x1x1 4 22688.7 5672.1
1x1x8 8 15492.9 1936.6
1x2x4 8 26055.6 3256.9
1x4x2 8 37918.8 4739.8
2x1x4 8 28617.1 3577.1
2x2x2 8 38052.8 4756.6
2x4x1 8 43291.5 5411.4
4x1x2 8 40830.8 5103.8
4x2x1 8 43422.7 5427.8
1x2x8 16 28173.5 1760.8
1x4x4 16 49492.4 3093.2
2x1x8 16 31002.6 1937.6
2x2x4 16 52120.4 3257.5
2x4x2 16 75928.1 4745.5
4x1x4 16 57102.5 3568.9
4x2x2 16 76585.3 4786.5
4x4x1 16 85224 5326.5
1x4x8 32 44954.5 1404.8
2x2x8 32 56265.8 1758.3
2x4x4 32 100113.9 3128.5
4x1x8 32 61857.1 1933.0
4x2x4 32 97141.1 3035.6
4x4x2 32 70359.1 2198.7
2x4x8 64 43834.2 684.9
4x2x8 64 41951.6 655.4
4x4x4 64 78309.3 1223.5
4x4x8 128 48208.8 376.6

 

9.2. Case study 2: Executing the SpMV and Jacobi kernels on Extra Large and Thin nodes

9.2.1. The SpMV kernel

9.2.1.1. Description

An important and ubiquitous computational kernel with streaming memory access pattern is the Sparse Matrix Multiplication with Vector (SpMV). It is used in a large variety of applications in scientific computing and engineering. For example, it is the basic operation of iterative solvers, such as Conjugate Gradient and Generalised Minimum Residual, which are extensively used to solve sparse linear systems resulting from the simulation of physical processes described by partial differential equations. Furthermore, SpMV is proclaimed as one of the so-called “Seven Dwarfs of Parallel Computation”, which are classes of applications that are believed to be important for at least the next decade.

The distinguishing characteristic of sparse matrices is that they are dominated by a large number of zeros (as it can been seen at Figure 9), making it highly inefficient to perform operations using typical dense array structures. Special storage schemes are used instead, which target both the reduction of the storage requirements of the matrix and the efficient execution of various operations by performing only the necessary computations. Thus, the common approach is to store only the non-zero values of the matrix and employ additional indexing information representing the position of these values. However, SpMV still suffers from memory bottlenecks because it performs O(n2) operations on O(n2) data, which means that the ratio of memory access to floating point operations is very high and it is considered to be a memory bounded kernel. As a result the level of compression applied on the matrix is the major factor that determines the performance of SpMV kernel.

 

Figure 9. Example of sparse matrices

Example of sparse matrices

(a) Chebyshev4

Example of sparse matrices

(b) bmw7st_1

 

The most common storage scheme is Compressed Sparse Row (CSR). The CSR format uses three arrays in order to describe the matrix efficiently. The first array named values contains all the non-zero elements of the matrix. The second array, colind, has length equal to values and there are stored the column indexes for each non-zero element. Finally, the third array, rowptr, which has size equal to the number of lines, contains pointers to colind that direct to the first element of each row. An example of CSR storage format (Figure 10) and the code of SpMV using the CSR storage format can be seen below.

 

Figure 10. Example of the CSR storage format

Example of the CSR storage format

for (i = 0; i < nr_rows; i++) {
  y[i] = 0;
  for(j = rowptr[i]; j < rowptr[i+1]; j++)
    y[i] += values[j] * x[colind[j]];
}

 

9.2.1.2. OpenMP implementation

It is easily observed that the input vector x and the three arrays of CSR format (values, colind, rowptr) are read-only fields and only the output vector y is written. As a result the outer loop can be divided to multi threaded tasks, because each iteration of this loop accesses the respective element of vector y. For example, the ith iteration writes only the ith element of vector y. However, if the inner loop is parallelised, it is possible for two or more threads to gain access to the same element of vector y simultaneously. In order to avoid this unwanted situation and synchronise threads to work properly, only the first loop is parallelised. The code of SpMV, using the CSR storage format and OpenMP programming tool for multi-threaded support, can been seen below.

#pragma omp parallel for private(j)
for (i = 0; i < nr_rows; i++) {
  y[i] = 0;
  for(j = rowptr[i]; j < rowptr[i+1]; j++)
    y[i] += values[j] * x[colind[j]];
}

One important characteristic of this implementation is that the length of the inner loop (number of non-zero elements of current line) is not constant. More specifically, one row may have from 0 to n non-zero elements in a n × n sparse matrix. As a result, the tasks created by OpenMP do not necessarily have the same workload and in some cases, are highly imbalanced. One easy and simple way to reduce this negative aspect is to make efficient use of scheduling algorithms defined by OpenMP.

There are four different scheduling options used in OpenMP to divide the tasks, created by a parallel loop, amongst the available threads. If the auto option is used the division of tasks amongst threads is decided by the system. However, the handicap of this option is that clearly ignores the distinguished characteristics of each problem and does not perform well in some cases. If the static option is used, the tasks are divided evenly between threads. It would be more preferable to use it in parallelising loops when the number of operations of one iteration is constant. In other cases when the workload is unequally divided between tasks, it is possible to create imbalance between threads and scale poorly. In contrast, by using dynamic option you avoid the imbalance between threads. However, an overhead is added, by dynamically assigning tasks, due to required synchronisation operations during the execution. Finally, with the guided option, a large part of tasks is assigned statically to threads. Then, each thread, that completes its work, is assigned a proportional number of the remaining tasks, until there are no remaining tasks. It is worth mentioning that static, dynamic and guided schedule options, support an extra option named chunksize which is the number of iterations of loop assigned to a single OpenMP task.

9.2.1.3. MPI implementation

The efficient parallelisation of the SpMV kernel for distributed-memory platforms is challenging, because the sparsity pattern of the matrix determines the communication pattern between compute nodes. In this context, several research papers have proposed the adoption of graph-partitioning techniques in the division of the matrix, in an effort to minimize inter-node communication. At the same time, however, the total computational load has to be kept balanced across nodes. For many sparse matrices, these two requirements can be conflicting, and the challenge is to find a partitioning scheme that makes the best compromise.

For the purposes of evaluating SpMV performance on Curie, we have devised a simple parallelisation approach which is shown in Figure 11. The matrix (A) is split in horizontal strips, so that each one contains the same number of non-zero elements. Each strip, along with its corresponding parts of input and output vectors (all depicted in the same colour), are assigned to a specific compute node. In a shared- memory environment, this scheme balances the computational load across different cores. In a distributed-memory platform, however, it does not tell us anything about communication balance. For example, node 1, in order to compute y[0] (shown in red), has to collect the “red” parts of x from all the other nodes. On the contrary, node 3, does not need to receive anything from the rest nodes in order to compute its part of y. Furthermore, in our parallelisation scheme, at the end of each iteration every MPI process does not “blindly” send the whole part of x to every other process, but only the elements that are needed, to the processes that need them.

 

Figure 11. Partitioning of sparse matrix based on the number of non-zero elements

Partitioning of sparse matrix based on the number of non-zero elements

 

In order to examine the behaviour of the kernel for different communication patterns, we have used large matrices with different sparsity patterns. Their characteristics are presented in Table 9. Figure 12 depicts macroscopic views of the non-zero elements of the matrices. Note that matrix 14.rajat31 differs from 30.boneS10 in that its first row and column are full of non-zero elements. This leads to significant communication pressure on the node where the first row is mapped, as exactly happens with node 1 in the example of Figure 11. In general, the pattern of both 11.wb-edu.graph and 14.rajat31 matrices implies a great amount of data that need to be exchanged between MPI processes. Compared to 14.rajat31, however, the communication in 11.wb-edu.graph is much more balanced. On the other hand, the localized organisation of non-zero elements in matrices 22.m_t1 and 30.boneS10 leads to small communication volume.

 

Figure 12. Sparsity pattern of the matrices

Sparsity pattern of the matrices

11.wb-edu.graph

Sparsity pattern of the matrices

30.boneS10

Sparsity pattern of the matrices

14.rajat31

Sparsity pattern of the matrices

22.m_t1

 

 

Table 9. Characteristics of sparse matrices used

Matrix Dimension Non-zero elements
11.wb-edu.graph 9845725 x 9845725 57156537
30.boneS10 914898 x 914898 55468422
14.rajat31 4690002 x 4690002 20316253
22.m_t1 97578 x 97578 9753570

 

9.2.2.  The Jacobi kernel

9.2.2.1. Description

A second case of application we used to evaluate the various parallelisation and deployment options on the Curie machine was the Jacobi kernel, which is a common stencil computation. In our case, we use the Jacobi computational method to solve the 3-dimensional heat equation, an important partial differential equation which describes the distribution of heat (or variation in temperature) in a given region over time.

This equation can be solved by approximating all the derivatives by finite differences. The 3D domain is partitioned using a mesh x=1,…,X, y=1,…,Y, z=1,…,Z, and in time using a discretisation t=0,…,T. The following pseudo-code describes a serial implementation of the algorithm.

current=1;
previous=0;
for ( t=0; t < T-1; t++ ) {
  for ( i=1; i <=X; i++ )
     for ( j=1; j <=Y; j++ )
        for ( k=1; k <=Z; k++ )
           u[current][i][j][k] = 1/6 * (
             u[previous][i-1][j][k] +
             u[previous][i+1][j][k] +
             u[previous][i][j-1][k] +
             u[previous][i][j+1][k] +
             u[previous][i][j][k-1] +
             u[previous][i][j][k+1]);

  swap(current, previous);
}
9.2.2.2. OpenMP implementation

The parallelisation of the 3D heat equation with OpenMP is almost trivial. Since there are no data dependences between the inner loops of the computational kernel, we can apply the parallel for directive to parallelise the first inner loop which iterates over dimension X in space. Furthermore, since the computational load is the same for all iterations and there isn’t any kind of data reuse, we have used static scheduling with a chunk size equal to 1. In our experiments, this was a proven optimal configuration for OpenMP, since any other choice of scheduling policy or chunk size failed to give higher performance. The following pseudo-code shows this parallelisation approach for a X x Y x Z mesh.

current=1;
previous=0;
#pragma omp parallel private(t,i,j,k)
{
  for ( t=0; t < T-1; t++ ) {
     #pragma omp for
     for ( i=1; i <=X; i++ )
        for ( j=1; j <=Y; j++ )
           for ( k=1; k <=Z; k++ )
              u[current][i][j][k] = 1/6 * (
                u[previous][i-1][j][k] +
                u[previous][i+1][j][k] +
                u[previous][i][j-1][k] +
                u[previous][i][j+1][k] +
                u[previous][i][j][k-1] +
                u[previous][i][j][k+1]);

    #pragma omp single
    {
      swap(current, previous);
    }
  }
}

We have chosen to place the omp parallel directive outside the outer loop, so as to minimise the thread management overhead, i.e. the fork-join overhead. For data consistency reasons, all loop variables have been declared as private, while matrix u is by default shared, to serve communication between different threads.

Finally, the omp single directive is used to define that a single thread shall undertake the task of swapping the previous and current indices before the next time step, which is necessary for the correctness of the implementation. All threads are synchronised at the end of this code block by the directive’s implied barrier, updating their local copies to be consistent with the shared memory’s data, before proceeding to the next iteration over time.

9.2.2.3. MPI implementation

To parallelise the 3D heat equation with MPI, we adopt a domain decomposition approach. The 3D space is divided into 3D sub-domains, which are distributed among processes. Since there exists no shared memory between different processes, data exchange is required: the computation of the boundary points of each sub-domain on time t lies on the values of their neighbouring points on time t-1, which are stored in the local memory of other processes. Hence, each process needs to communicate with all processes that hold the neighbouring sub-domains and receive the necessary data chunks.

The pattern of communication is now straightforward: each process sends its boundary 2D planes to its 6 neighbours and receives 6 2D planes, each from a different neighbour, to accomplish the computation. This pattern is shown in Figure 13, which depicts the communication pattern for a 3D space.

We have chosen to utilise the Cartesian virtual topology offered by MPI, to distribute the 3D subspaces to a virtual 3D grid of processes, which better reflects the logical communication pattern of the processes, compared to linear ranking. We should note that a virtual topology acts as an advisor to the runtime system for the assignment of processes to physical processors in a manner that improves communication performance.

In our experimentation with the Jacobi kernel, we have selected four different dimensions for the 3D space, namely 128, 256, 512 and 1024. Each dimension yields different working set size, which in turn implies varying behaviour in terms of cache reuse, memory access pattern and communication pattern over the network.

 

Figure 13. Communication pattern for the MPI implementation of the 3D Jacobi solver

Communication pattern for the MPI implementation of the 3D Jacobi solver

 

9.2.3. Optimising the SpMV kernel using OpenMP on Extra Large nodes

In order to test the effect of various OpenMP scheduling algorithms in the performance of SpMV, an experimental evaluation of SpMV using the CSR storage format for all the scheduling options was performed on a Curie Extra Large node. Actually, we utilised only one of the four S6010 modules of an Extra Large node, and so had a 4-way eight-core system totalling 32 cores. The experiments were conducted by measuring the execution time of 128 consecutive SpMV operations with randomly created input vectors. The threads are always scheduled to run as “economically” as possible. For example 8 threads are scheduled on the cores sharing the same NUMA memory node, instead of capturing cores from other memory nodes. For every schedule except auto, the chunksize which produced the best possible result was selected.

 

Figure 14. SpMV performance for different scheduling algorithms on a single S6010 module of a Curie Extra Large node.

SpMV performance for different scheduling algorithms on a single S6010 module of a Curie Extra Large node.

(a) Chebyshev4

SpMV performance for different scheduling algorithms on a single S6010 module of a Curie Extra Large node.

(b) bmw7st_1

 

Figure 14 shows the performance of SpMV using different scheduling policies. It is observed that the dynamic schedule performs better when utilising up to eight threads (one memory node), because it can better manage the imbalances resulting from SpMV. However, when more than one memory node is used, the overhead that occurs in dynamic assignment from synchronisation is enhanced, and SpMV with dynamic scheduling scales poorly. As a result, a static scheduling, which doesn’t require communication between threads during runtime, performs better for these cases. It is worth mentioning that a more NUMA-specific implementation would result in better performance and would give us a better insight about scheduling algorithms at NUMA-systems such as Curie. Finally, a super-linear scaling is observed at matrix bmw7st 1 from 16 to 32 threads, because the aggregate cache size, when 32 threads are utilized, is larger than the total data needed by SpMV in this matrix and costly cache misses are avoided.

Apart from the scheduling algorithm itself, another parameter that is very important is the chunksize, especially when using the dynamic option. If chunksize is small, a large number of tasks would be created and the overhead of assigning these tasks to threads would be signicant compared to the overall SpMV performance. Moreover, if chunksize is very big, it is possible that the created tasks would be highly imbalanced. So, a relatively medium chunksize must be chosen in order to have good scaling in SpMV. The performance for different chunksizes, when dynamic option is used, can be seen at Figure 15. It is easily noted that for matrix Chebyshev4 the best chunksize is 32 and for matrix bmw7st 1 is 1024, which means that when choosing chunksize, the matrix characteristics must be taken into consideration.

 

Figure 15. SpMV performance for different chunksizes of dynamic policy

SpMV performance for different chunksizes of dynamic policy

(a) Chebyshev4

SpMV performance for different chunksizes of dynamic policy

(b) bmw7st_1

 

9.2.4. Executing hybrid SpMV and Jacobi kernels on Extra Large nodes

In order to shed light on the performance behaviour of the various configurations of MPI processes and OpenMP threads in the hybrid implementations, we have evaluated these versions initially on a single Extra Large node. The large number of cores in this node and the multiple levels of their organisation, yield a large number of options for configuring and deploying the application which make the task of choosing the best among them more difficult.

We have evaluated the kernels with thread counts 1,2,4,8,…,128, and for each thread count we have tested all possible combinations of MPI processes and OpenMP threads. For example, for 32 threads, we have tested the following configurations: 32×1, 16×2, 8×4, 4×8, 2×16 and 1×32, where PxT denotes the number of MPI processes used x the number of OpenMP threads per process. In every case, we present the breakdown of total execution time to computation, communication and packing/unpacking. Apparently, in this case where only a single node is used no actual transfers over the network occur, but all requests are satisfied through shared memory. Finally, the speedup presented is calculated on the basis of the net computation time of a single process.

Figure 16 and Figure 17 show the performance of hybrid SpMV and Jacobi kernels, respectively, when executed on a single Extra Large node using various configurations. From these results we can draw the following observations:

  • Even though executing in a shared memory environment (single node), it is evident that OpenMP, the natural choice for parallelisation in shared memory platforms, does not provide the optimal performance, both for SpMV and Jacobi.
  • In general, optimal performance is achieved using configurations with many MPI processes, such as 32×2, 64×2, 64×1, 32×4, 128×1, etc. To be more precise, performance deteriorates notably when an MPI process hosts more than 8 OpenMP threads, and this is particularly evident for large number of threads and for the SpMV kernel.The reason for this behaviour is that an Extra Large node package contains 8 cores, which means that when an MPI process hosts more OpenMP threads than that number (i.e. 16,32,64,128), the extra threads must be placed on a remote package, on the same board or, even worse, on a remote board. The remote threads, however, need to access data residing on the local NUMA node of the MPI process (i.e. the NUMA node where the MPI process was initially spawned). These accesses are more expensive compared to the ones made by the “local” threads, and they furthermore put additional pressure on the local NUMA node. This is the reason that computation time for SpMV significantly increases for 16 threads or more. This phenomenon becomes more intense as the OpenMP threads of a single MPI process are spread across more packages. On the contrary, configurations with less than 8 OpenMP threads per MPI process guarantee that threads will always consume data on the local NUMA node.Furthermore, if MPI processes are placed in such a way that no two or more processes share the same NUMA node, they will be provided separate paths to main memory and hence higher aggregate bandwidth. In our experiments, this is true for configurations with 16 or less MPI processes, since the total number of packages contained in an Extra Large node is 16. We can observe the impact of this effect particularly for the case of SpMV, whose performance is greatly dependent on the available memory bandwidth. In matrix 22.m-t1 or 30.boneS10 we can see that performance slightly degrades for configurations with 32 MPI processes or more (compared to ones with fewer MPI processes and with 8 or less OpenMP threads), possibly due to memory contention issues.

    Jacobi is an intensely memory-bound application, with insignificant data reuse, invoking a high rate of capacity misses on the last- level cache. Although with a first sweep we can exclude all configurations with more than 4 OpenMP threads, the decision between 1, 2 or 4 OpenMP threads is harder (especially since communication time is, with the exception of the 128^3 test case, a very small fraction of total execution). Intuitively, the isolation provided by the data decomposition achieved with MPI should result to a more efficient cache utilisation , namely a better degree of temporal reuse and fewer capacity misses. With that in mind, a best practice for stencil codes would be to confine to at most 2 OpenMP threads. For the same reasons, Jacobi does not comply with the SpMV behaviour for more than 16 MPI processes and continues to scale up, as it is favoured by the further decomposition.

  • In general, we can conclude that for single-node runs, the hybrid model transparently offers NUMA-aware characteristics, allowing each MPI process and its OpenMP threads to work on adjacent data. This is due to a combination of features: the address space isolation naturally provided by MPI, the placement of every MPI process onto a separate socket using utilities such as taskset, and the first-touch policy of Linux which allows an MPI process to allocate its (private) pages on its local NUMA node.
  • As a rule of thumb, it seems a good practice to map an MPI process along with a moderate number of OpenMP threads (e.g. 4,2,1) to their own package. Provided that the application does not spawn more MPI processes t
    han the total number of packages, this mapping can be enforced automatically by the Linux scheduler. This scheme usually gives optimal performance, particularly for memory bound applications.

 

Figure 16. Performance of hybrid SpMV on a single Extra Large node

Performance of hybrid SpMV on a single Extra Large node
Performance of hybrid SpMV on a single Extra Large node
Performance of hybrid SpMV on a single Extra Large node
Performance of hybrid SpMV on a single Extra Large node

 

 

Figure 17. Performance of hybrid Jacobi on a single Extra Large node

Performance of hybrid Jacobi on a single Extra Large node
Performance of hybrid Jacobi on a single Extra Large node
Performance of hybrid Jacobi on a single Extra Large node
Performance of hybrid Jacobi on a single Extra Large node

 

9.2.5. Executing hybrid SpMV and Jacobi kernels on Thin nodes

On Curie Thin nodes, we conducted three sets of experiments in an effort to answer each of the following questions:

  1. How does MPI perform on a single thin node?
  2. Given a specific number of cores (or, alternatively, a fixed budget of core-hours), what would be the optimal selection of those cores in terms of their topology?
  3. How do the kernels scale when the hybrid model is used, and under various combinations of MPI processes and OpenMP threads?
9.2.5.1.  Pure MPI on a single machine

Figure 18 and Figure 19 present the performance behaviour of pure MPI on a single thin node, for the SpMV and Jacobi kernel, respectively.

A first observation is that, even though MPI is not a natural choice for intra-node parallelisation, it provides in general fair scalability.

For the SpMV kernel, the fraction of time spent in communication varies across matrices. Even though communication in this experiment is accomplished solely through shared memory, the amount of data that needs to be exchanged between processes still determines the amount of communication. As we explained in Section 9.2.1.3, “MPI implementation”, the sparsity pattern of matrices 11.wb-edu.graph and 14.rajat31 implies that a large amount of data needs to be communicated, and this translates to increased communication time, as shown in Figure 18. Since communication time cannot be reduced as efficiently as computation time with increasing number of threads, mainly due to balancing issues, the speedup achieved for matrices 11 and 14 is lower compared to the others.

For the Jacobi kernel, computation time does not scale as efficiently as in SpMV with the number of processors, since it is a memory-bound application and the available memory bandwidth acts as a limiting factor. On the other hand, communication is completely balanced; it takes place only on shared memory and scales almost linearly, implying a very efficient protocol for intra-node MPI communication.

 

Figure 18. Performance of pure-MPI SpMV on a single Thin node

Performance of pure-MPI SpMV on a single Thin node
Performance of pure-MPI SpMV on a single Thin node
Performance of pure-MPI SpMV on a single Thin node
Performance of pure-MPI SpMV on a single Thin node

 

 

Figure 19. Performance of pure-MPI Jacobi on a single Thin node

Performance of pure-MPI Jacobi on a single Thin node
Performance of pure-MPI Jacobi on a single Thin node
Performance of pure-MPI Jacobi on a single Thin node
Performance of pure-MPI Jacobi on a single Thin node

 

9.2.5.2. Hybrid MPI-OpenMP – Best utilisation of a given core budget

In this series of experiments we evaluate different placements of the application processes/threads on a fixed number of cores located in multiple nodes. More specifically, we assume that we have a certain number of cores (or core-hours) at our disposal and we are trying to find out which arrangement of those cores would give the best performance. At the one extreme, we could select a single core from a separate node, therefore utilising the maximum number of nodes. This implies a pure MPI scheme, which may suffer from high communication costs. From a resource management perspective it could also lead to inefficient resource utilisation (e.g. increased power consumption). At the other extreme, we could choose to use all cores of a node, thus utilising the smallest possible number of nodes. This could introduce performance degradation due to memory contention issues or inefficient exploitation of NUMA characteristics. We are looking for the best compromise between these two extremes. The results are shown in Figure 20 and Figure 21.

For SpMV, we assumed 64 cores in total, since after that number the kernel exhibits excessive communication times in some cases. The reason for that is the relatively small size of the input matrices; in configurations with many MPI processes it yields a very large number of small messages, which cannot be handled effectively by the Infiniband interconnect. For Jacobi, we assumed 256 cores in total.

From the results we can draw the following conclusions:

  • In contrast to the observations regarding SpMV in Section 9.2.4, “Executing hybrid SpMV and Jacobi kernels on Extra Large nodes”, where computation was dominating the total time, it is clear that now communication is the dominant factor, regardless the total number of nodes. The reason is that communication has to be performed now over the interconnection network, and not through shared memory. Besides that, as we explained above the size of the matrices is such that communication becomes the primary bottleneck. An observation that we can make from Figure 20 is that, in 3 out of 4 matrices, communication overhead, and hence overall execution time, is clearly dependent on the total number of MPI processes, regardless of the total number of nodes employed; the smaller the number of MPI processes per node, the better performance achieved due to reduced communication time. Even though we have not been able to verify it, we suspect that this is due to some kind of bottleneck on the network card or the interconnect, which increases with the number of MPI processes per node.
  • In the Jacobi case, communication appears to be the unweighted factor, exhibiting some unpredictable spikes (computation behaves as expected, almost equal for 1,2,4 OpenMP threads, higher for 8, and prohibitive for 16 threads). We can identify certain network properties, such as that the interconnection network is insensitive to the number of nodes involved and the distance between them, based on the fact that is no significant increase in communication time from 16 nodes to 256 nodes. However, it is highly sensitive to traffic from other users of the machine. The spikes observed are likely caused by messages blocked and delayed on a network switch, on their way to the destination node.
  • As a general conclusion we can say that you can achieve equally good performance with a small number of nodes, as you achieve with the largest possible number of nodes, assuming the same number of cores in both cases. This is a behaviour constantly observed for most cases, both in SpMV and Jacobi kernels. Therefore, as a rule of thumb, it would be advisable to utilise a given core-hour budget choosing an arrangement of the total cores over an intermediate number of nodes. In any case, even if a single node is used, hybrid MPI-OpenMP implementations should be preferred over pure ones, taking care not to spread OpenMP threads across multiple sockets.

 

Figure 20. Performance of hybrid SpMV for various deployment choices utilising 64 cores in total

Performance of hybrid SpMV for various deployment choices utilising 64 cores in total
Performance of hybrid SpMV for various deployment choices utilising 64 cores in total
Performance of hybrid SpMV for various deployment choices utilising 64 cores in total
Performance of hybrid SpMV for various deployment choices utilising 64 cores in total

 

 

Figure 21. Performance of hybrid Jacobi for various deployment choices utilising 256 cores in total

Performance of hybrid Jacobi for various deployment choices utilising 256 cores in total
Performance of hybrid Jacobi for various deployment choices utilising 256 cores in total
Performance of hybrid Jacobi for various deployment choices utilising 256 cores in total
Performance of hybrid Jacobi for various deployment choices utilising 256 cores in total

 

9.2.5.3. Hybrid MPI-OpenMP – Scalability performance

In this series of experiments we try to explore how performance scales in the hybrid implementations of the SpMV and Jacobi kernels, using multiple nodes and reaching up to large number of cores. Specifically, in SpMV we reach up to 256 cores and in Jacobi up to 4096 cores. In SpMV, we increase the number of threads by first trying to fill all cores of a node and then adding more nodes. In Jacobi, on the contrary, we increase the parallelism by adding nodes from the very beginning. The results are shown in Figure 22 and Figure 23.

While remaining within a single node, SpMV exhibits poor scalability in half cases, which is expected, considering the memory intensive nature of the kernel. When adding more nodes, however, scalability improves because computation time decreases while communication time stays almost at the same levels. This trend continues until 64 or 128 cores, in 3 out of 4 matrices. At larger number of cores communication becomes a significant bottleneck. Again, this is primarily due to the relative small size of the input matrices. On the contrary, matrix 11 does not exhibit such a behaviour, because of its large size and the large communication volume associated with its pattern.

In the small test cases of Jacobi (128^3, 256^3) the speedup is not as high as expected. Although computation scales well, communication time decreases in a slow rate, then remains constant and finally increases, causing the scalability to break at 4096 processes. This abrupt increase in communication time is a function of contention of a very large number of messages on each node, whose latency can no longer be hidden/overlapped, since message sizes are small and transmission times are tiny. Moreover, due to the high communication/computation ratio, the overhead due to communication prevails on performance. Both large test cases (512^3, 1024^3) demonstrate a very satisfactory speedup, with both computation and communication times scaling up to 4096 cores.

 

Figure 22. Scalability performance of hybrid SpMV up to 256 cores

Scalability performance of hybrid SpMV up to 256 cores
Scalability performance of hybrid SpMV up to 256 cores
Scalability performance of hybrid SpMV up to 256 cores
Scalability performance of hybrid SpMV up to 256 cores

 

 

Figure 23. Scalability performance of hybrid Jacobi up to 4096 cores

Scalability performance of hybrid Jacobi up to 4096 cores
Scalability performance of hybrid Jacobi up to 4096 cores
Scalability performance of hybrid Jacobi up to 4096 cores
Scalability performance of hybrid Jacobi up to 4096 cores
Scalability performance of hybrid Jacobi up to 4096 cores
Scalability performance of hybrid Jacobi up to 4096 cores
Scalability performance of hybrid Jacobi up to 4096 cores
Scalability performance of hybrid Jacobi up to 4096 cores

 


[1] See the definition of Amdahl’s law on Wikipedia.

[2] SOA = structure of arrays; AOS = array of structures.

[3] Created using Libreoffice Calc for example.

[4] The compute intensity is defined as C = number of floating point operations / number of memory access (R+W)

[5] Single Instruction Multiple Threads