Best Practice Guide – JUGENE

Best-Practice guide JUGENE

Florian Janetzko


Gilles Civario


Maciej Cytowski


Maciej Szpindler


Stefanie Janetzko


Alexander Schnurpfeil


Brian Wylie


Huub Stoffers


19 June 2012

Table of Contents

1. Introduction
2. System Architecture and Configuration
2.1. Processor architecture
2.2. Building blocks
2.2.1. Chip
2.2.2. Compute card
2.2.3. Node card
2.2.4. Rack
2.2.5. Special nodes
2.3. Operating system
2.4. Memory architecture
2.5. Networks
2.5.1. Three-dimensional torus: point-to-point network
2.5.2. Global tree and collective network
2.5.3. Global interrupt network
2.5.4. Funtional or storage-access network
2.5.5. Control network
2.6. I/O subsystem
2.6.1. I/O nodes
2.6.2. Shared file system (GPFS/NSD) servers
2.6.3. Dealing with contention on the functional network
2.7. File systems
2.7.1. WORK
2.7.2. HOME
2.7.3. ARCH
2.7.4. Performance of file systems
2.8. Further reading, information and references
3. System Access
3.1. Application for an account
3.2. How to contact FZJ-Dispatch
3.3. Access to JUGENE
3.3.1. Key generation
3.3.2. Upload of the public key
3.3.3. Login to JUGENE
3.4. Further reading, information and references
4. Production Environment
4.1. Accounting
4.1.1. Monthly quota
4.1.2. Billing formula
4.1.3. Overdrawn quota
4.1.4. Querying quota status
4.1.5. Querying project status by the project leader
4.2. Module environment
4.3. Running and monitoring jobs
4.3.1. SMP mode
4.3.2. DUAL mode
4.3.3. Virtual Node mode
4.3.4. Mode selection
4.3.5. Running and monitoring batch jobs with LoadLeveler Job classes and resource reservations Job submission and monitoring Job command file (job script)
command Job command file example for pure MPI codes Job command file example for hybrid MPI/OpenMP codes
Running interactive jobs with
The monitoring tool
4.4. Further reading, information and references
5. Programming Environment/Basic Porting
5.1. Compilers
5.1.1. Available compilers
5.1.2. Compiler flags XL Compiler family GNU Compiler Collection
5.2. Available libraries
Engineering and Scientific Subroutine Library (ESSL)

5.3. MPI
5.3.1. MPI implementation Compiling MPI applications Running MPI applications Exceptions to the MPI standard MPI extensions for Blue Gene/P
5.4. OpenMP
5.4.1. Compiler flags
5.4.2. Using OpenMP in different execution modes
5.5. Further reading, information and references
6. Performance Analysis
6.1. Available performance analysis tools
6.1.1. gprof gprof examples
6.1.2. Performance Application Programming Interface (PAPI) Further information
6.1.3. Scalasca Instrumenting your code Analyzing (running) the instrumented code Examine Scalasca analysis results Refine the measurement configuration Scalasca example Further information
6.1.4. Tau Further information
6.1.5. Vampir Further information
6.2. Further reading, information and references
7. Tuning Applications
7.1. Single-core optimization
7.1.1. Advanced compiler options and optimization strategies Advanced compiler flags Diagnostic compiler flags Making efficient use of the double FPU (“double hummer”) Optimization strategies
7.2. Advanced environment and MPI tuning
7.2.1. Environment variables
7.2.2. Network
7.2.3. Shape
7.2.4. Mapping Explicit mapfiles
7.3. Advanced OpenMP usage
7.3.1. Environment variables
7.3.2. Thread affinity
7.4. Hybrid programming
7.4.1. Optimal tasks / threads strategy
7.5. I/O optimization
7.5.1. General guidelines Using the right file system Adhere to block size and pay attention to block alignment Optimize hierarchical I/O schemes by distributing the load
over all available I/O nodes Consider not to use a hierarchical scheme Revise I/O of ported applications if its design assumes node
local storage
7.5.2. HDF5
7.5.3. MPI I/O
7.5.4. netCDF
7.5.5. SIONlib
7.6. Advanced job command language
7.6.1. Using multiple job steps
7.6.2. Job command file variables
7.6.3. Run-time environment variables
7.7. Further reading, information and references
8. Debugging
8.1. Compiler flags
8.1.1. Debugging options of the compiler
8.1.2. Compiler flags for using debuggers
8.2. Available debuggers
8.2.1. DDT Running DDT on JUGENE
8.2.2. TotalView Using TotalView interactively Using TotalView in batch mode
8.3. Analyzing core dumps
Core dump file analysis using
8.3.2. Core dump file analysis using debuggers DDT TotalView
8.4. Further reading, information and references
Bibliographic references

1. Introduction

The IBM Blue Gene/P system JUGENE (Jülich Blue Gene/P) is hosted
by the Gauss Centre for Supercomputing (GCS) at the Forschungszentrum
Jülich (FZJ) in Germany. It was installed as a 16 rack system in
October 2007 and was No. 2 in the Top500 in November 2007. It was one
of the prototype systems in the PRACE Preparatory Phase project.
After an upgrade to a 72 rack system in 2009 JUGENE became Europe’s
first Peta-FLOP/s supercomputer and starting on July 1, 2010 JUGENE
was the first Tier-0 system available in the PRACE project for
successful PRACE resource applications.

Figure 1. IBM Blue Gene/P JUGENE hosted by Forschungszentrum Jülich.



This best practice guide provides information about JUGENE in
order to enable users of the system to achieve good performance of
their applications. The guide covers a wide range of topics from the
detailed description of the hardware through information about the
basic production environment including how to login and the
accounting procedure as well as information about porting and
submitting jobs, up to tools and strategies how to analyze and
improve the performance of applications.

Information about JUGENE is available online:

User information, Software, FAQs

Message of the day for JUGENE

2. System Architecture and Configuration

The Blue Gene/P has a massively parallel supercomputer
architecture with different types of nodes and networks. In total
JUGENE has 72 racks and contains 73,728 compute nodes or 294,912

This section describes the specific hardware and the configuration of
the Blue Gene/P JUGENE from the processor architecture through the
different networks to the files systems which are available. Much of
the information in this section is taken from the
Redbook: IBM System
Blue Gene Solution:
Blue Gene/p Application

RedBook AD.

2.1. Processor architecture

JUGENE has four types of nodes with partly different hardware
and software
characteristics. There are two different processor
architectures used
in the different nodes.

Login nodes
service nodes
of JUGENE contain IBM 64-bit Power6
processors (p6 550) with 4.2 GHz.

The microprocessor of JUGENE’s
compute nodes
I/O nodes
is a PowerPC 450, Book E compliant, 32-bit microprocessor with a
clock speed of 850 MHz. The PowerPC 450 microprocessor, with
double-precision floating-point multiply add unit (double FPU), can
deliver four floating-point operations per cycle with 3.4 Giga-FLOP/s
per core.

2.2. Building blocks

The Blue Gene/P system contains the following components: chip,
compute card, node card, rack. These building blocks are illustrated
Figure 2.

Figure 2. Building blocks (chip, compute card, node card and rack) of
JUGENE system

Building blocks chip, compute card, node card and rack of the JUGENE system.


2.2.1. Chip

The Blue Gene/P processor chip contains four PowerPC 450 cores
(quad-core chip) where each core has a double-precision floating-point multiply
add unit (double FPU). The processor chip has a peak performance of 13.6 GFLOP/s.
It is sometimes referred to as
Each core has 64 kB private L1 cache (32 kB data and 32 kB
instruction cache) and an L2 prefetch cache of 14×256 bytes.

2.2.2. Compute card

The compute card contains the quad-core processor chip with 8 MB
shared L3 cache, 2 GB of shared DDR2 SDRAM and connectors for
and for power supply. It is also referred to as a
compute node.
Compute nodes have no local file system, they must route I/O
operations to an external device via
I/O nodes.

2.2.3. Node card

32 compute cards are collected into one
node card.
Up to two I/O cards can be added optionally. The node cards are

2.2.4. Rack

32 node cards are collected into one
Half a rack (16 node cards) is called a
and is the smallest building block for the
TORUS network.
One rack contains 1024 compute nodes or 4,096 cores.

2.2.5. Special nodes

As already mentioned, JUGENE possesses three further types of
nodes besides the compute nodes which were described above. Two of
these types are relevant for users:

  • I/O nodes:
    The hardware is identical to the compute nodes. It is not possible
    to login neither to the compute nodes nor to the I/O nodes. The
    nodes take care of the I/O to external devices (file systems).
    JUGENE possesses in total 600 I/O nodes. Details can be found in
    Section 2.6.1.
  • Login nodes:
    JUGENE has two front-end or login nodes, each has eight Power6
    cores (4.2 GHz) and 32 GB of memory.

2.3. Operating system

The JUGENE compute nodes run a lightweight, proprietary
operating system referred to as the Compute Node Kernel (CNK). The
CNK is a Linux-like 32 bit operating system supporting a large subset
of Linux compatible system calls as well as threads and dynamic
linking. The CNK redirects all of the file system load and network
traffic to the I/O node. Thus, using the CNK provides very little
interference with the applications that run on the compute nodes.

The I/O nodes provide access to external devices. All I/O
requests are routed through these nodes. The nodes run an optimized
version of the Linux operating system.

The login nodes (front-end nodes) provide the working
environment for the users with a SuSE Linux (SLES 10, 64 bit)
full-featured operating system.

2.4. Memory architecture

The Blue Gene/P system has a distributed-memory architecture
which includes an on-chip cache hierarchy and an off-chip memory. It
contains optimized on-chip symmetrical multiprocessing (SMP) support
for locking and communication between the four processors (cores) of
the compute node. The aggregate memory of the system is distributed
without any hardware sharing between nodes. The total amount of
physical memory per JUGENE compute node is 2 GB and the memory is
laid out as a single, flat and fixed-sized virtual address space
shared between the operating system kernel and the user application.

The first level (L1) cache is contained in the PowerPC 450
The PowerPC 450 L1 cache is 64-way set-associative. The
prefetching has been disabled for performance reasons. The
size is 32 bytes.


  • The usage of the FPUs (floating-point units) and FPU
    registers can be optimized using the XL compiler directives or
    assembler instructions.
  • To benefit from the SIMD (single instruction multiple data)
    instructions, data must fit in the L1 cache.

The second level (L2R and L2W) caches
are fully associative and coherent. They act as
prefetch and write-back buffers for L1 data. The L2 cache line is 128
bytes in size. Each L2 cache has one connection toward the L1
instruction cache running at full processor frequency. Each L2 cache
also has two connections toward the L1 data cache, one for the writes
and one for the loads, each running at full processor frequency. Read
and write are 16 bytes wide.

The third level (L3) cache is 8-way set associative, 8 MB in
size, with 128-byte lines. Both banks can be accessed by all
processor cores. The L3 cache has three write queues and three read
queues: one for each processor core and one for the 10 Gigabit
network. Ethernet and direct memory access (DMA) share the L3 ports.
Only one unit can use the port at a time.

2.5. Networks

The JUGENE system uses five different networks dedicated to
tasks and functionalities of the machine. Each network has a
different structure and topology of connections between the nodes of
the system.

2.5.1. Three-dimensional torus: point-to-point network

The torus network is used for general-purpose, point-to-point
message passing and multicast operations to a selected subset of
nodes. The topology is a three-dimensional torus constructed with
point-to-point, serial links between routers embedded within the
Blue Gene/P nodes. A three-dimensional torus topology is available
to the user application only if the assigned partition of the
machine is a multiple of a
Therefore, each of the nodes is connected to six
The target hardware bandwidth for each torus
link is 425 MB/s in each
direction of the link for a total of 5.1
GB/s bidirectional bandwidth
per node. The three-dimensional torus
network has a hardware
latency within 100ns – 800ns.

2.5.2. Global tree and collective network

The global collective network is a high-bandwidth, one-to-all
network used for collective communication operations, such as
broadcast and reductions, and to move process and application
from the
I/O nodes
to the
compute nodes.
Each compute and I/O node has three links to the global collective
network at 850 MB/s per direction for a total of 5.1 GB/s
bidirectional bandwidth per node. Latency on the
global collective
network is less than 2 µs from the bottom to top
of the collective,
with an additional 2 µs latency to broadcast to

2.5.3. Global interrupt network

The global interrupt network is a separate set of wires based
on asynchronous logic, which forms another network that enables
fast, low latency signalling of global interrupts and barriers.
Round-trip latency to perform a global barrier over this network for
a 72 K node partition is approximately 1.3 µs.

2.5.4. Funtional or storage-access network

The 10 Gb Ethernet network consists of all
I/O nodes
that are connected to a 10 Gb Ethernet switch. The
compute nodes
are not directly connected to this network. All traffic is passed
from the compute node over the global collective network to the I/O
node and then onto the 10 Gigabit Ethernet network. A more detailed
description can be found in
Section 2.6.

2.5.5. Control network

The control network is used for system boot, debug, and
monitoring. It provides run-time non-invasive service support as
well as non-invasive access to performance counters.

2.6. I/O subsystem

The I/O subsystem of JUGENE is a system of three hardware layers
connectivity layers between them shown schematically in
Figure 3.
Within the Blue Gene/P itself there are a number of
I/O nodes.
These constitute the interface of the I/O subsystem to JUGENE’s
computational resources. The I/O nodes act as shared file system
clients. They do not contain any storage components themselves, nor
do they connect to dedicated storage components that could be
considered local to any particular I/O node.
For all file systems to
which users have read and write access, the IBM General Parallel File
System (GPFS) technology is used. A cluster of file servers presents
Network Storage Devices (NSD) to their clients, performs meta-data
operations and is responsible for the integrity of the shared files
system. NSDs are the distributed shareable storage units from which
GPFS file systems are built. They use logical units (LUN) that are
served by a storage back-end of storage controllers and disk
enclosures. The storage controllers use hardware RAID technology to
pack several physical disks into an aggregate LUN with enhanced
performance as well as better protection against failure than each of
the individual physical disks involved can provide on their own.

Figure 3. The three layers of the I/O subsystem schematically

The three layers of the I/O subsystem schematically


The fibre channel connectivity layer consists entirely out of
cabling. It contains no switching hardware while the file server
network connectivity layer is implemented with switching technology.

2.6.1. I/O nodes

The I/O nodes of the Blue Gene/P are nodes dedicated to I/O and do
not permit the user to run any tasks of the user’s parallel
application. Each I/O node is connected via a 10 Gigabit Ethernet
adapter to a network over which it accesses the GPFS shared file
systems. This network, named in
Figure 3
as the file serving network connectivity layer, is the
functional network.
The compute nodes, which run the application processes, do not have
connection to the functional network. Nevertheless, a user process
can simply use I/O related system calls like chdir(), open(),
write(), or even socket(). “Under the hood” the CNK forwards
system and socket related operations to an I/O node using the
Global tree network.
This service is provided automatically and transparently by the CNK.

On the I/O node the forwarded operations are handled by the
and I/O daemon (CIOD), which subsequently hands them to the
appropriate component of the GPFS client software. The typical path
travelled by data which is handed to and returned by I/O functions
on a compute node, is presented schematically
Figure 4.

Figure 4.  I/O forwarding, function shipment, in more detail. The
bi-directional red arrows show the typical path travelled by data
associated with the function calls issued by an application process
running on a compute node.

I/O forwarding and function shipment in more detail. The bi-directional red arrows show the typical path travelled by data associated with the function calls issued by an application process running on a compute node.


The assignment of a set of compute nodes to their dedicated I/O node
the tree topology is static. It is determined by the number of I/O
nodes plugged into a node board and the position of node boards with
I/O nodes in a midplane. The set of compute nodes that share the
I/O node is usually called a
processor set

Figure 5
shows a pset from the perspective of an I/O node on top.

Figure 5.  I/O node with some of the topologically nearest compute
nodes in its pset

An I/O node with some of the topologically nearest compute nodes in its pset shown.


The 10 Gb/s bandwidth of the I/O node’s Ethernet interface
together with the characteristics of the I/O nodes themselves is
rate limiting for the I/O.


Asynchronous I/O is not supported on Blue Gene/P. For example
asynchronous MPI I/O routines are used (
MPI_Iwrite ...)
the code will run as if the corresponding blocking routines were

JUGENE’s 600 I/O nodes are distributed across the system in the
following way:

  • 71 racks have 4 I/O nodes per midplane (one I/O node for 128
    compute nodes, that means a pset contains 128 compute nodes)
  • 1 rack (R87) has 16 I/O nodes per midplane (one I/O node for
    32 compute nodes, that means a pset contains 32 compute nodes).

The following table sumarizes the maximum and average I/O
performance for a rack with 4 I/O nodes per midplane.

Table 1. Theoretical maximum and average available bandwidths at
various levels of aggregation of computational resources

System Unit Max. bandwidth Avg. bandwidth
midplane 3.00 GB/s 2.00 – 3.00 GB/s
pset 0.75 GB/s 0.5 – 0.75 GB/s
compute node 850 MB/s 10 MB/s
single core 850 MB/s 2.5 MB/s


2.6.2. Shared file system (GPFS/NSD) servers

The GPFS clients on JUGENE’s I/O nodes are served by a cluster
of 28
IBM p6-570 servers, equipped with 8 cores. However, 4
servers are
exclusively dedicated to storing GPFS meta-data of all
file systems.
The remaining 24 serve as data NSDs belonging to one of
the three
system (classes):


All servers are equipped with identical resources to access storage
capacity on one side, and to communicate with their GPFS clients on
the other side:

  • 8 x 4 Gb/s Fibre Channel (FC4) host adapter ports
  • 4 x 10 Gb/s Ethernet links into the functional network of
    the Blue

The maximum aggregate bandwidth of the cluster of 24 data servers on
the Blue Gene/P functional network is 120 GB/s. The maximum
aggregate bandwidth of the server cluster on the side of the fibre
channel adapters that connect to the storage boxes is 96 GB/s.

2.6.3. Dealing with contention on the functional network

The functional network of JUGENE consists of four fabrics, each
with a Force10 E1200i switch with 224 10 Gb/s Ethernet ports. Even
in these switches the overall bandwidth on all ports still exceeds
capacity that the backplane and forwarding engine can handle.
of four adjacent switch ports share a common path onto the
and into the forwarding engine. Members of the same 4-port
group are
competing for the same bandwidth whenever two or more of
them are
concurrently handling traffic. So, theoretically there is a
oversubscription on each of the four fabrics.

The following port allocation strategy was developed and
applied to reduce the probability that interfaces will actually be
contending for the same bandwidth:

  • Every GPFS/NSD server is directly linked to each of the four
    fabrics by plugging one of its interfaces into the first port of a
    4-port group. This guarantees that no two GPFS/NSD server
    interfaces within the same fabric are ever competing for the same
  • The four I/O nodes of a midplane are each hooked up to a
    different fabric. Thus, even the smallest jobs have the aggregate
    bandwidth of all four fabrics at their disposal.
  • I/O nodes of neighboring midplanes in the same fabric are
    distributed over different 4-port groups of the switch.

To better grasp how the fairly balanced scheme of port allocation
nonetheless allows considerable
differences in the I/O bandwidth at
the level of the fabric switch
between identically sized jobs on
Figure 6
presents a logical model of the port allocation on such a switch.
Each of the 224 ports is represented by a square, and the squares
are grouped in 56 groups. 142 unmarked green squares represent the
connections of the midplanes of the production environment.
Spreading them evenly across the available groups results in 30
with 3 midplane connections and 26 groups with 2 midplane
connections: (30 x 3) + (26 x 2) = 142. The light green squares,
marked a
and b
, represent the connections of the I/O
nodes of the deviant rack
(R87) described in
Section 2.6.1.
It shows that the bandwidth of these midplanes is also four times
of normal midplanes at the switch level. The yellow squares
represent the GPFS server connections. The darker yellow ones,
marked M1 – M4, represent the meta-data servers. The grey squares,
marked with a question mark, are not necessarily empty. Some may
also connect a front end node or some auxiliary systems not directly
related to JUGENE at all. Some hypothetical jobs, A, B, and C,
having a certain number of midplanes at their disposal, have been
added to the picture. Other jobs would normally be running
concurrently as well, using other midplanes, but are not shown.

Figure 6.  Logical model of port allocation on a fabric switch with 56
4-port groups. Each of the 56 groups has a 10 Gb/s bandwidth
channel that is shared among its 4 members. The port allocation to
I/O nodes, servers, front end nodes etc. is static hard wired.
allocation of jobs to midplanes, and hence to the ports used by
I/O nodes of these midplanes, is dynamically determined by a

Port allocation


Jobs A and C both utilize 16 midplanes. But at the switch
the bandwidth of job A is much more likely to be “squeezed”
that of job C. The peak bandwidth that job A can achieve is
about 80
GB/s since it has only 8 channels at its disposal. Job C has
channels, hence its peak bandwidth at the switch is twice that of
job A. Job C and job B would no doubt sometimes be contending over
the bandwidths of channels 33 – 38. But such contention would be
incidental, since the jobs are not likely to be continuously engaged
in I/O. The I/O of the tasks within a job is
very likely to be
orchestrated and more or less simultaneous. So
the tasks of job A are
likely to be contending among themselves for
the bandwidths of
channels 6 -13 when they – simultaneously – engage

2.7. File systems

On JUGENE three file systems for user files are available: WORK,
HOME, and ARCH. In the user environment three corresponding shell
environment variables are set to directories (entry points in these
file systems) where the user has read and write access:

  • $WORK denotes the absolute path to the user’s scratch
  • $HOME denotes the absolute path to the user’s home directory
  • $ARCH denotes the absolute path to the user’s archive

Users are strongly recommended to address file system locations
solely in
terms of these variables. Please note that all file systems
shared GPFS file systems. Consequently, $WORK,
$HOME, and $ARCH
to be regarded as absolute pointers into a unified namespace that
shared by all nodes. So for example,
will denote exactly the
same file system location to all processes of
a user, irrespective of
the node they happen to run on.

2.7.1. WORK

WORK is primarily intended as a “scratch space” for jobs. It is
implemented as a single file system with a net total storage
capacity of 2.4 PB. WORK is the best file system to use for large
scale and demanding I/O. So in many cases it is also the recommended
file system to store the end results of jobs. If a job’s end
have to be kept for a longer time
they will have to be saved to $HOME
or $ARCH.


  • Files that are older than 90 days will automatically be
  • There is no backup service in place for this file system

Limits are enforced on a per group basis, both for the amount
of data and
for the number of files. In this context anything that
occupies an I-node (a
separate meta-data record in the file system)
is counted as a
“file”: regular files, directories, and symbolic
links. The per
group data limit is 20 TB. The per-group inode limit
is 4

2.7.2. HOME

HOME is intended as a general repository of user resources:
source codes, binaries, libraries, files with job input parameters,
job outputs that are consulted on a regular basis, etc. Currently
there are three file systems in this class: one with a total net
storage capacity of 600 TB, and two file systems with a net storage
capacity of 300 TB. Despite the difference in size, all three file
systems are comparable in performance. They are configured for
somewhat less demanding I/O requests and somewhat smaller file sizes
than the WORK file system.

A daily incremental backup service is in place, creating
backups of all new and recently modified files. The backup service
will skip files that are open and in the process of being modified
by other processes while the backup service is running. But a file
that has resided unmodified for at least 24 hours on a home file
system will have a backup copy. Limits are enforced on a per group
basis both for data and number of I-nodes. The per group data limit
is 6 TB. The per-group I-node limit is 2 million.

2.7.3. ARCH

ARCH is intended as an archive facility. Job output results
have to be kept for some time (at most for the life time of the
user’s project) may very well become too large to be kept on the
HOME file system. Basically all files that are to be kept but will
not be actively used for some time can be moved to the archive.
Currently there are three archive file systems. They are identical
in terms of capacity and performance. Automated migration policies
are in place on all archive file systems. Files that have resided on
an archive file system for a while will have been migrated to tape.
A daily incremental backup service is in place. If a file is
migrated to tape, an independent backup copy, residing on another
tape, exists as well.

Limits are enforced on a per group basis for the number of
I-nodes. The per-group I-node limit is 2 million. There is no
automatic enforcement of a data limit on archive file systems. PRACE
projects are expected to respect a data limit of 20 TB.


Users with sets of many small files are urged to organize such
files into sets that are put in a single container file and to
subsequently put only the container file in the archive. This
significantly reduces the number of I-nodes within the archive file
systems and thus will have a very beneficial effect on the overall
performance of many routines that handle meta-data. Furthermore, it
takes about 2 minutes per file to retrieve data from tape. Thus,
trying to retrieve several thousand files from the archive is
simply impossible.

Optionally the container file can also be compressed. The size
of the
(compressed) container file should not exceed 1 TB because
considerations concerning the efficiency of the tape back-end.
command is the most appropriate tool to create such
container files.
It also has options to list files in a tar
extract files,
etc. Please consult the online tar(1)
manual page for

2.7.4. Performance of file systems

The performance of GPFS file systems is summarized in
Table 2.

Table 2. File system performance

File system File system bandwidth GPFS block size # midplanes
WORK 34.0 GB/s 4 MB 16
HOME 8.5 GB/s 1 MB 4
ARCH 2.8 GB/s 2 MB 1


The aggregate maximum sustained performance for each file
system is shown in column 2. Column 3 lists the GPFS block size.
the last column the minimum number of midplanes is shown which a
must use in order to mobilize the full potential of the file
with I/O requests from the client side.

RAID5 and RAID6 technology tend to have slightly better read
than write performance. Writing is also much more sensitive to
having a non-optimal file system block size. The figure in
column 2
is probably too optimistic for write access to the ARCH file
From a practical point of view file system bandwidth
when something is put into the archive. When retrieving
from the archive, many files tend to be on tape and have
restored to disk first. This process will take about 3 hours
rather due to limitations of the tape back end than to
limitations of
the archive file systems.

The GPFS block size of a file system is also included
because it is a very important factor in tuning the I/O of an
application. To get the best performance out of the GPFS file system
the I/O request size should be a multiple of this block size and
address block-aligned sections of a file.

The numbers given are only rough estimates based on the
following assumptions:

  • To deliver its maximum bandwidth to the application, the
    end must be addressed in a balanced way.
  • To have a fair chance of doing that, the application must
    have at
    least the same nominal bandwidth, evenly distributed over
    the four
    fabrics of the network as the collective of server
    serve the file system.
  • For reasons explained in
    Section 2.6.3
    there will be some effect of decreasing marginal
    returns when more
    than one midplane is used. To compensate for that in
    a very crude
    the estimate is set at 1.25 x the nominal value of
    network bandwidth, when more than a single
    midplane is

3. System Access

3.1. Application for an account

Members of an approved PRACE project
apply online
for their accounts on JUGENE.

For authentication an email-callback procedure starts.
Afterwards, the
applicant is asked for the ID assigned to the project
from FZJ (this
has been made known to the project-leader from
FZJ-Dispatch by mail;
the ID usually starts with the prefix pr).
After filling in the
project-ID, the applicant is asked for his or
her personal data and
any further information needed. During this
procedure an ssh-key has
to be uploaded, detailed information about
creating such a key can be
found in
Section 3.3.
Finally, the application has to be submitted electronically.
Also the printed
version has to be sent to FZJ-Dispatch after it has
been signed by the
applicant and the project leader. The applicant
will be informed by
mail as soon as the account has been created on
the system.

3.2. How to contact FZJ-Dispatch

For questions applicants can contact FZJ-Dispatch. The application
for an account must be sent by FAX or by postal mail:

Forschungszentrum Jülich GmbH
Jülich Supercomputing Centre, Dispatch
phone: +49 2461 61 5642
fax: +49 2461 61 2810

3.3. Access to JUGENE

Users can login to JUGENE on the so called front-end or login
nodes only.
Interactive access to the I/O or compute nodes is not
JUGENE has two different login nodes named jugene3 and
jugene4 which
may be addressed with

The front-end nodes have identical environments, but multiple
of one user may reside on different nodes which must be taken
account when killing processes.
The access is only possible by
using ssh connections with keys. The
public ssh key can be generated
on the system from which you want to
access JUGENE.

3.3.1. Key generation


In order to generate a key pair open a shell and use the
following command

ssh-keygen -b 2048 -t rsa

You are asked for a file name and location where the key should be
saved. Simply take the default by hitting the enter key. This will
generate the ssh key in the .ssh directory of your home directory
($HOME/.ssh). Next, you are asked for a passphrase. Please, choose a
secure passphrase. It should be at least 8 characters long and
should contain numbers, letters and special characters like


You are
allowed to leave the passphrase


You can generate the key pair using for example the PuTTYgen tool,
which is provided by the PuTTY project. Start PuTTYgen and choose
SSH-2 RSA at the bottom of the window and set the ‘number of bits in
the generated key’ to 2048 and press the ‘Generate’ button. PuTTYgen
will prompt you to generate some randomness by moving the mouse over
the blank area. Once this is done, a new public key will be
at the top of the window. Enter a secure passphrase. It
should be at
least 8 characters long and should contain numbers,
letters and
special characters like


You are
allowed to leave the passphrase

Save the public and the
private key. We recommend to use ‘’ for the public and
‘id_rsa’ for the private part.

3.3.2. Upload of the public key

The contents of the files key must be uploaded through
the JCS online WEB interface when applying for an account. The
public key can be
uploaded also later
if necessary.

The key will be saved on your JUGENE account at



  • If the file
    already exists it will be overwritten.
  • Make sure there is no write access for group or world on the
    directory, otherwise ssh does not work.

3.3.3. Login to JUGENE

After you have generated and uploaded the key you can access
JUGENE in the following way (Linux/UNIX):

ssh <userid>

On Windows please use your ssh client, choose the
method ‘public-key’, import the key pair and login
with your account


  • X11 forwarding (Linux/UNIX)

    If a login is done via multiple machines, the X11 forwarding
    be enabled in the file


    Add the following lines to this file on your local system:

    PubkeyAuthentication yes
    ForwardAgent yes
    ForwardX11 yes

    or use the
    flag when using ssh

    ssh -X ...

  • Login shell

    The Login Shell on JUGENE is the Bourne-again shell (bash). Users
    are not allowed to change the login shell but they can switch to
    a personal shell within the login process. User’s will find a
    template within the initial FZJ $HOME/.profile.
    You can choose
    your shell by setting the environment variable
    accordingly, for example setting


    will enable the C shell. Please see
    for details.

3.4. Further reading, information and references

JUGENE documentation on the web

4. Production Environment

4.1. Accounting

Job accounting is done via a central database at FZJ and the
information about all jobs are gathered once a day around midnight,
based on information obtained from the LoadLeveler batch system. The
FZJ billing procedure calculates the various resources that each user
in the project actually uses in so-called contingent units (KE). The
calculation formula is specified below.

4.1.1. Monthly quota

Every approved PRACE project has been granted a quota of CPU hours
for usage on the system by the PRACE panel. In order to ensure that
the system is used equally over time, quotas are divided evenly
across the months of usage. The amount of CPU time available for the
current month
is calculated from this monthly quota (MQ)
and any time left over from the previous month (PQ):

CQ = PQ + MQ

Remaining quota from earlier months is no longer available, so
“saving of cpu time” is not possible. At the end of an allocation
period, any remaining quota is forfeited. Daily, for each job the
quota used is debited to
This account may be overdrawn to the amount of the monthly quota
of the following month (up to –
without affecting the processing priority of the job. Only when
of the following month has also been used up all subsequent jobs
will be assigned lowest priority for the rest of that month. Using
of the following month is possible, but only if the user’s allowance
for computing resources includes the following month. An overdrawn
account in the preceding months is carried over into the
as a negative quota.

4.1.2. Billing formula

Jobs will be charged for wall clock time, that means for the time the
physical nodes are occupied. Jobs will be charged regardless whether
the nodes are processing the user’s application or idling. This
leads to the following billing formula:

KE = fc * NN * TIME

is the total number of nodes in the requested partitions used by the
job and
is the wall clock time duration for which the job occupied the nodes
(in hours). The factor
is the price for the usage of one node, currently fc is 0.039

4.1.3. Overdrawn quota

Jobs run at a normal priority unless three monthly quotas
(previous, current and next month’s quota) are consumed. If the
current quota account is overdrawn, then all subsequent batch jobs
for all users from the same project will be put on hold until CPU
capacity is free. The maximum time that can be requested to process
jobs is reduced to 6 hours. Jobs that require more computing time
will be rejected. The time used by the jobs will still be debited to
the quota account as described above. Users will be informed by
mail, when they run out of CPU quota.

4.1.4. Querying quota status

Here in this and the next section this guide describes the
which are also available for PRACE users. But PRACE
users can
also use the Distributed Accounting Report Tool (DART)
developed in the
and PRACE projects:


Users get information about their current quota status or the
usage of jobs by using the command

q_cpuquota <options>

Useful options are given in
Table 3.

Table 3.
Selected options for the

Option Description
-? usage information and all options
-j <jobstepid> for a single job with specified id
-t <time> for all jobs in the specified time period, for example:
-t 01.05.2011-15.05.2011
-d <number> for the last
days (positive integer)


4.1.5. Querying project status by the project leader

Project leaders are informed monthly by mail about the quota
of their project and the usage of each project member.
they have the possibility to

query the project status

any time. The project ID has to be specified and for authorization
an email-verification method is used.

4.2. Module environment

On JUGENE, compilers, tools, some general and scientific
applications as well as libraries are provided to users through the
module tool. For each application a specific module script is
provided that defines all what is needed to let the corresponding
application run. In other words, the user’s environment is updated
dynamically in a suitable manner. Among other configurations, module
scripts include definitions of environment variables to make known
the paths to executables, libraries, header files etc.

The module concept eases the use of software on JUGENE. There is
no need for the user to figure out what is needed and where needed
components are stored in the file system in order to use a certain
piece of software. Furthermore, it allows providing different
versions of a software package to the user; conflicts between
different versions are avoided due to the possibility to load a
certain configuration set while unloading another one.

On JUGENE there is a distinction between applications running on
front-end (login) nodes and back-end (compute) nodes.
that are made available by modules on the compute nodes
are stored
under the directory
whereas applications to be used on the front end are given in
Therefore, it is important to load the needed modules within the
job script if the corresponding application (or library etc.) is
needed for the runs on the compute nodes.

The module tool is used on the command line in the following

module <option>

In the following
Table 4
the most important options of the module
command are shown together
with short explanations.

Table 4.
Selected options for the

Option Description
<no option> List of available options
avail Lists all available modules
list Lists all currently loaded modules
add|load <module1> [<module2> ...] Loads a module. It is allowed to load more than one module
per command invocation.
rm|unload <module1> [<module2> ...] Removes loaded modules. It is allowed to remove more than
one module per command invocation.
switch|swap <module1> [<module2> ...] Unloads
and loads
show <module> Shows the actual changes that will take place in the environment
is loaded
help <module> Provides the user with some information about the
and the corresponding software package.


The command module avail shows that the modules are organized in
categories in order to facilitate the identification of available
software packages. The following categories – again distinguishing
between front-end and back-end – are defined on JUGENE
(Table 5).

Table 5. Module categories available for the front-end and the compute
nodes on JUGENE.

front-end back-end
/usr/local/modulefiles/TOOLS /bgsys/local/modulefiles/TOOLS
/usr/local/modulefiles/SCIENTIFIC /bgsys/local/modulefiles/SCIENTIFIC
/usr/local/modulefiles/MATH /bgsys/local/modulefiles/MATH
/usr/local/modulefiles/COMPILER /bgsys/local/modulefiles/COMPILER
/usr/local/modulefiles/IO /bgsys/local/modulefiles/IO
/usr/local/modulefiles/MISC /bgsys/local/modulefiles/MISC


Besides those modules that point directly to the corresponding
applications, there is a further module named
which itself contains a whole bunch of modules that focus on
analysing and debugging issues. The command
module load UNITE
makes them available to the user. These tools will be discussed in
detail in
Section 6
Section 8.

4.3. Running and monitoring jobs

A user process executes on Blue Gene/P compute nodes in one of
three available modes: Symmetric multiprocessing (SMP) which is a
default execution mode, Dual mode and Virtual Node mode (also called
Quad mode).

4.3.1. SMP mode

Figure 7.
Application allocation in SMP execution mode (source:

Lawrence Livermore National Laboratory


SMP mode


In symmetrical multiprocessing (SMP) mode each of the compute nodes
executes a single process (MPI process). SMP mode is the default
execute mode on the Blue Gene/P. In this mode processes may
threads on each of the available cores in the node resulting
in four
threads per one MPI process and per one compute node as
depicted in
Figure 7.
In SMP mode all four cores on the node work symmetrically. This
enables all processes and threads on one node to access the full
memory of the node. Executing in SMP mode also gives the maximum
amount of memory per process. OpenMP and Pthreads are supported in
SMP mode and thus a mixed MPI/OpenMP parallelization model can be

4.3.2. DUAL mode

In DUAL mode (schematically shown in
Figure 8),
each compute node executes two processes (MPI processes) per node
with a default maximum of two threads per process. In DUAL mode half
of the available memory and cores in the node is assigned to the
process. In this mode user may create two threads per MPI task. DUAL
mode also supports the MPI/OpenMP programming model.

Figure 8.
DUAL execution mode (source:

Lawrence Livermore National Laboratory


DUAL mode


4.3.3. Virtual Node mode

The Virtual Node (VN) mode allows to execute four separate processes
on one compute node
(Figure 9).
In this mode MPI applications can use all cores in a node (four
MPI tasks per node). In Virtual Node mode threading
is not
available. Resources and network links of the node are shared
by all
processes. VN mode is intended to be used with applications
parallelized with MPI without threads.

Figure 9.
VN execution mode (source:

Lawrence Livermore National Laboratory


VN mode



The choice of the execution mode is a user’s decision and
largely depends on the application implementation and requirements.

  • Applications using a mixed MPI and OpenMP or Pthreads
    parallelization technique may benefit from SMP or Dual
  • Single-threaded applications that use only MPI for parallel
    run in VN mode to use the system efficiently.
  • I/O-intensive tasks that require a large amount of data to
    transfered between nodes and CPU bound applications without
    memory requirements benefit by using the Virtual Node
  • Applications that have the ability to use a large
    number of
    processors with a good scaling behavior may consider to be
    executed in VN mode.

4.3.4. Mode selection

The default execution mode is SMP. To specify execution in DUAL or
VN mode users must pass the mode option to the application
through the mpirun argument for DUAL and Virtual Node mode

mpirun -mode VN
mpirun -mode Dual

For further information
about the mpirun command see

4.3.5. Running and monitoring batch jobs with LoadLeveler

Application execution on the JUGENE is managed by the
LoadLeveler scheduling system. LoadLeveler allocates computing
resources to run jobs of users. A user submits a job using a job
command file. The LoadLeveler scheduler attempts to find resources
within the machine to satisfy the requirements of the user’s job.
scheduling of jobs depends on the availability of resources and
job priority on JUGENE is proportional to the nodes
this means jobs requesting larger core counts are

Once the job is submitted LoadLeveler controls its execution
monitors the state. Jobs submitted with LoadLeveler are running
in a
batch mode which means that users cannot interact
application execution in general. Users may use the command line
to check
the job status. Job classes and resource reservations

Jobs are submitted for execution to a job queue or a job
job class is a classification to which a job can belong
its attributes. On JUGENE classes are chosen automatically
to the number of nodes the user requests and the user
a class. For example, if the user wants to
run on 512
nodes (1 midplane) the job will be automatically queued
in the
m001. The default wallclock time for jobs requesting 512
nodes or
more is
6 hours, the maximum wallclock time is 24 hours. Jobs
requesting less
than 512 nodes can run at most 30 minutes.

There are two exceptions:

  1. If users need to run jobs on the front-end nodes on JUGENE
    job_type serial
    needs to be chosen (see section
  2. For debugging purposes using less than one midplane a class
    exists which allows to run on smaller numbers of cores for at
    4 hours. This queue is only available upon special request.

A list of available classes as well as additional information
each class (such as the maximum wall time allowed, the maximum
of jobs which will be executed in parallel, etc.) can be
using the LoadLeveler command


The priority of a job is set according to the number of nodes
requested; the larger the number of nodes requested the higher the
priority of the job. The following table provides an overview over
the available classes (ordered with descending priority).

Table 6. Available LoadLeverler classes on JUGENE ordered with
descending priority. R87 denotes a dedicated rack with additional
I/O nodes in order to support jobs requesting less than 512 nodes.

Class Name Max. wall time Default wall time Max. jobs per user Comment/Running on
m144 24:00:00 06:00:00 On demand only
m128 24:00:00 06:00:00 On demand only
m064 24:00:00 06:00:00 All
m032 24:00:00 06:00:00 All
m016 24:00:00 06:00:00 All
m008 24:00:00 06:00:00 All
m004 24:00:00 06:00:00 All
m002 24:00:00 06:00:00 All
m001 24:00:00 06:00:00 All
small 00:30:00 00:30:00 2 R87
dbglong 04:00:00 04:00:00 4 On demand only Job submission and monitoring

Once the command file is created (see
the job is submitted using the

llsubmit <Job Command File>

LoadLeveler submits the job to the queue and assigns a unique
id. The job is waiting in the queue until the requested
are available. When resources are ready LoadLeveler
initializes the
nodes and computing environment and starts the user
application. To
monitor the job status and print its attributes, a
may use the command


To cancel the job execution the

llcancel <jobid>

command is used.

The following table summarizes the
LoadLeveler command-line

Table 7. Basic LoadLeveler commands for job control and monitoring.

Command option Description
llsubmit <Job command file> Submits a job to the queuing system
llq <no option> Lists queued jobs
-l <Job ID> Displays the attributes of the job
-s <Job ID> Displays the job status
-u <user> Displays all jobs of
-x -l <Job ID> Displays more details about the job
llqx <no option> Shows detailed information about all jobs
llcancel <Job ID> Cancels (kills) a job
llclass <no option> Lists existing classes
llstatus <no option> Displays the LoadLeveler status Job command file (job script)

A job file is a script that contains the job attributes,
specifications of the requested resources and commands to run the
application. This file should have two blocks of instructions:

  1. a LoadLeveler job keywords block at the beginning of a file
  2. one or more application script blocks

LoadLeveler keyword-statement lines begin with
which can be separated by any number of blanks. The application
block is a regular shell script which can contain any shell
The LoadLever scheduler uses a number of generic keyword
and some additional keywords dedicated to the Blue Gene/P
Selected general LoadLeveler keywords are given in
Table 8.
Blue Gene/P specific LoadLeveler keywords are given in
Table 9.

Table 8. Selected LoadLeveler job file keywords

Keyword Value Description Default
#@job_name <Job Name> Name of the job -
#@ll_res_id <Reservation ID> Reservation to be used for resource allocation -
#@notification start Notification when job has started never
complete Notification when job has ended
error Notification when an error has occurred
always =start+complete+error
never No notification is sent
#@notify_user <Mail address> Address to use for notifications $USER@localhost
#@wall_clock_limit <hh:mm:ss> Time limit for the job duration Depends on job size
#@queue <queue name> Places the job in the corresponding queue -


<file name> Name of the files to use as standard input / output /
#@environment <ENVVAR>


Specifies initial environment variables set by LL when your
step starts. If
is given all environment vairables are passed to the job.


Table 9. Selected Blue Gene/P specific LoadLeveler job file keywords

Keyword Value Description Default
#@job_type serial|bluegene Specifies the type of job step to process. If
is specified, the job will run on the front-end node,
otherwise it will be executed on the compute nodes.
#@bg_size <int> Size of the Blue Gene job in terms of number of compute nodes.
The keywords
are mutually exclusive.
#@bg_shape <int> x <int> x <int> Specifies the requested shape of a Blue Gene job in terms of
number of midplanes in x,y, and z direction, respectively. The
maximum shape on JUGENE in 9x4x4. The keywords
are mutually exclusive.
No default
#@bg_rotate TRUE|FALSE Specifies whether the scheduler should consider all
possible rotations of the given shape of the job when
searching for a partition for the job.
#@bg_connection MESH|PREFER_TORUS|TORUS Type of wiring requested for the partition. MESH


Sample job scripts can be found on the system in the directory


Since LoadLeveler automatically selects the appropriate partition to run
the job on, the option #@bg_partition cannot be used on JUGENE.

On JUGENE applications need always to be executed with the
command in order to run on the compute nodes. Thus, the job
file must contain appropriate arguments and a correct call
that will execute the user application. Please see
for further information.

Programs can be launched and controlled on JUGENE
using the mpirun command. The general syntax of this command is

mpirun <options>

offers the possibility to control the environment and the execution
of an application using numerous parameters which can either be set
by command-line options or by environment variables. The
command-line options have higher priority, this means they override
parameters specified by environment variables. In the following we
describe some useful parameters for mpirun giving the command-line
option and the corresponding environment variable.

Table 10.
Command-line options and environment variables for the

Command-line option Environment variable Description
-args "prg_args" MPIRUN_ARGS = "prg_args" Passes
to the launched application on the
compute node.
-env "ENVVAR=value" MPIRUN_ENV = "ENVVAR=value" Sets the environment variable
in the environment of the job on the compute nodes.
-env_all MPIRUN_EXP_ENV_ALL Exports all environment variables in the current environment
to the job on the compute nodes.
-exe executable MPIRUN_EXE=
Specifies the full path to the executable to be launched
on the compute nodes. The path must be specified as seen by
I/O and the compute nodes.
-exp_env ENVVAR MPIRUN_EXP_ENVVAR = "ENVVAR" Exports the environment variable
in the current environment of
to the job on the compute nodes. More than one variable can be
exported with a comma-separated list.
-mapfile mapfile MPIRUN_MAPFILE = "mapfile" Specifies the order in which the MPI tasks are distributed
across the nodes and cores of the partition reserved (see
Section 7.2.4
for details).
Specifies the mode in which the job will run.
-np n MPIRUN_NP="n" Creates
MPI tasks (only needed if not the full reserved partition will
be used).
-verbose 1|2|3|4 MPIRUN_VERBOSE =
Sets the verbosity level of
Information generated by
are passed to



  • In general it is not necessary to use the
    option of the
    command, because the number of MPI tasks is automatically
    computed according to the size of the partition requested and
    execution mode (VN, DUAL, SMP) selected. For example,
    1024 nodes in DUAL mode will automatically create 2048
    MPI tasks
    (2 tasks per node) without specifying the
    option. In fact, it is recommended to omit this option unless
    want to use fewer tasks.
  • It is recommended to use the
    -verbose 2
    option even if your job finishes with exit code 0. Events might
    occur during the run of your application which are not critical
    (that means the application does not crash) but which can provide
    how to set certain parameters in order to obtain a better
    performance. Such events are reported only at verbosity level 2
    or higher. Job command file example for pure MPI codes

The following example specifies a simple MPI job using a bash shell
script. Lines beginning with
are LoadLeveler keywords.

Example 1. Job command file for pure MPI codes

#@job_name         = example_1
#@comment          = "BGP Job by Size"
#@error            = $(job_name).$(jobid).err
#@output           = $(job_name).$(jobid).out
#@environment      = COPY_ALL
#@wall_clock_limit = 12:00:00
#@notification     = error
#@notify_user      = my_address@my.institution
#@job_type         = bluegene
#@bg_size          = 2048
#@bg_connection    = TORUS

mpirun -mode VN -mapfile TXYZ -verbose 2 -exe application.x


will launch the executable
in Virtual Node mode. All environment variables are exported to the
job environment. The topology of the requested partition is set to
and the mapping order is set to
that means the MPI tasks are distributed is such a way that each node
is filled with 4 MPI tasks (T dimension first) before the next node
is filled. 8192 MPI tasks are created in total. The user is
notified by email if an error during the execution occurs. Job command file example for hybrid MPI/OpenMP codes

The following example shows how to launch a hybrid application

Example 2. Job command file for hybrid MPI/OpenMP codes

#@job_name         = example_2
#@error            = $(job_name).$(jobid).out
#@output           = $(job_name).$(jobid).out
#@environment      = COPY_ALL
#@wall_clock_limit = 12:00:00
#@notify_user      = my_address@my.institution
#@notification     = complete
#@job_type         = bluegene
#@bg_connection    = TORUS
#@bg_size          = 1024

module load scalapack
mpirun -exe application.x -env OMP_NUM_THREADS=4


First, the module for the scalapack library is loaded. Then
will launch the executable
in SMP mode (default mode, it is not necessary to specify -mode SMP
here) and each task will start 4 threads. In
total, 1024 MPI tasks are created. The user will be notified upon
completion of the job.

Running interactive jobs with

While developing applications it may be necessary to execute a job
under interactive control.
creates and submits a LoadLeveler job under the direct control of
the user’s shell.
is invoked in the following way:

llrun [llrun_options] [mpirun_options] <executable>

and can be used with the same command line options like
. There are only a few additional parameters. The following
Table 11
lists the

Table 11. Options for the llrun command

Option Description
-B submit batch job
-w hh:mm:ss submit job with specified wallclock limit (Default:
-o filename do not run/submit job but save it to file
-q do not queue interactive job if system has insufficient
-v Verbose output
-h|? Print help
-tv Start the application with


The monitoring tool

is a client-server based application which allows monitoring of the
utilization of the BG/P system and was developed at the Jülich
Supercomputing Centre. It
gives a quick and compact summary of different information like the
usage of nodes/processors and the required resources of running and
waiting jobs. In order to use
the display needs to be exported correctly (login with
ssh -X). Then the tool is invoked by


A typical display of llview is shown in
Section 4.3.7.
Graphical elements of
are a node display, a usage bar, which gives a direct view of the
job granularity, a list of running jobs and a list of waiting jobs,
and a graph chart displaying the number of jobs in the different
queues. For JUGENE two different views exist. The physical view
shows which racks/midplanes the jobs are running on physically. The
logical view shows how the partitions the jobs are running on are
residing within the TORUS network. That means, midplanes which are
logically adjacent in the TORUS network are depicted

Figure 10.  llview client displaying the status of JUGENE (physical



5. Programming Environment/Basic Porting

5.1. Compilers

The Blue Gene/P programming environment consists of the two sets
of compilers: The GNU Compiler collection for the Blue Gene
architecture and the IBM XL compiler suite. The system supports
cross-compilation and the compilers run on the front-end nodes. The
compilers for the Blue Gene/P system have specific optimizations for
the BG/P architecture. In particular, the XL family of compilers
generates code appropriate for the double floating-point unit (FPU)
of the Blue Gene/P system (“double hummer”). They also incorporate
code optimizations specific to the Blue Gene/P instruction scheduling
and memory hierarchy characteristics. The Blue Gene/P system has
compilers for the C, C++, and Fortran programming languages. There is
also support for the execution of Python-based user applications.

5.1.1. Available compilers

Since the JUGENE front-end nodes differ from the back-end nodes
different compilers are necessary to generate executables for both
types of nodes.

For the back-end nodes the IBM XL C/C++ and XL Fortran family of
compilers should be the first choice for the compilation because of
their specific optimization possibilities targeted for the Blue
Gene/P architecture. XL compilers have a full support for the OpenMP
version 2.5 standard and Fortran 2003 language standard. Each of the
compilers from the XL family and the corresponding MPI wrappers have a thread-safe version
that should be used for any multi-threaded application. The
thread-safe version is invoked by appending “_r” to the
corresponding compiler name, for example
For Fortran XL compiler version 11.1, for C and C++ XL compiler
version 9.0 are available.

The available XL compilers and compiler wrappers are summarized in
Table 12.

Table 12. XL family of compilers – compiler invocation for the back-end
compute nodes

Language Compiler invocation Compiler invocation – MPI wrapper
C bgxlc, bgc89, bgc99, bgcc mpixlc, mpixlc_r
C++ bgxlc++, bgxlC mpixlcxx, mpixlcxx_r
Fortran bgxlf, bgxlf90, bgxlf95, bgxlf2003, bgf77, bgf90, bgf95,
mpixlf77, mpixlf90, mpixlf95, mpixlf2003, mpixlf77_r,
mpixlf90_r, mpixlf95_r, mpixlf2003_r


A collection of GNU compilers for C, C++ and Fortran is also
available on JUGENE (version 4.1.2 and 4.3.2). These compilers do
not support some features of the Blue Gene/P architecture. In
particular, GNU compilers do not generate highly optimized code for
the Blue Gene/P platform and do not automatically generate code for
the double FPU. GNU compilers are thread-safe but do not support
OpenMP in the default version (4.1.2) of the compiler collection of the
Blue Gene/P system. In order to use OpenMP with the GNU compilers
you must use version 4.3.2 (see
Section 5.4).

Table 13.  GNU compiler collection for Blue Gene/P – compiler
invocation for the back-end compute nodes.

Language Compiler invocation Compiler invocation – MPI wrapper
C powerpc-bgp-linux-cc mpicc
C++ powerpc-bgp-linux-g++ mpicxx
Fortran powerpc-bgp-linux-gfortran mpif77, mpif90


Additionally the corresponding compilers for the front-end nodes are
available (XL and GNU compilers).
Table 14
gives an overview over the invocation of the XL and gcc compilers
for the front-end nodes.

Table 14. Compiler invocation for the front-end nodes of JUGENE. The
commands for the XL and GNU family of compilers are shown.

Language XL compiler family GNU compiler family
C xlc gcc
C++ xlc++, xlC g++
Fortran xlf, xlf90, xlf95, xlf2003 gfortran


5.1.2. Compiler flags XL Compiler family

Default for compilation is
which means the “double hummer” (450d) is used for the calculation if
nothing else is specified.

This may be suboptimal. Experiences with various applications have
shown that the double hummer needs some preconditions to work
optimal (see
for further information). Therefore, the use of
does not always lead to an optimum but sometimes even can extend
the calculation time.

It is therefore recommended to test all applications with both
and compare the results and the calculation times. The option
should always be specified as

We recommend to start with
-O2 -qarch=450 -qtune=450
and increase the level of optimization stepwise according to
Table 15.
Always check that the numerical results are still correct when
using more aggressive optimization flags. For OpenMP codes, please
add -qsmp=omp -qthreaded.

Table 15. Compiler flags for XL compilers in the order of increasing
optimization potential.

Optimization level Description
-O2 -qarch=450
Basic optimization

-O3 -qstrict -qarch=450 -qtune=450
More aggressive optimizations are performed, no impact on
-O3 -qhot -qarch=450 -qtune=450 Aggressive optimization that may impact the accuracy
(high order transformations of loops)
-O4 -qarch=450 -qtune=450 Interprocedural optimization at compile time, SIMD
-O5 -qarch=450 -qtune=450 Interprocedural optimization at link time, whole-program
analysis, SIMD instructions


Once you have determined the optimal flags you can switch to
which activates the double hummer and see whether this improves the
performance further. If this option has no effect or even slows
down the performance of your code, please check
for further information. If you benefit from using
option you can try further optimization with
option together with
option. This will force the compiler to generate SIMD instructions
for double hummer FPU. Note that
is only supported with
switched on.


The double FPU does not support exceptions, thus options related
to exception handling are not valid with the

The following XL Compilers compiler options are not supported

  • -q64
    (The Blue Gene/P processor is a 32-bit architecture and cannot use
    64-bit mode)
  • -qaltivec,
    (The Blue Gene/P processor does not support vector data types nor VMX instructions)
  • -qpdf,
    (Profile directed feedback (PDF) is not fully supported on Blue

Please see
for a more detailed discussion and additional compiler flags for
optimizing and debugging applications on JUGENE.


File extension:
The XLF compilers treat Fortran source files differently depending
on their file extensions (suffix):

  • Files with suffixes
    .f, .f77, .f90, .f95, .f03
    are treated as files containing pure Fortran code (no
    pre-processor directives included)
  • Files with suffixes
    .F, .F77, .F90, .F95, .F03
    are assumed to contain pre-processor directives and are
    pre-processed before compilation. By default the pre-processed
    files are deleted. If you want to keep them specify the
    -d compiler flag. GNU Compiler Collection

The GNU Compiler Collection does not offer any specific
compiler options specific for Blue Gene/P. To tune the application
with GNU compilers users may start with the default set of options
and apply the reference compile options if there are any. To check
what options are in effect by default use the following command:

mpicc -Q -v -c

5.2. Available libraries

Numerous mathematical, scientific and communication libraries
are available on JUGENE. You can get an overview by using the
following command:

module avail

Libraries for the applications running on the JUGENE compute
nodes are listed under


The libraries for the JUGENE front-end node can be found under


Further information about how to link your application with the
different libraries can be obtained with

module help <library_name>[/<library_version>]


module show <library_name>[/<library_version>]

should be replaced by the name of the library (such as
by the corresponding version number (for example, 3.2.2). The latter one can
be omitted. In this case information about the default version is

Engineering and Scientific Subroutine Library (ESSL)

The vendor optimized numerical library for JUGENE is the ESSL
(Engineering and Scientific Subroutine Library) provided by IBM.
ESSL is a collection of high performance mathematical subroutines
providing a wide range of functions for many common scientific and
engineering applications. The mathematical subroutines are divided
into nine computational areas:

  • Linear Algebra Subprograms
  • Matrix Operations
  • Linear Algebraic Equations
  • Eigensystem Analysis
  • Fourier Transforms, Convolutions, Correlations and Related
  • Sorting and Searching
  • Interpolation
  • Numerical Quadrature
  • Random Number Generation

The ESSL Blue Gene Serial Library and the ESSL Blue Gene SMP
Library run in a 32-bit integer, 32-bit pointer environment. Both
libraries are tuned for IBM Blue Gene/P. A subset of the subroutines
in the ESSL Blue Gene Serial Library uses SIMD algorithms that
utilize the PowerPC 450 dual FPUs. The ESSL Blue Gene SMP library
provides thread-safe versions of the ESSL subroutines. A subset of
these subroutines are also multithreaded versions. Those
multithreaded versions support the shared memory parallel processing
programming model. Some of these multi-threaded subroutines also use
SIMD algorithms that utilize the PowerPC 450 dual FPUs. The ESSL
subroutines can be called from application programs written in
Fortran, C, and C++.

Compiling and linking a Fortran90 program with the sequential
and multithreaded version of ESSL, respectively, on JUGENE looks as

mpixlf90_r name.f -L/bgsys/local/lib -lesslbg

mpixlf90_r name.f -L/bgsys/local/lib -lesslsmpbg

is a Fortran program calling routines from ESSL.

Compiling and linking a C program with the sequential version
of ESSL on JUGENE looks as follows:

mpixlc_r name.c -L/bgsys/local/lib -lesslbg \

-L/opt/ibmcmp/xlf/bg/11.1/lib -lxl -lxlopt -lxlf90_r \

-lxlfmath \

-L/opt/ibmcmp/xlsmp/bg/1.7/lib -lxlomp_ser -lpthread
For using the multithreaded version you need to use the
following command:

mpixlc_r name.c -L/bgsys/local/lib -lesslsmpbg \

-L/opt/ibmcmp/xlf/bg/11.1/lib -lxl -lxlopt -lxlf90_r \

-lxlfmath \

-L/opt/ibmcmp/xlsmp/bg/1.7/lib -lxlomp_ser -lpthread
For further information on ESSL, please see the
IBM ESSL documentation

5.3. MPI

5.3.1. MPI implementation

The MPI implementation for JUGENE is based on the MPICH2
of the Mathematics and Computer Science Division (MCS)
at Argonne
National Laboratory. It supports the MPI 2.1 standard
(with some
exceptions, see below). Compiling MPI applications

In order to compile your MPI applications please use the
corresponding MPI compiler wrapper (either for the GNU or the XL
suite of compilers) described in
Section 5.1.1.
For example, compiling your C code with MPI support you can use
(IBM XL compiler)

mpixlc -o application.x routine.c

For further information about compilers and compiler flags please
Section 5.1.1
Section 7.1.1
of this guide. Running MPI applications

MPI applications can be run on JUGENE interactively with
or as batch jobs. For details please see
Section 4.3.6
Section 4.3.5. Exceptions to the MPI standard

Some features of the MPI 2.1 standard are not supported on

  • Asynchronous (non-blocking) I/O:
    Asynchronous I/O is not supported by the Blue Gene/P hardware.
    The MPI routines for non-blocking I/O (MPI_File_iwrite,
    MPI_File_iread) can be used without causing errors. However, the
    blocking I/O will actually be performed. MPI extensions for Blue Gene/P

IBM provides extensions to MPICH2 in order to ease the use of the
BG/P hardware. These extensions start with
instead of
Currently, only a C/C++ interface and no Fortran interface is
available. In order to use the extensions, please include

#include <mpix.h>

and compile your program with the usual mpi compiler wrapper
(mpixlc, mpixlcxx, …).

The following routines are available:

int  MPIX_Cart_comm_create(MPI_Comm *cart_comm)
int  MPIX_Pset_same_comm_create(MPI_Comm *pset_comm)
int  MPIX_Pset_diff_comm_create(MPI_Comm *pset_comm)

unsigned MPIX_torus2rank(unsigned x,unsigned y,unsigned z,
                         unsigned t)
void 	 MPIX_rank2torus(unsigned rank,unsigned *x,unsigned *y,
                         unsigned *z,unsigned *t)

unsigned MPIX_Comm_torus2rank(MPI_Comm comm,unsigned x,
                         unsigned y,unsigned z,unsigned t)
void 	 MPIX_Comm_rank2torus(MPI_Comm comm,unsigned rank,
                         unsigned *x,unsigned *y,
                         unsigned *z,unsigned *t)

int MPIX_Get_properties(MPI_Comm comm, int *prop_array)
int MPIX_Get_property(MPI_Comm comm, int prop, int *result)
int MPIX_Set_property(MPI_Comm comm, int prop, int value)		


creates a four-dimensional Cartesian communicator that mimics the
exact hardware on which it is run. It will only work properly if
the application runs on
nodes of a partition. For example, if you reserve 512 nodes your
must not
use less than 512 nodes. Because of MPICH2 dimension ordering, the
associated arrays (like coords, sizes, and periods) are in [t, z,
y, x,] order so that the rank in
matches the rank in


creates a communicator such that all nodes in the same communicator
are served by the same I/O node.


creates a communicator such that all nodes in the same communicator
are served by a different I/O node.


returns the mapped rank based on the physical X, Y, Z, and T


returns the physical X, Y, Z, and T coords based on the mapped


returns the communicator rank based on the physical X, Y, Z, and T


returns the physical X, Y, Z, and T coords based on the
communicator rank.


retrieves the values of all the properties of the specified


gets/sets the corresponding property for the specified

5.4. OpenMP

(Open Multi-Processing) is a well known set of compiler
directives and
API (Application Programming Interface) allowing
one to develop
parallel codes in a shared memory environment. It
defines directives
for the C, C++ and Fortran programming
languages. On Blue Gene/P, the
full OpenMP 2.5 standard
is supported by the IBM compilers, with the
limitation that
one has to use the
of the compilers. This means that as soon as you intent to
compile a
code with the OpenMP support enabled, you have use the
with names ending with a “_r”. See
Section 5.1.1
for more details.

Moreover, one should be aware that OpenMP is only supported for the
back-end nodes by the GNU compilers for version
4.2 or higher.
The default version for GNU compilers on JUGENE is 4.1.2. Therefore,
one has to load the
necessary module to compile OpenMP code with a GNU compiler:

module load gcc/4.3.2

5.4.1. Compiler flags

For the IBM compilers, the OpenMP support is enabled with

In addition, the thread-safety of all the optimizations performed
and libraries linked has to be ensured. This is done by using the
thread-safe compilers:
bgxlc_r, and
bgxlc++_r, or
mpixlc_r, and
for the mpi wrappers, depending on the targeted language. Using
those versions of compilers will by default add the
option. Any attempt to use
to compile OpenMP code without using the “_r” version of compiler
should lead to plenty of errors at link time such as:

libxlsmp.a(parregion.32.o): In function `SpinWait':
parregion.c:(.text+0x340): undefined reference to `pthread_yield'
parregion.c:(.text+0x36c): undefined reference to `pthread_yield'

For the GNU compilers of version greater or equal to
the OpenMP support is enabled with

Since those compilers generate by default thread-safe code, no
specific version or extra switch needs to be added.

5.4.2. Using OpenMP in different execution modes

OpenMP is supported on the Blue Gene/P for the two modes that
support shared memory. Those are the SMP mode described
Section 4.3.1
and the DUAL mode described
Section 4.3.2.
Either of the two modes should be selected with the corresponding
option to the
command, as described in
Section 4.3.4.
In addition, the
environment variable should be set and passed to
coherently with the chosen mode.


In general there is no need to adjust the number of tasks to run
as it is done automatically by the system to reflect the number of
nodes and the mode chosen.

The command line for running an OpenMP code should then look like:

  • In SMP mode

    mpirun -env "OMP_NUM_THREADS=4" -exe myOMPcode.x [...]

  • In DUAL mode

    mpirun -mode DUAL -env "OMP_NUM_THREADS=2" -exe myOMPcode.x

6. Performance Analysis

6.1. Available performance analysis tools

6.1.1. gprof

gprof – the GNU Profiler – is the basic tool to analyze program
execution profiles. It is a good starting point for profiling your
application. gprof produces an execution profile on the basis of the
call graph profile file (
by default). The profile file is created by programs that are
and/or linked with the
option. It works for C and Fortran programs and both GNU and XL

To produce an execution profile of your application you need to
compile and link your application with the
option. After the program execution is finished the profiling data
is collected in
files, where
corresponds to the MPI rank of a process that produced the data.
This is extended functionality of the GNU toolchain introduced on
Blue Gene/P to support MPI based application with gprof. Note that
support for threaded applications is not implemented in gprof on
Blue Gene/P. In case of threaded programs running in SMP mode on
computing nodes, most likely the master thread only will be

To start with the most basic timer-tick profiling information at the
machine instruction level, use the
option on the link command only. Without any additional options in
the compile command there will be least overhead added to the
execution time. Furthermore, there will not be collected any
graph information.

In order to proceed with a procedure level profiling you need to
include the
option in the compile command for all application source code files.
This will collect additionally call graph information. This will add
an additional performance overhead. Note also that compilation with
aggressive optimization options may alter the structure of the
reported call graph due to the possible compiler-based code

To enable the full level of profiling you need to include the
option in all compile and link commands. In that case call-graph,
statement-level, basic block, and machine-
instruction profiling data
is collected. This level of profiling
will introduce the most
overhead in an overall program performance.

To analyze the data collected in the
files during the program execution you need to start gprof on a
front-end node to obtain a flat profile:

gprof -p program-binary profile-file-list

If you use the
option in the gprof command line, it would merge all
files into one

In cases where a large number of
files is collected, for example, if a large number of compute nodes is
used for execution, users may find it impossible to process all of
these files due to the limit on input arguments in the Linux system.
To overcome this limitation use the
option on the gprof command line. This will instruct gprof to search the
current directory for all
files with
in a continuous sequence. gprof examples

The following example shows the flat profile for a simple parallel
program that operates on matrices. The profile shows that most of
the program spends in functions called
There is also a significant part of the runtime consumed by

Example 3. Flat Profile

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 30.60   4655.25  4655.25       48    96.98    96.98  mat_cmp
 27.23   8798.60  4143.35       64    64.74    64.74  mat_mul
 23.02  12301.23  3502.63                             sin
 19.10  15207.63  2906.40                             cos
 0.00   15213.24     0.36                             mmap
 0.00   15213.76     0.00       16     0.00   549.91  main


Each row in a profile corresponds to the function in the analyzed
program. Functions are sorted by decreasing runtime spent. The
meaning of the fields in a profile is as follows:

  • time

    The percentage of the total execution time that program
    spent in this function.

  • cumulative seconds

    Total number of seconds that program spent executing this
    functions, plus the time spent in all the functions above this

  • self seconds

    The number of seconds accounted for this function only.

  • calls

    The total number to the function calls.

  • self s/call

    The average time in seconds spent in this function per call.

  • total s/call

    The average time in seconds spent in this function and its
    descendants per call.

  • name

    The name of the function.

If the s field in a table is left blank the corresponding function was
never called or the information cannot be determined. In this
example the
functions are not profiled because the versions from the math
library were used which is not compiled with profiling enabled.

Example 4. Line Level Flat Profile

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self           self     total
 time   seconds   seconds  calls Ts/call Ts/call name
 24.51   4018.92  4018.92                        mat_mul (cannon_cmp.c:22 @ 1001524)
 11.44   5894.25  1875.33                        __sin (s_sin.c:334 @ 1138ed0)
  8.67   7315.83  1421.58                        mat_cmp (cannon_cmp.c:35 @ 10016d8)
  7.93   8616.51  1300.68                        mat_cmp (cannon_cmp.c:34 @ 10016b0)
  7.13   9786.35  1169.84                        __cos (s_sin.c:578 @ 1138078)
  7.13  10955.63  1169.28                        mat_cmp (cannon_cmp.c:33 @ 1001688)
  6.47  12017.09  1061.46                        mat_mul (cannon_cmp.c:21 @ 1001584)
  4.89  12819.35   802.26                        mat_cmp (cannon_cmp.c:32 @ 1001714)
  4.13  13496.15   676.80                        __sin (s_sin.c:89 @ 1138d64)
  2.92  13975.28   479.13                        __cos (s_sin.c:343 @ 1137dbc)
  1.55  14228.67   253.39                        __sin (s_sin.c:100 @ 1138d3c)
  1.49  14472.77   244.10                        __cos (s_sin.c:348 @ 1137d9c)
  1.44  14709.26   236.49                        __cos (s_sin.c:348 @ 1137d88)
  1.44  14945.63   236.37                        __sin (s_sin.c:89 @ 1138d38)
  1.43  15179.81   234.18                        __sin (s_sin.c:100 @ 1138d28)
  1.43  15413.78   233.97                        __cos (s_sin.c:343 @ 1137d98)
  1.08  15591.18   177.40                        __cos (s_sin.c:343 @ 1137db4)
  0.90  15738.35   147.17                        __sin (s_sin.c:89 @ 1138d44)
  0.82  15872.69   134.34                        __sin (s_sin.c:89 @ 1138d5c)
  0.80  16004.63   131.94                        __sin (s_sin.c:89 @ 1138d54)
  0.76  16129.12   124.49                        __cos (s_sin.c:578 @ 113806c)
  0.76  16253.48   124.36                        __cos (s_sin.c:343 @ 1137da4)
  0.71  16370.64   117.16                        __cos (s_sin.c:352 @ 1138070)
  0.06  16380.59     9.95                        __cos (s_sin.c:343 @ 1137d8c)
  0.03  16385.94     5.35                        __sin (s_sin.c:89 @ 1138d2c)
  0.01  16387.07     1.13                        mat_mul (cannon_cmp.c:20 @ 10015a0)
  0.01  16387.98     0.91                        mat_cmp (cannon_cmp.c:31 @ 1001730)
  0.01  16388.81     0.83                        mat_mul (cannon_cmp.c:21 @ 100150c)
  0.00  16391.47     0.36                        mmap
  0.00  16396.00     0.00     64   0.00    0.00  mat_mul (cannon_cmp.c:15 @ 10014a0)
  0.00  16396.00     0.00     48   0.00    0.00  mat_cmp (cannon_cmp.c:25 @ 1001600)
  0.00  16396.00     0.00     16   0.00    0.00  main (cannon_cmp.c:39 @ 10017a0)


In order to analyze the run-time profile at source-code-line-level
accuracy use the
option with the gprof command. For the complete information
line-by-line profiling add the
options to the compile command of the target source files. The
example above shows line-by-line profile output from gprof. The
profile consist of the same information as the previous flat profile
but with execution time samples splitted between actual lines in the
source code
Note that generation of the line-by-line profile for complex
programs can take significant time.

The call graph shows how much time was spent in each function
and its children. The next example came from the same simple program
as flat profile from the previous example.

Example 5. Call Graph

                     Call graph (explanation follows)

granularity: each sample hit covers 4 byte(s) for 0.00% of 15213.76 seconds

index % time    self  children    called     name
                0.00 8798.60      16/16          generic_start_main [2]
[1]     57.8    0.00 8798.60      16         main [1]
             4655.25    0.00      48/48          mat_cmp [3]
             4143.35    0.00      64/64          mat_mul [4]
[2]     57.8    0.00 8798.60                 generic_start_main [2]
                0.00 8798.60      16/16          main [1]
             4655.25    0.00      48/48          main [1]
[3]     30.6 4655.25    0.00      48         mat_cmp [3]
             4143.35    0.00      64/64          main [1]
[4]     27.2 4143.35    0.00      64         mat_mul [4]
[5]     23.0 3502.63    0.00                 sin [5]
[6]     19.1 2906.40    0.00                 cos [6]
[7]     0.0    0.36    0.00                 mmap [7]


A detailed description of the call graph fields can be found in
the gprof output. The gprof call graph shows entries separated by
horizontal rows. Each entry in the call graph consists of several
lines. The line with the index number at the left hand margin lists
the current function. The lines above it list the functions that
called this function, and the lines below it list the functions this
one called.

Profiling data can be collected in an alternate format. To turn on
the sequencial storing of the observed executing instruction
addresses at each interrupt instead of normal gprof counters, set
before the execution of the program. In that case data is collected
files. This allows the sequence of addresses to be reconstructed and
it is a Blue Gene/P enhancement to the GNU gprof tool. Also these
files tend to be much smaller than the
files. To process the
files, use the

command, which is a Blue Gene/P specific version and will recognize
the file format. It will produce the same output as in the previous
examples. Use the
option with the Blue Gene/P gprof command to display the program
counter values in the order in which they were collected and the
option to aggregate data into a
file that is in standard

6.1.2. Performance Application Programming Interface (PAPI)

PAPI is a library and associated utilities for portable
access to hardware/system performance counters.
A core component for CPU/processor counters provides
counts of instructions, floating-point operations,
branches predicted/taken, cache accesses/misses,
TLB misses, cycles, stall cycles, etc., though not all
events are supported by BG/P hardware. Additional
counters provide counts of events and bytes transfered
with the BG/P torus network. Predefined
events are derived from available native counters, for example,
total numbers of floating-point operations (excluding
loads and stores) are constructed with

            + PNE_BGP_PU0_FPU_MULT_1 + PNE_BGP_PU1_FPU_MULT_1
            + 2*PNE_BGP_PU0_FPU_FMA_2 + 2*PNE_BGP_PU1_FPU_FMA_2
            + 13*PNE_BGP_PU0_FPU_DIV_1 + 13*PNE_BGP_PU1_FPU_DIV_1
            + 2*PNE_BGP_PU0_FPU_ADD_SUB_2 + 2*PNE_BGP_PU1_FPU_ADD_SUB_2
            + 2*PNE_BGP_PU0_FPU_MULT_2 + 2*PNE_BGP_PU1_FPU_MULT_2
            + 4*PNE_BGP_PU0_FPU_FMA_4 + 4*PNE_BGP_PU1_FPU_FMA_4

To be able to use PAPI load the following modules:
module load UNITE papi

Available PAPI preset and BG/P native counters are
documented in $PAPI_ROOT/etc.

On BG/P, PAPI is not usable in virtual node (VN) mode.
PAPI also doesn’t distinguish events counted by shared
hardware counters, such as those for the BG/P network.

The PAPI API can be used to manually configure and assess
counters, however, it is more commonly used by tools such as
Scalasca (Section 6.1.3),
TAU (Section 6.1.4)
and Vampir (Section 6.1.5).
Since hardware counters are shared resources, they can’t
simultaneously be used by performance measurement tools
and the subject application itself. Further information

Since a complete description of PAPI is far beyond the
scope of this guide, please see the

PAPI homepage

for further information.

6.1.3. Scalasca

Scalasca is a software tool that supports the performance
optimization of parallel programs by measuring and analyzing their
runtime behavior. The analysis identifies potential performance
bottlenecks – in particular those concerning communication and
synchronization, and offers guidance in exploring their causes.

Analyzing your code with Scalasca involves the following steps

  1. Instrumenting the application code
  2. Analyzing the instrumented code by running it with
    measurement configured
  3. Examine the analysis results
  4. (Optional) Refine the measurement configuration
    and re-run Instrumenting your code

First you need to load the corresponding modules.

module load UNITE scalasca

You can obtain a short introduction/help using

module help scalasca

or running the
command on its own.

After you have loaded the Scalasca module you need to
recompile your application to instrument it for using Scalasca.
Simply prepend your compiler command with

scalasca -instrument

For example, if you are using the
compiler wrapper to compile your application, use
scalasca -instrument mpixlf90_r

Example 6.
Makefile modified for Scalasca instrumentation

PRE  = scalasca -instrument
F90  = $(PRE) mpixlf90_r
%.o: %.f90
        $(F90) $(F90FLAGS) -c -o $@ $<
$(EXE): $(OBJ)
        $(F90) $(F90FLAGS) $(LDFLAGS) -o $@ $(OBJ)
... Analyzing (running) the instrumented code

In order to analyze your code with Scalasca you run the
instrumented executable of your application the same way as you
do without Scalasca, just prepend
scalasca -analyze to your usual
mpirun command line.
You may also need to quote certain
options by enclosing them in quotation marks as shown in
the following job command file example.

Example 7. Job command file for a Scalasca experiment

#@job_name         = scalasca_profile
#@comment          = "Scalasca profile analysis"
#@output           = scalasca_prof.out
#@error            = scalasca_prof.err
#@environment      = COPY_ALL
#@job_type         = bluegene
#@notification     = never
#@bg_size          = 32
#@wall_clock_limit = 00:30:00

module load UNITE scalasca
scalasca -analyze mpirun "-mode VN" application.x args


This job file will launch your application in VN mode on 32
nodes (128 MPI tasks) and perform a Scalasca profile analysis.

As a result you will obtain a directory named
where the Scalasca results are stored. Furthermore, information
are printed to
They could look similar to the following examples:

Example 8. STDOUT example of a Scalasca run

[00000]EPIK: Created new measurement archive ./epik_application_128_sum
[00000]EPIK: Activated ./epik_application_128_sum [NO TRACE] (0.024s)
       [STDOUT of application.x]
[00000]EPIK: Closing experiment ./epik_application_128_sum
[00000]EPIK: 224 unique paths (223 max paths, 6 max frames, 0 unknowns)
[00000]EPIK: Unifying... done (0.017s)
[00000]EPIK: Collating... done (0.088s)
[00000]EPIK: Closed experiment ./epik_application_128_sum (0.125s)


Example 9. STDERR example of a Scalasca run

S=C=A=N: Scalasca 1.3.3 runtime summarization
S=C=A=N: ./epik_application_128_sum experiment archive
S=C=A=N: Mon Sep 26 09:12:01 2011: Collect start
/usr/local/bin/mpirun -mode VN -E EPK_TITLE application_128_sum\
-E EPK_LDIR . -E EPK_SUMMARY 1 -E EPK_TRACE 0 application.x
S=C=A=N: Mon Sep 26 09:14:42 2011: Collect done (status=0) 161s
S=C=A=N: ./epik_application_128_sum complete. Examine Scalasca analysis results

Example 7
will produce profile data of the run of the application
application.x in a epik_application_128_sum

To interactively explore the analysis results
with Scalasca’s CUBE GUI use the following command:

scalasca -examine -s epik_application_128_sum


In order to be able to use the
GUI please make sure you are logged in with

ssh -X

If you are not directly connected to JUGENE, make sure you are
using the
option for all ssh connections and that your local system (laptop,
PC) has a running X

For further information on how to use the GUI see

module help cube

and refer to the

Scalasca documentation

The file
contains the values of all Scalasca environment variables that
were in effect during the run of the experiment. When necessary
adjusting one or more of these environment variables
can be used to configure subsequent measurement runs, for example,
adding one or more PAPI hardware counters
(Section 6.1.2).

Furthermore, the file
contains the call tree of your application execution.

To analyze the results further execute the following command:

scalasca -examine -s epik_application_128_sum

As a result you get an additional file
which contains a table where all functions of your application are
listed, the type of the function (such as an MPI function, a user
function, etc.) and the time the application spent in this function
(absolute and relative).

Finally, Scalasca provides you with an estimate about the
amount of disk space necessary to perform a trace experiment, that means
a time-stamped protocol of all events executed by your application:

Estimated aggregate size of event trace (total_tbc): 8111014272 bytes
Estimated size of largest process trace (max_tbc):   35620782 bytes

This means that when you run a trace experiment you will need
about 8 GB of disk space (total_tbc). Furthermore,
the largest trace size for a single process will be around 36 MB
which exceeds the default trace buffer size of 10 MB. To
avoid disruptive flushing of trace buffers to disk during
measurement, larger buffers can be configured by
setting the environment variable ELG_BUFFER_SIZE
to at least max_tbc and/or refining measurement
by specifying a filter file to reduce measurement overhead.

A measurement filter can be used to exclude some functions or subroutines
from the experiment, for example because they introduce too
much overhead and are not particularly relevant for
your analysis or belong to external libraries. In
this case simply list the names of all functions (one per
line) in an ASCII
file filter.txt. The names of routines to be used in the
the filter can be identified from the epik.score file.
Add this filter to Scalasca analysis runs using the option
-f filter.txt. Refine the measurement configuration

To analyze the wasted waiting time associated with communication
and synchronization inefficiencies such as late senders and
load imbalances, a Scalasca trace experiment can be configured
and the instrumented application rerun. To minimize measurement
collection and storage overhead, remember to configure a
measurement filter and/or adequate trace buffer sizes, best
determined from the score report of a prior summary experiment
as discussed in Section

Here an example job command file to perform this trace

Example 10. Job command file for a Scalasca trace experiment

#@job_name         = scalasca_trace
#@comment          = "Scalasca trace analysis"
#@output           = scalasca_trace.out
#@error            = scalasca_trace.err
#@environment      = COPY_ALL
#@job_type         = bluegene
#@notification     = never
#@bg_size          = 32
#@wall_clock_limit = 00:30:00

module load UNITE scalasca
export ELG_BUFFER_SIZE=36000000
scalasca -analyze -t -f filter.txt  mpirun "-mode VN" application.x args


option specifies that a trace will be collected and analyzed,
stored in an experiment directory epik_application_128_trace.

The results of this experiment can be examined as described in
Section Scalasca example

Scalasca has been used to analyse the execution performance of the
three-dimensional reservoir simulator PFLOTRAN on
JUGENE, using a test case comprising 850x1000x160 grid cells with 15
chemical species for 2E9 degrees of freedom, and although it scaled well
to 65,536 processes various inefficiences became increasingly
significant. Developed by LANL/ORNL/PNNL, PFLOTRAN combines solvers
for non-isothermal, multi-phase groundwater flow and reactive,
multi-component contaminant transport. It consists of approximately
80,000 lines of Fortran90 and employs PETSc, LAPACK, BLAS and HDF5 I/O
libraries, with MPI used both directly by the application and
indirectly via these libraries.

The PFLOTRAN application and PETSc library were both instrumented using
Scalasca, resulting in over 1,100 routines in an initial summary
analysis report: those routines not found on execution callpaths to MPI
operations were filtered during subsequent measurements, such that
measurement dilation with respect to the uninstrumented application
execution was reduced to an acceptable 5-15%. For executions of 10
simulation timesteps, each process required a trace buffer size
(ELG_BUFFER_SIZE) of 32 MB, and total trace size grew to
2.0 TB with 65,536 processes, taking an additional 25 minutes for
Scalasca trace collection and parallel analysis.

Figure 11
shows the Scalasca analysis report explorer GUI (CUBE) with a trace analysis of
a 8192-processes SMP-mode execution on JUGENE. Load imbalance on 6-7% of
processes along the top/front edge are clearly visible and due to the
computational characteristics of that part of the problem geometry. A
more significant imbalance that was found arises from the decomposition
of grid cells onto processes. Both imbalances in (local) computation time
combine to manifest as waiting time in associated MPI communication and
synchronization quantified by Scalasca trace analysis (8% of total
execution time as highlit in
Figure 11.
Even after addressing these imbalances, the large numbers of MPI_Allreduce
operations performed each simulation timestep are the primary scalability inhibitor
that grow to dominate execution time at large scales.
MPI File I/O done with HDF5 also constitutes 8% of total execution time
at this scale. Although this initialization cost can be amortized over
longer simulation times, it was also identified as an important area to
be reworked by the developers to improve large-scale performance.

Figure 11. Scalasca analysis report explorer GUI (CUBE) showing trace analysis
of PFLOTRAN application execution on JUGENE with 8192 MPI processes.

Scalasca analysis report explorer GUI (CUBE). Further information

Scalasca event traces can also be visualized with Vampir
(Section 6.1.5), and
Scalasca analysis reports (cubefiles) can also be examined with
the TAU/ParaProf GUI (Section 6.1.4).

Since a complete description of the Scalasca tool is far beyond the
scope of this guide, please see the

Scalasca homepage

for further information.

6.1.4. Tau

The TAU performance system is a toolkit integrating a variety of
instrumentation, measurement, analysis and visualization components,
and with extensive bridges to and from other performance tools
such as Scalasca and Vampir. ParaProf profile displays of a Scalasca
analysis report (Section
are demonstrated in
Figure 12.
TAU is highly customizable with multiple profiling and tracing
capabilities, targets all parallel programming/execution paradigms and is ported
to a wide range of computer systems.

Figure 12
shows a color-coded routine execution time profile (upper right),
a routine-wise performance breakdown for all processes (upper left),
a distribution histogram of MPI_Allreduce time by processes (lower right)
and a three-dimensional profile view. Note that by default ParaProf shows
both routine and callpath profiles simultaneously, and one or the other
typically needs to be disabled (as done in this example).

Before using TAU you need to load the corresponding modules.

module load UNITE tau

Figure 12. TAU/ParaProf GUI displays of Scalasca trace analysis report
of PFLOTRAN application execution on JUGENE with 8192 MPI processes.

TAU analysis GUI (ParaProf).



In order to be able to use the TAU
GUI please make sure you are logged in with

ssh -X

If you are not directly connected to JUGENE, make sure you are
using the
option for all ssh connections and that your local system (laptop,
PC) has a running X
server! Further information

Since a complete description of the TAU tool is far beyond the
scope of this guide, please see the

TAU homepage

for further information.

6.1.5. Vampir

Vampir is a tool for interactive event trace analysis,
offering a visual presentation of dynamic execution behavior
in an event timeline chart showing process/thread states and
interactions, along with additional displays of communication
statistics, summaries and much more. Browsing via zooming
and scrolling updates linked displays and statistics to
the visible time interval, and events can be selected for
detailed examination as shown in
Figure 13.

Figure 13. Vampir visualizations of full Scalasca trace and seventh timestep
of PFLOTRAN application execution on JUGENE with 8192 MPI processes.

Vampir trace visualization GUI (full timeline).
Vampir trace visualization GUI (timeline zoom).


Vampir visualizes traces collected with VampirTrace
or Scalasca (Section,
and comes with an additional parallel server to handle larger traces.
Figure 13
shows an initial view of the entire execution trace where a timeline
for 50 of the 8192 MPI processes are visible, along with a matrix
of MPI message transfer times organized by sender/receiver, and a
routine execution time summary profile. MPI_Allreduce
events and time are highlit in red and seen to dominate the
execution of pairs of processes. After zooming in to the interval
corresponding to the seventh timestep of ten, all of the displays
update to show that interval, where it becomes possible to
identify individual communication and synchronization events.

Before using Vampir you need to load the corresponding modules.

module load UNITE vampir


In order to be able to use
please make sure you are logged in with

ssh -X

If you are not directly connected to JUGENE, make sure you are
using the
option for all ssh connections and that your local system (laptop,
PC) has a running X
server! Further information

Since a complete description of the Vampir tool is far beyond the
scope of this guide, please see the

Vampir homepage

for further information.

6.2. Further reading, information and references

JUGENE documentation on the web

General related information

7. Tuning Applications

This sections provides more details information in order to
optimize and tune applications for JUGENE.

7.1. Single-core optimization

7.1.1. Advanced compiler options and optimization strategies

For application tuning and code optimization IBM XL compilers
will be used as a reference. These compilers offer much more options
to optimize code execution on Blue Gene/P system. Advanced compiler flags

Note that aggressive compiler optimizations may result in
relaxed conformance to the IEEE floating-point standard, alter the
application results or even cause errors in application execution.
In this section a detailed discussion of different compiler flags
is provided and advanced compiler options are discussed. The
options are ordered alphabetically.

  • -On (n=0,2,3,4,5)

    Compiler optimization includes five base optimization levels (the
    option is not supported). The first option
    minimizes all the optimization transformations performed by the
    compiler. It is the most suitable for program-debugging purposes.
    While it results in a shorter compile time the executables will
    not show a good performance.

    option involves strong low-level compiler optimizations. It
    applies to a unit or subroutine scope and can include function
    inlining. Almost all programs benefit from this optimization.

    At the
    optimization level the compiler performs additional low-level
    transformations. It implies the option
    which introduces basic high-level analysis and loop
    transformation. At this level the compiler can perform
    optimizations that may or may not be beneficial. It should also
    be noted that the
    option implies
    that can alter floating-point accuracy and therefore the
    correctness of the program results needs to be checked when using
    this option.

    Further optimizations are introduced with the
    option including detailed loop optimization and basic
    inter-procedural analysis at link time. The most important
    addition at this level of optimization is the
    option which activates the interprocedural analysis (IPA). IPA
    analysis is capable of code optimizations and inlining from one
    compilation unit to another. To benefit from IPA the
    option must be specified in the compilation
    the link step. IPA based optimizations may result in very long
    compilation times.

    is the highest XL compiler optimization level. It uses
    interprocedural analysis at the higher level including the
    suboption. Note that this level of optimization may consume a lot
    of machine resources at compile time. The
    option should be used only when the application benefits from
    lower optimization levels and performs as expected.

  • -qfloat

    This option provides control over the compiler optimizations that
    results in IEEE 754 non-compliant code. To disable certain
    optimizations proper suboptions should be used. The most common
    suboptions include:

    • -qfloat=[no]maf
      will enable or disable combined multiple-add instructions.
      Sometimes the
      suboption must be used to reproduce identical results to those
      on other architectures. Disabling multiple-add instructions
      will likely result in significant performance degradation.
    • -qfloat=[no]rrm
      controls if round-to-nearest rounding mode is used. The
      default is
    • -qfloat=[no]rsqrt
      for optimizing the division by the square root with replacing
      with multiplication by the reciprocal of the square root.
    • -qfloat=[no]single
      controls if single-precision arithmetic instructions are used
      for single-precision floating-point values.
    • -qfloat=[no]rngchk
      specifies the range-checking usage for input arguments for
      software divide and inlined square root operations.
    • -qfloat=[no]hscmplx
      enables optimizations on complex division and complex absolute
      value calculation (disabled by default).
  • -qhot

    High-order transformation (HOT) is a part of the XL compiler
    devoted to loop transformations. These optimizations are implied
    by default at the
    levels. With the
    option the
    suboption is implied. Loop optimization techniques implemented in
    HOT includes: loop nest interchange, loop fusion, unrolling of
    loops and data reorganization.
    also includes the
    suboption for SIMD mode loop execution (vectorization).

  • -qipa

    Interprocedural analysis (IPA) optimization basically operates on
    the entire program code and tries to apply inter-procedural
    optimization for the whole program. The IPA mechanism performs
    transformations at both compile and link time. In order to
    maximize the benefit the IPA optimizer must be used on both the
    compile and the link step. IPA may be used on different levels,
    without a level, or
    will result in
    option applies
    IPA level 0 would reduce compilation time, but performs only a
    limited analysis.

    Other IPA specific suboptions are
    in order to generate object files for IPA at link time only,
    allows additional compilation threads for speed-up the IPA based
    optimizations, and other options that can be found in the

    XL compiler reference quide

  • -qunroll

    This option controls loop unrolling. It is implied by any
    optimization level higher than 0. Further suboptions can be
    specified to control the level of an automatic loop unrolling. Diagnostic compiler flags

Sometimes the compiler is not able to perform certain code
optimization or it performs unwanted changes. In order to check
what is actually done by the compiler a number of diagnostics can
be produced in form of a listing or messages that may guide the
user in application tuning and in finding performance bottlenecks.

The diagnostic messages for each subroutine are generally written
into a separate file named
This files consists of different sections, depending on the
compiler flags used. Some of these sections are:

  • Header section

    One line containing information about the compiler, source
    file and date and time of compilation.

  • Options section

    Displays the settings for all compiler options that are used in
    the current compilation step. If the compiler flag
    is used the settings for
    compiler options are given.

  • Source section

    This sections appears only if the
    flag is used. It containes the source code with line numbers and
    (if present) the error messages that occurred during compilation.

  • Loop transformation section

    This section appears if the
    flag is used. It reports if and how loops were transformed by the
    compiler. It contains valuable information for optimization of
    the code, for example, whether SIMD vectorization (
    of a loop was possible or not, whether loop unrolling has been
    performed or if the order of loops has been changed.

  • Object section

    This section is created when using the
    flag and contains the object code listing, which shows the source
    line number, the instruction offset in hexadecimal notation, the
    assembler mnemonic of the instruction, and the hexadecimal value
    of the instruction. On the right side, it also shows the cycle
    time of the instruction and the intermediate language of the
    compiler. Finally, the total number of machine instructions that
    are produced and the total cycle time (straight-line execution
    time) are displayed. There is a separate section for each
    compilation unit.

The following options (referring to XL family of compilers) can
be used to obtain more information about the changes/optimizations
made by the compiler.

  • -qflag=<listing_severity>:<terminal_severity>

    Defines the minimum severity level of diagnostic messages to be
    written to the listing file or to the user terminal. The message
    severity levels are:

    i : informational messages
    w : warning messages
    e : error, severe error and unrecoverable error messages (C only)
    s : severe error and unrecoverable error messages
  • -qlist

    The compiler produces a compiler listing that includes an
    object listing. You can use the object listing to help understand
    the performance characteristics of the generated code and to
    diagnose execution problems.

  • -qlistopt

    With this option the compiler produces a compiler listing
    that displays all the options that were in effect when the
    compiler was invoked.

  • -qreport

    The compiler will generate for each source file
    a file
    with pseudo code and a description of the kind of code
    optimizations which were actually performed during compilation.

  • -qsource

    The compiler produces a compiler listing that includes
    source code.

  • -qsrcmsg

    This option adds the corresponding source code lines to the
    diagnostic messages in the stderr file.

  • -qxflag=diagnostic

    This flag causes the compiler to print information about
    code optimization during compile time to stdout.

  • -v

    Instructs the compiler to report information on the
    progress of the compilation, and names the programs being invoked
    within the compiler and the options being specified to each

As an example, the following Fortran routine
was compiled with compiler diagnostic option

Example 11. Fortran subroutine mult

  subroutine mult(c,a,ndim)

    implicit none

    integer          :: ndim,i,j
    double precision :: a(ndim),c(ndim,ndim)

  ! Loop
    do i=1,1000
      do j=1,1000
        c(i,j) = a(i)

  end subroutine mult												


The loop transformation section in the file
looks as follows:

Example 12. Loop transformation section in mult.lst

              1|         SUBROUTINE mult (c, a, ndim)
              1|           $.CSE0 = ndim
             10|           IF (.FALSE.) THEN
                             $.LoopIV0 = 0
                Id=5         DO $.LoopIV0 = $.LoopIV0, 1844674
             11|               IF (.FALSE.) GOTO lab_40
                               $.LoopIV1 = 0
             12|               $.ICM0 = $.CSE0
                               $.CSE1 = max($.CSE0,0)
                               $.ICM1 = $.CSE1
                               $.ICM2 = ($.CSE1 * (-8) - 8)
                               $.ICM3 = ($.LoopIV0 + 1)
                               $.ICM4 = ($.CSE1 * 8)
             11|Id=6           DO $.LoopIV1 = $.LoopIV1, 999
                                 ! DIR_INDEPENDENT loopId = 0 
                                 ! DIR_INDEPENDENT loopId = 0 
             12|                 $.CSE2 = ($.LoopIV1 + 1)
                                 c($.CSE2,$.ICM3) = a($.CSE2)
             13|               ENDDO
             14|             ENDDO
             10|           IF (.FALSE.) GOTO lab_9
                           $.CIV2 = 0
                Id=1       DO $.CIV2 = $.CIV2, 499
             11|             IF (.FALSE.) GOTO lab_11
                             $.LoopIV1 = 0
             12|             $.ICM0 = $.CSE0
                             $.CSE3 = max($.CSE0,0)
                             $.ICM1 = $.CSE3
                             $.ICM2 = ($.CSE3 * (-8) - 8)
                             $.CSE4 = ($.CIV2 * 2)
                             $.ICM5 = ($.CSE4 + 1)
                             $.ICM4 = ($.CSE3 * 8)
                             $.ICM6 = ($.CSE4 + 2)
             11|Id=2         DO $.LoopIV1 = $.LoopIV1, 999
                               ! DIR_INDEPENDENT loopId = 0 
                               ! DIR_INDEPENDENT loopId = 0 
             12|               $.CSE5 = ($.LoopIV1 + 1)
                               $.CSE6 = a($.CSE5)
                               c($.CSE5,$.ICM5) = $.CSE6
                               c($.CSE5,$.ICM6) = $.CSE6
             13|             ENDDO
             14|           ENDDO
             16|           RETURN
                         END SUBROUTINE mult
         Source     Source     Loop Id    Action / Information
         File       Line
         ---------- ---------- ---------- --------------------
                  0         10          1 Loop interchanging 
                                          applied to loop nest
                  0         10          1 Outer loop has been 
                                          unrolled 2 time(s)


This indicates that a loop interchanging has been performed (inner
loop is now
id=2) and that the outer loop has been unrolled 2 times

For further information about diagnostic compiler flags, please see

IBM documentation  Making efficient use of the double FPU (“double hummer”)

To make use of the double FPU unit (450d mode) within the
BlueGene/P compute processor user applications must meet the
following general guidelines and requirements. Because the double
FPU is capable of operating on vector data in SIMD mode (Single
Instruction Multiple Data mode), selected parts of program code
(loops in particular) are often said to be

  • Use compiler options to generate SIMD code

    Only the XL family of compilers offers support for double FPU,
    thus all the information in this section refer to XL C or XL
    Fortran compilers only. Two different sets of compiler options
    could generate code for double FPU (SIMD instructions):

    • default
      option together with
    • high-order optimization
      options or
      with more specific
      (valid only with
  • Proper data alignment

    Using double FPU requires data to be 16-byte aligned. It means
    that program data stuctures (arrays, structs, etc.) must be
    placed in memory with addresses being a multiple of 16 bytes.
    Normaly XL compilers try to allocate the data with proper
    alignment. For some cases it may not be sufficient. You may look
    for SIMD problems related to improper alignment in the compiler
    listing (
    ) file. To have a listing file generated, use
    -qreport -qlist
    option on the compile command (see

    If the compiler failed to align the data for SIMD execution you
    may need to use alignment assertions (for
    being an array):

    • __alignx(16,a)
      in C
    • call alignx(16, a(1))
      in Fortran

    These assertion intrisics are supported by XL compilers only.

  • Checking if Double FPU is used

    To ensure that the compilers have generated SIMD instructions,
    one may examine the compiler listing file. Compile your program
    with the
    -qlist -qsource
    options (see
    to include the assembler listing in the file. The complete
    of SIMD assembler instructions for double FPU can be found
    in the
    IBM Redbook for “IBM System Blue Gene Solution: Blue
    Application Development”
    RedBook AD.

    If any of these instructions appear in a listing file, then
    at least some part of the program is using the double FPU. For
    example, the following line in the .lst file:

    46| 000640 fpmadd   00611020   1     FPMADD    fp3,fp35=fp2,fp34,\

    indicates that
    instruction for SIMD fused-multiply-add is used.

    You may also use
    option on the compile command to see additional information
    reagarding SIMD optimizations performed by the compiler (see

  • Using SIMD intrisics routines

    Another method of exploiting the double FPU is to use SIMD
    compiler intrinsics. These are direct equivalents for SIMD
    assembler instructions discussed earlier, supported by XL
    compiler family. Functions used for elementary SIMD arithmetic
    operations are callable from both C and Fortran. Again, for more
    information refer to the IBM Redbook
    RedBook AD.

  • Optimized SIMD library routines

    The Blue Gene/P system offers mathematical libraries optimized
    for the double FPU. This is often the easiest way of making
    efficient use of double FPU. For more information on SIMD
    libraries see the

  • Other issues that influence SIMD execution

    Many more factors may inhibit generation of SIMD instructions
    and furthermore lower the performance of the application.
    Typical issues that should be avoided are:
    non-stride data access, loops with subroutine/function
    calls, many-branch or conditional instructions, pointer
    aliasing, non-trivial data depedencies
    Some of these issues can be detected by the compiler. Use the
    option (see
    in the compile command and examine the listing file for
    potential SIMD related issues. Optimization strategies

In this section we provide some optimization strategies for
serial code performance with the help of examples.

Consider the following example of a basic Fortran program that
multiplies two square double-precision matrices. The program uses
the direct method of matrix multiplications with two nested loops
and third inner implicitly nested loop (with use of the
operator in Fortran).

Example 13. Matrix multiplication in Fortran

      program mat_mul
      Implicit None

      Integer, parameter :: n = 1024
      Real*8 :: a(n,n), b(n,n), c(n,n)
      Real*8 :: alf
      Real :: timef, time1, time0
      Integer :: i, j

      call mat_init(a, b, c, n)
      time0 = timef()
      Do i = 1,n
       Do j = 1,n
        c(i,j) = sum(a(i,:)*b(:,j))
       End Do
      End Do
      time1 = timef()
      write(0,'(A,F12.9,A)')'mat_mul in ',0.001*(time1-time0),&
     &                      ' seconds'

      call mat_cmp(c, n)

      end program mat_mul


The optimization strategy is to execute the loops in SIMD mode with
using the double FPU. In order to measure the execution time, the
instrisic XL Fortran function is used. This function returns the
actual time in milliseconds. Only time spent in the loops is
measured. Additional external subroutines are used for initializing
the matrices (mat_init) and further computation

  • Basic compiler optimization

    Compile with the basic optimization options for BlueGene/P

    bgxlf_r -O2 -qtune=450 -qarch=450 -o mat_mul.exe

    and run the
    program on a processor core. Time spent in main loops should be
    as follows:

    mat_mul in 70.269523621 seconds

    which gives about 30.5 Mflops being below 1% of theoretical
    peak performance of a processor core.

  • More compiler optimization

    Change optimization level from
    and add
    option to avoid numerical differences caused by aggressive code

    bgxlf_r -O3 -qstrict -qtune=450 -qarch=450 ...

    The result is 52.3 seconds.

  • Automatic SIMD code generation

    Switch to
    option to turn on SIMD instructions compiler generation for the

    bgxlf_r -O3 -qstrict -qtune=450 -qarch=450d ...

    Time result is 60.05 seconds which is even worse than the
    previous setting.

  • Cache friendly loop ordering

    One of the reasons for unefficient SIMD usage is data access
    which is not cache aware. Matrices in Fortran are stored in
    memory column-by-column and loop order should also maximize
    column based data access. Code with reordered loops for matrix
    multiplication may look as in the following example:

    Example 14. Cache-friendly matrix multiplication in Fortran

    				program mat_mul
          Implicit None
          Integer, parameter :: n = 1024
          Real*8 :: a(n,n), b(n,n), c(n,n)
          Real*8 :: alf
          Real :: timef, time1, time0
          Integer :: i, j
          call mat_init(a, b, c, n)
          time0 = timef()
          Do j = 1,n
           Do k = 1,n
            Do i = 1,n
             c(i,j) = c(i,j) + a(i,k)*b(k,j)
            End Do
           End Do
          End Do
          time1 = timef()
          write(0,'(A,F12.9,A)')'mat_mul in ',0.001*(time1-time0),&
         &                      ' seconds'
          call mat_cmp(c, n)
          end program mat_mul


    Note that the
    function is no longer used and third nested loop is explicitly
    introduced. Moreover loop order is changed from

    Leave the
    option turned on for compilation:

    bgxlf_r -O3 -qstrict -qtune=450 -qarch=450d ...

    Run the code:

    mat_mul in 3.469617128 seconds

    Note the dramatical change in a program runtime! This will
    give roughly 619 Mflops.

  • High order transformations

    One could try compilation with high order compiler transformation

    bgxlf_r -O3 -qhot -qstrict -qtune=450 -qarch=450d ...

    which is not beneficial with this case:

    mat_mul in 7.878117085 seconds

    A more inquiring user may notice that the
    option results in BLAS
    routine substitution for original triple loop. Compile with an
    -qreport -qlist
    to produce compiler listing which will be placed in
    file. Open the listing file to see the compiler-optimized code:

    23|           $.__xl_matmul__ALPHA_10 =  1.0000000000000000E+000
                  $.__xl_matmul__BETA_00 =  1.0000000000000000E+000
                  $.__xl_matmul__transab_N0 = 78
                  $.__xl_matmul__I0 = 1024
                  $.__xl_matmul__J0 = 1024
                  $.__xl_matmul__K0 = 1024
                  $.__xl_matmul__LDA0 = 1024
                  $.__xl_matmul__LDB0 = 1024
                  $.__xl_matmul__LDC0 = 1024
                  CALL __xl_dgemm($.__xl_matmul__transab_N0,&
                &   $.__xl_matmul__transab_N0,$.__xl_matmul__I0,\
                &   $.__xl_matmul__K0,$.__xl_matmul__ALPHA_10,0 + a,&
                &   $.__xl_matmul__LDA0,0 + b,$.__xl_matmul__LDB0,&
                &   $.__xl_matmul__BETA_00,0 + c,$.__xl_matmul__LDC0)

    That means the compiler is capable of recognizing the matrix
    multiplication code pattern and tries to use theoretically the
    most optimized solution. The problem is that the XL compiler
    intrinsic BLAS library call (
    ) is not as optimal as it could be.

    In the listing file you may also find some messages about
    SIMD optimization:

    1586-551 (I) Loop (loop index 9) at mat_mul.f90 <line 16> was not\
    SIMD vectorized because it contains unsupported vector data types.
    1586-542 (I) Loop (loop index 10 with nest-level 1 and iteration \
    count 1024) at mat_mul.f90 <line 15> was SIMD vectorized.
    1586-543 (I) <SIMD info> Total number of the innermost loops\
    considered <"5">. Total number of the innermost loops SIMD\
    vectorized <"2">.

    These information may guide you to the more SIMD optimized
    version of your code.

  • Use SIMD intrinsics

    Another step to increase the level of the double FPU usage
    may be introducing the SIMD intrinsic functions for elementary
    arithmetic operations. This probably needs a significant amount
    of work. Moreover it may require the advanced knowledge in
    processor programming.

    The list of XL compiler SIMD intrinisic functions may be found in
    the IBM Redbook for “IBM System Blue Gene Solution: Blue Gene/P
    Application Development”
    RedBook AD.

    In this book you may find the SIMD Fortran routine for
    double-precision square matrix-matrix multiplication. Look for
    the example 8-19 in page 136, the routine name is
    You can copy it into your program and call directly instead of
    the triple loop. Compile the modified program for double FPU:

    bgxlf_r -O3 -qstrict -qtune=450 -qarch=450d ...

    It can be easily seen that SIMD optimization was still
    profitable – the program is twice as fast with a performance of
    1267.64 Mflops:

    dsqmm in 1.694082141 seconds

  • Use optimized SIMD math library

    Often the most effective way to lever up the performance is to
    use the highly optimized SIMD library. On the JUGENE system the
    most optimized mathematical library is the IBM
    library. It contains the extended BLAS functions functionality
    for optimized linear algebra.

    To use BLAS routine for matrix multiplication (double-precision
    general case:
    dgemm) one needs to modify the example as follows:

    Example 15. DGEMM matrix multiplication in Fortran

          program mat_mul
          Implicit None
          Integer, parameter :: n = 1024
          Real*8 :: a(n,n), b(n,n), c(n,n)
          Real*8 :: alf
          Real :: timef, time1, time0
          Integer :: i, j
          Character :: transa
          External dgemm
          call mat_init(a, b, c, n)
          transa = 'N'
          alf = 1.0
          time0 = timef()
          call dgemm(transa,transa,n,n,n,alf,a,n,b,n,alf,d,n)
          time1 = timef()
          write(0,'(A,F12.9,A)')'dgemm in ',0.001*(time1-time0),&
         &                      ' seconds'
          call mat_cmp(c, n)
          end program mat_mul


    Leave the compilation options unchanged, recompile and run
    the program:

    dgemm in 7.894634247 seconds

    Note that the result is exacly the same as with the compiler
    BLAS substitution. The same (not optimal) compiler intrisic
    version of the library was used.

    In order to use ESSL library you need to link your program with
    option. On the JUGENE the ESSL library is located in
    directory. Recompile the code with correct linking options:

    bgxlf_r -O3 -qstrict -qtune=450 -qarch=450d -o mat_mul.exe
    mat_mul.f90 -L/opt/ibmmath/essl/4.4/lib -lesslbg

    Run the BLAS and ESSL version to measure the performance:

    dgemm in 0.831952870 seconds

    The performance is doubled again! We have reached 2581.26
    Mflops which is about 75% of the peak performance.

7.2. Advanced environment and MPI tuning

7.2.1. Environment variables

Here we list some environment variables (in alphabetical order)
which can be used in order to tune the performance of applications.
A more comprehensive list of available Blue Gene/P MPI environment
variables can be found in the Blue Gene/P Application Development
RedBook AD.


    Defines whether core dumps are generated or not in case the
    application aborts due to an error. The default is
    (no core dumps are generated). To enable the generation of core
    dumps set this variable to
    Section 8.3
    for further information).


    This environment variable defines how data is exchanged on
    collective reads and writes. Possible values are 0 (use
    and 1 (use
    The default is 0. When using a large number of tasks in a
    memory-demanding application the performance of the
    might scale worse than the point-to-point version for the I/O or
    the application might even crash with a
    signal 6.
    Setting this variable to 1 might help in this case and improve
    the performance (see also


    This variable can be used to tune how aggregate file domains are
    calculated. Possible values are 0 (Evenly calculate file domains
    across aggregators, also use
    to exchange domain information) and 1 (Align file domains with the
    underlying file system’s block size, also use
    to exchange domain information). The default is 1. This variable
    can be used together with the
    variable (see above), for example,
    to avoid


    require 6 arrays each of size
    to be setup before communication begins. If your application does
    not use
    or needs as much memory as possible you can turn off
    pre-allocating these arrays by setting the variable to N. The
    default setting is Y, which means allowing the pre-allocation.


    Use this variable to control the output of information associated
    with the Direct Memory Access (DMA) messaging device. Possible
    values are 0 (no DMA information output) and 1 (DMA information
    output). The default is 0. If the job encounters events that do
    not cause the application to fail but might be useful for
    application developers (for example RAS Events) this information
    will be displayed by setting
    When using the option
    -verbose 2
    with the
    command this variable is automatically set to 1.

  • DCMF_EAGER=message size (bytes)

    This environment variable sets the message size (in bytes) above
    which the MPI rendezvous protocol is used (for further details
    about the MPI protocols available, please see the Blue Gene/P
    Application Development Redbook
    RedBook AD).
    The default size is 1200 bytes. The MPI rendezvous protocol is
    optimized for maximum bandwidth. However, there is an initial
    handshake between the communication partners, which increases the
    latency. In case your application uses many short messages you
    might want to decrease the message size (even down to 0). On the
    other hand if your application can be mapped well to the
    network and uses mainly large messages increasing the limit might
    lead to a better performance (for example,


    If set to 1 the interrupt driven communication is turned on.
    This can be beneficial to some applications in order to overlap
    communication with computation and is required if you are using
    Global Arrays or ARMCI. The default setting is 0.

  • DCMF_RECFIFO=buffer size (bytes)

    Packets that arrive off the network are placed into a reception
    buffer. The environment variable
    can be used to set the size of this buffer. The default size is 8
    MB per process. If a process is busy and does not call often
    this buffer can become full and the movement of further packets is
    stopped until the buffer is free again. This can slow down the
    application. In this case the buffer should be increased.

  • DCMF_RGETFIFO=buffer size (bytes)

    When a remote get packet arrives off the network, the packet
    contains a descriptor describing data that is to be retrieved and
    sent back to the node that originated the remote get packet. The
    DMA injects that descriptor into a remote get injection buffer.
    The DMA then processes that injected descriptor by sending the
    data back to the originating node. Remote gets are commonly done
    during point-to-point communications when large data buffers are
    involved (typically larger than 1200 bytes).
    can be used to set the size of the buffer. The default size is 32
    KB. When a large number of remote get packets are received by a
    node, the remote get injection buffer may become full of
    descriptors. The size of the buffer can be adjusted using


    Setting this variable to 1 turns on sender-side matching and
    can speed up point-to-point messaging in well-behaved
    applications. The default setting is 0.

7.2.2. Network

The choice of the network (connection) can have a big influence on
the performance of applications. The Blue Gene/P offers two networks
which can be used by applications for the communication on and
between compute nodes. The
network interconnects all compute nodes and has a topology of a
three-dimensional torus, that means each node has six nearest neighbours.
network is a global collective tree network. It interconnects all
compute and I/O nodes.

The default network is the
network. You can change it using the LoadLeveler keyword
specifying either
as shown in the following example:

Example 16. Specifying the network with #@bg_connection

#@job_name = Job_example
#@comment = "BGP Job with torus network"
#@job_type = bluegene
#@bg_connection = TORUS


The example shows part of a job script for using the torus network.
The choice can have a big influence on the performance of your
application. In case of doubt choose
For partitions with less than 512 nodes only
can be used.


In general applications benefit using the
network. Therefore, it is recommended to use
#@bg_connection = TORUS.

7.2.3. Shape

The extension of a partition in X,Y and Z direction is called the
of a partition. The shape can be specified in units of midplanes
using the LoadLeveler keyword
where 1 midplane contains 512 nodes (=2048 cores):

Example 17. Specifying the shape with #@bg_shape

#@job_name = Shape_example 
#@comment ="BGP Job by Shape" 
#@job_type = bluegene
#@bg_connection = TORUS 
#@bg_shape = 1x1x2 
#@bg_rotate = TRUE 


This job script reserves a partition with 1 midplane in X and Y, and
2 midplanes in Z direction using in total 1024 nodes (4096 cores).
The keyword
tells LoadLeveler whether the job can run on any partition with the
correct size
this is the default) or if you want to have exactly the specified
In the first case the next free partition with 1024 nodes will be
used, regardless of its shape (for example, a partition of shape
could be used). The optimal shape for an application depends on the
communication pattern of the code. For example, if the application
uses a communicator of dimensions
a shape like
might show a better performance than for example
For further information, please see also
Section 7.2.4.

If you use partition sizes of 1 midplane or less on JUGENE the
partitions will have the dimensions NX,NY, and NZ in units of nodes
(32 nodes is the smallest size you can reserve on JUGENE) as shown
Table 16.

Table 16. Dimensions in units of nodes in NX, NY and NZ directions for
partitions with up to 512 nodes on JUGENE.

Number of nodes NX NY NZ
32 4 4 2
64 4 4 4
128 4 4 8
256 8 4 8
512 8 8 8


7.2.4. Mapping

The default mapping on JUGENE is in
order. Here, X,Y and Z are the torus coordinates of the compute
nodes within a partition and T is the core coordinate within a node
(T=0,1,2,3). Therefore, each core is well-defined by these four
coordinates. When
launches a parallel application it will distribute the MPI tasks in
such a way that the first coordinate of the mapping is increased
first. Therefore, with
mapping the first MPI task is executed on the core with the
the second on the core
the third on the core
and so on. Since in general adjacent tasks are not executed on
adjacent cores, this might not be the optimal mapping for all
applications. You can change the mapping using the
option of the

mpirun -mapfile mapping -exe myproc

can either be any permutation of X,Y,Z and T or a file which
contains explicit instructions on how to map the MPI tasks to the


Since for most applications the default mapping
is not optimal, we recommend testing at least the option
-mapfile TXYZ. Explicit mapfiles

An explicit mapping of tasks to cores can be specified using
a mapfile. Each line of such a file contains four integers which
represent coordinates of the four-dimensional torus in the order X,
Y, Z and T. The tasks are mapped to the specified coordinates in
ascending order (see example below). The optimal mapping depends on
the communicator used by the application and the shape of the
partition used for the run. Therefore, it is not possible to choose
an optimal mapping without detailed knowledge about the
application. In order to describe the strategy to follow we discuss
an example case.

Suppose we have an application which uses a three-dimensional
communicator of dimensions
which should run on one midplane in VN mode (that means a
four-dimensional torus of shape
The goal is to factorize the shape dimensions by choosing a
corresponding mapping in such a way that it fits to the
communicator of the application. In this case
would be optimal. The following mapfile (only shown in parts)
realizes the described mapping (everything after “#” is only a

0 0 0 0 # task  0; communicator coordinates ( 0, 0, 0)
1 0 0 0 # task  1; communicator coordinates ( 1, 0, 0)
2 0 0 0 # task  2; communicator coordinates ( 2, 0, 0)
7 0 0 0 # task  7; communicator coordinates ( 7, 0, 0) 
0 0 0 1 # task  8; communicator coordinates ( 8, 0, 0)
1 0 0 1 # task  9; communicator coordinates ( 9, 0, 0)
7 0 0 1 # task 15; communicator coordinates (15, 0, 0)  
0 1 0 0 # task 16; communicator coordinates ( 0, 1, 0)
1 1 0 0 # task 17; communicator coordinates ( 1, 1, 0)
7 1 0 0 # task 23; communicator coordinates ( 7, 1, 0)
0 1 0 1 # task 24; communicator coordinates ( 8, 1, 0)
1 1 0 1 # task 25; communicator coordinates ( 9, 1, 0)
0 0 0 2 # task 64; communicator coordinates ( 0, 8, 0)  
1 0 0 2 # task 65; communicator coordinates ( 1, 8, 0)
2 0 0 2 # task 66; communicator coordinates ( 2, 8, 0)
7 0 0 2 # task 71; communicator coordinates ( 7, 8, 0)
0 0 0 3 # task 72; communicator coordinates ( 8, 8, 0)
1 0 0 3 # task 73; communicator coordinates ( 9, 8, 0)

When using explicit mapfile you must use the LoadLeveler keyword
in your job script.

7.3. Advanced OpenMP usage

The Blue Gene/P compute node architecture offers
very few opportunities for advanced OpenMP
features. Here are a few elements one can consider when running an
OpenMP or a hybrid MPI+OpenMP code on Blue Gene/P.

7.3.1. Environment variables

one environment variable is really worth mentioning here:
This one is useful when the code needs a large stack for running.
OpenMP uses a per-thread private stack that is by default limited to
4 MB. But this might be insufficient for some codes that require a
larger stack for storing automatic variables.

To set the per-thread stack size to the desired value, one can
for example use:

mpirun -env "XLSMPOPTS=stack=6000000" ...

This will set it to a per-thread value of 6000000 bytes.

For OpenMP codes compiled using the GNU compiler version 4.3.2
onward, one can use the environment variable
to achieve the same result. The given number represents the
per-thread desired stack size in bytes. It can be set for example
like this:

mpirun -env "GOMP_STACKSIZE=6000000" ...

7.3.2. Thread affinity

Affinity is a concept that is not supported on Blue
Gene/P. There is therefore no point in trying to set it for
multi-threaded codes.

7.4. Hybrid programming

The Blue Gene/P machine is first and foremost a massively
parallel architecture. It is primarily designed for running as large
as possible MPI parallel codes. In addition, its back-end nodes
allowing shared memory parallel programing like OpenMP, it seems like
a natural solution to use hybrid MPI+OpenMP parallel codes.

The main advantages of such an approach are:

  • To increase the per-MPI task available memory without wasting
    the corresponding CPU capacity. By using the DUAL or SMP mode
    rather than the VN (virtual node) one, one can double or quadruple
    the amount of memory available for each MPI task. However, if a
    second level of intra-node parallelization is not introduced in the
    code, only one of the two or four cores of the node will be
    exploited. Adding OpenMP directives to your MPI code might solve
    this issue.
  • To reduce the number of MPI tasks without reducing the number
    of involved CPUs. The parallel design or scalability of some codes
    is inherently limited by a maximum of possible MPI tasks. By
    allowing for a given number of such tasks to run on a two to four
    time larger number of cores, one can push this limit further
    without sacrificing the code’s efficiency.

This hybrid MPI+OpenMP approach proved being very successful on
many architectures, including Blue Gene/P.

7.4.1. Optimal tasks / threads strategy

For defining the optimal tasks versus threads strategy, one has to
understand that the Blue Gene/P compute nodes do not allow for more
user processes or threads running on a node than the number of
available cores. Moreover, as described in
Section 5.4.2,
OpenMP can be enabled on back-end nodes with only either SMP or
DUAL mode.

In those conditions, the number of OpenMP threads to use set with
can only be
1, 2, 3 or 4
for SMP mode and
1 or 2
for DUAL mode.

Actually the only reason for running an OpenMP code in a
configuration other than DUAL mode with
or SMP mode with
would be if the available memory was insufficient for running with
all the possible threads. This might happen if for example the
private stack size needed by each OpenMP thread is especially large
and doesn’t allow all the threads to fit in memory.

In this very unlikely case only, having
set to 2 or 3 rather than 4 in SMP mode, or set to 1 in DUAL mode
might be necessary to solve the issue.

7.5. I/O optimization

This section briefly mentions general guidelines for efficient I/O on
JUGENE and subsequently presents readily available tools that can be
of help in separate subsections. A general understanding of the
structure of JUGENE’s I/O subsystem and the organization of its file
system as documented in this guide in
Section 2.6
Section 2.7
is assumed.

7.5.1. General guidelines Using the right file system

The scratch file system (WORK) , accessed by means of the shell
environment variable
is the file system of choice for jobs with demanding I/O. Direct
reading or writing of files on the HOME file system is reasonable
only for jobs that structurally use rather small sized I/O request
– this means less than 4 MB per read or write action. Adhere to block size and pay attention to block alignment

As with most file systems, I/O operations on WORK as well as HOME
are most efficient when the request sizes are exact multiples of
the file system’s block size and kept well aligned with file system
block boundaries. The block size for the WORK file system is 4 MB.
The block size for the HOME file systems is 1 MB. Users that want
to avoid the hard coding of block sizes can use the standard POSIX
in the initialization phase of their application. The field
of the returned stat structure contains the block size of the file
system in bytes.

The primary focus of special libraries for parallel I/O, such as
parallel HDF5 or parallel netCDF is on organizing output into
portable platform independent formats. Doing efficient I/O is to
some extent delegated to these libraries by using their routines,
but it is not their primary concern. These libraries are built on
top of MPI I/O. MPI I/O can be “hinted”, rather than instructed, to
be more efficient by using
key value pairs to specify desired buffer size, stripe size, etc. Optimize hierarchical I/O schemes by distributing the load
over all available I/O nodes

I/O operations that stripe over a larger number of blocks are more
efficient, because they are capable of utilizing more of the
underlying storage resources in parallel. But they are costly in
terms of buffer memory needed in user space and may prove difficult
to organize on an architecture like the Blue Gene/P that offers a
fairly limited amount of memory per task. If such a hierarchical
scheme, where presumably a subset of the tasks is doing large I/O
operations, is implemented on JUGENE, it should be optimized to
spread the work well over all I/O nodes available to the job. IBM’s
implementation of MPI contains two Blue Gene/P specific extensions
that were added for this purpose:
Both are collective operations that create a set of
communicators, of which each node only sees the one that
corresponds to its place in the topology. The first call creates
communicators for every Pset that contain only the members of the
Pset. All I/O nodes are used if, for example, every node 0 in these
communicators takes the role of a master node for the rest of the
nodes. The second call creates a set of orthogonal communicators in
which no two members of a given communicator belong to the same
Pset. Consider not to use a hierarchical scheme

The same efficiency that is associated with striping over a large
number of file system blocks can in principle also be achieved by a
large number of parallel tasks engaging in operations on the same
file, each on its own fairly small number of blocks. The tasks must
be orchestrated to operate on distinct file system blocks rather
than operate on – from the perspective of file system organization
– arbitrary ranges of a file that overlap in file system block
usage. If all tasks are involved, doing I/O themselves, all I/O
nodes are involved as well. However, task-local files must be
avoided since handling of thousands of individual files can cause a
severe performance bottleneck (see
for further information).

(Section 7.5.5)
which is a library that is currently being developed at Jülich,
may be a helpful tool for automating the organization of an I/O
along these lines. Revise I/O of ported applications if its design assumes node
local storage

Applications ported from “Beowulf” clusters and other architectures
that typically have node local scratch space often have an
organization of I/O in which all task specific data – intermediate
states kept for check pointing, error logs, final output data – are
written to task specific files stored on local storage for scratch.
On these architectures this is indeed a fairly straightforward and
efficient way of making use of all the distributed I/O resources
that the platform has allocated to the job. On JUGENE it is not,
since there is no distributed node local scratch space. Trivial
adaptations of the application’s I/O organization, that simply keep
the multitude of task specific files and merely solve possible
filename conflicts that can occur because the scratch file system
on this platform has a global name space, lead to other severe
performance issues. A multitude of directories and/or files must be
created in the startup phase of a job, typically under a common
root in the file system. This leads to severe “hot spots” in
meta-data handling, and thus to congestion and severe slowdown for
the job – possibly even to slowdown of the I/O of unrelated jobs
that experience delay from the busy meta-data servers. Revise such
I/O schemes by an alternative that better fits JUGENE’s
architecture, for example, a hierarchical scheme referred to in
or a non-hierarchical scheme, referred to in

7.5.2. HDF5

The HDF5 Library is available on JUGENE and can be loaded with

module load hdf5

Further information about how to use it on JUGENE can be
obtained with

module help hdf5

news hdf5

Further information about HDF5 are available at the

7.5.3. MPI I/O

MPI I/O is a substandard of MPI specifically for dealing with
parallel I/O. It offers routines that implement the opening,
reading, writing, and closing of files as collective actions. It is
generic and flexible. Both hierarchical and
non-hierarchical schemes are feasibly implemented by means of MPI
I/O. It also offers routines collective and non-collective handling
of files that haven been opened collectively.

MPI I/O introduces its own data types. MPI files are lists of
MPI datatypes. Basic types are pre-defined. Derived types can be
tailored to suit the application’s needs. MPI I/O is available on
other platforms as well and files written by MPI I/O are portable
between different platforms. Usage of MPI I/O however does not
automatically tune file system access. The data types of MPI I/O
mainly make sense to the application, but are not necessarily well
aligned with respect to file system specific parameters.

Parallel versions of netCDF and HDF are built on top of MPI
I/O to enable parallel file access.

7.5.4. netCDF

Versions 3.6.3 and 4.1.1 of netCDF are installed on JUGENE.
Currently, no module is available, but you can find the libraries on


Documentation can be found in



Further information about netCDF are available


7.5.5. SIONlib

SIONlib is a small library that focuses primarily on tuning
massive parallel file access to the underlying storage system. It
uses MPI to manage opening and closing of files as collective
actions with standard POSIX I/O routines. It does not introduce its
own datatypes. Rather it takes the traditional Unix view that for
the I/O routines a file is a generic stream of binary data. Thus its
introduction into existing source code is not very intrusive. By
replacing a limited number of I/O calls by SIONlib alternatives that
internally take care of the orchestration of block size alignment,
the tuning of massive parallel I/O operations to the underlying file
system is significantly improved.

Several versions of SIONLib are available on JUGENE. You can
manage which one you want to use by means of environment modules.
To use the default version simply enter:

module load sionlib

To see which versions are available enter module avail sionlib,
and pick the version that suits you best.

The module command puts SIONlib tools like sioncat, siondump,
and sionsplit, in your search path. These are for extracting data from
a sion file, dumping its meta data, and splitting a sion files into separate ones. But
most importantly this also puts the sionconfig tool into your search path.
Use this tool to obtain the correct values for the include files and libraries that you

We assume usage of the IBM compiler by means of the mpixlc_r C compiler
compiler wrapper script. To obtain the correct compiler switches for parallel I/O using
SIONlib on JUGENE, use the sionconfig tool as follows:
sionconfig --be --mpi --cflags. The “–be” denotes the BlueGene “Back End”
architecture as a target to generate code for, as opposed to the “Front End”
(--fe) node
architecture of the login nodes. The sionconfig tool thus used produces
something like the following – with path details depending on the version –
-I/usr/local/sionlib/v1.2p2/include -DBGP -DSION_MPI -D_SION_BGP
which can be used in your Makefile or build script.

To obtain the correct switches for the linking of objects with libraries use:
sionconfig --be --mpi --libs This produces something like the
following – with path details depending on the version –
-L/usr/local/sionlib/v1.2p2/lib -lsion_32 -lsionser_32

Thus, the following example would compile a file mysource1.c
to mysource1.o and subsequently link that object file with the SIONlib
libraries to produce the executable binary mybinary1.

mpixlc_r -I/usr/local/sionlib/v1.2p2/include -DBGP -DSION_MPI\
         -D_SION_BGP -c mysource1.c
mpixlc_r -o mybinary1 mysource1.o -L/usr/local/sionlib/v1.2p2/lib\
         -lsion_32 -lsionser_32

Further information about SIONlib are available at the

SIONlib site

7.6. Advanced job command language

7.6.1. Using multiple job steps

The LoadLeveler batch system allows users to create and control the
execution of more complex jobs. Users may define a sequence of job
steps which may possibly depend on each other. For each of the job
steps a correspondent LoadLeveler keyword block must be defined in
the job command file. A job step keyword block must begin with the
keyword which defines the name of the step and must contain the
keyword. LoadLeveler will execute all job steps as independent jobs
unless the keyword
is used. Dependency means that a job step execution will be started
depending on the status of the previous job step.

Example 18. Job command file with multiple steps

#@job_name = example_multiple_steps
#@environment = COPY_ALL
#@notification = error
#@notify_user = my_address@my.institution
#@job_type = bluegene
#@bg_size = 32
#@step_name = step_1
#@error = step_1.err
#@output = step_1.out
#@step_name = step_2
#@dependency = (step_1 == 0)
#@error = step_2.err
#@output = step_2.out
#@step_name = step_3
#@error = step_3.err
#@output = step_.out

   step_1 ) 
      mpirun -exe my_app.x  -mode VN -np 4;;
   step_2 ) 
      mpirun -exe my_app_2.x  -mode VN -np 16 ;;
   step_3 ) 
      mpirun -exe my_app_3.x -mode VN -np 64 ;;


This example contains three separate job steps:
The second step,
depends on the first step
and will run only if
exits with the correct exit status. The last step,
is independent of the rest of the job steps.

Note that all of the job steps use different executables. On the
Blue Gene system each of the executables should be started with the
command in order to be properly executed on the compute nodes. To
control the execution of the steps the
variable may be used. Another way of identifing job steps is the
variable in a job command file which contains the job step
identifier and increases after each

7.6.2. Job command file variables

The Loadleveler system has furthermore variables that can be used in
a job command file. The syntax for all Loadleveler variables is:
The following list enumerates the most helpful available variables:

  • $(home)

    The home directory of the user used to run the job.

  • $(host)

    The hostname of the machine from which the job was submitted.
    are equivalent.

  • $(jobid)

    The id number assigned to this job by LoadLeveler.

  • $(stepid)

    The id number assigned to this job step when multiple job
    steps are defined.

  • $(user)

    The user name of the user submitting the job.

7.6.3. Run-time environment variables

For a complete reference of the LoadLeveler run time
environment variables please refer to the LoadLeveler manual.

7.7. Further reading, information and references

JUGENE documentation on the web

General related information

8. Debugging

If an application aborts unexpectedly it is useful to monitor the
execution of the application in more detail in order to check which
branches of the code are actually executed, what are the actual values
of variables, which part of the memory is used etc. The simplest way
to do this
is to use
statements in the code in order to get the desired information.
However, this is tedious (each time a
statement is added the source needs to be recompiled and rerun).
Furthermore, since the code is modified the runtime conditions change
and may influence the behavior of the applications. Therefore, this
way of debugging is not recommended.

Instead, in the first place the compiler offers the possibility
to check for certain errors during the compilation of the code. For
this special compiler flags have to be used which will be described in
more detail in the next section. It is recommended to go this way
first when debugging is necessary, because the usage is quite easy and
does not require any additional software.

But not all errors can be detected this way since some occur only at
run time. In this case
need to be employed. Debuggers are powerful tools to analyze the
executions of applications on the fly, meaning while they are running. In
general, the corresponding applications need to be recompiled once
using appropriate compiler flags and are then executed under the
control of the debugger.

8.1. Compiler flags

8.1.1. Debugging options of the compiler

In this section useful debugging options for the XL compilers
listed and explained. Simply add them to the compile command you
usually use for your application. The information are taken from the
man pages of the XL compilers, for further information about
compiler flags just type
man bgxlf
man bgxlc.

  • -O0

    With this option all optimizations performed by the compiler
    switched off. Sometimes errors can occur due to too aggressive
    compiler optimizations (rounding of floating point numbers,
    rearrangement of loops and/or operations etc.). If you encounter
    problems that might be connected to such issues (for example,
    wrong or inaccurate numeric results) try this option and check
    whether the problem persists. If not, increase moderately the
    optimization level. See
    Section 5.1.2
    for further details.

  • -qcheck[=<suboptions_list>]

    For Fortran this option is identical to the
    option (see list of flags for Fortran codes below). For C/C++
    codes this option enables different runtime checks depending on
    (colon-separated list, see below for suboptions) specified and
    raises a runtime exception (
    signal) if a violation is encountered.

    • all

      Enables all suboptions.

    • bounds

      Performs runtime checking of addresses when subscripting
      within an object of known size.

    • divzero

      Performs runtime checking of integer division. A trap will
      occur if an attempt is made to divide by zero.

    • nullptr

      Performs runtime checking of addresses contained in pointer
      variables used to reference storage.

  • -qflttrap[=<suboptions_list>]

    Generates instructions to detect and trap runtime floating-point
    is a colon-separated list of one or more of the following

    • enable
    • imprecise

      Only checks for the specified exceptions on subprogram
      entry and exit.

    • inexact

      Detects floating-point inexact exceptions.

    • invalid

      Detects floating-point invalid operation exceptions.

    • nanq

      Generates code to detect and trap NaNQ (Quiet Not-a-Number)
      exceptions handled or generated by floating-point operations.

    • overflow

      Detects floating-point overflow.

    • underflow

      Detects floating-point underflow.

    • zerodivide

      Detects floating-point division by zero.

  • -qhalt=<sev>

    Stops the compiler after the first phase if the severity
    level of errors detected equals or exceeds the specified level
    <sev>. The severity levels in increasing order of severity

    • i
      = informational messages
    • l
      = language-level messages (Fortran only)
    • w
      = warning messages
    • e
      = error messages
    • s
      = severe error messages
    • u
      = unrecoverable error messages (Fortran only)
  • -qinitauto=[<hex_value>]

    Initializes each byte or word of storage for automatic variables
    to the specified hexadecimal value
    This generates extra code and should only be used for error
    determination. If you specify
    without a
    the compiler initializes the value of each byte of automatic
    storage to zero.

The following flags can be used only with

  • -C

    Checks each reference to an array element, array section, or
    character substring for correctness. This way some array-bound
    violations can be detected.

  • -qinit=f90ptr

    Makes the initial association status of pointers
    disassociated instead of undefined. This option applies to Fortran
    90 and above. The default association status of pointers is

  • -qsigtrap[=<trap_handler>]

    Sets up the specified trap handler to catch SIGTRAP
    exceptions when compiling a file that contains a main program.
    This option enables you to install a handler for SIGTRAP signals
    without calling the SIGNAL subprogram in the program.

The following flags apply only to

  • -qformat=[<options_list>]

    Warns of possible problems with string input and output format
    specifications. Functions diagnosed are printf,
    scanf, strftime,
    strfmon family functions and functions marked with format
    is a comma-separated list of one or more of the following

    • all

      Turns on all format diagnostic messages.

    • exarg

      Warns if excess arguments appear in printf and scanf style
      function calls.

    • nlt

      Warns if a format string is not a string literal, unless
      the format function takes its format arguments as a va_list.

    • sec

      Warns of possible security problems in use of format

    • y2k

      Warns of strftime formats that produce a 2-digit year.

    • zln

      Warns of zero-length formats.

  • -qinfo[=[<suboption>][,<groups_list>]]

    Produces or suppresses additional informational messages.
    <groups_list> is a colon-separated list. If a
    <groups_list> is specified along with a
    <suboption>, a
    colon must separate them. The suboptions are:

    • all

      Enables all diagnostic messages for all groups.

    • private

      Lists shared variables that are made private to a parallel

    • reduction

      Lists variables that are recognized as reduction variables
      inside a parallel loop.

    The list of groups that can be specified is extensive. Here only a
    few are given. For a complete list please refer to the manual page
    of the

    • c99

      C code that might behave differently between C89 and C99
      language levels

    • cls

      C++ classes

    • cmp

      Possible redundancies in unsigned comparisons

    • cnd

      Possible redundancies or problems in conditional

    • gen

      General diagnostic messages

    • ord

      Unspecified order of evaluation

    • ppt

      Trace of preprocessor actions

    • uni

      Uninitialized variables

8.1.2. Compiler flags for using debuggers

In order to run your code under the control of a debugger, you
need to recompile your application including the following compiler
flags (XL compilers):

-g -qfullpath

Additionally, the flag
may be useful. When specified, it ensures that function parameters
are stored on the stack even if the application is optimized. As a
result, parameters remain in the expected memory location, providing
access to the values of these incoming parameters to debuggers.

8.2. Available debuggers

Once you have compiled your application with the correct compiler
(Section 8.1.2)
you can run your application under the control of a debugger and
monitor the behavior on the fly in detail. We introduce the debuggers
which are available on JUGENE in the following subsection.

8.2.1. DDT

The Distributed Debugging Tool (DDT) is a graphical debugger
supporting C, C++, Fortran 77, and Fortran 90 programs. Among other
features it offers:

  • Multi-process and multi-threaded
  • 1D + 2D array data visualization
  • Support for MPI parallel debugging (automatic attach, message
  • Support for OpenMP (Version 2.x and later)
  • Job submission from within debugger Running DDT on JUGENE


In order to be able to use the graphical user interface,
please make sure you are logged in with

ssh -X

If you are not directly connected to JUGENE, make sure you are
using for all ssh connections the
option and that your local system (laptop, PC) has a running X

In order to debug your program load the
modules first:

module load UNITE ddt

Then start the DDT debugger with


After clicking on the DDT logo a dialog box appears
(Figure 14).

Figure 14. DDT welcome dialog box

DDT welcome dialog box


Run and Debug a Program
and select
(Figure 15)
your application (after
compilation with the
appropriate flags
Section 8.1.2)
in the next dialog box, adjust the number of nodes and the
OpenMP settings if applicable (for further options click on
and finally click on

Figure 15. DDT dialog box for choosing the executable and runtime

DDT run dialog box


The application is submitted to the batch system and queued. Once
the job is launched DDT will attach to the application, the DDT
process window will appear
(Figure 16)
and you can start to debug your application. For further
information about the DDT debugger and its capabilities please see

DDT documentation (Allinea Software)

Figure 16. DDT process window.

DDT process window


8.2.2. TotalView

TotalView is a very powerful debugger supporting C, C++,
Fortran 77, Fortran 90, PGI HPF and assembler programs and offers
among others the following features:

  • Multi-process and multi-threaded
  • C++ support (templates, inheritance, inline functions)
  • F90 support (user types, pointers, modules)
  • 1D + 2D array data visualization
  • Support for parallel debugging (MPI: automatic attach,
    message queues, OpenMP, pthreads)
  • Scripting and batch debugging
  • Memory debugging
  • Reverse debugging with ReplayEngine Using TotalView interactively


In order to be able to use the graphical user interface
please make sure you are logged in with

ssh -X

If you are not directly connected to JUGENE, make sure you are
using for all ssh connections the
option and that your local system (laptop, PC) has a running X

In order to debug your program with TotalView load the
modules first:

module load UNITE totalview

The most common way to use TotalView (like any other debugger) is
an interactive usage with a graphical user interface. In order to
do so start your application (after compilation with the
appropriate flags,
Section 8.1.2)
(Section 4.3.6)
using the option
For example:

llrun -np <ntasks> -mode VN -tv [-env
OMP_NUM_THREADS=<nthreads>] application.x

This will start the program
per task in VN mode. If your application is a pure MPI code, you
can omit the

After the corresponding partition is booted TotalView will launch
three windows: the root window
(Figure 19).
the process window
(Figure 18),
and the startup-parameter window
(Figure 17).

Figure 17. TotalView startup-parameters window

TotalView startup-parameters window


Figure 18. TotalView process window

TotalView root window


Figure 19. TotalView root window

TotalView root window


In the startup-parameter window
(Figure 17),
you have the four tags Debugging Options,
Standard I/O and Parallel.
If you wish to acitvate the memory
debugging check the corresponding box in the tag
Debugging Options.
If you would like to change or add the arguments passed to
your application or to
you can do so under Arguments. Please
do not change
anything in Parallel. Once you have made all changes needed,
click on

Now click on
in the process window of TotalView
(Figure 18).
TotalView will proceed executing the
command and launch your application. This may take several minutes
depending on the size of the partition you have requested (which is
the number of tasks you would like to use).

Finally, a dialog box
(Figure 20)
appears. Click on
and after a few seconds the source code of the main program of your
application appears in the process window and you can start
debugging your code.

Figure 20.
A dialog window appears after clicking on

A dialog window appears after clicking on GO .


Since a detailed description of the usage of TotalView is far
beyond the scope of this guide, please refer to the

TotalView documentation (Rogue Wave Software)

for a user’s guide and further information about TotalView. Using TotalView in batch mode

Sometimes using the interactive GUI for debugging is not
straightforward, for example in cases where the error occurs after
several hours of execution. In this case it would be very
cumbersome to wait until the code has reached the corresponding

In such cases TotalView can be executed in batch mode. Prepare a
job command file (see
Figure 20)
and launch your application with
instead of

The general syntax for
on JUGENE is

tvscript [options] -mpi BlueGene -np <ntasks>
"filename [mpi-arguments] [-args program_args]"

is the name of the executable to debug
be the first of the
is followed by the arguments which are usually specified with the
same option of the
command. The last command

Example 19
shows an example job command file using
to debug an application.

Example 19.
Job command script using

#@job_name         = tvscript_dbg
#@comment          = "batch debugging"
#@output           = tvscript_dbg.out
#@error            = tvscript_dbg.err
#@environment      = COPY_ALL
#@job_type         = bluegene
#@notification     = never
#@bg_size          = 32
#@wall_clock_limit = 00:30:00

module load UNITE totalview
tvscript -create_actionpoint "function=>display_backtrace\
         -show_arguments" -mpi BlueGene -np 4\
         -starter_args "application.x -mode VN" mpirun


The executable to debug is
and should run with 4 tasks in VN mode. At the beginning of the
an action point is created. When
reaches that action point, it logs a backtrace and the method’s

Running this job script, two log files are created by



file (Summary Log File) contains a summary which events occured. In
the example above, this file contains four lines (one for each

Actionpoint function hit, performing action display_backtrace with \
					options -show_arguments
Actionpoint function hit, performing action display_backtrace with \
					options -show_arguments
Actionpoint function hit, performing action display_backtrace with \
					options -show_arguments
Actionpoint function hit, performing action display_backtrace with \
					options -show_arguments										

This indicates that all tasks reached the defined action point and
performed the corresponding action (show the arguments of the

file contains more detailed information. In this case it lists (for
each task) the names and values of the arguments of the function

For further information about
and a complete list of options, please see the

TotalView documentation

8.3. Analyzing core dumps

If an application aborts due to an error the current status of the
memory usage of the application can be written to disk
(core dump files)
before the execution stops. Due to the fact that writing core files
from thousands of nodes takes (too) much time, the generating of core
files is suppressed. However, you can enable the generation of core
dumps exporting the environment variable
in your job command file:

mpirun -env BG_COREDUMPDISABLED=0 <other mpirun options> application.x

is your application. Please use the
option when compiling your application in case you would like to
analyze core dump files.


Use this option with care, because a core dump file
for each process
is generated. Running with 16000 MPI tasks means that 16000
core files are generated! Before using this option try to reproduce
the error with the least number of tasks possible! Alternatively, you
can limit the number of core files using setrlimit(RLIMIT_CORE).
In this case you need to modify your source code. See the manual page of
setrlimit (man setrlimit) for further information.

Core dump file analysis using

Core dump files are plain text files that include traceback information
hexadecimal. To read and convert the hexadecimal addresses the
can be used. Assuming your application is called
you can convert a hexadecimal address
from the core dump file to readable text in the following way:

addr2line -e application.x <hexaddr>

For further information about
please use

man addr2line

addr2line -h

8.3.2. Core dump file analysis using debuggers

Core dump files can be also analyzed using the DDT
(Section 8.2.1)
or the TotalView debugger
(Section 8.2.2). DDT

To debug using core dump files, start DDT as described in
Section 8.2.1.
Then click the Open Core Files button on the
welcome screen
(Figure 14).
This opens the Open Core Files window, which allows you to
select an executable and a set of core dump files. Click
to open the core dump files and start debugging them. While DDT is in
this mode, you cannot play, pause or step (because there is no
process active). You are, however, able to evaluate expressions and
browse the variables and stack frames saved in the core dump files. The
End Session menu option will return DDT to its normal mode of
operation. TotalView

Start TotalView as described in
After the source code of your application appears in the process
window, go to the menu
and select
New Program.
Open a core file
in the dialog box which appears and choose a core dump file. The Process
window displays the core dump file, with the Stack Trace, Stack Frame,
and Source Panes showing the state of the process when it dumped
core. The title bar of the Process Window names the signal that
caused the core dump. The right arrow in the line number area of
the Source Pane indicates the value of the program counter (PC)
when the process encountered the error.

8.4. Further reading, information and references

JUGENE documentation on the web

General related information

Bibliographic references

[RedBook AD] Carlos Sosa and Brant Knudson. IBM System Blue Gene Solution: Blue Gene/P Application
4th edition. 11 September 2009. Copyright © 2007, 2009 International Business Machines Corporation.

The text has been taken in parts from the
IBM ESSL documentation

This is the default mode so adding
-mode SMP
is unnecessary