Jülich Research on Petaflop Architectures
Information provided by
University of Warsaw
Table of Contents
- 1. Introduction
- 2. System architecture and configuration
- 3. System Access
- 4. User environment and programming
- 5. Tuning applications and Performance analysis
- 5.1. General optimization for x86 architecture
- 5.2. System specific optimization
- 5.2.1. How to benefit from SMT
- 5.2.2. Optimising the available memory per core
- 5.2.3. Reducing the memory consumption of MPI connections
- 5.2.4. Using dynamic memory allocation for MPI connections
- 5.2.5. Pinning OpenMP threads to processors
- 5.2.6. Requesting large amounts of memory on the front-end nodes
- 5.2.7. I/O tuning
- 5.3. Available generic x86 performance analysis tools
The supercomputer JUROPA is based on a cluster configuration of NovaScale servers from the French computer specialist Bull, and on blade servers from the American company Sun with Intel Nehalem processors. The system was designed by experts from the Jülich Supercomputing Centre and implemented together with the partner companies Bull, Sun, Intel, Mellanox and ParTec. It consists of 2208 computing nodes with a total computing capacity of 207 Tflop/s. JUROPA’s operating system was developed by Forschungszentrum Jülich and ParTec. JUROPA is financed by Forschungszentrum Jülich within the framework of funding by the Helmholtz Association.
Figure 1. JUROPA compute racks, source: Forschungszentrum Jülich
JUROPA is FZJ/GCS production system based on the Sun Constellation System with a total peak performance of 207 Tflop/s.The cluster consists of compute nodes connected via high-speed InfiniBand QDR network and utilizes a Lustre storage subsytem.JUROPA has 2208 computing nodes with a following characteristics:
- Sun Blade 6048 system
- 2 Intel Xeon X5570 (Nehalem-EP) quad-core processors
- Processor clocked at 2.93 GHz with SMT (Simultaneous Multithreading also known as Intel Hyper-Threading)
- 24 GB of DDR3 physical memory
- InfiniBand QDR networking module
In total, JUROPA is composed of 2208 computing nodes with 17664 cores.
Figure 2. JUROPA architecture, source: Forschungszentrum Jülich
JUROPA compute nodes are equipped with a two Intel Nehalem-EP (Xeon quad-core X5570) processors. The Intel Xeon X5570 processor is a 64-bit instruction set (x86_64) processor with the following architecture details:
- 4 physical cores (quad-core)
- 2-way SMT (hardware support for 8 concurrent threads)
- clock speed: 2.93 GHz
- L1: 64 KB per core
- L2: 256 KB per core
- L3: 8 MB shared by 4 cores
- memory bandwidth at the level of the processor: 32 GB/s
JUROPA is using login nodes for interactive work and job submission. Additionally GPFS nodes are in use for data exchange between the different HPC systems.
The operating system used is Linux SUSE SLES 11 with ParaStation Cluster Management system.
Each computing node contains a total of 24 GB DDR3 memory, clocked at 1066 MHz. With three processor memory channels the maximum possible memory bandwidth is 32 GB/s.
The JUROPA network uses a Fat Tree topology for InfiniBand inter-connect between the computing nodes and other system components. The network treeis composed of 92 blocks, 24 nodes each.
JUROPA is using a Lustre storage subsystem which is mounted on login nodes, compute nodes and GPFS nodes. Lustre Storage Pool has the following components:
- 4 Meta Data Servers (MDS)
- 2 x Bull NovaScale R423-E2 (Nehalem-EP quad-core)
- 2 x Bull NovaScale R423-E2 (Westmere-EP, 6-core)
- 98 TB for meta data (2 x EMC² CX4-240)
- 14 Object Storage Servers (OSS) for the home file systems
- Sun Fire X4170 Server
- 500 TB user data (28 x Sun Storage J4400 Array)
- 8 Object Storage Servers (OSS) for the home file systems
- 8 x Bull NovaScale R423-E2 (Nehalem-EP quad-core)
- 500 TB user data (2 x DDN SFA10000 storage)
- 8 Object Storage Servers (OSS) for the scratch file system
- 8 x Bull NovaScale R423-E2 (Westmere-EP, 6-core)
- 834 TB user data (2 x DDN SFA10000 storage)
- Aggregated data rate 50 GB/s
- Overall storage capacity: 1.8 PB
The Lustre Storage Pool is available on login, compute and GPFS nodes with the following file systems:
Table 1. Lustre filesystems on JUROPA
|$WORK||800 TB||group quota: 3 TB, 2 million files||no backup, files older than 28 days will be deleted||recommended for large temporary files and high performance requirements|
|$HOME||29 TB per file system, distributed among user groups||group quota: 3 TB, 2 million files||daily backup||recommended for permanent program data with low performance requirements|
Home directories reside in the Lustre filesystem. In order to hide the details of the home filesystem layout the full path to the home directory of each user is stored in the shell environment variable $HOME. References to files in the home directory should always be made through the $HOME environment variable. The initialization of $HOME will be performed during the login process.
The other important storage locations is the GPFS archive filesystem ($GPFSARCH).
Table 2. Archive filesystem on JUROPA
|$GPFSARCH||group quota: 2 million files||Storage for large files that are used infrequently. Long-term storage is done on magnetic tape. Thus, the retrieval of data may incur a significant time-delay. Backup of all files located in this file system is performed on a daily basis.|
Table 3. Lustre and GPFS filesystems visibility on JUROPA
|Primary name||Mount point||Type||Access|
|$GPFSARCH||GPFS||GPFS nodes only|
It is highly recommended to access files always with help of these variables. The values of these variables are automatically set during login.
For more information related to Section 2, “System architecture and configuration” please refer to the FZJ website:
JUROPA is using a set of login nodes for accesing the system. Users should use ssh connections for accessing the system, using:
ssh [-X] firstname.lastname@example.org
When using the generic name, a connection is established to one of the Login nodes in the node pool (
juropa01 ... juropa07). Initiating two logins in sequence may lead to sessions on different nodes. In order to force the session to be started on the same nodes use the specific node names instead, for example when using the login node <code>juropa01</code> (accordingly for the other login nodes):
ssh [-X] email@example.com
Users can’t login by suppling username/password credentials. Instead, login based on SSH key exchange is r
The public/private ssh key pair has to be generated. On Linux or UNIX-based systems, the key pair can be generated by executing:
ssh-keygen -t [dsa|rsa]
It is not allowed to use ssh key without a passphrase! Please protect the ssh key with a non-trivial pass phrase to fulfill the FZJ security policy.
Note that too many accesses (ssh or scp) within a short amount of time will be interpreted as an intrusion and will lead to automatically disabling the originating system at the FZJ firewall. For transferring multiple files in a single scp session, the -r option can be used, which allows to transfer whole directories.
If X11-based graphical tools are to be used on JUROPA, it is necessary to enable X11 forwarding in the file
$HOME/.ssh/config on your workstation or to use the <code>-X</code> option when connecting to JUROPA.
JUROPA is using an Intel programming environment including Intel Professional Fortran, C/C++ Compiler, Intel Cluster Tools Intel Math Kernel Library and OpenMP Intra-Node Programming Model.
It is recommended to use the Intel compilers in order to get executables with highest possible performance for these platforms.
Following libraries are available on the JUROPA:
- Numerical libraries:
- NAG Libraries, LAPACK, GSL, FFTW, GMP,
- Parallel libraries for distributed Memory Machines:
- ScaLAPACK, PARPACK, PETSc, MUMPS, SPRNG, ParMETIS, hypre, sundials.
The common programming environment is maintained with the concept of software modules in directory
/usr/local. The framework provides a set of installed libraries and applications (including multiple-version support) and an easy to use interface (module command) to set up the correct shell environment.
For more information on using
Modules please refer to the PRACE Generic x86 Best Practice Guide.
The persistent settings of the shell environment are governed by the content of
.profile or scripts sourced from within these files. Please use these files for storing your personal settings.
The batch system on JUROPA is Moab with the underlying resource manager Torque. The batch system is responsible for managing jobs on the machine, returning job outputs to the user and provides job control mechanism on the user’s or system administrator’s request.
Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done. The smallest allocation unit is one node (8 processors). Users will be charged for the number of compute nodes multiplied with the wall-clock time used. Approximately 22 GB of memory per node are available for applications.
Table 4. Job limits on JUROPA
|Interactive jobs||max. wallclock time||6 h|
|default wallclock time||30 min|
|max. number of nodes||8|
|default number or nodes||1|
|max. no. of running jobs (including batch jobs)||15|
|Batch jobs||max. wallclock time||24 h|
|default wallclock time||30 min|
|max. number of nodes||1024|
|default number or nodes||1|
|max. no. of running jobs||15|
The Moab batch system is using batch scripts for job submission. The minimal template to be filled is:
Example 1. Template for job batch script
#!/bin/bash -x #MSUB -l nodes=<no of nodes>:ppn=<no of procs /node> #MSUB -l walltime=<hh:mm:ss> #MSUB -e <full path for error file> # if keyword omitted : default is submitting directory #MSUB -o <full path for output file> # if keyword omitted : default is submitting directory #MSUB -v tpt=<no of threads per task> # for OpenMP/hybrid jobs only ### start of jobscript export OMP_NUM_THREADS=<no of threads/task> # for OpenMP jobs only cd $PBS_O_WORKDIR echo "workdir: $PBS_O_WORKDIR" NSLOTS=<nodes * ppn> echo "running on $NSLOTS cpus ..." mpiexec -np $NSLOTS [--exports=var1,...] <executable>
For parallel (MPI based) applications
mpiexec together with application program (
executable) must be used.
--exports along with a comma separated list of environment variables ensures the export of all specified variables from the current job script to the processes spawned by the mpiexec command. This is necessary for instance if
OMP_NUM_THREADS is defined for OpenMP. The use of
-x for exporting of all environment variables is deprecated.
To submit the job defined with job script
jobscript use the following command:
On success, msub returns the job ID of the submitted job.
The following script is an example on how to define hybrid jobs (using MPI and OpenMP) for your application:
Example 2. Hybrid job (MPI and OpenMPI) allocating 8 nodes with 8 cores per node and starting 8 threads on each node
#!/bin/bash -x #MSUB -l nodes=8:ppn=8 #MSUB -e /home/jhome3/test_user/my-error.txt #MSUB -o /home/jhome3/test_user/my-out.txt #MSUB -v tpt=8 ### start of jobscript export OMP_NUM_THREADS=8 cd $PBS_O_WORKDIR echo "workdir: $PBS_O_WORKDIR" # NSLOTS = nodes * ppn / tpt = 8 * 8 / 8 = 8 NSLOTS=8 mpiexec -np $NSLOTS --exports=OMP_NUM_THREADS ./hybrid_application
To start an interactive session for 30 minutes on two nodes with 8 processors each use the
-I option for
msub -I -l nodes=2:ppn=8,walltime=00:30:00
You will then get interactive access to a node and can start your applications right there.
The following table summarizes important
msub command options:
|-l||set job limits (controlled by suboptions)|
|nodes||number of compute nodes used by the job|
|:ppn||processes per node|
|:turbomode||enable CPU over-clocking|
|walltime||wallclock timelimit for the job|
|-e||define file name of job’s error output|
|-o||define file name of job’s standard output|
|-j||oe||join standard output and error output into one file|
|-v||tpt||define the number of threads per MPI task for an OpenMP job|
|-I||submit an interactive job|
|-M||define mail address to receive mail notification|
|-m||define when to send a mail notification|
|b||at job begin|
|e||at job end|
|a||in case of job abort|
|-N||define the job’s name|
|-W||depend||define job ID this job depends on|
|afterok:||only start job, if previous job in the chain was ok|
Other useful Moab commands are summarized below:
- show status of all (running) jobs
- cancel a job
mjobctl -q starttime jobid
- show estimated starttime of specified job
- show detailed information about this command
checkjob -v jobid
- get detailed information about a job
For further information please see also Moab documentation.
MPI library implementation on JUROPA is ParTec MPI Message Passing Interface. Different versions of the MPI are available. Please use
module avail parastation for more details.
The memory consumption for static inter-node (InfiniBand) communication buffers depends on the number of communicating MPI processes (tasks). Users should be aware of memory usage for application runs using a large number of tasks. For details please see section Section 5.2.3, “Reducing the memory consumption of MPI connections” of this guide.
Please refer to the section on application tunning for more information on how to address MPI connections memory consumption. The JUROPA Memory Logger (jumel) can be used to analyse the memory consumption of an application. Please see
module help jumel for further information.
For detailed information on code optimization on x86 architecture refer to the PRACE Generic x86 Best Practice Guide.
The Intel Nehalem processor offers the possibility of Simultaneous Multi-Threading (SMT) which Intel formerly also called Hyper-Threading (HT). In SMT mode each processor core can execute two threads or tasks (called processes in the following for simplicity) simultaneously, leading to an execution of at most 16 processes per JUROPA compute node. The two “slots” of each core for running processes are called Hardware Threads (HWT).
Each JUROPA compute node consists of two quad-core CPUs, located on socket 0 and 1, respectively. The cores are numbered 0 to 7 and the HWT are named 0 to 15 in a round-robin fashion.
In order to use
SMT on JUROPA you need to set
ppn=16 in your Moab job script. The following examples show how to use SMT for pure MPI and MPI/OpenMP hybrid applications.
Example 3. Pure MPI code
#!/bin/bash -x #MSUB -N SMT_MPI_64x1_job #MSUB -l nodes=4:ppn=16 ### start of jobscript ### mpiexec -np 64 application.exe
This script will start application.exe on 4 nodes using 16 MPI tasks per node, where two MPI tasks will be executed on each core.
Example 4. Hybrid MPI/OpenMP code
#!/bin/bash -x #MSUB -N SMT_hybrid_8x8_job #MSUB -l nodes=4:ppn=16 #MSUB -v tpt=8 ### start of jobscript ### export OMP_NUM_THREADS=8 mpiexec -np 8 --exports=OMP_NUM_THREADS application.exe
This script will start application.exe on 4 nodes using 2 MPI tasks per node and 8 OpenMP threads per task.
Processes which are running on the same core will need to share the resources available to that particular core. Therefore, applications will profit most from SMT if processes are running on one core which are complementary in their usage of resources. For example, while one of the two processes is performing some kind of computation the second one is accessing the memory. This way the two processes do not compete for resources. If on the other hand the two processes would need to access the memory at the same time they would need to share the caches and memory bandwith and would therefore hamper each other. We recommend to test whether your code profits from SMT or not.
In order to check whether your application profits from SMT you should compare timings t of two runs on the same number of physical cores (i.e. the specification of nodes should be the same for both jobs), one run without SMT (two) and the second one with SMT (tw). If two/tw is greater than 1 your application profits using SMT, for two/tw less than 1 it does not.
According to experience you can expect a maximum speed-up of up to two/tw = 1.5 for codes profiting from SMT. However, applications may show a smaller benefit or might even slow down when using SMT. In such cases SMT should not be applied.
By default the processes for pure MPI applications are mapped to the cores of each JUROPA node in a round-robin fashion. This means that the first 8 processes are allocated on the HWT 0 to 7 with process 0 on HWT 0 and process 7 on HWT 7 and the next 8 processes are allocated on HWT 8 to HWT 15.
For hybrid applications (MPI/OpenMP) the tasks and threads will be distributed over the cores in such a way that all threads belonging to one task will share the physical cores among themselves. Three scenarios for the distribution of tasks and threads for each node are handled by default:
- 2 MPI tasks with 8 OpenMP threads per task per node (2×8)
- 4 MPI tasks with 4 OpenMP threads per task per node (4×4)
- 8 MPI tasks with 2 OpenMP threads per task per node (8×2)
The mapping described here does not ensure that threads belonging to one task are pinned to a certain HWT. The mapping of tasks and threads just ensures that the m threads belonging to the certain task will be executed on the HWT assigned to this task.
You can customize the mapping of processes to the HWT by using the variable __PSI_CPUMAP (please note the two underscores “_” at the beginning of the variable). This variable should contain a comma-separated list of HWT, to which threads should be mapped. The threads will be distributed in the order of tasks.
Example 5. Hybrid MPI/OpenMP code (2×8, customized mapping)
#!/bin/bash -x #MSUB -N SMT_hybrid_2x8_job #MSUB -l nodes=1:ppn=16 #MSUB -v tpt=8 ### start of jobscript ### export OMP_NUM_THREADS=8 export __PSI_CPUMAP="0-3,12-15,4-11" mpiexec -np 2 --exports=OMP_NUM_THREADS application.exe
This will allocate the threads of the first MPI task on the HWT 0-3 and 12-15 and the threads of the second MPI task on the HWT 4-11.
The compute nodes of JUROPA have a NUMA (Non-Uniform Memory Architecture) design, i.e. on these systems the memory (24 GB/node) is divided into two memory nodes of 12 GB each, where each memory node is bound to one CPU socket containing 4 cores. Since latency and bandwidth of memory accesses are significantly worse, if the other socket is involved, processes are pinned to specific cores and bound to the local memory node. By default, the first 4 tasks are distributed across the first socket and further tasks are distributed across the second socket. If 3 GB of memory per core is not enough for your needs, then this ratio can be increased in several ways. Some examples are given in the table below:
Table 6. Task allocation parameters
|8||8||—||1||12||Only one task is posed on a node, i.e. only the memory connected to the first socket is available.|
|8||8||export||1||24||If one task is posed on a node and the variable __PSI_NO_MEMBIND is set, the whole memory is available.|
|8||1||—||8||3||8 tasks are distributed across the node.|
|8||4||—||2||12||Four cores are reserved for each task, the first task is set on the first socket the second task is set on the second socket. Both tasks can use their local memory exclusively.|
|8||2||—||4||6||Two cores are reserved for each task, so 6 GB of memory are dedicated to each task.|
Abbreviations in the above table:
ppnProcesses per node; this value is specified in your job script or
msubcommand through the ppn option
tppThreads per process; this value is specified through the environment variable PSI_TPP in your job script (see text below)
nobindNo memory binding; this value is specified through the environment variable __PSI_NO_MEMBIND in your job script (see text below)
tasksLists the effective number of tasks per node resulting from the settings of ppn and tpp
memLists the available memory per task in GB resulting from the settings of ppn and tpp
PSI_TPP and __PSI_NO_MEMBIND have to be exported within a batch script. Instead of exporting PSI_TPP it is also possible to set its value implicitly with the following Moab command within your batch script:
#MSUB -v tpt=1.
Then, of course, the value of tpt has to be set to your needs. Please be aware that, on the one hand, the access to the memory of a process from one socket to the other is assured through:
(provided that bash is used), but, on the other hand, the performance of this memory access might be reduced considerably, so it is a good idea to weigh up this option carefully.
In some cases it might be necessary to reduce the amount of memory that has to be dedicated to MPI connections. This will apply in particular in cases where the job runs on many cores (>2000) and all-to-all communication is required. (As long as all-to-all communication is not needed, the use of on-demand connections will be beneficial; for this case see PSP_ONDEMAND below.)
In ParaStation MPI, roughly 0.55 MB of memory are needed for each MPI connection. If a program runs on 4000 cores and the communication pattern includes all-to-all communication, a total of 4000 times 3999 times 0.55 MB = 8797.8 GB of memory is needed just for the MPI connections, that is about 2.2 GB out of a total of 3 GB available per core.
ParaStation MPI uses 16 send buffers and 16 receive buffers per connection by default. Each buffer has a size of 16 KB. While the size of the buffers cannot be changed, the number of buffers can be modified via the environment variables PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE. If you want to reduce the number of these buffers you have to set the corresponding variables in your batch script:
export PSP_OPENIB_SENDQ_SIZE=3 (3 buffers for send)
export PSP_OPENIB_RECVQ_SIZE=3 (3 buffers for receive)
provided that bash is used. Both sizes might be modified independently.
The following table gives some figures for this purpose:
Table 7. Number of Q_SIZE buffers and MPI throughput
But please be aware that a reduced Q_SIZE might degrade the MPI’s throughput and messaging rate. Furthermore, each Q_SIZE must be at least 3 in order to prevent deadlocks.
As described in the previous chapter, each MPI connection needs a certain amount of memory by default. The more processes are started, the more memory will be used for the MPI connections. As a rule of thumb, each MPI connection needs roughly 0.55 MB of memory. Besides the memory needed for your application, you have to take this fact into account.
On JUROPA, all MPI connections of your application will be established in the beginning of the run. This might lead to a memory shortage due to the reasons descibed above. If the memory allocation of your application plus the memory allocati
on for the MPI connections exceeds the memory capacity of the compute node, the job will fail.
The environment variable PSP_ONDEMAND influences the memory allocation for MPI connections. On JUROPA, the default of this variable is PSP_ONDEMAND=0, i.e. all needed memory will be allocated in the beginning of the run. If you alter the variable into PSP_ONDEMAND=1 within your batch script, then the memory allocation will take place dynamically. Dependent on the communication pattern of your application, this setting might circumvent your problem. But, if you do have all-to-all communication in your application, your program is likely to fail, since all MPI connections will have to be established when the all-to-all communication takes place.
The Intel compiler’s OpenMP runtime library provides the possibility to influence the binding of OpenMP threads to physical processing units. The behaviour is controlled by the environment variable KMP_AFFINITY. Information on this issue can be found on the Intel webpage.
On JUROPA, the default for the pinning is controlled by the PSI daemon. This means, the first four threads (0-3) are bound to the four cores of the first socket, all other threads (4-7) are bound to the cores of the second socket. If the user wants to change this setting, it is necessary to switch off the default setting explicitly by setting the environment variable __PSI_NO_PINPROC. Furthermore, the memory binding has to be switched off. The following example should clarify the general proceeding:
mpiexec —exports=OMP_NUM_THREADS,KMP_AFFINITY -n 1 ./application
Environment variables belonging to the PSI daemon need not to be exported explicitly in the mpiexec command.
Taking into account the setting from the example above the binding is as follows:
Figure 3. OpenMP threads binding example, source: Forschungszentrum Jülich
The threads are represented by circles in the picture above. KMP_AFFINITY=scatter distributes the threads as evenly as possible across the node.
Pre- and post-processing of input and output data is typically done on one of JUROPA’s front-end nodes.
While all login nodes and most of the GPFS nodes possess 24 GB of main memory, there are two GPFS nodes (juropagpfs04, juropagpfs05) that are equipped with 192 GB each. These nodes must be used, if interactive pre- or post-processing requires more than 24 GB of memory. Please note that the GPFS nodes are interactive front-end nodes, they cannot be used by batch jobs.
To access the GPFS nodes login with ssh:
ssh firstname.lastname@example.org or ssh email@example.com
In order to optimize the I/O performance of programs on JUROPA and to avoid any disturbance of other users you should follow a few rules when writing your software for the system:
- Write your data in large blocks instead of small portions to disk.
- Use buffered I/O whenever possible and avoid to flush your output frequently.
- Avoid task-local file I/O (each MPI task reads/writes to its own file). Use parallel I/O, for example MPI I/O or dedicated libraries like SIONlib, HDF5, pnetCDF, etc.
Depending on the programming language used, there are some specific rules:
- FORTRANWhen using the ifort compiler, make sure that the environment variable
FORT_BUFFERED = true
is set. This is the case in the default environment, when the module parastation is loaded; in case you unload this module make sure that you set the FORT_BUFFERED variable properly.
Alternatively, you can use the compiler option
-assume buffered_ioto switch on I/O buffering for FORTRAN programs. The Intel Fortran Compiler (
ifort) offers the non-standard parameters BUFFERED, BLOCKSIZE and BUFFERCOUNT for the open statement in order to customize the buffering of the output. Here are typical values, recommended for JUROPA.
Table 8. Compiler parameters for buffered I/O management in Fortran
enables buffering of WRITE commands
sets the size of the buffer to 1 MB
uses n buffers; in general n=1 is appropriate
- CDo not use fflush (except at the end of your output/program).
- C++Do not use flush (except at the end of your output/program). Do not use endl, because this performs an additional flush each time called. If you need a line break, please use n instead.
The following generic profiling and tracing tools are available on JUROPA:
- HPCToolkit sampling profiler
- mpiP MPI profiling library
- Tau performance analysis system
- VampirTrace tracing library
- Scalasca performance analyzer
- Paraver tracing tool
For more information related to Section 5, “Tuning applications and Performance analysis” please refer to the FZJ website: