PRACE Batch Systems

Introduction

This document briefly describes each of the Batch Systems currently deployed at each of the PRACE platforms. It describes some basic commands of each system and then lists some site-specific commands which may be of interest. For further more detailed information, the user should visit the platform’s online User Guide.

Overview

This document lists all Batch Systems responsible for managing jobs on PRACE Tier-0 and Tier-1 computing systems.

An external link to specific online User Guides at Sites is provided.

In case it is not available, the official documentation of the corresponding Batch System can be considered.

For any problem or further information, contact with your site by sending an email to prace-support@(yoursite).(countrydomain)

Tier-0 Systems

Site System Platform Batch System External Link
BSC MareNostrum IBM iDataPlex LSF Site Documentation (PDF)
CEA CURIE Bull Bullx B505, B510, BCS Slurm Site page
FZJ JUQUEEN IBM BlueGene/Q LoadLeveler Site page
HLRS Hazel Hen Cray XC40 Torque/Moab Site page
LRZ SuperMUC phase 1 IBM iDataPlex LoadLeveler http://www.lrz.de/services/compute/supermuc/loadleveler
LRZ SuperMUC phase 2 Lenovo NeXtScale nx360M5 WCT LoadLeveler http://www.lrz.de/services/compute/supermuc/loadleveler

Tier-1 Systems

Site System Platform Batch System External Link
BSC Minotauro Bull Bullx B505 Slurm Site Documentation (PDF)
CaSToRC CY-TERA IBM iDataPlex Slurm http://cytera.cyi.ac.cy/index.php/user-support/documentation.html#RunningApplications
CINECA GALILEO IBM NeXtScale nx360M5 +GPU/MIC PBS Pro Site page PBS on Galileo
CSC Sisu Cray XC40 Slurm Site page
CSCS Rosa Cray XE6 Slurm Site page
CYFRONET Zeus (BigMem) HP BL685c G7 PBS Pro Site page
CYFRONET Zeus (GPGPU) HP SL390s PBS Pro Site page
ICHEC Kay Intel / Penguin Computing SLURM Site page
EPCC ARCHER Cray XC30 PBS Pro/APLS http://www.archer.ac.uk/documentation/user-guide/batch.php
IDRIS Turing IBM BlueGene/Q LoadLeveler Site page
IPB PARADOX HP Proliant SL250s PBS Pro Site page
IT4I-VSB TUO Anselm Bull Bullx B510/B515 PBS Pro Site page
IT4I-VSB TUO Salomon SGI ICE-X Intel Haswell + Phi PBS Pro Site page
NCSA EA-ECNIS IBM BlueGene/P LoadLeveler Site page
NIIF NIIFI SC HP Cluster Platform 4000SL Slurm Site page
NIIF Seged HP Cluster Platform 4000BL Slurm Site page
NIIF Leo HP ProLiant SL250s Slurm Site page
NIIF PHItagoras HP ProLiant SL250s Slurm Site page
PDC Beskow Cray XC40 Slurm Site page
PSNC Cane SGI Rackable C1103-G15 Slurm Site page
PSNC Chimera SGI UV 1000 Slurm Site page
STFC Blue Joule IBM BlueGene/Q LoadLeveler Site page
SURFsara Cartesius Bull Bullx B720/B710/B515 Slurm Site page
UHEM Karadeniz HP Proliant BL460 LSF Site page
UIO Abel MEGWARE MiriQuid Slurm Site page
WCSS Supernova HP Cluster Platform 3000BL2x220 PBS Site page

Loadleveler

This Section describes common LoadLeveler batch commands, LoadLeveler directives and, finally, an example MPI batch script.

LoadLeveler Commands

llsubmit <job_script>

Submit a job script, called ‘job_script’, for execution

llq

Check the status of your job(s).

 llcancel <job_id>

Cancel a job.

llstatus

Returns status of machine.

LoadLeveler Directives

This section lists some typical and important LoadLeveler directives accepted or unadvised. A general overview of LoadLeveler is provided by the corresponding IBM documentation.

General LoadLeveler Directives
#@ wall_clock_limit = <HH:MM:SS> [,HH:MM:SS]

The first time sets the hard limit of the execution time in hours, minutes and seconds. The second optional time sets a soft limit whereby the process receives a warning at this time.

#@ requirements = (Feature == "PRACE")

This directive is required for all PRACE jobs, and must be employed.

#@ notify_user = <your-personal-email>

Emails the user once the job is complete. Note: Use your private email and not an account local to the execution platform.

#@ notification = error

Notifies the job only when the job fails.

#@ shell = /bin/bash

Specifies which shell to employ.

#@ queue

Must appear at the end of the list of directives.

Please be aware that using some other keywords can on the contrary produce unexpected behavior. Therefore, the following keywords must be avoided in PRACE jobs:

#@ requirements = (X) with a value for X different from Feature == "PRACE"

#@ class

BG/Q Directives
#@ job_type = BLUEGENE

is required to setup a step running on the BG/Q.

#@ bg_size = xxxxx

is required to design the size of a BG job

#@ bg_connection = {MESH|TORUS|PREFER_TORUS}

Multisteps jobs are accepted (with steps running on the Front-end Node for pre- and post-processing purposes).

You must also be aware that using some other keywords can on the contrary produce unexpected behavior. Therefore, the following keywords must be avoided in PRACE jobs:

#@ bg_shape

#@ bg_partition

Example of BlueGene MPI script for PRACE users

#!/bin/bash
# @ job_name = myjob
# @ error = $(job_name).$(jobid).out
# @ output = $(job_name).$(jobid).out
# @ wall_clock_limit = 00:30:00
# @ notify_user = <replace with your email>
# @ notification = error
# @ job_type = bluegene
# @ bg_size = 128
# @ queue
# 
# Run the program in Virtual Node Mode on the BlueGene/Q:
# 
# Executable statements follow

Executable statements for BlueGene PRACE users

module load prace
cp my_code $PRACE_SCRATCH
# Warning: if you need to transfer important volumes
# of data, please use a multi-step job

cp input.data $PRACE_SCRATCH
cd $PRACE_SCRATCH

mpirun -mode VN -np 256 -mapfile TXYZ -exe ./my_code

# $LOADL_STEP_INITDIR is the submission directory
cp output.data $LOADL_STEP_INITDIR

Please be aware that $PRACE_DATA and $PRACE_HOME are not usable under BlueGene job step type, and using some can produce unexpected behavior.

Power Directives
#@ total_tasks = <number of cores>

Sets number of cores for an MPI job.

#@ resources = ConsumableCpus(N)

For multithread parallel jobs.

#@ job_type = parallel

Otherwise, could be set to serial if needed.

#@ data_limit = MEMORY

where MEMORY is the amount of data memory required. This directive is optional: the default value is set to the recommended value.

#@ stack_limit = MEMORY

same as data_limit above, but for stack memory. This directive is optional: the default value is set to the recommended value.

Please be aware that using some other keywords can on the contrary produce unexpected behavior. Therefore, the following keywords must be avoided in PRACE jobs:

#@ blockingwith a value different from unlimited

#@ max_processors

#@ min_processors

#@ network.X=Y

#@ node

#@ node_usage

#@ task_geometry

#@ tasks_per_node

Example of a POWER MPI script for PRACE users

#@ job_name = MyJob
#@ output = MyModel/output.$(jobid).log_ll
#@ error = MyModel/output.$(jobid).log_ll
#@ notify_user = <replace with your email>
#@ notification = error
#@ shell = /bin/bash
#@ requirements = (Feature == "PRACE")
#@ job_type = parallel
#@ total_tasks = 128
#@ wall_clock_limit = 06:00:00,05:50:00
#@ data_limit = 512mb
#@ stack_limit = 400mb
#@ queue

module load prace cpmd
cd $PRACE_SCRATCH
cp $PRACE_DATA/MyModel/CPMD/* .
$CPMD input.in > cpmd.out
cp * $PRACE_DATA/MyModel/CPMD

Note that, in this example, both the standard output and standard error are sent into the same file MyModel/output.$(jobid).log_ll which will reside in the $PRACE_HOME directory.

System specifics

LRZ SuperMUC

A detailed documeentation on the usage of LoadLeveler with LRZ resources is available at http://www.lrz.de/services/compute/supermuc/loadleveler/. This document highlights the LRZ specific directives that should be added to the <job_scrip> file in order to successfully run a job on SuperMUC (or Thin Node Islands) and on SuperMIG (or Fat Node Island).

Depending on the requirements of the job, users on SuperMUC can select between three queues:

Name Max. number of nodes (Max. number of cores) Maximum Wall Clock in hours Run limit per user
test 32 (512) 2 1
general 512 (8192) 48 tbd
large 2048 (32768) 48 tbd

The users of SuperMIG have the possibility to choose betwen two qeues:

Name Max. number of nodes (Max. number of cores) Maximum Wall Clock in hours Run limit per user
test 4 (160) 2 1
general 52 (2080) 48 1

The destination queue is specified in the class = ; required keyword. More details on the Job Class LRZ page.

Another detail deserving special attention is the node allocation policy. In particular, only complete nodes (that is 16 cores on SuperMUC and 40 cores SuperMIG) are assigned to the users for their jobs. This means that the accounting criteria is the following:

Accounted time on SuperMUC = (Number of allocated Nodes) * (Walltime) * (16)

Accounted time on SuperMIG = (Number of allocated Nodes) * (Walltime) * (40)

Users are strongly encouraged to use entire nodes for their jobs. Basically, two appraches are possible:

  • Specify the number of nodes and the number of tasks per node: the number of tasks to be started on each computation node is expressed by means of the tasks_per_node = keyword. Due to the design of the systen, number should be less or equal than 16 on SuperMUC and less or equal than 40 on SuperMIG. The number of node(s) is added by means of the notation node =. The node parameter can also be employed in the form node = so that the scheduler tries to get max node(s), but it does not start the job unless min node(s) can be reserved.
  • Specify the total number of tasks: use the total_tasks = keyword. Even if not mandatory, it is recommended to specify the number of node(s) too, using the syntax explained in the previous bullet.

A more comprehensive explanation and some examples can be found in the Keywords for Node and Core allocation section of the LRZ user documentation page. Please note that if more than 512 cores are requested on SuperMUC it is necessary to specify the keyword island_count in the job description.

Finally, it is necessay to fix the Wall Clock Limit as wall_clock_limit = hh:mm:ss. More details are available in the Limits paragraph of the site documentaion.

Users submitting jobs to SuperMUC are strongly encouraged to use energy tags:

  • max_perf_decrease_allowed, setting an acceptable performance degradation
  • energy_saving_req, specifying the required energy saving

If these keywords are not present in the job description, LoadLeveler uses default values and job will not run at the maximum speed, but only at 2.3 GHz. A detailed explanation is available on the LRZ page dedicated to energy aware jobs.

LSF

This Section describes common LSF batch commands and directives.

LSF Commands

bqueues -l

Display queue information

bsub < <jobscript>

Submits the commands contained in <jobscript>. NB: the input jobscript MUST be redirected with the < character.

bjobs -a

Shows submitted jobs

bkill <jobid>

Cancels the job <jobid> from the queuing system.

bhist <jobid> 

Display historical information about the job <jobid>

LSF Directives

#BSUB -n <number of cores>

Specifies number of cores

#BSUB -W <HH:MM>

Sets maximum wall clock time

#BSUB -o %J.out 

Sets location of standard output

#BSUB -e %J.err

Sets location of standard error

System specifics

UHEN Karadeniz Cluster

For the information how to run jobs on the Karadeniz Cluster at UHEM (UYBHM), please check the pdf guide: http://uhem.itu.edu.tr/documents/ka…

Moab+Torque

Torque Commands

qsub <jobscript>

Submit a jobscript for execution.

qstat -a

Display all batch jobs.

qstat -q

Display all batch queues with resource limit settings.

qdel <job_id>

Delete a job.

Example MPI Script

Examples for PBS (Torque) options in job scripts.

You can submit batch jobs using qsub. A very simple qsub script for a MPI job with PBS (Torque) directives (#PBS …) for the options of qsub looks like this:

#!/bin/bash
#
# Simple PBS batch script that reserves two cpus and runs one
# MPI process on each node)
# The default walltime is 10min !
#
#PBS -l nodes=2:nehalem
cd $HOME/testdir
mpirun -np 2 -hostfile $PBS_NODEFILE ./mpitest

VERY important is that you specify a shell in the first line of your batch script. VERY important in case you use the openmpi module is to omit the -hostfile option. Otherwise an error will occur like

If you want to use two MPI processes on each node this can be done like this:

#!/bin/bash
#
# Simple PBS batch script that reserves two nodes and runs a
# MPI program on four processors (two on each node)
# The default walltime is 10min !
#
#PBS -l nodes=2:nehalem:ppn=2
cd $HOME/testdir
mpirun -np 4 -hostfile machines ./mpitest

System specifics

HRLS Laki

For more information on the NEC Nehalem Cluster Laki @HLRS, please visit: https://wickie.hlrs.de/platforms/in…

Examples for PBS (Torque) options in job scripts.

You can submit batch jobs using qsub.

A very simple qsub script for a MPI job with PBS (Torque) directives (#PBS …) for the options of qsub looks like this:

#!/bin/bash
#
# Simple PBS batch script that reserves two cpus and runs one
# MPI process on each node)
# The default walltime is 10min !
#
#PBS -l nodes=2:nehalem
cd $HOME/testdir
mpirun -np 2 -hostfile $PBS_NODEFILE ./mpitest

VERY important is that you specify a shell in the first line of your batch script.

VERY important in case you use the openmpi module is to omit the -hostfile option. Otherwise an error will occur like: If you want to use two MPI processes on each node this can be done like this:

#!/bin/bash
#
# Simple PBS batch script that reserves two nodes and runs a
# MPI program on four processors (two on each node)
# The default walltime is 10min !
#
#PBS -l nodes=2:nehalem:ppn=2
cd $HOME/testdir
mpirun -np 4 -hostfile machines ./mpitest
KTH/PDC Lindgren

For more information on how to submit jobs on CRAY XE6 "Lindgren" @KTH/PDC, please visit: http://www.pdc.kth.se/resources/com…

PBS Pro

PBS Pro Commands

To submit a job to the queues for execution, use the qsub command:

qsub <jobscript>

To view all currently queued, held and running jobs use the qstat command:

qstat

and you can use the -u option to limit the display to just your jobs:

qstat -u <username>

To display the status of the queues use:

qstat -Q

To delete a job from the queue you use the qdel command:

qdel <job_id>

For more information please see:

Note; not all features will be implemented at every site; please consult local documentation.

PBS Pro Directives

#PBS -l mppwidth=<Total number of parallel tasks>

Reserves the total number of parallel tasks. This is usually the number of MPI tasks multiplied by the number of threads per MPI task. It can also be the number of images in the PGAS programming model.

#PBS -l mppnppn = <number of cores per node>

Set the number of parallel tasks per node. Usually this will be set to the total number of cores on a node.

#PBS -l walltime = <HH:MM:SS>

Set the maximum wall clock time for the job. Consult local documentation for the maximum value this can be set to.

#PBS -N <Job name>

Set the name of the job.

#PBS -A <Budget code>

Set the budget code to charge the job to.

#PBS -m e

Send e-mail when job completes successfully.

#PBS -M user@home.eu

E-mail address for notification email.

Example MPI script

This is an example job submission script for running a 1 hour, 4096 core MPI job on a Cray XE/XT system that has 32 cores per node:

#!/bin/bash --login

#PBS -N MyJob
#PBS -l walltime=1:0:0
#PBS -l mppwidth=4096
#PBS -l mppnppn=32
#PBS -m e
#PBS -M user@home.eu
#PBS -A budget_code

# Load the prace module to access PCPE
module load prace

# Copy the input file to the scratch file system
cp input.dat $PRACE_SCRATCH

cd $PRACE_SCRATCH
aprun -n 256 -N 32 /path/to/my/program
cp output.dat ~/

System specifics

CINES Jade

For more information on the SGI ICE 8200 JADE installed at CINES, please visit: http://www.cines.fr/spip.php?rubrique300

EPCC ARCHER XC30

This section describes some localised PBS Pro directives particular to ARCHER, the Cray XC30 machine located at EPCC. For more information, please see:

To run parallel jobs on ARCHER you use the aprun command within your job submission script. For example, to launch a 1536-core job that uses 24 cores per node:

aprun -n 1536 myProgram.x

You should avoid placing serial CPU- and memory-intensive commands in your job submission scripts as these will run directly on the job submission nodes (rather than the compute nodes) and will cause problems for the batch submission system. To launch a serial executable (or command) on the compute nodes you should precede the command with aprun -n 1.

WCSS Supernova

For the information how to run jobs on the Supernova Cluster at WCSS, please visit: http://kdm.wcss.wroc.pl/wiki/Runnin…

Slurm

This section contains basic Slurm comands specific to BSC MinoTauro, CEA CURIE and CSCS Rosa.

System specifics

BSC MinoTauro

MinoTauro operates with only Slurm as batch system. Commands and directives for submitting jobs are similar to those used in MareNostrum thanks to special software wrappers. Details and information are available in the BSC website: http://www.bsc.es/marenostrum-suppo…

CEA CURIE

This section contains basic Slurm commands specific to CEA CURIE.

For more information on the CURIE Tier-0 system installed at CEA, please visit: http://www-hpc.cea.fr/en/complexe/t…

Best practices and the full documentation of CURIE is available here.

Basic Commands

ccc_msub <jobscript>

Submit a jobscript for execution.

ccc_mpp -r

Display ‘running’ batch jobs.

ccc_mpp -p

Display ‘pending’ batch jobs.

ccc_mdel <job_id>

Delete a job.

CSCS Rosa

For the information how to run jobs on the Cray XE6 System – Rosa at CSCS, please visit: http://user.cscs.ch/running_batch_jobs/

SIGMA Abel

For the information how to run jobs on the Abel System at UNINETT Sigma please visit: (link to be added)