SGI Altix UV at PSNC
Table of Contents
- 1. Introduction
- 2. System Architecture and Configuration
- 3. System Access
- 4. User environment and programming
- 5. Performance Analysis
This document present best practices for users of SGI Altix UV system “Chimera”, installed at Poznan Supercomputer and Networking Center (PSNC) in Poland. Chimera is a compute cluster built on widelyused x86_64 architecture with a unique hardware SMP technology which enables to use whole system memory in a single-imagemode.
Figure 1. “Chimera” SGI Altix UV, source: Poznan Supercomputer and Networking Center
The SGI Altix UV system is a shared memory machine built on the basis of widely recognised Intel x86_64 architecture. This section describes SGI UV installation located in PSNC, Poznan, Poland.
“Chimera” is SGI Altix UV 1000 shared memory cluster, also referred to as multiprocessor distributed shared memory (DSM). It consists of 2048 Intel Xeon cores with 16 TB of memory controlled by a cache-coherent single image of Linux system. This means that all of memory is available to a single application and is shared by all of the processors in a system. Shared memory capability of the machine is a hardware extension to the commodity processor based nodes of the cluster.
Peak performance of Chimera is 21.8 TFlop/s.
Chimera is powered by 256 Intel Xeon E7-8837 processors with a following parameters:
- Nehalem micro architecture
- clocked at 2.66 GHz
- 24 MB L3 cache
- 8 x 256 KB L2 cache
SGI Altix UV is cache-coherent non-uniform memory access (ccNUMA) architecture with global shared memory available as a single system image (SSI). System uses processor caches to reduce memory latency. Data in local or remote memory is stored in various processor caches throughout the system. Cache coherency mechanism keeps cached copies consistent.
Distributed shared memory (DSM) means that the memory is physically distributed between processor nodes within the system and also placed at various distances from the processors in a certain node. That results in memory access time is dependent on physical memory placement and is non-uniform across the node (NUMA).
On Chimera is active one 10Gigabit Ethernet port. It is shared with local connection for storage and public IP address of server.
SGI UV is using vendor specific interconnect called “NUMAlink” to provide global memory shared between the cluster nodes. NUMAlink 5 is capable of 15 GB/s of peak bandwidth through two 7.5 GB/s unidirectional links.
Additional hardware support for parallel execution is available for MPI based applications with MPI Offload Engine also being SGI technology.
The SGI UV system has got hardware accelerator for MPI point-to-point and collective communication. The MPI acceleration is performed by the UV MPI Offload Engine (MOE) through message queues, synchronisation and multicast implemented in hardware. The MOE gives small MPI latency, and improve performance on some commonly used MPI collective communication operations.
All MPI platform installed on Chimera supports this hardware extension.
Chimera has got internal disks of capacity 500GB for scratch mounted to
/disks directory and ram disks accessed from directories
/dev/shm of capacity 8TB and 3TB. Ram disks shares capacity with memory designed for programs.
On Chimera are available two real storage areas with disk and two pseudo file systems located in ram. Only the home directories are protected with a regular backup.
Table 1. File systems available on Chimera
|mount point||file system||capacity||notice|
||network, GPFS via NFS||51 TB||ho
||local, ext4||523 GB||scratch directory|
||ram disk, tmpfs||3 TB||scratch directory, all data will be lost while computer failure|
||ram disk, tmpfs||8 TB||scratch, all data will be lost while computer failure|
Users apply for access to the system with computational grant. Grant application form is available via PSNC’s HPC Portal: http://hpc.man.poznan.pl. European researchers can apply for core hours using PRACE DECI calls.
- User access is available via SSH: ssh email@example.com
- Access with gsissh: gsissh firstname.lastname@example.org -p 2222
Chimera is a single image system running under control of a standard Linux distribution. Job scheduling is handled by SLURM batch system.
Users access directly to machine’s system, there is no access node servers (like on clusters).
On Chimera common programming tools are installed: Intel Composer XE 2011, Intel Vtune Amplifier XE 2011 and Intel Inspector XE 2011.
- GNU compilers 4.4
- Intel Compilers
- Intel MPI 4.0 u3
- Intel MKL 2011
- Intel Threading Building Block
For more information on using GNU and Intel compilers please refer to the PRACE Generic x86 Best Practice Guide.
Default shell for users is bash.
All preferred utilities (compilers, MPI, etc..) are accessible on command line. All are placed in
.bashrc file contents user’s specific settings of environment should be made at beginning of
.bashrc file. All settings made at the end of this file will be activated only in log-in shell and will be inactive in job scripts.
Queue system on Chimera is governed by SLURM.
Queues in SLURM queue system are called partitions.
Table 2. SLURM partitions available for users.
The default partition is
standard. Chimera is single image system so there is only one node in partitions:
chimera. Processors in job resources have to be reserved as tasks. To get actual information on partitions and time limits command sinfo can be used.
Recommended way to run jobs on Chimera is batch mode in contrast to interactive mode. Jobs are submitted with sbatch command. As last argument user have to add script name, all arguments of sbatch commnad canbe placed at beginning of script file with
Schema of script for MPI job is presented below:
#!/bin/bash #SBATCH -J MY_JOB_NAME #SBATCH --partition=QUEUE #SBATCH --get-user-env #SBATCH --ntasks=NUMBER_OF_PROCESSORS #SBATCH --mail-type=end #SBATCH --mail-user=HERE_PUT_YOUR_email #SBATCH --time=01:00:00 mpiexec my-program.exe options...
Before run this script user has to modify fields
HERE_PUT_YOUR_email, time and program name with options. In case the program other than MPI parallel paradigm there is no need to use
By default all environment variables are propagated to jobs. To avoid propagating variables use option:
To export specific environment variables to the batch job (program) please use:
--export=VARIABLE option for example:
There is no need to use SGI specific MPI implementation MPT. Support of hardware acceleration of MPI is included in default on Chimera IntelMPI.
Available generic x86 performance analysis tools on Chimera includes:
- VTune Performance Analyzer
- Intel Trace Analyzer
For more information on using VTune Performance Analyzer please refer to the PRACE Generic x86 Best Practice Guide.
There is no system specific performance analysis tools available on Chimera computer.