Data Transfer with GridFTP

This document describes the deployment and usage of the data transfer facility GridFTP, a component of the Globus middleware suite. It allows to transfer data efficiently between PRACE sites as well as between a non-PRACE site and a PRACE site.

Large scientific applications typically running on the PRACE HPC infrastructure both analyse and generate huge amounts of data. These data sets cannot be expected to be permanently stored within the respective PRACE sites. In addition, an alternative site-to-site file transfer mechanism is required in case that one partner site can not be connected to the PRACE shared filesystem for technical reasons. To enable users to stage data in and out of PRACE HPC systems without common shared filesystems, PRACE sites provides an efficient file transfer service based on GridFTP.

GridFTP is a protocol for high-performance, secure and reliable data transfer for high-bandwidth WANs such as the one employed as the PRACE internal network, and PRACE users need to move data between the PRACE infrastructure and their local file systems. Thus, for this purpose, some sites also provide GridFTP servers that can be reached via the public internet. These so-called Door Nodes are listed in a table in section "Access from a machine outside of the PRACE environment".

PRACE uses the GridFTP server implementation that comes with the Globus Toolkit.

How Can GridFTP Be Used for File Transfer

In order to use GridFTP for file transfer, one either needs a GridFTP client program, which runs on your local workstation, that provides the interface between the user and a remote GridFTP server or access to a PRACE site with a GridFTP installation. There are several clients available for GridFTP, one of which is globus-url-copy, a command line tool which can transfer files using the GridFTP protocol as well as other protocols such as http and ftp. globus-url-copy is distributed with the Globus Toolkit and usually available on machines where the Globus Toolkit installed and mandatory on all sites participating in the PRACE network.

Globus URL copy Syntax

The basic syntax of the globus-url-copy command is:

globus-url-copy [options] sourceURL destinationURL

where the arguments are described in the following table.

Arguments of globus-url-copy

Argument Description
[options] The optional command line switches as described in Command line options for globus-url-copy section.
sourceURL The URL of the file(s) to be copied. If it is a directory, it must end with a slash (/), and all files within that directory will be copied.
destURL The URL to which to copy the file(s). To copy several files to one destination URL, destURL must be a directory and be terminated with a slash (/)

globus-url-copy supports multiple protocols, so the format of the source and destination URLs can be either

file://path

when you refer to a local file or directory or

protocol://host[:port]/path

when you refer to a remote file or directory. While globus-url-copy is supporting other protocols such as http, https and ftp as well, in the PRACE infrastructure it is only possible to use the GridFTP protocol: gsiftp://

The port number can be omitted if the GridFTP server’s listen on the default port 2811.

The path

  • must be an absolute path for file://
  • should be an absolute filepath for gsiftp://
  • must be terminated with a slash (/), if it refers to a directory.
  • experts may use a relative path for gsiftp:// in this case the path relative to the user’s home directory, must be given right after the specified port without a space

To transfer data with globus-url-copy using the gsiftp:// protocol, the user must have valid credentials, as will be described in the Copying files and directories with globus-url-copy section. Normally you will use file:// for addressing a local file or directory, and gsiftp:// for addressing a remote file or directory. However, note that the GridFTP protocol supports so-called Third Party Transfers where you can transfer data between two remote servers. In this case you have to use gsiftp:// both for the source and the destination URL.

Command Line Options For globus-url-copy

For example, the structure of a transfer command form a local file to a remote host would look like the following:

globus-url-copy file:///<path>/<file> gsiftp://<host>:<port>/<path>/<file>

We present the most important command line options. For a much more comprehensive description of available options, see the documentation on the Globus website http://www.globus.org/.

When you use the optional parameters given in the table below, you will get additional information:

Optional parameters

Option Description
-help Prints usage information for the globus-url-copy program.
-version Prints the version of the globus-url-copy program.
-vb During the transfer, displays: (1) number of bytes transferred (2) performance since the last update (every 5 seconds) (3) average performance for the whole transfer

The following table lists parameters which you can set to optimize the performance of your data transfer:

Parameters to optimize performance

Option Description
-tcp-bs <size> Specifies the size (in bytes) of the TCP buffer to be used by the underlying GridFTP data channels.
-p <number of parallel streams> Specifies the number of parallel streams to be used in the GridFTP transfer.
-stripe Use this parameter to initiate a "striped" GridFTP transfer that uses more than one node at both the source and destination. As multiple nodes contribute to the transfer, each using its own network interface, a larger amount of the network bandwidth can be consumed than with a single system. Thus, at least for "big" (> 100 MB) files, striping can considerably improve performance
How to choose values for these parameters?

Concerning the first two parameters, namely the TCP buffer size and the number of parallel streams, the optimal values depend on factors such as the latency between the source and destination sites, the available bandwidth, network traffic, etc. Some of the parameters are fixed (for instance, you can measure the latency yourself using ping), whilst others, such as the limiting bandwidth, are only known to the network administrators at the various PRACE sites. However, as a rule of thumb, we recommend to use the following values:

  • four parallel streams should be enough.
  • for the typical latencies that occur in the PRACE network use 4MB for the TCP buffer size.

If you plan a lot of transfers of big files, it might be advisable to vary the value to monitor how it influences performance. For instance, a higher TCP buffer size than the recommended one above could give you better performance between sites with a larger latency, however, more memory is used, which may affect the transfer performance.

With regard to striping, currently the following PRACE sites are supporting multi-striping: CINECA, CSC, IDRIS, HLRS, LRZ, IT4I (VSB-TUO)

First, make sure both the PRACE and the Globus modules have been loaded via the ‘module load prace’ command. This sets the $PRACE_HOME and $PRACE_SCRATCH environment variables and make the Globus client commands, like globus-url-copy, available. For example:

module load prace globus

echo $PRACE_SCRATCH
/prace/lrz/home/pr3d0001/pr3d15ab

echo $PRACE_DATA
/prace/lrz/data/pr3d0001/pr3d15ab

Please remember that if you want to move files to or from the client machine where globus-url-copy is available, a local installation of GridFTP server is not required. Command line tools can handle the client file system using the syntax ‘file://‘ as explained above. The remote system involved in the transfer must have a GridFTP service running(set up by the site administrator).

The ‘prace_service’ command offers a convenient way to refer to the PRACE sites hosting a GridFTP server on the internal PRACE network. Type ‘prace_service -l’ to obtain a list of the available PRACE sites together with their short names. Use prace_service -i -f <short site name> to get the URL identifying the remote machine. A list of the internal PRACE GridFTP servers follows:

Site Machine Short name for ‘prace_service’ URL:Port
Barcelona Supercomputing Center MareNostrum bsc gftp.prace.bsc.es:2811
Commissariat à l’Énergie Atomique et aux Énergies alternatives Curie cea garbin-prace.eole.ccc.cea.fr:2812
The Finnish IT Center for Science CSC Sisu csc gridftp-prace.csc.fi:2811
Cyfronet cyfronet prace-int.cyfronet.pl:2811
Edinburgh Parallel Computing Centre EPCC Hector epcc dtn01-prace.rdf.ac.uk:2811
Forschungszentrum Jülich FZJ Juqueen fzj juqueen1p.fz-juelich.de:2813
High Performance Computing Center Stuttgart HLRS Hazelhen hlrs-hazelhen gridftp-fr1.hww.de:2812
ICHEC Fionn ichec fionn.ichec.ie:2811
Institut du Développement et des Ressources en Informatique Scientifique IDRIS Turing idris turing2-d.idris.fr:2812
Leibniz Supercomputing Centre LRZ SuperMUC lrz-supermuc supermuc-prace.lrz.de:2811
National Center for Supercomputing Applications NCSA ncsa bg-fen.scc.acad.bg:2811
NIIFI NIIFI SC prace-login.sc.niif.hu:2811
NIIFI NIIFI SEGED headnode-vlan907.szeged.hpc.niif.hu:2811
NIIFI NIIFI LEO leo-login.sc.niif.hu:2811
NIIFI NIIFI PHIT phitagoras.sc.niif.hu:2811
SURFsara Computing and Networking Service Cartesius surfsara int2-prace.cartesius.surfsara.nl:2812
University of Oslo UiOsigma2 uio gridftp1.prace.uio.no:2811
Wroclaw Centre for Networking and Supercomputing WCSS wcss prace-int.wcss.pl:2811
IT4Innovations VSB-TUO Anselm gridftp-prace.anselm.it4i.cz:2812
IT4Innovations VSB-TUO Salomon gridftp-prace.salomon.it4i.cz:2812

The table reflects the status of deployment on November, 2015.

A fundamental prerequisite for a successful transfer is that the user account exists and is authorized for the GridFTP service on both the local and the remote sites. In case of any problems, please contact the Helpdesk.

Finally, an example: in order to copy the local file ‘myfile’ to the $PRACE_HOME and $PRACE_SCRATCH directories mounted on the GridFTP server at RZG, say, the following commands should be used:

globus-url-copy file://`pwd`/myfile gsiftp://`prace_service -i -f rzg`/$PRACE_HOME/myfile

globus-url-copy file://`pwd`/myfile gsiftp://`prace_service -i -f rzg`/$PRACE_SCRATCH/myfile

Access from a Machine Outside of the PRACE Environment

At the moment there’s no PRACE site which is offering GridFTP services to the public Internet.

Data Transfer With Globus-url-copy

In this subsection we describe how you can use globus-url-copy to

  • copy data between your local workstation and the PRACE infrastructure
  • copy data from one PRACE platform to another PRACE platform

After that we give some concrete examples that show how to use the globus-url-copy command.

Copying Data Between a Local Workstation and the PRACE Infrastrucutre

To transfer files from your local workstation to PRACE, you need have Globus installation available and to use one of the PRACE GridFTP Door Nodes listed in the table above.

On the GridFTP Door Node server

  • Globus toolkit has been installed,
  • connections to the PRACE network and thus to the GridFTP servers at every PRACE site
  • the machine can be accessed from the public internet,

This requires access to PRACE systems, that is the user should:

Copying Files with globus-url-copy

Here we show how to copy files and entire directories. As a general rule, to avoid quota, accessibility and performance problems, we recommend to employ the temporary directory $PRACE_SCRATCH.

Before using globus-url-copy you have to generate a proxy credential based on your credentials (i.e. permanent public/private key pair) via the command grid-proxy-init. This process involving both your credentials and the trusted CA certificates has also been described in PRACE Certificates FAQ.

grid-proxy-init
Your identity: /C=DE/O=GridGermany/OU=Leibniz-Rechenzentrum/OU=HLS/CN=Gabriel Mateescu
Enter GRID pass phrase for this identity:
Creating proxy .................................. Done
Your proxy is valid until: Fri Mar 10 05:09:41 2012

Copy a file

Let’s assume that you have stored a large file "myfile" in the current working directory of your local workstation, and that you want to use it as an input file for a calculation on a PRACE production system. To upload it to PRACE using GridFTP, you can use either CINECA’s or LRZ’s GridFTP service.

globus-url-copy file://`pwd`/myfile gsiftp://gftp-prace.cineca.it/<PRACE_HOME>/myfile

module load prace
echo $PRACE_HOME /prace/lrz/home/pr3d0001/pr3d15ab
echo $PRACE_SCRATCH /prace/lrz/data/pr3d0001/pr3d15a

Copy a directory

We will describe how to copy the subdirectory "mydirectory" of the current directory to the user’s remote PRACE_HOME directory:

globus-url-copy -cd -r file://`pwd`/mydirectory/ gsiftp://gftp-prace.cineca.it/<PRACE_HOME>/mydirectory/

where the -cd option stands for "create directory" and its purpose is to create the directory named "mydirectory" as a subdirectory of the remote PRACE_HOME directory. To include subdirectories please use the recursive copy option -r. Note that we terminate the URLs with a / to indicate that we refer to a directory. Note, we have omitted the port number, 2811 is this is the GridFTP default port.

Share: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+Email this to someone