This document describes the deployment and usage of the data transfer facility GridFTP, a component of the Globus middleware suite. It allows to transfer data efficiently between PRACE sites as well as between a non-PRACE site and a PRACE site.
Large scientific applications typically running on the PRACE HPC infrastructure both analyse and generate huge amounts of data. These data sets cannot be expected to be permanently stored within the respective PRACE sites. In addition, an alternative site-to-site file transfer mechanism is required in case that one partner site can not be connected to the PRACE shared filesystem for technical reasons. To enable users to stage data in and out of PRACE HPC systems without common shared filesystems, PRACE sites provides an efficient file transfer service based on GridFTP.
GridFTP is a protocol for high-performance, secure and reliable data transfer for high-bandwidth WANs such as the one employed as the PRACE internal network, and PRACE users need to move data between the PRACE infrastructure and their local file systems. Thus, for this purpose, some sites also provide GridFTP servers that can be reached via the public internet. These so-called Door Nodes are listed in a table in section "Access from a machine outside of the PRACE environment".
PRACE uses the GridFTP server implementation that comes with the Globus Toolkit.
How Can GridFTP Be Used for File Transfer
In order to use GridFTP for file transfer, one either needs a GridFTP client program, which runs on your local workstation, that provides the interface between the user and a remote GridFTP server or access to a PRACE site with a GridFTP installation. There are several clients available for GridFTP, one of which is globus-url-copy, a command line tool which can transfer files using the GridFTP protocol as well as other protocols such as http and ftp. globus-url-copy is distributed with the Globus Toolkit and usually available on machines where the Globus Toolkit installed and mandatory on all sites participating in the PRACE network.
Globus URL copy Syntax
The basic syntax of the globus-url-copy command is:
globus-url-copy [options] sourceURL destinationURL
where the arguments are described in the following table.
Arguments of globus-url-copy
|[options]||The optional command line switches as described in Command line options for globus-url-copy section.|
|sourceURL||The URL of the file(s) to be copied. If it is a directory, it must end with a slash (/), and all files within that directory will be copied.|
|destURL||The URL to which to copy the file(s). To copy several files to one destination URL, destURL must be a directory and be terminated with a slash (/)|
globus-url-copy supports multiple protocols, so the format of the source and destination URLs can be either
when you refer to a local file or directory or
when you refer to a remote file or directory. While globus-url-copy is supporting other protocols such as http, https and ftp as well, in the PRACE infrastructure it is only possible to use the GridFTP protocol: gsiftp://
The port number can be omitted if the GridFTP server’s listen on the default port 2811.
- must be an absolute path for file://
- should be an absolute filepath for gsiftp://
- must be terminated with a slash (/), if it refers to a directory.
- experts may use a relative path for gsiftp:// in this case the path relative to the user’s home directory, must be given right after the specified port without a space
To transfer data with globus-url-copy using the gsiftp:// protocol, the user must have valid credentials, as will be described in the Copying files and directories with globus-url-copy section. Normally you will use file:// for addressing a local file or directory, and gsiftp:// for addressing a remote file or directory. However, note that the GridFTP protocol supports so-called Third Party Transfers where you can transfer data between two remote servers. In this case you have to use gsiftp:// both for the source and the destination URL.
Command Line Options For globus-url-copy
For example, the structure of a transfer command form a local file to a remote host would look like the following:
globus-url-copy file:///<path>/<file> gsiftp://<host>:<port>/<path>/<file>
We present the most important command line options. For a much more comprehensive description of available options, see the documentation on the Globus website http://www.globus.org/.
When you use the optional parameters given in the table below, you will get additional information:
||Prints usage information for the globus-url-copy program.|
||Prints the version of the globus-url-copy program.|
||During the transfer, displays: (1) number of bytes transferred (2) performance since the last update (every 5 seconds) (3) average performance for the whole transfer|
The following table lists parameters which you can set to optimize the performance of your data transfer:
Parameters to optimize performance
||Specifies the size (in bytes) of the TCP buffer to be used by the underlying GridFTP data channels.|
||Specifies the number of parallel streams to be used in the GridFTP transfer.|
||Use this parameter to initiate a "striped" GridFTP transfer that uses more than one node at both the source and destination. As multiple nodes contribute to the transfer, each using its own network interface, a larger amount of the network bandwidth can be consumed than with a single system. Thus, at least for "big" (> 100 MB) files, striping can considerably improve performance|
How to choose values for these parameters?
Concerning the first two parameters, namely the TCP buffer size and the number of parallel streams, the optimal values depend on factors such as the latency between the source and destination sites, the available bandwidth, network traffic, etc. Some of the parameters are fixed (for instance, you can measure the latency yourself using ping), whilst others, such as the limiting bandwidth, are only known to the network administrators at the various PRACE sites. However, as a rule of thumb, we recommend to use the following values:
- four parallel streams should be enough.
- for the typical latencies that occur in the PRACE network use 4MB for the TCP buffer size.
If you plan a lot of transfers of big files, it might be advisable to vary the value to monitor how it influences performance. For instance, a higher TCP buffer size than the recommended one above could give you better performance between sites with a larger latency, however, more memory is used, which may affect the transfer performance.
With regard to striping, currently the following PRACE sites are supporting multi-striping: CINECA, CSC, IDRIS, HLRS, LRZ, IT4I (VSB-TUO)
First, make sure both the PRACE and the Globus modules have been loaded via the ‘module load prace’ command. This sets the $PRACE_HOME and $PRACE_SCRATCH environment variables and make the Globus client commands, like globus-url-copy, available. For example:
module load prace globus echo $PRACE_SCRATCH /prace/lrz/home/pr3d0001/pr3d15ab echo $PRACE_DATA /prace/lrz/data/pr3d0001/pr3d15ab
Please remember that if you want to move files to or from the client machine where globus-url-copy is available, a local installation of GridFTP server is not required. Command line tools can handle the client file system using the syntax ‘file://‘ as explained above. The remote system involved in the transfer must have a GridFTP service running(set up by the site administrator).
The ‘prace_service’ command offers a convenient way to refer to the PRACE sites hosting a GridFTP server on the internal PRACE network. Type ‘prace_service -l’ to obtain a list of the available PRACE sites together with their short names. Use
prace_service -i -f <short site name> to get the URL identifying the remote machine. A list of the internal PRACE GridFTP servers follows:
|Site||Machine||Short name for ‘prace_service’||URL:Port|
|Barcelona Supercomputing Center||MareNostrum||bsc||gftp.prace.bsc.es:2811|
|Commissariat à l’Énergie Atomique et aux Énergies alternatives||Curie||cea||garbin-prace.eole.ccc.cea.fr:2812|
|The Finnish IT Center for Science CSC||Sisu||csc||gridftp-prace.csc.fi:2811|
|Edinburgh Parallel Computing Centre EPCC||Hector||epcc||dtn01-prace.rdf.ac.uk:2811|
|Forschungszentrum Jülich FZJ||Juqueen||fzj||juqueen1p.fz-juelich.de:2813|
|High Performance Computing Center Stuttgart HLRS||Hazelhen||hlrs-hazelhen||gridftp-fr1.hww.de:2812|
|Institut du Développement et des Ressources en Informatique Scientifique IDRIS||Turing||idris||turing2-d.idris.fr:2812|
|Leibniz Supercomputing Centre LRZ||SuperMUC||lrz-supermuc||supermuc-prace.lrz.de:2811|
|National Center for Supercomputing Applications NCSA||ncsa||bg-fen.scc.acad.bg:2811|
|SURFsara Computing and Networking Service||Cartesius||surfsara||int2-prace.cartesius.surfsara.nl:2812|
|University of Oslo UiOsigma2||uio||gridftp1.prace.uio.no:2811|
|Wroclaw Centre for Networking and Supercomputing WCSS||wcss||prace-int.wcss.pl:2811|
The table reflects the status of deployment on November, 2015.
A fundamental prerequisite for a successful transfer is that the user account exists and is authorized for the GridFTP service on both the local and the remote sites. In case of any problems, please contact the Helpdesk.
Finally, an example: in order to copy the local file ‘myfile’ to the $PRACE_HOME and $PRACE_SCRATCH directories mounted on the GridFTP server at RZG, say, the following commands should be used:
globus-url-copy file://`pwd`/myfile gsiftp://`prace_service -i -f rzg`/$PRACE_HOME/myfile globus-url-copy file://`pwd`/myfile gsiftp://`prace_service -i -f rzg`/$PRACE_SCRATCH/myfile
Access from a Machine Outside of the PRACE Environment
At the moment there’s no PRACE site which is offering GridFTP services to the public Internet.
Data Transfer With Globus-url-copy
In this subsection we describe how you can use globus-url-copy to
- copy data between your local workstation and the PRACE infrastructure
- copy data from one PRACE platform to another PRACE platform
After that we give some concrete examples that show how to use the globus-url-copy command.
Copying Data Between a Local Workstation and the PRACE Infrastrucutre
To transfer files from your local workstation to PRACE, you need have Globus installation available and to use one of the PRACE GridFTP Door Nodes listed in the table above.
On the GridFTP Door Node server
- Globus toolkit has been installed,
- connections to the PRACE network and thus to the GridFTP servers at every PRACE site
- the machine can be accessed from the public internet,
This requires access to PRACE systems, that is the user should:
- have a valid PRACE account
- obtain a X.509 certificate (please refer to PRACE Certificate FAQ for more details)
- access the PRACE infrastructure, as explained in the Interactive Access to HPC resources section of the PRACE User Documentation
Copying Files with globus-url-copy
Here we show how to copy files and entire directories. As a general rule, to avoid quota, accessibility and performance problems, we recommend to employ the temporary directory $PRACE_SCRATCH.
Before using globus-url-copy you have to generate a proxy credential based on your credentials (i.e. permanent public/private key pair) via the command grid-proxy-init. This process involving both your credentials and the trusted CA certificates has also been described in PRACE Certificates FAQ.
grid-proxy-init Your identity: /C=DE/O=GridGermany/OU=Leibniz-Rechenzentrum/OU=HLS/CN=Gabriel Mateescu Enter GRID pass phrase for this identity: Creating proxy .................................. Done Your proxy is valid until: Fri Mar 10 05:09:41 2012
Copy a file
Let’s assume that you have stored a large file "myfile" in the current working directory of your local workstation, and that you want to use it as an input file for a calculation on a PRACE production system. To upload it to PRACE using GridFTP, you can use either CINECA’s or LRZ’s GridFTP service.
globus-url-copy file://`pwd`/myfile gsiftp://gftp-prace.cineca.it/<PRACE_HOME>/myfile module load prace echo $PRACE_HOME /prace/lrz/home/pr3d0001/pr3d15ab echo $PRACE_SCRATCH /prace/lrz/data/pr3d0001/pr3d15a
Copy a directory
We will describe how to copy the subdirectory "mydirectory" of the current directory to the user’s remote PRACE_HOME directory:
globus-url-copy -cd -r file://`pwd`/mydirectory/ gsiftp://gftp-prace.cineca.it/<PRACE_HOME>/mydirectory/
where the -cd option stands for "create directory" and its purpose is to create the directory named "mydirectory" as a subdirectory of the remote PRACE_HOME directory. To include subdirectories please use the recursive copy option -r. Note that we terminate the URLs with a / to indicate that we refer to a directory. Note, we have omitted the port number, 2811 is this is the GridFTP default port.