From 5e93381bc99e19dc541a1cd61bf0a0e92e3c0a12 Mon Sep 17 00:00:00 2001 From: Matthew K Defenderfer <mdefende@uab.edu> Date: Tue, 20 Aug 2024 13:41:17 -0500 Subject: [PATCH] add instructions for transferring data --- README.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/README.md b/README.md index 8f193ff..31113d5 100644 --- a/README.md +++ b/README.md @@ -88,6 +88,37 @@ In order to access the container, you will need to add a personal access token f ### Pre-parse output for Python +### Parallel Transfer Using s5cmd + +In cases where a large amount of data needs to be offloaded from GPFS to LTS, Globus is not sufficiently performant. Instead, the `s5cmd` parallel transfer tool should be used. Scripts for this purpose are located in `transfer-gpfs-with-s5cmd`. The shell script reads a formatted GPFS parquet dataset and finds the files located in a given directory. Those files are divided into groups, and a throttled array job is submitted where each task transfers each batch. + +This script uses the `gpfs-policy` container so no environment setup is needed. An AWS CLI credentials file is required. The default location is in `${HOME}/.aws/credentials` and has the following form: + +``` +# Default profile # +[default] +aws_access_key_id = <lts_access_key> +aws_secret_access_key = <lts_secret_key> +``` + +More than 1 profile can be added to the same credentials file as long as the profile names in `[]` are unique. The `default` profile is used unless specified otherwise. + +#### Usage + +``` +Usage: ./run-fpart-db.sh [ -h ] + [ -n | --ntasks ] [ -p | --partition] [ -t | --time ] + [ -m | --mem ] [ -c | --split-count ] [ -d | --part-dir ] + [ -a | --aws-credentials-file ] [ -u | --credentials-profile ] + filter input_parquet destination +``` + +- `filter`: directory to transfer from parquet list (i.e. `/scratch/$USER`) +- `input_parquet`: parquet dataset to get file list from +- `destination`: LTS bucket to sync files to + +All other options can be seen running `./run-fpart-db.sh -h` + ## Running reports ### Disk usage by top level directies -- GitLab