From 5e93381bc99e19dc541a1cd61bf0a0e92e3c0a12 Mon Sep 17 00:00:00 2001
From: Matthew K Defenderfer <mdefende@uab.edu>
Date: Tue, 20 Aug 2024 13:41:17 -0500
Subject: [PATCH] add instructions for transferring data

---
 README.md | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/README.md b/README.md
index 8f193ff..31113d5 100644
--- a/README.md
+++ b/README.md
@@ -88,6 +88,37 @@ In order to access the container, you will need to add a personal access token f
 
 ### Pre-parse output for Python
 
+### Parallel Transfer Using s5cmd
+
+In cases where a large amount of data needs to be offloaded from GPFS to LTS, Globus is not sufficiently performant. Instead, the `s5cmd` parallel transfer tool should be used. Scripts for this purpose are located in `transfer-gpfs-with-s5cmd`. The shell script reads a formatted GPFS parquet dataset and finds the files located in a given directory. Those files are divided into groups, and a throttled array job is submitted where each task transfers each batch.
+
+This script uses the `gpfs-policy` container so no environment setup is needed. An AWS CLI credentials file is required. The default location is in `${HOME}/.aws/credentials` and has the following form:
+
+```
+# Default profile #
+[default]
+aws_access_key_id = <lts_access_key>
+aws_secret_access_key = <lts_secret_key>
+```
+
+More than 1 profile can be added to the same credentials file as long as the profile names in `[]` are unique. The `default` profile is used unless specified otherwise.
+
+#### Usage
+
+```
+Usage: ./run-fpart-db.sh [ -h ] 
+        [ -n | --ntasks ] [ -p | --partition] [ -t | --time ] 
+        [ -m | --mem ] [ -c | --split-count ] [ -d | --part-dir ] 
+        [ -a | --aws-credentials-file ] [ -u | --credentials-profile ] 
+        filter input_parquet destination
+```
+
+- `filter`: directory to transfer from parquet list (i.e. `/scratch/$USER`)
+- `input_parquet`: parquet dataset to get file list from
+- `destination`: LTS bucket to sync files to
+
+All other options can be seen running `./run-fpart-db.sh -h`
+
 ## Running reports
 
 ### Disk usage by top level directies
-- 
GitLab