Skip to content
Snippets Groups Projects
Commit 5e93381b authored by Matthew K Defenderfer's avatar Matthew K Defenderfer
Browse files

add instructions for transferring data

parent b56e9346
No related branches found
No related tags found
1 merge request!9Draft: Partition parquet dataset for sync with s5cmd
...@@ -88,6 +88,37 @@ In order to access the container, you will need to add a personal access token f ...@@ -88,6 +88,37 @@ In order to access the container, you will need to add a personal access token f
### Pre-parse output for Python ### Pre-parse output for Python
### Parallel Transfer Using s5cmd
In cases where a large amount of data needs to be offloaded from GPFS to LTS, Globus is not sufficiently performant. Instead, the `s5cmd` parallel transfer tool should be used. Scripts for this purpose are located in `transfer-gpfs-with-s5cmd`. The shell script reads a formatted GPFS parquet dataset and finds the files located in a given directory. Those files are divided into groups, and a throttled array job is submitted where each task transfers each batch.
This script uses the `gpfs-policy` container so no environment setup is needed. An AWS CLI credentials file is required. The default location is in `${HOME}/.aws/credentials` and has the following form:
```
# Default profile #
[default]
aws_access_key_id = <lts_access_key>
aws_secret_access_key = <lts_secret_key>
```
More than 1 profile can be added to the same credentials file as long as the profile names in `[]` are unique. The `default` profile is used unless specified otherwise.
#### Usage
```
Usage: ./run-fpart-db.sh [ -h ]
[ -n | --ntasks ] [ -p | --partition] [ -t | --time ]
[ -m | --mem ] [ -c | --split-count ] [ -d | --part-dir ]
[ -a | --aws-credentials-file ] [ -u | --credentials-profile ]
filter input_parquet destination
```
- `filter`: directory to transfer from parquet list (i.e. `/scratch/$USER`)
- `input_parquet`: parquet dataset to get file list from
- `destination`: LTS bucket to sync files to
All other options can be seen running `./run-fpart-db.sh -h`
## Running reports ## Running reports
### Disk usage by top level directies ### Disk usage by top level directies
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment