Skip to content
Snippets Groups Projects
Commit 90b63609 authored by Matthew K Defenderfer's avatar Matthew K Defenderfer
Browse files

add instructions in README

parent 8dd5cb12
No related branches found
No related tags found
1 merge request!8Automate conversion of GPFS policy outputs to parquet without Jupyter
...@@ -65,6 +65,24 @@ The ouput file is an unsorted list of files in uncompressed ASCII. Further proc ...@@ -65,6 +65,24 @@ The ouput file is an unsorted list of files in uncompressed ASCII. Further proc
### Pre-parse output for Python ### Pre-parse output for Python
Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`.
While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
This script is written to parse the `list-policy-external` policy format with quoted special characters.
```
Usage: ./run-convert-to-parquet.sh [ -h ]
[ -o | --outdir ] [ -n | --ntasks ] [ -p | --partition]
[ -t | --time ] [ -m | --mem ]
gpfs_logdir"
```
- `outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet`
- `gpfs_logdir`: Directory path containing the split log files as `*.gz`
All other options control the array job resources. The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
## Running reports ## Running reports
### Disk usage by top level directies ### Disk usage by top level directies
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment