From 90b636091c7317c1f4a9270b46ccac9360115383 Mon Sep 17 00:00:00 2001
From: Matthew K Defenderfer <mdefende@uab.edu>
Date: Tue, 20 Aug 2024 14:18:07 -0500
Subject: [PATCH] add instructions in README

---
 README.md | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/README.md b/README.md
index 37cd4dd..132f64e 100644
--- a/README.md
+++ b/README.md
@@ -65,6 +65,24 @@ The ouput file is an unsorted list of files in uncompressed ASCII.  Further proc
 
 ### Pre-parse output for Python
 
+Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`. 
+
+While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
+
+This script is written to parse the `list-policy-external` policy format with quoted special characters.
+
+```
+Usage: ./run-convert-to-parquet.sh [ -h ] 
+        [ -o | --outdir ] [ -n | --ntasks ] [ -p | --partition] 
+        [ -t | --time ] [ -m | --mem ] 
+        gpfs_logdir"
+```
+
+- `outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet`
+- `gpfs_logdir`: Directory path containing the split log files as `*.gz`
+
+All other options control the array job resources. The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
+
 ## Running reports
 
 ### Disk usage by top level directies
-- 
GitLab