From 51b788bab0ca94606c3a93087db8c08949838b1e Mon Sep 17 00:00:00 2001
From: Matthew K Defenderfer <mdefende@uab.edu>
Date: Mon, 16 Sep 2024 17:52:08 -0500
Subject: [PATCH] Fixed up documentation on converting logs to parquet

---
 README.md | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index cf75898..4e6a6c1 100644
--- a/README.md
+++ b/README.md
@@ -44,6 +44,8 @@ All other arguments are Slurm directives dictating resource requests. The defaul
 - `time`: `24:00:00`
 - `mem-per-cpu`: `8G`
 
+This script was written using the default `python3` interpreter on Cheaha (version `3.6.8`) so no environment needs to be active. Running this script in an environment with a newer Python version may cause unintended errors and effects.
+
 ### Run Any Policies (admins)
 
 Any defined policy file can be run using the `submit-pol-job` by running the following:
@@ -98,23 +100,30 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment
 
 ### Pre-parse output for Python
 
-Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`. 
+Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`.
 
 While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
 
 This script is written to parse the `list-path-external` policy format with quoted special characters.
 
-```
-Usage: ./run-convert-to-parquet.sh [ -h ] 
-        [ -o | --outdir ] [ -n | --ntasks ] [ -p | --partition] 
-        [ -t | --time ] [ -m | --mem ] 
-        gpfs_logdir"
+```bash
+./run-convert-to-parquet.sh [ -h ] [ -o | --outdir ] 
+                            [ -n | --ntasks ] [ -p | --partition] 
+                            [ -t | --time ] [ -m | --mem ] 
+                            gpfs_logdir
 ```
 
 - `outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet`
 - `gpfs_logdir`: Directory path containing the split log files as `*.gz`
 
-All other options control the array job resources. The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
+All other options control the array job resources. Default values are as follows:
+
+- `ntasks`: 1
+- `mem`: `16G`
+- `time`: `02:00:00`
+- `partition`: `amd-hdr100`
+
+The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
 
 ## Running reports
 
@@ -124,7 +133,6 @@ A useful report is the top level directory (tld) report.  This is akin to runnin
 
 ### Comparing directory similarity
 
-
 ## Scheduling regular policy runs via cron
 
 The policy run can be scheduled automatically with the cronwrapper script.
-- 
GitLab