From 51b788bab0ca94606c3a93087db8c08949838b1e Mon Sep 17 00:00:00 2001 From: Matthew K Defenderfer <mdefende@uab.edu> Date: Mon, 16 Sep 2024 17:52:08 -0500 Subject: [PATCH] Fixed up documentation on converting logs to parquet --- README.md | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index cf75898..4e6a6c1 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,8 @@ All other arguments are Slurm directives dictating resource requests. The defaul - `time`: `24:00:00` - `mem-per-cpu`: `8G` +This script was written using the default `python3` interpreter on Cheaha (version `3.6.8`) so no environment needs to be active. Running this script in an environment with a newer Python version may cause unintended errors and effects. + ### Run Any Policies (admins) Any defined policy file can be run using the `submit-pol-job` by running the following: @@ -98,23 +100,30 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment ### Pre-parse output for Python -Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`. +Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`. While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier. This script is written to parse the `list-path-external` policy format with quoted special characters. -``` -Usage: ./run-convert-to-parquet.sh [ -h ] - [ -o | --outdir ] [ -n | --ntasks ] [ -p | --partition] - [ -t | --time ] [ -m | --mem ] - gpfs_logdir" +```bash +./run-convert-to-parquet.sh [ -h ] [ -o | --outdir ] + [ -n | --ntasks ] [ -p | --partition] + [ -t | --time ] [ -m | --mem ] + gpfs_logdir ``` - `outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet` - `gpfs_logdir`: Directory path containing the split log files as `*.gz` -All other options control the array job resources. The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases. +All other options control the array job resources. Default values are as follows: + +- `ntasks`: 1 +- `mem`: `16G` +- `time`: `02:00:00` +- `partition`: `amd-hdr100` + +The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases. ## Running reports @@ -124,7 +133,6 @@ A useful report is the top level directory (tld) report. This is akin to runnin ### Comparing directory similarity - ## Scheduling regular policy runs via cron The policy run can be scheduled automatically with the cronwrapper script. -- GitLab