@@ -44,6 +44,8 @@ All other arguments are Slurm directives dictating resource requests. The defaul
...
@@ -44,6 +44,8 @@ All other arguments are Slurm directives dictating resource requests. The defaul
-`time`: `24:00:00`
-`time`: `24:00:00`
-`mem-per-cpu`: `8G`
-`mem-per-cpu`: `8G`
This script was written using the default `python3` interpreter on Cheaha (version `3.6.8`) so no environment needs to be active. Running this script in an environment with a newer Python version may cause unintended errors and effects.
### Run Any Policies (admins)
### Run Any Policies (admins)
Any defined policy file can be run using the `submit-pol-job` by running the following:
Any defined policy file can be run using the `submit-pol-job` by running the following:
...
@@ -98,23 +100,30 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment
...
@@ -98,23 +100,30 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment
### Pre-parse output for Python
### Pre-parse output for Python
Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`.
Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`.
While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
This script is written to parse the `list-path-external` policy format with quoted special characters.
This script is written to parse the `list-path-external` policy format with quoted special characters.
-`outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet`
-`outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet`
-`gpfs_logdir`: Directory path containing the split log files as `*.gz`
-`gpfs_logdir`: Directory path containing the split log files as `*.gz`
All other options control the array job resources. The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
All other options control the array job resources. Default values are as follows:
-`ntasks`: 1
-`mem`: `16G`
-`time`: `02:00:00`
-`partition`: `amd-hdr100`
The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
## Running reports
## Running reports
...
@@ -124,7 +133,6 @@ A useful report is the top level directory (tld) report. This is akin to runnin
...
@@ -124,7 +133,6 @@ A useful report is the top level directory (tld) report. This is akin to runnin
### Comparing directory similarity
### Comparing directory similarity
## Scheduling regular policy runs via cron
## Scheduling regular policy runs via cron
The policy run can be scheduled automatically with the cronwrapper script.
The policy run can be scheduled automatically with the cronwrapper script.