Skip to content
Snippets Groups Projects
Commit 51b788ba authored by Matthew K Defenderfer's avatar Matthew K Defenderfer
Browse files

Fixed up documentation on converting logs to parquet

parent c149847a
No related branches found
No related tags found
1 merge request!15Update main to v0.1.1
...@@ -44,6 +44,8 @@ All other arguments are Slurm directives dictating resource requests. The defaul ...@@ -44,6 +44,8 @@ All other arguments are Slurm directives dictating resource requests. The defaul
- `time`: `24:00:00` - `time`: `24:00:00`
- `mem-per-cpu`: `8G` - `mem-per-cpu`: `8G`
This script was written using the default `python3` interpreter on Cheaha (version `3.6.8`) so no environment needs to be active. Running this script in an environment with a newer Python version may cause unintended errors and effects.
### Run Any Policies (admins) ### Run Any Policies (admins)
Any defined policy file can be run using the `submit-pol-job` by running the following: Any defined policy file can be run using the `submit-pol-job` by running the following:
...@@ -98,23 +100,30 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment ...@@ -98,23 +100,30 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment
### Pre-parse output for Python ### Pre-parse output for Python
Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`. Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`.
While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier. While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
This script is written to parse the `list-path-external` policy format with quoted special characters. This script is written to parse the `list-path-external` policy format with quoted special characters.
``` ```bash
Usage: ./run-convert-to-parquet.sh [ -h ] ./run-convert-to-parquet.sh [ -h ] [ -o | --outdir ]
[ -o | --outdir ] [ -n | --ntasks ] [ -p | --partition] [ -n | --ntasks ] [ -p | --partition]
[ -t | --time ] [ -m | --mem ] [ -t | --time ] [ -m | --mem ]
gpfs_logdir" gpfs_logdir
``` ```
- `outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet` - `outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet`
- `gpfs_logdir`: Directory path containing the split log files as `*.gz` - `gpfs_logdir`: Directory path containing the split log files as `*.gz`
All other options control the array job resources. The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases. All other options control the array job resources. Default values are as follows:
- `ntasks`: 1
- `mem`: `16G`
- `time`: `02:00:00`
- `partition`: `amd-hdr100`
The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
## Running reports ## Running reports
...@@ -124,7 +133,6 @@ A useful report is the top level directory (tld) report. This is akin to runnin ...@@ -124,7 +133,6 @@ A useful report is the top level directory (tld) report. This is akin to runnin
### Comparing directory similarity ### Comparing directory similarity
## Scheduling regular policy runs via cron ## Scheduling regular policy runs via cron
The policy run can be scheduled automatically with the cronwrapper script. The policy run can be scheduled automatically with the cronwrapper script.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment