Automate conversion of GPFS policy outputs to parquet without Jupyter
Created a set of scripts to parse our standard GPFS policy outputs and save them as a parquet dataset without needing a Jupyter notebook. Iterated off of parquet-list-policy-data.ipynb
.
Changes
- Simplified parsing algorithm
- Automatically extracts the "top-level directory" (
tld
). Can parse from/data/user
,/data/user/home
,/data/project
, and/scratch
.- Sets the
tld
as the index within each parquet file for faster aggregation later
- Sets the
- Variable output directory: defaults to
log_dir/parquet
but can be specified elsewhere - Environment is controlled through a Singularity container (defaults to
daskdev/dask:2024.8.0-py3.12
) but is variable- If the container is not specified, the default is automatically downloaded and used
- Parallelization: Each part of a policy output is processed individually in an array task. Processing logs parts of 5 million lines apiece takes ~ 3 minutes
- Controlled via command line
run-convert-to-parquet.sh
Edited by Matthew K Defenderfer