Skip to content

Automate conversion of GPFS policy outputs to parquet without Jupyter

Matthew K Defenderfer requested to merge convert-to-parquet into main

Created a set of scripts to parse our standard GPFS policy outputs and save them as a parquet dataset without needing a Jupyter notebook. Iterated off of parquet-list-policy-data.ipynb.

Changes

  • Simplified parsing algorithm
  • Automatically extracts the "top-level directory" (tld). Can parse from /data/user, /data/user/home, /data/project, and /scratch.
    • Sets the tld as the index within each parquet file for faster aggregation later
  • Variable output directory: defaults to log_dir/parquet but can be specified elsewhere
  • Environment is controlled through a Singularity container (defaults to daskdev/dask:2024.8.0-py3.12) but is variable
    • If the container is not specified, the default is automatically downloaded and used
  • Parallelization: Each part of a policy output is processed individually in an array task. Processing logs parts of 5 million lines apiece takes ~ 3 minutes
  • Controlled via command line run-convert-to-parquet.sh
Edited by Matthew K Defenderfer

Merge request reports