Build out a post processing pipeline to generate parquet dataset from policy run.
Opening an issue so we can collaborate on implementing a full pipeline to generate parquet files from the policy run output.
Starting a policy run with the submit script saving output to a designated output directory for a specific gpfs path to list files in.
PROJECTDIR=/data/rc/gpfs-policy
Run the policy on the target dir. Set OUTDIR to where the outputs listing should be saved. Set TARGETDIR to the path for which to create the listing.
./submit-pol-job $OUTDIR /data/rc/gpfs-policy/policy/list-path-external-
defer 4 24 4G amd-hdr100 $TARGETDIR 180
Current manual steps after the policy run completes.
cd $OUTDIR
Example: simple confirmation that the policy run output file was generated.
[root@cheaha-master01 data]# ls -lhr
total 450G
-rw------- 1 root root 231G Jul 23 13:27 list-28643138.list.gather-info
-rw------- 1 root root 219G May 7 17:47 list-27778270.list.gather-info
They can be large but compress well, so we split them up so it's easier to do work in parallel with an array job.
The split script sticks them in a dir named for the listing file with a .d
extension, in the same dir as the original output file.
cd $GPFS_PROJECT_DIR
./split-info-file $OUTDIR/list-28643138.list.gather-info
Example: count files for array sizing.
[root@cheaha-master01 data]# ls -ltr list-28643138.list.gather-info.d/ | wc -l
112
Compress the split files in an array job
cd list-28643138.list.gather-info.d/
sbatch --array=0-111 --mem-per-cpu=4G -p amd-hdr100 --wrap='set -x ; fname=`printf "list-%.3d" $SLURM_ARRAY_TASK_ID`; echo $fname; gzip $fname'
Run a check to make sure the compression array tasks didn't have any errors
sacct -j <ARRAYJOBID> -s F
Example: post compression it's just 17GB
[root@cheaha-master01 data]# du -sh list-28643138.list.gather-info*
231G list-28643138.list.gather-info
17G list-28643138.list.gather-info.d
Change the ownership to atlab
chown -R jpr.atlab list-28643138.list.gather-info.d/
Move the directory with compressed data to the rc data dir.
srun mv list-28643138.list.gather-info.d/ /data/rc/gpfs-policy/data/
Remove the raw source file
rm list-28643138.list.gather-info
Create friendly dir with a symlink to the data dir that has the run info as metadata in the symlink name.
cd /data/rc/gpfs-policy/data
ln -s list-28643138.list.gather-info.d/ list-policy_data-user_2024-07-23
Create parquet files. Currently use a conda env with parquet and papermill in it. Would be better to switch to a dedicated venv.
module load Anaconda3/2021.11
conda activate mpd3
Submit batch. The jupyter notebooks depend on a kernel that has the correct modules. This should be cleaned up. Parameters are specified via env vars set on job invocation. The notebook does the parquet selection. The wrap-list-pickle.sh scrap is a generic wrapper for the notebooks.
export NBKERNEL=mpd3
dirname=data/list-policy_data-user_2024-07-23 NB=parquet-list-policy-data sbatch --array=0-111 --mem-per-cpu=16G --partition=amd-hdr100 wrap-list-pickle.sh
This results in a parquet/
subdir in the .d
dir that contains the gzipped listing files, one for each gzip file. They can be loaded efficiently into panda dataframes.
Note that the gzip list file are the raw output from the policy run, just split and compressed.