Build out a post processing pipeline to generate parquet dataset from policy run.

added Backlog label

changed the description

added Sprint 24-16 label and removed Backlog label

created branch 8-build-out-a-post-processing-pipeline-to-generate-parquet-dataset-from-policy-run to address this issue

mentioned in merge request !8 (merged)

Fixed in !8 (merged)

A shell script controls the conversion through a python script. The python script is run using a container so no environment installation is necessary on the user's end. The container is automatically downloaded as long as the user has an active Gitlab access token with read_registry permissions. Instructions on how to set that up have been included in the repo documentation.

The main script submits an array job where each task converts a single log file to parquet. Parsing is done via regex and is built to work using the list-path-external policy format with quoted text to protect special characters. During testing, parsing of 5 million lines using the script's default Slurm resource values takes approximately 3 minutes.

closed

Build out a post processing pipeline to generate parquet dataset from policy run.

Child items ...

Activity