From 526b4b8df56d4dad55ce12b339263f78f36374a6 Mon Sep 17 00:00:00 2001 From: Matthew K Defenderfer <mdefende@uab.edu> Date: Mon, 16 Sep 2024 17:57:56 -0500 Subject: [PATCH] couple more changes to tld --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 59783c7..1fb6044 100644 --- a/README.md +++ b/README.md @@ -102,8 +102,6 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`. -While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier. - This script is written to parse the `list-path-external` policy format with quoted special characters. ```bash @@ -125,6 +123,8 @@ All other options control the array job resources. Default values are as follows The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases. +#### tld + For all policies run on filesets in `/data/user`, `/data/project`, `/home`, or `/scratch` will automatically have their "top-level directory" (`tld`) computed and added to the parquet output. This is defined as the directory just under any of those specified filesets. For example, a file with path `/data/project/datascienceteam/example.txt` will have `tld` set to `datascienceteam`. Any files in a directory outside those specified filesets will have `tld` set to `None`. -- GitLab