Skip to content
Snippets Groups Projects
Commit 526b4b8d authored by Matthew K Defenderfer's avatar Matthew K Defenderfer
Browse files

couple more changes to tld

parent 5cc47f5f
No related branches found
No related tags found
1 merge request!15Update main to v0.1.1
...@@ -102,8 +102,6 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment ...@@ -102,8 +102,6 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment
Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`. Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`.
While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
This script is written to parse the `list-path-external` policy format with quoted special characters. This script is written to parse the `list-path-external` policy format with quoted special characters.
```bash ```bash
...@@ -125,6 +123,8 @@ All other options control the array job resources. Default values are as follows ...@@ -125,6 +123,8 @@ All other options control the array job resources. Default values are as follows
The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases. The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
#### tld
For all policies run on filesets in `/data/user`, `/data/project`, `/home`, or `/scratch` will automatically have their "top-level directory" (`tld`) computed and added to the parquet output. This is defined as the directory just under any of those specified filesets. For example, a file with path `/data/project/datascienceteam/example.txt` will have `tld` set to `datascienceteam`. For all policies run on filesets in `/data/user`, `/data/project`, `/home`, or `/scratch` will automatically have their "top-level directory" (`tld`) computed and added to the parquet output. This is defined as the directory just under any of those specified filesets. For example, a file with path `/data/project/datascienceteam/example.txt` will have `tld` set to `datascienceteam`.
Any files in a directory outside those specified filesets will have `tld` set to `None`. Any files in a directory outside those specified filesets will have `tld` set to `None`.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment