From 526b4b8df56d4dad55ce12b339263f78f36374a6 Mon Sep 17 00:00:00 2001
From: Matthew K Defenderfer <mdefende@uab.edu>
Date: Mon, 16 Sep 2024 17:57:56 -0500
Subject: [PATCH] couple more changes to tld

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 59783c7..1fb6044 100644
--- a/README.md
+++ b/README.md
@@ -102,8 +102,6 @@ Split files will have the form `${outdir}/list-XXX.gz` where XXX is an increment
 
 Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`.
 
-While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
-
 This script is written to parse the `list-path-external` policy format with quoted special characters.
 
 ```bash
@@ -125,6 +123,8 @@ All other options control the array job resources. Default values are as follows
 
 The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
 
+#### tld
+
 For all policies run on filesets in `/data/user`, `/data/project`, `/home`, or `/scratch` will automatically have their "top-level directory" (`tld`) computed and added to the parquet output. This is defined as the directory just under any of those specified filesets. For example, a file with path `/data/project/datascienceteam/example.txt` will have `tld` set to `datascienceteam`.
 
 Any files in a directory outside those specified filesets will have `tld` set to `None`.
-- 
GitLab