Compare revisions

Matthew K Defenderfer · Matthew K Defenderfer · Matthew K Defenderfer · Matthew K Defenderfer · Matthew K Defenderfer · Matthew K Defenderfer
--- a/README.md
+++ b/README.md
@@ -7,81 +7,127 @@ The relavent [documentation is available from IBM](https://www.ibm.com/docs/en/s
 This project focuses on scheduled execution of lifecyle policies to gather and process data about
 file system objects and issue actions against those objects based on policy.

-## Running a policy
+## Applying Policies

-A policy is executed in the context of a SLURM batch job reservation using the submit-pol-job script:
-```
-submit-pol-job <outdir> <policy> <nodecount> <corespernode> <ram> <partition> <time>
+Applying a policy to filesets is done through the `mmapplypolicy` command at a base level. This repo contains wrapper scripts to call that command with a specified policy file on a given fileset where each wrapper has different levels of functionality meant for different groups of users in RC. All scripts are stored in `src/run-policy`
+
+- `run-mmpol`: the main script that calls `mmapplypolicy`. Generally not invoked on its own
+- `submit-pol-job`: general wrapper that sets up the Slurm job `run-mmpol` executes in. Admins can execute a policy run from this level using any policy file they have defined
+- `run-submit-pol-job.py`: a Python wrapper for `submit-pol-job` meant specifically for running list policy jobs. This wrapper can be run by specific non-admins who have been given `sudo` permissions on this file alone. It can only run one of two policies: `list-path-external` and `list-path-dirplus`.
+
+The production version of these scripts are kept in `/data/rc/list-gpfs-dirs`. Admins can run any one of these scripts from anywhere, but non-admins are only granted `sudo` privileges on the `run-submit-pol-job.py` file in that directory.
+
+Note: The command is aligned to run on specific nodes by way of arguments to mmapplypolicy.  The command is technically not run inside of the job reservation so the resource constraints are imperfect.  The goal is to use the scheduler to ensure the policy run does not conflict with existing resource allocations on the cluster.
+
+### List Policies (non-admin)
+
+A list policy can be executed using `run-submit-pol-job.py` using the following command:
+
+``` bash
+sudo run-submit-pol-job.py [-h] [-o OUTDIR] [-f LOG_PREFIX] [--with-dirs]
+                           [-N NODES] [-c CORES] [-p PARTITION] [-t TIME]
+                           [-m MEM_PER_CPU]
+                           device
 ```
-Where the positional arguments are:

- **outdir** - the directory for the output files, should be global to cluster (e.g. /scratch of the user running the job)
- **policy** - path to the GPFS policy to execute (e.g. in ./policy directory) 
- **nodecount** - number of nodes in the cluster that will run the policy
- **corespernode** - number of cores on each node to reserve
- **ram** - ram per core, can use "G" for gigabytes
- **partition** - the partition to submit the job
- **time** - the time in minutes to reserve for the job
+- `outdir`: specifies the directory the output log should be saved to. Defaults to `/data/rc/gpfs-policy/data`
+- `log-prefix`: string to begin the name of the policy output with. Metadata containing the policy file name, slurm job ID, and time run will be appended to this prefix. Defaults to `list-policy_<device>`. See below for `device`
+  - **Note: this is currently non-functional**
+- `--with-dirs`: changes the policy file from `list-path-external` to `list-path-dirplus`. The only difference is that directories are included in the policy output.
+- `device`: the fileset or directory to apply the policy to.

-Note: the resource reservation is imperfect.  The job wrapper calls a script `run-mmpol.sh` which is responsible for executing the `mmapplypolicy` command.  
+All other arguments are Slurm directives dictating resource requests. The default paramaters are as follows:

-The command is aligned to run on specific nodes by way of arguments to mmapplypolicy.  The command is technically not run inside of the job reservation so the resource constraints are imperfect.  The goal is to use the scheduler to ensure the policy run does not conflict with existing resource allocations on the cluster.
+- `nodes`: 1
+- `cores`: 16
+- `partition`: `amd-hdr100, medium`
+- `time`: `24:00:00`
+- `mem-per-cpu`: `8G`

-## Running the policy "list-policy-external"
+This script was written using the default `python3` interpreter on Cheaha (version `3.6.8`) so no environment needs to be active. Running this script in an environment with a newer Python version may cause unintended errors and effects.

-The list-policy-external policy provides an efficient tool to gather file stat data into a URL-encoded
-ASCII text file.  The output file can then be processed by down-stream to create reports on storage
-patterns and use.
+### Run Any Policies (admins)

-An example invocation would be:
+Any defined policy file can be run using the `submit-pol-job` by running the following:

-```shell
-submit-pol-job /path/to/output/dir \
-     /absolute/path/policy/list-path-external \
-	 4 24 4G partition_name \
-	 /path/to/listed/dir \
-	 180
+``` bash
+sudo ./submit-pol-job [ -h ] [ -o | --outdir ] [ -f | --outfile ] [ -P | --policy ] 
+                      [ -N | --nodes ] [ -c | --cores ] [ -p | --partition ] 
+                      [ -t | --time ] [ -m | --mem ]
+                      device
 ```

-Some things to keep in mind:
- the `submit-pol-job` script may need a `./` prefix if it is not in your path.
- use absolute paths for all directory arguments to avoid potential confusion
- make sure the output dir has sufficient space to hold the resulting file listing (It could be 100's of Gigabytes for a large collection of files.)
+The only difference here is that a path to the policy file can be specified using `-P` or `--policy`. All other arguments are the same and have the same defaults

-The slurm job output file will be local to the directory from which this command executed.  It can be  watched to observe progress in the generation of the file list.  A listing of 100's of millions of files may take a couple of hours to generate and consume serveral hundred gigabytes for the output file.
+### Output

-The output file in `/path/to/output/dir` is named as follows
- a prefix of "list-${SLURM_JOBID}"
- ".list" for the name of the policy rule type of "list"
- a tag for the list name name defined in the policy file,  "list-gather" for `list-path-external` policy
+The list-policy-external policy provides an efficient tool to gather file stat data into a URL-encoded
+ASCII text file.  The output file can then be processed by down-stream to create reports on storage
+patterns and use. Make sure the output dir has sufficient space to hold the resulting file listing (It could be 100's of Gigabytes for a large collection of files.)

-The output file contains one line per file object stored under the `/path/to/listed/dir`.  No directories or non-file objects are included in this listing.  Each entry is a space-seperated set of file attributes selected by the SHOW command in the LIST rule.  Entries are encoded according to RFC3986 URI percent encoding.  This means all spaces and special characters will be encoded, making it easy to split lines into fields using the space separator.
+The slurm job output file will be local to the directory from which this command executed.  It can be watched to observe progress in the generation of the file list.  A listing of 100's of millions of files may take a couple of hours to generate and consume serveral hundred gigabytes for the output file.

-The ouput file is an unsorted list of files in uncompressed ASCII.  Further processing is desireble to use less space for storage and provide organized collections of data.
+#### List Policy Specific Outputs
+
+The raw output file for list policies in `outdir` will be named `list-<jobid>.list.gather-info`.
+
+The output file contains one line per file object stored under the `device`.  No directories or non-file objects are included in this listing unless the `list-path-dirplus` policy is used.  Each entry is a space-seperated set of file attributes selected by the SHOW command in the LIST rule.  Entries are encoded according to RFC3986 URI percent encoding.  This means all spaces and special characters will be encoded, making it easy to split lines into fields using the space separator.

 ## Processing the output file

 ### Split and compress

-### Pre-parse output for Python
+Policy outputs generated using `list-path-external` or `list-path-dirplus` can be split into multiple smaller log files to facilitate out-of-memory computation for very large filesets using tools such as dask. The policy output can be split and compressed using the `src/split-info-file.sh` script. See the following for usage:

-Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`. 
+```bash
+./split-info-file.sh [ -h ] [ -l | --lines ] [ -o | --outdir ] 
+                     [ -n | --ntasks ] [ -p | --partition] [ -t | --time ] [ -m | --mem ]  
+                     log
+```
+
+- `lines`: the max number of lines to include in each split file. Defaults to 5000000
+- `outdir`: directory to store the split files in. Defaults to ${log}.d in log's parent directory.
+- `log`: path to the GPFS policy log. Can be either uncompressed or `gzip` compressed
+
+All other options specify job resource parameters. Defaults are as follows:
+
+- `ntasks`: 4
+- `partition`: `amd-hdr100`
+- `time`: `12:00:00`
+- `mem`: `16G`
+
+Split files will have the form `${outdir}/list-XXX.gz` where XXX is an incrementing index. Files are automatically compressed.
+
+### Pre-parse output for Python

-While the file is being parsed, the top-level-directory (`tld`) is extracted for each entry and added as a separate column to make common aggregations easier.
+Processing GPFS log outputs is controlled by the `run-convert-to-parquet.sh` script and assumes the GPFS log has been split into a number of files of the form `list-XXX.gz` where `XXX` is an incrementing numeric index. This creates an array job where each task in the array reads the quoted text in one file, parses it into a dataframe, and exports it as a parquet file with the name `list-XXX.parquet`.

 This script is written to parse the `list-path-external` policy format with quoted special characters.

-```
-Usage: ./run-convert-to-parquet.sh [ -h ] 
-        [ -o | --outdir ] [ -n | --ntasks ] [ -p | --partition] 
-        [ -t | --time ] [ -m | --mem ] 
-        gpfs_logdir"
+```bash
+./run-convert-to-parquet.sh [ -h ] [ -o | --outdir ] 
+                            [ -n | --ntasks ] [ -p | --partition] 
+                            [ -t | --time ] [ -m | --mem ] 
+                            gpfs_logdir
 ```

 - `outdir`: Path to save parquet outputs. Defaults to `${gpfs_logdir}/parquet`
 - `gpfs_logdir`: Directory path containing the split log files as `*.gz`

-All other options control the array job resources. The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
+All other options control the array job resources. Default values are as follows:
+
+- `ntasks`: 1
+- `mem`: `16G`
+- `time`: `02:00:00`
+- `partition`: `amd-hdr100`
+
+The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
+
+#### tld
+
+For all policies run on filesets in `/data/user`, `/data/project`, `/home`, or `/scratch` will automatically have their "top-level directory" (`tld`) computed and added to the parquet output. This is defined as the directory just under any of those specified filesets. For example, a file with path `/data/project/datascienceteam/example.txt` will have `tld` set to `datascienceteam`.
+
+Any files in a directory outside those specified filesets will have `tld` set to `None`.

 ## Running reports

@@ -91,7 +137,6 @@ A useful report is the top level directory (tld) report.  This is akin to runnin

 ### Comparing directory similarity

-
 ## Scheduling regular policy runs via cron

 The policy run can be scheduled automatically with the cronwrapper script.

--- a/src/convert-to-parquet/convert-to-parquet.py
+++ b/src/convert-to-parquet/convert-to-parquet.py
@@ -31,7 +31,12 @@ def parse_line(line):
        
        d = dict([re.match(r'([\w]+)=(.*)',l).groups() for l in details.split('|')])

-        tld = re.match(r'(?:/data/user(?:/home)?/|/data/project/|/scratch/)([^/]+)',path).group(1)
+        grp = re.match(r'(?:/data/user(?:/home)?/|/data/project/|/scratch/)([^/]+)',path)
+        if grp:
+            tld = grp.group(1)
+        else:
+            tld = None
+            
        d.update({'path': path,
                  'tld': tld})
        return d

--- a/src/convert-to-parquet/run-convert-to-parquet.sh
+++ b/src/convert-to-parquet/run-convert-to-parquet.sh
@@ -11,7 +11,6 @@ mem="16G"
 time="02:00:00"
 partition="amd-hdr100"
 outdir=""
-sif="gitlab.rc.uab.edu:4567/mdefende/gpfs-policy:latest"

 ############################################################
 # Help                                                     #
@@ -87,15 +86,12 @@ if [[ -z "$outdir" ]]; then
    outdir="${gpfs_logdir}/parquet"
 fi

-singularity pull --force gpfs.sif docker://${sif}
-
 nlogs=$(ls ${gpfs_logdir}/list-* | wc -l)

-cmd="singularity exec --bind /data,/scratch gpfs.sif python3 convert-to-parquet.py -o ${outdir} -f \${log}"
+cmd="python3 convert-to-parquet.py -o ${outdir} -f \${log}"

 >&2 cat << EOF
 --------------------------------------------------------------------------------
-sif:                 ${sif}
 output dir:          ${outdir}
 GPFS logs:           ${gpfs_logdir}

@@ -127,6 +123,8 @@ mkdir -p err
 #SBATCH --error=err/%A_%a.err
 #SBATCH --array=1-${nlogs}

+source /data/rc/gpfs-policy/venv/bin/activate
+
 log=\$(ls ${gpfs_logdir}/list-* | awk "NR==\${SLURM_ARRAY_TASK_ID} { print \$1 }")

 ${cmd}

--- a/src/run-policy/run-mmpol.sh
+++ b/src/run-policy/run-mmpol.sh
@@ -4,13 +4,97 @@ set -euxo pipefail

 # run an mmapply policy across the cluster via slurm

-# gather info to map mmapplypolicy to runtime configuration
-# arguments passed via job env and runtime context
+############################################################
+# Default Values                                           #
+############################################################

-filesystem=${FILESYSTEM:-scratch}
-policyfile=$POLICYFILE
-tmpglobal=$DIR/slurm-tmp-${SLURM_JOBID}
-tmpscratch=$DIR/slurm-tmp-${SLURM_JOBID}
+outdir="/data/rc/gpfs-policy/data"
+policy_file="./policy-def/list-path-external"
+device="scratch"
+output_log_prefix=""
+
+############################################################
+# Help                                                     #
+############################################################
+
+usage()
+{
+>&2 cat << EOF
+Usage: $0 [ -h ] [ -o | --outdir ] [ -f | --output-prefix ] [-P | --policy-file] device
+EOF
+exit 1
+}
+
+help()
+{
+>&2 cat << EOF
+Runs mmapplypolicy on the specified device/fileset. The policy file dictates the actions performed including list, delete, add, etc. This is most often called by the submit-pol-job wrapper instead of invoked directly
+
+Usage: $0 [ -h ] [ -o | --outdir ] [ -f | --output-prefix ] [ -P | --policy-file ] device
+
+options:
+    -h|--help       Print this Help.
+
+Required:
+    device              GPFS fileset/directory apply the policy to. Can be 
+                            specified as either the name of the fileset or the 
+                            full path to the directory 
+                            (Examples: scratch, /data/user/[username])
+
+Path:   
+    -o|--outdir         Parent directory to save policy output to 
+                            (default: /data/rc/gpfs-policy/data)
+    -f|--output-prefix  Prefix of the policy output file. Appended with a metadata string containing the policy name, 
+                            job ID, and date
+
+Policy Options:
+    -P|--policy-file    Path to policy file to apply to the given GPFS device
+EOF
+exit 0
+}
+
+args=$(getopt -a -o ho:f:P: --long help,outdir:,output-prefix:,policy-file: -- "$@")
+
+if [[ $? -gt 0 ]]; then
+  usage
+fi
+
+eval set -- ${args}
+
+while :
+do
+  case $1 in
+    -h | --help)            help                    ;;
+    -o | --outdir)          outdir=$2               ; shift 2 ;;
+    -f | --output-prefix)   output_log_prefix=$2    ; shift 2 ;;
+    -P | --policy-file)     policy_file=$2           ; shift 2 ;;
+    --) shift; break ;;
+    *) >&2 echo Unsupported option: $1
+       usage ;;
+  esac
+done
+
+if [[ $# -eq 0 ]]; then
+  usage
+fi
+
+device="$1"
+
+# Ensure device is specified
+if [[ -z "$device" ]]; then
+    echo "Error: Specify either the name of a fileset or a directory path"
+    usage
+fi
+
+# create default output_log_prefix if not specified in the arguments
+if [[ -z "$output_log_prefix"]]; then
+    modified_device=$(echo "$device" | sed -e 's|^/||' -e 's|/$||' -e 's|/|-|g')
+    output_log_prefix="list-policy_${modified_device}"
+fi
+
+# create temporary working directory for list aggregation
+tmpglobal=$outdir/slurm-tmp-${SLURM_JOBID}
+tmpscratch=$outdir/slurm-tmp-${SLURM_JOBID}
 mkdir -p $tmpglobal

 nodes=`scontrol show hostnames "${SLURM_JOB_NODELIST}" | tr '\n' ',' | sed -e 's/,$//'`
@@ -18,17 +102,17 @@ cores="${SLURM_CPUS_PER_TASK}"

 DATESTR=`date +'%Y-%m-%d-%H:%M:%S'`

-policy=`basename $policyfile`
+policy=`basename $policy_file`
 filetag="${policy}_slurm-${SLURM_JOBID}_${DATESTR}"

-cmd="mmapplypolicy ${filesystem} -I defer \
-  -P $policyfile \
+cmd="mmapplypolicy ${device} -I defer \
+  -P $policy_file \
  -g $tmpglobal \
  -s $tmpscratch \
-  -f ${DIR}/list-${SLURM_JOBID} \
-  -M FILEPATH=${filesystem} \
+  -f ${outdir}/list-${SLURM_JOBID} \
+  -M FILEPATH=${device} \
  -M JOBID=${SLURM_JOBID} \
-  -M LIST_OUTPUT_FILE=${OUTFILE:-/tmp/gpfs-list-policy}
+  -M LIST_OUTPUT_FILE=${output_prefix} \
  -N ${nodes} -n ${cores} -m ${cores}"

 # report final command in job log
@@ -41,6 +125,6 @@ $cmd
 outfile=`ls -t $tmpglobal | head -1`
 if [[ "$outfile" != "" ]]
 then
-   mv -n $tmpglobal/$outfile $tmpglobal/../${outfile}_$filetag
+   mv -n $tmpglobal/$outfile $tmpglobal/../${output_log_prefix}_$filetag
 fi
 rmdir $tmpglobal
--- a/src/run-policy/submit-pol-job
+++ b/src/run-policy/submit-pol-job
@@ -110,8 +110,8 @@ fi
 slurm_out="out/pol-%A-$(basename ${policy})-$(basename ${device}).out"
 mkdir -p out

-DIR=$outdir POLICYFILE=$policy FILESYSTEM=${device} OUTFILE=${outfile} && \
-DIR=$DIR POLICYFILE=$POLICYFILE FILESYSTEM=${FILESYSTEM} OUTFILE=${OUTFILE} \
+run_mmpol_cmd="./run-mmpol.sh -o ${outdir} -f ${outfile} -P ${policy} ${device}"
+
 sbatch \
   -N $nodes \
   -c $cores \
@@ -119,4 +119,4 @@ sbatch \
   --mem-per-cpu=$mem_per_cpu \
   -p $partition \
   -o ${slurm_out} \
-   ./run-mmpol.sh
+   "${run_mmpol_cmd}"
--- a/src/split-info-file.sh
+++ b/src/split-info-file.sh
 #!/bin/bash

-set -euo pipefail
+set -euxo pipefail

 ############################################################
 # Default Values                                           #
@@ -11,6 +11,7 @@ mem="16G"
 time="12:00:00"
 partition="amd-hdr100"
 lines=5000000
+outdir=""

 ############################################################
 # Help                                                     #
@@ -18,9 +19,11 @@ lines=5000000
 usage()
 {
 >&2 cat << EOF
-Usage: $0 [ -h ] [ -n | --ntasks ] [ -p | --partition] [ -t | --time ] [ -m | --mem ] 
-          [ -l | --lines ] log
+Usage: $0 [ -h ] [ -l | --lines ] [ -o | --outdir ]
+          [ -n | --ntasks ] [ -p | --partition] [ -t | --time ] [ -m | --mem ] 
+          log
 EOF
+exit 0
 }

 help()
@@ -28,18 +31,22 @@ help()
 # Display Help
 >&2 cat << EOF
 Splits a GPFS policy log into multiple parts for batch array processing
-Usage: $0 [ -h ] [ -n | --ntasks ] [ -p | --partition ] [ -t | --time ] [ -m | --mem ] 
-          [ -l | --lines ] log
+Usage: $0 [ -h ] [ -l | --lines ] [ -o | --outdir ]
+          [ -n | --ntasks ] [ -p | --partition] [ -t | --time ] [ -m | --mem ] 
+          log

 General:
-    -h|--help           Print this Help.
+    -h|--help           Print this help.

 Required:
    log                 Path to the log file to split

-File Partitioning:
+Split Parameters:
    -l|--lines          Max number of records to save in each split (default: 5000000)

+File Parameters:
+    -o|--outdir         Directory path to store split files in. Defaults to log.d in log's parent directory.
+
 Job Parameters:
    -n|--ntasks         Number of job tasks (default: 4)
    -p|--partition      Partition to submit tasks to (default: amd-hdr100)
@@ -49,7 +56,7 @@ EOF
 exit 0
 }

-args=$(getopt -a -o hn:p:t:m:l: --long help,ntasks:,partition:,time:,mem:,lines: -- "$@")
+args=$(getopt -a -o hl:o:n:p:t:m: --long help,lines:,outdir:,ntasks:,partition:,time:,mem: -- "$@")
 if [[ $? -gt 0 ]]; then
  usage
 fi
@@ -60,11 +67,12 @@ while :
 do
  case $1 in
    -h | --help)            help            ;;
+    -l | --lines)           lines=$2        ; shift 2 ;;
+    -o | --outdir)          outdir=$2       ; shift 2 ;;
    -n | --ntasks)          ntasks=$2       ; shift 2 ;;
    -p | --partition)       partition=$2    ; shift 2 ;;
    -t | --time)            time=$2         ; shift 2 ;;
    -m | --mem)             mem=$2          ; shift 2 ;;
-    -l | --lines)           lines=$2        ; shift 2 ;;
    --) shift; break ;;
    *) >&2 echo Unsupported option: $1
       usage ;;
@@ -76,8 +84,17 @@ if [[ $# -eq 0 ]]; then
 fi

 log=$1
-dirname="$(basename ${log} .gz).d"
-prefix=${dirname}/list-
+
+if [[ -z "${log}" ]]; then
+    echo "Log path is required"
+    usage
+fi
+
+if [[ -z "${outdir}" ]]; then
+    outdir="$(readlink -f ${log}).d"
+fi
+
+prefix=${outdir}/list-

 split_cmd="cat ${log} | split -a 3 -d -l ${lines} - ${prefix}"
 zip_cmd="ls ${prefix}* | xargs -i -P 0 bash -c 'gzip {} && echo {} done'"
@@ -89,6 +106,7 @@ fi
 >&2 cat << EOF
 --------------------------------------------------------------------------------
 GPFS log:           ${log}
+Output Directory    ${outdir}
 Lines per File:     ${lines}

 ntasks:             ${ntasks}
@@ -101,7 +119,7 @@ zip cmd:            ${zip_cmd}
 --------------------------------------------------------------------------------
 EOF

-mkdir -p ${dirname}
+mkdir -p ${outdir}
 mkdir -p out
 mkdir -p err
No results found