Reliable file name parsing
The existing list policy rules work to generate files containing a record of the data gathered during the policy job. They appear to do the work requested. These files are formatted according to the GPFS policy record format:
iAggregate:WEIGHT:INODE:GENERATION:SIZE:iRule:resourceID:attr_flags:path-length!PATH_NAME:pool-length!POOL_NAME[;show-length>!SHOW]end-of-record-character
This is a somewhat complex format but we have been able to use it generate file lists of interest. By creating a |
separated collection of records in the SHOW command
SHOW ('|size=' || varchar(FILE_SIZE) ||
'|kballoc='|| varchar(KB_ALLOCATED) ||
'|access=' || varchar(ACCESS_TIME) ||
'|create=' || varchar(CREATION_TIME) ||
'|modify=' || varchar(MODIFICATION_TIME) ||
'|uid=' || varchar(USER_ID) ||
'|gid=' || varchar(GROUP_ID) ||
'|heat=' || varchar(FILE_HEAT) ||
'|pool=' || varchar(POOL_NAME) ||
'|path=' || varchar(PATH_NAME) ||
'|misc=' || varchar(MISC_ATTRIBUTES) ||
'|'
)
WHERE include_list
We've been able to narrow in on fields of interest and load them into a dataframe in our notebook via read_csv()
df = pd.read_csv(file,
lineterminator='\n',
sep="|", header=0,
#on_bad_lines="warn",
index_col=False,
#nrows=1000000,
names=fields,
usecols=usecols,
converters=splitters,
parse_dates=['atime', 'ctime', 'mtime'],
)
The problem we run into is that GPFS doesn't restrict the characters that can appear in file names. File names in GPFS, at least from the metadata view, are just a string of bytes delimited by null \0
.. This means files can contain our pipe separator |
. In a large file system with lots of applications with their own conventions, this does happen. When it does it breaks our name parsing.
GPFS provides encoding capability to wrap a file in in a URL encoded string via the ESCAPE '%' term. The issue is, the ESCAPE term is only valid on EXTERNAL LIST and EXTERNAL POOL rules.
RULE 'xp' EXTERNAL POOL 'pool-name' EXEC 'script-name' ESCAPE '%'
RULE 'xl' EXTERNAL LIST 'list-name' EXEC 'script-name' ESCAPE '%/+@#'
These rules link a list of file names generated by a LIST rule (the ones we have written) with an external program that is meant to operate on the list, like a script that will move a file to a backup location or do some sort of other action.
We haven't been pairing our LIST rules with an EXTERNAL LIST rule. Whenever we run our LIST rules our job output reports a warning:
[W] Attention: In RULE 'list-path' LIST name 'no_extern_list' appears but there is no corresponding "EXTERNAL LIST 'no_extern_list' EXEC ... OPTS ..." rule to specify a program to process the matching files.
This warning makes sense now. The LIST rule generates a named list of files and the EXTERNAL LIST command consumes the named list, passing it on to an external program for processing. The error is saying we name our list of files "no_extern_list" but then we don't do anything to process it...like we should.
As a result, it seems that the resulting collection of candidate files (the ones that matched the conditions of the LIST rule) are just gathered up and dumped in an job output file in the global path. This is the file we have been targeting for for our parsing. However, since we aren't following the defined record formatting (that indicates the path length explicitly) we've been using a weak delimiter that is now failing.
The correct solution appears to be to create a named list via a LIST command and then consume the list in a script via the EXTERNAL LIST command. That script can then capture the data we are generating via the SHOW command. Because we can use the ESCAPE term, we can also URL encode the file name (and all other data) to avoid getting parsing errors on file names. The resulting file list format is also much simpler:
InodeNumber GenNumber SnapId [OptionalShowArgs] -- FullPathToFile