Improve hive conversion functionality
A few changes are introduced to improve performance and efficiency of hivize
.
Skipping TLDs with existing data
--no-clobber
was added to the CLI command to avoid having to submit hundreds of tasks we know will fail. Instead, the tld
list is generated from the passed parameter or from the unique tld
values in the dataset. All tlds which already have parquet files in the hive directory for the given acquisition date are removed from the list before further processing. Only tlds which need processing are submitted as tasks in the first place. The no_clobber
option in hivize
was left as-is in case it is run from an interactive session.
Grouped memory requests
Prior versions submitted each tld
requesting the same amount of RAM which was much higher than necessary based on outliers in the dataset. This decreased throughput and efficiency by causing QoS to be reached while running fewer total tasks at once. The basic --mem
option has been removed and replaced with the --mem-factor
parameter. Each tld
is assigned an estimated size based on the total size reported by the input parquet headers and what percentage of the log files from the tld account for. This value generally underestimates the actual memory needs, so --mem-factor
was added to compensate.
The estimated memory value for each TLD is multiplied by the --mem-factor
value (defaults to 3), and then these numeric values are cut into groups. Each task in a group is assigned the same requested memory value, and the group is submitted as an array job. The memory groups are as follows: 8, 16, 32, 64, and 128 GB. Anything above 64 GB is set to 128 GB regardless of the actual estimated dataset size. Increasing the memory factor does not change the breakpoint values for the groups, it will just move some TLDs from a smaller memory group to a larger memory group. This does decrease throughput, but you could run this twice with --no-clobber
active the second time and a higher memory factor. Most of the tasks would have succeeded in the initial run leaving only a few, if any, tasks that would need to be run in a larger memory group.
A parquet file with the size groups and actual estimated size for each tld is saved in the same directory as the tld group files that each array job reads from.
Misc
- Added a
--dry-run
option toconvert-to-hive
that does everything except submit the batch jobs. Tld group files as well as the parquet file detailing the size groups will be generated as normal. - Reduced the number of default CPUs for hive batch jobs.
- Added
cli_args
parameter toconvert_to_hive
andparse_args
for testing in interactive contexts. These parameters default toNone
which causes the CLI arguments to be parsed.