Skip to content
Snippets Groups Projects

Improve hive conversion functionality

A few changes are introduced to improve performance and efficiency of hivize.

Skipping TLDs with existing data

--no-clobber was added to the CLI command to avoid having to submit hundreds of tasks we know will fail. Instead, the tld list is generated from the passed parameter or from the unique tld values in the dataset. All tlds which already have parquet files in the hive directory for the given acquisition date are removed from the list before further processing. Only tlds which need processing are submitted as tasks in the first place. The no_clobber option in hivize was left as-is in case it is run from an interactive session.

Grouped memory requests

Prior versions submitted each tld requesting the same amount of RAM which was much higher than necessary based on outliers in the dataset. This decreased throughput and efficiency by causing QoS to be reached while running fewer total tasks at once. The basic --mem option has been removed and replaced with the --mem-factor parameter. Each tld is assigned an estimated size based on the total size reported by the input parquet headers and what percentage of the log files from the tld account for. This value generally underestimates the actual memory needs, so --mem-factor was added to compensate.

The estimated memory value for each TLD is multiplied by the --mem-factor value (defaults to 3), and then these numeric values are cut into groups. Each task in a group is assigned the same requested memory value, and the group is submitted as an array job. The memory groups are as follows: 8, 16, 32, 64, and 128 GB. Anything above 64 GB is set to 128 GB regardless of the actual estimated dataset size. Increasing the memory factor does not change the breakpoint values for the groups, it will just move some TLDs from a smaller memory group to a larger memory group. This does decrease throughput, but you could run this twice with --no-clobber active the second time and a higher memory factor. Most of the tasks would have succeeded in the initial run leaving only a few, if any, tasks that would need to be run in a larger memory group.

A parquet file with the size groups and actual estimated size for each tld is saved in the same directory as the tld group files that each array job reads from.

Misc

  • Added a --dry-run option to convert-to-hive that does everything except submit the batch jobs. Tld group files as well as the parquet file detailing the size groups will be generated as normal.
  • Reduced the number of default CPUs for hive batch jobs.
  • Added cli_args parameter to convert_to_hive and parse_args for testing in interactive contexts. These parameters default to None which causes the CLI arguments to be parsed.

Merge request reports

Merged by Matthew K DefenderferMatthew K Defenderfer 3 weeks ago (Apr 29, 2025 10:20pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply
Loading