This is a complete overhaul of the backend compute for this package. Prior versions used a combination of pandas and dask with CUDA capabilities to improve compute power and address pandas's performance issues with large datasets. This caused a number of problems as far as usability and maintenance:
pip
.Switching the compute backend to polars solves every problem. Polars is written in Rust which is highly performant and natively supports parallelization wherever possible. Additionally, the most recent releases include an updated data streaming engine to enable fast, out-of-memory computation using only CPUs. This reduces the resources required for processing which should increase job/task throughput on the cluster.
As part of this, there were breaking changes made to a few of the primary functions. Many functions are reduced in scope such as removing automatic data I/O. This increases usability in the long run by making these functions a bit more modular.
cudf
,dask_cuda
,dask
, and pandas
dependencies, among others.compute
moduleprocess.factory
submodulepynvml
and GPUtil
. GPU integration will be added back later when Cheaha's OS is updated so the newer versions of cudf-polars
can be correctly compiled.cli
submodule functions are now imported in cli.__init__.py
policy
submodule functions are now imported in policy.__init__.py
cli.gpfs_preproc.py
report/general-report.qmd
hivize
no longer includes a preparation and grouping step (prep_hivize
)bytes_to_human_readable_size
,create_size_bin_labels
, and calculate_size_distribution
moved from .policy.hive
to .utils.{core,units}
as_path
and parse_scontrol
moved to .utils.core
as_datetime
,as_timedelta
,create_timedelta_breakpoints
,create_timedelta_labels
moved/added to .utils.datetime
prep_{age,size}_distribution
functions that return the correctly formatted breakpoints and labels used in pl.Series.cut
. The calculate_*
versions of those commands now call those functions and then apply cut
and return the groups as a Seriesaggregate_gpfs_dataset
now can group on file size in addition to file ageaggregate_gpfs_dataset
now takes in a dataframe/lazyframe and returns the aggregated dataframe. No data I/O is performed.create_db
more general purpose. Takes table definition an input. create_churn_db
added for convenience to specifically create the churn database/table.