Convert to polars for backend compute
Overview
This is a complete overhaul of the backend compute for this package. Prior versions used a combination of pandas and dask with CUDA capabilities to improve compute power and address pandas's performance issues with large datasets. This caused a number of problems as far as usability and maintenance:
- There were 4 possible compute backends which necessitated slight algorithmic modifications based on the backend. So essentially 4 different versions of the same function/method needed to be written, one for each backend.
- The CUDA and Dask packages are extremely heavy and caused very long load times even for simple actions. Workarounds involving altering where packages were loaded in the script made the code flow more confusing.
- Dask, while performant, is prone to crashes when creating and using its local cluster functionality.
- There was an over-reliance on A100s being available in order to completely processing in a timely manner.
- Installation on Cheaha is slightly cumbersome due to needing the conda repository for most of the packages as opposed to being able to use
pip
.
Switching the compute backend to polars solves every problem. Polars is written in Rust which is highly performant and natively supports parallelization wherever possible. Additionally, the most recent releases include an updated data streaming engine to enable fast, out-of-memory computation using only CPUs. This reduces the resources required for processing which should increase job/task throughput on the cluster.
As part of this, there were breaking changes made to a few of the primary functions. Many functions are reduced in scope such as removing automatic data I/O. This increases usability in the long run by making these functions a bit more modular.
Changelog
Major Changes
- Updated all parquet/dataframe logic to Polars
- Removed
cudf
,dask_cuda
,dask
, andpandas
dependencies, among others. - Removed the
compute
module - Removed the
process.factory
submodule - Removed all GPU interfaces including
pynvml
andGPUtil
. GPU integration will be added back later when Cheaha's OS is updated so the newer versions ofcudf-polars
can be correctly compiled.
Minor changes
- All primary
cli
submodule functions are now imported incli.__init__.py
- All primary
policy
submodule functions are now imported inpolicy.__init__.py
- Deleted unused
cli.gpfs_preproc.py
- Deleted unused
report/general-report.qmd
-
hivize
no longer includes a preparation and grouping step (prep_hivize
) -
bytes_to_human_readable_size
,create_size_bin_labels
, andcalculate_size_distribution
moved from.policy.hive
to.utils.{core,units}
-
as_path
andparse_scontrol
moved to.utils.core
-
as_datetime
,as_timedelta
,create_timedelta_breakpoints
,create_timedelta_labels
moved/added to.utils.datetime
- added
prep_{age,size}_distribution
functions that return the correctly formatted breakpoints and labels used inpl.Series.cut
. Thecalculate_*
versions of those commands now call those functions and then applycut
and return the groups as a Series -
aggregate_gpfs_dataset
now can group on file size in addition to file age
Scope Changes
-
aggregate_gpfs_dataset
now takes in a dataframe/lazyframe and returns the aggregated dataframe. No data I/O is performed. -
create_db
more general purpose. Takes table definition an input.create_churn_db
added for convenience to specifically create the churn database/table.
Cleanup
- Removed notebooks which were no longer relevant
Issues
- Fixes #55 (closed)
- Fixes #44 (closed)
- Fixes #43 (closed)
- Closes #56 (closed)
- Closes #31 (closed)
- Closes #45 (closed)
Merge request reports
Activity
mentioned in merge request !57 (merged)