Add automated backend compute determination
Need to add a way to determine which compute backend is necessary to analyze a given dataset automatically. This is important for 2 reasons:
- Allows the aggregation to be run as part of a full pipeline without the user needing to determine it on their own which can be time-consuming manually
- Lowers barrier of entry for users who are not familiar with Dask or cuDF.
The user should be able to initialize a compute backend from a single command and then pass that backend to any further functions. Those functions should be written such that they can adapt the code being run based on the type of backends.
Backend types
-
pandas
: No GPU is present and the full dataset can fit into memory with room to store intermediate computations -
cuDF (cudf.pandas)
: A single GPU is available and the dataset can fit into VRAM with some headroom -
dask
: Dataset cannot fit into RAM and no GPU is available. Only creates a local cluster with the job's allocated cores and memory -
dask_cuda
: Either multiple GPUs are present, or there is a single GPU present and the dataset cannot fit into VRAM
Luckily, the actual processing commands are very similar across all 4 compute backend type generally only differing by the main module loaded and some dask-specific implementations.
To-Do
-
Determine compute resources assigned to the current job -
Estimate total in-memory dataset size. Using parquet file sizes is misleading due to how well compressed they are. -
Determine which backend is acceptable based on resources and dataset size -
Manage automated module import based on the backend used - For example, if dask_cuda needs to be used, automatically import
dask_cuda
anddask.dataframe
to the global namespace for convenience
- For example, if dask_cuda needs to be used, automatically import
-
Return an object containing the client and cluster objects if dask is necessary for user interaction.