Snippets Groups Projects

Add automated backend compute determination

Need to add a way to determine which compute backend is necessary to analyze a given dataset automatically. This is important for 2 reasons:

Allows the aggregation to be run as part of a full pipeline without the user needing to determine it on their own which can be time-consuming manually
Lowers barrier of entry for users who are not familiar with Dask or cuDF.

The user should be able to initialize a compute backend from a single command and then pass that backend to any further functions. Those functions should be written such that they can adapt the code being run based on the type of backends.

Backend types

pandas: No GPU is present and the full dataset can fit into memory with room to store intermediate computations
cuDF (cudf.pandas): A single GPU is available and the dataset can fit into VRAM with some headroom
dask: Dataset cannot fit into RAM and no GPU is available. Only creates a local cluster with the job's allocated cores and memory
dask_cuda: Either multiple GPUs are present, or there is a single GPU present and the dataset cannot fit into VRAM

Luckily, the actual processing commands are very similar across all 4 compute backend type generally only differing by the main module loaded and some dask-specific implementations.

To-Do

Determine compute resources assigned to the current job
Estimate total in-memory dataset size. Using parquet file sizes is misleading due to how well compressed they are.
Determine which backend is acceptable based on resources and dataset size
Manage automated module import based on the backend used
- For example, if dask_cuda needs to be used, automatically import dask_cuda and dask.dataframe to the global namespace for convenience
Return an object containing the client and cluster objects if dask is necessary for user interaction.

0 of 5 checklist items completed

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Child items ...

Activity

Matthew K Defenderfer changed milestone to %Distributable Package With Basic Functionality 6 months ago

changed milestone to %Distributable Package With Basic Functionality
Matthew K Defenderfer assigned to @mdefende 6 months ago

assigned to @mdefende
Matthew K Defenderfer added #33 (closed) as child task 6 months ago

added #33 (closed) as child task
Matthew K Defenderfer added #34 (closed) as child task 6 months ago

added #34 (closed) as child task
Matthew K Defenderfer marked this issue as related to #31 (closed) 6 months ago

marked this issue as related to #31 (closed)
Matthew K Defenderfer mentioned in merge request !25 (merged) 6 months ago

mentioned in merge request !25 (merged)
Matthew K Defenderfer closed with merge request !25 (merged) 6 months ago

closed with merge request !25 (merged)

Please register or sign in to reply

Labels

None

Milestone

None

Due date

None

Confidentiality

Not confidential

0 Participants