Add example notebook using dask for ad-hoc analysis
Compare changes
example-dask-setup.ipynb
0 → 100644
+ 281
− 0
```
```
If you're on the same network as the compute node (either VPN or sshuttle), you can access the dask dashboard in your browser by going to `<node_ip>:8787`. Jobs on `c0241` will have a dashboard at `172.20.201.241:8787`. You can print the link using the `dashboard_link` property as well, but that will most likely show `127.0.0.1` as the IP which will not work.
```
```
It's imperative to shut down the cluster when you're done working. There have been a number of instances where I've restarted the kernel in the middle of a dask compute task where the worker processes could not be successfully killed. This caused the dask watchdog process to timeout which caused NHC to put the node into a drain state. Load increased consistently until the node became unresponsive and had to be rebooted. Before ending your job, call `manager.shutdown()` to close both the dask client and cluster objects.
```
This setup assumes you're using a LocalCUDACluster and so sets the default dataframe backend to `cudf` instead of `pandas`. Remember that every partition is a `cudf.DataFrame` which is mostly compatible with a pandas-style workflow but not always. A common issue is when trying to do anything semi-complicated with datetimes, like cutting into groups. It's generally better to convert those to unix timestamps first (ints) and work from there.
```
If you're using the flat parquet, it's highly advised to not set an index after setting up the dataframe unless the `path` column is excluded from the dataset. This causes a large shuffle that must be done mostly in-memory, and the `path` column alone can exceed 80 GB in `data-project` GPFS logs. This can be worked around by reading just a few columns and setting up managed memory (`rmm`) in the cluster options which was done at the beginning of the notebook. This allows most compute to take place, but I wouldn't depend on it for everything.
```
```
```
```
```
```
```
```
```