Remove amperenodes-medium from OOD partition options
Since OOD jobs don't have a defined endpoint like batch jobs, the most common reasons OOD jobs are stopped are timeout and manual cancelling. Timeouts are problematic due to inaccurate time limit estimation by the majority of users paired with the fact a user will not be actively using an interactive session for the full runtime over multiple days. For in-demand resources like the A100s, this can create an artificial bottleneck for other jobs as an idle interactive job is basically a short-term reservation.
Over the past year, 1774 OOD jobs have run on the amperenodes-medium partition. Of those, only 358 jobs (20%) did not request the maximum 48 hour time limit, and only 10% requested less than a 24 hour runtime.
Overall, 726 OOD jobs timed out when requesting a 2 day time limit. We don't have a measure of GPU activity we can look back on over the full calendar year, but it's unlikely compute was being utilized during the full runtime of any of those jobs. This not only impacts queued amperenodes-medium
jobs but also amperenodes
jobs since amperenodes-medium
is just a subset of amperenodes
.
I think we should remove the interactive use of amperenodes-medium. Instead, code development should use amperenodes
and then turn into a batch script and use amperenodes-medium
if necessary. This would prevent unauthorized psuedo-reservations clogging the A100 queue.