A100 `amperenodes` communication to User Community
TODO
For this issue
- ETA of release
- Review of hpc-announce email
Timeline
This is tentative and subject to change.
- 2023-07-27: Physical installation complete
- 2023-08-04: Slurm/node configuration complete (first pass)
- 2023-08-09: Single-GPU testing complete
- 2023-09-25: Necessary fixes complete
- 2023-09-25: Release
For A100s generally
-
Plan for local NVMe drives
- Mike proposed RAID 0 striping of the two drives (performance?)
-
Mount path would be
/local
-
Remaining tasks for A100s
-
Node definitions in
slurm.conf
-
QoS definitions in
slurm.conf
-
Consistent shell variable for
/local
- Validating A100s
- Testing A100s ()
- Performance comparison of A100s to P100s
-
Add
amperenodes
to live OOD: #461 (closed)
-
Node definitions in
-
CUDA: rc/cluster-software#103
- At least CUDA/toolkit >= 11.8
- Ideally >= 12.0
- cuDNN compiled against CUDA/toolkit
- tensorrt compiled against CUDA/toolkit [optional]
Release coordination
-
slurm.conf
- remove restritions on access foramperenodes*
partitions - Done https://gitlab.rc.uab.edu/rc/rc-slurm/-/merge_requests/38 -
Add
amperenodes*
partitions to OOD Prod - #461 (closed) - Communications Prepared
- Shell MOTD
- OOD MOTD
- Docs Announcement
- Docs Pages
- HPC Announce
-
Remove reservation in
scontrol
- Release HPC Announce
- Notify and close relevant ServiceNow tickets
See the wiki page for current state of information to communicate.