a100_communication.md · Wiki · rc / devops

HPC Announce Email

To our Research Computing Community,

UAB Research Computing has good news to share! We have installed 40 A100 GPUs across 20 nodes, 2 GPUs per node, for immediate use by our research community. To get started quickly, use partitions "amperenodes" and/or "amperenodes-medium". For more information about changes, known limitations, how to make the most of the A100 GPUs, and hardware details, please read on.

Changes to CUDA software

To use the latest version of CUDA, please use "module load CUDA/12.2.0`. To use the latest version of cuDNN, please use "module load cuDNN/12.1.0". For more information please see our documentation.

Hardware Specification

Each A100 node has two A100 GPUs, each of which has 80 GB of memory. The nodes also have 128 cores split across two CPU dies and 512 GB of main system memory. 6 TB of NVMe storage is available in a striped (RAID 0) configuration for I/O performance. For more information please see our hardware page and GPU page. Please also read about how to ensure IO performance.

Known Limitations

For TensorFlow users: we are researching how to make the TensorRT library available as a module. You may see warnings about TensorRT not found in TensorFlow. The lack of TensorRT may or may not impact performance, but this warning does not prevent or impact quality of model training.

Further Reading

Questions and Concerns

If you have any questions or concerns, please reply to this email to create a support ticket, or email support@listserv.uab.edu.

Thank you!

The UAB Research Computing Team

OOD MOTD Announcement Box

LAST UPDATED 2023-09-25

🎉 Important news! 🎉

We've installed 40 A100 80 GB GPUs on Cheaha across 20 nodes, ready for immediate use.

- To get started using these GPUs please see our [A100 FAQ](https://docs.rc.uab.edu/cheaha/slurm/gpu/#frequently-asked-questions-faq-about-a100-gpus).
- For more details about the A100 GPUs see our [GPU Page](https://docs.rc.uab.edu/cheaha/slurm/gpu/).
- To get best I/O performance please see [Ensuring IO Performance iwth A100 GPUs](https://docs.rc.uab.edu/cheaha/slurm/gpu/).
- As part of this release, we have made changes to our [CUDA modules](https://docs.rc.uab.edu/cheaha/slurm/gpu/).

After this point...

All of the following changes have been made to the docs and elsewhere

Put it in various places in the docs

Questions and Answers:

How do I access the A100 nodes?
- Request jobs on the "amperenodes" partition (up to 12 hours) or "amperenodes-medium" partition (up to 48 hours).
How many GPUs can I request at once?
- Up to four GPUs may be requested by any one researcher at once.
- There are two GPUs per node, so requesting four GPUs will allocate two nodes.
What performance improvements can I expect over the P100 GPUs?
- Performance improvements depend on the software and algorithms being used.
- Swapping a single P100 to a single A100, you can generally expect 3x to 20x improvement.
- ADD DOCUMENTATION FROM NVIDIA
How can I make the most efficient use of the A100 GPUs?
- A100s process data very rapidly. Ideally, we want the A100 to be the bottleneck during processing.
- Possible ways to improve performance include...
  - ...copying your input data onto /local/$SLURM_JOB_ID (node-specific NVMe drives) before processing.
  - ...using a larger number of CPU cores for data loading and preprocessing. Need to qualify this statement, open up a dialog
    - This is becoming a concern as models and data increase is size and scope. We want to learn with you what configurations benefit most from additional resources
    - This may need to go to docs
  - ...verifying improvements empirically by recording timings with different setups.
Where are the A100 nodes physically located, and will this impact my workflows?
- The A100 nodes are located in the DC BLOX Data Center, west of UAB Campus.
- Because Cheaha storage (GPFS) is located on campus, there may be slightly higher latency when transferring data between the A100 nodes and GPFS. Impacts will only occur if very small amounts of data are transferred very frequently, which is unusual for most GPU workflows.
- We strongly recommend copying your input data onto /local/$SLURM_JOB_ID prior to processing.
What will happen to the P100 GPUs?
- We intend to retain all of the 18 existing P100 GPU nodes.
- 9 nodes are available now.
- 9 nodes have been temporarily taken offline as we reconfigure hardware, and will be reallocated based on demand.
What else should I be aware of?
- Please be sure to clean your data off of /local/$SLURM_JOB_ID as soon as you no longer need it, before the job finishes.

Sample script for automating data transfers

Add to docs and put URL in email. Check docs and remove references to /scratch/local

Conceptually, it is possible to wrap existing sbatch script payloads with a cp command to move the data to the local SSD, then clean up the local SSD afterward. The specifics of the following example may need to be tweaked to meet the needs of the specific script.

#!/bin/bash
#SBATCH ...
#SBATCH --partition=amperenodes
#SBATCH --gres=gpu:1
#SBATCH --array=0-9

# module load ...

# COPY RESEARCH DATA TO LOCAL TEMPORARY DIRECTORY
# Be sure to make unique directories for each array task!
# Avoids having one task delete another task's data.
# Your data lives in $DATA_SOURCE_DIRECTORY
TMPDIR="/local/$SLURM_JOB_ID"
mkdir -p "$TMPDIR"
cp -r $DATA_SOURCE_DIRECTORY "$TMPDIR"

# YOUR ORIGINAL PAYLOAD GOES HERE

# CLEAN UP TEMPORARY DIRECTORY
# WARNING! Changing the following line can cause research data to be permanently deleted unexpectedly!
rm -rf "$TMPDIR"

The example above assumes one GPU on one node, a single processing step and no intermediate data being stored on GPFS. Any intermediate data would also need to be copied and cleaned up.

For workflows across multiple nodes, take care to ensure data is moved to local drives on the same node that the data will be processed on. Please contact us for more information if you need to use A100s on multiple nodes in a single job.