|
## Initial Announcement
|
|
## Current Communication
|
|
|
|
|
|
UAB Research Computing has good news to share! We are currently installing, configuring, and testing 40 A100 GPUs (20 nodes) on Cheaha, with a planned release date of "DATE".
|
|
UAB Research Computing has good news to share! We have installed 40 A100 GPUs across 20 nodes, 2 GPUs per node, for immediate use by our research community. To get started quickly, use partitions "amperenodes" and/or "amperenodes-medium". For more information about the installation, known limitations, how to make the most of the A100 GPUs, and hardware details, please read on.
|
|
|
|
|
|
Due to limitations on electrical power in the GPU node rack, we have taken 9 of the P100 nodes offline. The reduced capacity may lead to increased wait times for GPU jobs until the A100 nodes are released. We intend to repurpose the P100 nodes within our cloud platform to increase capacity there.
|
|
**Changes to CUDA software**
|
|
|
|
|
|
## Future Announcements
|
|
To use the latest version of CUDA, please use "module load CUDA/12.2.0`. To use the latest version of cuDNN, please use "module load cuDNN/12.1.0
|
|
|
|
|
|
### Draft of future information
|
|
**Hardware Specification**
|
|
|
|
|
|
The new partitions will be the `amperenodes*` partitions and will mirror the existing `pascalnodes*` partitions. We anticipate a full release by `DATE`. For machine learning, deep learning and AI training, we anticipate up to 20 times previous throughput. Specific platform details must be used to achieve increased throughput. Additionally, until release, we will need to temporarily reduce GPU compute capacity on Cheaha. Please read on for more details about our transition plan and about the A100 GPU nodes.
|
|
Each A100 node has two A100 GPUs, each of which has 80 GB of memory. The nodes also have 128 cores split across two CPU dies and 512 GB of main system memory. 6 TB of NVMe storage is available in a striped (RAID 0) configuration for I/O performance.
|
|
|
|
|
|
#### Transition Plan
|
|
**Known Limitations**
|
|
|
|
|
|
Rack space limitations in our data center require removing some of the P100 nodes to make physical space for the new A100 nodes. As a consequence, until release, there will be a period of reduced GPU compute capacity on Cheaha. We intend to retain the removed P100 nodes for future use elsewhere on our system.
|
|
- The A100 nodes are not yet available via Open OnDemand. We intend to close this gap as soon as possible.
|
|
|
|
- The NVMe mount points are currently in flux and environment variables may change to ensure a consistent experience across partitions on Cheaha. We will communicate finalized information as soon as possible.
|
|
|
|
- We are researching how to make the TensorRT library available as a module. You may see warnings about TensorRT not found in TensorFlow. The lack of TensorRT may or may not impact performance, but this warning does not prevent or impact quality of model training.
|
|
|
|
|
|
Previously, we had virtualized A100 nodes available in restricted partitions available upon request. These partitions were `amperenodes`, `amperenodes-medium` and `amperenodes-debug`. We are retaining the first two, and removing the `amperenodes-debug` partition as it's purpose has been served.
|
|
**Questions and Answers**:
|
|
|
|
|
|
Below are details of current and upcoming changes to Cheaha. Completed items are checked.
|
|
- How do I access the A100 nodes?
|
|
- Unrestricting `amperenodes` and `amperenodes-medium` partitions.
|
|
- Request jobs on the "amperenodes" partition (up to 12 hours) or "amperenodes-medium" partition (up to 48 hours).
|
|
- Removing `amperenodes-debug` partition.
|
|
- How many GPUs can I request at once?
|
|
- Adding 20 A100 nodes.
|
|
- Up to four GPUs may be requested by any one researcher at once.
|
|
- 20 A100 nodes in `amperenodes`.
|
|
- What performance improvements can I expect over the P100 GPUs?
|
|
- 8 A100 nodes shared between `amperenodes-medium` and `amperenodes`.
|
|
- Performance improvements depend on the software and algorithms being used.
|
|
- Removing 6 virtualized A100 nodes.
|
|
- Swapping a single P100 to a single A100, you can generally expect 3x to 20x improvement.
|
|
- 9 P100 nodes have been physically removed from Cheaha racks and from `pascalnodes` and `pascalnodes-medium` (complete)
|
|
- How can I make the most efficient use of the A100 GPUs?
|
|
- Retaining 9 P100 nodes (complete)
|
|
- A100s process data very rapidly. Ideally, we want the A100 to be the bottleneck during processing.
|
|
- 9 P100 nodes in `pascalnodes` (complete)
|
|
- Possible ways to improve performance include...
|
|
- 4 P100 nodes shared between `pascalnodes-medium` and `pascalnodes` (complete)
|
|
- ...copying your input data onto `/local` (node-specific NVMe drives) before processing.
|
|
|
|
- ...using a larger number of CPU cores for data loading and preprocessing.
|
|
|
|
- ...verifying improvements empirically by recording timings with different setups.
|
|
|
|
- Where are the A100 nodes physically located, and will this impact my workflows?
|
|
|
|
- The A100 nodes are located in the DCBlox Data Center, west of UAB Campus.
|
|
|
|
- Because Cheaha storage (GPFS) is located on campus, there may be slightly higher latency when transferring data between the A100 nodes and GPFS. Impacts will only occur if very small amounts of data are transferred very frequently, which is unusual for most GPU workflows.
|
|
|
|
- We strongly recommend copying your input data onto `/local` prior to processing.
|
|
|
|
- What will happen to the P100 GPUs?
|
|
|
|
- We intend to retain all of the 18 existing P100 GPU nodes.
|
|
|
|
- 9 nodes are still available now.
|
|
|
|
- 9 nodes have been temporarily taken offline due to space and power limitations from the incoming A100 nodes.
|
|
|
|
- More information will be forthcoming.
|
|
|
|
- What else should I be aware of?
|
|
|
|
- Please be sure to clean your data off of `/local` as soon as you no longer need it, before the job finishes.
|
|
|
|
|
|
The changes are also summarized in the table below.
|
|
|
|
|
|
|
|
| | Before | | | After | |
|
|
## Sample script for automating data transfers
|
|
| -------------------- | ------ | ---- | --- | ----- | ---- |
|
|
|
|
| Partition | P100 | A100 | | P100 | A100 |
|
|
|
|
| `pascalnodes` | 18 | | | 9 | |
|
|
|
|
| `pascalnodes-medium` | 7 | | | 9 | |
|
|
|
|
| `amperenodes` | | 6* | | | 20 |
|
|
|
|
| `amperenodes-medium` | | 6* | | | 8 |
|
|
|
|
| `amperenodes-debug` | | 6* | | | |
|
|
|
|
|
|
|
|
#### Performance
|
|
|
|
|
|
|
|
The A100 nodes have the following hardware, which is a substantial improvement over the P100 nodes.
|
|
|
|
- 2x A100 GPUs, 80 GB GPU memory
|
|
|
|
- 2x 64 core CPUs, AMD
|
|
|
|
- 512 GB system memory
|
|
|
|
- 2x 3 TB NVMe local drives
|
|
|
|
|
|
|
|
The local NVMe drives and additional CPUs should help enhance data loading rates, moving workflow bottlenecks onto the GPUs. These drives are mounted on `/local/scratch`.
|
|
|
|
|
|
|
|
A100 GPUs are capable of greater machine learning, deep learning and AI throughput than P100 GPUs, by a factor of up to 20. Part of achieving these throughput increases is making use of the node-local NVMe solid state drives, which may involve changes to your scripts and workflows. Specifically, input data will need to be moved from GPFS to the node-local drives before processing and **cleaned up afterward.**
|
|
|
|
|
|
|
|
Please be mindful that the A100 nodes will be located in the DC Blox data center, while GPFS storage is located in the TIC data center on campus. For best performance, be sure to move data to the local drive before processing.
|
|
|
|
|
|
|
|
Conceptually, it is possible to wrap existing `sbatch` script payloads with a `cp` command to move the data to the local SSD, then clean up the local SSD afterward. The specifics of the following example may need to be tweaked to meet the needs of the specific script.
|
|
Conceptually, it is possible to wrap existing `sbatch` script payloads with a `cp` command to move the data to the local SSD, then clean up the local SSD afterward. The specifics of the following example may need to be tweaked to meet the needs of the specific script.
|
|
|
|
|
... | | ... | |