|
|
UAB Research Computing has good news to share! We are currently installing, configuring, and testing 40 A100 GPUs (20 nodes) on Cheaha, with a planned release date of "DATE".
|
|
|
|
|
|
Due to limitations on electrical power in the GPU node rack, we have taken 9 of the P100 nodes offline. The reduced capacity may lead to increased wait times for GPU jobs until the A100 nodes are released. We intend to repurpose the P100 nodes within our cloud platform to increase capacity there.
|
|
|
|
|
|
### Draft of future information
|
|
|
|
|
|
The new partitions will be the `amperenodes*` partitions and will mirror the existing `pascalnodes*` partitions. We anticipate a full release by `DATE`. For machine learning, deep learning and AI training, we anticipate up to 20 times previous throughput. Specific platform details must be used to achieve increased throughput. Additionally, until release, we will need to temporarily reduce GPU compute capacity on Cheaha. Please read on for more details about our transition plan and about the A100 GPU nodes.
|
|
|
|
|
|
#### Transition Plan
|
|
|
|
|
|
Rack space limitations in our data center require removing some of the P100 nodes to make physical space for the new A100 nodes. As a consequence, until release, there will be a period of reduced GPU compute capacity on Cheaha. We intend to retain the removed P100 nodes for future use elsewhere on our system.
|
|
|
|
|
|
Previously, we had virtualized A100 nodes available in restricted partitions available upon request. These partitions were `amperenodes`, `amperenodes-medium` and `amperenodes-debug`. We are retaining the first two, and removing the `amperenodes-debug` partition as it's purpose has been served.
|
|
|
|
|
|
Below are details of current and upcoming changes to Cheaha. Completed items are checked.
|
|
|
- Unrestricting `amperenodes` and `amperenodes-medium` partitions.
|
|
|
- Removing `amperenodes-debug` partition.
|
|
|
- Adding 20 A100 nodes.
|
|
|
- 20 A100 nodes in `amperenodes`.
|
|
|
- 8 A100 nodes shared between `amperenodes-medium` and `amperenodes`.
|
|
|
- Removing 6 virtualized A100 nodes.
|
|
|
- 9 P100 nodes have been physically removed from Cheaha racks and from `pascalnodes` and `pascalnodes-medium` (complete)
|
|
|
- Retaining 9 P100 nodes (complete)
|
|
|
- 9 P100 nodes in `pascalnodes` (complete)
|
|
|
- 4 P100 nodes shared between `pascalnodes-medium` and `pascalnodes` (complete)
|
|
|
|
|
|
The changes are also summarized in the table below.
|
|
|
|
|
|
| | Before | | | After | |
|
|
|
| -------------------- | ------ | ---- | --- | ----- | ---- |
|
|
|
| Partition | P100 | A100 | | P100 | A100 |
|
|
|
| `pascalnodes` | 18 | | | 9 | |
|
|
|
| `pascalnodes-medium` | 7 | | | 9 | |
|
|
|
| `amperenodes` | | 6* | | | 20 |
|
|
|
| `amperenodes-medium` | | 6* | | | 8 |
|
|
|
| `amperenodes-debug` | | 6* | | | |
|
|
|
|
|
|
#### Performance
|
|
|
|
|
|
The A100 nodes have the following hardware, which is a substantial improvement over the P100 nodes.
|
|
|
- 2x A100 GPUs, 80 GB GPU memory
|
|
|
- 2x 64 core CPUs, AMD
|
|
|
- 512 GB system memory
|
|
|
- 2x 3 TB NVMe local drives
|
|
|
|
|
|
The local NVMe drives and additional CPUs should help enhance data loading rates, moving workflow bottlenecks onto the GPUs. These drives are mounted on `/local/scratch`.
|
|
|
|
|
|
A100 GPUs are capable of greater machine learning, deep learning and AI throughput than P100 GPUs, by a factor of up to 20. Part of achieving these throughput increases is making use of the node-local NVMe solid state drives, which may involve changes to your scripts and workflows. Specifically, input data will need to be moved from GPFS to the node-local drives before processing and **cleaned up afterward.**
|
|
|
|
|
|
Please be mindful that the A100 nodes will be located in the DC Blox data center, while GPFS storage is located in the TIC data center on campus. For best performance, be sure to move data to the local drive before processing.
|
|
|
|
|
|
Conceptually, it is possible to wrap existing `sbatch` script payloads with a `cp` command to move the data to the local SSD, then clean up the local SSD afterward. The specifics of the following example may need to be tweaked to meet the needs of the specific script.
|
|
|
|
|
|
```shell
|
|
|
#!/bin/bash
|
|
|
#SBATCH ...
|
|
|
#SBATCH --partition=amperenodes
|
|
|
#SBATCH --gres=gpu:1
|
|
|
#SBATCH --array=0-9
|
|
|
|
|
|
# module load ...
|
|
|
|
|
|
# COPY DATA TO LOCAL DRIVE
|
|
|
# Be sure to make unique directories for each array task!
|
|
|
# Avoids having one task delete another task's data.
|
|
|
SSD_TEMP_DIR="/local/scratch/$USER/my_project/$SLURM_ARRAY_TASK_ID"
|
|
|
mkdir -p "$SSD_TEMP_DIR"
|
|
|
cp -r $DATA_SOURCE_DIRECTORY "$SSD_TEMP_DIR"
|
|
|
|
|
|
# YOUR ORIGINAL PAYLOAD GOES HERE
|
|
|
|
|
|
# CLEAN UP LOCAL DRIVE
|
|
|
rm -r "$SSD_TEMP_DIR"
|
|
|
```
|
|
|
|
|
|
The example above assumes one GPU on one node, a single processing step and no intermediate data being stored on GPFS. Any intermediate data would also need to be copied and cleaned up.
|
|
|
|
|
|
For workflows across multiple nodes, take care to ensure data is moved to local drives on the same node that the data will be processed on. Please contact us for more information if you need to use A100s on multiple nodes in a single job. |