... | ... | @@ -12,9 +12,11 @@ Each A100 node has two A100 GPUs, each of which has 80 GB of memory. The nodes a |
|
|
|
|
|
**Known Limitations**
|
|
|
|
|
|
- The A100 nodes are not yet available via Open OnDemand. We intend to close this gap as soon as possible.
|
|
|
- The NVMe mount points are currently in flux and environment variables may change to ensure a consistent experience across partitions on Cheaha. We will communicate finalized information as soon as possible.
|
|
|
- We are researching how to make the TensorRT library available as a module. You may see warnings about TensorRT not found in TensorFlow. The lack of TensorRT may or may not impact performance, but this warning does not prevent or impact quality of model training.
|
|
|
For TensorFlow users: we are researching how to make the TensorRT library available as a module. You may see warnings about TensorRT not found in TensorFlow. The lack of TensorRT may or may not impact performance, but this warning does not prevent or impact quality of model training.
|
|
|
|
|
|
## After this point...
|
|
|
|
|
|
Put it in various places in the docs
|
|
|
|
|
|
**Questions and Answers**:
|
|
|
|
... | ... | @@ -22,30 +24,36 @@ Each A100 node has two A100 GPUs, each of which has 80 GB of memory. The nodes a |
|
|
- Request jobs on the "amperenodes" partition (up to 12 hours) or "amperenodes-medium" partition (up to 48 hours).
|
|
|
- How many GPUs can I request at once?
|
|
|
- Up to four GPUs may be requested by any one researcher at once.
|
|
|
- There are two GPUs per node, so requesting four GPUs will allocate two nodes.
|
|
|
- What performance improvements can I expect over the P100 GPUs?
|
|
|
- Performance improvements depend on the software and algorithms being used.
|
|
|
- Swapping a single P100 to a single A100, you can generally expect 3x to 20x improvement.
|
|
|
- ADD DOCUMENTATION FROM NVIDIA
|
|
|
- How can I make the most efficient use of the A100 GPUs?
|
|
|
- A100s process data very rapidly. Ideally, we want the A100 to be the bottleneck during processing.
|
|
|
- Possible ways to improve performance include...
|
|
|
- ...copying your input data onto `/local` (node-specific NVMe drives) before processing.
|
|
|
- ...using a larger number of CPU cores for data loading and preprocessing.
|
|
|
- ...copying your input data onto `/local/$SLURM_JOB_ID` (node-specific NVMe drives) before processing.
|
|
|
- ...using a larger number of CPU cores for data loading and preprocessing. `Need to qualify this statement, open up a dialog`
|
|
|
- This is becoming a concern as models and data increase is size and scope. We want to learn with you what configurations benefit most from additional resources
|
|
|
- This may need to go to docs
|
|
|
- ...verifying improvements empirically by recording timings with different setups.
|
|
|
- Where are the A100 nodes physically located, and will this impact my workflows?
|
|
|
- The A100 nodes are located in the DCBlox Data Center, west of UAB Campus.
|
|
|
- The A100 nodes are located in the DC BLOX Data Center, west of UAB Campus.
|
|
|
- Because Cheaha storage (GPFS) is located on campus, there may be slightly higher latency when transferring data between the A100 nodes and GPFS. Impacts will only occur if very small amounts of data are transferred very frequently, which is unusual for most GPU workflows.
|
|
|
- We strongly recommend copying your input data onto `/local` prior to processing.
|
|
|
- We strongly recommend copying your input data onto `/local/$SLURM_JOB_ID` prior to processing.
|
|
|
- What will happen to the P100 GPUs?
|
|
|
- We intend to retain all of the 18 existing P100 GPU nodes.
|
|
|
- 9 nodes are still available now.
|
|
|
- 9 nodes have been temporarily taken offline due to space and power limitations from the incoming A100 nodes.
|
|
|
- More information will be forthcoming.
|
|
|
- 9 nodes are available now.
|
|
|
- 9 nodes have been temporarily taken offline as we reconfigure hardware, and will be reallocated based on demand.
|
|
|
- What else should I be aware of?
|
|
|
- Please be sure to clean your data off of `/local` as soon as you no longer need it, before the job finishes.
|
|
|
- Please be sure to clean your data off of `/local/$SLURM_JOB_ID` as soon as you no longer need it, before the job finishes. <point to example job>
|
|
|
|
|
|
|
|
|
## Sample script for automating data transfers
|
|
|
|
|
|
Add to docs and put URL in email.
|
|
|
Check docs and remove references to `/scratch/local`
|
|
|
|
|
|
Conceptually, it is possible to wrap existing `sbatch` script payloads with a `cp` command to move the data to the local SSD, then clean up the local SSD afterward. The specifics of the following example may need to be tweaked to meet the needs of the specific script.
|
|
|
|
|
|
```shell
|
... | ... | @@ -57,17 +65,19 @@ Conceptually, it is possible to wrap existing `sbatch` script payloads with a `c |
|
|
|
|
|
# module load ...
|
|
|
|
|
|
# COPY DATA TO LOCAL DRIVE
|
|
|
# COPY RESEARCH DATA TO LOCAL TEMPORARY DIRECTORY
|
|
|
# Be sure to make unique directories for each array task!
|
|
|
# Avoids having one task delete another task's data.
|
|
|
SSD_TEMP_DIR="/local/scratch/$USER/my_project/$SLURM_ARRAY_TASK_ID"
|
|
|
mkdir -p "$SSD_TEMP_DIR"
|
|
|
cp -r $DATA_SOURCE_DIRECTORY "$SSD_TEMP_DIR"
|
|
|
# Your data lives in $DATA_SOURCE_DIRECTORY
|
|
|
TMPDIR="/local/$SLURM_JOB_ID"
|
|
|
mkdir -p "$TMPDIR"
|
|
|
cp -r $DATA_SOURCE_DIRECTORY "$TMPDIR"
|
|
|
|
|
|
# YOUR ORIGINAL PAYLOAD GOES HERE
|
|
|
|
|
|
# CLEAN UP LOCAL DRIVE
|
|
|
rm -r "$SSD_TEMP_DIR"
|
|
|
# CLEAN UP TEMPORARY DIRECTORY
|
|
|
# WARNING! Changing the following line can cause research data to be permanently deleted unexpectedly!
|
|
|
rm -rf "$TMPDIR"
|
|
|
```
|
|
|
|
|
|
The example above assumes one GPU on one node, a single processing step and no intermediate data being stored on GPFS. Any intermediate data would also need to be copied and cleaned up.
|
... | ... | |