Metrics, logging, observability (Epic)

High-level overview of metrics needed for transparency and accountability to the University.

What is Needed

Overall summaries:
- sum, mean, median, as applicable
Throughput/timeline data
- At each moment in time, what is allocation, usage, capacity?
- Think like a Grafana graph.
- Generally this isn't always available
- One approach to producing it from, e.g., sacct: https://gitlab.rc.uab.edu/rc-data-science/data-science-internal/sacct-analysis
Allocation, usage, capacity weighting based on relative performance
- GPU example: P100 = 1, V100 = 4.3, A100 = 12.9, H200 = 25.8
- Sources
  - P100-V100: https://www.nvidia.com/en-gb/data-center/tesla-v100/
  - V100-A100: https://www.nvidia.com/en-us/data-center/a100/
  - A100-H200: https://www.nvidia.com/en-us/data-center/h200/
- Can repeat for CPUs as well, but may be harder to find info.

CPU usage from sacct is not ideal.
- CPUTime(Raw) is Elapsed * Count. This is really a form of "allocated time". Applies to the Job.
- SystemCPU and UserCPU are actual usages, but only for parent processes, not child processes. If a parent process sits and does nothing, these will be artificially low. Applies to Job Steps.
- Use, e.g., ps for polling.
Mem usage from sacct is not ideal.
- MaxRSS and AveRSS are actual usages, but are polled periodically and can change much faster than sacct can capture.
- Very rarely, sacct will capture an OOM event before the OS terminates the process and report more memory used than allocated.
- Use, e.g., `
GPU usage is not recorded by sacct at all.
- Use nvidia-smi for polling.
Gathering polling data on a per-job data requires more than just monitoring hardware usage.
- Usage must be broken out by PID.
- PIDs must be associated to Jobs and Steps (scontrol listpids $jobid)
- Example of augmenting squeue with context switching data: https://gitlab.rc.uab.edu/rc/squeue-with-context-switches
- The same principle could be applied to cpu, mem, gpu usages and recorded periodically.

Edited Sep 15, 2025 by William E Warriner