Metrics, logging, observability (Epic)
High-level overview of metrics needed for transparency and accountability to the University.
What is Needed
- Cheaha Data
- CPU allocation (sacct) & usage (sacct, mostly) & capacity (hardware)
- GPU allocation (sacct) & usage (nvidia-smi polling) & capacity (hardware)
- Mem allocation (sacct) & usage (??) & capacity (hardware)
- Job wait time (sacct)
- Job run time (sacct)
- Openstack Data
- CPU allocation (??) & usage (??) & capacity (hardware)
- GPU allocation (??) & usage (nvidia-smi polling) & capacity (hardware)
- Mem allocation (??) & usage (??) & capacity (hardware)
- Other O~S specific hardware (floating IPs, etc.)
- Time an allocation is active (??)
- Storage Data
- GPFS5 allocation & usage & capacity
- break down by
/data/userand/data/projectand/scratch
- break down by
- Ceph Core allocation & usage & capacity
- break down by
/data/userand/data/projectand/scratch
- break down by
- LTS allocation & usage & capacity
- break down by individual allocations and shared allocations
- GPFS5 allocation & usage & capacity
- Account Data
- Cloud Equivalent Costs
What Views are Needed
- Overall summaries:
- sum, mean, median, as applicable
- Throughput/timeline data
- At each moment in time, what is allocation, usage, capacity?
- Think like a Grafana graph.
- Generally this isn't always available
- One approach to producing it from, e.g., sacct: https://gitlab.rc.uab.edu/rc-data-science/data-science-internal/sacct-analysis
- Allocation, usage, capacity weighting based on relative performance
- GPU example: P100 = 1, V100 = 4.3, A100 = 12.9, H200 = 25.8
- Sources
- Can repeat for CPUs as well, but may be harder to find info.
Currently Available Resources
- Hardware information: https://gitlab.rc.uab.edu/rc-data-science/metrics/rc-hardware
- Historical and current sacct data: https://gitlab.rc.uab.edu/rc-data-science/metrics/noctua_db
- Building throughputs from sacct data: https://gitlab.rc.uab.edu/rc-data-science/data-science-internal/sacct-analysis
- Admin metric reporting: https://gitlab.rc.uab.edu/rc-data-science/metrics/admin-facing-reporting
- Cloud costs: https://gitlab.rc.uab.edu/rc-data-science/metrics/third-party-cloud-cost-comparison
- Account history: https://gitlab.rc.uab.edu/rc-data-science/metrics/researcher-registration-date-source
- Current accounts (self reg db):
- Schema and processes: https://gitlab.rc.uab.edu/rc/rabbitmq_agents
- Data: https://gitlab.rc.uab.edu/mhanby/rc-users.git
- Tracking account affiliation/activity: https://gitlab.rc.uab.edu/rc-data-science/metrics/categorize-active-cheaha-accounts
Detailed Notes
-
CPU usage from sacct is not ideal.
- CPUTime(Raw) is Elapsed * Count. This is really a form of "allocated time". Applies to the Job.
- SystemCPU and UserCPU are actual usages, but only for parent processes, not child processes. If a parent process sits and does nothing, these will be artificially low. Applies to Job Steps.
- Use, e.g.,
psfor polling.
-
Mem usage from sacct is not ideal.
- MaxRSS and AveRSS are actual usages, but are polled periodically and can change much faster than sacct can capture.
- Very rarely, sacct will capture an OOM event before the OS terminates the process and report more memory used than allocated.
- Use, e.g., `
-
GPU usage is not recorded by sacct at all.
- Use
nvidia-smifor polling.
- Use
-
Gathering polling data on a per-job data requires more than just monitoring hardware usage.
- Usage must be broken out by PID.
- PIDs must be associated to Jobs and Steps (
scontrol listpids $jobid) - Example of augmenting
squeuewith context switching data: https://gitlab.rc.uab.edu/rc/squeue-with-context-switches - The same principle could be applied to cpu, mem, gpu usages and recorded periodically.
Edited by William E Warriner