Manage tiering strategy to ensure GPFS + Ceph usage does not exceed entitlements
Problem
Currently on GPFS5, quotas only affect the GPFS tier, not the Ceph tier. If data is moved from GPFS to Ceph, the GPFS quota remains the same, effectively giving researchers additional total storage, which may exceed their entitlement.
Solution
Combine metadata from GPFS and Ceph to ensure total usage does not exceed allowed quota.
Implementation
We will likely need to use GPFS policy runs to gather the data, as there is no realtime mechanism, and no live cache, of the information we need.
Ceph usage per file is part of the policy run data, in a separate column from GPFS usage. Stub files have 0 GPFS usage and >0 Ceph usage, so we can use a policy run to get the Ceph total and GPFS total, and adjust quotas following the policy run.
Because policy runs take time and consume I/O and compute resources, we'll want to run them periodically, probably once per day around midnight.
Because researchers can (and do) have varying quotas, we need a new data resource to manage the "ground truth" of total allowed quota per shared allocation. We will start with an SQLite DB with appropriate information.
Allocation Ground Truth SQLite DB
We need a data resource to hold the actual allowed quota for each researcher, as we can no longer store that in the GPFS fileset metadata. We will start with an SQLite DB with two related tables.
- Allocation Table (PK is field 1, blazerid)
- Owner blazerid
-
/data/project/slug - Ground-truth quota (bytes)
- Policy Run Table (PK is fields 1 + 2, slug + datetime stamp)
-
/data/project/slug (FK) - ISO 8601 datetime stamp of most recent policy run
- Cached ceph usage of most recent policy run (bytes)
- Cached gpfs usage of most recent policy run (bytes)
-
How does this affect shared allocation creation procedures?
We will update the shared allocation process to include the following steps
- Check if blazerid is in the table. If so, we need to talk with the requester. Don't create the allocation.
- If not already in the table, then add an appropriate entry to the first table.
How does this affect shared allocation quota change procedures?
Update the appropriate entry in the table with the new quota.
What about DB backups?
Let's co-locate the DB with the user reg DB and backup that directory to LTS periodically.
How Do Researchers know their quota and usage?
We need to build a script that requests information from the database. Input /data/project/ slug and get out the following. Default output should be human readable in some reasonable way.
- Quota allowed TiB
- Cached ceph usage TiB
- Cached gpfs usage TiB
- Unused quota TiB
- ISO 8601 datetime stamp of last update
We will need to document this command in the RF docs. The mmlsquota command will be inaccurate, steer researchers away from this command. The only accurate source will be our script.
How Do We know overall quotas and usages?
We also need a script that gets us a table of the same information as above. Researchers should not have access to this script. Basically, it would loop over the same underlying data as in the researcher script. Include --csv, --json and --yaml output formats, please!