EPIC: A/B-style migration user testing
Background
I will defer to @jpr for corrections on anything I've missed or gotten wrong. Here is my understanding of the plan for the researcher-facing part of migration. We want the process to be as minimally invasive as possible, but we recognize we can't do an all-at-once migration and expect an issue-free experience.
We are looking to migrate a subset of our researchers from GPFS4 to GPFS5 in advance of other users. The researchers we migrate first would ideally be those who have shown a willingness to collaborate, are technologically savvy, and are frequent users of our system, who will help us identify issues while also being understanding of the steps and time required to resolve them. Matt is working to identify candidates to communicate with.
The A/B-style testing will necessarily split our user base into two groups: an "A" group who is migrated to the new GPFS5 platform, and a "B" group who remains with GPFS4. Users in group A will be on GPFS5 and should not be able to interact with GPFS4 in any way. Likewise, users in group B will be on GPFS4 and should not be able to interact with GPFS5 in any way. For the group A users to have a minimally-invasive user experience, all data they interact with should be migrated entirely to GPFS5 (with some potential, carefully communicated exceptions).
Proposed Strategy
Mirror all services except the Slurm Controller and Queue. Original services are only accessible by group B, mirrored serves are only accessible by group A. The Slurm Controller and Queue are accessible by both groups, with separate partitions for each.
Formal Invariants
The overall requirement can be expressed more formally as a pair of invariants:
- Group A cannot interact with GPFS4
- Group B cannot interact with GPFS5
Data Access
Preventing inappropriate access to data (i.e., group A accessing GPFS4 and group B accessing GPFS5) requires following possible data transmission paths starting with how users interface with the Cheaha platform. There are three primary interfaces through which users can access data on Cheaha. It is critical that users not be allowed to access their counterpart interfaces. Group A can access Login A, OOD A, and Globus A only, and likewise for B with the B services.
- SSH to the login node (
login004
, cheaha.rc.uab.edu). - HTTPS via browser to Open OnDemand (OOD) (hosted on
login005
, https://rc.uab.edu). - HTTPS via Globus to Data Transfer Nodes (hosted on DTNs, https://app.globus.org).
Starting at the login node:
- users can directly access GPFS data via CLI.
- users can initiate transfers between the login node and other locations on the internet.
- users can submit jobs to Slurm to gain access to compute nodes.
- users can ssh from Login A to Login B and vice-versa. This must be prevented.
Starting at OOD:
- users can directly access GPFS data via the OOD File Browser.
- users can emulate SSH access to a particular login node via the OOD Terminal. Ensure Group A accesses Login A
- users can submit jobs to Slurm to gain access to compute nodes.
Continuing from Slurm job access to compute nodes:
- users can directly access GPFS data via CLI.
- users can initiate transfers between the compute nodes and other locations on the internet.
- users can ssh from Compute A to Login B and from Compute B to Login A. This must be prevented.
Starting at Globus:
- users can directly access GPFS data via the Globus File Manager.
- users can initiate transfers between GPFS endpoints and other endpoints.
- users can initiate transfers between Globus A and Globus B. This must be prevented.
Known Required Steps and Changes
The following prerequisites must be met to fully separate group A from GPFS4:
- Identify early adopters.
- Clustering users via project spaces at gpfs5-clustering.
- Build a user account migration process.
- See below.
- Build a data migration process.
- TBD
- Build login node dedicated to group A (Login A). Covers SSH access to data from the internet.
- Prevent access to Login B from Login A, prevent access to Login A from Login B. Covers SSH access between login nodes.
- Build SSH redirection from Login B to Login A.
- #543 (closed)
- Group A users must have SSH routed from
login004
to new login node. - Consider using Ballast software to automate group to node mapping.
- Build separate OOD experience for group A (OOD A). Covers HTTPS access to data.
- #542 (closed)
- TBD
- Build separate Slurm experience for group A (Slurm A).
- Dedicate nodes in each partition for use exclusively by Group A (Compute A).
- Compute A will reside in DCBlox.
- Compute B will reside in TIC.
- Prevent access to Login B from Compute A, and prevent access to Login A from Compute B. Covers SSH access to data from compute nodes.
- Create mirrored Slurm partitions for use exclusively by Group A (Partitions A).
- Mirrored partitions should have identical names plus an appended, static string like
_A
. - Group A partitions must have only Group A nodes, and not Group B nodes.
- Group A partitions must be accessible only by Group A users, and not by Group B users.
- Group B partitions are inverse of Group A partitions (no action required).
- The ScienceDMZ partition must be considered as a special case because internet access is routed through DTNs.
- Mirrored partitions should have identical names plus an appended, static string like
- Automate Slurm partition selection with all methods of Slurm access:
sbatch
,srun
,salloc
. Covers Slurm Job access to data from compute nodes.- #544 (closed)
- Research Job Submit Plugins to intercept jobs and modify their options.
- The plugin should take a submitted job and, if the "partition" option was specified, append
_A
. If the partition option was not specified, it should be added using the default value with_A
appended.
- Dedicate nodes in each partition for use exclusively by Group A (Compute A).
- Build separate Globus experience for group A (Globus A). Covers Globus access to data.
- TBD
User Migration Diagram
- User interfaces are shown in green.
- Possible hardware data paths are shown in blue.
- Slurm components are shown in purple.
- The red line indicates the "mirror" which no arrows should cross.
- Orange lines cross the "mirror" and the indicated actions need to be prevented.
User Migration Process
A possibility for the user migration process.
- Store current account state. (transparent)
- Account state transitions to "Hold" to prevent access to Cheaha. (visible, need to communicate)
- Assign users to a new Linux group. (transparent)
- User data migrated from GPFS4 to GPFS5. (transparent)
- Restore previous account state. (visible, need to communicate)