As a user, I should have a good experience using `/tmp/` and `$LOCAL`
Problem:
- Node-specific directories (
/tmp/and/local, or$LOCAL) have a global namespace by default, so all users place temporary files into the same directory. - Additionally,
/tmp/should not hold any research data, or derived research data.
Possible Solutions
/tmp/
Research data should not touch /tmp/. It is meant for OS and support application use, not for research data use, it's simply too small and not performant enough. We cannot use chmod o-rwx /tmp/ or nodes will become unusable by researchers. Ideally, research workflows using /tmp/ for large research data should fail as soon as possible, so researchers don't waste time.
We can use PAM modules!
-
pam_tmpdir- creates a dedicated directory$HOME/tmp/for every researcher.- Pros:
- assigns data to researcher quota
- Cons:
- unnecessary dependency on GPFS, a network resource
- may cause unexpected failures of OOD if
dataquota is full - no help with
/local/
- Pros:
-
pam_mktemp- creates/tmp/$USER/for every researcher.- Pros:
- per-user directory, more secure
- Cons:
- no change in how fast failure occurs
- no help with
/local/
- Pros:
-
pam_namespace- creates `/tmp/$USER/ for every researcher.- Pros:
- per-user directories for both
/tmp/and/local/(and anywhere else)
- per-user directories for both
- Cons:
- no change in how fast failure occurs
- Pros:
- Make
/tmp/a symbolic link (or bind mount) to/local/.- Pros:
- transparent behavior of
/tmp/
- transparent behavior of
- Cons:
- no change in how fast failure occurs
- Pros:
pam_namespace provides the most comprehensive solution, but none of the solutions listed above cause fast failure. There is also pam_setquota which can be used to configure a quota for every user, which could assist with faster failure. Additionally, Slurm could be configured to create $SLURM_JOB_ID subdirectories for /local/$USER/.
Proposal
- Configure
pam_namespaceto create/tmp/$USER/and/local/$USER/and make them appear as/tmp/and/local/, respectively. - Configure
pam_setquotato reduce the user-apparent storage in/tmp/to cause faster job failure. - Configure Slurm to create
/local/$USER/$SLURM_JOB_ID/on job start and delete it after job completion. Override$TMPDIR(and related env vars$TMPand$TEMP) with this path.