Skip to content
Snippets Groups Projects
William Warriner's avatar
William E Warriner authored
7fd3e838
Name Last commit Last update
.gitignore
README.md
script.sh

README

The command srun is a means of allocating jobs using the Slurm scheduler. It has two modes: job allocation and job step allocation. Here is the srun documentation for Slurm 18.08.9.

  • For Job Allocation: When used at the terminal, srun schedules a new job and blocks terminal execution. This mode is rarely useful as, if the terminal process dies or is killed (like if your internet connection is interrupted), then the job is terminated. It can be useful for quick, ad-hoc tests using the --pty flag to get an interactive terminal in a job context. We recommend against using srun for job allocation and instead recommend using OOD HPC Desktop jobs for increased robustness.

  • For Job Step Allocation: When used as a command in an sbatch script, srun starts a job step with the requested resources and blocks execution of the remainder of the script. If you need to avoid blocking the script, prefix your srun command with nohup or use & as a suffix to run srun in the background and continue script execution.

Relevant srun Flag Reference

Relevant flags for srun command within this context:

  • --distribution:
    • How to distribute tasks on allocated hardware.
    • Defaults to block.
    • Not needed for this use-case.
  • --exclusive: Makes resources allocated exclusive to this srun task only, no other srun tasks can use these resources while this srun job step is running.
  • --nodes: Total number of nodes to allocate for this srun job step.
  • --ntasks:
    • Number of tasks to allocate for each node.
    • Total tasks for this srun job step is --nodes * --ntasks.
  • --cpus-per-task:
    • Number of cpus to allocate for each task.
    • Total cpus for this srun job step is --nodes * --ntasks * --cpus-per-task.
  • --cpu-bind:
    • Takes one or two options, separated by a comma. The first option is either verbose or quiet (default). The second option is a binding type. The default binding type is sufficient for most jobs.
    • Use --cpu-bind=verbose for helpful logging information about which memory is allocated to this srun job step.
  • --mem: how much memory per node to allocate
  • --mem-per-cpu: how much memory per cpu to allocate
  • --mem-bind:
    • Takes one or two options, separated by a comma. The first option is either verbose or quiet (default). The second option is a binding type.
    • Does not appear to have any effect on our system.
  • --export=ALL: Export all variables to srun environment. This should be the default behavior, but you may need to use this flag explicitly.

Putting It All Together

See script.sh for a "Hello World!" style example of how to use srun job steps in a for loop within sbatch to run many similar tasks in parallel. We typically recommend using the sbatch --array flag for this type of execution, but there are niche cases where knowledge of this style of parallel execution is helpful, such as with Quantum Espresso's uspex command.

The sbatch job runs across three nodes, using one cpu per node. The steps will all be started together via the for loop due to the use of nohup to allow execution to continue after each srun job step is started. Note the use of $SLURM_* environment to pass sbatch options to the srun job steps for consistency.

The script should produce an output that looks something like the following. Note that the order of execution is not sequential because the srun job steps are executed in parallel.

cpu-bind=NULL - c0220, task  0  0 [38051]: mask 0x200
c0220
cpu-bind=NULL - c0222, task  0  0 [34410]: mask 0x40
cpu-bind=NULL - c0221, task  0  0 [92378]: mask 0x80
c0222
c0221