diff --git a/README.md b/README.md index ef2a6cc7f091b8d628a3ca6812941d925925f906..91627495c586baa462ce0737209e0fdf69d584ec 100644 --- a/README.md +++ b/README.md @@ -69,6 +69,9 @@ cd - Slurm Configuration +**NOTE:** NHC doesn’t have to use slurmd to do it’s job. It can be run via cron, other framework, or even a different HPC scheduler (PBS, SGE…). The following are only needed if you want `slurmd` to invoke `nhc-wrapper`. The current configuration on Cheaha does use `slurmd`, thus the changes below. + + > `HealthCheckProgram`: Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the `NOT_RESPONDING` state. This program may be used to verify the node is fully operational and DRAIN the node or send email if a problem is detected. Any action to be taken must be explicitly performed by the program (e.g. execute "`scontrol update NodeName=foo State=drain Reason=tmp_file_system_full`" to drain a node). The execution interval is controlled using the `HealthCheckInterval` parameter. Note that the `HealthCheckProgram` will be executed at the same time on all nodes to minimize its impact upon parallel programs. This program will be killed if it does not terminate normally within 60 seconds. This program will also be executed when the `slurmd` daemon is first started and before it registers with the `slurmctld` daemon. By default, no program will be executed. > `HealthCheckInterval`: The interval in seconds between executions of `HealthCheckProgram`. The default value is zero, which disables execution.