Skip to content
Snippets Groups Projects
Commit 25d51904 authored by Mike Hanby's avatar Mike Hanby
Browse files

Update README.md

parent 2e4680bc
No related branches found
No related tags found
No related merge requests found
...@@ -69,6 +69,9 @@ cd - ...@@ -69,6 +69,9 @@ cd -
Slurm Configuration Slurm Configuration
**NOTE:** NHC doesn’t have to use slurmd to do it’s job. It can be run via cron, other framework, or even a different HPC scheduler (PBS, SGE…). The following are only needed if you want `slurmd` to invoke `nhc-wrapper`. The current configuration on Cheaha does use `slurmd`, thus the changes below.
> `HealthCheckProgram`: Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the `NOT_RESPONDING` state. This program may be used to verify the node is fully operational and DRAIN the node or send email if a problem is detected. Any action to be taken must be explicitly performed by the program (e.g. execute "`scontrol update NodeName=foo State=drain Reason=tmp_file_system_full`" to drain a node). The execution interval is controlled using the `HealthCheckInterval` parameter. Note that the `HealthCheckProgram` will be executed at the same time on all nodes to minimize its impact upon parallel programs. This program will be killed if it does not terminate normally within 60 seconds. This program will also be executed when the `slurmd` daemon is first started and before it registers with the `slurmctld` daemon. By default, no program will be executed. > `HealthCheckProgram`: Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the `NOT_RESPONDING` state. This program may be used to verify the node is fully operational and DRAIN the node or send email if a problem is detected. Any action to be taken must be explicitly performed by the program (e.g. execute "`scontrol update NodeName=foo State=drain Reason=tmp_file_system_full`" to drain a node). The execution interval is controlled using the `HealthCheckInterval` parameter. Note that the `HealthCheckProgram` will be executed at the same time on all nodes to minimize its impact upon parallel programs. This program will be killed if it does not terminate normally within 60 seconds. This program will also be executed when the `slurmd` daemon is first started and before it registers with the `slurmctld` daemon. By default, no program will be executed.
> `HealthCheckInterval`: The interval in seconds between executions of `HealthCheckProgram`. The default value is zero, which disables execution. > `HealthCheckInterval`: The interval in seconds between executions of `HealthCheckProgram`. The default value is zero, which disables execution.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment