Update README.md

2e4680bc · Mike Hanby · 561de532 · 2e4680bc
Commit 2e4680bc authored 2 years ago by Mike Hanby
--- a/README.md
+++ b/README.md
@@ -16,12 +16,11 @@ mkdir -p /data/rc/installers/nhc && cd $_
 wget https://github.com/mej/nhc/releases/download/1.4.3/lbnl-nhc-1.4.3-1.el7.noarch.rpm
 ```
-Create a staging area for the upstream repo and the Cheaha config files
+Create a staging area for the upstream repo and the Cheaha config files. I do this under `~/build/nhc` so that I can have files / directories other than just our repo under `~/build/nhc`, thus the git clone of our repo will end up as `~/build/nhc/nhc`
 ```shell
 mkidr -p ~/build/nhc
 cd ~/build/nhc
-wget https://github.com/mej/nhc/releases/download/1.4.3/lbnl-nhc-1.4.3-1.el7.noarch.rpm
 git clone git@gitlab.rc.uab.edu:rc/nhc.git
 ```
@@ -29,13 +28,13 @@ There are 2 config files that will be deployed to the compute nodes:
 - [/etc/nhc/nhc.conf](nhc.conf) : Contains the rules used by NHC to verify the health of the node
 - [/etc/sysconfig/nhc](nhc.etc.sysconfig) : Environment variables needed by NHC (ex: `$PATH`, `$SLURMHOMEDIR`, ...)
-I have created two test files on `/data` and `/scratch` that are used by the `check_file_test` test in the [nhc.conf](nhc.conf) file. These will be used as a test to ensure that a file on GPFS is readable.
+I have created two test files on `/data` and `/scratch` that are used by the `check_file_test` test in the [nhc.conf](nhc.conf) file. These will be used as a test to ensure that a file on GPFS is readable. This only has to be done once (i.e. I did it), unless we decide to perform a `check_file_test` on other file systems or paths on the same file system.
 ```shell
 sudo touch /data/.nhc-test /scratch/.nhc-test
 ```
-Install the [NHC RPM](https://github.com/mej/nhc/releases/tag/1.4.3) in the compute node image(s) (currently v1.4.3) and copy the config files.
+(BrightCM Instructions) Install the [NHC RPM](https://github.com/mej/nhc/releases/tag/1.4.3) in the compute node image(s) (currently v1.4.3) and copy the config files. This is already done on the physical compute nodes and their images. If building an image/system not managed by BrightCM, install the appropriat package, depending on the OS (rpm, deb, from source, etc...), using the appropriate framework (Ansible, ...)
 ```shell
 cd /cm/images
@@ -52,35 +51,38 @@ done
 cd -
 ```
-Deploy the RPM and Config files to the running compute nodes
+(BrightCM Instructions) Instead of waiting for all of our nodes to reboot to pick up the changes, we'll deploy the RPM and Config files to the running compute nodes
- Install the RPM
+  - Install the RPM
-  ```shell
+    ```shell
-  sudo ansible -i /etc/ansible/hosts computenodes -m shell --one-line --fork=20 -a 'yum --disablerepo=\* localinstall -y /data/rc/installers/nhc/lbnl-nhc-1.4.3-1.el7.noarch.rpm' 
+    sudo ansible -i /etc/ansible/hosts computenodes -m shell --one-line --fork=20 -a 'yum --disablerepo=\* localinstall -y /data/rc/installers/nhc/lbnl-nhc-1.4.3-1.el7.noarch.rpm' 
-  ```
+    ```
- Copy the config files
+  - Copy the config files
-  ```shell
+    ```shell
-  sudo ansible -i /etc/ansible/hosts computenodes --one-line -m copy --forks=40 -a 'src=/cm/images/compute-cm82-el7.9-kernel-3.10.0-1160.42-mlnx-ceph/etc/nhc/nhc.conf dest=/etc/nhc/nhc.conf'
+    sudo ansible -i /etc/ansible/hosts computenodes --one-line -m copy --forks=40 -a 'src=/cm/images/compute-cm82-el7.9-kernel-3.10.0-1160.42-mlnx-ceph/etc/nhc/nhc.conf dest=/etc/nhc/nhc.conf'
-  sudo ansible -i /etc/ansible/hosts computenodes --one-line -m copy --forks=40 -a 'src=/cm/images/compute-cm82-el7.9-kernel-3.10.0-1160.42-mlnx-ceph/etc/sysconfig/nhc dest=/etc/sysconfig/nhc'
+    sudo ansible -i /etc/ansible/hosts computenodes --one-line -m copy --forks=40 -a 'src=/cm/images/compute-cm82-el7.9-kernel-3.10.0-1160.42-mlnx-ceph/etc/sysconfig/nhc dest=/etc/sysconfig/nhc'
-  ```
+    ```
-Add the following lines to `/etc/slurm/slurm.conf`
+Slurm Configuration
-> `HealthCheckProgram`: Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the `NOT_RESPONDING` state. This program may be used to verify the node is fully operational and DRAIN the node or send email if a problem is detected.  Any action to be taken must be explicitly performed by the program (e.g. execute "`scontrol update NodeName=foo State=drain Reason=tmp_file_system_full`" to drain a node).  The execution interval is controlled using the `HealthCheckInterval` parameter.  Note that the `HealthCheckProgram` will be executed at the same time on all nodes to minimize its impact upon parallel programs.  This program will be  killed  if it does not terminate normally within 60 seconds.  This program will also be executed when the `slurmd` daemon is first started and before it registers with the `slurmctld` daemon.  By default, no program will be executed.
-> `HealthCheckInterval`: The interval in seconds between executions of `HealthCheckProgram`.  The default value is zero, which disables execution.
-```shell
+  > `HealthCheckProgram`: Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the `NOT_RESPONDING` state. This program may be used to verify the node is fully operational and DRAIN the node or send email if a problem is detected.  Any action to be taken must be explicitly performed by the program (e.g. execute "`scontrol update NodeName=foo State=drain Reason=tmp_file_system_full`" to drain a node).  The execution interval is controlled using the `HealthCheckInterval` parameter.  Note that the `HealthCheckProgram` will be executed at the same time on all nodes to minimize its impact upon parallel programs.  This program will be  killed  if it does not terminate normally within 60 seconds.  This program will also be executed when the `slurmd` daemon is first started and before it registers with the `slurmctld` daemon.  By default, no program will be executed.
-## 20230424 - MJH - Adding Node Health Check (NHC)
+  > `HealthCheckInterval`: The interval in seconds between executions of `HealthCheckProgram`.  The default value is zero, which disables execution.
-HealthCheckProgram=/usr/sbin/nhc-wrapper
-HealthCheckInterval=300
-HealthCheckNodeState=ANY,CYCLE
-```
-Instruct the `slurmd` clients to reread the `slurm.conf` file
+- Add the following lines to `/etc/slurm/slurm.conf`. **SLURM ADMINISTRATOR ONLY**. I have already performed this step. `slurm.conf` along with the other files in `/etc/slurm` are symlinks that point to files under NFS mount `/cm/shared/apps/slurm/var/etc/`. Once the changes are made and the file saved, all of the Slurm nodes see the change and the daemons must be reloaded.
-```shell
+  ```shell
-scontrol_admin reconfigure
+  ## 20230424 - MJH - Adding Node Health Check (NHC)
-```
+  HealthCheckProgram=/usr/sbin/nhc-wrapper
\ No newline at end of file
+  HealthCheckInterval=300
+  HealthCheckNodeState=ANY,CYCLE
+  ```
+- Instruct the `slurmd` clients to reread the `slurm.conf` file.
+  ```shell
+  scontrol_admin reconfigure
+  ```