Skip to content
Snippets Groups Projects
README.md 6.04 KiB
Newer Older
# NHC Install and Configuration
Mike Hanby's avatar
Mike Hanby committed

[NHC Docs](https://lbnl-node-health-check.readthedocs.io/en/latest/README.html) on https://readthedocs.io
Mike Hanby's avatar
Mike Hanby committed

[Node Health Check (NHC)](https://github.com/mej/nhc) will replace the BrightCM health checker. The primary purpose for this tool is to determine if a compute node is healthy enough to run jobs. Checks can include hardware validation (expected # of CPU cores reported by the kernel?), file systems available, FS free capacity, available memory, expected processes running...
Mike Hanby's avatar
Mike Hanby committed

A few current scenarios in particular have us looking at NHC:
- Available resources are good so far as Slurm is concerned, however new jobs hang while starting
- Possibly the same issue as above, but new user SSH into compute node hangs, `ps` shows `/usr/bin/lua /usr/share/lmod/lmod/libexec/lmod zsh --initial_load restore` as the process that may be hung???
- Node system load is higher than # of cores. This is usually an indicator that the system isn't performing well for the existing jobs or future jobs.
Mike Hanby's avatar
Mike Hanby committed

Download the RPM to `/data/rc/installers/nhc`
Mike Hanby's avatar
Mike Hanby committed

```shell
mkdir -p /data/rc/installers/nhc && cd $_
wget https://github.com/mej/nhc/releases/download/1.4.3/lbnl-nhc-1.4.3-1.el7.noarch.rpm
Mike Hanby's avatar
Mike Hanby committed
```

Mike Hanby's avatar
Mike Hanby committed
Create a staging area for the upstream repo and the Cheaha config files. I do this under `~/build/nhc` so that I can have files / directories other than just our repo under `~/build/nhc`, thus the git clone of our repo will end up as `~/build/nhc/nhc`
Mike Hanby's avatar
Mike Hanby committed

```shell
mkidr -p ~/build/nhc
cd ~/build/nhc
git clone git@gitlab.rc.uab.edu:rc/nhc.git
```
Mike Hanby's avatar
Mike Hanby committed

There are 2 config files that will be deployed to the compute nodes:
- [/etc/nhc/nhc.conf](nhc.conf) : Contains the rules used by NHC to verify the health of the node
- [/etc/sysconfig/nhc](nhc.etc.sysconfig) : Environment variables needed by NHC (ex: `$PATH`, `$SLURMHOMEDIR`, ...)
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
I have created two test files on `/data` and `/scratch` that are used by the `check_file_test` test in the [nhc.conf](nhc.conf) file. These will be used as a test to ensure that a file on GPFS is readable. This only has to be done once (i.e. I did it), unless we decide to perform a `check_file_test` on other file systems or paths on the same file system.
Mike Hanby's avatar
Mike Hanby committed

```shell
sudo touch /data/.nhc-test /scratch/.nhc-test
```
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
(BrightCM Instructions) Install the [NHC RPM](https://github.com/mej/nhc/releases/tag/1.4.3) in the compute node image(s) (currently v1.4.3) and copy the config files. This is already done on the physical compute nodes and their images. If building an image/system not managed by BrightCM, install the appropriat package, depending on the OS (rpm, deb, from source, etc...), using the appropriate framework (Ansible, ...)

```shell
cd /cm/images
for img in $(ls | grep -Ei 'compute'); do 
  echo $img
  sudo yum --installroot=/cm/images/${img} --disablerepo=\* localinstall -y /data/rc/installers/nhc/lbnl-nhc-1.4.3-1.el7.noarch.rpm
  sudo cp ~/build/nhc/nhc/nhc.conf /cm/images/${img}/etc/nhc/
  sudo chown root:root /cm/images/${img}/etc/nhc/nhc.conf
  sudo chmod 644 /cm/images/${img}/etc/nhc/nhc.conf
  sudo cp ~/build/nhc/nhc/nhc.etc.sysconfig /cm/images/${img}/etc/sysconfig/nhc
  sudo chown root:root /cm/images/${img}/etc/sysconfig/nhc
  sudo chmod 644 /cm/images/${img}/etc/sysconfig/nhc
done
cd -
```
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
(BrightCM Instructions) Instead of waiting for all of our nodes to reboot to pick up the changes, we'll deploy the RPM and Config files to the running compute nodes
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
  - Install the RPM
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
    ```shell
    sudo ansible -i /etc/ansible/hosts computenodes -m shell --one-line --fork=20 -a 'yum --disablerepo=\* localinstall -y /data/rc/installers/nhc/lbnl-nhc-1.4.3-1.el7.noarch.rpm' 
    ```
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
  - Copy the config files
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
    ```shell
    sudo ansible -i /etc/ansible/hosts computenodes --one-line -m copy --forks=40 -a 'src=/cm/images/compute-cm82-el7.9-kernel-3.10.0-1160.42-mlnx-ceph/etc/nhc/nhc.conf dest=/etc/nhc/nhc.conf'
Mike Hanby's avatar
Mike Hanby committed
    sudo ansible -i /etc/ansible/hosts computenodes --one-line -m copy --forks=40 -a 'src=/cm/images/compute-cm82-el7.9-kernel-3.10.0-1160.42-mlnx-ceph/etc/sysconfig/nhc dest=/etc/sysconfig/nhc'
    ```
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
Slurm Configuration
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
**NOTE:** NHC doesn’t have to use slurmd to do it’s job. It can be run via cron, other framework, or even a different HPC scheduler (PBS, SGE…). The following are only needed if you want `slurmd` to invoke `nhc-wrapper`. The current configuration on Cheaha does use `slurmd`, thus the changes below.


Mike Hanby's avatar
Mike Hanby committed
  > `HealthCheckProgram`: Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the `NOT_RESPONDING` state. This program may be used to verify the node is fully operational and DRAIN the node or send email if a problem is detected.  Any action to be taken must be explicitly performed by the program (e.g. execute "`scontrol update NodeName=foo State=drain Reason=tmp_file_system_full`" to drain a node).  The execution interval is controlled using the `HealthCheckInterval` parameter.  Note that the `HealthCheckProgram` will be executed at the same time on all nodes to minimize its impact upon parallel programs.  This program will be  killed  if it does not terminate normally within 60 seconds.  This program will also be executed when the `slurmd` daemon is first started and before it registers with the `slurmctld` daemon.  By default, no program will be executed.
  > `HealthCheckInterval`: The interval in seconds between executions of `HealthCheckProgram`.  The default value is zero, which disables execution.
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
- Add the following lines to `/etc/slurm/slurm.conf`. **SLURM ADMINISTRATOR ONLY**. I have already performed this step. `slurm.conf` along with the other files in `/etc/slurm` are symlinks that point to files under NFS mount `/cm/shared/apps/slurm/var/etc/`. Once the changes are made and the file saved, all of the Slurm nodes see the change and the daemons must be reloaded.
Mike Hanby's avatar
Mike Hanby committed

Mike Hanby's avatar
Mike Hanby committed
  ```shell
  ## 20230424 - MJH - Adding Node Health Check (NHC)
  HealthCheckProgram=/usr/sbin/nhc-wrapper
  HealthCheckInterval=300
  HealthCheckNodeState=ANY,CYCLE
  ```

- Instruct the `slurmd` clients to reread the `slurm.conf` file.

  ```shell
  scontrol_admin reconfigure
  ```