Skip to content
Snippets Groups Projects
Mike Hanby's avatar
Mike Hanby authored
Add v2gpu compute nodes

See merge request rc/nhc!2
b36f59db
Name Last commit Last update
README.md
nhc.conf
nhc.etc.sysconfig

NHC Install and Configuration

NHC Docs on https://readthedocs.io

Node Health Check (NHC) will replace the BrightCM health checker. The primary purpose for this tool is to determine if a compute node is healthy enough to run jobs. Checks can include hardware validation (expected # of CPU cores reported by the kernel?), file systems available, FS free capacity, available memory, expected processes running...

A few current scenarios in particular have us looking at NHC:

  • Available resources are good so far as Slurm is concerned, however new jobs hang while starting
  • Possibly the same issue as above, but new user SSH into compute node hangs, ps shows /usr/bin/lua /usr/share/lmod/lmod/libexec/lmod zsh --initial_load restore as the process that may be hung???
  • Node system load is higher than # of cores. This is usually an indicator that the system isn't performing well for the existing jobs or future jobs.

Download the RPM to /data/rc/installers/nhc

mkdir -p /data/rc/installers/nhc && cd $_
wget https://github.com/mej/nhc/releases/download/1.4.3/lbnl-nhc-1.4.3-1.el7.noarch.rpm

Create a staging area for the upstream repo and the Cheaha config files

mkidr -p ~/build/nhc
cd ~/build/nhc
wget https://github.com/mej/nhc/releases/download/1.4.3/lbnl-nhc-1.4.3-1.el7.noarch.rpm
git clone git@gitlab.rc.uab.edu:rc/nhc.git

There are 2 config files that will be deployed to the compute nodes:

I have created two test files on /data and /scratch that are used by the check_file_test test in the nhc.conf file. These will be used as a test to ensure that a file on GPFS is readable.

sudo touch /data/.nhc-test /scratch/.nhc-test

Install the NHC RPM in the compute node image(s) (currently v1.4.3) and copy the config files.

cd /cm/images
for img in $(ls | grep -Ei 'compute'); do 
  echo $img
  sudo yum --installroot=/cm/images/${img} --disablerepo=\* localinstall -y /data/rc/installers/nhc/lbnl-nhc-1.4.3-1.el7.noarch.rpm
  sudo cp ~/build/nhc/nhc/nhc.conf /cm/images/${img}/etc/nhc/
  sudo chown root:root /cm/images/${img}/etc/nhc/nhc.conf
  sudo chmod 644 /cm/images/${img}/etc/nhc/nhc.conf
  sudo cp ~/build/nhc/nhc/nhc.etc.sysconfig /cm/images/${img}/etc/sysconfig/nhc
  sudo chown root:root /cm/images/${img}/etc/sysconfig/nhc
  sudo chmod 644 /cm/images/${img}/etc/sysconfig/nhc
done
cd -

Deploy the RPM and Config files to the running compute nodes

  • Install the RPM

    sudo ansible -i /etc/ansible/hosts c0183 -m shell --one-line --fork=20 -a 'yum --disablerepo=\* localinstall -y /data/rc/installers/nhc/lbnl-nhc-1.4.3-1.el7.noarch.rpm' 
  • Copy the config files

    sudo ansible -i /etc/ansible/hosts computenodes --one-line -m copy --forks=40 -a 'src=/cm/images/compute-cm82-el7.9-kernel-3.10.0-1160.42-mlnx-ceph/etc/nhc/nhc.conf dest=/etc/nhc/nhc.conf'
    
    sudo ansible -i /etc/ansible/hosts computenodes --one-line -m copy --forks=40 -a 'src=/cm/images/compute-cm82-el7.9-kernel-3.10.0-1160.42-mlnx-ceph/etc/sysconfig/nhc dest=/etc/sysconfig/nhc'

Add the following lines to /etc/slurm/slurm.conf

HealthCheckProgram: Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the NOT_RESPONDING state. This program may be used to verify the node is fully operational and DRAIN the node or send email if a problem is detected. Any action to be taken must be explicitly performed by the program (e.g. execute "scontrol update NodeName=foo State=drain Reason=tmp_file_system_full" to drain a node). The execution interval is controlled using the HealthCheckInterval parameter. Note that the HealthCheckProgram will be executed at the same time on all nodes to minimize its impact upon parallel programs. This program will be killed if it does not terminate normally within 60 seconds. This program will also be executed when the slurmd daemon is first started and before it registers with the slurmctld daemon. By default, no program will be executed. HealthCheckInterval: The interval in seconds between executions of HealthCheckProgram. The default value is zero, which disables execution.

## 20230424 - MJH - Adding Node Health Check (NHC)
HealthCheckProgram=/usr/sbin/nhc-wrapper
HealthCheckInterval=300
HealthCheckNodeState=ANY,CYCLE

Instruct the slurmd clients to reread the slurm.conf file

scontrol_admin reconfigure