Skip to content

WIP: Fix the issue with loading sacctmgr config due to conn timeout on slurmdb.

Created by: eesaanatluri

The packer build for ohpc image fails at the ansible task that loads sacctmgr config in ohpc_install role.
This error occurs because the sacctmgr can't communicate with slurmdb. For the sacctmgr command to work we need the clustername to be registered in slurmdb, therefore we see an error like this -

sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to ohpc:7031: Connection timed out\nsacctmgr: error: slurmdbd: Sending PersistInit msg: Connection timed out\nsacctmgr: error: Problem talking to the database: Connection timed out"

This PR would fix the above described problem using the solution below.

The connection timeout for the database is occurring because, we allocate a private IP address for ohpc with DHCP, but the /etc/hosts which comes from ansible template rendering uses a (statically assigned) headnode_private_ip variable defined in group_vars/all file. So in order to communicate with the slurm database, a correct private IP address accessible through the ansible inventory host variable hostvars should be used. hostvars contains facts that have been gathered about the host.

Merge request reports

Loading