The vnfs build step (creating the system image for a node in ohpc) is based on a somewhat time consuming cpio step, especially when the the master is a vm running on a laptop.
The warewulf framework has the ability to import a pre-built vnfs image. This could change the ansible step from a build to a file copy.
This feature tracks replacing the default vnfs build step with an option to simply import an existing vnfs.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
I've taken an initial pass at customization for supporting an import of images.
This includes variables to define the source path of the images and a place holder var for conditional tests. Ideally it should be a flag that selects import vs build.
A modified build file to select a new role for import.
The new import role that replaces the build steps and everything before the build with a simple import.
The import works but the effort to construct the cluster fails at the compute node boot. The ansible scripts wait for node boot but then the compute node start fails to load the image.
The boot failure of the compute node in initial test was due to an existing ohpc master on the compute network which interfered with the boot sequence.
Rerunning the vagrant up with a clean compute network mostly succeeds but fails on the last step of setting the nodes to the idle state:
ohpc: TASK [nodes_vivify : enable slurmd on compute nodes] *************************** ohpc: changed: [ohpc] ohpc: ohpc: TASK [nodes_vivify : Waiting for slurmd to enable on the compute nodes] ******** ohpc: Pausing for 30 seconds ohpc: (ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort) ohpc: ok: [ohpc] ohpc: ohpc: TASK [nodes_vivify : update slurm status on nodes] ***************************** ohpc: fatal: [ohpc]: FAILED! => {"changed": true, "cmd": "scontrol update nodename=$(wwsh node list | tail --lines=+3 | cut -f 1 -d' '| tr '\\n' ',') state=IDLE", "delta": "0:00:00.119136", "end": "2018-09-27 14:55:22.442929", "msg": "non-zero return code", "rc": 1, "start": "2018-09-27 14:55:22.323793", "stderr": "slurm_update error: Invalid node state specified", "stderr_lines": ["slurm_update error: Invalid node state specified"], "stdout": "", "stdout_lines": []} ohpc: to retry, use: --limit @/vagrant/CRI_XCBC/site.retry ohpc: ohpc: PLAY RECAP ********************************************************************* ohpc: ohpc : ok=72 changed=65 unreachable=0 failed=1 The SSH command responded with a non-zero exit status. Vagrantassumes that this means the command failed. The output for this commandshould be in the log above. Please read the output to determine whatwent wrong.
The acceptance test also fails (hangs) so something is not right with the cluster.
jpr@laptop:~/projects/ohpc_vagrant$ vagrant ssh -c "srun hostname"srun: Required node not available (down, drained or reserved)srun: job 2 queued and waiting for resources^Csrun: Job allocation 2 has been revokedsrun: Force Terminated job 2Connection to 127.0.0.1 closed.
According to sinfo, the c0 node is in the down* state:
[root@ohpc vagrant]# sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELISTlow* up 2-00:00:00 1 down* c0
The scontrol command from the last ansible task can be run manually without throwing an error on the idle state. But this leave the partition in the idle* state and jobs still cannot run.
Trying to debug the node via a root ssh login results in a prompt for password. Root should be able to login into the compute nodes freely:
[vagrant@ohpc ~]$ sudo ssh c0no such identity: /root/.ssh/identity: No such file or directoryno such identity: /root/.ssh/id_rsa: No such file or directoryno such identity: /root/.ssh/id_dsa: No such file or directoryroot@c0's password:
The issue with the root ssh error is likely due to using a vnfs image from one cluster build on another ohpc instance and not preserving the cluster_root.pub file for that associated cluster. The [pub key file is copied into the vnfs image into the compute node authorized_keys file and would ordinarily allow root to ssh.
The keys are generated earlier during ohpc_config tasks. It is likely we will need to capture all the credentials from the instance that build the vnfs image and then register the same on a new instance when that specific vnfs image is used.
Either that or edit the vnfs image with the newly created instances ssh keys, which will take just as long to unpack and repack as the original build step, so this would be a wash.
After some discussion with louistw there are several bits of surgery that would be required to export the root and node credentials for the cluster. It's not the world of effort but does require keeping track of these bits in addition to the vnfs and boot files.
This mainly has to do with pre-seeding the host and root ssh keys for the wwinit ssh_keys command in the ohcp_config role.
Given that effort it seems best to let this feature improvement go for now and focus instead on iteration ease with the open on demand role.