feature request: avoid vnfs build for vagrant dev envs

created branch 2-feature-request-avoid-vnfs-build-for-vagrant-dev-envs

I've taken an initial pass at customization for supporting an import of images.

This includes variables to define the source path of the images and a place holder var for conditional tests. Ideally it should be a flag that selects import vs build.

A modified build file to select a new role for import.

The new import role that replaces the build steps and everything before the build with a simple import.

The import works but the effort to construct the cluster fails at the compute node boot. The ansible scripts wait for node boot but then the compute node start fails to load the image.

Needs further debug.

The boot failure of the compute node in initial test was due to an existing ohpc master on the compute network which interfered with the boot sequence.

Rerunning the vagrant up with a clean compute network mostly succeeds but fails on the last step of setting the nodes to the idle state:

    ohpc: TASK [nodes_vivify : enable slurmd on compute nodes] ***************************
    ohpc: changed: [ohpc]
    ohpc: 
    ohpc: TASK [nodes_vivify : Waiting for slurmd to enable on the compute nodes] ********
    ohpc: Pausing for 30 seconds
    ohpc: (ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
    ohpc: ok: [ohpc]
    ohpc: 
    ohpc: TASK [nodes_vivify : update slurm status on nodes] *****************************
    ohpc: fatal: [ohpc]: FAILED! => {"changed": true, "cmd": "scontrol update nodename=$(wwsh node list | tail --lines=+3 | cut -f 1 -d' '| tr '\\n' ',') state=IDLE", "delta": "0:00:00.119136", "end": "2018-09-27 14:55:22.442929", "msg": "non-zero return code", "rc": 1, "start": "2018-09-27 14:55:22.323793", "stderr": "slurm_update error: Invalid node state specified", "stderr_lines": ["slurm_update error: Invalid node state specified"], "stdout": "", "stdout_lines": []}
    ohpc: 	to retry, use: --limit @/vagrant/CRI_XCBC/site.retry
    ohpc: 
    ohpc: PLAY RECAP *********************************************************************
    ohpc: ohpc                       : ok=72   changed=65   unreachable=0    failed=1   
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.

The acceptance test also fails (hangs) so something is not right with the cluster.

jpr@laptop:~/projects/ohpc_vagrant$ vagrant ssh -c "srun hostname"
srun: Required node not available (down, drained or reserved)
srun: job 2 queued and waiting for resources
^Csrun: Job allocation 2 has been revoked
srun: Force Terminated job 2
Connection to 127.0.0.1 closed.

closed

reopened

According to sinfo, the c0 node is in the down* state:

[root@ohpc vagrant]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
low*         up 2-00:00:00      1  down* c0

The scontrol command from the last ansible task can be run manually without throwing an error on the idle state. But this leave the partition in the idle* state and jobs still cannot run.

Trying to debug the node via a root ssh login results in a prompt for password. Root should be able to login into the compute nodes freely:

[vagrant@ohpc ~]$ sudo ssh c0
no such identity: /root/.ssh/identity: No such file or directory
no such identity: /root/.ssh/id_rsa: No such file or directory
no such identity: /root/.ssh/id_dsa: No such file or directory
root@c0's password:

The issue with the root ssh error is likely due to using a vnfs image from one cluster build on another ohpc instance and not preserving the cluster_root.pub file for that associated cluster. The [pub key file is copied into the vnfs image into the compute node authorized_keys file and would ordinarily allow root to ssh.

https://github.com/jprorama/CRI_XCBC/blob/vagrant-provision/roles/compute_build_vnfs/tasks/main.yml#L47

   - name: copy ssh keys over
     copy: src=cluster_root.pub dest={{ compute_chroot_loc }}/root/.ssh/authorized_keys

The keys are generated earlier during ohpc_config tasks. It is likely we will need to capture all the credentials from the instance that build the vnfs image and then register the same on a new instance when that specific vnfs image is used.

Either that or edit the vnfs image with the newly created instances ssh keys, which will take just as long to unpack and repack as the original build step, so this would be a wash.

After some discussion with louistw there are several bits of surgery that would be required to export the root and node credentials for the cluster. It's not the world of effort but does require keeping track of these bits in addition to the vnfs and boot files.

This mainly has to do with pre-seeding the host and root ssh keys for the wwinit ssh_keys command in the ohcp_config role.

Given that effort it seems best to let this feature improvement go for now and focus instead on iteration ease with the open on demand role.

feature request: avoid vnfs build for vagrant dev envs

Child items ...

Activity