[16:11] <cook_key> Hi, I'm having an issue with ssh host keys. It appears that cloud-init is removing the host keys but not generating them. This is causing sshd to fail to start, and then we do not have ssh access to the VM. I'm seeing this behavior after creating a snapshot of the VM and then booting another instance of the VM. So this would be the 2nd run of
[16:11] <cook_key> cloud-init. On the first run, the keys are generated.
[16:23] <minimal> cook_key: many of cloud-init's actions are performed only on 1st boot by design
[16:23] <minimal> so if you clone a VM instance you need to "clean" it before use
[16:33] <cook_key> how do I clean it so it'll generate those keys again?
[16:50] <smoser> well, the goal is that you dont ever really *need* to clean.  you should be able to capture an existing instance and the things that need to run per-instance (like ssh key generation) will run.
[16:51] <smoser> cook_key: you'll need to either see a console log or capture the failed instanced and then look at logs on it.
[16:56] <minimal> smoser: wouldn't at the very least a "cloud-init clean" need to be done when cloning the VM?
[16:59] <smoser> you should not need to do that.
[16:59] <smoser> for a stock image it should not be necessary.  you could have done things in user-data or in your own use that need to be cleaned up, but cloud-init should handle its things correctly.
[17:00] <cook_key> I've looked at the logs. It says that it's removing the host keys and I'd expect to see regeneration of the keys after that, but I do not. It seems like it's just skipping it altogther. It's like the ssh_genkeytypes is empty and it skips the loop: https://github.com/canonical/cloud-init/blob/22.1/cloudinit/config/cc_ssh.py#L246
[17:00] <minimal> smoser: assuming the instance-id is different, right?
[17:00] <smoser> minimal: yeah. that is kind of an assumption ;)
[17:01] <minimal> smoser: I was making a more general point that there are likely to be (non cloud-init related) host-specific things in an VM that need to be cleaned out before using a clone of it
[17:02] <minimal> cook_key: which Data Source are you using?
[17:04] <cook_key> minimal: OpenStack
[17:05] <minimal> cook_key: so does the meta-data it provides to VMs have a unique instance id?
[17:07] <minimal> cook_key: as smoser and I were aluding to, if old cloud-init config from the cloned VM is around it should not be used as, in theory, any new VM should be presented with a different instance-id by Openstack - if it is presented with the same instance-id as given to the VM that was cloned then cloud-init won't know to do all the 1st time configuration
[17:23] <cook_key> minimal: Yes, I believe I get a new instance ID after booting the snapshot:
[17:23] <cook_key> # ls /var/lib/cloud/instances/
[17:23] <cook_key> 200cf044-f88c-44eb-be80-4de19248f823  4a08a20f-4362-46ee-b0d0-86739fa2697a
[17:24] <minimal> ok, have you tried enabling debugging in the cloud-init config as then you'll have more details in /var/log/cloud-init.log to look to see what's going on?
[17:25] <cook_key> Yes, the debug messages are in the log, but I can't quite trace why it's skipping the key generation
[17:27] <cook_key> 2022-06-17 17:09:43,267 - handlers.py[DEBUG]: start: init-network/config-ssh: running config-ssh with frequency once-per-instance
[17:27] <cook_key> 2022-06-17 17:09:43,268 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/4a08a20f-4362-46ee-b0d0-86739fa2697a/sem/config_ssh - wb: [644] 25 bytes
[17:27] <cook_key> 2022-06-17 17:09:43,270 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/4a08a20f-4362-46ee-b0d0-86739fa2697a/sem/config_ssh (recursive=False)
[17:27] <cook_key> 2022-06-17 17:09:43,272 - util.py[DEBUG]: Restoring selinux mode for /var/lib/cloud/instances/4a08a20f-4362-46ee-b0d0-86739fa2697a/sem/config_ssh (recursive=False)
[17:27] <cook_key> 2022-06-17 17:09:43,273 - helpers.py[DEBUG]: Running config-ssh using lock (<FileLock using file '/var/lib/cloud/instances/4a08a20f-4362-46ee-b0d0-86739fa2697a/sem/config_ssh'>)
[17:27] <cook_key> 2022-06-17 17:09:43,284 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_rsa_key
[17:27] <cook_key> 2022-06-17 17:09:43,284 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_rsa_key.pub
[17:28] <cook_key> 2022-06-17 17:09:43,284 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_dsa_key
[17:28] <cook_key> 2022-06-17 17:09:43,284 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_dsa_key.pub
[17:28] <cook_key> 2022-06-17 17:09:43,284 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_ecdsa_key
[17:28] <cook_key> 2022-06-17 17:09:43,285 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_ecdsa_key.pub
[17:28] <cook_key> 2022-06-17 17:09:43,285 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_ed25519_key
[17:28] <cook_key> 2022-06-17 17:09:43,285 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_ed25519_key.pub
[17:28] <cook_key> 2022-06-17 17:09:43,288 - util.py[DEBUG]: Reading from /etc/ssh/sshd_config (quiet=False)
[17:28] <cook_key> 2022-06-17 17:09:43,306 - util.py[DEBUG]: Read 4268 bytes from /etc/ssh/sshd_config
[17:28] <cook_key> 2022-06-17 17:09:43,324 - util.py[DEBUG]: Changing the ownership of /home/centos/.ssh to 1001:1002
[17:28] <cook_key> 2022-06-17 17:09:43,326 - util.py[DEBUG]: Restoring selinux mode for /home/centos/.ssh (recursive=False)
[17:28] <cook_key> 2022-06-17 17:09:43,376 - util.py[DEBUG]: Writing to /home/centos/.ssh/authorized_keys - wb: [600] 0 bytes
[17:28] <cook_key> Sorry for flooding the chat with that
[17:29] <minimal> this is the important bit: "Writing to /home/centos/.ssh/authorized_keys - wb: [600] 0 bytes"
[17:29] <minimal> see the "0 bytes" bit...
[17:29] <minimal> so is your user-data correct regarding how you specify the key(s) to add?
[17:31] <cook_key> I believe so. I'm able to ssh in with the right ssh_authorized_key for the user, but not until I manually run `ssh-keygen -A` to create the ssh host keys. If I don't do that, sshd will not start. I'm actually having to enable a password for the user so I can get in via the console the first time to generate the keys manually.
[17:34] <minimal> sorry, was getting confused between host keys and user keys :-)
[17:35] <cook_key> No worries
[17:44] <minimal> cook_keys: do you specify "ssh_keys" in your user-data?
[17:54] <cook_key> Yes
[18:00] <minimal> ok, "ssh_keys" is for specifying *host* keys, not user keys - host keys will not be generated if you specify them via "ssh_keys" entry
[18:01] <minimal> https://cloudinit.readthedocs.io/en/latest/topics/modules.html#ssh
[18:01] <minimal> the "Host Keys" section, 2nd and 3rd paragraphs
[18:02] <minimal> "If no host keys are specified using ssh_keys, then keys will be generated using ssh-keygen"
[18:03] <minimal> so it is doing exactly what you told it to do :-)
[18:06] <minimal> pesky computers, always doing exactly what I tell them to do, not what I intended for them to do ;-)
[18:06] <cook_key> Oh sorry, I misspoke. We are not setting `ssh_keys`, but we are setting `ssh_authorized_keys`
[18:07] <minimal> ah, ok, then the keygen should occur
[18:08] <minimal> do you specify "ssh_genkeytypes" ?
[18:10] <cook_key> Originally I wasn't, because I wanted it to use the default for that. But I also tried adding it in and specifying all 4 keys and it still did not generate the keys.
[18:10] <cook_key> Based on the code, it seems like something is setting ssh_genkeytypes to [] and it skips the generation loop
[18:10] <cook_key> all 4 key types*
[18:14] <minimal> yeah that's why I wondered if you were setting it
[18:14] <minimal> the default value is set here: https://github.com/canonical/cloud-init/blob/728098325657cb2fec559cf321ccd5235e786381/cloudinit/config/cc_ssh.py#L172
[18:38] <cook_key> I tried a workaround of using rc.local to make sure the keys are there but cloud-init seems to run afterwards and just delete them on me. This is frustrating.
[18:42] <minimal> cook_key: well you could set "ssh_deletekeys: False" in user-data to stop the deletion
[18:42] <minimal> however it is strange that the list of keys to create appears to be empty
[18:42] <minimal> which version of cloud-init are you using?
[18:43] <cook_key> I'll give that a shot.
[18:43] <cook_key> I'm using 22.1-1.el8
[18:43] <cook_key> Centos 8.4
[18:44] <minimal> that's just a workaround, you should try and nail down what is going wrong
[18:45] <cook_key> that's why I'm here ;)
[18:48] <minimal> have you checked the contents of /etc/cloud/cloud.cfg and any files in /etc/cloud/cloud.d/ to see if they are settings anything related?
[18:51] <cook_key> Would this do it?
[18:51] <cook_key> ssh_deletekeys:   1
[18:51] <cook_key> ssh_genkeytypes:  ~
[18:51] <cook_key> This is in /etc/cloud/cloud.cfg
[18:54] <cook_key> What does the `~` mean in this case? Default?
[18:57] <minimal> I think "ssh_genkeytypes:  ~" means its a broken config ;-)
[18:57] <minimal> that might explain your problems
[19:03] <minimal> cook_key: any idea how that got in the cloud.cfg file?
[19:04] <minimal> "ssh_deletekeys:   1" I assume is equivalent to "True"
[19:05] <cook_key> I'm not sure, but it's not the only place in the file:
[19:05] <cook_key> $ cat /etc/cloud/cloud.cfg | grep \~
[19:05] <cook_key> mount_default_fields: [~, ~, 'auto', 'defaults,nofail,x-systemd.requires=cloud-init.service', '0', '2']
[19:05] <cook_key> ssh_genkeytypes:  ~
[19:05] <cook_key> syslog_fix_perms: ~
[19:09] <minimal> cook_key: for the mount_default_fields value that is normal I think
[19:11] <minimal> the syslog_fix_perms entry, I see that in the cloud.cfg.tmpl file for Redhat (and derivatives). Perhaps "~" means something in YAML
[19:11] <minimal> anyway try commenting out the "ssh_genkeytypes" entry and see if things work as expected
[19:16] <cook_key> > The tilde is one of the ways the null value can be written
[19:16] <cook_key> ha, that's got to be it, right?
[19:16] <cook_key> Thank you very much for the help minimal. I hope this is it.
[19:18] <minimal> cook_key: sounds like it, I guess its then an empty list (rather than not being set at all) and so the for loop does nothing
[19:18] <minimal> sounds like a bug to be raised ;-)