gellertp | Hi, we're running EC2 instances on centos 7 and noticed that the cloud-init stage 'init-local' is suspiciously taking >50s in the EC2LocalDatasource step. Does anyone have any idea on what could be causing this? | 15:20 |
---|---|---|
gellertp | In particular, the following log lines in /var/log/cloud-init.log were suspicious: | 15:21 |
gellertp | 2021-10-06 13:42:55,162 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 40.044 seconds | 15:21 |
gellertp | 2021-10-06 13:43:05,174 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 10.011 seconds | 15:21 |
minimal | gellertp: "resolving" sounds strange for the 1st url as its an IP address | 15:48 |
rharper | That sounds vaguely familiar with an issue RedHat saw in their build where the /etc/resolv.conf file was left in the cloud-image and included bogus DNS entries | 15:52 |
rharper | aw .... gellertp is gone | 15:52 |
minimal | rharper: think the use of "resolving" in the debug message is misleading as util.py uses urllib to parse the url but doesn't seem to actually do any DNS/hosts lookups (for a cursory look at urllib's parse.py) | 15:58 |
rharper | lemme get the details ... it was something deep | 15:59 |
rharper | minimal: I'm not finding it in my logs, but the source of the issue was that centos/rhel cloud images end up with an /etc/resolv.conf that had bogus 10.X nameservers, the is_resolvable_url() check in the ec2 datasource attempst to resolve "bogus" domain names on purpose, normally this isn't an issue on systems during local time as we've not *yet* applied network config, but in these images, having the bad nameserver entry meant | 16:13 |
rharper | cloud-init waited for those DNS requests to bogus servers to timeout | 16:13 |
rharper | https://opendev.org/openstack/kayobe/commit/9c1d085d2e52396d05397afb0f658224bda0087c this is old, but represents the issue; something related to how the cloud images get built | 16:13 |
ubottu | Commit 9c1d085 in openstack/kayobe "Workaround issue in CentOS cloud images with resolv.conf" | 16:13 |
rharper | it happens off and on over the years of building images; | 16:14 |
rharper | oh, and i see gellertp mentions centos7 now; so likely that. Sometimes it can be triggered if you're customizing an image and you "boot it up in a vm" and don't know to clear out certain files. Ubuntu images use /etc/resolv.conf as symlink into /run so it's always ephemeral ; | 16:15 |
minimal | rharper: why would it try and resolve 169.254.169.254 though? | 16:15 |
minimal | i.e. an IP address | 16:16 |
rharper | it's part of the url_handler.py logic , I'm not sure we parse the URL with the intent of avoiding ips | 16:16 |
rharper | ah, utils.is_resolvable() | 16:20 |
rharper | for each of the metadata urls in the datasource, it checks if it's resolvable . I don't quite have the history of why we do resolvable check on the IP, but the second URL in the metadata urls to trie is a hostname , and it would fail the same way; | 16:21 |
minimal | I remember the good old days when a domain could NOT begin with a number, so detecting an IP address was easy :-) | 16:24 |
rharper | heh | 16:24 |
minimal | eeek! although the RFCs appear to say that a fully numeric domain name, such as 1.2.3.4. , is not permitted (as TLD cannot begin with number) I'm guess that 1.2.3.4 (no trailing dot) still has to be resolved as going through search path this could be the valid name 1.2.3.4.mycompany.com | 16:33 |
rharper | the code adds the trailing dot IIRC | 16:35 |
holmanb | minimal: regarding resolving the URL - https://cloudinit.readthedocs.io/en/latest/topics/datasources/ec2.html | 17:05 |
=== mamercad46 is now known as mamercad | ||
akutz | Howdy. I just saw a bug come through on VMware's internal tracker that Cloud-Init v21.3's update in Photon is causing ssh key generation on existing hosts. Do we know if there's any known issues with 21.3 related to SSH key-gen? | 18:09 |
rharper | akutz: there was the recent merge of a drop-in conf to prevent race between cloud-init and sshd-keygen@.service | 18:09 |
rharper | if photon is RHEL derivative, it may be that race | 18:10 |
akutz | Photon is homegrown, but does use dnf. | 18:10 |
rharper | https://github.com/canonical/cloud-init/pull/1028 | 18:10 |
ubottu | Pull 1028 in canonical/cloud-init "Add sshd-keygen disable drop-in conf" [Merged] | 18:10 |
rharper | well, in particular if they use RH based sshd package, it will have ssh-keygen@.service enabled (unless they disable it in the build) | 18:11 |
akutz | Thank you for the quick response rharper! | 18:11 |
akutz | Yep, they have the sshd-keygen service - https://github.com/vmware/photon/blob/3.0/SPECS/openssh/sshd-keygen.service | 18:13 |
rharper | cool, hopefully that PR should give them something to check; without that, the best case was keygen would run first, make keys, cloud-init would delete it and regen. and then you're good. Across reboots, we've not seen any issues that I know; | 18:30 |
rharper | https://bugs.launchpad.net/bugs/1946644 | 18:30 |
ubottu | Launchpad bug 1946644 in cloud-init "After restart cloud-init reconfigured the machine hostname and ssh keypairs" [High, Confirmed] | 18:30 |
rharper | akutz: ^ | 18:30 |
rharper | maybe that one if not the keygen | 18:30 |
akutz | Ah, thanks! | 18:31 |
akutz | There's nothing in the CI v21.3 upgrade of which we're aware that would cause a "cloud-init cleanup", including wiping out any indication of previous boots, right? | 21:15 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!