/srv/irclogs.ubuntu.com/2020/08/13/#cloud-init.txt

AnhVoMSFT@Odd_Bloke @blackboxsw_ @rharper any rough ETA on when the next SRU of cloud-init will be?15:59
Odd_BlokeAnhVoMSFT: Within the next couple of weeks: once https://github.com/canonical/cloud-init/pull/528 is landed, and once https://github.com/canonical/cloud-init/pull/514 has been approved (that's a packaging change, so it won't "land" per se), we'll be starting to look at it.16:09
Odd_BlokeFYI for fokls who aren't already aware: I will be on vacation for a couple of weeks starting at the end of this week.17:41
Odd_Bloke("fokls" <-- I am evidently in need of a vacation ;)17:42
AnhVoMSFTThanks @Odd_Bloke18:33
rharpereandersson: hey , I saw your reponse on the openstack timeout issue;19:38
rharpereandersson: I was hoping to see a cloud-init log (or if you know) whether the timeout is seen during cloud-init-local.service where it's bringing up the ephemeral dhcp client, or if it's seen much later ...  and if you're using latest cloud-iinit on ubuntu or something else?  I'd really like to rule out that it's not the classless static route issue19:39
eanderssonrharper http://paste.openstack.org/show/Udy604fiNCxNqP9pTV7D/19:42
eanderssonSo I have seen a couple examples of this and we can reproduce it pretty easily.19:43
meenarharper: what is, "the classless static route issue"?19:43
eanderssonThis is on CentOS. I haven't tried it on Ubuntu.19:44
eanderssonBut the underlying issue is basically either network or something not quite fast enough on the OpenStack side.19:44
eanderssonWhen creating a ton of VMs (e.g. 200+ VMs at the same time)19:45
eanderssonWhen I tested by allowing it to retry at least a few times it would never fail.19:45
rharperhttps://bugs.launchpad.net/cloud-init/+bug/182110219:45
ubot5Ubuntu bug 1821102 in cloud-init "Cloud-init should not setup ephemeral ipv4 if apply_network_config is False for OpenStack" [High,Fix released]19:45
rharpermeena: ^19:45
rharperit's sense been fixed19:46
rharpereandersson: thanks, lemme look19:46
rharperscale testing sounds like dhcp response times are slow; so it's likely unrelated to the bug I mentioned;19:46
eanderssonMy motivation is really that if something like the EC2 (Amazon) data source has a retry, it is very reasonable that the OpenStack should be allowed to retry at least a couple of times.19:47
eanderssonBut it's easy to just override the default cloud-init pages so it's not a hill I would die on. :D19:48
eandersson*settings19:48
rharperright, i'm still trying to see if we're failing to bring up networking correctly, or if it's the IMDS itself that's failing;  the snippet is hard to see the whole contenxt19:48
Odd_BlokeA big difference (until recently, of course) is that EC2 only had Intel hardware, so we would always be able to determine that we were on EC2 reliably (and maybe we still can in their ARM, I don't recall).19:49
eanderssonYea - I unfortunately don't have the full logs at the moment.19:49
Odd_BlokeSo EC2 can configure a timeout because that timeout should only affect EC2 instances.19:49
eanderssonBut this issue is for sure that either the network (e.g. vxlan) or dhcp / network namespace / security group isn't ready.19:50
rharpereandersson: it does look like this is during early boot and the ephemeral dhcp; I see the ip commands which tear down what we brought up19:50
rharpereandersson: if possible, it would be great to see an Ubuntu log where I know we've got the classless static routes issue fixed;19:51
eanderssonHow would that bug be presented?19:51
eanderssonReading19:51
rharperso,  looking at the logs for the bug I posted19:52
rharperthey look almost exactly like yours19:52
rharperhttp://paste.openstack.org/show/796830/19:53
eanderssonIt's for sure not the same issue because we don't have an additional route for 16919:53
eanderssonand the metadata proxy is local on the hypervisor19:54
rharperok; if it's not the bug and it's a slow metadata service; and 50 seconds isn't enough (10 second timeout * 5 retries); we're left with bumping a default for openstack which makes the remaining network datasources to wait longer;20:00
eanderssonYea - I mean an alternative would just to be add some documentation tbh.20:00
rharperOdd_Bloke: one area we didn't continue the discussion on was whether we should sort the datasource_list by total wait time (timeout * retries)20:00
rharperconfigdrive is still an option for openstack;20:01
eanderssonI have never seen it fail more than once btw.20:03
eanderssonI probably could make it fail more, but would need to create a lot of VMs :D20:04
rharperwell, it already retries 5 times; with 10 second timeouts before retry20:04
eanderssonIt does not20:04
rharperwell, then I want to see that log20:04
rharperwhat you posted showed exactly one; the default for openstack is 10 second timeout and 5 retries20:04
eandersson> Giving up on OpenStack md from ['http://169.254.169.254/openstack'] after 0 seconds20:04
eanderssonThe TCP timeout is indeed 10s20:04
eanderssonWhich is why it "waits for 10s"20:04
eanderssonThat code path does not adher to the retries20:05
rharpereandersson: I don't understand; https://github.com/canonical/cloud-init/blob/master/cloudinit/sources/__init__.py#L172  this is the default config for all datasources, unless the subclass overrides;20:07
rharperDataSourceOpenstack.py does not have an override (expect in your PR);20:07
rharperso, I see no reason why OS would *not* retry the expected 5 times;   unless we're not getting 404; and there's some url_helper path that's skipping the retry ...20:08
rharperhence needing the logs20:08
eanderssonIt's because the code path uses wait_for_url20:09
eanderssonhttps://github.com/canonical/cloud-init/blob/411dbbebd328163bcb1c676cc711f3e5ed805375/cloudinit/sources/DataSourceOpenStack.py#L8220:09
eanderssonIt does not even pass on the retries20:09
eanderssonIt basically only uses the max_wait_seconds to determine how long it will retry for20:10
eanderssonIn this case it's -1 (no retries) so it never retries20:10
johnsonshiOdd_Bloke: Thanks for your always reviewing my PRs Odd_Bloke! I believe my PR (https://github.com/canonical/cloud-init/pull/468/) is finally ready.20:10
eanderssonIt does however use the tcp timeout. So it may look like it tries for "10 s"20:10
blackboxsw_Odd_Bloke: you mentioned a couple of prs that needed review, I just +1'd https://github.com/canonical/cloud-init/pull/531 were there 2 others that needed eyes?20:10
eanderssonhttps://github.com/canonical/cloud-init/blob/master/cloudinit/url_helper.py#L344 wait_for_url is defined here20:11
rharperyeah; I see the try once with -1 ; which is the default;20:13
eanderssonYea - that code path is just so unforgiving20:25
rharperso, CloudStack, Ec2, Exoscale, MAAS, and Openstack (ignoring ovf since it's local ds mostly); all have max_wait set to something other than default;20:26
eanderssonAnd there are so many things that happens at the same time when spawning a new VM.20:26
Odd_Blokeblackboxsw_: Ryan beat you to them. :)20:26
Odd_Bloke(Thanks for the review!)20:26
blackboxsw_that's the spirit !20:26
rharperonly OpenStack and MAAS, I think support multi-arch20:26
Odd_BlokeObviously more thanks to rharper for doing multiple reviews. ^_^20:26
rharperOdd_Bloke: sure20:27
rharperman, Xenial is the killer here (and maybe other OS which use cloud-init but don't use ds-identify)20:27
Odd_Blokejohnsonshi: My pleasure!  I'm taking another look through the code to confirm that I'm happy with it, but note that I would like you to revert your removal of the type annotations before it lands.20:28
rharperit will walk each DS at network time; so putting a large max_wait means pain for anything coming after it20:28
AnhVoMSFT@Odd_Bloke - regarding typehints, what is your recommendations regarding other distros like RHEL/CentOS/SUSE, who have many enterprise customers that continue using it for at least another 3-5 years? There are features that we should be delivering them irrespective of the python version. Do you recommend we send another PR to stable-19.4 with the exact change except dropping the typehints20:31
AnhVoMSFTin that PR?20:31
johnsonshiOdd_Bloke: Thanks for the note about the type hints. Any recommendations for the pre-existing RHEL and CentOS image maintainers out there? Those images will still be supported for a few more years.20:31
AnhVoMSFToops, didn't realize johnsonshi was on IRC just now and asking same question :-)20:32
Odd_BlokeI don't have any particular recommendations, no: it really depends on how they handle backporting cloud-init changes (and, for that matter, how their distro/company handles backporting Python 3-only projects to their Python 2-only releases more generally).  Or, expressed another way: I would expect them to know better than any recommendation I could give. :)20:36
Odd_BlokeI think using `stable-19.4` would be appropriate, but note that the type annotations are not the only way in which cloud-init code cannot run on Python 2 any longer, so just removing them may not be sufficient.20:37
Odd_BlokeOf course, it's only worth using `stable-19.4` if the maintainers of the existing images will use it, so you'd need to coordinate that with them.20:37
AnhVoMSFTis typehints going to be a requirement for all new methods going into cloud-init?20:39
rharperOdd_Bloke: so I think the best test of the cost of changing openstack max_wait to something other than -1 would be ec2 xenial boots; which at this time still are not ds-identify strict; so it will walk all of the ds in the list;  we could look at the current boot time now;  make the change to openstack per eandersson PR and retest  boot to see the impact20:39
Odd_BlokeAnhVoMSFT: Not a requirement, but I would expect they will become used more generally as people adapt to being able to use them.20:40
Odd_BlokeAnd I would like to discuss the timeline for dropping 3.4 support at the summit, after which point we would be able to use the `typing` module; I would expect an uptick in their use then, too.20:40
Odd_Bloke(We can only annotate simple cases without `typing`, which obviously reduces the number of annotations across the codebase.)20:41
AnhVoMSFTGiven that this PR brings more clarity due to the refactoring and additional telemetry ( we previously did not capture the call to extract goal state ), I think the typehints isn't the major enhancement and dropping it doesn't really make the code worse (the code is being made better by the refactoring, it's not being made worse because the typehints weren't there to start with)20:41
AnhVoMSFTby extract goal state I mean extracting the certificates.xml - duh20:42
Odd_BlokeTo be clear, I didn't say that the PR made our codebase worse, I said that type annotations make our codebase better.20:43
Odd_BlokeI'd be happy to accept the type annotations in a follow-up PR, if that would give you an easier-to-backport commit?20:44
johnsonshiOdd_Bloke AnhVoMSFT: Thanks to both of you for clarifying. :) What should the next steps be for my PR?20:44
johnsonshiAnhVoMSFT Odd_Bloke: Ok I'll be going with that suggestion and have that in a follow up PR instead. Thanks folks.20:46
AnhVoMSFTI think we should address the typehints in a follow-up PR, perhaps adding typehints to all the methods in DataSourceAzure, that'll be cleaner and provide better static analysis during testing too20:46
Odd_BlokeAnhVoMSFT: We won't be able to add them to all the methods until we have `typing` access, but I agree that typing as many as we can currently would be great!20:47
AnhVoMSFTThanks @Odd_Bloke - I think one topic for the cloud-summit this time should be on how distro maintainers (mostly looking at RHEL, SUSE, Oracle Linux) plan on packaging cloud-init to support their python 2.7 and python 3.4 customers20:50
Odd_Blokejohnsonshi: AnhVoMSFT: Landed! :)20:50
johnsonshiOdd_Bloke: Woooo! :) I think that was the first substantial PR I made in the cloud-init codebase. Thanks for reviewing! :)20:51
Odd_Blokejohnsonshi: Thank you for your work, and congratulations! :)20:53
Odd_BlokeAnhVoMSFT: Agreed, that would be a good topic to include.20:54
Odd_Blokerick_h: AnhVoMSFT has suggested including a summit topic on how distro maintainers who still have to maintain packages for Python 2.7 systems (and, once we drop 3.4 support, 3.4 systems) can collaborate.  We wouldn't have any skin in that game as Ubuntu (xenial is on 3.5), but I think it would be a really valuable topic for us to facilitate as cloud-init upstream.20:56
AnhVoMSFTjohnsonshi I'm looking forward to the follow-up PR. I think there are quite a few important things to follow up20:56
rick_hOdd_Bloke:  AnhVoMSFT that makes sense. I'll add it to the topic list. I think this is another place that having the discourse category for cloud-init might be of use. I don't have permission to set it up but working on getting it enabled.21:02
robjoWhat determines the Data source dependencies?21:04
robjoSeeing some flakiness in Azure21:05
Odd_Blokerobjo: "dependencies" in what sense?21:05
robjoLooking for data source in: ['Azure', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM', 'NETWORK']21:05
robjoIN this case the Azure data source does not get loaded, i.e. I end up with Searching for network data source in: ['DataSourceNone']21:06
robjobut when I get: "Looking for data source in: ['Azure', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM']21:06
robjothen the Azure data source gets loaded and the instance gets provisioned21:07
robjothere is no message in the importer, thus I don't know if we are trying to load the Azure data source and it fails and gets discarded, that would then point to Azure21:07
rharperrobjo: in sources/DataSourcesAzure.py   at the bottom, there should be 'datasources = []'21:07
rharperrobjo: Azure has been local (DEP_FILESYSTEM) for some time21:08
rharperit uses the EphemeralDHCP to bring up networking to fetch the network config from IMDS; so it wants to only run at local time;21:08
rharperwell, not an empty array, but that's where it's defined;21:09
robjoIt has "datasources = [21:10
robjo    (DataSourceAzure, (sources.DEP_FILESYSTEM, )),21:10
robjo]21:10
robjo"21:10
robjoso that looks OK21:10
rharperyrd21:10
rharperyes21:10
rharperthat's what's in master21:11
robjostill running 19.4 but that's there already21:11
robjoso where does the "NETWORK" part come from and why do I end up with None being loaded instead of Azure?21:12
robjoNone has both dependency on "sources.DEP_FILESYSTEM, sources.DEP_NETWORK"21:13
rharperso stages.py:init runs a fetch() which calls the datasource fetch code, this looks for <site-pages>/cloudinit/sources/*.py;  for each of those files, it checks for the datasource attribute; and extracts the array and sees if there is a datasource class name DataSourceXXX, that has deps that match;21:14
rharperrobjo: it sounds like something went wrong at cloud-init-local.service time;  such that when cloud-init.service runs (which runs with mode=dsnet NETWORK deps) that it's still looking for a datasource21:14
rharperthere isn't a NETWORK Azure datasource, it only runs at local time21:15
robjoso the Azure data source gets discarded because of no network dependency21:15
robjofair enough, but how do I get there in the first place?21:15
rharperthe question in the log is why wasn't Azure found at local time21:16
rharperif you make it to cloud-init.service (stage 2) on Azure without finding the datasource then it's all going downhill from there21:17
rharperAzure its found via several checks, azure_chassis in DMI data, or if it has an azure seed dir or a specific ovf-env.xml file;21:18
eanderssonThanks rharper. Let me know how it goes. I have alternative ideas if the impact is too large.21:19
rharpereandersson: ok, I'll append the suggested test to the PR;21:21
robjoOK, yes, 'local' never runs, the log file starts with: "Cloud-init v. 19.4 running 'init' ...."21:23
rharperthat's do it21:25
rharper*that'll do it*21:25
robjoInteresting that this is not consistent and region dependent and storage class dependent....21:26
rharperthat sounds quite odd21:27
rharpermaybe then a race21:27
rharperso sometimes if there's a systemd unit race, systemd will evict a unit21:27
rharpercan you check in the logs for 'cycle' I think is what systemd emits21:27
robjosystemd expects a live system, AFAIK, given that the system with the condition never boots I can only extract information by attaching the system disk from the failed system21:29
rharpershould be in syslog or messages; or if your image writes persistent journal, then you can offline journalctl --directory /path/to/journal  to dump21:31
robjothanks, time to go dig, messages file is empty so there are obviously other issues and journald is setup to forward to rsyslog21:38
AnhVoMSFTI felt like I ran into this problem before, with init running but not init-local22:03
AnhVoMSFTwhich distro is this?22:04
rharperAnhVoMSFT: likely suse related22:08

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!