[15:59] @Odd_Bloke @blackboxsw_ @rharper any rough ETA on when the next SRU of cloud-init will be? [16:09] AnhVoMSFT: Within the next couple of weeks: once https://github.com/canonical/cloud-init/pull/528 is landed, and once https://github.com/canonical/cloud-init/pull/514 has been approved (that's a packaging change, so it won't "land" per se), we'll be starting to look at it. [17:41] FYI for fokls who aren't already aware: I will be on vacation for a couple of weeks starting at the end of this week. [17:42] ("fokls" <-- I am evidently in need of a vacation ;) [18:33] Thanks @Odd_Bloke [19:38] eandersson: hey , I saw your reponse on the openstack timeout issue; [19:39] eandersson: I was hoping to see a cloud-init log (or if you know) whether the timeout is seen during cloud-init-local.service where it's bringing up the ephemeral dhcp client, or if it's seen much later ... and if you're using latest cloud-iinit on ubuntu or something else? I'd really like to rule out that it's not the classless static route issue [19:42] rharper http://paste.openstack.org/show/Udy604fiNCxNqP9pTV7D/ [19:43] So I have seen a couple examples of this and we can reproduce it pretty easily. [19:43] rharper: what is, "the classless static route issue"? [19:44] This is on CentOS. I haven't tried it on Ubuntu. [19:44] But the underlying issue is basically either network or something not quite fast enough on the OpenStack side. [19:45] When creating a ton of VMs (e.g. 200+ VMs at the same time) [19:45] When I tested by allowing it to retry at least a few times it would never fail. [19:45] https://bugs.launchpad.net/cloud-init/+bug/1821102 [19:45] Ubuntu bug 1821102 in cloud-init "Cloud-init should not setup ephemeral ipv4 if apply_network_config is False for OpenStack" [High,Fix released] [19:45] meena: ^ [19:46] it's sense been fixed [19:46] eandersson: thanks, lemme look [19:46] scale testing sounds like dhcp response times are slow; so it's likely unrelated to the bug I mentioned; [19:47] My motivation is really that if something like the EC2 (Amazon) data source has a retry, it is very reasonable that the OpenStack should be allowed to retry at least a couple of times. [19:48] But it's easy to just override the default cloud-init pages so it's not a hill I would die on. :D [19:48] *settings [19:48] right, i'm still trying to see if we're failing to bring up networking correctly, or if it's the IMDS itself that's failing; the snippet is hard to see the whole contenxt [19:49] A big difference (until recently, of course) is that EC2 only had Intel hardware, so we would always be able to determine that we were on EC2 reliably (and maybe we still can in their ARM, I don't recall). [19:49] Yea - I unfortunately don't have the full logs at the moment. [19:49] So EC2 can configure a timeout because that timeout should only affect EC2 instances. [19:50] But this issue is for sure that either the network (e.g. vxlan) or dhcp / network namespace / security group isn't ready. [19:50] eandersson: it does look like this is during early boot and the ephemeral dhcp; I see the ip commands which tear down what we brought up [19:51] eandersson: if possible, it would be great to see an Ubuntu log where I know we've got the classless static routes issue fixed; [19:51] How would that bug be presented? [19:51] Reading [19:52] so, looking at the logs for the bug I posted [19:52] they look almost exactly like yours [19:53] http://paste.openstack.org/show/796830/ [19:53] It's for sure not the same issue because we don't have an additional route for 169 [19:54] and the metadata proxy is local on the hypervisor [20:00] ok; if it's not the bug and it's a slow metadata service; and 50 seconds isn't enough (10 second timeout * 5 retries); we're left with bumping a default for openstack which makes the remaining network datasources to wait longer; [20:00] Yea - I mean an alternative would just to be add some documentation tbh. [20:00] Odd_Bloke: one area we didn't continue the discussion on was whether we should sort the datasource_list by total wait time (timeout * retries) [20:01] configdrive is still an option for openstack; [20:03] I have never seen it fail more than once btw. [20:04] I probably could make it fail more, but would need to create a lot of VMs :D [20:04] well, it already retries 5 times; with 10 second timeouts before retry [20:04] It does not [20:04] well, then I want to see that log [20:04] what you posted showed exactly one; the default for openstack is 10 second timeout and 5 retries [20:04] > Giving up on OpenStack md from ['http://169.254.169.254/openstack'] after 0 seconds [20:04] The TCP timeout is indeed 10s [20:04] Which is why it "waits for 10s" [20:05] That code path does not adher to the retries [20:07] eandersson: I don't understand; https://github.com/canonical/cloud-init/blob/master/cloudinit/sources/__init__.py#L172 this is the default config for all datasources, unless the subclass overrides; [20:07] DataSourceOpenstack.py does not have an override (expect in your PR); [20:08] so, I see no reason why OS would *not* retry the expected 5 times; unless we're not getting 404; and there's some url_helper path that's skipping the retry ... [20:08] hence needing the logs [20:09] It's because the code path uses wait_for_url [20:09] https://github.com/canonical/cloud-init/blob/411dbbebd328163bcb1c676cc711f3e5ed805375/cloudinit/sources/DataSourceOpenStack.py#L82 [20:09] It does not even pass on the retries [20:10] It basically only uses the max_wait_seconds to determine how long it will retry for [20:10] In this case it's -1 (no retries) so it never retries [20:10] Odd_Bloke: Thanks for your always reviewing my PRs Odd_Bloke! I believe my PR (https://github.com/canonical/cloud-init/pull/468/) is finally ready. [20:10] It does however use the tcp timeout. So it may look like it tries for "10 s" [20:10] Odd_Bloke: you mentioned a couple of prs that needed review, I just +1'd https://github.com/canonical/cloud-init/pull/531 were there 2 others that needed eyes? [20:11] https://github.com/canonical/cloud-init/blob/master/cloudinit/url_helper.py#L344 wait_for_url is defined here [20:13] yeah; I see the try once with -1 ; which is the default; [20:25] Yea - that code path is just so unforgiving [20:26] so, CloudStack, Ec2, Exoscale, MAAS, and Openstack (ignoring ovf since it's local ds mostly); all have max_wait set to something other than default; [20:26] And there are so many things that happens at the same time when spawning a new VM. [20:26] blackboxsw_: Ryan beat you to them. :) [20:26] (Thanks for the review!) [20:26] that's the spirit ! [20:26] only OpenStack and MAAS, I think support multi-arch [20:26] Obviously more thanks to rharper for doing multiple reviews. ^_^ [20:27] Odd_Bloke: sure [20:27] man, Xenial is the killer here (and maybe other OS which use cloud-init but don't use ds-identify) [20:28] johnsonshi: My pleasure! I'm taking another look through the code to confirm that I'm happy with it, but note that I would like you to revert your removal of the type annotations before it lands. [20:28] it will walk each DS at network time; so putting a large max_wait means pain for anything coming after it [20:31] @Odd_Bloke - regarding typehints, what is your recommendations regarding other distros like RHEL/CentOS/SUSE, who have many enterprise customers that continue using it for at least another 3-5 years? There are features that we should be delivering them irrespective of the python version. Do you recommend we send another PR to stable-19.4 with the exact change except dropping the typehints [20:31] in that PR? [20:31] Odd_Bloke: Thanks for the note about the type hints. Any recommendations for the pre-existing RHEL and CentOS image maintainers out there? Those images will still be supported for a few more years. [20:32] oops, didn't realize johnsonshi was on IRC just now and asking same question :-) [20:36] I don't have any particular recommendations, no: it really depends on how they handle backporting cloud-init changes (and, for that matter, how their distro/company handles backporting Python 3-only projects to their Python 2-only releases more generally). Or, expressed another way: I would expect them to know better than any recommendation I could give. :) [20:37] I think using `stable-19.4` would be appropriate, but note that the type annotations are not the only way in which cloud-init code cannot run on Python 2 any longer, so just removing them may not be sufficient. [20:37] Of course, it's only worth using `stable-19.4` if the maintainers of the existing images will use it, so you'd need to coordinate that with them. [20:39] is typehints going to be a requirement for all new methods going into cloud-init? [20:39] Odd_Bloke: so I think the best test of the cost of changing openstack max_wait to something other than -1 would be ec2 xenial boots; which at this time still are not ds-identify strict; so it will walk all of the ds in the list; we could look at the current boot time now; make the change to openstack per eandersson PR and retest boot to see the impact [20:40] AnhVoMSFT: Not a requirement, but I would expect they will become used more generally as people adapt to being able to use them. [20:40] And I would like to discuss the timeline for dropping 3.4 support at the summit, after which point we would be able to use the `typing` module; I would expect an uptick in their use then, too. [20:41] (We can only annotate simple cases without `typing`, which obviously reduces the number of annotations across the codebase.) [20:41] Given that this PR brings more clarity due to the refactoring and additional telemetry ( we previously did not capture the call to extract goal state ), I think the typehints isn't the major enhancement and dropping it doesn't really make the code worse (the code is being made better by the refactoring, it's not being made worse because the typehints weren't there to start with) [20:42] by extract goal state I mean extracting the certificates.xml - duh [20:43] To be clear, I didn't say that the PR made our codebase worse, I said that type annotations make our codebase better. [20:44] I'd be happy to accept the type annotations in a follow-up PR, if that would give you an easier-to-backport commit? [20:44] Odd_Bloke AnhVoMSFT: Thanks to both of you for clarifying. :) What should the next steps be for my PR? [20:46] AnhVoMSFT Odd_Bloke: Ok I'll be going with that suggestion and have that in a follow up PR instead. Thanks folks. [20:46] I think we should address the typehints in a follow-up PR, perhaps adding typehints to all the methods in DataSourceAzure, that'll be cleaner and provide better static analysis during testing too [20:47] AnhVoMSFT: We won't be able to add them to all the methods until we have `typing` access, but I agree that typing as many as we can currently would be great! [20:50] Thanks @Odd_Bloke - I think one topic for the cloud-summit this time should be on how distro maintainers (mostly looking at RHEL, SUSE, Oracle Linux) plan on packaging cloud-init to support their python 2.7 and python 3.4 customers [20:50] johnsonshi: AnhVoMSFT: Landed! :) [20:51] Odd_Bloke: Woooo! :) I think that was the first substantial PR I made in the cloud-init codebase. Thanks for reviewing! :) [20:53] johnsonshi: Thank you for your work, and congratulations! :) [20:54] AnhVoMSFT: Agreed, that would be a good topic to include. [20:56] rick_h: AnhVoMSFT has suggested including a summit topic on how distro maintainers who still have to maintain packages for Python 2.7 systems (and, once we drop 3.4 support, 3.4 systems) can collaborate. We wouldn't have any skin in that game as Ubuntu (xenial is on 3.5), but I think it would be a really valuable topic for us to facilitate as cloud-init upstream. [20:56] johnsonshi I'm looking forward to the follow-up PR. I think there are quite a few important things to follow up [21:02] Odd_Bloke: AnhVoMSFT that makes sense. I'll add it to the topic list. I think this is another place that having the discourse category for cloud-init might be of use. I don't have permission to set it up but working on getting it enabled. [21:04] What determines the Data source dependencies? [21:05] Seeing some flakiness in Azure [21:05] robjo: "dependencies" in what sense? [21:05] Looking for data source in: ['Azure', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM', 'NETWORK'] [21:06] IN this case the Azure data source does not get loaded, i.e. I end up with Searching for network data source in: ['DataSourceNone'] [21:06] but when I get: "Looking for data source in: ['Azure', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM'] [21:07] then the Azure data source gets loaded and the instance gets provisioned [21:07] there is no message in the importer, thus I don't know if we are trying to load the Azure data source and it fails and gets discarded, that would then point to Azure [21:07] robjo: in sources/DataSourcesAzure.py at the bottom, there should be 'datasources = []' [21:08] robjo: Azure has been local (DEP_FILESYSTEM) for some time [21:08] it uses the EphemeralDHCP to bring up networking to fetch the network config from IMDS; so it wants to only run at local time; [21:09] well, not an empty array, but that's where it's defined; [21:10] It has "datasources = [ [21:10] (DataSourceAzure, (sources.DEP_FILESYSTEM, )), [21:10] ] [21:10] " [21:10] so that looks OK [21:10] yrd [21:10] yes [21:11] that's what's in master [21:11] still running 19.4 but that's there already [21:12] so where does the "NETWORK" part come from and why do I end up with None being loaded instead of Azure? [21:13] None has both dependency on "sources.DEP_FILESYSTEM, sources.DEP_NETWORK" [21:14] so stages.py:init runs a fetch() which calls the datasource fetch code, this looks for /cloudinit/sources/*.py; for each of those files, it checks for the datasource attribute; and extracts the array and sees if there is a datasource class name DataSourceXXX, that has deps that match; [21:14] robjo: it sounds like something went wrong at cloud-init-local.service time; such that when cloud-init.service runs (which runs with mode=dsnet NETWORK deps) that it's still looking for a datasource [21:15] there isn't a NETWORK Azure datasource, it only runs at local time [21:15] so the Azure data source gets discarded because of no network dependency [21:15] fair enough, but how do I get there in the first place? [21:16] the question in the log is why wasn't Azure found at local time [21:17] if you make it to cloud-init.service (stage 2) on Azure without finding the datasource then it's all going downhill from there [21:18] Azure its found via several checks, azure_chassis in DMI data, or if it has an azure seed dir or a specific ovf-env.xml file; [21:19] Thanks rharper. Let me know how it goes. I have alternative ideas if the impact is too large. [21:21] eandersson: ok, I'll append the suggested test to the PR; [21:23] OK, yes, 'local' never runs, the log file starts with: "Cloud-init v. 19.4 running 'init' ...." [21:25] that's do it [21:25] *that'll do it* [21:26] Interesting that this is not consistent and region dependent and storage class dependent.... [21:27] that sounds quite odd [21:27] maybe then a race [21:27] so sometimes if there's a systemd unit race, systemd will evict a unit [21:27] can you check in the logs for 'cycle' I think is what systemd emits [21:29] systemd expects a live system, AFAIK, given that the system with the condition never boots I can only extract information by attaching the system disk from the failed system [21:31] should be in syslog or messages; or if your image writes persistent journal, then you can offline journalctl --directory /path/to/journal to dump [21:38] thanks, time to go dig, messages file is empty so there are obviously other issues and journald is setup to forward to rsyslog [22:03] I felt like I ran into this problem before, with init running but not init-local [22:04] which distro is this? [22:08] AnhVoMSFT: likely suse related