=== paride3 is now known as paride === cpaelzer_ is now known as cpaelzer [06:42] hi cloud-init party people, I have a system that is maas deployed [06:42] on a reboot I know see that it seems to have trouble communicating [06:42] due to that boot is super slow [06:42] I get on the SOL console a bunch of "failed posting event" [06:42] and that slows down the network init "a lot" [06:43] in fact - I assume due to timeouts - I see a one minute delay between every action [06:43] example: [06:43] [ 316.066379] cloud-init[1075]: 2022-03-29 06:33:32,161 - handlers.py[WARNING]: failed posting event: finish: init-network/activate-datasource: SUCCESS: activating datasource [06:43] [ 376.131293] cloud-init[1075]: 2022-03-29 06:34:32,226 - handlers.py[WARNING]: failed posting event: start: init-network/config-migrator: running config-migrator with frequency always [06:43] [ 436.243233] cloud-init[1075]: 2022-03-29 06:35:32,338 - handlers.py[WARNING]: failed posting event: start: init-network/config-write-files: running config-write-files with frequency once-per-instance [06:43] [ 496.317115] cloud-init[1075]: 2022-03-29 06:36:32,412 - handlers.py[WARNING]: failed posting event: start: init-network/config-growpart: running config-growpart with frequency always [06:43] [ 556.477652] cloud-init[1075]: 2022-03-29 06:37:32,572 - handlers.py[WARNING]: failed posting event: finish: init-network/config-resizefs: SUCCESS: config-resizefs ran successfully [06:43] [ 616.581319] cloud-init[1075]: 2022-03-29 06:38:32,676 - handlers.py[WARNING]: failed posting event: start: init-network/config-update_hostname: running config-update_hostname with frequency always [06:43] [ 676.673728] cloud-init[1075]: 2022-03-29 06:39:32,768 - handlers.py[WARNING]: failed posting event: finish: init-network/config-update_etc_hosts: SUCCESS: config-update_etc_hosts ran successfully [06:43] [ 736.753728] cloud-init[1075]: 2022-03-29 06:40:32,848 - handlers.py[WARNING]: failed posting event: finish: init-network/config-rsyslog: SUCCESS: config-rsyslog previously ran [06:43] [ 796.809045] cloud-init[1075]: 2022-03-29 06:41:32,904 - handlers.py[WARNING]: failed posting event: finish: init-network/config-users-groups: SUCCESS: config-users-groups previously ran [06:43] [ 856.881363] cloud-init[1075]: 2022-03-29 06:42:32,976 - handlers.py[WARNING]: failed posting event: finish: init-network: SUCCESS: searching for network datasources [06:43] once finished the rest of the system is happy [06:44] I'm clear that this is already an error path as it can't post the events - so slowness is due to the underlying issue whatever it will be [06:44] two questions for this [06:44] 1. what would be the best place to start looking why it fails to post these? [06:45] 2. should we make the error path somewhat less-waiting? For example start with the 60 second timeout that we seem to have, but then over time dimish that to 60 -> 45 -> 30 -> 15 -> 5 seconds so that the bad-path isn't "that slow" ? [06:45] holmanb: falcojr: blackboxsw: ^^ ? === FergusL2 is now known as FergusL [11:26] cpaelzer: Hi. Which version of cloud-init are you using? have you defined any reporting handlers? which DataSource are you using? [12:06] hi minimal (and anyone else reading this later) [12:06] as I mentioned this is a maas deployed system, so as I'd expect it uses DataSourceMAAS [12:07] 2021-11-25 10:52:05,249 - stages.py[INFO]: Loaded datasource DataSourceMAAS - DataSourceMAAS [http://10-245-168-0--21.maas-internal:5248/MAAS/metadata/] [12:07] version is: 22.1-14-g2e17a0d6-0ubuntu1~20.04.3 [12:07] maas defines reporting handlers AFAIK, I'd need to look what exactly it had set up [12:11] I see the url_helper to try to open a connection [12:11] that is when the 60 second timeout happens [12:12] and once the 60 sec are done I get the "failed posting event" [12:13] url_helper.py[DEBUG]: [0/1] open 'http://10-245-168-0--21.maas-internal:5248/MAAS/metadata/status/wmy6y6' with {'url': 'http://10-245-168-0--21.maas-internal:5248/MAAS/metadata/status/wmy6y6', 'allow_redirects': True, 'method': 'POST', 'headers': {'Authorization': }} configuration [13:01] cpaelzer: ok, I misread "maas deployed" as a typo to "mass deployed" [13:02] I see :-) [13:02] I'm suspecting that it is trying to post events before a network is actually up (and so such posts fail) [13:03] you'll notice all the warnings you posted mentioned "init-network" [13:13] minimal: oh I have seen it fail later as well [13:14] e.g. here from modile-final (that is later) [13:14] cpaelzer: ok, my point is that a request cannot be made to a HTTP url without a working network connection [13:14] Mar 29 09:20:55 node-horsea cloud-init[5577]: Cloud-init v. 22.1-14-g2e17a0d6-0ubuntu1~20.04.3 finished at Tue, 29 Mar 2022 09:19:55 +0000. Datasource DataSourceMAAS [http://10-245-168-0--21> [13:14] Mar 29 09:20:55 node-horsea cloud-init[5577]: 2022-03-29 09:20:55,962 - handlers.py[WARNING]: failed posting event: finish: modules-final/config-power-state-change: SUCCESS: config-power-sta> [13:14] Mar 29 09:21:56 node-horsea cloud-init[5577]: 2022-03-29 09:21:56,031 - handlers.py[WARNING]: failed posting event: finish: modules-final: SUCCESS: running modules for final [13:16] so that would explain the earlier warnings [13:16] maybe there is a way I can re-create a valid request out of that url_helper.py[DEBUG] log entry? [13:16] I could then issue that "now" and see if we see e.g. failed-auth, or routing or ... [13:18] I see no reference to port 5248 in the MAAS cloud-init DataSource code and so assume that this url is somehow added to the cloud-init config (i.e. /etc/cloud/cloud.cfg or file in /etc/cloud/cloud.cfg.d/) [13:19] so it seems like its a MAAS-specific configuration issue (of the VM image you are using) rather than than a cloud-init one [13:21] cpaelzer: https://bugs.launchpad.net/cloud-init/+bug/1910552 [13:21] Launchpad bug 1910552 in MAAS "machines fail to boot if MAAS doesn't respond to cloud-init" [Medium, Triaged] [13:21] IIRC, maas has a configuration to automatically turn the reporting behavior on [13:22] in the near future, I'd like to make this reporting behavior off the main thread as to not block other things [13:24] falcojr: so cloud-init would boot faster again then, and the reportes will (in a different thread) try to submit as they do now? [13:25] cpaelzer: yes [13:25] falcojr: have you seen jerzy asking if they could set "only report on initial boot" somehow? [13:25] because only on deploy maas will listen (and care) [13:26] are they asking how to do, or do they know and plan to add this in 3.3.0 [13:27] cpaelzer: I took it to mean they know and are planning to do it on the maas side [13:29] cpaelzer: on a maas instance, at /etc/cloud/cloud.cfg.d/90_dpkg_local_cloud_config.cfg, I get [13:29] # written by cloud-init debian package per preseed entry [13:29] # cloud-init/local-cloud-config [13:29] manage_etc_hosts: true [13:29] manual_cache_clean: true [13:29] reporting: [13:29] maas: [13:29] consumer_key: FBq7akSpdqmhesRNQj [13:29] endpoint: http://10-10-10-0--24.maas-internal:5248/MAAS/metadata/status/pwcbwg [13:29] token_key: SD6q9qBqWaRxkhCM3r [13:29] token_secret: cU5ktnQvnPLEqaEVyJHNrcRhzANtCfhF [13:29] type: webhook [13:30] (that's on an internal VM so not concerned about secrets) [13:33] falcojr: btw when I was looking for docs on c-i reporting in general the only docs as such I could find was doc/examples/cloud-config-reporting.txt [13:33] there's no real info (apart from the source code) into the various reporting dests [13:34] minimal: yeah, we definitely need to add some documentation around it [13:34] falcojr: e.g. I didn't know there was Azure/Hyper-V reporting until I noticed it in the source [13:36] wondering if that would give similar error as the relevant hyperv/azure kernel module and daemon might not be ready before c-i would try to talk to it [13:41] don't know the specifics of that off the top of my head, but I imagine it would be the same as long as there's some kind of timeout to the request [13:45] falcojr: I'm working on a OS image for Azure currently and at present I don't start the hv_kvp daemon until the default runlevel whereas c-i (e.g. cloud-init-local) obviously starts earlier === EugenMayer1 is now known as EugenMayer [19:14] holmanb: https://github.com/canonical/cloud-init/pull/1357 because tox -e do_format no longer works locally due to https://github.com/psf/black/issues/2964 [19:14] Pull 1357 in canonical/cloud-init "black: bump pinned version to 22.3.0 to avoid click dependency issues" [Open] [19:14] Issue 2964 in psf/black "Incompatible with click 8.1.0 (ImportError: cannot import name '_unicodefun' from 'click')" [Closed] [19:16] Now that we've completed a successful SRU of cloud-init with you driving the downstream ubuntu release,I think it's time we close out on commit bits for the project per https://discourse.ubuntu.com/t/commit-rights-to-cloud-init-for-brett-holman/26271. [19:17] we'll cobble up an email to the mailing list announcing commit rights and hope to get you owning PR review and merging permissions in short order. [19:18] the PR above might be a simple one we can go through to test commit bit access [19:31] json schema-wise this PR is fairly straight-forward https://github.com/canonical/cloud-init/pull/1358 though I probably should consolidate rtd/examples/cloud-config-mount-points into the meta schema examples for the module [19:31] Pull 1358 in canonical/cloud-init "schema: add JSON schema for mcollective, migrator and mounts modules" [Open] [19:53] @blackboxsw: I can review 1358 once it passes ci :) [19:55] blackboxsw: otherwise falcojr suggested #1311 sidechannel as a first candidate so I might do that first [19:56] +1 on 1311. now that 1357 is landed. since you've already reviewed 1311 anyway. [20:50] @blackboxsw: oops, I missed the comment about do_format earlier today [20:50] @blackboxsw: will look into fixing that [22:43] holmanb: no worries James merged it. [22:44] and looks like you got the ds-identify LXD detection for VMS PR merged good deal. I'd like us to generate and test a new-upstream-snapshot release to Jammy if we could this week. [22:49] as we have hit beta freeze and coming up on final freeze, it'd be good to get our content from cloud-init into this release to avoid 0-day SRU type situations.