=== paride3 is now known as paride | ||
=== cpaelzer_ is now known as cpaelzer | ||
cpaelzer | hi cloud-init party people, I have a system that is maas deployed | 06:42 |
---|---|---|
cpaelzer | on a reboot I know see that it seems to have trouble communicating | 06:42 |
cpaelzer | due to that boot is super slow | 06:42 |
cpaelzer | I get on the SOL console a bunch of "failed posting event" | 06:42 |
cpaelzer | and that slows down the network init "a lot" | 06:42 |
cpaelzer | in fact - I assume due to timeouts - I see a one minute delay between every action | 06:43 |
cpaelzer | example: | 06:43 |
cpaelzer | [ 316.066379] cloud-init[1075]: 2022-03-29 06:33:32,161 - handlers.py[WARNING]: failed posting event: finish: init-network/activate-datasource: SUCCESS: activating datasource | 06:43 |
cpaelzer | [ 376.131293] cloud-init[1075]: 2022-03-29 06:34:32,226 - handlers.py[WARNING]: failed posting event: start: init-network/config-migrator: running config-migrator with frequency always | 06:43 |
cpaelzer | [ 436.243233] cloud-init[1075]: 2022-03-29 06:35:32,338 - handlers.py[WARNING]: failed posting event: start: init-network/config-write-files: running config-write-files with frequency once-per-instance | 06:43 |
cpaelzer | [ 496.317115] cloud-init[1075]: 2022-03-29 06:36:32,412 - handlers.py[WARNING]: failed posting event: start: init-network/config-growpart: running config-growpart with frequency always | 06:43 |
cpaelzer | [ 556.477652] cloud-init[1075]: 2022-03-29 06:37:32,572 - handlers.py[WARNING]: failed posting event: finish: init-network/config-resizefs: SUCCESS: config-resizefs ran successfully | 06:43 |
cpaelzer | [ 616.581319] cloud-init[1075]: 2022-03-29 06:38:32,676 - handlers.py[WARNING]: failed posting event: start: init-network/config-update_hostname: running config-update_hostname with frequency always | 06:43 |
cpaelzer | [ 676.673728] cloud-init[1075]: 2022-03-29 06:39:32,768 - handlers.py[WARNING]: failed posting event: finish: init-network/config-update_etc_hosts: SUCCESS: config-update_etc_hosts ran successfully | 06:43 |
cpaelzer | [ 736.753728] cloud-init[1075]: 2022-03-29 06:40:32,848 - handlers.py[WARNING]: failed posting event: finish: init-network/config-rsyslog: SUCCESS: config-rsyslog previously ran | 06:43 |
cpaelzer | [ 796.809045] cloud-init[1075]: 2022-03-29 06:41:32,904 - handlers.py[WARNING]: failed posting event: finish: init-network/config-users-groups: SUCCESS: config-users-groups previously ran | 06:43 |
cpaelzer | [ 856.881363] cloud-init[1075]: 2022-03-29 06:42:32,976 - handlers.py[WARNING]: failed posting event: finish: init-network: SUCCESS: searching for network datasources | 06:43 |
cpaelzer | once finished the rest of the system is happy | 06:43 |
cpaelzer | I'm clear that this is already an error path as it can't post the events - so slowness is due to the underlying issue whatever it will be | 06:44 |
cpaelzer | two questions for this | 06:44 |
cpaelzer | 1. what would be the best place to start looking why it fails to post these? | 06:44 |
cpaelzer | 2. should we make the error path somewhat less-waiting? For example start with the 60 second timeout that we seem to have, but then over time dimish that to 60 -> 45 -> 30 -> 15 -> 5 seconds so that the bad-path isn't "that slow" ? | 06:45 |
cpaelzer | holmanb: falcojr: blackboxsw: ^^ ? | 06:45 |
=== FergusL2 is now known as FergusL | ||
minimal | cpaelzer: Hi. Which version of cloud-init are you using? have you defined any reporting handlers? which DataSource are you using? | 11:26 |
cpaelzer | hi minimal (and anyone else reading this later) | 12:06 |
cpaelzer | as I mentioned this is a maas deployed system, so as I'd expect it uses DataSourceMAAS | 12:06 |
cpaelzer | 2021-11-25 10:52:05,249 - stages.py[INFO]: Loaded datasource DataSourceMAAS - DataSourceMAAS [http://10-245-168-0--21.maas-internal:5248/MAAS/metadata/] | 12:07 |
cpaelzer | version is: 22.1-14-g2e17a0d6-0ubuntu1~20.04.3 | 12:07 |
cpaelzer | maas defines reporting handlers AFAIK, I'd need to look what exactly it had set up | 12:07 |
cpaelzer | I see the url_helper to try to open a connection | 12:11 |
cpaelzer | that is when the 60 second timeout happens | 12:11 |
cpaelzer | and once the 60 sec are done I get the "failed posting event" | 12:12 |
cpaelzer | url_helper.py[DEBUG]: [0/1] open 'http://10-245-168-0--21.maas-internal:5248/MAAS/metadata/status/wmy6y6' with {'url': 'http://10-245-168-0--21.maas-internal:5248/MAAS/metadata/status/wmy6y6', 'allow_redirects': True, 'method': 'POST', 'headers': {'Authorization': <keys/tokens removed>}} configuration | 12:13 |
minimal | cpaelzer: ok, I misread "maas deployed" as a typo to "mass deployed" | 13:01 |
cpaelzer | I see :-) | 13:02 |
minimal | I'm suspecting that it is trying to post events before a network is actually up (and so such posts fail) | 13:02 |
minimal | you'll notice all the warnings you posted mentioned "init-network" | 13:03 |
cpaelzer | minimal: oh I have seen it fail later as well | 13:13 |
cpaelzer | e.g. here from modile-final (that is later) | 13:14 |
minimal | cpaelzer: ok, my point is that a request cannot be made to a HTTP url without a working network connection | 13:14 |
cpaelzer | Mar 29 09:20:55 node-horsea cloud-init[5577]: Cloud-init v. 22.1-14-g2e17a0d6-0ubuntu1~20.04.3 finished at Tue, 29 Mar 2022 09:19:55 +0000. Datasource DataSourceMAAS [http://10-245-168-0--21> | 13:14 |
cpaelzer | Mar 29 09:20:55 node-horsea cloud-init[5577]: 2022-03-29 09:20:55,962 - handlers.py[WARNING]: failed posting event: finish: modules-final/config-power-state-change: SUCCESS: config-power-sta> | 13:14 |
cpaelzer | Mar 29 09:21:56 node-horsea cloud-init[5577]: 2022-03-29 09:21:56,031 - handlers.py[WARNING]: failed posting event: finish: modules-final: SUCCESS: running modules for final | 13:14 |
minimal | so that would explain the earlier warnings | 13:16 |
cpaelzer | maybe there is a way I can re-create a valid request out of that url_helper.py[DEBUG] log entry? | 13:16 |
cpaelzer | I could then issue that "now" and see if we see e.g. failed-auth, or routing or ... | 13:16 |
minimal | I see no reference to port 5248 in the MAAS cloud-init DataSource code and so assume that this url is somehow added to the cloud-init config (i.e. /etc/cloud/cloud.cfg or file in /etc/cloud/cloud.cfg.d/) | 13:18 |
minimal | so it seems like its a MAAS-specific configuration issue (of the VM image you are using) rather than than a cloud-init one | 13:19 |
falcojr | cpaelzer: https://bugs.launchpad.net/cloud-init/+bug/1910552 | 13:21 |
ubottu | Launchpad bug 1910552 in MAAS "machines fail to boot if MAAS doesn't respond to cloud-init" [Medium, Triaged] | 13:21 |
falcojr | IIRC, maas has a configuration to automatically turn the reporting behavior on | 13:21 |
falcojr | in the near future, I'd like to make this reporting behavior off the main thread as to not block other things | 13:22 |
cpaelzer | falcojr: so cloud-init would boot faster again then, and the reportes will (in a different thread) try to submit as they do now? | 13:24 |
falcojr | cpaelzer: yes | 13:25 |
cpaelzer | falcojr: have you seen jerzy asking if they could set "only report on initial boot" somehow? | 13:25 |
cpaelzer | because only on deploy maas will listen (and care) | 13:25 |
cpaelzer | are they asking how to do, or do they know and plan to add this in 3.3.0 | 13:26 |
falcojr | cpaelzer: I took it to mean they know and are planning to do it on the maas side | 13:27 |
falcojr | cpaelzer: on a maas instance, at /etc/cloud/cloud.cfg.d/90_dpkg_local_cloud_config.cfg, I get | 13:29 |
falcojr | # written by cloud-init debian package per preseed entry | 13:29 |
falcojr | # cloud-init/local-cloud-config | 13:29 |
falcojr | manage_etc_hosts: true | 13:29 |
falcojr | manual_cache_clean: true | 13:29 |
falcojr | reporting: | 13:29 |
falcojr | maas: | 13:29 |
falcojr | consumer_key: FBq7akSpdqmhesRNQj | 13:29 |
falcojr | endpoint: http://10-10-10-0--24.maas-internal:5248/MAAS/metadata/status/pwcbwg | 13:29 |
falcojr | token_key: SD6q9qBqWaRxkhCM3r | 13:29 |
falcojr | token_secret: cU5ktnQvnPLEqaEVyJHNrcRhzANtCfhF | 13:29 |
falcojr | type: webhook | 13:29 |
falcojr | (that's on an internal VM so not concerned about secrets) | 13:30 |
minimal | falcojr: btw when I was looking for docs on c-i reporting in general the only docs as such I could find was doc/examples/cloud-config-reporting.txt | 13:33 |
minimal | there's no real info (apart from the source code) into the various reporting dests | 13:33 |
falcojr | minimal: yeah, we definitely need to add some documentation around it | 13:34 |
minimal | falcojr: e.g. I didn't know there was Azure/Hyper-V reporting until I noticed it in the source | 13:34 |
minimal | wondering if that would give similar error as the relevant hyperv/azure kernel module and daemon might not be ready before c-i would try to talk to it | 13:36 |
falcojr | don't know the specifics of that off the top of my head, but I imagine it would be the same as long as there's some kind of timeout to the request | 13:41 |
minimal | falcojr: I'm working on a OS image for Azure currently and at present I don't start the hv_kvp daemon until the default runlevel whereas c-i (e.g. cloud-init-local) obviously starts earlier | 13:45 |
=== EugenMayer1 is now known as EugenMayer | ||
blackboxsw | holmanb: https://github.com/canonical/cloud-init/pull/1357 because tox -e do_format no longer works locally due to https://github.com/psf/black/issues/2964 | 19:14 |
ubottu | Pull 1357 in canonical/cloud-init "black: bump pinned version to 22.3.0 to avoid click dependency issues" [Open] | 19:14 |
ubottu | Issue 2964 in psf/black "Incompatible with click 8.1.0 (ImportError: cannot import name '_unicodefun' from 'click')" [Closed] | 19:14 |
blackboxsw | Now that we've completed a successful SRU of cloud-init with you driving the downstream ubuntu release,I think it's time we close out on commit bits for the project per https://discourse.ubuntu.com/t/commit-rights-to-cloud-init-for-brett-holman/26271. | 19:16 |
blackboxsw | we'll cobble up an email to the mailing list announcing commit rights and hope to get you owning PR review and merging permissions in short order. | 19:17 |
blackboxsw | the PR above might be a simple one we can go through to test commit bit access | 19:18 |
blackboxsw | json schema-wise this PR is fairly straight-forward https://github.com/canonical/cloud-init/pull/1358 though I probably should consolidate rtd/examples/cloud-config-mount-points into the meta schema examples for the module | 19:31 |
ubottu | Pull 1358 in canonical/cloud-init "schema: add JSON schema for mcollective, migrator and mounts modules" [Open] | 19:31 |
holmanb | @blackboxsw: I can review 1358 once it passes ci :) | 19:53 |
holmanb | blackboxsw: otherwise falcojr suggested #1311 sidechannel as a first candidate so I might do that first | 19:55 |
blackboxsw | +1 on 1311. now that 1357 is landed. since you've already reviewed 1311 anyway. | 19:56 |
holmanb | @blackboxsw: oops, I missed the comment about do_format earlier today | 20:50 |
holmanb | @blackboxsw: will look into fixing that | 20:50 |
blackboxsw | holmanb: no worries James merged it. | 22:43 |
blackboxsw | and looks like you got the ds-identify LXD detection for VMS PR merged good deal. I'd like us to generate and test a new-upstream-snapshot release to Jammy if we could this week. | 22:44 |
blackboxsw | as we have hit beta freeze and coming up on final freeze, it'd be good to get our content from cloud-init into this release to avoid 0-day SRU type situations. | 22:49 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!