ananke | how can I identify which cloud-init module is erroring out? cloud-init status --long shows 'degraded done', but 'errors' shows 'errors: []' | 14:06 |
---|---|---|
ananke | the exit code I get is '2', while the full message is https://dpaste.org/pbAAp | 14:15 |
ananke | while cloud-init.log ends with main.py[DEBUG] Ran 12 modules with 0 failures | 14:26 |
ananke | and handlers.py[DEBUG] finish: modules-final: SUCCESS: running modules for final | 14:26 |
holman | ananke: the exit 2 relates to the information you see after 'recoverable_errors' | 14:37 |
holman | ananke: does your userdata try to use the `cc_refresh_rmc_and_interface` key? | 14:38 |
ananke | it doesn't. frankly, that's the first time I see that module. I'll look more in the logs to see what's going on | 14:39 |
holman | ananke: the other thing going on is deprecation warnings due to using keys that will go away | 14:39 |
holman | ananke: which distro? | 14:40 |
ananke | kali linux, rolling, which is debian based these days | 14:40 |
ananke | behind the scenes the process is a bit convoluted. I'm building an AMI from the official upstream AMI, using packer. then I take the resulting AMI and I do the same thing with a trimmed down packer/cloud-init configuration | 14:42 |
holman | gotcha | 14:42 |
ananke | both packer recipes start with the cloud-init --wait, and the first build process is fine. so it's something I introduce during the first stage, or that I feed during the second stage. second stage cloud-init contains bare minimum, and I've trimmed it down to the point of being just nothing | 14:43 |
ananke | so now I'm dissecting things more. the problem is that the logs provide very little clue, there's nothing with 'error', etc | 14:44 |
holman | ananke: can you tell me what `grep rmc /etc/cloud/cloud.cfg` shows? | 14:45 |
ananke | if I could scope it down to a module, it would be ideal. sure, just a sec. starting a new run | 14:45 |
ananke | on an instance running AMI from first stage, that returns: └─# grep rmc /etc/cloud/cloud.cfg | 14:46 |
ananke | - reset_rmc | 14:46 |
ananke | - refresh_rmc_and_interface | 14:46 |
holman | ananke: did that config come stock with the image? | 14:47 |
ananke | ahh, and cloud-init status produces exit code 2 there too | 14:47 |
ananke | holman: yes, it did | 14:47 |
ananke | we don't touch /etc/cloud/cloud.cfg | 14:47 |
ananke | cloud-init 24.1.1-1 | 14:48 |
holman | ananke: that specific module was renived in 23.2 | 14:49 |
holman | removed | 14:49 |
holman | so the distro provider needs to drop that key from the confix | 14:49 |
ananke | I don't suppose you have an idea what module is this? I can't find it in the documentation | 14:49 |
holman | config | 14:49 |
ananke | ohh, I see. thank you. and this explains, I was looking at 24.1 documentation | 14:49 |
holman | yeah | 14:50 |
holman | so probably a debian change? | 14:50 |
ananke | wouldn't surprise me. I can file a bug with Kali, and see if they'll file it upstream | 14:50 |
holman | ananke: perfect, thanks! | 14:50 |
holman | ananke: could you link me the bug once it's filed? | 14:51 |
ananke | certainly. one quick question: as a temporary workaround can I override the inclusion of this module via a cfg.d file, or will I have to remove it from cloud.cfg? | 14:51 |
holman | just remove it from cloud.cfg and that should go away | 14:52 |
ananke | cheers. I'll test that | 14:53 |
holman | and once you update your cloud-config keys to use the non-deprecated ones, I would expect status to exit 0 | 14:53 |
holman | ananke: just to reiterate: all that you need to debug an exit 2 error code should be visible in the recoverable errors key | 15:17 |
holman | ananke: and here's some bedtime reading material if you want more background on it -> https://cloudinit.readthedocs.io/en/latest/explanation/return_codes.html | 15:18 |
ananke | thank you. I'm currently rebuilding the image in stage 1 to see if removing those rmc modules will suffice. I'll also read that documentation. once I'm certain this is the issue, I'll file bug with Kali and pass you the info | 15:21 |
holman | sounds good :-) | 15:22 |
holman | looking forward to hearing back from you | 15:22 |
holman | minimal: ping (when you're around) | 15:23 |
holman | minimal: I know you took a peak at #5120 and had some questions around gdate / etc | 15:55 |
holman | minimal: that PR is approved by falcojr but I'll wait to merge in case you have any outstanding concerns | 15:56 |
holman | minimal: I added context related to the usage of gnu date to the PR context | 15:56 |
minimal | holman: thanks, having a read of the comments now | 16:02 |
holman | minimal: thanks :) | 16:03 |
minimal | BTW yes Busybox ash is not necessarily 100% POSIX compliant (depending on its config when built) but from memory "shellcheck -s sh ..." does complain about $'..' not being POSIX | 16:03 |
holman | yeah I haven't found anything that is 100% POSIX complaint | 16:04 |
holman | dash isn't either | 16:04 |
holman | good to know on the shellcheck -s sh | 16:05 |
minimal | well ash *CAN* be if you don't enable any of the non-POSIX compile options | 16:05 |
holman | good to know | 16:05 |
minimal | i.e. one of the links you referenced mentioned the ENABLE_ASH_BASH_COMPAT compile option - turn that off and you're 1 step closer to POSIX compliance | 16:06 |
holman | but I honestly don't care about 100% compliance without a real use-case | 16:06 |
holman | "local" is really nice | 16:06 |
minimal | Alpine's ash turns that on: https://git.alpinelinux.org/aports/tree/main/busybox/busyboxconfig#n1143 | 16:07 |
minimal | yeah, "local" is probably the only non-POSIX feature I tend to use in my shellscripts | 16:07 |
holman | I recently discovered the freebsd man page for sh (also an Almquist derivative) | 16:10 |
minimal | you can never have enough shell options eh? ;-) | 16:10 |
holman | it's really good | 16:11 |
holman | https://man.freebsd.org/cgi/man.cgi?sh | 16:11 |
holman | hehe, so many options | 16:11 |
holman | minimal: one more thing | 16:15 |
holman | minimal: do you have any pointers for installing python3.12 on edge? | 16:15 |
holman | I can just wait for 3.12 to become available if not | 16:16 |
minimal | there are no "released" py3.12 packages for Edge - basically as a lot of python packages are breaking with 3.12 ncopa hasn't yet pushed a 3.12 package as that would mess up Edge | 16:17 |
holman | gotcha | 16:17 |
holman | I'd have to build from source then to repro that issue? | 16:18 |
minimal | however he does have a "personal" repo with the 3.12 packages he created and there were notes posted on IRC how to make use of this for others working on fixing 3.12-related issues | 16:18 |
minimal | I'll dig up those notes... | 16:18 |
holman | thanks | 16:19 |
minimal | cloud-init already has some testing for py3.12, right? | 16:21 |
minimal | I though I saw a reference to 3.12 a while ago in some of the github workflow stuff | 16:21 |
holman | minimal: 3.13 even :-) | 16:42 |
holman | https://github.com/canonical/cloud-init/blob/93f5a0165069603b2eb45ec20983393170fe78a9/.github/workflows/unit.yml#L28 | 16:43 |
minimal | holmnan: so where would the 3.12 logs be visible? | 16:52 |
minimal | BTW re the $'...' thing, https://www.shellcheck.net/wiki/SC2039 | 16:52 |
holman | Python 3.12 pytest logs should be visible in every PR under the "Checks" tab -> open the "Unit tests" dropdown | 16:57 |
holman | ex: https://github.com/canonical/cloud-init/pull/5162/checks | 16:58 |
-ubottu:#cloud-init- Pull 5162 in canonical/cloud-init "Deprecate the users ssh-authorized-keys property" [Open] | 16:58 | |
holman | which triggered this ci job: https://github.com/canonical/cloud-init/actions/runs/8626580327/job/23665266600?pr=5162 | 16:58 |
minimal | so then that ci unittest passing points to the issue likely being specific to musl and py3.12. I wonder if freebsd would have similar issues with python 3.12 | 17:03 |
holman | minimal: agreed | 17:50 |
holman | okay just figured out a reproducer | 17:51 |
minimal | oh? interesting | 18:08 |
ananke | hmm, more deprecated stuff: Invalid user-data /var/lib/cloud/instances/i-0b277ba864e274251/cloud-config.txt | 18:40 |
ananke | Error: Cloud config schema errors: system_info: Additional properties are not allowed ('system_info' was unexpected) | 18:40 |
ananke | what was system_info replaced with? | 18:41 |
minimal | anake: system_info isn't user-data, it's specified in /etc/cloud.cfg | 18:50 |
minimal | used to specify things like the distro name, default user name, etc | 18:51 |
minimal | https://cloudinit.readthedocs.io/en/latest/reference/base_config_reference.html#system-info-keys | 18:52 |
ananke | uhmm, user-data can provide cloud-init config. this wasn't an issue in earlier versions of cloud-init | 18:52 |
minimal | from that page: "Anything under system_info cannot be overridden by vendor data, user data, or any other handlers or transforms."d | 18:53 |
ananke | hah. so we've had it all wrong for years, and cloud-init just started enforcing it | 18:53 |
* ananke facepalms | 18:53 | |
minimal | "use the docs Luke..." ;-) | 18:54 |
ananke | minimal: that's easy to say when you know what docs to look in. I kept looking | 18:55 |
minimal | to find that I just used the "search" functionality on the RTD site | 18:55 |
ananke | I did too, but missed that section | 18:55 |
minimal | that was the 3rd result returned for "system_info": "Base configuration" | 18:56 |
ananke | been fighting this stuff too long today. the fact that it didn't become an issue until now is what was throwing me off | 18:56 |
ananke | literally the same config at the start of the process is valid, and it becomes invalid at the end of it, while cloud-init is updated | 18:57 |
minimal | it doesn't become "invalid" at the end, it was always invalid, it just wasn't validated automatically in (some) previous releases | 18:57 |
ananke | so I'm wondering how in the world this is going to work moving forward | 18:57 |
minimal | invalid config is always invalid config, even if you don't realise/aren't told so | 18:58 |
minimal | um, you validate any user-data before using it? | 18:59 |
ananke | minimal: validate _how_? | 18:59 |
minimal | using the cli validation? | 18:59 |
ananke | no, I don't. like I said, this particular configuration was never a problem until the new version of cloud-init | 19:00 |
minimal | it wasn't a "problem" but it was still incorrect/invalid | 19:00 |
ananke | problems with the syntax in the past were more apparent | 19:00 |
minimal | "cloud-init schema ..." to check | 19:00 |
ananke | minimal: part of the problem is that I can't pre-validate it before feeding it as user-data. chicken & egg: there's no cloud-init in place to validate it | 19:01 |
minimal | i.e. "cloud-init schema -t cloud-config -c myconfig.yml" | 19:01 |
minimal | you use another machine/VM to run the validation on? | 19:02 |
ananke | not for user-data. and I'd need to somehow have cloud-init of a given version available. this process uses a single container image with packer/aws tools/etc. it's used to build images for many different distros, each one with different version of cloud-init | 19:03 |
minimal | well as what represents valid user-data can change between cloud-init releases your infra should take this into account | 19:04 |
ananke | easier said than done | 19:05 |
minimal | ok but that doesn't change the fact | 19:05 |
ananke | not sure why you're beating the dead horse at this point | 19:05 |
minimal | i.e. if you're creating user-data using some form of templating that that template could be written to cater for differing versions | 19:06 |
ananke | as to being invalid in previosu versions, I find it dubious. the default user/gecos fields I'm feeding via user-data _are_ used on first boot. | 19:14 |
ananke | or rather, they were with an earlier version | 19:15 |
ananke | I'll do some more digging later | 19:15 |
minimal | how are you providing them? inside a top-level "users:" section? | 19:15 |
ananke | system_info: default_user: | 19:16 |
minimal | which earlier cloud-init version? | 19:16 |
ananke | sorry, had to drive to the office. let me check | 19:34 |
ananke | removing ability to specify default_user via user-data would be fairly problematic. we rely on it, and has worked just fine on ubuntu/centos/debian/etc for years. | 19:38 |
ananke | so it worked fine on 23.3.1, and schema check returns no issues there. after cloud-init is updated to 24.1.1 and system is rebooted, schema is no longer accepted | 19:40 |
minimal | 23.3.1? I'm not seeing that in the github cloud-init releases list for some reason, there is 23.3.2 and 23.3 but not 23.3.1 | 19:44 |
ananke | cloud-init 23.3.1-1 all initialization system for infrastructure cloud instances | 19:44 |
minimal | I'm guess that release was "pulled" for some reason | 19:45 |
ananke | and this would mean anything previous was accepting these keys too | 19:45 |
ananke | our 'default_user' fed via user-data hasn't changed in years | 19:46 |
minimal | anyway, the 23.2.2 docs state the same about system_info not being overriden by user-data, so then I'd say the behaviour you see with 23.3.1 is a bug as it was not intended to work | 19:46 |
minimal | so what led you to expect a "system_info" section in cloud-config to work? Was there a document at some time that showed this as valid? | 19:47 |
ananke | must have been. and it clearly worked across multiple distros/versions. we've had this for the past 5 years | 19:49 |
ananke | ubuntu 16/18/20/22, centos 7, debian 10/11, kali, etc | 19:49 |
minimal | ok, but it seems likely that was unintended behaviour | 19:49 |
holman | ananke: that used to be supported, apparently back in 0.7.2: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1090482 | 19:49 |
-ubottu:#cloud-init- Launchpad bug 1090482 in cloud-init (Ubuntu Raring) "over-riding distro config still broken" [High, Fix Released] | 19:49 | |
minimal | normally to change aspects of the default user in cloud-config you'd specify: "users:\n - default\n" and then some attributes | 19:50 |
ananke | hah. https://cloudinit.readthedocs.io/en/22.1_a/topics/examples.html?highlight=default_user | 19:50 |
ananke | so it was in the sample config in documentation | 19:51 |
holman | ananke: but it looks like that was reverted in 18.4 in this commit: https://github.com/canonical/cloud-init/commit/f0ff194054da90b7b49620b5658342e52156d68e | 19:52 |
-ubottu:#cloud-init- Commit f0ff194 in canonical/cloud-init "stages: Fix bug causing datasource to have incorrect sys_cfg." | 19:52 | |
holman | as a fix for this bug https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1787459 | 19:52 |
-ubottu:#cloud-init- Launchpad bug 1787459 in cloud-init (Ubuntu) "datasource.sys_cfg gets different values in local stage and after." [Medium, Fix Released] | 19:52 | |
ananke | holman: so I wonder how it continued to work for us, until now | 19:53 |
holman | ananke: did it actually work? or did it just fail silently? | 19:53 |
holman | ananke: schema validation isn't blocking configurations, it's just warning about invalid ones | 19:53 |
minimal | the cloud-init.log with debug should show exactly what happens or doesn't happen with that cloud-config | 19:54 |
ananke | holman: it worked. we've built hundreds of images, with our custom default username fed via user-data's default_username option | 19:54 |
ananke | minimal: the users - default doesn't allow you to specify what the default user will be. just includes the default username configured via default_user | 19:55 |
ananke | s/default_username/default_user in earlier sentence | 19:55 |
minimal | ananke: I believe you can do "users:\n - default\n name: whatyouwant\n" to override the default name | 19:56 |
ananke | I can certainly try and see if that works | 19:56 |
minimal | "name:" being one of the attributes that can be specified | 19:56 |
minimal | I'm fairly sure I've used that in the past | 19:57 |
Odd_Bloke | "default" is a string not a mapping, so I don't think that's valid YAML. | 20:06 |
minimal | hmm, I could have sworn I'd overriden aspects of the default user in the past | 20:10 |
Odd_Bloke | I think perhaps `user:` can be used to override the default default user? | 20:10 |
Odd_Bloke | Or the default user defaults, perhaps more accurately (if not less confusingly :p). | 20:11 |
minimal | well yeah, if you specify "users:" but don't also specify "- default" then it won't be created and then you could have a "full" user definition instead | 20:11 |
ananke | so how would one translate this config to something using the user module? https://dpaste.org/XdcZx/raw | 20:11 |
minimal | "users:\n - name: student\n gecos: Student\n shell: /bin/bash" would definately work | 20:12 |
Odd_Bloke | I _think_ the content of your `default_user:` block under `user:` at the top level. | 20:12 |
ananke | the nice thing about using this approach was that one would just override the bare minimum needed: username/gecos/shell. The rest was inherited from whatever the distro provides | 20:12 |
minimal | as by NOT specifying "- default" when you specify "users:" then the default is not created | 20:13 |
minimal | that's mentioned in the docs for the users_groups module | 20:13 |
minimal | oh, you want to inherit also.....hmm | 20:14 |
ananke | well, inheritance is a secondary goal, at this juncture I just need to figure out how to make this work with whatever is the correct syntax | 20:14 |
ananke | the problem is though, distro provides its own system_info with default_user: wonder how this is going to work | 20:15 |
Odd_Bloke | https://github.com/canonical/cloud-init/blob/main/cloudinit/distros/ug_util.py#L172-L173 sets old_user to cfg["user"], and https://github.com/canonical/cloud-init/blob/main/cloudinit/distros/ug_util.py#L207 merges that with distro.get_default_user(). | 20:16 |
ananke | wonder if I'm a canary, and other people will come out with the same problem :) | 20:17 |
minimal | ah, there is "user:" to override the default | 20:17 |
minimal | it's shown in "Example7" for the users_groups module | 20:17 |
minimal | so "user:\n name: whatever\n" | 20:18 |
ananke | will try both and see what comes out | 20:19 |
ananke | gotta get kid from practice, will finish this tongith | 20:22 |
ananke | tonight | 20:22 |
minimal | Odd_Bloke: I'd missed that you referenced "user:" earlier | 20:23 |
Odd_Bloke | Haha, I did wonder, all good! | 20:35 |
Odd_Bloke | At least we reached the same conclusion. | 20:35 |
falcojr | hmmm, after trying it looks like you actually can override the default_user in system_info using user data...but that's not going to be true for most keys in system_info and using `user` as already mentioned here is the supported way to go | 21:08 |
holman | ananke: thanks for reporting one this | 21:31 |
holman | *on | 21:31 |
holman | it sounds like we have some docs to update and deprecation(s) to add to schema | 21:32 |
holman | ananke: we've warned on invalid keys for a while now I think, I assume you're digging into this because of the exit 2 error code? | 21:32 |
holman | unfortunately there's been some spurious things like this to fix | 21:36 |
holman | which is one reason why we haven't made cloud-init hard error on invalid config, but rather warn more loudly when it thinks it's got something that isn't right | 21:37 |
holman | cloud-init's configuration was never fully documented, and it's difficult to audit all of the places that the config is accessed, but we're getting closer :-) | 21:37 |
ananke | holman: yes, it all started after running into that exit code 2. when building images with packer we leverage cloud-init as much as possible, so we can have more uniform build recipes, and the first build step is to run cloud-init status --wait. up until now this step never complained about this particular schema | 21:50 |
holman | ananke: gotcha | 21:51 |
ananke | but yeah, back when we started documentation wasn't anywhere near as complete, so we relied on example configs. once it was working, that part wasn't touched, and we were none wiser | 21:55 |
ananke | now it's a matter of moving to whatever syntax is valid and replicates previous functionality. will have to use the serial console, because packer can't seem to be able to connect anymore | 21:58 |
ananke | ohh, this is interesting. looks like ssh_authorized_keys for that user may be wiping packer's ssh key | 22:11 |
ananke | so a couple observations: 1) no, omitting keys for a user via the users: config does NOT merge it with what the ones for default_user:. this includes things like groups, sudoers, and so on | 22:20 |
ananke | which leads me to believe that's not how you can specify the default user, because more importantly 2) this new user does not have ssh keys specified in AWS metadata | 22:21 |
minimal | ananke: "user:" config or "users:" config? from the earlier chat the consensus was to use "user:" for changing default user stuff | 22:21 |
ananke | minimal: I must have misread it. I've been using 'users:' | 22:22 |
minimal | I mentioned "Example7" for the users_groups module | 22:23 |
ananke | I haven't figured out what Example7 means. I've been looking at https://cloudinit.readthedocs.io/en/latest/reference/examples.html | 22:23 |
ananke | ahh, I see, https://cloudinit.readthedocs.io/en/latest/reference/modules.html#users-and-groups | 22:24 |
minimal | go to https://cloudinit.readthedocs.io/en/latest/reference/modules.html | 22:24 |
minimal | then go to the "Users and Groups" section | 22:24 |
minimal | then click on the "Examples" tab in that section | 22:24 |
minimal | then scroll down to "Example7" | 22:25 |
ananke | yep, found it, thank you | 22:25 |
minimal | also click on the "Config schema" tab in that section and look at "user" | 22:25 |
minimal | "The user dictionary values override the default_user configuration from /etc/cloud/cloud.cfg." | 22:26 |
minimal | "The user dictionary keys support for the default_user are the same as the users schema" - so you can specify things like "ssh_authorized_keys" | 22:27 |
ananke | it's certainly been a long day. user: vs users: blends in | 22:27 |
ananke | minimal: the problem wasn't with what was supplied via ssh_authorized_keys, rather what's provided via AWS metadata service | 22:28 |
minimal | "users:" is the general way to creation additional users, whereas "user:" is to modify the default user's settings | 22:28 |
minimal | you're referring to metadata? or to metadata/user-data/network config provided via the metadata *service* (IMDS)? | 22:29 |
minimal | as I'd expect ssh keys coming via IMDS to be from user-data provided (by you) to AWS cli/API when a VM is created | 22:31 |
ananke | these keys are created automatically by packer, and accessible via metadata service under public-keys/, presumably that's where cloud-init pulls them from | 22:32 |
ananke | point being, when I tried using the 'users:' section to specify our username, that user was created, but the packer ssh key wasn't present in that account | 22:33 |
ananke | moving to user: fixed it | 22:33 |
minimal | right, as "users:" was creating a new user (with no ssh_authorized_keys specified) | 22:35 |
ananke | but interestingly enough, there seemed to be _no_ default user | 22:36 |
minimal | in which scenario? | 22:37 |
ananke | in users: | 22:37 |
minimal | ok, as mentioned earlier, when "users:" is used (i.e. to create NEW users) unless you specifiy "- default" then the default user will NOT be created | 22:37 |
minimal | this is mentioned in the docs I referred you to | 22:37 |
ananke | which is what made the mistake of 'users:' vs 'user:' confusing. | 22:38 |
ananke | right, but all along I was trying to manipulate the default user | 22:38 |
minimal | In the "Summary" tab: "Omission of default as the first item in the users list skips creation [typo] the default user." | 22:39 |
ananke | that's not the point :) | 22:39 |
minimal | that explains why the default user was not created when you used "users:" | 22:40 |
ananke | the explanation is much simpler: I kept mistaking 'users:' for 'user:'. didn't realize there was a separate distinct config option 'user:' | 22:40 |
minimal | that's why you should check "Config schema" for the relevant module in the docs ;-) | 22:41 |
ananke | doesn't help in case where you think user: == users: | 22:42 |
minimal | wouldn't kit help as that part of the docs clearly lists "user:" and "users:" *separately* one of the other? | 22:44 |
minimal | s/one of the/one after the/ | 22:45 |
minimal | that section (in my opinion) makes it very clear that they are 2 different things | 22:45 |
ananke | not sure there's much more point in telling me I made a mistake | 22:46 |
ananke | I realize that. I was merely explaining what happened | 22:47 |
minimal | I wasn't doing that, I'm just saying that (in my opinion) the relevant section of the docs, if consulted, is quite clear | 22:48 |
minimal | obviously if it is not consulted that's a different matter | 22:48 |
ananke | the fact that user manipulation is spread across 'default_user' 'user' and 'users' doesn't help | 22:49 |
ananke | but I digress. the issue in this particular case is solved, thank you everybody for help and patience | 22:49 |
minimal | "user" and "users" are documented in the same place, "default_user" is documented in base config which, from memory, mentions this is config setting for distros/vendors, not for "end users" | 22:50 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!