=== blackboxsw_away is now known as blackboxsw | ||
hwrd | welp... back at it... I kicked up a brand new instance and it looks like it grabbed the user-data first try, but it's still ignoring the runcmd https://gist.github.com/hahuang65/a9b042587c33709b330d8841ecbe6e4b | 16:58 |
---|---|---|
hwrd | blackboxsw minimal don't know if you guys are around, but I trolled thru the above log ^, and shortened it to what I thought were teh relevant bits for user-data https://gist.github.com/hahuang65/4ea7f1b36930131ed4c89631c48f5015 | 17:13 |
hwrd | looks like it SHOULD be working properly... | 17:13 |
minimal | "Skipping module named runcmd, no 'runcmd' key in configuration" | 17:14 |
hwrd | BUT also.. it says it writes the user-data... but | 17:14 |
hwrd | cat: /var/lib/cloud/instances/i-0492894c085df3346/user-data.txt: No such file or directory | 17:14 |
hwrd | >_> | 17:14 |
minimal | what is the contents of your user-data? | 17:15 |
hwrd | oh right... so how do I get that, from an instance that works? I see on this other instance that the `/var/lib/cloud/instances/<id>/user-data.txt` is some encoded formt | 17:16 |
hwrd | otherwise, I have to get it out of terraform, which stitches a bunch of yaml files together. | 17:16 |
minimal | from however you "launched" the Vm and provided user-data at launch time | 17:16 |
hwrd | minimal these are the 3 parts that Terraform merges together into my user-data https://gist.github.com/hahuang65/3a3ddf7a378ce854bab8b3546aa71c5f | 17:22 |
hwrd | I can't easily get to the end result... but I know this is working on Amazon Linux 2, with cloudinit v19.x.x | 17:22 |
minimal | it is not just "runcmd", there is no sign of it creating the "hhhuang" user either | 17:27 |
hwrd | minimal yes but... | 17:28 |
hwrd | [hhhuang@ip-172-16-52-97 ~]$ whoamihhhuang | 17:28 |
minimal | or doing write_files | 17:28 |
hwrd | whoops, newline didn't paste | 17:28 |
hwrd | but the user IS there... and so are all the writefiles. | 17:28 |
hwrd | for example | 17:29 |
minimal | hang on, you have "users:" in base.yaml, rails.yaml, and users.yaml - are you sure this is merged correctly? | 17:29 |
minimal | and merged by "what"? terraform? | 17:29 |
hwrd | merged by terraform | 17:29 |
minimal | well the logfile does NOT show the users_groups (which creates users specified in cloud-config user-data) module being called | 17:30 |
hwrd | minimal https://registry.terraform.io/providers/hashicorp/cloudinit/2.2.0/docs/data-sources/cloudinit_config | 17:30 |
hwrd | minimal yeah, I understand. so something is jacked. the log doesn't seem to show tht user-data exists... but something is running the user-data... but I also can't find a trace of the user-data on the filesystem. | 17:31 |
hwrd | I do have a machine that I have the `user-data.txt` as cloudinit sees it and writes it down... but it's encoded, and I'm not sure how to decode it. | 17:32 |
hwrd | but my gut tells me I need to figure out why this new machine isnt logging the user-data, but is running parts of it any wy. | 17:32 |
hwrd | anyway* | 17:33 |
minimal | encoded? /var/lib/cloud/instance/user-data.txt should show the actual user-data used by cloud-init, not any "encoded" version of it AFAIK | 17:33 |
hwrd | if I do `sudo head /var/lib/cloud/instance/user-data.txt` | 17:35 |
hwrd | I get this | 17:35 |
hwrd | minimal https://imgur.com/a/WgttfMf | 17:35 |
minimal | I don't tend to use imgur as it wants to load lots of JS from 101 "random" places... | 17:39 |
hwrd | sorry, any good place to upload a screenshot minimal | 17:43 |
minimal | dunno | 17:45 |
minimal | anyway looking at /var/lib/cloud/instance/user-data.txt on a VM here it starts with "#cloud-config" as that is exactly what the user-data provided to cloud-init started with | 17:46 |
hwrd | oh, is it because this is gzipped and base64'd | 17:46 |
minimal | whereas the user-data.txt.i file begins with "Content-Type: multipart/mixed" | 17:47 |
hwrd | ah, yeah, I have to `mv user-data.txt{,.gz}` and then `gunzip user-data.txt.gz` and now I can read it | 17:49 |
hwrd | lemme paste tht | 17:49 |
minimal | so what does the user-data.txt.i file contain? | 17:52 |
hwrd | minimal https://gist.github.com/hahuang65/03fa989b8dbfb527c5552b995c87841c < that's user-data.txt | 17:53 |
hwrd | lemme get the .i | 17:53 |
hwrd | oh the .i is pretty similar. do you want the paste | 17:54 |
hwrd | I had to remove some sensitive data/sections out of the file | 17:54 |
minimal | if you notice the user-data.txt is not "plain", it is a 3-part document with each of your original YAML files unmerged | 17:56 |
minimal | this looks more like what I'd expect for a user-data.txt.i file | 17:56 |
minimal | perhaps blackboxsw has some ideas... | 17:56 |
hwrd | would that cause cloudinit to not write the file down in the first place? | 17:56 |
hwrd | cuz on the busted instance, there's no `/var/lib/cloud/instance` directory at all | 17:57 |
minimal | I notice each of the 3 parts has Content-Type: text/plain, whereas when I look at a (single part) user-data.txt file here it has Content-type: text/cloud-config | 17:58 |
hwrd | hrmmm. | 17:59 |
hwrd | I WONDER if I need to update the terraform provider. | 17:59 |
minimal | I have not to-date used user-data merging so can't help regarding that | 18:01 |
hwrd | hrm, I'm guessing this isnt it. I went from 2.3.2 to 2.3.4 | 18:02 |
hwrd | wow someone is having the same issue as I am https://stackoverflow.com/questions/78769521/cloud-init-fails-on-amazon-linux-2023-but-works-on-amazonlinux2 | 18:05 |
minimal | that sounds like it may be related to the issue I mentioned the other day | 18:10 |
hwrd | I'm still confused why nothing shows up in the logs, and yet parts of my config are being run. | 18:20 |
hwrd | hrm... why does `cloud-init.log` have entries from multiple dates... if I'm tearing down the EC2 instance in between runs? | 18:29 |
hwrd | minimal you think it's worth trying that user-data.txt frile from the working instance on the new instance? | 18:36 |
minimal | the file you provided only has entries for 22nd July | 18:36 |
hwrd | minimal I cut out the above dates | 18:36 |
hwrd | there were more lines above | 18:36 |
minimal | that makes no sense at all, unless the AMI already had a non-empty cloud-init.log file | 18:37 |
hwrd | oh yeah that's gotta be what it is. Packer fires up a new instance to make the AMI | 18:38 |
minimal | Packer? you mean this is NOT an official AWS AmazonLinux3 AMI? | 18:38 |
hwrd | hah, I guess I should have mentioned that. Yeah it's NOT an official AMI | 18:39 |
minimal | well then all bets are off | 18:39 |
hwrd | I should try with the official one. | 18:39 |
minimal | Packer modifies a *running* VM | 18:39 |
minimal | cloud-init is designed to run on 1st boot of a VM | 18:39 |
minimal | correction, to do *most* of its work on 1st boot | 18:40 |
blackboxsw | bah, sorry I thought you said terraform to start? ok so w/ packer you've created a custom AMI that happened to run your user creation or user-data on first boot during the packer AMI image build. So, that's why your user exists, then the way you are trying to redeploy that dirty AMI (which had run cloud-init once already) through terraform is not providing the user-data somehow to the ec2 instance via the terraform launch? | 18:41 |
minimal | so if you're using Packer then you need to ensure that you tidy up the VM to remove any cloud-init "state" before saving it as an AMI for later use | 18:41 |
hwrd | blackboxsw nope, sorry for the confusion. Everything I sent you guys was post-AMI-creation. I'm not (intentionally) running any cloudinit when I build the packer AMI. | 18:43 |
blackboxsw | if you are creating a golden image via some tooling that you wish to reboot and have cloud-init use that AMI and run as if it was first boot, you'd need to run `sudo cloud-init clean --logs --machine-id` in that MI before snapshotting that AMI. You'd probably also likey want to remove the custom user you created too unless that's an artifact you want in all VMs launched with that AMI | 18:43 |
minimal | hwrd: why are you modifying the original AMI in the first place? | 18:43 |
blackboxsw | hwrd: I get that, it's packer that runs cloud-init to customize the AMI under the hood. | 18:43 |
hwrd | minimal mostly to compile Ruby, so that instances created by Terraform don't take 30 minutes to bootstrp | 18:44 |
hwrd | I'll try the cloud-init clean | 18:45 |
hwrd | the full list of things I do in Packer is... install dev libs (for compiling Ruby), install tools we want on every instance (zsh, datadog, mosh, ssm-agent, codedeploy-agent), compile Ruby | 18:47 |
hwrd | then, when Terraform fires up an instance, we have it populate scripts specific to that instance, and set up users | 18:48 |
blackboxsw | I presume given the previous discussion though, that cloud-init clean is only going to drop previous logs from your system, you are still going to be in a state on "first boot" of your custom AMI image in terraform that user-data content being presented isn't being seen/processed by cloud-init on the new VMs 'first boot'. But, at least you'll have a clean state and logs to better determine the source of the problem. | 18:49 |
hwrd | oh I think the issue with my `runcmd` is that I can't do it when I make the AMI. My `runcmd` sets up a a bunch of dirs, and it needs the EBS volume to be attached. | 18:49 |
hwrd | blackboxsw I think you're right. | 18:50 |
hwrd | it's just a bit confounding that this all worked on the AL2 AMI. Never expected to hit this hurdle when my original task was to update our AMIs to AL2023 | 18:50 |
blackboxsw | I still think this may be related to the user-data format or compression not being understood by cloud-init and so it may be ignoring that user-data content on final AMI launch in terraform. | 18:51 |
hwrd | thought I was going to mainly be fighting with dependencies. | 18:51 |
hwrd | blackboxsw you may be right, but how do you explain that most of my user-data is run? | 18:51 |
hwrd | even though, there's no trace of it running | 18:52 |
minimal | hwrd: isn't the point that the user-data is actions when Packer is run, not later when the revised AMI is used? | 18:52 |
minimal | s/actions/actioned/ | 18:53 |
hwrd | minimal not if there are certain dependencies that aren't met until a concrete instance is required? | 18:53 |
minimal | I don't understand | 18:54 |
hwrd | when Packer is run, an AMI is built. I then want to boot up instances using that AMI, which will run user-data for the instance, and not the AMI | 18:55 |
minimal | I mean if the user is created when Packer runs then that user will be present when the AMI is later run and so it doesn't matter than cloud-init then doesn't create the user as it was previously created during Packer run | 18:55 |
blackboxsw | hwrd: correct per minimal's comment, I believe packer is going to trigger a cloud-init run during AMI/image creation (which is creating your default user etc), so that's when your user-data is being consumed. When you then launching images in terraform trying to reference your custom AMI, I'm guessing cloud-init had already run once during that AMI creation which created that default user. | 18:55 |
hwrd | for exmple, I dont have a hostname for the AMI, but will for the instance | 18:55 |
hwrd | I cant link EBS volumes for an AMI if they're going to be detached again | 18:55 |
hwrd | right, I understand what you guys are saying about how cloudinit is intended for | 18:56 |
hwrd | but regardless of the intentions... there is a technical reason why it's no longer working for me. whether or not that was an intended "fix" from cloudinit's perspective, I dont know | 18:57 |
hwrd | All I know is, it was working on AL2 with cloudinit 19... and now on AL2023 and cloudinit 22... it's not. My process has remained the same | 18:58 |
hwrd | but I guess if my end result is the same (I can upgrade our systems to AL2023), it doesn't matter how I get there | 18:58 |
hwrd | are you guys suggesting, I move all my user-data to when Packer is creating the AMI? | 18:59 |
hwrd | really, everything is working, except this: https://gist.github.com/hahuang65/c104af2ea2644dc69889db8015f90b72 | 19:01 |
hwrd | but, like I said, `/data/` requires my final EBS volume to be attached and mounted | 19:01 |
hwrd | and there's smething funky with codedeploy-agent and ASGs that don't work well, unless I specifically disable them in the AMI, and start it with `runcmd`. | 19:02 |
hwrd | I'm sorry if what I'm saying sounds totally whack-o | 19:08 |
hwrd | yeh I think if this doesnt work out today nd tomorrow, I'm gonna migrate the `runcmd`s to codedeploy hooks. | 19:16 |
hwrd | runcmd runs once... and bootcmd runs every time? | 19:18 |
hwrd | blackboxsw `/usr/bin/cloud-init: error: unrecognized arguments: --machine-id` is that a new flag, beyond v22? | 19:35 |
blackboxsw | hwrd: because the packer image build -> terraform reuse of that AMI involved multiple boots where cloud-init is involved. I'm guessing that the way user-data is being provided via terraform in the VM launch using your custom AMI is what is somehow presenting user-data in a way that cloud-init in AL2023 doesn't like which results in redacting all user-data (or skipping processing it). | 19:35 |
hwrd | I can see tht... but it ISN'T skipping processing it. only the `runcmd` portion... for some reason. | 19:36 |
blackboxsw | hwrd: oops yes, I think on AL2023 you can `echo "uninitialized" > /etc/machine-id` in absence of that --machine-id setting | 19:36 |
blackboxsw | is it possible that whatever mime part is adding the `runcmd:` section it is being ignored? `sudo cloud-init query userdata` on your VM that was booted was showing no content. | 19:37 |
hwrd | blackboxsw I have no idea. yeah `cloud-init query userdata` shows no content cuz `/var/lib/cloud/instance/user-data.txt` doesn't exist | 19:38 |
hwrd | is it worth schleppng over `user-data.txt` from another machine to the problematic machine, and run that somehow? | 19:39 |
blackboxsw | Given that this appears to be a bug affecting others in the wild (given your stackoverflow link) it might be worth filing a bug against cloud-init with the attached full logs (via cloud-init collect-logs) and representing how you created the AMI in packer and launched the AMI in terraform. then we can see how terraform config is providing user-data and what the clean boot logs are on the terraform deployed system. | 19:40 |
blackboxsw | https://github.com/canonical/cloud-init/issues/new/choose | 19:40 |
blackboxsw | The fact that no /var/lib/cloud/instance symlink exists is a pointer to a problem in datasource detection I believe | 19:41 |
hwrd | yes, but in my logs...`2024-07-22 18:04:54,076 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/i-0d45a58160fcd4eea'` | 19:42 |
hwrd | and | 19:42 |
hwrd | `2024-07-22 18:04:54,268 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0d45a58160fcd4eea/user-data.txt - wb: [600] 12342 bytes` | 19:42 |
hwrd | is tht bizarre or what? | 19:42 |
hwrd | I'll file an issue for sure | 19:43 |
hwrd | blackboxsw is this weird: Failed collecting file(s) due to error:[('/run/cloud-init/cloud-id', '/tmp/tmpuclmpz4q/cloud-init-logs-2024-07-22/run/cloud-init/cloud-id', "[Errno 2] No such file or directory: '/run/cloud-init/cloud-id'")] | 19:45 |
hwrd | phew... took a bit, but here's the issue blackboxsw https://github.com/canonical/cloud-init/issues/5533 | 20:15 |
-ubottu:#cloud-init- Issue 5533 in canonical/cloud-init "Previously working configuration on Amazon Linux 2 no longer works on Amazon Linux 2023" [Open] | 20:15 | |
hwrd | ah yes, so if I copy the user-data from the working machine to the broken machine, and run `cloud-init schema --system --annotate`, it tells me `# E1: File None needs to begin with "#cloud-config"` | 20:23 |
hwrd | whereas cloudinit on the working machine (v19.x.x), `cloud-init schema` isn't even a commn. | 20:23 |
hwrd | command* | 20:23 |
hwrd | still doesn't explain why parts of my user-data were run though. | 20:26 |
esv | hey folks, over the weekend I tried to convince a VM(cloud image) built with ssh_pwauth false to set it to true, haven't been able to. | 20:27 |
esv | normally I would just edit /etc/cloud/cloud.cfg and set ssh_pwauth to true and be done with it, but after doing a "cloud-init clean --logs --seed" , the setting is just ignored | 20:46 |
esv | ok, here we go... the VM is using: /bin/cloud-init 23.1.1-11.el8_9.1 | 20:47 |
esv | original /etc/cloud/cloud.cfg: https://bpa.st/2HDQ ; cloud-init query --all : https://bpa.st/QDMQ | 20:47 |
esv | so, I guess the questions here is, is there a way to alter the behavior or is it cooked for the life of the deployed server? | 20:49 |
minimal | hwrd: I said earlier that it is NOT the case that "parts of my user-data were run though" from the cloud-init.log output you provided, none of your user-data appears to be run THEN | 22:39 |
minimal | it may have been run earlier (i.e. when you used Packer) | 22:39 |
hwrd | that shouldn't be the case. I agree that it appears that none of it ran. however packer doesn't have access to the user data code, so there's no chance it was run with packer. | 22:40 |
hwrd | the user-data code only exists in the Terraform repository | 22:40 |
minimal | hwrd: sigh, when you run a VM using Packer then cloud-init will be started during *that* boot | 22:41 |
hwrd | right | 22:41 |
minimal | which "is not good" | 22:41 |
minimal | as then, unless you tidy things up, when you launch a VM using Terraform then cloud-init may thing this is not "at boot" | 22:42 |
hwrd | I can boot up the packer built AMI and see that none of the user-data stuff was run | 22:42 |
minimal | s/at/1st | 22:42 |
minimal | well if the cloud-init log does NOT show it creating etc then somehow they were otherwise created | 22:42 |
hwrd | my user doesn't even exist. I can't ssh in unless I use the shared key | 22:42 |
minimal | s/etc/users etc/ | 22:43 |
hwrd | or, is it possible there's some other process running cloudinit and wiping it? | 22:43 |
minimal | you tell us, it's your AMI/VM | 22:43 |
hwrd | cuz I see in cloudinit logs that it's writing out the user-data file. but when I look, it's gone. | 22:43 |
hwrd | how can I check? | 22:43 |
minimal | I don't know, I'm only commenting on the logs you provided which appear to show cloud-init basically doing nothing with user-data | 22:44 |
hwrd | yeah I'm pretty suspicious about the logs | 22:44 |
hwrd | I'm not sure what to believe | 22:45 |
minimal | the logs show cloud-init complaining about the user-data schema | 22:45 |
hwrd | the logs also show cloudinit saying it's writing out a file, but that file is non existent | 22:46 |
minimal | though that is a warning, rather than an error | 22:46 |
minimal | you earlier said there were older logs entries. Have you tried tidying up cloud-init in Packer and then testing the "new" AMI again? | 22:47 |
hwrd | oh yeah so I ran my user-data from the other instance thru cloud-init schema and it did complain that it didn't start with the cloudinit comment | 22:47 |
minimal | which other instance? | 22:47 |
hwrd | minimal: nope cuz AWS didn't let me SSH in during packer build. I gotta try to rebuild the AMI again | 22:47 |
hwrd | the older Amazon Linux 2 instance, the one that everything is working on | 22:48 |
minimal | "cloudinit comment"? you mean "#cloud-config"? | 22:48 |
hwrd | yeah, sorry typing from a phone | 22:48 |
minimal | technically that has *always* been a requirement for valid cloud-config user-data YAML | 22:49 |
hwrd | https://github.com/canonical/cloud-init/issues/5533#issuecomment-2243753663 | 22:49 |
-ubottu:#cloud-init- Issue 5533 in canonical/cloud-init "Previously working configuration on Amazon Linux 2 no longer works on Amazon Linux 2023" [Open] | 22:49 | |
hwrd | see the comment I linked. | 22:49 |
minimal | it is just that older versions of cloud-init didn't do a validation of the provided user-data against the schema whereas new versions do | 22:49 |
hwrd | got it | 22:51 |
hwrd | so then Terraform has been doing it wrong. | 22:51 |
hwrd | will cloudinit refuse to run it if it's not valid? | 22:51 |
minimal | so your user-data was *always* "wrong" (i.e. syntactically invalid) | 22:51 |
minimal | I already pointed out the message in the logs is a WARNING, not an ERROR | 22:51 |
hwrd | right | 22:52 |
hwrd | but whys it warning... when there's no user-data? | 22:52 |
hwrd | there are so many little weird inconsistencies here | 22:52 |
minimal | sigh | 22:52 |
minimal | basically I would ignore EVERYTHING until you test a VM creation where there are no cloud-init log entries dated from before you created the VM | 22:53 |
minimal | as the earlier timestamped logs signal to me that things are not right | 22:54 |
hwrd | okay, I'll try that | 22:55 |
minimal | also remember the other day when I pointed out there was a AWS related bug ("ec2: Support double encoded userdata #4276) whose fix is NOT present in the cloud-init version you're using? This may or may not be a factor in your problems (the format of the /var/lib/cloud/user-data.txt you mentioned earlier makes me wonder) | 22:57 |
hwrd | ah yeah, so that didn't pan out. I disabled gzip and base64 encoding. same issue | 22:58 |
hwrd | I'll come back after I get that fresh AMI build | 22:59 |
Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!