/srv/irclogs.ubuntu.com/2023/03/15/#cloud-init.txt

=== meetingology` is now known as meetingology
eb3095-Vultr	So having an issue with the Vultr datasource. When it fails to dhcp/reach metadata, the datasource fails out and it defaults to nocloud, resetting the host keys, the root passwords, etc. Is this expected behavior? Should we be in some way invoking the cache in this instance to prevent this behavior?	15:49
minimal	eb3095-Vultr: first off you could change the defined/active data source list to be only Vultr (or Vult and None) so it can't fallback to NoCloud or anything else	15:58
minimal	is this happening on 1st boot or on (some) later boots? have you enabled debugging for the cloud-init logs to see more info about what is going on?	15:59
eb3095-Vultr	Soeey, I did mean none, not nocloud, and yes, we are defining the datasources in cloud.cfg as just the vultr datasource and we are still seeing this behavior. Its happening on boots after the first. I see whats going on though, it was trying to either dhcp or reach the metadata, and then an error is thrown in the datasource which fails out the datasource and it fallsback to none.	16:02
minimal	so there is a DHCP fallback but I'm surprised that SSH host keys would be (re)generated after initial boot. Can you enable debugging and then pastebin (or similar) the cloud-init.log?	16:07
eb3095-Vultr	Sure one sec.	16:08
eb3095-Vultr	While I am getting that for you, have someone pulling it for me, I was looking at this. https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/helpers/vultr.py In the get_metadata function im raising the exception from the dhcp failure, assuming thats an issue here. Should I be checking for a cache and returning that instead? This seems like the underlying cause.	16:12
eb3095-Vultr	https://pastebin.com/sfGfpX6k thats the cloud-init.log	16:12
minimal	still reading through it but can see DHCP being setup and info successfully fetched from Vultr metadata server	16:17
eb3095-Vultr	further down, the guy replicating this for me deployed then broke networking to reproduce.	16:22
minimal	so the initial run looks fine to me	16:41
minimal	in the 2nd set of log entries I noticed:	16:41
minimal	2023-03-15 14:46:47,866 - stages.py[DEBUG]: cache invalid in datasource: DataSourceVultr	16:41
minimal	2023-03-15 14:46:47,866 - handlers.py[DEBUG]: finish: init-local/check-cache: SUCCESS: cache invalid in datasource: DataSourceVultr	16:41
minimal	2023-03-15 14:46:47,866 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance	16:41
minimal	also trying to do DHCP fails for some reason	16:43
eb3095-Vultr	yeah in that particular instance the dhcp failure is intentional. Concerned about the cache being invalid.	16:44
minimal	DHCP failure is intentional? how so?	16:44
eb3095-Vultr	in this case we were purposely breaking it to replicate an outage to test this scenario. As when it fails we are seeing cloud-init re-init on the none datasource and its breaking installs	16:45
minimal	breaking installs? the "install" has already occurred, i.e. the 1st boot	16:46
minimal	if Vultr is failing to provide a DHCP response then isn't that a Vultr issue?	16:46
eb3095-Vultr	By breaking installs I mean breaking existing deployed servers by rolling hostkeys and changing passwords, on subsequent boots. And yess, DHCP failure would be a Vultr issue, but thats not the concern I have. Its that when an outage happens where DHCP or the metadata is down we are seeing customer instances that reboot in this window have their servers altered. Im trying to determine if this is expected behavior in this	16:48
eb3095-Vultr	instance or if we did something wrong in the datasource that we need to fix.	16:48
minimal	AFAIK the server is being altered as the None datasource is being used as the Vultr datasource is not usable	16:49
minimal	the Vultr datasource is not usable due to DHCP failure and therefore non-access to the Vultr metadata server	16:50
eb3095-Vultr	yeah that does seem to be the case. Is thats what is expected to happen here or should I throw a merge request in to prevent this somehow, so even if the DHCP fails we can fallback on the cache	16:50
minimal	there has been a recent change in treat enabled datasource of either "DS" or ["DS","None"] as "hardcoded to only use the specified datasource	16:52
minimal	s/in treat/to treat an/	16:52
eb3095-Vultr	was that very recently? We are using a copy from like a month or two ago from the main branch because we had our own fixes in we needed to pull down here. Im wondering if me raising that exception in the get_metadata is the cause of this behavior or not	16:53
minimal	hmm, it is Openstack only at present it seems, support for other DSes not yet done: https://github.com/canonical/cloud-init/commit/d1ffbea556a06105d1ade88b4143ad43f53692c4	16:54
-ubottu:#cloud-init- Commit d1ffbea in canonical/cloud-init "source: Force OpenStack when it is only option (#2045)"		16:54
minimal	so that's in release 23.1.1	16:55
eb3095-Vultr	hmm, I'll happily add us to that list. So I'm guessing this is in fact the expected behavior here and we are not doing something explicitly wrong in our own datasource? Adding ourselves to this here will also prevent this behavior?	16:55
minimal	probably best to discuss with holmanb, the author of that PR	16:57
minimal	basically the ssh module only regenerates SSH host keys "once-per-instance" (i.e. not once per boot) so this only happens if c-i thinks the instance has changed, this is being trigger by the switch from Vultr DS to None DS	16:59
minimal	it would be useful to see logs from when the problem happens, rather than when the problem is simulated as the 2 may not have identical behaviour	16:59
eb3095-Vultr	Yeah I get whats happening, I was just concerned that change in DS was because we were throwing that exception and should be doing something else here. Wasnt sure if we were breaking cloud-init or this is just how cloud-init works.	17:00
minimal	I'd assume basically that any DS using a metadata server relies upon that metadata server being available to it upon each boot	17:00
minimal	so I wonder why no DHCP response if occasionally happening	17:01
minimal	s/if/is/	17:01
eb3095-Vultr	its rare, but outages happen from time to time. Just hoping we had a path to avoid breaking instances that reboot in those windows	17:01
minimal	hopefully holmanb should be on channel shortly for his input	17:04
minimal	eb3095-Vultr: are you Vultr staff?	17:04
eb3095-Vultr	Yes, and the guy who wrote that datasource	17:04
minimal	interesting. I'm the Alpine maintainer/packager for cloud-init, I've been meaning to give it a test on Vultr	17:07
eb3095-Vultr	We just added alpine recently	17:07
minimal	ok, I don't see it listed on your Operating Systems page	17:10
eb3095-Vultr	erm, i thought we released it, oof still in beta, one sec, ima deal with that right now haha	17:13
eb3095-Vultr	Hmm does look like that commit is the solution, and 2 weeks ago at that, so gonna rebuild and run a trial to see if that solves my problem here.	17:40
eb3095-Vultr	thanks for the heads up on that	17:40
minimal	that fix is in cloud 23.1.1, I've packaged that for Alpine Edge (a couple of days ago) but Alpine 3.17 is still on cloud-init 22.4 at present	17:48
WannaBeGeekster	Hey everyone. Trying to use packages: on Ubuntu 22.04. I look in the logs and it is trying to use yum instead of apt to install the packages so it is erroring out. Not sure if anyone has seen that before or not?	18:14
minimal	WannaBeGeekster: are you using an Ubuntu cloud image?	18:15
WannaBeGeekster	I didn't build the images. I was basically just told the same thing from my manager. Let me try a different image and see if it fixes the issue.	18:16
minimal	is it some sort of "official" image or did some locally create it?	18:17
minimal	s/some/someone/	18:17
WannaBeGeekster	The image I was using was build using packer.	18:17
minimal	my point being, perhaps cloud-init was not correctly configured when the image was created	18:18
WannaBeGeekster	Yes, I will double check that for sure. I think it is the most likely issue right now.	18:18
eb3095-Vultr	for my issue, that patch didnt change anything, still falls back to None on failure and rolls the keys and passwords again	18:37
minimal	eb3095-Vultr: so the underlying issue is that DHCP failure means Vultr DS selection fails	18:57
eb3095-Vultr	well, i wouldnt say the DS selection fails as it does chose it, its more of a it falls back to None and borks the server. Would rather it do nothing than it do that	18:57
minimal	but what would you expect the DS, or cloud-init, to do when DHCP requests fail?	18:57
eb3095-Vultr	preferably nothing, just fail and move on	18:58
eb3095-Vultr	im still not even sure if thats what is suppose to happen or not lol. I'm still not convinced me throwing that exception isnt explicitly at fault or some other weird nonsense i did in that datasource.	18:58
minimal	well the Vultr DS selection does fail:	18:59
minimal	2023-03-15 14:47:48,917 - handlers.py[DEBUG]: finish: init-local/search-Vultr: FAIL: no local data found from DataSourceVultr	18:59
minimal	"fail and move on"? but then it doesn't have access to meta-data etc - i.e.g. things like network config are checked on every boot (the interfaces config could have changed since last boot)	19:00
eb3095-Vultr	i guess just semantics on my behalf, the datasource does indeed fail. Im wondering if it should fail in that fashion if that data isnt available, or if I should be invoking the cache in some way, or avoiding the exception. Im aware if it moves on it wont change configs, but stale configs is still a better result then locking the user out of the vm and rolling host keys.	19:00
minimal	any of blackboxsw, falcojr, or smoser around to give some input?	19:02
minimal	eb3095-Vultr: after the DHCP failure is logged	19:10
minimal	main.py[DEBUG]: No local datasource found	19:10
falcojr	rolling host keys?	19:11
eb3095-Vultr	it changes all the host keys and the root password when it fails over to none	19:11
falcojr	sorry, let me read back further :D	19:12
minimal	because that is seen as a new instance and so "1st boot" actions occur	19:13
eb3095-Vultr	indeed	19:13
minimal	looking at cloudinit/cmd/main.py perhaps passing "--force" to cloud-init may do what you want	19:13
minimal	e.g. Force running even if no datasource is found (use at your own risk)."	19:14
eb3095-Vultr	i would think that does the opposite haha	19:15
minimal	wel if you don't use that then a ds must be found and the one found is None in your case as Vultr fails ;-)	19:16
falcojr	eb3095-Vultr: if you wanna be really bleeding edge, you could try out this PR https://github.com/canonical/cloud-init/pull/2060	19:21
-ubottu:#cloud-init- Pull 2060 in canonical/cloud-init "datasource: Optimize datasource detection, fix bugs" [Open]		19:21
falcojr	^ takes the openstack fix mentioned by minimal and applies it to all datasources	19:22
minimal	I think this is part of the issue:	19:22
minimal	https://github.com/canonical/cloud-init/blob/main/cloudinit/net/dhcp.py#L151	19:22
minimal	it doesn't define how to handle non-zero exit codes, like whatever code dhclient uses for the "couldn't get a lease" returns	19:23
minimal	falcojr: ah, I was mixing that PR up with the other, already merged, one	19:24
eb3095-Vultr	i built off of main and it still happens, i can provide a log. falcojr can I get confirmation of whether or not this is expected behavior? If its not than I can totally guess a few things I might of gotten wrong in the datasource	19:24
blackboxsw	good pt on dhcp.py#L151.... so we probably want to return [] upon failure there.	19:25
minimal	blackboxsw: the question is what to do with failure? wait a few seconds and try again? how many retries before giving up? etc	19:25
blackboxsw	reading the rest of the context from today too. if dhcp fails during datasource detection and we don't handle and exception in get_data, we skip that datasource and move on to the next in the list	19:25
minimal	I'd assume when a DS that uses a metadata server is selected then the DS really needs to talk to the metadata server rather than giving up at the first attempt	19:26
falcojr	dhcp.py exception should be better, but it looks like it's being caught in vultr code	19:28
eb3095-Vultr	i dont mind this behavior on first boot, on additional boots though its an issue. Ide prefer in these cases for it to just do nothing and cloudinit fail, not switch to None. I'm guessing that I can take that as confirmation this is expected behavior though. Im guessing my next best option would be to make it so even on this failure, if its not the very first boot, that the datasource just doesnt fail. Any recommendation	19:28
eb3095-Vultr	on how to proceed there.	19:28
falcojr	https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/helpers/vultr.py/#L43	19:28
eb3095-Vultr	yeah im catching it and rethrowing it. This is for instances where we have to cycle through multiple interfaces.	19:31
blackboxsw	hrm /me had to looing lru_cache and raising exceptions to make sure we don't just cache the raised excetion path per vultr.get_metadata invocation.	19:38
blackboxsw	s/*looing/look at/	19:38
blackboxsw	safe there. multiple calls to an lru_cached function still get in and attempt to run the code. It seems only successful return values are cached.	19:39
minimal	that's a good question - how to handle multi-interface scenario? how to differenciate between "interface has no DHCP response as it is not supposed to used for metadata" from "interface has no DHCP response but it is supposed ot as it is for metadata access"?	19:40
eb3095-Vultr	I actually think that was a hold over from a previous implementation. I dont even know if its needed anymore, it was there because that was getting called multiple times but I dont think it is anymore. When the networking kicks back in though it hasnt shown it returning an exception where it should not.	19:40
eb3095-Vultr	looks like you came to the same conclusion, so nevermind haha	19:40
falcojr	an exception in _get_data shouldn't be invalidating cache...something else seems to be going on	19:43
blackboxsw	ok and reading that helpers.vurltr.get_metadata we only raise the last excetion if all get_interface_list() fail. so if none of the interfaces can DHCP DataSourceVultr.get_data() will eventually raise that error and fail to discover the datasource, and cloud-init will fallback to None.	19:43
eb3095-Vultr	yeah thats what i figured	19:43
blackboxsw	it seems I have some food stuck under my "p" key as it kees not firing :/	19:44
eb3095-Vultr	so I know cloud-init is caching this somewhere, and it seems like throwing that exception is the issue. Is there a way for me to invoke that cache and return whats cached instead of throwing that exception? Would that be a better solution here?	19:44
blackboxsw	right James, something is telling cloud-init to recrawl get_data every boot and if dhcp isn't active on that reboot then we get a raised exception here in get_data, failing to discover vultr datasource https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceVultr.py#L50	19:46
blackboxsw	generally that doesn't happen per boot unless instance-id changes	19:47
eb3095-Vultr	oh its recrawling because I am telling it to in the cloud.cfg because we use it to configure networking which is subject to change and relies on the metadata.	19:47
eb3095-Vultr	im ok with a stale config here on a failure though	19:47
blackboxsw	how are you telling it to recrawl?	19:52
eb3095-Vultr	https://www.irccloud.com/pastebin/g7XEaVa4/	19:53
falcojr	whats in /run/cloud-init/cloud.cfg ?	19:53
eb3095-Vultr	what I just posted above. Thats why its recrawling, thats intentional as we want it to configure any network config changes and it pulls that info from metadata, thats why its hitting that every boot.	19:54
blackboxsw	yeah I'm trying to peek at plumbing for the network updates definitions to see how we handle failure for just the subsequent get_data call after datasource is already detected. I'm expecting this is the same failure path for normal datasource detection (update_metadata_if_supported or some such function)	19:54
blackboxsw	if we are saying recrawl get_data for network updates, but network isn't there, I'd lean toward the datasource (or network backplane of the cloud) should probably be in place or more resilient in the unlikely event of DHCP not available.	19:55
eb3095-Vultr	i'de imagine so. Which is why im wondering if instead of throwing that exception I should just invoke cloudinits cache, grab the last known metadata, and return that instead, and if its not available, like a first boot, then let it fail over as it sees fit.	19:55
blackboxsw	expected use-cases like this could also be 'sandboxing' a VM without network for a while for triage/debugging etc.	19:56
falcojr	are we somehow changing the instance-id at the same time?	19:57
falcojr	"just" failing to _get_data shouldn't result in a datasource change	19:57
falcojr	or failing to read the instance-id?	19:57
eb3095-Vultr	i can provide the latest log if you like	19:57
falcojr	yeah, that'd help	19:58
eb3095-Vultr	https://pastebin.com/XKhvfrQS this is from the build on the main branch, so bleeding edge build	19:58
eb3095-Vultr	it was installed AFTER it booted on 22.4.2 though	19:59
blackboxsw	while we are debugging this. I think the problem is here https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/__init__.py#L858 with respect the Network scope PER_BOOT updates. If get_data() no longer works because DHCP is unavailable in Vultr on this particular boot, we'll raise an exception here just after clearing all cached userdata, vendordata, metadata on the cached datasource.	20:01
blackboxsw	I wonder if we see "Unable to render networking. Network config is "	20:03
blackboxsw	"likely broken: %s",	20:03
eb3095-Vultr	If it helps I can spin up an instance or credit an account if you want hands on here	20:06
blackboxsw	problem is if we are telling cloud-init to render per boot, and the dhcp infrastructure or IMDS isn't available per boot, what should our failure mode be. Generally I'd say we want the datasource or cloud to retry if it's an expected failure path that should be resolved with retries on that cloud. If we always default to use the cached datasource, we prevent folks from migrating from one datasource to another	20:06
blackboxsw	use-case (albeit not very widely used): NoCloud CIDATA drive Mounted which cloud-init discovers on one boot to create a 'golden image'. Then device is removed and image is launched on AWS or GCP etc and we expect cloud-init to detect new cloud IMDS.	20:07
blackboxsw	if we always fallback to reuse the cached datsource if get_data fails, we prevent migration of images from one cloud to another.	20:08
eb3095-Vultr	wouldnt defining the ds in the cmdline in grub fix that?	20:09
blackboxsw	^ this use-case is probably not used often (or intentionally). But it's a benefit of cloud-init being able to discover any viable datasource across boot	20:09
blackboxsw	I realize the use-case isn't too stong. yes, ci.ds=Oracle or something on kernel cmdline does fix that across migration, but whether or not someone has access to manipulate cmdline. Sure, it's easy with physical systems and kvm to extend kernel cmdline, but other cloud platforms don't expose that ability... but you can import images (which may be a dirty image in which cloud-init has run once before in a slightly different	20:11
blackboxsw	environment)	20:11
blackboxsw	sry being pulled away for a bit. good discussion here though. I look forward to participating more	20:11
blackboxsw	and feel free to discount that use-case/suggestion	20:11
falcojr	I don't actually think that's the problem as that cache clear line isn't clearing our entire cached datasource	20:22
falcojr	eb3095-Vultr: any chance this could be failing? https://github.com/canonical/cloud-init/blob/23.1.1/cloudinit/sources/helpers/vultr.py/#L110	20:23
falcojr	and by failing I mean returning False	20:23
falcojr	I think minimal pointed this out earlier, but that line in the one log: "2023-03-15 14:46:47,866 - stages.py[DEBUG]: cache invalid in datasource: DataSourceVultr"	20:23
falcojr	comes from https://github.com/canonical/cloud-init/blob/23.1.1/cloudinit/stages.py/#L303	20:24
falcojr	and the "check_instance_id" call is overridden in the Vultr datasource	20:25
falcojr	so it seems the only way we would be getting that log is if that call is returning false	20:25
eb3095-Vultr	nah that is not failing we use kvm and are feeding that through the dmi, and its there throughout	20:26
eb3095-Vultr	also everything works perfectly exactly up the point where dhcp throws an exception	20:27
falcojr	lines 885 and after in https://pastebin.com/sfGfpX6k	20:27
falcojr	that's pre dhcp and bad	20:27
falcojr	erhm, sorry, 1671 I mean	20:28
falcojr	bah, nevermind, ignore me. Just now seeing the DataSourceNone there was already detected	20:29
eb3095-Vultr	np lol, very verbose log	20:30
falcojr	going back to 885 though, what's interesting is that it says cache invalid but then detects Vultr again	20:31
falcojr	yeah, ok, I was messing up different "cache invalid" lines. 885 does look concerning to me and that is happening right before the DHCP failure	20:33
minimal	"Thats why its recrawling, thats intentional as we want it to configure any network config changes and it pulls that info from metadata, thats why its hitting that every boot." - don't all the DSes that use metadata servers check network config from metadata server on every boot?	20:43
falcojr	no, most don't actually	20:44
falcojr	https://canonical-cloud-init.readthedocs-hosted.com/en/latest/explanation/events.html BOOT_NEW_INSTANCE is most commont	20:45
minimal	falcojr: hmm, ok, I thought if for example I added a 2nd network interface while the VM was running then that's what HOTPLUG was for, but if I added it while the VM was not running then it would check network-config from IMDS (e.g. EC2, GCE, Azure, Vultr, etc) upon next boot and deal with it	20:48
minimal	likewise if I changed the IP, subnet, etc of any interface while the VM wasn't running	20:49
falcojr	depends on the datasource, but if it defaults to BOOT_NEW_INSTANCE and hasn't issued a new instance id, then no, it won't	20:49
eb3095-Vultr	manual_cache_clean: true seems to fix the issue on our end if put in the cloud.cfg	20:52
falcojr	ah yeah, that should work. I completely forgot about that option	20:55
falcojr	probably worth figuring out why check_instance_id is returning false though.	20:56
minimal	falcojr: ok so from a quick check of DSes, Azure, ConfigDrive, Ec2, OpenStack, RbxCloud, Scaleway, and SmartOS have both BOOT and BOOT_NEW_INSTANCE, all the others use default of only BOOT_NEW_INSTANCE	20:58
minimal	so perhaps Vultr DS should also define BOOT as well as BOOT_NEW_INSTANCE to get network config changes rather than the way they're currently configuring things?	20:59
eb3095-Vultr	we do, in our own cloud.cfg on the images we use. I'll go back and put a PR in for the code though, totally forgot lol	21:00
minimal	yeah that's what I meant, in DataSourceVultr.py rather than in cloud.cfg	21:02
minimal	as it means anyone, like myself, who build's their own Vultr OS images will have the DS setup in that fashion ;-)	21:03
eb3095-Vultr	indeed oversight on my behalf, i made this change on our end and never got around to it haha	21:03
minimal	I also have been meaning to buy a Round Tuit for several years :-)	21:04

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!