/srv/irclogs.ubuntu.com/2023/03/15/#cloud-init.txt

=== meetingology` is now known as meetingology
eb3095-VultrSo having an issue with the Vultr datasource. When it fails to dhcp/reach metadata, the datasource fails out and it defaults to nocloud, resetting the host keys, the root passwords, etc. Is this expected behavior? Should we be in some way invoking the cache in this instance to prevent this behavior?15:49
minimaleb3095-Vultr: first off you could change the defined/active data source list to be only Vultr (or Vult and None) so it can't fallback to NoCloud or anything else15:58
minimalis this happening on 1st boot or on (some) later boots? have you enabled debugging for the cloud-init logs to see more info about what is going on?15:59
eb3095-VultrSoeey, I did mean none, not nocloud, and yes, we are defining the datasources in cloud.cfg as just the vultr datasource and we are still seeing this behavior. Its happening on boots after the first. I see whats going on though, it was trying to either dhcp or reach the metadata, and then an error is thrown in the datasource which fails out the datasource and it fallsback to none.16:02
minimalso there is a DHCP fallback but I'm surprised that SSH host keys would be (re)generated after initial boot. Can you enable debugging and then pastebin (or similar) the cloud-init.log?16:07
eb3095-VultrSure one sec.16:08
eb3095-VultrWhile I am getting that for you, have someone pulling it for me, I was looking at this. https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/helpers/vultr.py In the get_metadata function im raising the exception from the dhcp failure, assuming thats an issue here. Should I be checking for a cache and returning that instead? This seems like the underlying cause.16:12
eb3095-Vultrhttps://pastebin.com/sfGfpX6k thats the cloud-init.log16:12
minimalstill reading through it but can see DHCP being setup and info successfully fetched from Vultr metadata server16:17
eb3095-Vultrfurther down, the guy replicating this for me deployed then broke networking to reproduce.16:22
minimalso the initial run looks fine to me16:41
minimalin the 2nd set of log entries I noticed:16:41
minimal2023-03-15 14:46:47,866 - stages.py[DEBUG]: cache invalid in datasource: DataSourceVultr16:41
minimal2023-03-15 14:46:47,866 - handlers.py[DEBUG]: finish: init-local/check-cache: SUCCESS: cache invalid in datasource: DataSourceVultr16:41
minimal2023-03-15 14:46:47,866 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance16:41
minimalalso trying to do DHCP fails for some reason16:43
eb3095-Vultryeah in that particular instance the dhcp failure is intentional. Concerned about the cache being invalid.16:44
minimalDHCP failure is intentional? how so?16:44
eb3095-Vultrin this case we were purposely breaking it to replicate an outage to test this scenario. As when it fails we are seeing cloud-init re-init on the none datasource and its breaking installs16:45
minimalbreaking installs? the "install" has already occurred, i.e. the 1st boot16:46
minimalif Vultr is failing to provide a DHCP response then isn't that a Vultr issue?16:46
eb3095-VultrBy breaking installs I mean breaking existing deployed servers by rolling hostkeys and changing passwords, on subsequent boots. And yess, DHCP failure would be a Vultr issue, but thats not the concern I have. Its that when an outage happens where DHCP or the metadata is down we are seeing customer instances that reboot in this window have their servers altered. Im trying to determine if this is expected behavior in this 16:48
eb3095-Vultrinstance or if we did something wrong in the datasource that we need to fix.16:48
minimalAFAIK the server is being altered as the None datasource is being used as the Vultr datasource is not usable16:49
minimalthe Vultr datasource is not usable due to DHCP failure and therefore non-access to the Vultr metadata server16:50
eb3095-Vultryeah that does seem to be the case. Is thats what is expected to happen here or should I throw a merge request in to prevent this somehow, so even if the DHCP fails we can fallback on the cache16:50
minimalthere has been a recent change in treat enabled datasource of either "DS" or ["DS","None"] as "hardcoded to only use the specified datasource16:52
minimals/in treat/to treat an/16:52
eb3095-Vultrwas that very recently? We are using a copy from like a month or two ago from the main branch because we had our own fixes in we needed to pull down here. Im wondering if me raising that exception in the get_metadata is the cause of this behavior or not16:53
minimalhmm, it is Openstack only at present it seems, support for other DSes not yet done: https://github.com/canonical/cloud-init/commit/d1ffbea556a06105d1ade88b4143ad43f53692c416:54
-ubottu:#cloud-init- Commit d1ffbea in canonical/cloud-init "source: Force OpenStack when it is only option (#2045)"16:54
minimalso that's in release 23.1.116:55
eb3095-Vultrhmm, I'll happily add us to that list. So I'm guessing this is in fact the expected behavior here and we are not doing something explicitly wrong in our own datasource? Adding ourselves to this here will also prevent this behavior?16:55
minimalprobably best to discuss with holmanb, the author of that PR16:57
minimalbasically the ssh module only regenerates SSH host keys "once-per-instance" (i.e. not once per boot) so this only happens if c-i thinks the instance has changed, this is being trigger by the switch from Vultr DS to None DS16:59
minimalit would be useful to see logs from when the problem happens, rather than when the problem is simulated as the 2 may not have identical behaviour16:59
eb3095-VultrYeah I get whats happening, I was just concerned that change in DS was because we were throwing that exception and should be doing something else here. Wasnt sure if we were breaking cloud-init or this is just how cloud-init works.17:00
minimalI'd assume basically that any DS using a metadata server relies upon that metadata server being available to it upon each boot17:00
minimalso I wonder why no DHCP response if occasionally happening17:01
minimals/if/is/17:01
eb3095-Vultrits rare, but outages happen from time to time. Just hoping we had a path to avoid breaking instances that reboot in those windows17:01
minimalhopefully holmanb should be on channel shortly for his input17:04
minimaleb3095-Vultr: are you Vultr staff?17:04
eb3095-VultrYes, and the guy who wrote that datasource17:04
minimalinteresting. I'm the Alpine maintainer/packager for cloud-init, I've been meaning to give it a test on Vultr17:07
eb3095-VultrWe just added alpine recently17:07
minimalok, I don't see it listed on your Operating Systems page17:10
eb3095-Vultrerm, i thought we released it, oof still in beta, one sec, ima deal with that right now haha17:13
eb3095-VultrHmm does look like that commit is the solution, and 2 weeks ago at that, so gonna rebuild and run a trial to see if that solves my problem here.17:40
eb3095-Vultrthanks for the heads up on that17:40
minimalthat fix is in cloud 23.1.1, I've packaged that for Alpine Edge (a couple of days ago) but Alpine 3.17 is still on cloud-init 22.4 at present17:48
WannaBeGeeksterHey everyone.  Trying to use packages: on Ubuntu 22.04.  I look in the logs and it is trying to use yum instead of apt to install the packages so it is erroring out.  Not sure if anyone has seen that before or not?18:14
minimalWannaBeGeekster: are you using an Ubuntu cloud image?18:15
WannaBeGeeksterI didn't build the images.  I was basically just told the same thing from my manager.  Let me try a different image and see if it fixes the issue.18:16
minimalis it some sort of "official" image or did some locally create it?18:17
minimals/some/someone/18:17
WannaBeGeeksterThe image I was using was build using packer.18:17
minimalmy point being, perhaps cloud-init was not correctly configured when the image was created18:18
WannaBeGeeksterYes, I will double check that for sure.  I think it is the most likely issue right now.18:18
eb3095-Vultrfor my issue, that patch didnt change anything, still falls back to None on failure and rolls the keys and passwords again18:37
minimaleb3095-Vultr: so the underlying issue is that DHCP failure means Vultr DS selection fails18:57
eb3095-Vultrwell, i wouldnt say the DS selection fails as it does chose it, its more of a it falls back to None and borks the server. Would rather it do nothing than it do that18:57
minimalbut what would you expect the DS, or cloud-init, to do when DHCP requests fail?18:57
eb3095-Vultrpreferably nothing, just fail and move on18:58
eb3095-Vultrim still not even sure if thats what is suppose to happen or not lol. I'm still not convinced me throwing that exception isnt explicitly at fault or some other weird nonsense i did in that datasource.18:58
minimalwell the Vultr DS selection does fail:18:59
minimal2023-03-15 14:47:48,917 - handlers.py[DEBUG]: finish: init-local/search-Vultr: FAIL: no local data found from DataSourceVultr18:59
minimal"fail and move on"? but then it doesn't have access to meta-data etc - i.e.g. things like network config are checked on every boot (the interfaces config could have changed since last boot)19:00
eb3095-Vultri guess just semantics on my behalf, the datasource does indeed fail. Im wondering if it should fail in that fashion if that data isnt available, or if I should be invoking the cache in some way, or avoiding the exception. Im aware if it moves on it wont change configs, but stale configs is still a better result then locking the user out of the vm and rolling host keys.19:00
minimalany of blackboxsw, falcojr, or smoser around to give some input?19:02
minimaleb3095-Vultr: after the DHCP failure is logged19:10
minimalmain.py[DEBUG]: No local datasource found19:10
falcojrrolling host keys?19:11
eb3095-Vultrit changes all the host keys and the root password when it fails over to none19:11
falcojrsorry, let me read back further :D19:12
minimalbecause that is seen as a new instance and so "1st boot" actions occur19:13
eb3095-Vultrindeed19:13
minimallooking at cloudinit/cmd/main.py perhaps passing "--force" to cloud-init may do what you want19:13
minimale.g. Force running even if no datasource is found (use at your own risk)."19:14
eb3095-Vultri would think that does the opposite haha19:15
minimalwel if you don't use that then a ds must be found and the one found is None in your case as Vultr fails ;-)19:16
falcojreb3095-Vultr: if you wanna be really bleeding edge, you could try out this PR https://github.com/canonical/cloud-init/pull/206019:21
-ubottu:#cloud-init- Pull 2060 in canonical/cloud-init "datasource: Optimize datasource detection, fix bugs" [Open]19:21
falcojr^ takes the openstack fix mentioned by minimal and applies it to all datasources19:22
minimalI think this is part of the issue:19:22
minimalhttps://github.com/canonical/cloud-init/blob/main/cloudinit/net/dhcp.py#L15119:22
minimalit doesn't define how to handle non-zero exit codes, like whatever code dhclient uses for the "couldn't get a lease" returns19:23
minimalfalcojr: ah, I was mixing that PR up with the other, already merged, one19:24
eb3095-Vultri built off of main and it still happens, i can provide a log. falcojr can I get confirmation of whether or not this is expected behavior? If its not than I can totally guess a few things I might of gotten wrong in the datasource19:24
blackboxswgood pt on dhcp.py#L151.... so we probably want to return [] upon failure there.19:25
minimalblackboxsw: the question is what to do with failure? wait a few seconds and try again? how many retries before giving up? etc19:25
blackboxswreading the rest of the context from today too. if dhcp fails during datasource detection and we don't handle and exception in get_data, we skip that datasource and move on to the next in the list19:25
minimalI'd assume when a DS that uses a metadata server is selected then the DS *really* needs to talk to the metadata server rather than giving up at the first attempt19:26
falcojrdhcp.py exception should be better, but it looks like it's being caught in vultr code19:28
eb3095-Vultri dont mind this behavior on first boot, on additional boots though its an issue. Ide prefer in these cases for it to just do nothing and cloudinit fail, not switch to None. I'm guessing that I can take that as confirmation this is expected behavior though. Im guessing my next best option would be to make it so even on this failure, if its not the very first boot, that the datasource just doesnt fail. Any recommendation 19:28
eb3095-Vultron how to proceed there.19:28
falcojrhttps://github.com/canonical/cloud-init/blob/main/cloudinit/sources/helpers/vultr.py/#L4319:28
eb3095-Vultryeah im catching it and rethrowing it. This is for instances where we have to cycle through multiple interfaces.19:31
blackboxswhrm /me had to looing lru_cache and raising exceptions to make sure we don't just cache the raised excetion path per vultr.get_metadata invocation. 19:38
blackboxsws/*looing/look at/19:38
blackboxswsafe there. multiple calls to an lru_cached function still get in and attempt to run the code. It seems only successful return values are cached.19:39
minimalthat's a good question - how to handle multi-interface scenario? how to differenciate between "interface has no DHCP response as it is not supposed to used for metadata" from "interface has no DHCP response but it is supposed ot as it is for metadata access"?19:40
eb3095-VultrI actually think that was a hold over from a previous implementation. I dont even know if its needed anymore, it was there because that was getting called multiple times but I dont think it is anymore. When the networking kicks back in though it hasnt shown it returning an exception where it should not.19:40
eb3095-Vultrlooks like you came to the same conclusion, so nevermind haha19:40
falcojran exception in _get_data shouldn't be invalidating cache...something else seems to be going on19:43
blackboxswok and reading that helpers.vurltr.get_metadata we only raise the last excetion if all get_interface_list() fail. so if none of the interfaces can DHCP DataSourceVultr.get_data() will eventually raise that error and fail to discover the datasource, and cloud-init will fallback to None.19:43
eb3095-Vultryeah thats what i figured19:43
blackboxswit seems I have some food stuck under my "p" key as it kees not firing :/19:44
eb3095-Vultrso I know cloud-init is caching this somewhere, and it seems like throwing that exception is the issue. Is there a way for me to invoke that cache and return whats cached instead of throwing that exception? Would that be a better solution here?19:44
blackboxswright James, something is telling cloud-init to recrawl get_data every boot and if dhcp isn't active on that reboot then we get a raised exception here in get_data, failing to discover vultr datasource  https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceVultr.py#L50 19:46
blackboxswgenerally that doesn't happen per boot unless instance-id changes19:47
eb3095-Vultroh its recrawling because I am telling it to in the cloud.cfg because we use it to configure networking which is subject to change and relies on the metadata.19:47
eb3095-Vultrim ok with a stale config here on a failure though19:47
blackboxswhow are you telling it to recrawl? 19:52
eb3095-Vultrhttps://www.irccloud.com/pastebin/g7XEaVa4/19:53
falcojrwhats in /run/cloud-init/cloud.cfg ?19:53
eb3095-Vultrwhat I just posted above. Thats why its recrawling, thats intentional as we want it to configure any network config changes and it pulls that info from metadata, thats why its hitting that every boot.19:54
blackboxswyeah I'm trying to peek at plumbing for the network updates definitions to see how we handle failure for just the subsequent get_data call after datasource is already detected. I'm expecting this is the same failure path for normal datasource detection (update_metadata_if_supported or some such function) 19:54
blackboxswif we are saying recrawl get_data for network updates, but network isn't there, I'd lean toward the datasource (or network backplane of the cloud) should probably be in place or more resilient in the unlikely event of DHCP not available.19:55
eb3095-Vultri'de imagine so. Which is why im wondering if instead of throwing that exception I should just invoke cloudinits cache, grab the last known metadata, and return that instead, and if its not available, like a first boot, then let it fail over as it sees fit.19:55
blackboxswexpected use-cases like this could also be 'sandboxing' a VM without network for a while for triage/debugging etc.19:56
falcojrare we somehow changing the instance-id at the same time?19:57
falcojr"just" failing to _get_data shouldn't result in a datasource change19:57
falcojror failing to read the instance-id?19:57
eb3095-Vultri can provide the latest log if you like19:57
falcojryeah, that'd help19:58
eb3095-Vultrhttps://pastebin.com/XKhvfrQS this is from the build on the main branch, so bleeding edge build19:58
eb3095-Vultrit was installed AFTER it booted on 22.4.2 though19:59
blackboxswwhile we are debugging this. I think the problem is here https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/__init__.py#L858 with respect the Network scope PER_BOOT updates. If get_data() no longer works because DHCP is unavailable in Vultr on this particular boot, we'll raise an exception here just after clearing all cached userdata, vendordata, metadata on the cached datasource.20:01
blackboxswI wonder if we see                 "Unable to render networking. Network config is "               20:03
blackboxsw                "likely broken: %s", 20:03
eb3095-VultrIf it helps I can spin up an instance or credit an account if you want hands on here20:06
blackboxswproblem is if we are telling cloud-init to render per boot, and the dhcp infrastructure or IMDS isn't available per boot, what should our failure mode be. Generally I'd say we want the datasource or cloud to retry if it's an expected failure path that should be resolved with retries on that cloud. If we always default to use the cached datasource, we prevent folks from migrating from one datasource to another20:06
blackboxswuse-case (albeit not very widely used): NoCloud CIDATA drive Mounted which cloud-init discovers on one boot to create a 'golden image'. Then device is removed and image is launched on AWS or GCP etc and we expect cloud-init to detect new cloud IMDS.20:07
blackboxswif we always fallback to reuse the cached datsource if get_data fails, we prevent migration of images from one cloud to another.20:08
eb3095-Vultrwouldnt defining the ds in the cmdline in grub fix that?20:09
blackboxsw^ this use-case is probably not used often (or intentionally). But it's a benefit of cloud-init being able to discover any viable datasource across boot20:09
blackboxswI realize the use-case isn't too stong. yes, ci.ds=Oracle or something on kernel cmdline does fix that across migration, but whether or not someone has access to manipulate cmdline. Sure, it's easy with physical systems and kvm to extend kernel cmdline, but other cloud platforms don't expose that ability... but you can import images (which may be a dirty image in which cloud-init has run once before in a slightly different 20:11
blackboxswenvironment)20:11
blackboxswsry being pulled away for a bit. good discussion here though. I look forward to participating more20:11
blackboxswand feel free to discount that use-case/suggestion20:11
falcojrI don't actually think that's the problem as that cache clear line isn't clearing our entire cached datasource20:22
falcojreb3095-Vultr: any chance this could be failing? https://github.com/canonical/cloud-init/blob/23.1.1/cloudinit/sources/helpers/vultr.py/#L11020:23
falcojrand by failing I mean returning False20:23
falcojrI think minimal pointed this out earlier, but that line in the one log: "2023-03-15 14:46:47,866 - stages.py[DEBUG]: cache invalid in datasource: DataSourceVultr"20:23
falcojrcomes from https://github.com/canonical/cloud-init/blob/23.1.1/cloudinit/stages.py/#L30320:24
falcojrand the "check_instance_id" call is overridden in the Vultr datasource20:25
falcojrso it seems the only way we would be getting that log is if that call is returning false20:25
eb3095-Vultrnah that is not failing we use kvm and are feeding that through the dmi, and its there throughout20:26
eb3095-Vultralso everything works perfectly exactly up the point where dhcp throws an exception20:27
falcojrlines 885 and after in https://pastebin.com/sfGfpX6k20:27
falcojrthat's pre dhcp and bad20:27
falcojrerhm, sorry, 1671 I mean20:28
falcojrbah, nevermind, ignore me. Just now seeing the DataSourceNone there was already detected20:29
eb3095-Vultrnp lol, very verbose log20:30
falcojrgoing back to 885 though, what's interesting is that it says cache invalid but then detects Vultr again20:31
falcojryeah, ok, I was messing up different "cache invalid" lines. 885 does look concerning to me and that is happening right before the DHCP failure20:33
minimal"Thats why its recrawling, thats intentional as we want it to configure any network config changes and it pulls that info from metadata, thats why its hitting that every boot." - don't all the DSes that use metadata servers check network config from metadata server on every boot?20:43
falcojrno, most don't actually20:44
falcojrhttps://canonical-cloud-init.readthedocs-hosted.com/en/latest/explanation/events.html BOOT_NEW_INSTANCE is most commont20:45
minimalfalcojr: hmm, ok, I thought if for example I added a 2nd network interface while the VM was running then that's what HOTPLUG was for, but if I added it while the VM was not running then it would check network-config from IMDS (e.g. EC2, GCE, Azure, Vultr, etc) upon next boot and deal with it20:48
minimallikewise if I changed the IP, subnet, etc of any interface while the VM wasn't running20:49
falcojrdepends on the datasource, but if it defaults to BOOT_NEW_INSTANCE and hasn't issued a new instance id, then no, it won't20:49
eb3095-Vultrmanual_cache_clean: true seems to fix the issue on our end if put in the cloud.cfg20:52
falcojrah yeah, that should work. I completely forgot about that option20:55
falcojrprobably worth figuring out why check_instance_id is returning false though. 20:56
minimalfalcojr: ok so from a quick check of DSes, Azure, ConfigDrive, Ec2, OpenStack, RbxCloud, Scaleway, and SmartOS have both BOOT and BOOT_NEW_INSTANCE, all the others use default of only BOOT_NEW_INSTANCE20:58
minimalso perhaps Vultr DS should also define BOOT as well as BOOT_NEW_INSTANCE to get network config changes rather than the way they're currently configuring things?20:59
eb3095-Vultrwe do, in our own cloud.cfg on the images we use. I'll go back and put a PR in for the code though, totally forgot lol21:00
minimalyeah that's what I meant, in DataSourceVultr.py rather than in cloud.cfg21:02
minimalas it means anyone, like myself, who build's their own Vultr OS images will have the DS setup in that fashion ;-)21:03
eb3095-Vultrindeed oversight on my behalf, i made this change on our end and never got around to it haha21:03
minimalI also have been meaning to buy a Round Tuit for several years :-)21:04

Generated by irclog2html.py 2.7 by Marius Gedminas - find it at mg.pov.lt!