[15:49] <eb3095-Vultr> So having an issue with the Vultr datasource. When it fails to dhcp/reach metadata, the datasource fails out and it defaults to nocloud, resetting the host keys, the root passwords, etc. Is this expected behavior? Should we be in some way invoking the cache in this instance to prevent this behavior?
[15:58] <minimal> eb3095-Vultr: first off you could change the defined/active data source list to be only Vultr (or Vult and None) so it can't fallback to NoCloud or anything else
[15:59] <minimal> is this happening on 1st boot or on (some) later boots? have you enabled debugging for the cloud-init logs to see more info about what is going on?
[16:02] <eb3095-Vultr> Soeey, I did mean none, not nocloud, and yes, we are defining the datasources in cloud.cfg as just the vultr datasource and we are still seeing this behavior. Its happening on boots after the first. I see whats going on though, it was trying to either dhcp or reach the metadata, and then an error is thrown in the datasource which fails out the datasource and it fallsback to none.
[16:07] <minimal> so there is a DHCP fallback but I'm surprised that SSH host keys would be (re)generated after initial boot. Can you enable debugging and then pastebin (or similar) the cloud-init.log?
[16:08] <eb3095-Vultr> Sure one sec.
[16:12] <eb3095-Vultr> While I am getting that for you, have someone pulling it for me, I was looking at this. https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/helpers/vultr.py In the get_metadata function im raising the exception from the dhcp failure, assuming thats an issue here. Should I be checking for a cache and returning that instead? This seems like the underlying cause.
[16:12] <eb3095-Vultr> https://pastebin.com/sfGfpX6k thats the cloud-init.log
[16:17] <minimal> still reading through it but can see DHCP being setup and info successfully fetched from Vultr metadata server
[16:22] <eb3095-Vultr> further down, the guy replicating this for me deployed then broke networking to reproduce.
[16:41] <minimal> so the initial run looks fine to me
[16:41] <minimal> in the 2nd set of log entries I noticed:
[16:41] <minimal> 2023-03-15 14:46:47,866 - stages.py[DEBUG]: cache invalid in datasource: DataSourceVultr
[16:41] <minimal> 2023-03-15 14:46:47,866 - handlers.py[DEBUG]: finish: init-local/check-cache: SUCCESS: cache invalid in datasource: DataSourceVultr
[16:41] <minimal> 2023-03-15 14:46:47,866 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance
[16:43] <minimal> also trying to do DHCP fails for some reason
[16:44] <eb3095-Vultr> yeah in that particular instance the dhcp failure is intentional. Concerned about the cache being invalid.
[16:44] <minimal> DHCP failure is intentional? how so?
[16:45] <eb3095-Vultr> in this case we were purposely breaking it to replicate an outage to test this scenario. As when it fails we are seeing cloud-init re-init on the none datasource and its breaking installs
[16:46] <minimal> breaking installs? the "install" has already occurred, i.e. the 1st boot
[16:46] <minimal> if Vultr is failing to provide a DHCP response then isn't that a Vultr issue?
[16:48] <eb3095-Vultr> By breaking installs I mean breaking existing deployed servers by rolling hostkeys and changing passwords, on subsequent boots. And yess, DHCP failure would be a Vultr issue, but thats not the concern I have. Its that when an outage happens where DHCP or the metadata is down we are seeing customer instances that reboot in this window have their servers altered. Im trying to determine if this is expected behavior in this 
[16:48] <eb3095-Vultr> instance or if we did something wrong in the datasource that we need to fix.
[16:49] <minimal> AFAIK the server is being altered as the None datasource is being used as the Vultr datasource is not usable
[16:50] <minimal> the Vultr datasource is not usable due to DHCP failure and therefore non-access to the Vultr metadata server
[16:50] <eb3095-Vultr> yeah that does seem to be the case. Is thats what is expected to happen here or should I throw a merge request in to prevent this somehow, so even if the DHCP fails we can fallback on the cache
[16:52] <minimal> there has been a recent change in treat enabled datasource of either "DS" or ["DS","None"] as "hardcoded to only use the specified datasource
[16:52] <minimal> s/in treat/to treat an/
[16:53] <eb3095-Vultr> was that very recently? We are using a copy from like a month or two ago from the main branch because we had our own fixes in we needed to pull down here. Im wondering if me raising that exception in the get_metadata is the cause of this behavior or not
[16:54] <minimal> hmm, it is Openstack only at present it seems, support for other DSes not yet done: https://github.com/canonical/cloud-init/commit/d1ffbea556a06105d1ade88b4143ad43f53692c4
[16:54] -ubottu:#cloud-init- Commit d1ffbea in canonical/cloud-init "source: Force OpenStack when it is only option (#2045)"
[16:55] <minimal> so that's in release 23.1.1
[16:55] <eb3095-Vultr> hmm, I'll happily add us to that list. So I'm guessing this is in fact the expected behavior here and we are not doing something explicitly wrong in our own datasource? Adding ourselves to this here will also prevent this behavior?
[16:57] <minimal> probably best to discuss with holmanb, the author of that PR
[16:59] <minimal> basically the ssh module only regenerates SSH host keys "once-per-instance" (i.e. not once per boot) so this only happens if c-i thinks the instance has changed, this is being trigger by the switch from Vultr DS to None DS
[16:59] <minimal> it would be useful to see logs from when the problem happens, rather than when the problem is simulated as the 2 may not have identical behaviour
[17:00] <eb3095-Vultr> Yeah I get whats happening, I was just concerned that change in DS was because we were throwing that exception and should be doing something else here. Wasnt sure if we were breaking cloud-init or this is just how cloud-init works.
[17:00] <minimal> I'd assume basically that any DS using a metadata server relies upon that metadata server being available to it upon each boot
[17:01] <minimal> so I wonder why no DHCP response if occasionally happening
[17:01] <minimal> s/if/is/
[17:01] <eb3095-Vultr> its rare, but outages happen from time to time. Just hoping we had a path to avoid breaking instances that reboot in those windows
[17:04] <minimal> hopefully holmanb should be on channel shortly for his input
[17:04] <minimal> eb3095-Vultr: are you Vultr staff?
[17:04] <eb3095-Vultr> Yes, and the guy who wrote that datasource
[17:07] <minimal> interesting. I'm the Alpine maintainer/packager for cloud-init, I've been meaning to give it a test on Vultr
[17:07] <eb3095-Vultr> We just added alpine recently
[17:10] <minimal> ok, I don't see it listed on your Operating Systems page
[17:13] <eb3095-Vultr> erm, i thought we released it, oof still in beta, one sec, ima deal with that right now haha
[17:40] <eb3095-Vultr> Hmm does look like that commit is the solution, and 2 weeks ago at that, so gonna rebuild and run a trial to see if that solves my problem here.
[17:40] <eb3095-Vultr> thanks for the heads up on that
[17:48] <minimal> that fix is in cloud 23.1.1, I've packaged that for Alpine Edge (a couple of days ago) but Alpine 3.17 is still on cloud-init 22.4 at present
[18:14] <WannaBeGeekster> Hey everyone.  Trying to use packages: on Ubuntu 22.04.  I look in the logs and it is trying to use yum instead of apt to install the packages so it is erroring out.  Not sure if anyone has seen that before or not?
[18:15] <minimal> WannaBeGeekster: are you using an Ubuntu cloud image?
[18:16] <WannaBeGeekster> I didn't build the images.  I was basically just told the same thing from my manager.  Let me try a different image and see if it fixes the issue.
[18:17] <minimal> is it some sort of "official" image or did some locally create it?
[18:17] <minimal> s/some/someone/
[18:17] <WannaBeGeekster> The image I was using was build using packer.
[18:18] <minimal> my point being, perhaps cloud-init was not correctly configured when the image was created
[18:18] <WannaBeGeekster> Yes, I will double check that for sure.  I think it is the most likely issue right now.
[18:37] <eb3095-Vultr> for my issue, that patch didnt change anything, still falls back to None on failure and rolls the keys and passwords again
[18:57] <minimal> eb3095-Vultr: so the underlying issue is that DHCP failure means Vultr DS selection fails
[18:57] <eb3095-Vultr> well, i wouldnt say the DS selection fails as it does chose it, its more of a it falls back to None and borks the server. Would rather it do nothing than it do that
[18:57] <minimal> but what would you expect the DS, or cloud-init, to do when DHCP requests fail?
[18:58] <eb3095-Vultr> preferably nothing, just fail and move on
[18:58] <eb3095-Vultr> im still not even sure if thats what is suppose to happen or not lol. I'm still not convinced me throwing that exception isnt explicitly at fault or some other weird nonsense i did in that datasource.
[18:59] <minimal> well the Vultr DS selection does fail:
[18:59] <minimal> 2023-03-15 14:47:48,917 - handlers.py[DEBUG]: finish: init-local/search-Vultr: FAIL: no local data found from DataSourceVultr
[19:00] <minimal> "fail and move on"? but then it doesn't have access to meta-data etc - i.e.g. things like network config are checked on every boot (the interfaces config could have changed since last boot)
[19:00] <eb3095-Vultr> i guess just semantics on my behalf, the datasource does indeed fail. Im wondering if it should fail in that fashion if that data isnt available, or if I should be invoking the cache in some way, or avoiding the exception. Im aware if it moves on it wont change configs, but stale configs is still a better result then locking the user out of the vm and rolling host keys.
[19:02] <minimal> any of blackboxsw, falcojr, or smoser around to give some input?
[19:10] <minimal> eb3095-Vultr: after the DHCP failure is logged
[19:10] <minimal> main.py[DEBUG]: No local datasource found
[19:11] <falcojr> rolling host keys?
[19:11] <eb3095-Vultr> it changes all the host keys and the root password when it fails over to none
[19:12] <falcojr> sorry, let me read back further :D
[19:13] <minimal> because that is seen as a new instance and so "1st boot" actions occur
[19:13] <eb3095-Vultr> indeed
[19:13] <minimal> looking at cloudinit/cmd/main.py perhaps passing "--force" to cloud-init may do what you want
[19:14] <minimal> e.g. Force running even if no datasource is found (use at your own risk)."
[19:15] <eb3095-Vultr> i would think that does the opposite haha
[19:16] <minimal> wel if you don't use that then a ds must be found and the one found is None in your case as Vultr fails ;-)
[19:21] <falcojr> eb3095-Vultr: if you wanna be really bleeding edge, you could try out this PR https://github.com/canonical/cloud-init/pull/2060
[19:21] -ubottu:#cloud-init- Pull 2060 in canonical/cloud-init "datasource: Optimize datasource detection, fix bugs" [Open]
[19:22] <falcojr> ^ takes the openstack fix mentioned by minimal and applies it to all datasources
[19:22] <minimal> I think this is part of the issue:
[19:22] <minimal> https://github.com/canonical/cloud-init/blob/main/cloudinit/net/dhcp.py#L151
[19:23] <minimal> it doesn't define how to handle non-zero exit codes, like whatever code dhclient uses for the "couldn't get a lease" returns
[19:24] <minimal> falcojr: ah, I was mixing that PR up with the other, already merged, one
[19:24] <eb3095-Vultr> i built off of main and it still happens, i can provide a log. falcojr can I get confirmation of whether or not this is expected behavior? If its not than I can totally guess a few things I might of gotten wrong in the datasource
[19:25] <blackboxsw> good pt on dhcp.py#L151.... so we probably want to return [] upon failure there.
[19:25] <minimal> blackboxsw: the question is what to do with failure? wait a few seconds and try again? how many retries before giving up? etc
[19:25] <blackboxsw> reading the rest of the context from today too. if dhcp fails during datasource detection and we don't handle and exception in get_data, we skip that datasource and move on to the next in the list
[19:26] <minimal> I'd assume when a DS that uses a metadata server is selected then the DS *really* needs to talk to the metadata server rather than giving up at the first attempt
[19:28] <falcojr> dhcp.py exception should be better, but it looks like it's being caught in vultr code
[19:28] <eb3095-Vultr> i dont mind this behavior on first boot, on additional boots though its an issue. Ide prefer in these cases for it to just do nothing and cloudinit fail, not switch to None. I'm guessing that I can take that as confirmation this is expected behavior though. Im guessing my next best option would be to make it so even on this failure, if its not the very first boot, that the datasource just doesnt fail. Any recommendation 
[19:28] <eb3095-Vultr> on how to proceed there.
[19:28] <falcojr> https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/helpers/vultr.py/#L43
[19:31] <eb3095-Vultr> yeah im catching it and rethrowing it. This is for instances where we have to cycle through multiple interfaces.
[19:38] <blackboxsw> hrm /me had to looing lru_cache and raising exceptions to make sure we don't just cache the raised excetion path per vultr.get_metadata invocation. 
[19:38] <blackboxsw> s/*looing/look at/
[19:39] <blackboxsw> safe there. multiple calls to an lru_cached function still get in and attempt to run the code. It seems only successful return values are cached.
[19:40] <minimal> that's a good question - how to handle multi-interface scenario? how to differenciate between "interface has no DHCP response as it is not supposed to used for metadata" from "interface has no DHCP response but it is supposed ot as it is for metadata access"?
[19:40] <eb3095-Vultr> I actually think that was a hold over from a previous implementation. I dont even know if its needed anymore, it was there because that was getting called multiple times but I dont think it is anymore. When the networking kicks back in though it hasnt shown it returning an exception where it should not.
[19:40] <eb3095-Vultr> looks like you came to the same conclusion, so nevermind haha
[19:43] <falcojr> an exception in _get_data shouldn't be invalidating cache...something else seems to be going on
[19:43] <blackboxsw> ok and reading that helpers.vurltr.get_metadata we only raise the last excetion if all get_interface_list() fail. so if none of the interfaces can DHCP DataSourceVultr.get_data() will eventually raise that error and fail to discover the datasource, and cloud-init will fallback to None.
[19:43] <eb3095-Vultr> yeah thats what i figured
[19:44] <blackboxsw> it seems I have some food stuck under my "p" key as it kees not firing :/
[19:44] <eb3095-Vultr> so I know cloud-init is caching this somewhere, and it seems like throwing that exception is the issue. Is there a way for me to invoke that cache and return whats cached instead of throwing that exception? Would that be a better solution here?
[19:46] <blackboxsw> right James, something is telling cloud-init to recrawl get_data every boot and if dhcp isn't active on that reboot then we get a raised exception here in get_data, failing to discover vultr datasource  https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceVultr.py#L50 
[19:47] <blackboxsw> generally that doesn't happen per boot unless instance-id changes
[19:47] <eb3095-Vultr> oh its recrawling because I am telling it to in the cloud.cfg because we use it to configure networking which is subject to change and relies on the metadata.
[19:47] <eb3095-Vultr> im ok with a stale config here on a failure though
[19:52] <blackboxsw> how are you telling it to recrawl? 
[19:53] <eb3095-Vultr> https://www.irccloud.com/pastebin/g7XEaVa4/
[19:53] <falcojr> whats in /run/cloud-init/cloud.cfg ?
[19:54] <eb3095-Vultr> what I just posted above. Thats why its recrawling, thats intentional as we want it to configure any network config changes and it pulls that info from metadata, thats why its hitting that every boot.
[19:54] <blackboxsw> yeah I'm trying to peek at plumbing for the network updates definitions to see how we handle failure for just the subsequent get_data call after datasource is already detected. I'm expecting this is the same failure path for normal datasource detection (update_metadata_if_supported or some such function) 
[19:55] <blackboxsw> if we are saying recrawl get_data for network updates, but network isn't there, I'd lean toward the datasource (or network backplane of the cloud) should probably be in place or more resilient in the unlikely event of DHCP not available.
[19:55] <eb3095-Vultr> i'de imagine so. Which is why im wondering if instead of throwing that exception I should just invoke cloudinits cache, grab the last known metadata, and return that instead, and if its not available, like a first boot, then let it fail over as it sees fit.
[19:56] <blackboxsw> expected use-cases like this could also be 'sandboxing' a VM without network for a while for triage/debugging etc.
[19:57] <falcojr> are we somehow changing the instance-id at the same time?
[19:57] <falcojr> "just" failing to _get_data shouldn't result in a datasource change
[19:57] <falcojr> or failing to read the instance-id?
[19:57] <eb3095-Vultr> i can provide the latest log if you like
[19:58] <falcojr> yeah, that'd help
[19:58] <eb3095-Vultr> https://pastebin.com/XKhvfrQS this is from the build on the main branch, so bleeding edge build
[19:59] <eb3095-Vultr> it was installed AFTER it booted on 22.4.2 though
[20:01] <blackboxsw> while we are debugging this. I think the problem is here https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/__init__.py#L858 with respect the Network scope PER_BOOT updates. If get_data() no longer works because DHCP is unavailable in Vultr on this particular boot, we'll raise an exception here just after clearing all cached userdata, vendordata, metadata on the cached datasource.
[20:03] <blackboxsw> I wonder if we see                 "Unable to render networking. Network config is "               
[20:03] <blackboxsw>                 "likely broken: %s", 
[20:06] <eb3095-Vultr> If it helps I can spin up an instance or credit an account if you want hands on here
[20:06] <blackboxsw> problem is if we are telling cloud-init to render per boot, and the dhcp infrastructure or IMDS isn't available per boot, what should our failure mode be. Generally I'd say we want the datasource or cloud to retry if it's an expected failure path that should be resolved with retries on that cloud. If we always default to use the cached datasource, we prevent folks from migrating from one datasource to another
[20:07] <blackboxsw> use-case (albeit not very widely used): NoCloud CIDATA drive Mounted which cloud-init discovers on one boot to create a 'golden image'. Then device is removed and image is launched on AWS or GCP etc and we expect cloud-init to detect new cloud IMDS.
[20:08] <blackboxsw> if we always fallback to reuse the cached datsource if get_data fails, we prevent migration of images from one cloud to another.
[20:09] <eb3095-Vultr> wouldnt defining the ds in the cmdline in grub fix that?
[20:09] <blackboxsw> ^ this use-case is probably not used often (or intentionally). But it's a benefit of cloud-init being able to discover any viable datasource across boot
[20:11] <blackboxsw> I realize the use-case isn't too stong. yes, ci.ds=Oracle or something on kernel cmdline does fix that across migration, but whether or not someone has access to manipulate cmdline. Sure, it's easy with physical systems and kvm to extend kernel cmdline, but other cloud platforms don't expose that ability... but you can import images (which may be a dirty image in which cloud-init has run once before in a slightly different 
[20:11] <blackboxsw> environment)
[20:11] <blackboxsw> sry being pulled away for a bit. good discussion here though. I look forward to participating more
[20:11] <blackboxsw> and feel free to discount that use-case/suggestion
[20:22] <falcojr> I don't actually think that's the problem as that cache clear line isn't clearing our entire cached datasource
[20:23] <falcojr> eb3095-Vultr: any chance this could be failing? https://github.com/canonical/cloud-init/blob/23.1.1/cloudinit/sources/helpers/vultr.py/#L110
[20:23] <falcojr> and by failing I mean returning False
[20:23] <falcojr> I think minimal pointed this out earlier, but that line in the one log: "2023-03-15 14:46:47,866 - stages.py[DEBUG]: cache invalid in datasource: DataSourceVultr"
[20:24] <falcojr> comes from https://github.com/canonical/cloud-init/blob/23.1.1/cloudinit/stages.py/#L303
[20:25] <falcojr> and the "check_instance_id" call is overridden in the Vultr datasource
[20:25] <falcojr> so it seems the only way we would be getting that log is if that call is returning false
[20:26] <eb3095-Vultr> nah that is not failing we use kvm and are feeding that through the dmi, and its there throughout
[20:27] <eb3095-Vultr> also everything works perfectly exactly up the point where dhcp throws an exception
[20:27] <falcojr> lines 885 and after in https://pastebin.com/sfGfpX6k
[20:27] <falcojr> that's pre dhcp and bad
[20:28] <falcojr> erhm, sorry, 1671 I mean
[20:29] <falcojr> bah, nevermind, ignore me. Just now seeing the DataSourceNone there was already detected
[20:30] <eb3095-Vultr> np lol, very verbose log
[20:31] <falcojr> going back to 885 though, what's interesting is that it says cache invalid but then detects Vultr again
[20:33] <falcojr> yeah, ok, I was messing up different "cache invalid" lines. 885 does look concerning to me and that is happening right before the DHCP failure
[20:43] <minimal> "Thats why its recrawling, thats intentional as we want it to configure any network config changes and it pulls that info from metadata, thats why its hitting that every boot." - don't all the DSes that use metadata servers check network config from metadata server on every boot?
[20:44] <falcojr> no, most don't actually
[20:45] <falcojr> https://canonical-cloud-init.readthedocs-hosted.com/en/latest/explanation/events.html BOOT_NEW_INSTANCE is most commont
[20:48] <minimal> falcojr: hmm, ok, I thought if for example I added a 2nd network interface while the VM was running then that's what HOTPLUG was for, but if I added it while the VM was not running then it would check network-config from IMDS (e.g. EC2, GCE, Azure, Vultr, etc) upon next boot and deal with it
[20:49] <minimal> likewise if I changed the IP, subnet, etc of any interface while the VM wasn't running
[20:49] <falcojr> depends on the datasource, but if it defaults to BOOT_NEW_INSTANCE and hasn't issued a new instance id, then no, it won't
[20:52] <eb3095-Vultr> manual_cache_clean: true seems to fix the issue on our end if put in the cloud.cfg
[20:55] <falcojr> ah yeah, that should work. I completely forgot about that option
[20:56] <falcojr> probably worth figuring out why check_instance_id is returning false though. 
[20:58] <minimal> falcojr: ok so from a quick check of DSes, Azure, ConfigDrive, Ec2, OpenStack, RbxCloud, Scaleway, and SmartOS have both BOOT and BOOT_NEW_INSTANCE, all the others use default of only BOOT_NEW_INSTANCE
[20:59] <minimal> so perhaps Vultr DS should also define BOOT as well as BOOT_NEW_INSTANCE to get network config changes rather than the way they're currently configuring things?
[21:00] <eb3095-Vultr> we do, in our own cloud.cfg on the images we use. I'll go back and put a PR in for the code though, totally forgot lol
[21:02] <minimal> yeah that's what I meant, in DataSourceVultr.py rather than in cloud.cfg
[21:03] <minimal> as it means anyone, like myself, who build's their own Vultr OS images will have the DS setup in that fashion ;-)
[21:03] <eb3095-Vultr> indeed oversight on my behalf, i made this change on our end and never got around to it haha
[21:04] <minimal> I also have been meaning to buy a Round Tuit for several years :-)