[05:36] Hi cloud-init team, I opened https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1842562 [05:36] Launchpad bug 1842562 in cloud-init (Ubuntu Eoan) "AWS: Add udev rule to set Instance Store device IO timeouts" [Medium,In progress] [05:36] You can ping me, or ddstreet if you have any questions. I hope cloud-init is the right place for it [05:37] Theres still some debate going on in the SF case, but I think cloud-init is the best place [13:16] ok idgi, I'm trying to use {{ v1.local-hostname }} within my cloud-config.txt [13:17] but when I do I get unicode rendering errors [13:28] Mechanismus: What version of cloud-init are you using? Where do you see the errors? [13:31] version: /usr/bin/cloud-init 19.1-1-gbaa47854-0ubuntu1~18.04.1 [13:31] errors when I run `cloud-init query --list-keys` [13:31] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte [13:33] Mechanismus: Oh, that's not good! Could you file a bug at https://bugs.launchpad.net/cloud-init/+filebug and attach the tarball that `cloud-init collect-logs` generates, please? [13:36] eh... I'll have to desensitize the file [13:36] ...after I find it [13:36] it's being generated via terraform and supplied to an azure vm in user data [13:37] Find it? `cloud-init collect-logs` gathers all the data we would need, so you shouldn't need to go find anything. :) [13:39] oh [13:39] I thought you meant the gzip that was provided to the vm at launch [13:39] :) [13:42] actually I have to redeploy the VM to get logs with specifically this situation [13:42] that'll take a minute [14:16] rharper: blackboxsw: We don't have a template for Oracle yet; which other template would you suggest basing it on? [14:16] Odd_Bloke: the openstack one ? [14:16] at least for now, it's mostly Openstack API right ? [14:16] until we switch datasources ? [14:17] it's been a while since I use the cli [14:17] something else might fit better [14:18] I think I'll just launch instances using the web interface. (I don't believe Oracle's user-facing API is OpenStack-compatible.) [14:19] So I guess I was really asking which one is most recently updated? [14:20] Looks like Openstack probably is the best one. [15:12] morning, yeah I [15:14] morning, yeah I'd agree on basing the manual verification script on openstack or azure since you'll probably be calling oracle's api instead of a launch-oracle.py script as we don't have that yet [15:36] Odd_Bloke: ok so it looks like Azure's part of the cloud config is running, but my custom bits fail to merge in and I still get the error with cloud-init query [15:37] Is there anything I can look at in the logs for something I might be doing wrong before I actually open a bug? [15:39] Mechanismus: It's hard to know, because I don't really understand the problem you're seeing. Honestly, a bug would make this a lot easier to work through. Is there a particular reason you don't want to file one? [15:40] Odd_Bloke: not really except that if it's a bug then the turnaround time to get this working is arbitrary and I'm trying to get this working today [15:40] I'm about to try an alternative approach in generating my cloud config in terraform though which would let me work around this for now [15:41] I mean, the same people who would help you in IRC are telling you to file a bug, so I'm not sure why you think the turnaround would be any different. ;) [15:42] blackboxsw: I'm actually not going to document exactly how to launch an Oracle instance; the UI makes it fairly easy to work out, and I don't want us to have out-of-date docs when they change things. I _will_ document the one thing that caught me out (remembering to add your SSH key). [15:42] I mean that if it's a matter of I'm trying to do something unsupported then I can fix that. If the fact that the UnicodeDecodeError shouldn't happen even when I'm being dumb constitutes a bug then I get that, but getting the _machine_ working is kind of my top priority right now [15:43] Odd_Bloke: thanks, makes sense [15:43] that's all I was really hoping, was if it was a complicated instance launch in any manner [15:44] Mechanismus: Well, we can always close out a bug Invalid if it turns out that you are doing something unsupported, but we would never expect to see a traceback. [15:44] So I expect there is a valid bug, even if it's "we should message better when we see bad input" or whatever. [15:44] And a bug gives us a place to attach logs etc. and have discussion to that won't get lost in IRC backlog. [15:45] we can reflect the bug lnk in channel here too to improve response time on resolution [15:46] UnicodeDecodeError rings a bell when handling some of the cloud metadata in the past. but we probably can/should address that in cloud-init proper if it's causing your general cloud-init query --all to fail. [15:46] Odd_Bloke: I fully agree with you and will be happy to look into it once I fix the issue I was working on when I ran into this [15:47] * blackboxsw just realized I'm waay out of date on this discussion. I'll read the origin of this conversation to catch up [15:52] Mechanismus: couple things you can/should try to test whether your given jinja query syntax is valid, on a booted vm you can run: cloud-init query --format "{{ v1.local-hostname }}" to see if it's an accessible template variable. [15:53] Mechanismus: specifically for our v1 standardized metadata, I *think* you needed {{v1.local_hostname }} instead of {{ v1.local-hostname}} as the hyphen gets interpreted as subtraction [15:53] on my system: cloud-init query --format "{{v1.local-hostname}}" [15:53] WARNING: Ignoring jinja template for query commandline: Undefined jinja variable: "local-hostname". Jinja tried subtraction. Perhaps you meant "local_hostname" [15:54] blackboxsw: Good point, v1.local-hostname is in instance-data.json but doesn't work with the query. However, v1.local_hostname works, though that's exactly what I have in cloud-config on this machine with the errors [15:55] also Mechanismus inside the template, you can use python as a workaround... so if you could do something like {{ v1.local_hostname.decode('utf-8') }} [15:55] or {{ v1.keys() }} to see available subkeys under v1 [15:59] Mechanismus: you could also check cloud-init query userdata (which would provide your cloud-config yaml) and you'd be able to process that content for your hostname: or fqdn: declarations [15:59] but that's a bigger lift :/ [16:00] if i had to guess.. [16:00] 0dc3a77f41f4544e4cb5a41637af7693410d4cdf [16:01] would fix Mechanismus [16:01] although that should not occur with pyhon 3 [16:03] hrm, though I thought he was on v. 19.1.1 and that commitish was in ~18.5 [16:04] oh. you're right. i just loked at the date [16:05] and assumed the april commit didnt get into 19.1 [16:05] yeah true initially. [16:05] I mean, yes I thought so too initially [16:06] heh interesting on Azure for my SRU test [16:06] https://bugs.launchpad.net/cloud-init/+bug/1801364 is related, but not really. [16:06] Launchpad bug 1801364 in cloud-init "persisting OpenStack metadata fails" [Undecided,Confirmed] [16:06] ubuntu@my-e1:~$ sudo cloud-init query userdata [16:06] ../sethostname.yaml [16:06] i'm assuming he is not python 2 [16:06] I expected the metadata service to actually report the user-data, not the file name I used when launching the instance [16:06] are you sure you have userdata ? [16:07] and not just the name of a file ? [16:07] checking my azcli launch command [16:08] to my knowledge 'az' takes only a custom-data blob in --custom-data [16:08] not a reference to a file [16:08] ahh interesting, I specified a nonexistent file on --custom-data [16:09] if file does exist I think it gets populated properly.. .checking our latest SRU run [16:09] Yeah, my shell history strongly suggests you can pass a file to --custom-data. [16:09] smoser: https://github.com/cloud-init/ubuntu-sru/blob/master/manual/azure-sru-19.2.21.txt [16:10] yeah if the file does exist (sethostname.yaml in that example ^) it works and sets SRU-worked- [16:10] but if file doesn't exist, azure just provides the string to the vm [16:10] and doesn't error (because it's 'flexible' in allowing blob or file) [16:11] gcloud has --metadata-from-file distinct from --metadata, which I prefer, I think. [16:11] +1 Odd_Bloke, yeah explicit intent/failures [16:12] the other better path is @filename [16:12] like curl does [16:12] true [16:13] * blackboxsw tries that @ w/ azcli to see if it'll fail on file absent [16:13] or even succeed on file presence [16:14] ok failure when providing --custom-data @ [16:14] Deployment failed. Correlation ID: 91ddbf3f-e296-4ea4-aab7-2189c314fe66. { [16:14] "error": { [16:14] "code": "PropertyChangeNotAllowed", [16:14] "message": "Changing property 'customData' is not allowed.", [16:14] "target": "customData" [16:14] } [16:14] } [16:14] :) [16:16] well, we missed cloud-init status meeting Monday due to US holiday. We'll shift it to next Monday, and I'll send an email to the list === blackboxsw changed the topic of #cloud-init to: Reviews: http://bit.ly/ci-reviews | Meeting minutes: https://goo.gl/mrHdaj | Next status meeting Sept 9 16:15 UTC | cloud-init v 19.2 (07/17) | https://bugs.launchpad.net/cloud-init/+filebug [16:33] * Odd_Bloke is going to look at SRU verification for https://bugs.launchpad.net/cloud-init/+bug/1812857 [16:33] Launchpad bug 1812857 in cloud-init "RuntimeError: duplicate mac found! both 'ens4' and 'bond0' have mac '9c:XX:XX:46:5d:91'" [Medium,Fix released] [16:41] Odd_Bloke: good deal, reviewing oracle run now. I just pushed https://github.com/cloud-init/ubuntu-sru/pull/44 with correct mtu v1 and v2 inputs (which fixed the diffs of v2 output so it is now limited to just dict ordering diffs) [16:46] Odd_Bloke: I can't remember w/ Oracle. Upon reboot upgraded cloud-init changed from detecting DataSourceOpenStackLocal to detecting DataSourceOpenStack (net). I realize it's the same datasource, but was a bit surprised that it switched to !Local detection [16:47] * blackboxsw tries to see if we have that same transition for Ec2Local -> Ec2 [16:50] it may be worth peeking at 'Crawl of metadata service" in cloud-init.log post the clean reboot to see why we cloud-init balks on OpenStackLocal post upgrade [16:50] maybe there was an ephemeral dhcp response issue there? [17:28] Looking. [17:31] blackboxsw: on your pull request with the netplan v2 bits; shouldn't you pull the mtu values from the devices in the verification ? [17:32] rharper: btw, you were right that netplan raises a warning about missing definitions for the bond_interfaces [17:32] yeah [17:33] so for verification there, I'd read the MTU values on the bond and member interfaces pre-upgrade (v1) and post-upgrade (v2) [17:33] it's fine to hard code the paths in the test since we're constructing the config (and interface names); [17:33] sure rharper agreed, I can grep -B 2 -i mtu and we'll see the interfaces in most cases [17:33] I'm grabbing, bug #1806701 [17:33] bug 1806701 in cloud-init "cloud-init may hang OS boot process due to grep for the entire ISO file when it is attached" [Medium,Fix released] https://launchpad.net/bugs/1806701 [17:34] blackboxsw: you can read /sys/class/net//mtu [17:34] if you're not playing with ipv6 mtu [17:34] rharper: that would be if a created a vm with that config and applied. I didn't do that, I was just running net-convert in the test [17:34] ah [17:34] ok [17:34] I saw NoCloud [17:34] so was thinking you were doing a VM [17:35] right, only because I did check on an lxc with -proposed enabled [17:35] to make sure our -proposed bits had the logic [17:35] instead of just testing ti [17:35] tip [17:35] pok [17:35] that's fine actually since it's about the netconfig generated [17:48] OK, so there is a change in behaviour, and I think it's to do with network configuration; digging in more now. [17:49] (When I said there weren't new tracebacks at stand-up, I was mistaken.) [17:55] OK, I think the issue is coming from the classless static route support we now have for ephemeral DHCP. [17:56] And if the interface already has an address, we handle failing to set it gracefully, but we will still attempt to apply the routes to it. [17:56] And that fails, causing the DS to not be considered. [17:58] Good catch, Chad. [18:05] OK, I've got a fairly small patch which seems correct to me. I'll propose it and we can discuss it. [18:23] Odd_Bloke: ah, yes, we really should have a net_is_up check in the oracle/openstack ds [18:23] in that, if networking is up, no need to bring up ephemeral DHCP [18:23] that said, it can't hurt to be more defensive in Ephemeral DHCP as well [18:26] rharper: blackboxsw: https://code.launchpad.net/~daniel-thewatkins/cloud-init/+git/cloud-init/+merge/372289 <-- what do you think of that? [18:27] Odd_Bloke: ahh good deal. hrm. ok, so we caught a potential regression then. Sure, let's review what you've got when available and we'll get that in [18:28] reading now. [18:32] Odd_Bloke: left a comment, not quite sure what to do; to me, if we called EphemeralDHCP, then I really expect it to do a DHCP not skip the dhcp + setup if the interface already has an IP ... ; should we raise and exception instead? and for Oracle/OpenStack, (or any user of EphemeralDHCP) we should check net.is_up(self.fallback_interface) before using the EphemeralDHCP [18:41] Having to check net.is_up before using it leads to slightly awkward code like get_metadata_from_imds in DataSourceAzure.py; if net.is_up(): do_thing() else: with EphemeralDHCP: do_thing() [18:41] But I agree that there's no point getting the lease when we're going to throw it away immediately without using any of it. [18:41] rharper: Odd_Bloke, probably fair to think about things that way, though existing behavior is to bail on all other setup if the interfaces already has an IP, regardless of Odd_Bloke's fix [18:43] and Odd_Bloke agree it is awkward to have every call side is_net_up() or EphemeralDCHP. it'd be nice to have that failsafe logic within EphermeralDCHP contextmgr.. maybe we could have a EphemeralDHCP(force=True) if we really want to force a dhclient run on an interface even if it already has config [18:46] In fact, I think get_metadata_from_imds is wrong because of this; it will report errors differently depending on whether or not an ephemeral lease was needed. [18:47] (Not a big deal, but this is why avoiding having to spell out do_thing() twice is good.) [18:47] Odd_Bloke: do we have the traceback you saw on Oracle somewhere [18:47] You mean you can't see it in my terminal? [18:47] Odd_Bloke: no, I'm just sniffing your browser traffic to your banks [18:47] lol [18:47] There we go: https://paste.ubuntu.com/p/jt8hNMJjKb/ [18:47] thanks man [18:48] Oh, OK, you could just have sniffed that URL then. [18:48] But I'll make it easier for you. [18:48] Yeah, so I think my fix is probably too far down the stack. [18:48] * blackboxsw wonders really if we should be checking the same failure condition we already are for ['ip', '-family', 'inet', 'addr', 'add', cidr, 'broadcast', [18:48] * blackboxsw self.broadcast, 'dev', self.interface], [18:48] * blackboxsw The 'File exists' in stderr [18:49] as in we can try all setup commands and only queue cleanup for the commands which succeed [18:49] We should perhaps do that, but I don't think that's the root of the problem here. [18:49] and ignore the setup commands for routes or addrs that already exist [18:49] We should be able to know that we don't need to do DHCP at all here. [18:50] Odd_Bloke: agreed there too [18:50] FWIW, we already do have support for not-DHCP'ing in the context manager, if we pass in a connectivity_url. [18:50] So the context manager already doesn't _always_ DHCP. [18:53] hrm, as in we could pass connectivity_url=self.metadata_address to EphemeralDHCPv4 maybe? [18:54] hrm no that wouldn't work, doesn't get setuntil you _crawl_metadata [18:59] We could refactor that though, I think. Regardless, connectivity_url is broken because it doesn't consider 403s to be an indication that you have connectivity. [18:59] (Which you obviously do, to get any sort of response!) [18:59] do IMDS return 403s ? would you want your connitivity url to do that ? [19:00] 403 indicates connectivity. [19:00] (Perhaps the argument is named incorrectly. :p) [19:01] And yes, on Oracle, `curl http://169.254.169.254` gives a 403. [19:02] *sigh* [19:02] =) [19:02] The same thing would happen on Google, too, at least; they expect a specific header in their requests. [19:02] And connectivity_url doesn't allow specifying anything other than the URL string, obvs. [19:09] Looks like nothing has ever used connectivity_url, so it wouldn't be super-surprising for there to be wrinkles with it, actually. [19:13] I guess, to step back for a minute, is it worth fixing the OpenStack DS for Oracle when we're about to switch over to their dedicated DS? [19:18] Odd_Bloke: I guess I'm still trying to understand why the network is up and configured already in local timeframe after a reboot [19:20] blackboxsw: iscsi [19:20] ahh ahh [19:26] Odd_Bloke: rharper. I *think* it probably makes sense for this to go with Odd_Bloke's branch to avoid the time cost of !detecting OpenstackLocal. As that issue could potentially affect other private openstack clouds using iscsci root or providing network config on the kernel cmdline wouldn't it? [19:26] Well, my branch is really too far down the stack. [19:27] There, if we've been given routes then we should be applying them regardless. [19:28] The change should be at least one frame further up, so that we don't even DHCP if we already have networking. [19:28] first, is this a regression on Oracle, or has it been this way ? ie, do we need to apply fix and respin the SRU ? [19:28] This is a regression [19:28] related to the rfc3442 stuff ? [19:28] Yep. [19:29] I guess I don't understand why if we never when down this path before [19:29] Because we try to apply routes that already exist and don't handle that erroring. [19:29] but previously we didnt ? are we really DHCP'ing again on top of iscsi root ? [19:29] how does that even work ? [19:29] where did the lease response come from ? [19:31] We ephemerally DHCP, and then in EphemeralIPv4Network._bringup_device the first util.subp call fails. That failure is handled gracefully and, before, was the last thing that __enter__ did. [19:31] However, __enter__ now unconditionally continues on to apply the routes that the DHCP response included, and that's what fails. [19:32] I see [19:32] And that failure means that DataSourceOpenStackLocal doesn't find metadata, so we fall through to DataSourceOpenStack later on. [19:32] (Which just uses the networking that the system already has, of course.) [19:33] well, we set try dhcp to false for non-local [19:33] in the datasource [19:33] Right. [19:33] we really shouldn't DHCP if network is up [19:33] Yeah, agreed. [19:34] so we have new errors in local ,but I don't think anything functional fails; [19:35] rharper: right, we still ultimately detect ini-network DataSourceOpenstack, just a time cost of failing @ init-local timeframe [19:35] and seeing more traces [19:35] well, I wonder if non-iscsi we'd see a network-config failure [19:36] in the iscsi case, we already use iscsi network-config instead of ds network-config [19:36] if local failed to render network-config, then we write out fallback I think, then at net time, we can crawl, all is well; [19:40] blackboxsw: What's the time cost in your view, OOI? [19:40] AFAICT, the time taken by the network traffic dominates, and we have to pay that cost in either route. [19:40] But I may be missing another consequence. [19:41] (I didn't notice the two different data sources, so I'm clearly not on my top game today, lol) [19:44] Odd_Bloke: not big, certainly subsecond cost. lemme it looked like it as +00.09700s [19:45] OK, cool, just making sure I wasn't missing something else. [20:18] https://bugs.launchpad.net/cloud-init/+bug/1842752 <-- the bug we just discussed me filing [20:18] Launchpad bug 1842752 in cloud-init "Additional traceback in logs when using DataSourceOpenStackLocal on Oracle" [Low,Triaged] [20:19] thanks Odd_Bloke [20:40] The traceback does appear on upgrade. [20:55] what do we want to do about verifying https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1833192 [20:55] Launchpad bug 1833192 in cloud-init (Ubuntu) "VMware: post custom script isn't run correctly" [Undecided,Fix released] [20:59] rharper: Could we just comment on the bug asking for help? [20:59] And maybe reach out to what VMWare contacts we do have? [21:00] we can ping them directly [21:00] I'll send an email [21:45] rharper: Odd_Bloke so I can validate that cloud-init behaves as expected, for bug #1840080 in the SRU [21:45] bug 1840080 in cloud-init (Ubuntu) "cloud-init cc_ubuntu_drivers does not set up /etc/default/linux-modules-nvidia" [High,Fix released] https://launchpad.net/bugs/1840080 [21:45] \o/ [21:45] it emits proper debconf-set-selections, yet ubuntu-drivers-common doesn't actually install linux-modules-nvidia packages [21:49] so, not really sure what we should do on this front. I *think* that behavior of cloud-init is correct, but we have yet to see the plumbing from ubuntu-drivers-common [21:51] blackboxsw, hit him up early tomorrow and ask; for now move on [21:54] yeah nothing else remains to move on to, waiting on CDO QA review, validation of ubuntu-drivers behavior, on aws GPU eoan instance, and I think Odd_Bloke is working the last remaining bug: #1812857 [21:54] bug 1812857 in cloud-init "RuntimeError: duplicate mac found! both 'ens4' and 'bond0' have mac '9c:XX:XX:46:5d:91'" [Medium,Fix released] https://launchpad.net/bugs/1812857 [21:55] powersj: I'll publish to copr el-testing. But, I think the rest of validation is grabbed/blocked or done. [21:56] so, we can touch base tomorrow to see if there is anything else of note that would require SRU-regen [21:56] sounds good